Config File¶
The config file has four main sections: run, data, trainer, training_module. These top level config entries must always be present.
Variable interpolation¶
NequIP uses the Hydra library for configurations, which is built on top of the OmegaConf YAML configuration library. OmegaConf offers a powerful variable interpolation feature, which includes special functions called “resolvers”. Hydra provides built-in resolvers that allow you to interpolate the run name or output directory into the config.
NequIP also registers a number of custom resolvers to allow users to do basic integer arithmetic directly in the config file:
Integer multiplication:
area: ${int_mul:${width},${height}}Integer division:
half_width: ${int_div:${width},2}These resolvers will throw errors if the inputs are not integers or if division is not exact.
run¶
run allows users to specify an ordered agenda of tasks that nequip-train will run, of which there are three types: train (which requires a train and at least one val dataset), val (which requires one or more val datasets), and test (which requires one or more test datasets).
Users can specify one or more of these run types in the config. A common pattern is to perform training followed immediately by testing:
run: [train, test]
Important
Any val or test tasks that come after train will use the best model checkpoint.
If you want to check how the untrained model performs on the validation and test datasets at initialization before training, train, and then assess the trained model’s performance:
run: [val, test, train, val, test]
Note
Continuing training from a checkpoint file will continue from the last run task the checkpoint file was at before stopping. For example, if one uses run: [test, train, val, test] and a nequip-train run crashed at the train step, a run restarted from that checkpoint will continue in the train stage (skipping the initial test stage that had already been completed in the previously crashed run).
data¶
data defines the NequIPDataModule object, which manages the train, validation, and test datasets. For guidance on data configuration, see the Data Configuration guide and nequip.data.datamodule API documentation.
trainer¶
The trainer specifies arguments to instantiate a PyTorch Lightning Trainer object. To understand how to configure it, see the trainer flags and API documentation.
It is in the Trainer that users can specify callbacks used to influence the course of training. This includes the very important ModelCheckpoint callback that should be configured to save checkpoint files in the way the user so pleases. nequip’s own callbacks can also be used here.
Logging¶
nequip supports various loggers through PyTorch Lightning, including its built-in loggers, e.g. Tensorboard, Weights & Biases, etc.
Tensorboard¶
Tensorboard can be configured, for example, as follows:
logger:
_target_: lightning.pytorch.loggers.TensorBoardLogger
# The run name in tensorboard can be, for example, inherited from Hydra.
version: ${hydra:job.name}
# By default (not if overridden) Hydra will make `./outputs` and put various runs at `./outputs/{name}`.
# Here we add an additional `./outputs/tensorboard_logs` within which logs will be stored _across_ runs.
save_dir: outputs/tensorboard_logs
The full set of options are found in the documentation of the underlying object from PyTorch Lightning.
training_module¶
training_module defines the NequIPLightningModule (or its subclasses). Users are directed to the nequip.train.NequIPLightningModule API documentation to learn how to configure it. Usually the EMALightningModule is the right choice.
The following important objects are configured as part of the training_module:
model¶
This section configures the model itself, including hyperparameters and the choice of architecture (for example, the NequIP message-passing E(3)-equivariant GNN, or the Allegro architecture). Refer to the model documentation page to learn how to configure this section.
loss and metrics¶
Loss functions and metrics to monitor training progress are configured here in the training_module. See the Loss and Metrics guide for configuration details, including simplified wrappers, coefficient mechanics, and monitoring setup.
optimizer and lr_scheduler¶
The optimizer can be any PyTorch-compatible optimizer. Options from PyTorch can be found in torch.optim. The Adam optimizer, for example, can be configured as follows:
optimizer:
_target_: torch.optim.Adam
lr: 0.01
The lr_scheduler is configured according to PyTorch Lightning’s lr_scheduler_config (see configure_optimizers() for the full range of options). Consider the following use of ReduceLROnPlateau as an example.
lr_scheduler:
scheduler:
_target_: torch.optim.lr_scheduler.ReduceLROnPlateau
factor: 0.6
patience: 5
threshold: 0.2
min_lr: 1e-6
monitor: val0_epoch/weighted_sum
interval: epoch
frequency: 1
The scheduler is a PyTorch-compatible learning rate scheduler. Options from PyTorch can be found in torch.optim.lr_scheduler.