# Config File The config file has four main sections: [`run`](#run), [`data`](#data), [`trainer`](#trainer), [`training_module`](#training_module). These top level config entries must always be present. ## Variable interpolation NequIP uses the [Hydra library](https://hydra.cc/) for configurations, which is built on top of the [OmegaConf](https://omegaconf.readthedocs.io/) YAML configuration library. OmegaConf offers a powerful [variable interpolation](https://omegaconf.readthedocs.io/en/latest/usage.html#variable-interpolation) feature, which includes special functions called ["resolvers"](https://omegaconf.readthedocs.io/en/2.3_branch/usage.html#resolvers). Hydra provides [built-in resolvers](https://hydra.cc/docs/1.3/configure_hydra/intro/#resolvers-provided-by-hydra) that allow you to interpolate the run name or output directory into the config. NequIP also registers a number of custom resolvers to allow users to do basic integer arithmetic directly in the config file: - Integer multiplication: `area: ${int_mul:${width},${height}}` - Integer division: `half_width: ${int_div:${width},2}` These resolvers will throw errors if the inputs are not integers or if division is not exact. ## `run` `run` allows users to specify an ordered agenda of tasks that [`nequip-train`](../getting-started/workflow.md#training) will run, of which there are three types: `train` (which requires a `train` and at least one `val` dataset), `val` (which requires one or more `val` datasets), and `test` (which requires one or more `test` datasets). Users can specify one or more of these run types in the config. A common pattern is to perform training followed immediately by testing: ```yaml run: [train, test] ``` ```{important} Any `val` or `test` tasks that come after `train` will use the **best** model checkpoint. ``` If you want to check how the untrained model performs on the validation and test datasets at initialization before training, train, and then assess the trained model's performance: ```yaml run: [val, test, train, val, test] ``` ```{note} [Continuing training from a checkpoint file](../getting-started/workflow.md#saving-and-restarting) will continue from the last `run` task the checkpoint file was at before stopping. For example, if one uses `run: [test, train, val, test]` and a `nequip-train` run crashed at the `train` step, a run restarted from that checkpoint will continue in the `train` stage (skipping the initial `test` stage that had already been completed in the previously crashed run). ``` ## `data` `data` defines the {class}`~nequip.data.datamodule.NequIPDataModule` object, which manages the train, validation, and test datasets. For guidance on data configuration, see the [Data Configuration guide](data.md) and [`nequip.data.datamodule`](../../api/datamodule.rst) API documentation. ## `trainer` The `trainer` specifies arguments to instantiate a [PyTorch Lightning](https://lightning.ai/) {class}`~lightning.pytorch.trainer.trainer.Trainer` object. To understand how to configure it, see the trainer [flags](https://lightning.ai/docs/pytorch/stable/common/trainer.html#trainer-flags) and [API](https://lightning.ai/docs/pytorch/stable/common/trainer.html#trainer-class-api) documentation. It is in the {class}`~lightning.pytorch.trainer.trainer.Trainer` that users can specify [callbacks](https://lightning.ai/docs/pytorch/stable/api_references.html#callbacks) used to influence the course of training. This includes the very important {class}`~lightning.pytorch.callbacks.ModelCheckpoint` callback that should be configured to save checkpoint files in the way the user so pleases. `nequip`'s own [callbacks](../../api/callbacks.rst) can also be used here. ### Logging `nequip` supports various loggers through PyTorch Lightning, including its [built-in loggers](https://lightning.ai/docs/pytorch/stable/api_references.html#loggers), e.g. Tensorboard, Weights & Biases, etc. #### Tensorboard Tensorboard can be configured, for example, as follows: ```yaml logger: _target_: lightning.pytorch.loggers.TensorBoardLogger # The run name in tensorboard can be, for example, inherited from Hydra. version: ${hydra:job.name} # By default (not if overridden) Hydra will make `./outputs` and put various runs at `./outputs/{name}`. # Here we add an additional `./outputs/tensorboard_logs` within which logs will be stored _across_ runs. save_dir: outputs/tensorboard_logs ``` The full set of options are found in the documentation of the [underlying object from PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.tensorboard.html#module-lightning.pytorch.loggers.tensorboard). ## `training_module` `training_module` defines the {class}`~nequip.train.NequIPLightningModule` (or its subclasses). Users are directed to the [`nequip.train.NequIPLightningModule` API documentation](../../api/lightning_module.rst) to learn how to configure it. Usually the {class}`~nequip.train.EMALightningModule` is the right choice. The following important objects are configured as part of the `training_module`: ### `model` This section configures the model itself, including hyperparameters and the choice of architecture (for example, the NequIP message-passing E(3)-equivariant GNN, or the Allegro architecture). Refer to the [model documentation page](../../api/model.rst) to learn how to configure this section. ### `loss` and `metrics` Loss functions and metrics to monitor training progress are configured here in the `training_module`. See the [Loss and Metrics guide](metrics.md) for configuration details, including simplified wrappers, coefficient mechanics, and monitoring setup. ### `optimizer` and `lr_scheduler` The `optimizer` can be any PyTorch-compatible optimizer. Options from PyTorch can be found in {mod}`torch.optim`. The {class}`~torch.optim.Adam` optimizer, for example, can be configured as follows: ```yaml optimizer: _target_: torch.optim.Adam lr: 0.01 ``` The `lr_scheduler` is configured according to PyTorch Lightning's `lr_scheduler_config` (see {meth}`~lightning.pytorch.core.LightningModule.configure_optimizers` for the full range of options). Consider the following use of {class}`~torch.optim.lr_scheduler.ReduceLROnPlateau` as an example. ```yaml lr_scheduler: scheduler: _target_: torch.optim.lr_scheduler.ReduceLROnPlateau factor: 0.6 patience: 5 threshold: 0.2 min_lr: 1e-6 monitor: val0_epoch/weighted_sum interval: epoch frequency: 1 ``` The `scheduler` is a PyTorch-compatible learning rate scheduler. Options from PyTorch can be found in {mod}`torch.optim.lr_scheduler`.