The NequIP Workflow¶
Overview¶
At a glance, the NequIP workflow is as follows.
Train models with
nequip-train, which produces a checkpoint file.Test those models using
nequip-train, sometimes as part of the same call to the command.Package the model from the checkpoint file with
nequip-package, which produces a package file. Package files are the recommended format for distributing NequIP framework models as they are designed to be usable on different machines and code environments (e.g. with differente3nn,nequip,allegroversions than what the model was initially trained with).Compile the packaged model (or model from a checkpoint file) with
nequip-compile, which produces a compiled model file that can be loaded for production simulations in supported integrations such as LAMMPS and ASE.
The NequIP framework workflow.¶
Training¶
The core command in NequIP is nequip-train, which takes in a YAML config file defining the dataset(s), model, and training hyperparameters, and then runs (or restarts) a training session. Hydra is used to manage the config files, and so many of the features and tricks from Hydra can be used if desired. nequip-train can be called as follows.
nequip-train -cp full/path/to/config/directory -cn config_name.yaml
nequip-train uses the Trainer from PyTorch Lightning to run a training loop.
Command line options¶
The command line interface of nequip-train is managed by Hydra, and complete details on its flexible syntax can be found in the Hydra documentation.
The flags -cp and -cn refer to the “config path” and “config name” respectively. If one runs nequip-train in the same directory where the config file is located, the -cp flag may be omitted. Note also that the full path is usually required if one uses -cp. Users who seek further configurability (e.g. using relative paths, multiple config files located in different directories, etc) are directed to the “command line flags” page in the Hydra docs to learn more.
Working directories for output files from nequip-train are managed by Hydra, and users can configure how these directories are organized through Hydra’s options.
The config file¶
Under the hood, the Hydra config utilities and the PyTorch Lightning framework are used to facilitate training and testing in the NequIP infrastructure. The config defines a hierarchy of objects, built by instantiating classes, usually specified in the config with _target_, with the parameters the user provides. The Python API of these classes exactly corresponds to the available configuration options in the config file. As a result, the Python API of these classes is the single source of truth defining valid configuration options. These classes could come from:
torchitself, in the case of optimizers and learning rate schedulers;lightning, such as Lightning’sTraineror Lightning’s native callbacks;nequip, such as the various DataModules, custom callbacks, and so on.
Users are advised to look at the tutorial configuration to understand how the config file is structured, and then to look up what each of the classes do and what parameters they can take (be they on PyTorch, PyTorch Lightning or NequIP’s API docs). The documentation for nequip’s own classes can be found in the Python API section of this documentation. For detailed guidance on config structure, see the Config File guide.
Tip
Hydra’s output directory can be accessed in the config file using variable interpolation, which is very useful, for example, to instruct Lightning to save checkpoints in Hydra’s output directory:
callbacks:
- _target_: lightning.pytorch.callbacks.ModelCheckpoint
dirpath: ${hydra:runtime.output_dir}
...
Saving and restarting¶
Checkpointing behavior is controlled by lightning and configuring it is the onus of the user. Checkpointing can be controlled by flags in Lightning’s Trainer and can be specified even further with Lightning’s ModelCheckpoint callback.
If a run is interrupted, one can continue training from a checkpoint file with the following command
nequip-train -cp full/path/to/config/directory -cn config_name.yaml ++ckpt_path='path/to/ckpt_file'
where we have used Hydra’s override syntax (++). Note how one must still specify the config file used. Training from a checkpoint will always use the model from the checkpoint file, but other training hyperparameters (dataset, loss, metrics, callbacks, etc) are determined by the config file passed in the restart nequip-train (and can therefore be different from that of the original config used to generate the checkpoint). The restart will also resume from the last run stage (i.e. train, val, test, etc) that was running before the interruption.
Warning
DO NOT MODIFY THE CONFIG BETWEEN RESTARTS. There are no safety checks to guard against nonsensical changes to the config used for restarts, which can cause various problems during state restoration. It is safest to restart without changes to the original config. If one seeks to train a model from a checkpoint file with different training hyperparameters or datasets (e.g. for fine-tuning), one can use the ModelFromCheckpoint() model loader. The only endorsed exception is raising the max_epochs argument of the Trainer to extend the training run if it was interrupted because max_epochs was previously too small.
Testing¶
Testing is also performed with nequip-train by adding test to the list of run parameters in the config. Testing requires test dataset(s) to be defined with the NequIPDataModule defined by the data key in the config.
There are two main ways users can use test.
One can have testing be done automatically after training in the same
nequip-trainsession by specifyingrun: [train, test]in the config. Thetestphase will use thebestmodel checkpoint from thetrainphase.One can run tests from a checkpoint file by having
run: [test]in the config and using theModelFromCheckpoint()model loader to load a model from a checkpoint file.
One can use the TestTimeXYZFileWriter callback (see API) to write out .xyz files containing the predictions of the model on the test dataset(s).
Packaging¶
The recommended way to archive a trained model is to package it with the build option of nequip-package.
nequip-package build path/to/ckpt_file path/to/packaged_model.nequip.zip
One can inspect the metadata of the packaged model by using the info option.
nequip-package info path/to/pkg_file.nequip.zip
Warning
The output path MUST have the extension .nequip.zip.
Tip
To see command line options, one can use nequip-package -h. There are two options build and info, so one can get more detailed information with nequip-package build -h and nequip-package info -h.
While checkpoint files are unlikely to survive breaking changes across updates to the software, the packaging infrastructure is designed to allow packaged models to remain usable as the framework is updated.
nequip-package saves not only the model and its weights, but also a snapshot of the code that implements the model at the time the model is packaged.
The packaged model can thus be loaded and used independently even if new and different versions of NequIP (and extensions such as allegro) are later installed.
Fine-tuning packaged models¶
Packaged models can be used for both inference and fine-tuning. Fine-tuning uses the ModelFromPackage() model loader in the config for a new nequip-train run to use the model from the package as the starting point. The checkpoint files produced by this kind of fine-tuning nequip-train run can be used as usual and support restarting training with ++ckpt_path path/to/ckpt, further fine-tuning using ModelFromCheckpoint(), nequip-compile, nequip-package, etc.
See the Fine-Tuning training techniques section for further details.
Compilation¶
nequip-compile is the command used to compile a model (either from a checkpoint file or a package file) for production simulations with our various integrations. There are two compiler modes: torchscript and aotinductor, which produce compiled model files with extensions .nequip.pth and .nequip.pt2 respectively. We generally recommend the newer and faster aotinductor, but it requires PyTorch 2.6 or later.
Note on TorchScript deprecation: TorchScript compilation (--mode torchscript) is deprecated and no longer supported in PyTorch >= 2.10 as announced by PyTorch. Please use --mode aotinductor instead. If you must use TorchScript, use PyTorch 2.9 or earlier.
To compile a model with TorchScript (PyTorch < 2.10 only):
nequip-compile \
path/to/ckpt_file/or/package_file \
path/to/compiled_model.nequip.pth \
--device [cpu|cuda] \
--mode torchscript
To compile a model with AOTInductor:
nequip-compile \
path/to/ckpt_file/or/package_file \
path/to/compiled_model.nequip.pt2 \
--device [cpu|cuda] \
--mode aotinductor \
--target [ase|pair_nequip|pair_allegro|...]
AOTInductor requires access to compilers like gcc and nvcc when running nequip-compile. Specifically, C++17 support is required, which requires gcc version 8 or higher (preferably >=11 where C++17 is the default). Without the proper compiler version, you may encounter errors such as C++ compile error, issues involving the filesystem standard library, or even Segmentation fault (core dumped). You can check your gcc version with gcc --version, and may need to upgrade or load a specific module on your HPC system to get the required version before running nequip-compile.
Important
nequip-compile should be called on the same type of system and device where the compiled model will be used. This constraint may not be always be necessary for TorchScript compilation, but it is required for AOTInductor compilation, which specializes the model to a particular type of GPU, etc.
Tip
If --mode aotinductor is used, the compiled model will be specific to a specified --target integration. For example, the framework provides --target ase for compiled models to be used with ASE, --target pair_nequip for compiled NequIP GNN models to be used in LAMMPS, or --target pair_allegro for compiled Allegro models to be used in LAMMPS.
The --target flag wraps the --input-fields and --output-fields options. Developers designing new models or wanting to set up new integrations can manually provide --input-fields and --output-fields. New integration “target”s may be added through PRs or through NequIP extension packages. Engage with us on GitHub if you seek to do something like this.
Tip
If performing training and inference on separate machines, with possibly different Python, CUDA, or hardware environments, consider packaging the trained model and transferring the packaged model to the inference machine and running nequip-compile on it there.
Compiling models from nequip.net¶
Models from nequip.net can be compiled directly using the nequip.net: syntax:
nequip-compile \
nequip.net:mir-group/NequIP-OAM-L:0.1 \
path/to/compiled_model.nequip.pt2 \
--device cuda \
--mode aotinductor \
--target ase
The format is nequip.net:group-name/model-name:version, where you can find the full model ID on the model’s page at nequip.net.
Models are automatically downloaded and cached for compilation in ~/.nequip/model_cache (configurable via the NEQUIP_CACHE_DIR environment variable).
The first compilation will download the model from the server, but subsequent compilations will use the cached model instantly.
Cached files are validated using cryptographic hashes to ensure integrity.
To bypass the cache for a single run, set NEQUIP_NO_CACHE=1 (or true, yes, y). To re-enable caching, unset the variable or set it to any other value like NEQUIP_NO_CACHE=0.
Production Simulations¶
Once a model has been trained and compiled it can be used to run production simulations in our supported integrations with other codes and simulation engines, including LAMMPS and ASE.