Troubleshooting

This page covers how one can troubleshoot a variety of issues when using the NequIP framework, and will be continuously improved based on user feedback. You are also encouraged to look through past GitHub issues, GitHub discussions, and our FAQ page.

Codebase changes break my checkpoints

While we strive for stability, there are times when breaking changes are necessary to enable new features and capabilities. Checkpoint files are unlikely to survive such breaking changes, i.e. one may no longer be able to load a checkpoint file to continue training or nequip-compile the model when upgrading versions.

The NequIP framework’s proposed solution is to package the model with nequip-package, which produces a packaged model file that contains a snapshot of the code (from the code version at package-time) associated with a model. Packaged models can be compiled with nequip-compile for inference and/or used as pretrained models for fine-tuning (with the possibility of repackaging the model with nequip-package again). Packaged models are more reliable in allowing for old models to be used across version changes of the NequIP codebase. While packaged models guard against breaking changes in model code, they may not be compatible if there are fundamental changes to the code that performs packaging and loads packaged models. Changes like this should be rare, and we will refrain from making such changes unless absolutely necessary.

More information can be found in the docs on packaging and on the different file types.

Poor Training Behavior

If you’re model doesn’t seem like it’s learning, the reasons could range from problematic model hyperparameters, to problematic training hyperparameters, to data-side problems.

Data Conventions. As a start, ensure that the data format conforms to NequIP conventions. For example, a common failure mode is when a dataset uses the opposite sign convention for stress.

Isolated Atom Energies. A common cause of poor training is not using the isolated atom energies computed at the same level of theory as the dataset as the per_type_energy_shifts of the model. More details can be found on the Energy Shifts & Scales docs.

Learning Rate. In terms of training hyperparameters, NequIP models tend to learn with learning rates of around 0.01, while Allegro models are better with learning rates of around 0.001. One can always try to lower the learning rate to test if the problem could lie with a large learning rate causing large jumps in the loss landscape.

Common Errors

nequip-train config.yaml fails

Problem: Trying to run nequip-train as follows fails.

nequip-train config.yaml

with an error that looks like

    raise OverrideParseException(
hydra.errors.OverrideParseException: Error parsing override 'config.yaml'
missing EQUAL at '<EOF>'
See https://hydra.cc/docs/1.2/advanced/override_grammar/basic for details

Solution: Read the workflow docs and follow hydra’s command line options, e.g.

nequip-train -cn config.yaml

Compilation with nequip-compile fails in AOTInductor Mode

Problem: Trying to run nequip-compile as follows, fails:

nequip-compile \
path/to/ckpt_file/or/package_file \
path/to/compiled_model.nequip.pt2 \
--device (cpu/cuda) \
--mode aotinductor \
--target target_integration

with an error like this:

torch._inductor.exc.CppCompileError: C++ compile error

or like this:

allegro_torch27/lib/python3.10/site-packages/torch/include/torch/csrc/inductor/aoti_include/common.h:4:10: fatal error: filesystem: No such file or directory
#include <filesystem>
        ^~~~~~~~~~~~
compilation terminated.

or like this:

Segmentation fault (core dumped)

Solution: Use a newer GCC version.

It’s likely your GCC version does not support C++17. Try a GCC version >= 11 that supports C++17 by default (see GCC C++17 status).

On HPC clusters, you can usually module load to a newer version of GCC.

EMA checkpoint error

Problem: When using EMALightningModule, you encounter:

AssertionError: EMA module loaded in a state where it does not contain EMA weights -- the checkpoint file is likely corrupted.

Solution: This error can occur when using check_val_every_n_epoch with a value other than 1. EMA requires validation to run every epoch. No configuration change is needed if the field is unspecified in the config.

Training Freezes or Hangs

Problem: Trying to run nequip-train as follows proceeds normally but then freezes at this point:

    Building model and training_module from scratch / checkpoint ...

This seems to occur due to corruption of torch extension optimisation caches (such as for OpenEquivariance).

Solution: Remove the torch extensions cache folder, to force it to re-build: rm -rf ~/.cache/torch_extensions/*.