FAQs

Logging

Q: How does logging work? How do I use Tensorboard or Weights and Biases?

A: Logging is configured under the trainer section of the config file by specifying the logger argument of the Trainer. Compatible loggers are found here. Read the Config docs for a more complete description.

Units

Q: What units do nequip framework models use?

A: nequip has no preferred system of units and uses the units of the data provided. Model inputs, outputs, error metrics, and all other quantities follow the units of the dataset. Users must use consistent input and output units. For example, if the length unit is Å and the energy labels are in eV, the force predictions from the model will be in eV/Å. The provided force labels should hence also be in eV/Å.

Warning

nequip cannot and does not check the consistency of units in inputs you provide, and it is your responsibility to ensure consistent treatment of input and output units

Floating Point Precision

Q: What floating point precision (torch.dtype) is used in the nequip framework?

A: float64 precision is used for data (inputs to model and reference labels). Either float32 or float64 precision can be used as the model_dtype (which is a mandatory hyperparameter of models in the nequip framework). If float32 precision is used for model_dtype, the model will cast down from the float64 inputs (e.g. positions) and cast up the outputs (e.g. energy) to float64. A major change in the post-revamp nequip framework is that NequIP or Allegro models keep the initial embeddings in float64 before casting down if model_dtype=float32 for better numerics.

Validation metrics are much better than training metrics or loss

Q: The same type of metric (e.g. force MAE) is a lot lower on the validation set than the training set during the course of training. What’s happening?

A: This phenomenon is generally observed when using EMALightningModule as the training module, where validation (and inference tasks in general) uses an exponential-moving average (EMA) of the weights that vary more rapidly during training. Thus, training and validation happens on a different set of model weights, leading to the differences. The better validation metrics justifies why the EMA approach is useful in practice. The answer would be different if this phenomenon is observed without EMA.

Distributed Training

Q: How do I train with multiple GPUs?

A: Read our Distributed Data Parallel training docs.

AMD GPU Compatibility

Q: Does the NequIP framework support AMD GPUs?

A: The NequIP framework is compatible with AMD GPUs. However, certain acceleration features, including CuEquivariance, require NVIDIA GPUs. When specifying device parameters, use cuda as the device identifier for both NVIDIA and AMD GPUs.

Energy-Only Training

Q: How do I train on datasets that only contain energies (no forces)?

A: For energy-only datasets, use the following specialized components:

These components are specifically designed for energy-only training and will not attempt to compute force-related statistics or metrics that would cause errors with datasets lacking force labels.

Partial Stress Data

Q: How do I train on datasets where only some structures have stress labels?

A: For datasets with partial stress coverage (where stress labels are unavailable for some structures), use the following approach:

  1. Add the transform: Include AddNaNStressTransform in your transforms list to populate missing stress labels with NaN values:

    transforms:
      - _target_: nequip.data.transforms.AddNaNStressTransform
      - _target_: nequip.data.transforms.ChemicalSpeciesToAtomTypeMapper
        model_type_names: ${model_type_names}
      - _target_: nequip.data.transforms.NeighborListTransform
        r_max: 5.0
    
  2. Configure loss: Use EnergyForceStressLoss with the ignore_nan parameter to skip stress in loss computation for structures without labels:

    loss:
      _target_: nequip.train.EnergyForceStressLoss
      coeffs:
        total_energy: 1.0
        forces: 1.0
        stress: 1.0
      ignore_nan:
        stress: true
    
  3. Configure metrics: Similarly, use EnergyForceStressMetrics with ignore_nan:

    val_metrics:
      _target_: nequip.train.EnergyForceStressMetrics
      coeffs:
        total_energy_rmse: 1.0
        forces_rmse: 1.0
        stress_rmse: 1.0
      ignore_nan:
        stress: true
    

The loss and metrics will only be computed on frames with valid (non-NaN) stress labels.

Upgrading from pre-0.7.0 nequip

Warning

Importing, restarting, or migrating models or training runs from pre-0.7.0 versions of nequip is not supported. Please use Python environment management to maintain separate installations of older nequip versions to keep working with that data, if necessary.

Q: What replaces nequip-evaluate, which was removed?

A: nequip-evaluate is replaced by using the test run type with nequip-train with the TestTimeXYZFileWriter callback

Q: What replaces nequip-deploy, which was removed?

A: nequip-deploy (which previously generates a TorchScript .pth file) is replaced by nequip-compile that can produce either a TorchScript .nequip.pth file or an AOTInductor .nequip.pt2 file to be used for inference tasks in our integrations such as ASE and LAMMPS. Q: What replaces nequip-benchmark, which has been removed?

A: No direct substitute exists, but the NequIP ASE calculator can be used to similarly run a model from Python on a single static frame.

Q: Are losses still sometimes in normalized internal units?

A: No, in nequip >= 0.7.0, the loss components are all in physical units.