Data Configuration

Data Processing Flow

The data processing in NequIP follows this pipeline from raw files to model-ready data:

../../_images/ml-data-flow-chart.svg

This entire pipeline is coordinated and managed by a NequIPDataModule object, which also:

  • Manages train/val/test dataset splits

  • Computes dataset statistics

Key Components:

  • NequIPDataModule: Orchestrates everything

  • Dataset: Reads raw data files and applies transforms to individual structures

  • Transforms: Process data sequentially (e.g. compute neighbor lists, map atom types)

  • DataLoader: Batches transformed data for efficient training with parallel loading

  • Statistics Manager: Computes statistics of processed data to initialize model parameters

DataModules

The data section of the NequIP config file specifies a NequIPDataModule, which manages how training data is loaded and processed. NequIPDataModules coordinate all aspects of data handling from loading to preprocessing. For comprehensive configuration options, see nequip.data.datamodule.

Common DataModules

ASEDataModule is the most commonly used datamodule because it can read many file formats through ASE (Atomic Simulation Environment), including popular formats such as the .xyz format. The following is an example of splitting a single data file into separate training, validation, and testing sets.

data:
  _target_: nequip.data.datamodule.ASEDataModule
  split_dataset:
    file_path: training_data.xyz
    train: 0.8
    val: 0.1
    test: 0.1
  # ... other arguments

Specialized DataModules

For specific benchmark datasets, specialized datamodules provide auto-download capabilities and predefined configurations:

These specialized datamodules have unique APIs tailored to their specific datasets and often handle downloading and preprocessing automatically.

Custom Data Configurations

For more complex or custom data setups, you can use the base NequIPDataModule directly. This allows you to specify custom dataset configurations - datasets are the components that actually read data files and apply transforms to individual structures. See nequip.data.dataset for available dataset classes.

The existing specialized datamodules are essentially convenience wrappers that simplify configuring the base NequIPDataModule with specific datasets and common settings.

DataModule Arguments

Key arguments that datamodules take include transforms (see Data Transforms), dataloaders (see DataLoaders), and dataset statistics managers (see Dataset Statistics).

Data Transforms

Transforms process raw data into a format suitable for model training. They are specified in datamodule configurations (see DataModules and nequip.data.datamodule) which pass them as arguments to datasets (see nequip.data.dataset) where they are applied sequentially to each data point. Two transforms are essential for most use cases:

  • ChemicalSpeciesToAtomTypeMapper maps atomic numbers to model type indices. This handles the distinction between chemical species (C, H, O) and the model atom type names:

    - _target_: nequip.data.transforms.ChemicalSpeciesToAtomTypeMapper
      model_type_names: [C, H, O, Cu]
      chemical_species_to_atom_type_map:
        C: C
        H: H
        O: O
        Cu: Cu
    

    When model_type_names correspond exactly to chemical species (the common case), you can omit chemical_species_to_atom_type_map and it will default to an identity mapping:

    - _target_: nequip.data.transforms.ChemicalSpeciesToAtomTypeMapper
      model_type_names: [C, H, O, Cu]
    

    Alternatively, you can use the list_to_identity_dict resolver to be explicit:

    model_type_names: [C, H, O, Cu]
    chemical_species: ${model_type_names}
    
    transforms:
      - _target_: nequip.data.transforms.ChemicalSpeciesToAtomTypeMapper
        model_type_names: ${model_type_names}
        chemical_species_to_atom_type_map: ${list_to_identity_dict:${chemical_species}}
    
  • NeighborListTransform computes which atoms are neighbors of each atom within a cutoff distance.

    - _target_: nequip.data.transforms.NeighborListTransform
      r_max: 5.0  # should be the same as model `r_max`
    

The model_type_names list defines the atom types known to the model, and the chemical_species_to_atom_type_map dict explicitly maps chemical species to these types. The model_type_names should be consistent across data, model, and statistics configurations.

Here’s an example with both transforms:

transforms:
  - _target_: nequip.data.transforms.ChemicalSpeciesToAtomTypeMapper
    model_type_names: ${model_type_names}
  - _target_: nequip.data.transforms.NeighborListTransform
    r_max: 5.0

Warning

Transform Order May Matter: The order of transforms can be important for some configurations. For example, when using per-edge-type cutoffs in NeighborListTransform, the ChemicalSpeciesToAtomTypeMapper must come before NeighborListTransform because the neighborlist transform needs atom type information to apply different cutoffs for different element pairs.

Additional transforms are available for specific use cases. For stress-related data, you may need:

For a complete list of available transforms, see the transforms API documentation.

DataLoaders

DataLoaders handle batching and parallel data loading using PyTorch’s torch.utils.data.DataLoader. They are specified in datamodule configurations (see DataModules and nequip.data.datamodule) which use them to wrap datasets for efficient training:

train_dataloader:
  _target_: torch.utils.data.DataLoader
  batch_size: 5        # an important training hyperparameter to tune
  num_workers: 5       # parallel workers for data loading
  shuffle: true        # often useful to shuffle training data

Tip

Training batch size affects learning dynamics and is an important hyperparameter to tune. However, validation and test batch sizes have no effect on training and should generally be set as large as possible without causing out-of-memory errors to speed up evaluation.

Tip

When using multiple num_workers, consider setting OMP_NUM_THREADS depending on the CPU cores available if training on GPUs (if training on CPUs, OMP_NUM_THREADS will affect model speed). When setting num_workers close to the number of available CPU cores, setting OMP_NUM_THREADS=1 has been found to be helpful for faster dataloading.

Dataset Statistics

Dataset statistics provide both rough knowledge of your dataset (e.g., average energy per atom, force magnitudes) and are crucial for initializing data-derived model hyperparameters. They are computed by specifying a dataset statistics manager as an argument to datamodules (see DataModules and nequip.data.datamodule). The CommonDataStatisticsManager automatically computes essential statistics:

stats_manager:
  _target_: nequip.data.CommonDataStatisticsManager
  type_names: [C, H, O, Cu]
  dataloader_kwargs:
    batch_size: 10  # Can be larger than training batch size to speed up computation

For energy-only datasets (without forces), use EnergyOnlyDataStatisticsManager instead:

stats_manager:
  _target_: nequip.data.EnergyOnlyDataStatisticsManager
  type_names: [C, H, O, Cu]
  dataloader_kwargs:
    batch_size: 10

You can use a larger batch_size in dataloader_kwargs than your training batch size to compute statistics faster without memory issues. Statistics are computed once during data setup, not during training.

For advanced use cases, you should use the base DataStatisticsManager directly for more flexible configuration. See the dataset statistics API documentation for configuration options.

For guidance on using computed statistics to initialize model parameters, see Training data statistics as hyperparameters.