Data Configuration¶
Data Processing Flow¶
The data processing in NequIP follows this pipeline from raw files to model-ready data:
This entire pipeline is coordinated and managed by a NequIPDataModule object, which also:
Manages train/val/test dataset splits
Computes dataset statistics
Key Components:
NequIPDataModule: Orchestrates everythingDataset: Reads raw data files and applies transforms to individual structures
Transforms: Process data sequentially (e.g. compute neighbor lists, map atom types)
DataLoader: Batches transformed data for efficient training with parallel loading
Statistics Manager: Computes statistics of processed data to initialize model parameters
DataModules¶
The data section of the NequIP config file specifies a NequIPDataModule, which manages how training data is loaded and processed.
NequIPDataModules coordinate all aspects of data handling from loading to preprocessing.
For comprehensive configuration options, see nequip.data.datamodule.
Common DataModules¶
ASEDataModule is the most commonly used datamodule because it can read many file formats through ASE (Atomic Simulation Environment), including popular formats such as the .xyz format.
The following is an example of splitting a single data file into separate training, validation, and testing sets.
data:
_target_: nequip.data.datamodule.ASEDataModule
split_dataset:
file_path: training_data.xyz
train: 0.8
val: 0.1
test: 0.1
# ... other arguments
Specialized DataModules¶
For specific benchmark datasets, specialized datamodules provide auto-download capabilities and predefined configurations:
MD22DataModule- MD22 datasetsrMD17DataModule- Revised MD17 datasetssGDML_CCSD_DataModule- sGDML datasetsTM23DataModule- TM23 datasetNequIP3BPADataModule- 3BPA dataset
These specialized datamodules have unique APIs tailored to their specific datasets and often handle downloading and preprocessing automatically.
Custom Data Configurations¶
For more complex or custom data setups, you can use the base NequIPDataModule directly. This allows you to specify custom dataset configurations - datasets are the components that actually read data files and apply transforms to individual structures. See nequip.data.dataset for available dataset classes.
The existing specialized datamodules are essentially convenience wrappers that simplify configuring the base NequIPDataModule with specific datasets and common settings.
DataModule Arguments¶
Key arguments that datamodules take include transforms (see Data Transforms), dataloaders (see DataLoaders), and dataset statistics managers (see Dataset Statistics).
Data Transforms¶
Transforms process raw data into a format suitable for model training. They are specified in datamodule configurations (see DataModules and nequip.data.datamodule) which pass them as arguments to datasets (see nequip.data.dataset) where they are applied sequentially to each data point.
Two transforms are essential for most use cases:
ChemicalSpeciesToAtomTypeMappermaps atomic numbers to model type indices. This handles the distinction between chemical species (C, H, O) and the model atom type names:- _target_: nequip.data.transforms.ChemicalSpeciesToAtomTypeMapper model_type_names: [C, H, O, Cu] chemical_species_to_atom_type_map: C: C H: H O: O Cu: Cu
When
model_type_namescorrespond exactly to chemical species (the common case), you can omitchemical_species_to_atom_type_mapand it will default to an identity mapping:- _target_: nequip.data.transforms.ChemicalSpeciesToAtomTypeMapper model_type_names: [C, H, O, Cu]
Alternatively, you can use the
list_to_identity_dictresolver to be explicit:model_type_names: [C, H, O, Cu] chemical_species: ${model_type_names} transforms: - _target_: nequip.data.transforms.ChemicalSpeciesToAtomTypeMapper model_type_names: ${model_type_names} chemical_species_to_atom_type_map: ${list_to_identity_dict:${chemical_species}}
NeighborListTransformcomputes which atoms are neighbors of each atom within a cutoff distance.- _target_: nequip.data.transforms.NeighborListTransform r_max: 5.0 # should be the same as model `r_max`
The model_type_names list defines the atom types known to the model, and the chemical_species_to_atom_type_map dict explicitly maps chemical species to these types. The model_type_names should be consistent across data, model, and statistics configurations.
Here’s an example with both transforms:
transforms:
- _target_: nequip.data.transforms.ChemicalSpeciesToAtomTypeMapper
model_type_names: ${model_type_names}
- _target_: nequip.data.transforms.NeighborListTransform
r_max: 5.0
Warning
Transform Order May Matter: The order of transforms can be important for some configurations. For example, when using per-edge-type cutoffs in NeighborListTransform, the ChemicalSpeciesToAtomTypeMapper must come before NeighborListTransform because the neighborlist transform needs atom type information to apply different cutoffs for different element pairs.
Additional transforms are available for specific use cases. For stress-related data, you may need:
VirialToStressTransform- converts virial to stress tensorsStressSignFlipTransform- handles different stress sign conventionsAddNaNStressTransform- adds NaN stress tensors for structures without stress data (useful for datasets with partial stress coverage, see Partial Stress Data FAQ)
For a complete list of available transforms, see the transforms API documentation.
DataLoaders¶
DataLoaders handle batching and parallel data loading using PyTorch’s torch.utils.data.DataLoader.
They are specified in datamodule configurations (see DataModules and nequip.data.datamodule) which use them to wrap datasets for efficient training:
train_dataloader:
_target_: torch.utils.data.DataLoader
batch_size: 5 # an important training hyperparameter to tune
num_workers: 5 # parallel workers for data loading
shuffle: true # often useful to shuffle training data
Tip
Training batch size affects learning dynamics and is an important hyperparameter to tune. However, validation and test batch sizes have no effect on training and should generally be set as large as possible without causing out-of-memory errors to speed up evaluation.
Tip
When using multiple num_workers, consider setting OMP_NUM_THREADS depending on the CPU cores available if training on GPUs (if training on CPUs, OMP_NUM_THREADS will affect model speed). When setting num_workers close to the number of available CPU cores, setting OMP_NUM_THREADS=1 has been found to be helpful for faster dataloading.
Dataset Statistics¶
Dataset statistics provide both rough knowledge of your dataset (e.g., average energy per atom, force magnitudes) and are crucial for initializing data-derived model hyperparameters.
They are computed by specifying a dataset statistics manager as an argument to datamodules (see DataModules and nequip.data.datamodule).
The CommonDataStatisticsManager automatically computes essential statistics:
stats_manager:
_target_: nequip.data.CommonDataStatisticsManager
type_names: [C, H, O, Cu]
dataloader_kwargs:
batch_size: 10 # Can be larger than training batch size to speed up computation
For energy-only datasets (without forces), use EnergyOnlyDataStatisticsManager instead:
stats_manager:
_target_: nequip.data.EnergyOnlyDataStatisticsManager
type_names: [C, H, O, Cu]
dataloader_kwargs:
batch_size: 10
You can use a larger batch_size in dataloader_kwargs than your training batch size to compute statistics faster without memory issues.
Statistics are computed once during data setup, not during training.
For advanced use cases, you should use the base DataStatisticsManager directly for more flexible configuration.
See the dataset statistics API documentation for configuration options.
For guidance on using computed statistics to initialize model parameters, see Training data statistics as hyperparameters.