# Data Configuration

## Data Processing Flow

The data processing in NequIP follows this pipeline from raw files to model-ready data:

```{figure} ./ml-data-flow-chart.svg
:width: 400px
:align: center
```

This entire pipeline is coordinated and managed by a {class}`~nequip.data.datamodule.NequIPDataModule` object, which also:
- Manages train/val/test dataset splits
- Computes dataset statistics

**Key Components:**

- **{class}`~nequip.data.datamodule.NequIPDataModule`**: Orchestrates everything
- **Dataset**: Reads raw data files and applies transforms to individual structures
- **Transforms**: Process data sequentially (e.g. compute neighbor lists, map atom types)
- **DataLoader**: Batches transformed data for efficient training with parallel loading
- **Statistics Manager**: Computes statistics of processed data to initialize model parameters

## DataModules

The [data section](config.md/#data) of the NequIP config file specifies a {class}`~nequip.data.datamodule.NequIPDataModule`, which manages how training data is loaded and processed.
{class}`~nequip.data.datamodule.NequIPDataModule`s coordinate all aspects of data handling from loading to preprocessing.
For comprehensive configuration options, see [`nequip.data.datamodule`](../../api/datamodule.rst).

### Common DataModules

{class}`~nequip.data.datamodule.ASEDataModule` is the most commonly used datamodule because it can read many file formats through [ASE](https://wiki.fysik.dtu.dk/ase/) (Atomic Simulation Environment), including popular formats such as the `.xyz` format.
The following is an example of splitting a single data file into separate training, validation, and testing sets.

```yaml
data:
  _target_: nequip.data.datamodule.ASEDataModule
  split_dataset:
    file_path: training_data.xyz
    train: 0.8
    val: 0.1
    test: 0.1
  # ... other arguments
```

### Specialized DataModules

For specific benchmark datasets, specialized datamodules provide auto-download capabilities and predefined configurations:

- {class}`~nequip.data.datamodule.MD22DataModule` - MD22 datasets
- {class}`~nequip.data.datamodule.rMD17DataModule` - Revised MD17 datasets
- {class}`~nequip.data.datamodule.sGDML_CCSD_DataModule` - sGDML datasets
- {class}`~nequip.data.datamodule.TM23DataModule` - TM23 dataset
- {class}`~nequip.data.datamodule.NequIP3BPADataModule` - 3BPA dataset

These specialized datamodules have unique APIs tailored to their specific datasets and often handle downloading and preprocessing automatically.

### Custom Data Configurations

For more complex or custom data setups, you can use the base {class}`~nequip.data.datamodule.NequIPDataModule` directly. This allows you to specify custom dataset configurations - datasets are the components that actually read data files and apply transforms to individual structures. See [`nequip.data.dataset`](../../api/dataset.rst) for available dataset classes.

The existing [specialized datamodules](#specialized-datamodules) are essentially convenience wrappers that simplify configuring the base {class}`~nequip.data.datamodule.NequIPDataModule` with specific datasets and common settings.

### DataModule Arguments

Key arguments that datamodules take include transforms (see [Data Transforms](#data-transforms)), dataloaders (see [DataLoaders](#dataloaders)), and dataset statistics managers (see [Dataset Statistics](#dataset-statistics)).

## Data Transforms

Transforms process raw data into a format suitable for model training. They are specified in datamodule configurations (see [DataModules](#datamodules) and [`nequip.data.datamodule`](../../api/datamodule.rst)) which pass them as arguments to datasets (see [`nequip.data.dataset`](../../api/dataset.rst)) where they are applied sequentially to each data point.
Two transforms are essential for most use cases:

- **{class}`~nequip.data.transforms.ChemicalSpeciesToAtomTypeMapper`** maps atomic numbers to model type indices. This handles the distinction between chemical species (C, H, O) and the model atom type names:
  ```yaml
  - _target_: nequip.data.transforms.ChemicalSpeciesToAtomTypeMapper
    model_type_names: [C, H, O, Cu]
    chemical_species_to_atom_type_map:
      C: C
      H: H
      O: O
      Cu: Cu
  ```

  When `model_type_names` correspond exactly to chemical species (the common case), you can omit `chemical_species_to_atom_type_map` and it will default to an identity mapping:
  ```yaml
  - _target_: nequip.data.transforms.ChemicalSpeciesToAtomTypeMapper
    model_type_names: [C, H, O, Cu]
  ```

  Alternatively, you can use the `list_to_identity_dict` resolver to be explicit:
  ```yaml
  model_type_names: [C, H, O, Cu]
  chemical_species: ${model_type_names}

  transforms:
    - _target_: nequip.data.transforms.ChemicalSpeciesToAtomTypeMapper
      model_type_names: ${model_type_names}
      chemical_species_to_atom_type_map: ${list_to_identity_dict:${chemical_species}}
  ```

- **{class}`~nequip.data.transforms.NeighborListTransform`** computes which atoms are neighbors of each atom within a cutoff distance.
  ```yaml
  - _target_: nequip.data.transforms.NeighborListTransform
    r_max: 5.0  # should be the same as model `r_max`
  ```

The `model_type_names` list defines the atom types known to the model, and the `chemical_species_to_atom_type_map` dict explicitly maps chemical species to these types. The `model_type_names` should be consistent across data, model, and statistics configurations.

Here's an example with both transforms:

```yaml
transforms:
  - _target_: nequip.data.transforms.ChemicalSpeciesToAtomTypeMapper
    model_type_names: ${model_type_names}
  - _target_: nequip.data.transforms.NeighborListTransform
    r_max: 5.0
```

```{warning}
**Transform Order May Matter**: The order of transforms can be important for some configurations. For example, when using per-edge-type cutoffs in {class}`~nequip.data.transforms.NeighborListTransform`, the {class}`~nequip.data.transforms.ChemicalSpeciesToAtomTypeMapper` must come before {class}`~nequip.data.transforms.NeighborListTransform` because the neighborlist transform needs atom type information to apply different cutoffs for different element pairs.
```

Additional transforms are available for specific use cases. For stress-related data, you may need:

- {class}`~nequip.data.transforms.VirialToStressTransform` - converts virial to stress tensors
- {class}`~nequip.data.transforms.StressSignFlipTransform` - handles different stress sign conventions
- {class}`~nequip.data.transforms.AddNaNStressTransform` - adds NaN stress tensors for structures without stress data (useful for datasets with partial stress coverage, see [Partial Stress Data FAQ](../reference/faq.md#partial-stress-data))

For a complete list of available transforms, see the [transforms API documentation](../../api/data_transforms.rst).

## DataLoaders

DataLoaders handle batching and parallel data loading using PyTorch's {class}`torch.utils.data.DataLoader`.
They are specified in datamodule configurations (see [DataModules](#datamodules) and [`nequip.data.datamodule`](../../api/datamodule.rst)) which use them to wrap datasets for efficient training:

```yaml
train_dataloader:
  _target_: torch.utils.data.DataLoader
  batch_size: 5        # an important training hyperparameter to tune
  num_workers: 5       # parallel workers for data loading
  shuffle: true        # often useful to shuffle training data
```

```{tip}
Training batch size affects learning dynamics and is an important hyperparameter to tune. However, validation and test batch sizes have no effect on training and should generally be set as large as possible without causing out-of-memory errors to speed up evaluation.
```

```{tip}
When using multiple `num_workers`, consider setting `OMP_NUM_THREADS` depending on the CPU cores available if training on GPUs (if training on CPUs, `OMP_NUM_THREADS` will affect model speed). When setting `num_workers` close to the number of available CPU cores, setting `OMP_NUM_THREADS=1` has been found to be helpful for faster dataloading.
```

## Dataset Statistics

Dataset statistics provide both rough knowledge of your dataset (e.g., average energy per atom, force magnitudes) and are crucial for initializing data-derived model hyperparameters.
They are computed by specifying a dataset statistics manager as an argument to datamodules (see [DataModules](#datamodules) and [`nequip.data.datamodule`](../../api/datamodule.rst)).
The {class}`~nequip.data.CommonDataStatisticsManager` automatically computes essential statistics:

```yaml
stats_manager:
  _target_: nequip.data.CommonDataStatisticsManager
  type_names: [C, H, O, Cu]
  dataloader_kwargs:
    batch_size: 10  # Can be larger than training batch size to speed up computation
```

For energy-only datasets (without forces), use {class}`~nequip.data.EnergyOnlyDataStatisticsManager` instead:

```yaml
stats_manager:
  _target_: nequip.data.EnergyOnlyDataStatisticsManager
  type_names: [C, H, O, Cu]
  dataloader_kwargs:
    batch_size: 10
```

You can use a larger `batch_size` in `dataloader_kwargs` than your training batch size to compute statistics faster without memory issues.
Statistics are computed once during data setup, not during training.

For advanced use cases, you should use the base {class}`~nequip.data.DataStatisticsManager` directly for more flexible configuration.
See the [dataset statistics API documentation](../../api/data_stats.rst) for configuration options.

For guidance on using computed statistics to initialize model parameters, see [Training data statistics as hyperparameters](model.md/#training-data-statistics-as-hyperparameters).