How to prepare training dataset

What does NequIP behind the scene

NequIP uses AtomicDataset class to store the atomic configurations. During the initialization of an AtomicDataset object, NequIP reads the atomic structures from the dataset, computes the neighbor list and other data structures needed for the GNN by converting raw data to a list of AtomicData objects.

The computed results are then cached on harddisk root/processed_hashkey folder. The hashing is based on all the metadata provided for the dataset, which includes the file name, the cutoff radius, float number precision and etc. In the case where multiple training/evaluation runs use the same dataset, the neighbor list will only be computed in the first NequIP run. The later runs will directly load the AtomicDataset object from the cache file to save computation time.

Note: be careful to the cached file. If you update your raw data file but keep using the same filename, NequIP will not automatically update the cached data.

Key concepts

yaml interface

nequip-train and nequip-evaluate automatically construct the AtomicDataset based on the yaml arguments. Later sections offer a couple different examples.

If the training and validation datasets are from different raw files, the arguments for each set can be defined with dataset prefix and validation_dataset prefix, respectively.

For example, dataset_file_name is used for training data and validation_dataset_file_name is for validation data.

Python interface

See nequip.data.AtomicInMemoryDataset.

Prepare dataset and specify in yaml config

ASE format

NequIP accept all format that can be parsed by ase.io.read function. We recommend extxyz.

Example: Given an atomic data stored in “H2.extxyz” that looks like below:

2
Properties=species:S:1:pos:R:3 energy=-10 user_label=2.0 pbc="F F F"
H       0.00000000       0.00000000       0.00000000
H       0.00000000       0.00000000       1.02000000

The yaml input should be

dataset: ase
dataset_file_name: H2.extxyz
ase_args:
format: extxyz
include_keys:
  - user_label
key_mapping:
  user_label: label0
chemical_symbol_to_type:
  H: 0

For other formats than extxyz, be careful to the ase parsers; they may have different behavior from the extxyz parser. For example, the ase vasp parser store potential energy to free_energy instead of energy. Because we optimize our code to the extxyz parser, NequIP will not be able to load any total_energy labels. We need some additional keys to help NequIP to understand the situtaion Here’s an example for vasp outcar.

dataset: ase
dataset_file_name: OUTCAR
ase_args:
  format: vasp-out
key_mapping:
  free_energy: total_energy
chemical_symbol_to_type:
  H: 0

The way around is to use key mapping, please see more note below.

NPZ format

If your dataset constitute configurations that always have the same number of atoms, npz data format can be an option.

In the npz file, all the values should have the same row as the number of the configurations. For example, the force array of 36 atomic configurations of an N-atom system should have the shape of (36, N, 3); their total_energy array should have the shape of (36).

NPZ also supports “fixed fields.” Fixed fields are the quantities that are shared among all the configurations in the dataset. For example, if the dataset is a trajectory of an NVT MD simulation, the super cell size and the atomic species are indeed a constant matrix/vector through out the whole dataset. In this case, in stead of repeating the same values for many times, we specify the cell and species as fixed fields and only provide them once.

Below is an example of the yaml specification.

dataset: npz
dataset_file_name: example.npz
include_keys:
  - user_label1
  - user_label2
npz_fixed_field_keys:
  - cell
  - atomic_numbers
key_mapping:
  position: pos
  force: forces
  energy: total_energy
  Z: atomic_numbers

Note on key mapping

NequIP has default key names for energy, force, cell (defined at nequip.data._keys) Unlike in the ASE format where these information is automatically parsed, in the npz data format, the correct key names have to be provided. The common key names are: total_energy, forces, atomic_numbers, pos, cell, pbc. the key_mapping can help to convert the user defined name (key) to NequIP default name (value).

Advanced options

skip frames during data processing

The include_frame argument can be specified in yaml to skip certain frames in the raw datafile. The item has to be a list or a python iteratable object.

register user-defined graph, node, edge fields

Graph, node, edge fields are quantities that belong to the whole graph, each atom, each edge, respectively. Example graph fields include cell, pbc, and total_energy. Example node fields include pos, forces

To help NequIP to properly assemble the batch data, graph quantity other than cell, pbc, total_energy should be registered.