Prepare a dataset for training


▼ Dependency installation instructions Install example dependencies into a new Conda environment using the provided environment.yaml:
mamba env create --file ../../devtools/conda-envs/examples_env.yaml --name openff-nagl-examples
mamba activate openff-nagl-examples
jupyter notebook prepare-dataset.ipynb

Training a GCN requires a collection of examples that the GCN should reproduce and interpolate between. This notebook describes how to prepare such a dataset for predicting partial charges.

Imports

from pathlib import Path

from tqdm import tqdm

from openff.toolkit.topology import Molecule

from openff.nagl.storage.record import MoleculeRecord
from openff.nagl.storage import MoleculeStore

Choosing our molecules

The simplest way to specify the molecules in our dataset is with SMILES, though anything you can load into an OpenFF Molecule is fair game. For instance, with the Molecule.from_file() method you could load partial charges from SDF files. But for this example, we’ll have NAGL generate our charges, so we can just provide the SMILES themselves:

alkanes_smiles = Path("alkanes.smi").read_text().splitlines()
alkanes_smiles
['C',
 'CC',
 'CCC',
 'CCCC',
 'CC(C)C',
 'CCCCC',
 'CC(C)CC',
 'CCCCCC',
 'CC(C)CCC',
 'CC(CC)CC']

Generating charges

NAGL can generate AM1-BCC and AM1-Mulliken charges automatically with the OpenFF Toolkit. If you’d like a dataset of other charges, load them into the Molecule.partial_charges attribute and use the MoleculeRecord.from_precomputed_openff() method.

records = [
    MoleculeRecord.from_openff(
        Molecule.from_smiles(smiles, allow_undefined_stereo=True),
        partial_charge_methods=["am1bcc", "am1"],
        generate_conformers=True,
        n_conformer_pool=500, # Start with 500 conformers...
        n_conformers=10, # ... and prune all but 10 (ELF10)
        rms_cutoff=0.05, # Conformers in the initial pool must be at least this different
    ) 
    for smiles in tqdm(alkanes_smiles, desc="Labeling molecules")
]
Labeling molecules:   0%|          | 0/10 [00:00<?, ?it/s]
Labeling molecules:  10%|█         | 1/10 [00:00<00:05,  1.62it/s]
Labeling molecules:  20%|██        | 2/10 [00:01<00:04,  1.95it/s]
Labeling molecules:  30%|███       | 3/10 [00:02<00:05,  1.38it/s]
Labeling molecules:  40%|████      | 4/10 [00:03<00:07,  1.18s/it]
Labeling molecules:  50%|█████     | 5/10 [00:05<00:06,  1.33s/it]
Labeling molecules:  60%|██████    | 6/10 [00:08<00:08,  2.05s/it]
Labeling molecules:  70%|███████   | 7/10 [00:14<00:10,  3.33s/it]
Labeling molecules:  80%|████████  | 8/10 [00:21<00:08,  4.39s/it]
Labeling molecules:  90%|█████████ | 9/10 [00:29<00:05,  5.51s/it]
Labeling molecules: 100%|██████████| 10/10 [00:36<00:00,  6.02s/it]
Labeling molecules: 100%|██████████| 10/10 [00:36<00:00,  3.67s/it]

Storing the dataset

Finally, we’ll save all the molecule records to a SQLite database file, which NAGL can use directly as a dataset via DGLMoleculeLightningDataModule:

output_store_file = Path("alkanes.sqlite")
if output_store_file.exists():
    output_store_file.unlink()

store = MoleculeStore(output_store_file)
store.store(records)
grouping records to store by InChI key:   0%|          | 0/10 [00:00<?, ?it/s]
grouping records to store by InChI key: 100%|██████████| 10/10 [00:00<00:00, 113.46it/s]

storing grouped records:   0%|          | 0/10 [00:00<?, ?it/s]
storing grouped records: 100%|██████████| 10/10 [00:00<00:00, 125.70it/s]