Download Notebook View in GitHub Open in Google Colab
Prepare a dataset for training
▼ Dependency installation instructions
Install example dependencies into a new Conda environment using the provided environment.yaml:mamba env create --file ../../devtools/conda-envs/examples_env.yaml --name openff-nagl-examples
mamba activate openff-nagl-examples
jupyter notebook prepare-dataset.ipynb
Training a GCN requires a collection of examples that the GCN should reproduce and interpolate between. This notebook describes how to prepare such a dataset for predicting partial charges.
Imports
from pathlib import Path
from tqdm import tqdm
from openff.toolkit.topology import Molecule
from openff.nagl.storage.record import MoleculeRecord
from openff.nagl.storage import MoleculeStore
Choosing our molecules
The simplest way to specify the molecules in our dataset is with SMILES, though anything you can load into an OpenFF Molecule
is fair game. For instance, with the Molecule.from_file()
method you could load partial charges from SDF files. But for this example, we’ll have NAGL generate our charges, so we can just provide the SMILES themselves:
alkanes_smiles = Path("alkanes.smi").read_text().splitlines()
alkanes_smiles
['C',
'CC',
'CCC',
'CCCC',
'CC(C)C',
'CCCCC',
'CC(C)CC',
'CCCCCC',
'CC(C)CCC',
'CC(CC)CC']
Generating charges
NAGL can generate AM1-BCC and AM1-Mulliken charges automatically with the OpenFF Toolkit. If you’d like a dataset of other charges, load them into the Molecule.partial_charges
attribute and use the MoleculeRecord.from_precomputed_openff()
method.
records = [
MoleculeRecord.from_openff(
Molecule.from_smiles(smiles, allow_undefined_stereo=True),
partial_charge_methods=["am1bcc", "am1"],
generate_conformers=True,
n_conformer_pool=500, # Start with 500 conformers...
n_conformers=10, # ... and prune all but 10 (ELF10)
rms_cutoff=0.05, # Conformers in the initial pool must be at least this different
)
for smiles in tqdm(alkanes_smiles, desc="Labeling molecules")
]
Labeling molecules: 0%| | 0/10 [00:00<?, ?it/s]
Labeling molecules: 10%|█ | 1/10 [00:00<00:05, 1.62it/s]
Labeling molecules: 20%|██ | 2/10 [00:01<00:04, 1.95it/s]
Labeling molecules: 30%|███ | 3/10 [00:02<00:05, 1.38it/s]
Labeling molecules: 40%|████ | 4/10 [00:03<00:07, 1.18s/it]
Labeling molecules: 50%|█████ | 5/10 [00:05<00:06, 1.33s/it]
Labeling molecules: 60%|██████ | 6/10 [00:08<00:08, 2.05s/it]
Labeling molecules: 70%|███████ | 7/10 [00:14<00:10, 3.33s/it]
Labeling molecules: 80%|████████ | 8/10 [00:21<00:08, 4.39s/it]
Labeling molecules: 90%|█████████ | 9/10 [00:29<00:05, 5.51s/it]
Labeling molecules: 100%|██████████| 10/10 [00:36<00:00, 6.02s/it]
Labeling molecules: 100%|██████████| 10/10 [00:36<00:00, 3.67s/it]
Storing the dataset
Finally, we’ll save all the molecule records to a SQLite database file, which NAGL can use directly as a dataset via DGLMoleculeLightningDataModule
:
output_store_file = Path("alkanes.sqlite")
if output_store_file.exists():
output_store_file.unlink()
store = MoleculeStore(output_store_file)
store.store(records)
grouping records to store by InChI key: 0%| | 0/10 [00:00<?, ?it/s]
grouping records to store by InChI key: 100%|██████████| 10/10 [00:00<00:00, 113.46it/s]
storing grouped records: 0%| | 0/10 [00:00<?, ?it/s]
storing grouped records: 100%|██████████| 10/10 [00:00<00:00, 125.70it/s]