Inputs

Powered by the fantastic OpenFF Toolkit, QCSubmit can consume input molecules from a wide range of sources, including:

You can read more about each of these input paths below, but in general you can simply pass the input to your chosen QCSubmit dataset generation factory via the molecules keyword argument in the create_dataset function:

dataset = factory.create_dataset(
    dataset_name="My exotic dataset",
    # pass the single/multiple molecule sdf here
    molecules="my_exotic_sdf.sdf",
    ...
)

QCSubmit will then determine the type of input and process it accordingly using the component_result class, which deduplicates the molecules while preserving unique conformations.

Standard file formats

QCSubmit supports the following individual file formats, as well as directories containing a mix of formats. Simply provide the path to the target directory and QCSubmit will search through the directory and read in molecules for each file.

Molecule objects

In some cases you may want to pre-process the molecules using a custom workflow not yet supported by QCSubmit, and thus will have some collection of molecule objects from RDKit, OpenEye or the OpenFF Toolkit. As QCSubmit uses the OpenFF Toolkit Molecule class internally when processing datasets, the objects need to be first converted to this type. To ensure the correctness of the conversion, convenience methods are provided by the molecule class between RDKit and OpenEye molecule objects:

from openff.toolkit.topology import Molecule

# a list of OE and RDKit molecules
processed_mols = [oemol1, oemol2, rdmol1, rdmol2]

# convert to openff.toolkit.topology.Molecule instances
molecules = [Molecule(ref_mol) for ref_mol in processed_mols]

dataset = factory.create_dataset(
    dataset_name="My exotic dataset",
    # pass the list of molecules
    molecules=molecules,
    ...
)

HDF5 files

Warning

HDF5 support is still pre-alpha and so the specification is still evolving.

QCSubmit also supports the HDF5 file format, which is well suited to inputs containing many conformations per molecule. The format consists of one named group per molecule. Two datasets should then be made under this group with the following naming and information:

  • conformations: A Numpy array with shape (n, n_atoms, 3) containing all of the molecule conformations in Cartesian coordinates, where n is the number of conformations and n_atoms is the number of atoms in the molecule.

  • smiles: A length 1 list containing a single mapped SMILES string representing the molecule. Every atom in the molecule should be mapped to an index from 1 to n.

Note

If the “molecule” contains multiple components, this format still uses a single SMILES string; individual components may be distinguished using the . separator.

Finally, the units of the molecule conformation should be recorded as an attribute of the conformations dataset under the key units. Recognized units include:

  • nanometers

  • angstroms

  • bohrs

Demonstration

HDF5 files following this format can be constructed using the OpenFF Toolkit:

import h5py
import numpy as np
from openff.units import Quantity, unit

output_file = h5py.File("my_exotic_molecules.hdf5", "w")

# Create a list of OpenFF Toolkit Molecule instances with conformations
target_molecules = [...]

for mol in target_molecules: 
    smiles = mol.to_smiles(
        isomeric=True, 
        explicit_hydrogens=True, 
        mapped=True,
    )
    conformations = [c.m_as(unit.nanometers) for c in mol.conformers]

    group = output_file.create_group(mol.name)
    group.create_dataset(
        'smiles', 
        data=[smiles], 
        dtype=h5py.string_dtype(),
    )
    ds = group.create_dataset(
        'conformations', 
        data=np.array(conformations), 
        dtype=np.float32,
    )
    ds.attrs['units'] = 'nanometers'

output_file.close()