Inputs
Powered by the fantastic OpenFF Toolkit, QCSubmit can consume input molecules from a wide range of sources, including:
HDF5 files via a custom specification
You can read more about each of these input paths below, but in general you can simply pass the input to your chosen QCSubmit dataset generation factory via the molecules
keyword argument in the create_dataset
function:
dataset = factory.create_dataset(
dataset_name="My exotic dataset",
# pass the single/multiple molecule sdf here
molecules="my_exotic_sdf.sdf",
...
)
QCSubmit will then determine the type of input and process it accordingly using the component_result
class, which deduplicates the molecules while preserving unique conformations.
Standard file formats
QCSubmit supports the following individual file formats, as well as directories containing a mix of formats. Simply provide the path to the target directory and QCSubmit will search through the directory and read in molecules for each file.
MOL/SDF
PDB, provided the
openeye-toolkits
package is availableSMILES file
Molecule objects
In some cases you may want to pre-process the molecules using a custom workflow not yet supported by QCSubmit, and thus will have some collection of molecule objects from RDKit, OpenEye or the OpenFF Toolkit. As QCSubmit uses the OpenFF Toolkit Molecule
class internally when processing datasets, the objects need to be first converted to this type. To ensure the correctness of the conversion, convenience methods are provided by the molecule class between RDKit and OpenEye molecule objects:
from openff.toolkit.topology import Molecule
# a list of OE and RDKit molecules
processed_mols = [oemol1, oemol2, rdmol1, rdmol2]
# convert to openff.toolkit.topology.Molecule instances
molecules = [Molecule(ref_mol) for ref_mol in processed_mols]
dataset = factory.create_dataset(
dataset_name="My exotic dataset",
# pass the list of molecules
molecules=molecules,
...
)
HDF5 files
Warning
HDF5
support is still pre-alpha and so the specification is still evolving.
QCSubmit also supports the HDF5 file format, which is well suited to inputs containing many conformations per molecule. The format consists of one named group per molecule. Two datasets should then be made under this group with the following naming and information:
conformations
: A Numpy array with shape(n, n_atoms, 3)
containing all of the molecule conformations in Cartesian coordinates, wheren
is the number of conformations andn_atoms
is the number of atoms in the molecule.smiles
: A length 1 list containing a single mapped SMILES string representing the molecule. Every atom in the molecule should be mapped to an index from 1 ton
.
Note
If the “molecule” contains multiple components, this format still uses a single SMILES string; individual components may be distinguished using the .
separator.
Finally, the units of the molecule conformation should be recorded as an attribute of the conformations
dataset under the key units
. Recognized units include:
nanometers
angstroms
bohrs
Demonstration
HDF5
files following this format can be constructed using the OpenFF Toolkit:
import h5py
import numpy as np
from openff.units import Quantity, unit
output_file = h5py.File("my_exotic_molecules.hdf5", "w")
# Create a list of OpenFF Toolkit Molecule instances with conformations
target_molecules = [...]
for mol in target_molecules:
smiles = mol.to_smiles(
isomeric=True,
explicit_hydrogens=True,
mapped=True,
)
conformations = [c.m_as(unit.nanometers) for c in mol.conformers]
group = output_file.create_group(mol.name)
group.create_dataset(
'smiles',
data=[smiles],
dtype=h5py.string_dtype(),
)
ds = group.create_dataset(
'conformations',
data=np.array(conformations),
dtype=np.float32,
)
ds.attrs['units'] = 'nanometers'
output_file.close()