Powered by the fantastic OpenFF Toolkit, QCSubmit can consume input molecules from a wide range of sources, including:
You can read more about each of these input paths below, but in general you can simply pass the input to your chosen QCSubmit dataset generation factory via the
molecules keyword argument in the
dataset = factory.create_dataset( dataset_name="My exotic dataset", # pass the single/multiple molecule sdf here molecules="my_exotic_sdf.sdf", ... )
QCSubmit will then determine the type of input and process it accordingly using the
component_result class, which deduplicates the molecules while preserving unique conformations.
QCSubmit supports the following individual file formats, as well as directories containing a mix of formats. Simply provide the path to the target directory and QCSubmit will search through the directory and read in molecules for each file.
PDB, provided the
openeye-toolkitspackage is available
In some cases you may want to pre-process the molecules using a custom workflow not yet supported by QCSubmit, and thus will have some collection of molecule objects from RDKit, OpenEye or the OpenFF Toolkit. As QCSubmit uses the OpenFF Toolkit
Molecule class internally when processing datasets, the objects need to be first converted to this type. To ensure the correctness of the conversion, convenience methods are provided by the molecule class between RDKit and OpenEye molecule objects:
from openff.toolkit.topology import Molecule # a list of OE and RDKit molecules processed_mols = [oemol1, oemol2, rdmol1, rdmol2] # convert to openff.toolkit.topology.Molecule instances molecules = [Molecule(ref_mol) for ref_mol in processed_mols] dataset = factory.create_dataset( dataset_name="My exotic dataset", # pass the list of molecules molecules=molecules, ... )
HDF5 support is still pre-alpha and so the specification is still evolving.
QCSubmit also supports the HDF5 file format, which is well suited to inputs containing many conformations per molecule. The format consists of one named group per molecule. Two datasets should then be made under this group with the following naming and information:
conformations: A Numpy array with shape
(n, n_atoms, 3)containing all of the molecule conformations in Cartesian coordinates, where
nis the number of conformations and
n_atomsis the number of atoms in the molecule.
smiles: A length 1 list containing a single mapped SMILES string representing the molecule. Every atom in the molecule should be mapped to an index from 1 to
If the “molecule” contains multiple components, this format still uses a single SMILES string; individual components may be distinguished using the
Finally, the units of the molecule conformation should be recorded as an attribute of the
conformations dataset under the key
units. Recognized units include:
HDF5 files following this format can be constructed using the OpenFF Toolkit:
import h5py import numpy as np from simtk import unit output_file = h5py.File("my_exotic_molecules.hdf5", "w") # Create a list of OpenFF Toolkit Molecule instances with conformations target_molecules = [...] for mol in target_molecules: smiles = mol.to_smiles( isomeric=True, explicit_hydrogens=True, mapped=True, ) conformations = [c.value_in_unit(unit.nanometers) for c in mol.conformers] group = output_file.create_group(mol.name) group.create_dataset( 'smiles', data=[smiles], dtype=h5py.string_dtype(), ) ds = group.create_dataset( 'conformations', data=np.array(conformations), dtype=np.float32, ) ds.attrs['units'] = 'nanometers' output_file.close()