Powered by the fantastic openff-toolkit
QCSubmit can consume input molecules from a wide range of sources including:
You can read more about each of these inputs below, but in general getting started simply requires you to pass the input
to your chosen
QCSubmit dataset generation factory via the
molecules keyword argument in the
create_dataset() function as shown here:
dataset = factory.create_dataset( dataset_name="My exotic dataset", # pass the single/multiple molecule sdf here molecules="my_exotic_sdf.sdf", ... )
QCSubmit will then determine the type of input and process it accordingly using the
ComponentResult class which will deduplicate the molecules while preserving unique conformations.
Standard file formats
QCSubmit supports the following individual file formats as well as directories containing a mix of formats, simply provide the path to the target directory and
QCSubmit will search through the directory trying to read in molecules for each file.
PDB provided you have openeye-toolkits available
In some cases you may want to pre-process the molecules using a custom workflow not yet supported by
QCSubmit and thus will have some collection of molecule objects from
OpenEye or the
QCSubmit uses the
openff-toolkit Molecule class internally when processing datasets the objects need to be first converted to this type. To ensure the correctness of the conversion convince methods are provided by the molecule class between RDKit and OpenEye objects.
from openff.toolkit.topology import Molecule # a list of OE and RDKit molecules processed_mols = [oemol1, oemol2 rdmol1, rdmol2] # convert to openff.toolkit.topology.Molecule instances molecules = [Molecule(ref_mol) for ref_mol in processed_mols] dataset = factory.create_dataset( dataset_name="My exotic dataset", # pass the list of molecules molecules=molecules, ... )
HDF5 support is still pre-alpha and so the specification is still evolving.
QCSubmit also supports HDF5 Files following a simple format which is well suited to inputs containing many
conformations per molecule. The format consists of one group per molecule stored
under the index which should be assigned to the molecule. Two datasets should then
be made under this group with the following naming and information
conformations: A numpy ndarray containing all of the molecule conformations with shape (n, n_atoms, 3), where
nis the number of conformations and
n_atomsis the number of atoms in the molecule.
smiles: A length 1 list of mapped smiles strings which represents the topology of the entire system.
If the system contains multiple components we should have a single smiles
string indexed from 1 to m where m is the total number of atoms, distinguishing individual components using the
Finally the units of the molecule conformation should be set as an attribute of the
conformations dataset under the key
recognised units are as follows:
HDF5 files following this format can then be readily made using the
import h5py import numpy as np from simtk import unit output_file = h5py.File("my_exotic_molecules.hdf5", "w") for molecule in target_molecules: # a list of openff.toolkit.topology.Molecule instances with conformations smiles = molecule.to_smiles(isomeric=True, explicit_hydrogens=True, mapped=True) conformations = [c.value_in_unit(unit.nanometers) for c in molecule.conformers] group = output_file.create_group(molecule.name) group.create_dataset('smiles', data=[smiles], dtype=h5py.string_dtype()) ds = group.create_dataset('conformations', data=np.array(conformations), dtype=np.float32) ds.attrs['units'] = 'nanometers' output_file.close()