Cookbook: Every way to make a Molecule

Every pathway through the OpenFF Toolkit boils down to four steps:

  1. Using other tools, assemble a graph of a molecule, including all of its atoms, bonds, bond orders, formal charges, and stereochemistry[1]

  2. Use that information to construct a Molecule

  3. Combine a number of Molecule objects to construct a Topology

  4. Call ForceField.create_openmm_system(topology) to create an OpenMM System (or an Interchange for painless conversion to other MD formats)

So let’s take a look at every way there is to construct a molecule! We’ll use zwitterionic L-alanine as an example biomolecule with all the tricky bits - a stereocenter, non-zero formal charges, and bonds of different orders.

from openff.toolkit import Molecule, Topology

From SMILES

SMILES is the classic way to create a Molecule. SMILES is a widely-used compact textual representation of arbitrary molecules. This lets us specify an exact molecule, including stereochemistry and bond orders, very easily — though they may not be the most human-readable format.

The Molecule.from_smiles() method is used to create a Molecule from a SMILES code.

Implicit hydrogens SMILES

zw_l_alanine = Molecule.from_smiles("C[C@H]([NH3+])C(=O)[O-]")

zw_l_alanine.visualize()
../_images/baea4bdf9e6c7bc5844494662cbfdf74a98735fdbd0616353d2b616e573becdf.svg

Explicit hydrogens SMILES

smiles_explicit_h = Molecule.from_smiles(
    "[H][C]([H])([H])[C@@]([H])([C](=[O])[O-])[N+]([H])([H])[H]",
    hydrogens_are_explicit=True,
)

assert zw_l_alanine.is_isomorphic_with(smiles_explicit_h)

smiles_explicit_h.visualize()
../_images/7eac1df6074103845dd0f49290abce6b07d38c5e25eef3e0fb57b0b4cfda7c1c.svg

Mapped SMILES

By default, no guarantees are made about the indexing of atoms from a SMILES string. If the indexing is important, a mapped SMILES string may be used. In this case, Hydrogens must be explicit. Note that though mapped SMILES strings must start at index 1, Python lists start at index 0.

mapped_smiles = Molecule.from_mapped_smiles(
    "[H:10][C:2]([H:7])([H:8])[C@@:4]([H:9])([C:3](=[O:5])[O-:6])[N+:1]([H:11])([H:12])[H:13]"
)

assert zw_l_alanine.is_isomorphic_with(mapped_smiles)

# First index is the Nitrogen
assert mapped_smiles.atoms[0].atomic_number == 7

# Final indices are all H
assert all([a.atomic_number == 1 for a in mapped_smiles.atoms[6:]])

mapped_smiles.visualize()
../_images/9d497089991f1627f7ac7eb809a628a6dfc16b5ba0d060518ef748194428e424.svg

SMILES without stereochemistry

The Toolkit won’t accept an ambiguous SMILES. This SMILES could be L- or D- alanine; rather than guess, the Toolkit throws an error:

from openff.toolkit.utils.exceptions import UndefinedStereochemistryError

try:
    smiles_non_isomeric = Molecule.from_smiles("CC([NH3+])C(=O)[O-]")
except UndefinedStereochemistryError as error:
    print(error)
Unable to make OFFMol from OEMol: OEMol has unspecified stereochemistry. oemol.GetTitle()=''Problematic atoms are:
Atom atomic num: 6, name: , idx: 1, aromatic: False, chiral: True with bonds:
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 0, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 7, name: , idx: 2, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 3, aromatic: False, chiral: False
bond order: 1, chiral: False to atom atomic num: 1, name: , idx: 9, aromatic: False, chiral: False

We can downgrade this error to a warning with the allow_undefined_stereo argument. This will create a molecule with undefined stereochemistry, which might lead to incorrect parametrization or surprising conformer generation. See the FAQ for more details.

smiles_non_isomeric = Molecule.from_smiles(
    "CC([NH3+])C(=O)[O-]",
    allow_undefined_stereo=True,
)

assert not zw_l_alanine.is_isomorphic_with(smiles_non_isomeric)

smiles_non_isomeric.visualize()
../_images/3755f649bf8513e9c303c7d24607d0693189b7c62d51c73f466cc7999fe426eb.svg

By hand

You can always construct a Molecule by building it up from individual atoms and bonds. Other methods are generally easier, but it’s a useful fallback for when you need to write your own constructor for an unsupported source format.

The Molecule() constructor and the add_atom() and add_bond() methods are used to construct a Molecule by hand.

by_hand = Molecule()
by_hand.name = "Zwitterionic l-Alanine"

by_hand.add_atom(
    atomic_number=8,  # Atomic number 8 is Oxygen
    formal_charge=-1,  # Formal negative charge
    is_aromatic=False,  # Atom is not part of an aromatic system
    stereochemistry=None,  # Optional argument; "R" or "S" stereochemistry
    name="O-",  # Optional argument; descriptive name for the atom
)
by_hand.add_atom(6, 0, False, name="C")
by_hand.add_atom(8, 0, False, name="O")
by_hand.add_atom(6, 0, False, stereochemistry="S", name="CA")
by_hand.add_atom(1, 0, False, name="CAH")
by_hand.add_atom(6, 0, False, name="CB")
by_hand.add_atom(1, 0, False, name="HB1")
by_hand.add_atom(1, 0, False, name="HB2")
by_hand.add_atom(1, 0, False, name="HB3")
by_hand.add_atom(7, +1, False, name="N+")
by_hand.add_atom(1, 0, False, name="HN1")
by_hand.add_atom(1, 0, False, name="HN2")
by_hand.add_atom(1, 0, False, name="HN3")


by_hand.add_bond(
    atom1=0,  # First (zero-indexed) atom specified above ("O-")
    atom2=1,  # Second atom specified above ("C")
    bond_order=1,  # Single bond
    is_aromatic=False,  # Bond is not aromatic
    stereochemistry=None,  # Optional argument; "E" or "Z" stereochemistry
    fractional_bond_order=None,  # Optional argument; Wiberg (or similar) bond order
)
by_hand.add_bond(1, 2, 2, False)  # C = O
by_hand.add_bond(1, 3, 1, False)  # C - CA
by_hand.add_bond(3, 4, 1, False)  # CA - CAH
by_hand.add_bond(3, 5, 1, False)  # CA - CB
by_hand.add_bond(5, 6, 1, False)  # CB - HB1
by_hand.add_bond(5, 7, 1, False)  # CB - HB2
by_hand.add_bond(5, 8, 1, False)  # CB - HB3
by_hand.add_bond(3, 9, 1, False)  # CB - N+
by_hand.add_bond(9, 10, 1, False)  # N+ - HN1
by_hand.add_bond(9, 11, 1, False)  # N+ - HN2
by_hand.add_bond(9, 12, 1, False)  # N+ - HN3

assert zw_l_alanine.is_isomorphic_with(by_hand)

by_hand.visualize()
../_images/07f673daacb2a86c55accafedc85fe66f8a18174129f754b78f5c0fda15c020e.svg

From a dictionary

Rather than build up the Molecule one method at a time, the Molecule.from_dict() method can construct a Molecule in one shot from a Python dict that describes the molecule in question. This allows Molecule objects to be written to and read from disk in any format that can be interpreted as a dict; this mechanism underlies the from_bson(), from_json(), from_messagepack(), from_pickle(), from_toml(), from_xml(), and from_yaml() methods.

This format can get very verbose, as it is intended for serialization, so this example uses hydrogen cyanide rather than alanine.

molecule_dict = {
    "name": "",
    "atoms": [
        {
            "atomic_number": 1,
            "formal_charge": 0,
            "is_aromatic": False,
            "stereochemistry": None,
            "name": "H",
        },
        {
            "atomic_number": 6,
            "formal_charge": 0,
            "is_aromatic": False,
            "stereochemistry": None,
            "name": "C",
        },
        {
            "atomic_number": 7,
            "formal_charge": 0,
            "is_aromatic": False,
            "stereochemistry": None,
            "name": "N",
        },
    ],
    "virtual_sites": [],
    "bonds": [
        {
            "atom1": 0,
            "atom2": 1,
            "bond_order": 1,
            "is_aromatic": False,
            "stereochemistry": None,
            "fractional_bond_order": None,
        },
        {
            "atom1": 1,
            "atom2": 2,
            "bond_order": 3,
            "is_aromatic": False,
            "stereochemistry": None,
            "fractional_bond_order": None,
        },
    ],
    "properties": {},
    "conformers": None,
    "partial_charges": None,
    "partial_charges_unit": None,
    "hierarchy_schemes": {},
}

from_dictionary = Molecule.from_dict(molecule_dict)

from_dictionary.visualize()
../_images/bd08caf96206f0075097380e248f13e37c4aed89ab1772dabe5eaffd5a92f62d.svg

From a file

We can construct a Molecule from a file or file-like object with the from_file() method. We’re a bit constrained in what file formats we can accept, because they need to provide all the information needed to construct the molecular graph; not just coordinates, but also elements, formal charges, bond orders, and stereochemistry.

From SDF file

We generally recommend the SDF format. The SDF file used here can be found on GitHub

sdf_path = Molecule.from_file("zw_l_alanine.sdf")
assert zw_l_alanine.is_isomorphic_with(sdf_path)
sdf_path.visualize()
../_images/20cefc63f2a9f8f7e426d930563617fd1507b8989746c87de725b02c451d2189.svg

From SDF file object

from_file() can also take a file object, rather than a path. Note that the object must be in binary mode!

with open("zw_l_alanine.sdf", mode="rb") as file:
    sdf_object = Molecule.from_file(file, file_format="SDF")

assert zw_l_alanine.is_isomorphic_with(sdf_object)
sdf_object.visualize()
../_images/20cefc63f2a9f8f7e426d930563617fd1507b8989746c87de725b02c451d2189.svg

From PDB file

The Topology.from_pdb() method is now the recommended method for loading all PDB files. It can interpret proteins, waters, ions, and small molecules from a PDB file as a Topology. It can infer the full chemical graph of the canonical amino acids (26 including protonation states) in capped and uncapped forms. It does this according to the RCSB chemical component dictionary — the information is not explicitly in the input PDB file. The following block loads a PDB file containing a single protein.

from openff.toolkit.utils import get_data_file_path

path = get_data_file_path("proteins/T4-protein.pdb")
topology = Topology.from_pdb(path)
protein = topology.molecule(0)
protein.visualize("nglview")

The following block loads a PDB containing two small molecules from PDB and SMILES.

top_from_pdb_from_smiles = Topology.from_pdb(
    get_data_file_path("molecules/po4_phenylphosphate.pdb"),
    unique_molecules=[
        Molecule.from_smiles("P(=O)([O-])([O-])([O-])"),
        Molecule.from_smiles("C1=CC=CC=C1OP(=O)([O-1])([O-1])"),
    ],
)

top_from_pdb_from_smiles.molecule(0).visualize("nglview")
top_from_pdb_from_smiles.molecule(1).visualize("nglview")

And for a maximalist example, the following block loads a single PDB file containing a protein, waters, ions, and a small molecule (the chemical identity of any small molecules must be provided with the unique_molecules keyword argument, and any small molecules’ connectivity must have corresponding CONECT records in the PDB file).

from openff.toolkit.utils import get_data_file_path

ligand_path = get_data_file_path("molecules/PT2385.sdf")
ligand = Molecule.from_file(ligand_path)

complex_path = get_data_file_path("proteins/5tbm_complex_solv_box.pdb")
topology = Topology.from_pdb(complex_path, unique_molecules=[ligand])

molecule_smileses = [mol.to_smiles() for mol in topology.molecules]
counts = molecule_smileses
for smiles in sorted(set(molecule_smileses)):
    count = molecule_smileses.count(smiles)
    if len(smiles) > 1000:
        smiles = "protein"
    print(smiles, ":", count, "molecule(s)")
[Cl-] : 12 molecule(s)
[H]O[H] : 4413 molecule(s)
protein : 1 molecule(s)
[H]c1c(c(c2c(c1Oc3c(c(c(c(c3[H])F)[H])C#N)[H])C(C([C@@]2([H])O[H])(F)F)([H])[H])S(=O)(=O)C([H])([H])[H])[H] : 1 molecule(s)
[Na+] : 17 molecule(s)

Other string identification formats

The OpenFF Toolkit supports a few text based molecular identity formats other than SMILES (see above)

From InChI

The Molecule.from_inchi() method constructs a Molecule from an IUPAC InChI string. Note that InChI cannot distinguish the zwitterionic form of alanine from the neutral form (see section 13.2 of the InChI Technical FAQ), so the toolkit defaults to the neutral form.

Warning

The OpenFF Toolkit makes no guarantees about the atomic ordering produced by the from_inchi method. InChI is not intended to be an interchange format.

inchi = Molecule.from_inchi(
    "InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1"
)

inchi.visualize()
../_images/a1c3717ec50320213965eacc83b9f48dd971cbd39ee1c63ffe7f045cc6b853f0.svg

From IUPAC name

The Molecule.from_iupac() method constructs a Molecule from an IUPAC name.

Important

This code requires the OpenEye toolkit.

iupac = Molecule.from_iupac("(2S)-2-azaniumylpropanoate")

assert zw_l_alanine.is_isomorphic_with(iupac)

iupac.visualize()
../_images/9f89ea6c64e1fe895b9cdac1191153fa0e70ce22acecdb62068b2a2e64db9dba.svg

Re-ordering atoms in an existing Molecule

Most Molecule creation methods don’t specify the ordering of atoms in the new Molecule. The Molecule.remap() method allows a new ordering to be applied to an existing Molecule.

See also Mapped SMILES.

Warning

The Molecule.remap() method is experimental and subject to change.

# Note that this mapping is off-by-one from the mapping taken
# by the remap method, as Python indexing is 0-based but SMILES
# is 1-based
print("Before remapping:", zw_l_alanine.to_smiles(mapped=True))

# Flip the order of the carbonyl carbon and oxygen
remapped = zw_l_alanine.remap(
    {
        0: 0,
        1: 1,
        2: 2,
        3: 4,  # Note these two mappings,
        4: 3,  # which are flipped!
        5: 5,
        6: 6,
        7: 7,
        8: 8,
        9: 9,
        10: 10,
        11: 11,
        12: 12,
    }
)

print("After remapping: ", remapped.to_smiles(mapped=True))

# Doesn't affect the identity of the molecule
assert zw_l_alanine.is_isomorphic_with(remapped)
remapped.visualize()
Before remapping: [H:3][C@@:2]([C:5](=[O:6])[O-:7])([C:1]([H:8])([H:9])[H:10])[N+:4]([H:11])([H:12])[H:13]
After remapping:  [H:3][C@@:2]([C:4](=[O:6])[O-:7])([C:1]([H:8])([H:9])[H:10])[N+:5]([H:11])([H:12])[H:13]
../_images/36c541fe3accee6f43168197bd56d0bd0a461abfa9172fb45cf77dd47bcbdc3f.svg

Via Topology objects

The Topology class represents a biomolecular system; it is analogous to the similarly named objects in GROMACS, MDTraj or OpenMM. Notably, it does not include co-ordinates and may represent multiple copies of a particular molecular species or even more complex mixtures of molecules. Topology objects are usually built up one species at a time from Molecule objects.

Molecule objects can be retrieved from a Topology via the Topology.molecule() method by providing the index of the molecule within the topology. For a topology consisting of a single molecule, this is just topology.molecule(0).

Constructor methods that are available for Topology but not Molecule generally require a Molecule to be provided via the unique_molecules keyword argument. The provided Molecule is used to provide the identity of the molecule, including aromaticity, bond orders, formal charges, and so forth. These methods therefore don’t provide a route to the graph of the molecule, but can be useful for reordering atoms to match another software package.

From an OpenMM Topology

The Topology.from_openmm() method constructs an OpenFF Topology from an OpenMM Topology. The method requires that all the unique molecules in the Topology are provided as OpenFF Molecule objects, as the structure of an OpenMM Topology doesn’t include the concept of a molecule. When using this method to create a Molecule, this limitation means that the method really only offers a pathway to reorder the atoms of a Molecule to match that of the OpenMM Topology.

from openmm.app.pdbfile import PDBFile

openmm_topology = PDBFile("zw_l_alanine.pdb").getTopology()
openff_topology = Topology.from_openmm(openmm_topology, unique_molecules=[zw_l_alanine])

from_openmm_topology = openff_topology.molecule(0)

assert zw_l_alanine.is_isomorphic_with(from_openmm_topology)

from_openmm_topology.visualize()
../_images/f61691f5bb02cd6d654f6636a1985c8b8741e01781666030e7c280344434e4dc.svg

From an MDTraj Topology

The Topology.from_mdtraj() method can also be used to create an OpenFF Topology from an MDTraj Topology. This method has similar restrictions and behavior as Topology.from_openmm.

from mdtraj import load_pdb

mdtraj_topology = load_pdb("zw_l_alanine.pdb").topology
openff_topology = Topology.from_mdtraj(mdtraj_topology, unique_molecules=[zw_l_alanine])

from_mdtraj_topology = openff_topology.molecule(0)

assert zw_l_alanine.is_isomorphic_with(from_mdtraj_topology)

from_mdtraj_topology.visualize()
../_images/f61691f5bb02cd6d654f6636a1985c8b8741e01781666030e7c280344434e4dc.svg

From Toolkit objects

The OpenFF Toolkit calls out to other software to perform low-level tasks like reading SMILES or files. These external software packages are called toolkits, and presently include RDKit and the OpenEye Toolkit. OpenFF Molecule objects can be created from the equivalent objects in these toolkits.

From RDKit Mol

The Molecule.from_rdkit() method converts an rdkit.Chem.rdchem.Mol object to an OpenFF Molecule.

from rdkit import Chem

rdmol = Chem.MolFromSmiles("C[C@H]([NH3+])C([O-])=O")

print("rdmol is of type", type(rdmol))

from_rdmol = Molecule.from_rdkit(rdmol)

assert zw_l_alanine.is_isomorphic_with(from_rdmol)
from_rdmol.visualize()
rdmol is of type <class 'rdkit.Chem.rdchem.Mol'>
../_images/20cefc63f2a9f8f7e426d930563617fd1507b8989746c87de725b02c451d2189.svg

From OpenEye OEMol

The Molecule.from_openeye() method converts an object that inherits from openeye.oechem.OEMolBase to an OpenFF Molecule.

from openeye import oechem

oemol = oechem.OEGraphMol()
oechem.OESmilesToMol(oemol, "C[C@H]([NH3+])C([O-])=O")

assert isinstance(oemol, oechem.OEMolBase)

from_oemol = Molecule.from_openeye(oemol)

assert zw_l_alanine.is_isomorphic_with(from_oemol)
from_oemol.visualize()
../_images/80c08aa8200e8d296840edd2e8be3455d07dcdd0e846eb155dc7e34c6da7428e.svg

From QCArchive

QCArchive is a repository of quantum chemical calculations on small molecules. The Molecule.from_qcschema() method creates a Molecule from a record from the archive. Because the identity of a molecule can change of the course of a QC calculation, the Toolkit accepts records only if they contain a hydrogen-mapped SMILES code.

Note

These examples use molecules other than l-Alanine because of their availability in QCArchive

From a QCArchive molecule record

The Molecule.from_qcschema() method can take a molecule record queried from the QCArchive and create a Molecule from it.

client = qcportal.PortalClient("https://api.qcarchive.molssi.org:443/")

record = [*client.query_molecules(molecular_formula="C16H20N3O5")][-1]

from_qcarchive = Molecule.from_qcschema(record)

from_qcarchive.visualize()
../_images/23f7c80504fc6b21e247c3e9b87ec5a4cc66a77c8c671aae9b741f21b438f97c.svg

From a QCArchive optimisation record

Molecule.from_qcschema() can also take an optimisation record and create the corresponding Molecule.

optimization_dataset = client.get_dataset(
    dataset_type="optimization",
    dataset_name="SMIRNOFF Coverage Set 1",
)
dimethoxymethanol_optimization = optimization_dataset.get_entry(
    "coc(o)oc-0",
)

from_optimisation = Molecule.from_qcschema(dimethoxymethanol_optimization)

from_optimisation.visualize()
../_images/15fbede9e12296fccbad80a3bd50430138191c15ffae51c4255f124f55296738.svg