Download Notebook View in GitHub Open in Google Colab
Protein-ligand-water systems with Interchange
In this example, we’ll take a docked protein-ligand system from an OpenFF benchmark data set, parameterize and solvate it, and export the parameterized system to a variety of simulation engines.
This example describes a number of procedures that are buggy, poorly performing, or outright broken. Get excited about what's coming, but hold off on using this in production work.
import urllib
import mdtraj as md
import nglview
import numpy as np
from openff.toolkit import ForceField, Molecule, Topology
from openff.units import unit
from openff.interchange import Interchange
from openff.interchange.components._packmol import UNIT_CUBE, pack_box
from openff.interchange.drivers import (
get_amber_energies,
get_gromacs_energies,
get_openmm_energies,
)
from openff.interchange.drivers.all import get_summary_data
Collecting structures
In this example we’ll use starting coordinates data from MCL1, which is part of the Protein Ligand Benchmark data set curated by the Open Force Field Initiative. Conveniently for the purposes of this example, the ligand is already docked and the protein is relatively small (~2000 atoms), so we can focus on using Interchange without too much prep.
Start by downloading the protein and ligand structure files from GitHub:
url = (
"https://raw.githubusercontent.com/openforcefield/protein-ligand-benchmark/"
"8c94c0dcc892dfd77992567294b1ff31c62e8695/plbenchmark/sample_data/2020-08-26_mcl1_sample/"
)
urllib.request.urlretrieve(url + "01_protein/crd/protein.pdb", "protein.pdb")
urllib.request.urlretrieve(url + "02_ligands/lig_23/crd/lig_23.sdf", "lig_23.sdf")
# `protein.pdb` and `lig_23.sdf` should be in the local path now
!ls -lhrt
total 232K
-rw-r--r-- 1 runner docker 23K May 9 00:14 protein_ligand.ipynb
-rw-r--r-- 1 runner docker 211 May 9 00:14 README.md
-rw-r--r-- 1 runner docker 194K May 9 00:30 protein.pdb
-rw-r--r-- 1 runner docker 4.1K May 9 00:30 lig_23.sdf
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/pty.py:89: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
pid, fd = os.forkpty()
The OpenFF Toolkit recently (version 0.13) added support for loading multi-component PDBs. There are some limitations but for our system - a well-structured PDB file including a protein and some crystal waters - it should work fine.
protein_with_crystal_water = Topology.from_pdb("protein.pdb")
protein_with_crystal_water.n_molecules
29
protein_with_crystal_water
is a Topology
, not a Molecule
, containing the protein and a few crystal waters. We can splice out the protein as a Molecule
object and visualize it:
protein = protein_with_crystal_water.molecule(0)
protein
Preparing components
This system has three components: Protein, ligand, and solvent (water). For each component, we need positions and parameters. Our protein and ligand positions come from PDBs, and we’ll generate solvent coordinates ourselves. For parameters, the Sage force field will be perfect for the ligand and water, but doesn’t support proteins - they’re coming in Rosemary. In the meantime, we’ll use a SMIRNOFF port of ff14SB, a popular force field in the Amber force field family which has a compatible functionaln form.
Unfortunately, this means we need to treat each component seperately. Interchange provides an experimental means for combining these systems, which we’ll see in a bit.
Protein component
Let’s start with the protein component. We The Molecule.from_polymer_pdb()
method constructs a Molecule
from a PDB file encoding a protein. A Molecule
object stores a molecule’s chemical identity, bond graph, and co-ordinates. The OpenFF Toolkit doesn’t accept PDB files for small molecules because they don’t have enough chemical information, but it makes an exception for biopolymers like proteins via a chemical substructure dictionary containing information about canonical amino aicds. This saves us from needing to do things like write up a SMILES string for an entire protein.
Our PDB file doesn’t only contain one molecule, though, it contains a protein and crystal waters. The OpenFF Toolkit recently (version 0.13) added support for loading multi-component PDBs. There are some limitations but for our system it should work fine.
protein_with_crystal_water = Topology.from_pdb("protein.pdb")
protein_with_crystal_water.n_molecules
29
protein_with_crystal_water
is a Topology
, not a Molecule
, containing the protein and a few crystal waters. We can splice out the protein as a Molecule
object and visualize it to make sure it looks reasonable:
protein = protein_with_crystal_water.molecule(0)
protein.visualize(backend="nglview")
OpenFF maintains a port of Amber ff14sb, which we’ll use for the protein parameters. We’re using the impropers
variant because Interchange doesn’t support Amber’s improper torsion function.
The Amber ff14sb port is intended as a proof-of-concept for SMIRNOFF protein force fields. It does not precisely match the energetics or forces of the original ff14sb, and Interchange is missing features required for it to work correctly. Wait for protein support in the Rosemary force field to use this in production.
ff14sb = ForceField("ff14sb_off_impropers_0.0.3.offxml")
We can use the Interchange.from_smirnoff
constructor method to combine the protein
molecule (with coordinates) and the ff14sb
force field into an Interchange
.
protein_intrcg = Interchange.from_smirnoff(
force_field=ff14sb,
topology=protein.to_topology(),
)
Ligand component
SDF files encode all the chemical information the Toolkit needs to construct a Molecule
, so we can use the general-purpose small molecule from_file
method for the ligand:
ligand = Molecule.from_file("lig_23.sdf")
ligand.name = "LIG"
ligand.visualize(backend="nglview")
We’ll use the OpenFF 2.0.0 “Sage” force field for the ligand, which is a production-ready small molecule SMIRNOFF force field. Its coordinates are taken from the lig_23.sdf
file we downloaded earlier. We just want to do some point energy calculations as a proof of concept, so we’ll use the unconstrained variant of Sage (see the OpenFF Toolkit FAQ for details).
ligand_intrcg = Interchange.from_smirnoff(
force_field=ForceField("openff_unconstrained-2.0.0.offxml"),
topology=[ligand],
)
Now that we have two interchanges, we can combine them with the Interchange.combine
method! We’ll need a combined system to solvate too, so this’ll be useful in a second.
The
Interchange.combine
method is unstable and may break unexpectedly or be removed altogether at any time. In the future, OpenFF force fields will support biopolymers and combining Interchanges will be less necessary than at present. Don't combine force fields in production code.
docked_intrcg = protein_intrcg.combine(ligand_intrcg)
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/openff/interchange/components/interchange.py:954: UserWarning: Interchange object combination is experimental and likely to produce strange results. Any workflow using this method is not guaranteed to be suitable for production. Use with extreme caution and thoroughly validate results!
return _combine(self, other)
In addition to making it easy to parameterize systems for all sorts of engines, Interchange makes it easy to visualize systems. We can use the visualize()
method to view our docked system in NGLView and make sure the co-ordinates make sense:
w = docked_intrcg.visualize()
w.clear_representations()
w.add_representation(
"licorice",
radius=0.1,
selection=[*range(protein_intrcg.topology.n_atoms)],
)
w.add_representation(
"spacefill",
selection=[*range(protein_intrcg.topology.n_atoms, docked_intrcg.topology.n_atoms)],
)
w
Solvent component
We’ll reuse the Sage force field from earlier here, as it includes parameters for TIP3P water, but first we need coordinates for our solvated system. This is a portion of the OpenFF ecosystem that will be streamlined in the future, but we can use a PACKMOL wrapper to get the job done. We’re adding a fixed amount of water for this quick example so the density will be wrong, but imagine it’s right.
# Construct a water molecule
water = Molecule.from_smiles("O")
water.generate_conformers(n_conformers=1)
# Come up with a box size based on the size of the protein plus a 2 nm buffer
xyz = protein.conformers[0]
centroid = xyz.sum(axis=0) / xyz.shape[0]
protein_radius = np.sqrt(((xyz - centroid) ** 2).sum(axis=-1).max())
box_vectors = UNIT_CUBE * (protein_radius * 2 + 2 * unit.nanometer)
# Pack the box with an arbitrary number of water
n_water = 1000
packed_topology = pack_box(
molecules=[water],
number_of_copies=[n_water],
solute=docked_intrcg.topology,
box_vectors=box_vectors,
)
# Visualize
packed_topology.to_file("packed.pdb")
nglview.show_structure_file("packed.pdb")
And now we can create the interchange! The Topology
we got from pack_box
includes the positions we’ll later apply to the solvated complex. For now, we need an Interchange
that represents the water component. We can pass it Sage, wihch contains TIP3P parameters, and a topology of n_water
water molecules without worrying about the positions, since we’ll just set those later.
water_intrcg = Interchange.from_smirnoff(
force_field=ForceField("openff_unconstrained-2.0.0.offxml"),
topology=[water] * n_water,
)
Putting the system together
Now that we’ve got all the pieces, we can combine the docked protein-ligand system with the solvent, and add in the positions and box vectors we just worked out
The
`Interchange.combine`
method is unstable and under-tested and may break unexpectedly or be removed altogether.
system_intrcg = docked_intrcg.combine(water_intrcg)
system_intrcg.positions = packed_topology.get_positions()
system_intrcg.box = packed_topology.box_vectors
w = system_intrcg.visualize()
w.clear_representations()
# Protein rep
w.add_representation(
"licorice",
radius=0.2,
selection=[*range(protein_intrcg.topology.n_atoms)],
)
# Ligand rep
w.add_representation(
"spacefill",
selection=[*range(protein_intrcg.topology.n_atoms, docked_intrcg.topology.n_atoms)],
)
# Water rep
w.add_representation(
"licorice",
radius=0.1,
selection=[*range(docked_intrcg.topology.n_atoms, system_intrcg.topology.n_atoms)],
)
w
Exporting to simulation engines
Finally, we can export the final Interchange object to models understood by various simulation engines. Some of these exports are not yet optimized for large files.
OpenMM
openmm_system = system_intrcg.to_openmm()
openmm_topology = system_intrcg.topology.to_openmm(ensure_unique_atom_names=False)
print(type(openmm_system), type(openmm_topology))
<class 'openmm.openmm.System'> <class 'openmm.app.topology.Topology'>
Amber
system_intrcg.to_inpcrd("out.inpcrd")
system_intrcg.to_prmtop("out.prmtop")
LAMMPS
# The LAMMPS exporter has not yet been optimized for large molecules or systems
if False:
system_intrcg.to_lammps("out.lmp")
GROMACS
Interchange’s GROMACS exporter is a little slow for biopolymers; this will be faster in a future release.
system_intrcg.to_gromacs(prefix="out")
Energy tests
In order to verify the accuracy of each export, we can use functions in the openff.interchange.drivers
module to call out to each engine to evaluate single-point energies. Under the hood, each function uses the export functions just as we did in the above cells. The GROMACS and Amber exports are a little slower than the OpenMM export, so some of these cells might take a minute to execute.
To get a quick look at how a single engine reports energies, use get_openmm_energies
(or get_gromacs_energies
, get_amber_energies
, or get_lammps_energies
).
get_openmm_energies(system_intrcg)
EnergyReport(energies={'Bond': <Quantity(11255.7296, 'kilojoule / mole')>, 'Angle': <Quantity(1980.01657, 'kilojoule / mole')>, 'Torsion': <Quantity(7759.41267, 'kilojoule / mole')>, 'Nonbonded': <Quantity(5067733.97, 'kilojoule / mole')>})
For convenience there is a function get_summary_data
that runs through all available engines and summarizes the results in a Pandas DataFrame. (This cell might take a minute to execute). We’re setting the argument _engines
to a non-defualt value so that the LAMMPS driver is skipped even if it’s available; normally this argument is unnecessary if you don’t have LAMMPS installed on your system.
summary = get_summary_data(system_intrcg, _engines=["OpenMM", "GROMACS", "Amber"])
summary
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/openff/interchange/components/mdconfig.py:463: UserWarning: Ambiguous failure while processing constraints. Constraining h-bonds as a stopgap.
warnings.warn(
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/openff/interchange/components/mdconfig.py:397: SwitchingFunctionNotImplementedWarning: A switching distance 8.0 angstrom was specified by the force field, but Amber does not implement a switching function. Using a hard cut-off instead. Non-bonded interactions will be affected.
warnings.warn(
Bond | Angle | Torsion | Electrostatics | vdW | RBTorsion | |
---|---|---|---|---|---|---|
OpenMM | 11255.729633 | 1980.016575 | 7759.412675 | -21041.921220 | 5.088776e+06 | NaN |
Amber | 538.964470 | 2180.593271 | 7759.412514 | -142996.707746 | 5.088723e+06 | NaN |
GROMACS | 538.962646 | 1980.016235 | 7759.417099 | -21053.490234 | 5.088701e+06 | 0.0 |
We can see from these large energy differences that something is wrong - this stems from the experimental Interchange
combination operation producing incorrect results.
summary.describe()
Bond | Angle | Torsion | Electrostatics | vdW | RBTorsion | |
---|---|---|---|---|---|---|
count | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000e+00 | 1.0 |
mean | 4111.218917 | 2046.875360 | 7759.414096 | -61697.373067 | 5.088734e+06 | 0.0 |
std | 6187.327778 | 115.803108 | 0.002602 | 70407.289380 | 3.830953e+01 | NaN |
min | 538.962646 | 1980.016235 | 7759.412514 | -142996.707746 | 5.088701e+06 | 0.0 |
25% | 538.963558 | 1980.016405 | 7759.412595 | -82025.098990 | 5.088712e+06 | 0.0 |
50% | 538.964470 | 1980.016575 | 7759.412675 | -21053.490234 | 5.088723e+06 | 0.0 |
75% | 5897.347052 | 2080.304923 | 7759.414887 | -21047.705727 | 5.088750e+06 | 0.0 |
max | 11255.729633 | 2180.593271 | 7759.417099 | -21041.921220 | 5.088776e+06 | 0.0 |
In the future this should work more smoothly with identical energies reported by each engine. In lieu of that, we can evaluate the energy of each component that we previously added together. This requires setting box vectors for each component and also setting the water positions, which we skipped earlier since we were able to use PACKMOL results directly.
for subset in [ligand_intrcg, protein_intrcg, water_intrcg]:
subset.box = system_intrcg.box
water_intrcg.positions = system_intrcg.positions[-3000:,]
get_summary_data(ligand_intrcg, _engines=["OpenMM", "GROMACS", "Amber"])
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/openff/interchange/components/mdconfig.py:397: SwitchingFunctionNotImplementedWarning: A switching distance 8.0 angstrom was specified by the force field, but Amber does not implement a switching function. Using a hard cut-off instead. Non-bonded interactions will be affected.
warnings.warn(
Bond | Angle | Torsion | Electrostatics | vdW | RBTorsion | |
---|---|---|---|---|---|---|
OpenMM | 25.837656 | 356.516457 | 27.319830 | -31.854656 | 105.023359 | NaN |
Amber | 25.837455 | 356.516548 | 27.319846 | -31.912623 | 104.965682 | NaN |
GROMACS | 25.838177 | 356.516541 | 27.319841 | -31.987286 | 105.025609 | 0.0 |
get_summary_data(protein_intrcg, _engines=["OpenMM", "GROMACS", "Amber"])
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/openff/interchange/components/mdconfig.py:397: SwitchingFunctionNotImplementedWarning: A switching distance 8.0 angstrom was specified by the force field, but Amber does not implement a switching function. Using a hard cut-off instead. Non-bonded interactions will be affected.
warnings.warn(
Bond | Angle | Torsion | Electrostatics | vdW | RBTorsion | |
---|---|---|---|---|---|---|
OpenMM | 11229.891977 | 1623.500118 | 7732.092845 | -20643.050126 | 5.086854e+06 | NaN |
Amber | 11229.891982 | 1623.499947 | 7732.092668 | -20643.369819 | 5.086796e+06 | NaN |
GROMACS | 11229.882812 | 1623.498657 | 7732.095154 | -20647.458984 | 5.086781e+06 | 0.0 |
get_summary_data(water_intrcg, _engines=["OpenMM", "GROMACS", "Amber"])
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/openff/interchange/components/mdconfig.py:397: SwitchingFunctionNotImplementedWarning: A switching distance 8.0 angstrom was specified by the force field, but Amber does not implement a switching function. Using a hard cut-off instead. Non-bonded interactions will be affected.
warnings.warn(
Bond | Angle | Torsion | Electrostatics | vdW | RBTorsion | |
---|---|---|---|---|---|---|
OpenMM | 0.0 | 0.000000 | 0.0 | -59.077080 | 1704.356370 | NaN |
Amber | 0.0 | 200.576358 | 0.0 | -123632.468733 | 1703.736515 | NaN |
GROMACS | 0.0 | 0.000000 | 0.0 | -66.763672 | 1704.347492 | 0.0 |
We can see from these results that each engine reports nearly identical energies for each individual component.