Protein-ligand-water systems with Interchange

In this example, we’ll take a docked protein-ligand system from an OpenFF benchmark data set, parameterize and solvate it, and export the parameterized system to a variety of simulation engines.

🚧 This code is not production-ready
This example describes a number of procedures that are buggy, poorly performing, or outright broken. Get excited about what's coming, but hold off on using this in production work.
import urllib

import mdtraj as md
import nglview
import numpy as np
from openff.toolkit import ForceField, Molecule, Topology
from openff.units import unit

from openff.interchange import Interchange
from openff.interchange.components._packmol import UNIT_CUBE, pack_box
from openff.interchange.drivers import (
    get_amber_energies,
    get_gromacs_energies,
    get_openmm_energies,
)
from openff.interchange.drivers.all import get_summary_data

Collecting structures

In this example we’ll use starting coordinates data from MCL1, which is part of the Protein Ligand Benchmark data set curated by the Open Force Field Initiative. Conveniently for the purposes of this example, the ligand is already docked and the protein is relatively small (~2000 atoms), so we can focus on using Interchange without too much prep.

Start by downloading the protein and ligand structure files from GitHub:

url = (
    "https://raw.githubusercontent.com/openforcefield/protein-ligand-benchmark/"
    "8c94c0dcc892dfd77992567294b1ff31c62e8695/plbenchmark/sample_data/2020-08-26_mcl1_sample/"
)

urllib.request.urlretrieve(url + "01_protein/crd/protein.pdb", "protein.pdb")
urllib.request.urlretrieve(url + "02_ligands/lig_23/crd/lig_23.sdf", "lig_23.sdf")

# `protein.pdb` and `lig_23.sdf` should be in the local path now
!ls -lhrt
total 232K
-rw-r--r-- 1 runner docker  23K May  9 00:14 protein_ligand.ipynb
-rw-r--r-- 1 runner docker  211 May  9 00:14 README.md
-rw-r--r-- 1 runner docker 194K May  9 00:30 protein.pdb
-rw-r--r-- 1 runner docker 4.1K May  9 00:30 lig_23.sdf
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/pty.py:89: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  pid, fd = os.forkpty()

The OpenFF Toolkit recently (version 0.13) added support for loading multi-component PDBs. There are some limitations but for our system - a well-structured PDB file including a protein and some crystal waters - it should work fine.

protein_with_crystal_water = Topology.from_pdb("protein.pdb")
protein_with_crystal_water.n_molecules
29

protein_with_crystal_water is a Topology, not a Molecule, containing the protein and a few crystal waters. We can splice out the protein as a Molecule object and visualize it:

protein = protein_with_crystal_water.molecule(0)
protein

Preparing components

This system has three components: Protein, ligand, and solvent (water). For each component, we need positions and parameters. Our protein and ligand positions come from PDBs, and we’ll generate solvent coordinates ourselves. For parameters, the Sage force field will be perfect for the ligand and water, but doesn’t support proteins - they’re coming in Rosemary. In the meantime, we’ll use a SMIRNOFF port of ff14SB, a popular force field in the Amber force field family which has a compatible functionaln form.

Unfortunately, this means we need to treat each component seperately. Interchange provides an experimental means for combining these systems, which we’ll see in a bit.

Protein component

Let’s start with the protein component. We The Molecule.from_polymer_pdb() method constructs a Molecule from a PDB file encoding a protein. A Molecule object stores a molecule’s chemical identity, bond graph, and co-ordinates. The OpenFF Toolkit doesn’t accept PDB files for small molecules because they don’t have enough chemical information, but it makes an exception for biopolymers like proteins via a chemical substructure dictionary containing information about canonical amino aicds. This saves us from needing to do things like write up a SMILES string for an entire protein.

Our PDB file doesn’t only contain one molecule, though, it contains a protein and crystal waters. The OpenFF Toolkit recently (version 0.13) added support for loading multi-component PDBs. There are some limitations but for our system it should work fine.

protein_with_crystal_water = Topology.from_pdb("protein.pdb")
protein_with_crystal_water.n_molecules
29

protein_with_crystal_water is a Topology, not a Molecule, containing the protein and a few crystal waters. We can splice out the protein as a Molecule object and visualize it to make sure it looks reasonable:

protein = protein_with_crystal_water.molecule(0)
protein.visualize(backend="nglview")

OpenFF maintains a port of Amber ff14sb, which we’ll use for the protein parameters. We’re using the impropers variant because Interchange doesn’t support Amber’s improper torsion function.

🚧 This code is not production-ready
The Amber ff14sb port is intended as a proof-of-concept for SMIRNOFF protein force fields. It does not precisely match the energetics or forces of the original ff14sb, and Interchange is missing features required for it to work correctly. Wait for protein support in the Rosemary force field to use this in production.
ff14sb = ForceField("ff14sb_off_impropers_0.0.3.offxml")

We can use the Interchange.from_smirnoff constructor method to combine the protein molecule (with coordinates) and the ff14sb force field into an Interchange.

protein_intrcg = Interchange.from_smirnoff(
    force_field=ff14sb,
    topology=protein.to_topology(),
)

Ligand component

SDF files encode all the chemical information the Toolkit needs to construct a Molecule, so we can use the general-purpose small molecule from_file method for the ligand:

ligand = Molecule.from_file("lig_23.sdf")
ligand.name = "LIG"
ligand.visualize(backend="nglview")

We’ll use the OpenFF 2.0.0 “Sage” force field for the ligand, which is a production-ready small molecule SMIRNOFF force field. Its coordinates are taken from the lig_23.sdf file we downloaded earlier. We just want to do some point energy calculations as a proof of concept, so we’ll use the unconstrained variant of Sage (see the OpenFF Toolkit FAQ for details).

ligand_intrcg = Interchange.from_smirnoff(
    force_field=ForceField("openff_unconstrained-2.0.0.offxml"),
    topology=[ligand],
)

Now that we have two interchanges, we can combine them with the Interchange.combine method! We’ll need a combined system to solvate too, so this’ll be useful in a second.

🚧 This code is not production-ready
The Interchange.combine method is unstable and may break unexpectedly or be removed altogether at any time. In the future, OpenFF force fields will support biopolymers and combining Interchanges will be less necessary than at present. Don't combine force fields in production code.
docked_intrcg = protein_intrcg.combine(ligand_intrcg)
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/openff/interchange/components/interchange.py:954: UserWarning: Interchange object combination is experimental and likely to produce strange results. Any workflow using this method is not guaranteed to be suitable for production. Use with extreme caution and thoroughly validate results!
  return _combine(self, other)

In addition to making it easy to parameterize systems for all sorts of engines, Interchange makes it easy to visualize systems. We can use the visualize() method to view our docked system in NGLView and make sure the co-ordinates make sense:

w = docked_intrcg.visualize()
w.clear_representations()
w.add_representation(
    "licorice",
    radius=0.1,
    selection=[*range(protein_intrcg.topology.n_atoms)],
)
w.add_representation(
    "spacefill",
    selection=[*range(protein_intrcg.topology.n_atoms, docked_intrcg.topology.n_atoms)],
)
w

Solvent component

We’ll reuse the Sage force field from earlier here, as it includes parameters for TIP3P water, but first we need coordinates for our solvated system. This is a portion of the OpenFF ecosystem that will be streamlined in the future, but we can use a PACKMOL wrapper to get the job done. We’re adding a fixed amount of water for this quick example so the density will be wrong, but imagine it’s right.

# Construct a water molecule
water = Molecule.from_smiles("O")
water.generate_conformers(n_conformers=1)

# Come up with a box size based on the size of the protein plus a 2 nm buffer
xyz = protein.conformers[0]
centroid = xyz.sum(axis=0) / xyz.shape[0]
protein_radius = np.sqrt(((xyz - centroid) ** 2).sum(axis=-1).max())
box_vectors = UNIT_CUBE * (protein_radius * 2 + 2 * unit.nanometer)

# Pack the box with an arbitrary number of water
n_water = 1000

packed_topology = pack_box(
    molecules=[water],
    number_of_copies=[n_water],
    solute=docked_intrcg.topology,
    box_vectors=box_vectors,
)
# Visualize
packed_topology.to_file("packed.pdb")
nglview.show_structure_file("packed.pdb")

And now we can create the interchange! The Topology we got from pack_box includes the positions we’ll later apply to the solvated complex. For now, we need an Interchange that represents the water component. We can pass it Sage, wihch contains TIP3P parameters, and a topology of n_water water molecules without worrying about the positions, since we’ll just set those later.

water_intrcg = Interchange.from_smirnoff(
    force_field=ForceField("openff_unconstrained-2.0.0.offxml"),
    topology=[water] * n_water,
)

Putting the system together

Now that we’ve got all the pieces, we can combine the docked protein-ligand system with the solvent, and add in the positions and box vectors we just worked out

🚧 This code is not production-ready
The `Interchange.combine` method is unstable and under-tested and may break unexpectedly or be removed altogether.
system_intrcg = docked_intrcg.combine(water_intrcg)
system_intrcg.positions = packed_topology.get_positions()
system_intrcg.box = packed_topology.box_vectors
w = system_intrcg.visualize()
w.clear_representations()
# Protein rep
w.add_representation(
    "licorice",
    radius=0.2,
    selection=[*range(protein_intrcg.topology.n_atoms)],
)
# Ligand rep
w.add_representation(
    "spacefill",
    selection=[*range(protein_intrcg.topology.n_atoms, docked_intrcg.topology.n_atoms)],
)
# Water rep
w.add_representation(
    "licorice",
    radius=0.1,
    selection=[*range(docked_intrcg.topology.n_atoms, system_intrcg.topology.n_atoms)],
)
w

Exporting to simulation engines

Finally, we can export the final Interchange object to models understood by various simulation engines. Some of these exports are not yet optimized for large files.

OpenMM

openmm_system = system_intrcg.to_openmm()
openmm_topology = system_intrcg.topology.to_openmm(ensure_unique_atom_names=False)
print(type(openmm_system), type(openmm_topology))
<class 'openmm.openmm.System'> <class 'openmm.app.topology.Topology'>

Amber

system_intrcg.to_inpcrd("out.inpcrd")
system_intrcg.to_prmtop("out.prmtop")

LAMMPS

# The LAMMPS exporter has not yet been optimized for large molecules or systems
if False:
    system_intrcg.to_lammps("out.lmp")

GROMACS

Interchange’s GROMACS exporter is a little slow for biopolymers; this will be faster in a future release.

system_intrcg.to_gromacs(prefix="out")

Energy tests

In order to verify the accuracy of each export, we can use functions in the openff.interchange.drivers module to call out to each engine to evaluate single-point energies. Under the hood, each function uses the export functions just as we did in the above cells. The GROMACS and Amber exports are a little slower than the OpenMM export, so some of these cells might take a minute to execute.

To get a quick look at how a single engine reports energies, use get_openmm_energies (or get_gromacs_energies, get_amber_energies, or get_lammps_energies).

get_openmm_energies(system_intrcg)
EnergyReport(energies={'Bond': <Quantity(11255.7296, 'kilojoule / mole')>, 'Angle': <Quantity(1980.01657, 'kilojoule / mole')>, 'Torsion': <Quantity(7759.41267, 'kilojoule / mole')>, 'Nonbonded': <Quantity(5067733.97, 'kilojoule / mole')>})

For convenience there is a function get_summary_data that runs through all available engines and summarizes the results in a Pandas DataFrame. (This cell might take a minute to execute). We’re setting the argument _engines to a non-defualt value so that the LAMMPS driver is skipped even if it’s available; normally this argument is unnecessary if you don’t have LAMMPS installed on your system.

summary = get_summary_data(system_intrcg, _engines=["OpenMM", "GROMACS", "Amber"])
summary
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/openff/interchange/components/mdconfig.py:463: UserWarning: Ambiguous failure while processing constraints. Constraining h-bonds as a stopgap.
  warnings.warn(
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/openff/interchange/components/mdconfig.py:397: SwitchingFunctionNotImplementedWarning: A switching distance 8.0 angstrom was specified by the force field, but Amber does not implement a switching function. Using a hard cut-off instead. Non-bonded interactions will be affected.
  warnings.warn(
Bond Angle Torsion Electrostatics vdW RBTorsion
OpenMM 11255.729633 1980.016575 7759.412675 -21041.921220 5.088776e+06 NaN
Amber 538.964470 2180.593271 7759.412514 -142996.707746 5.088723e+06 NaN
GROMACS 538.962646 1980.016235 7759.417099 -21053.490234 5.088701e+06 0.0

We can see from these large energy differences that something is wrong - this stems from the experimental Interchange combination operation producing incorrect results.

summary.describe()
Bond Angle Torsion Electrostatics vdW RBTorsion
count 3.000000 3.000000 3.000000 3.000000 3.000000e+00 1.0
mean 4111.218917 2046.875360 7759.414096 -61697.373067 5.088734e+06 0.0
std 6187.327778 115.803108 0.002602 70407.289380 3.830953e+01 NaN
min 538.962646 1980.016235 7759.412514 -142996.707746 5.088701e+06 0.0
25% 538.963558 1980.016405 7759.412595 -82025.098990 5.088712e+06 0.0
50% 538.964470 1980.016575 7759.412675 -21053.490234 5.088723e+06 0.0
75% 5897.347052 2080.304923 7759.414887 -21047.705727 5.088750e+06 0.0
max 11255.729633 2180.593271 7759.417099 -21041.921220 5.088776e+06 0.0

In the future this should work more smoothly with identical energies reported by each engine. In lieu of that, we can evaluate the energy of each component that we previously added together. This requires setting box vectors for each component and also setting the water positions, which we skipped earlier since we were able to use PACKMOL results directly.

for subset in [ligand_intrcg, protein_intrcg, water_intrcg]:
    subset.box = system_intrcg.box

water_intrcg.positions = system_intrcg.positions[-3000:,]
get_summary_data(ligand_intrcg, _engines=["OpenMM", "GROMACS", "Amber"])
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/openff/interchange/components/mdconfig.py:397: SwitchingFunctionNotImplementedWarning: A switching distance 8.0 angstrom was specified by the force field, but Amber does not implement a switching function. Using a hard cut-off instead. Non-bonded interactions will be affected.
  warnings.warn(
Bond Angle Torsion Electrostatics vdW RBTorsion
OpenMM 25.837656 356.516457 27.319830 -31.854656 105.023359 NaN
Amber 25.837455 356.516548 27.319846 -31.912623 104.965682 NaN
GROMACS 25.838177 356.516541 27.319841 -31.987286 105.025609 0.0
get_summary_data(protein_intrcg, _engines=["OpenMM", "GROMACS", "Amber"])
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/openff/interchange/components/mdconfig.py:397: SwitchingFunctionNotImplementedWarning: A switching distance 8.0 angstrom was specified by the force field, but Amber does not implement a switching function. Using a hard cut-off instead. Non-bonded interactions will be affected.
  warnings.warn(
Bond Angle Torsion Electrostatics vdW RBTorsion
OpenMM 11229.891977 1623.500118 7732.092845 -20643.050126 5.086854e+06 NaN
Amber 11229.891982 1623.499947 7732.092668 -20643.369819 5.086796e+06 NaN
GROMACS 11229.882812 1623.498657 7732.095154 -20647.458984 5.086781e+06 0.0
get_summary_data(water_intrcg, _engines=["OpenMM", "GROMACS", "Amber"])
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/openff/interchange/components/mdconfig.py:397: SwitchingFunctionNotImplementedWarning: A switching distance 8.0 angstrom was specified by the force field, but Amber does not implement a switching function. Using a hard cut-off instead. Non-bonded interactions will be affected.
  warnings.warn(
Bond Angle Torsion Electrostatics vdW RBTorsion
OpenMM 0.0 0.000000 0.0 -59.077080 1704.356370 NaN
Amber 0.0 200.576358 0.0 -123632.468733 1703.736515 NaN
GROMACS 0.0 0.000000 0.0 -66.763672 1704.347492 0.0

We can see from these results that each engine reports nearly identical energies for each individual component.