Tutorial 02 - Estimating Data Sets

Open In Colab

In this tutorial we will be estimating the data set we created in the first tutorial using molecular simulation. The tutorial will cover:

  • loading in the data set to estimate, and the force field parameters to use in the calculations.

  • defining custom calculation schemas for the properties in our data set.

  • estimating the data set of properties using an Evaluator server instance.

  • retrieving the results from the server and storing them on disk.

Note: If you are running this tutorial in google colab you will need to run a setup script instead of following the installation instructions:

[ ]:
# !wget https://raw.githubusercontent.com/openforcefield/openff-evaluator/main/docs/tutorials/colab_setup.ipynb
# %run colab_setup.ipynb

For this tutorial make sure that you are using a GPU accelerated runtime.

For the sake of clarity all warnings will be disabled in this tutorial:

[ ]:
import warnings

warnings.filterwarnings("ignore")
import logging

logging.getLogger("openff.toolkit").setLevel(logging.ERROR)

We will also enable time-stamped logging to help track the progress of our calculations:

[ ]:
from openff.evaluator.utils import setup_timestamp_logging

setup_timestamp_logging()

Loading the Data Set and Force Field Parameters

We will begin by loading in the data set which we created in the previous tutorial:

[ ]:
import pathlib

from openff.evaluator.datasets import PhysicalPropertyDataSet

data_set_path = "filtered_data_set.json"


# If you have not yet completed that tutorial or do not have the data set file
# available, this tutorial will use a copies provided by the framework

if not pathlib.Path(data_set_path).exists():
    from openff.evaluator.utils import get_data_filename

    data_set_path = get_data_filename("tutorials/tutorial01/filtered_data_set.json")

data_set = PhysicalPropertyDataSet.from_json(data_set_path)

As a reminder, this data contains the experimentally measured density and \(H_{vap}\) measurements for ethanol and isopropanol at ambient conditions:

[ ]:
data_set.to_pandas().head()

We will also define the set of force field parameters which we wish to use to estimate this data set of properties. The framework has support for estimating force field parameters from a range of sources, including those in the OpenFF SMIRNOFF format, those which can be applied by AmberTools, and more.

Each source of a force field has a corresponding source object in the framework. In this tutorial we will be using the OpenFF Parsley force field which is based off of the SMIRNOFF format:

[ ]:
from openff.evaluator.forcefield import SmirnoffForceFieldSource

force_field_path = "openff-1.0.0.offxml"
force_field_source = SmirnoffForceFieldSource.from_path(force_field_path)

Defining the Calculation Schemas

The next step we will take will be to define a custom calculation schema for each type of property in our data set.

A calculation schema is the blueprint for how a type of property should be calculated using a particular calculation approach, such as directly by simulation, by reprocessing cached simulation data or, in future, a range of other options.

The framework has built-in schemas defining how densities and \(H_{vap}\) should be estimated from molecular simulation, covering all aspects from coordinate generation, force field assignment, energy minimisation, equilibration and finally the production simulation and data analysis. All of this functionality is implemented via the frameworks built-in, lightweight workflow engine, however we won’t dive into the details of this until a later tutorial.

For the purpose of this tutorial, we will simply modify the default calculation schemas to reduce the number of molecules to include in our simulations to speed up the calculations. This step can be skipped entirely if the default options (which we recommend using for ‘real-world’ calculations) are to be used:

[ ]:
from openff.evaluator.properties import Density, EnthalpyOfVaporization

density_schema = Density.default_simulation_schema(n_molecules=256)
h_vap_schema = EnthalpyOfVaporization.default_simulation_schema(n_molecules=256)

We could further use this method to set either the absolute or the relative uncertainty that the property should be estimated to within. If either of these are set, the simulations will automatically be extended until the target uncertainty in the property has been met.

For our purposes however we won’t set any targets, leaving the simulations to run for the default 1 ns.

To use these custom schemas, we need to add them to the a request options object which defines all of the options for estimating our data set:

[ ]:
from openff.evaluator.client import RequestOptions

# Create an options object which defines how the data set should be estimated.
estimation_options = RequestOptions()
# Specify that we only wish to use molecular simulation to estimate the data set.
estimation_options.calculation_layers = ["SimulationLayer"]

# Add our custom schemas, specifying that the should be used by the 'SimulationLayer'
estimation_options.add_schema("SimulationLayer", "Density", density_schema)
estimation_options.add_schema("SimulationLayer", "EnthalpyOfVaporization", h_vap_schema)

Launching the Server

The framework is split into two main applications - an EvaluatorServer and an EvaluatorClient.

The EvaluatorServer is the main object which will perform any and all calculations needed to estimate sets of properties. It is design to run on whichever compute resources you may have available (whether that be a single machine or a high performance cluster), wait until a user requests a set of properties be estimated, and then handle that request.

The EvaluatorClient is the object used by the user to send requests to estimate data sets to running server instances over a TCP connection. It is also used to query the server to see when that request has been fulfilled, and to pull back any results.

Let us begin by spawning a new server instance.

To launch a server, we need to define how this object is going to interact with the compute resource it is running on.

This is accomplished using a calculation backend. While there are several to choose from depending on your needs, well will go with a simple dask based one designed to run on a single machine:

[ ]:
import os

from openff.evaluator.backends import ComputeResources
from openff.evaluator.backends.dask import DaskLocalCluster

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

calculation_backend = DaskLocalCluster(
    number_of_workers=1,
    resources_per_worker=ComputeResources(
        number_of_threads=1,
        number_of_gpus=1,
        preferred_gpu_toolkit=ComputeResources.GPUToolkit.CUDA,
    ),
)
calculation_backend.start()

Here we have specified that we want to run our calculations on a single worker which has access to a single GPU.

With that defined, we can go ahead and spin up the server:

[ ]:
from openff.evaluator.server import EvaluatorServer

evaluator_server = EvaluatorServer(calculation_backend=calculation_backend)
evaluator_server.start(asynchronous=True)

The server will run asynchronously in the background waiting until a client connects and requests that a data set be estimated.

Estimating the Data Set

With the server spun up we can go ahead and connect to it using an EvaluatorClient and request that it estimate our data set using the custom options we defined earlier:

[ ]:
from openff.evaluator.client import EvaluatorClient

evaluator_client = EvaluatorClient()

request, exception = evaluator_client.request_estimate(
    property_set=data_set,
    force_field_source=force_field_source,
    options=estimation_options,
)

assert exception is None

The server will now receive the requests and begin whirring away fulfilling it. It should be noted that the request_estimate() function returns two values - a request object, and an exception object. If all went well (as it should do here) the exception object will be None.

The request object represents the request which we just sent to the server. It stores the unique id which the server assigned to the request, as well as the address of the server that the request was sent to.

The request object is primarily used to query the current state of our request, and to pull down the results when it the request finishes. Here we will use it it synchronously query the server every 30 seconds until our request has completed.

[ ]:
# Wait for the results.
results, exception = request.results(synchronous=True, polling_interval=30)
assert exception is None

Note: we could also asynchronously query for the results of the request. The resultant results object would then contain the partial results of any completed estimates, as well as any exceptions raised during the estimation.

Inspecting the Results

Now that the server has finished estimating our data set and returned the results to us, we can begin to inspect the results of the calculations:

[ ]:
print(len(results.queued_properties))

print(len(results.estimated_properties))

print(len(results.unsuccessful_properties))
print(len(results.exceptions))

We can (hopefully) see here that there were no exceptions raised during the calculation, and that all of our properties were successfully estimated.

We will extract the estimated data set and save this to disk:

[ ]:
results.estimated_properties.json("estimated_data_set.json", format=True);

Conclusion

And that concludes the second tutorial. In the next tutorial we will be performing some basic analysis on our estimated results.

If you have any questions and / or feedback, please open an issue on the GitHub issue tracker.