Tutorial 01 - Loading Data Sets

Open In Colab

In this tutorial we will be exploring the frameworks utilities for loading and manipulating data sets of physical property measurements. The tutorial will cover

  • Loading a data set of density measurements from NISTs ThermoML Archive

  • Filtering the data set down using a range of criteria, including temperature pressure, and composition.

  • Supplementing the data set with enthalpy of vaporization (\(\Delta H_{v}\)) data sourced directly from the literature

If you haven’t yet installed the OpenFF Evaluator framework on your machine, check out the installation instructions here!

Note: If you are running this tutorial in google colab you will need to run a setup script instead of following the installation instructions:

[1]:
# !wget https://raw.githubusercontent.com/openforcefield/openff-evaluator/master/docs/tutorials/colab_setup.ipynb
# %run colab_setup.ipynb

For the sake of clarity all warnings will be disabled in this tutorial:

[2]:
import warnings
warnings.filterwarnings('ignore')
import logging
logging.getLogger("openff.toolkit").setLevel(logging.ERROR)

Extracting Data from ThermoML

For anyone who is not familiar with the ThermoML archive - it is a fantastic database of physical property measurements which have been extracted from data published in the

  • Journal of Chemical and Engineering Data

  • Journal of Chemical Thermodynamics

  • Fluid Phase Equilibria

  • Thermochimica Acta

  • International Journal of Thermophysics

journals. It includes data for a wealth of different physical properties, from simple densities and melting points, to activity coefficients and osmotic coefficients, all of which is freely available. As such, it serves as a fantastic resource for benchmarking and optimising molecular force fields against.

The Evaluator framework has built-in support for extracting this wealth of data, storing the data in easy to manipulate python objects, and for automatically re-computing those properties using an array of calculation techniques, such as molecular simulations and, in future, from trained surrogate models.

This support is provided by the ThermoMLDataSet object:

[3]:
from openff.evaluator.datasets.thermoml import ThermoMLDataSet

The ThermoMLDataSet object offers two main routes for extracting data the the archive:

  • extracting data directly from the NIST ThermoML web server

  • extracting data from a local ThermoML XML archive file

Here we will be extracting data directly from the web server. To pull data from the web server we need to specifiy the digital object identifiers (DOIs) of the data we wish to extract - these correspond to the DOI of the publication that the data was initially sourced from.

For this tutorial we will be extracting data using the following DOIs:

[4]:
data_set = ThermoMLDataSet.from_doi(
    "10.1016/j.fluid.2013.10.034",
    "10.1021/je1013476",
)

We can inspect the data set to see how many properties were loaded:

[5]:
len(data_set)
[5]:
275

and for how many different substances those properties were measured for:

[6]:
len(data_set.substances)
[6]:
254

We can also easily check which types of properties were loaded in:

[7]:
print(data_set.property_types)
{'EnthalpyOfMixing', 'Density'}

Filtering the Data Set

The data set object we just created contains many different functions which will allow us to filter the data down, retaining only those measurements which are of interest to us.

The first thing we will do is filter out all of the measurements which aren’t density measurements:

[8]:
from openff.evaluator.datasets.curation.components.filtering import (
    FilterByPropertyTypes,
    FilterByPropertyTypesSchema
)

data_set = FilterByPropertyTypes.apply(
    data_set, FilterByPropertyTypesSchema(property_types=["Density"])
)

print(data_set.property_types)
{'Density'}

Next we will filter out all measurements which were made away from atmospheric conditions:

[9]:
from openff.evaluator.datasets.curation.components.filtering import (
    FilterByPressure,
    FilterByPressureSchema,
    FilterByTemperature,
    FilterByTemperatureSchema,
)

print(f"There were {len(data_set)} properties before filtering")

# First filter by temperature.
data_set = FilterByTemperature.apply(
    data_set,
    FilterByTemperatureSchema(minimum_temperature=298.0, maximum_temperature=298.2)
)
# and then by pressure
data_set = FilterByPressure.apply(
    data_set,
    FilterByPressureSchema(minimum_pressure=101.224, maximum_pressure=101.426)
)

print(f"There are now {len(data_set)} properties after filtering")
There were 213 properties before filtering
There are now 9 properties after filtering

Finally, we will filter out all measurements which were not measured for either ethanol (CCO) or isopropanol (CC(C)O):

[10]:
from openff.evaluator.datasets.curation.components.filtering import (
    FilterBySmiles,
    FilterBySmilesSchema,
)

data_set = FilterBySmiles.apply(
    data_set,
    FilterBySmilesSchema(smiles_to_include=["CCO", "CC(C)O"])
)

print(f"There are now {len(data_set)} properties after filtering")
There are now 2 properties after filtering

We will convert the filtered data to a pandas DataFrame to more easily visualize the final data set:

[11]:
pandas_data_set = data_set.to_pandas()
pandas_data_set[
    ["Temperature (K)", "Pressure (kPa)", "Component 1", "Density Value (g / ml)", "Source"]
].head()
[11]:
Temperature (K) Pressure (kPa) Component 1 Density Value (g / ml) Source
0 298.15 101.325 CC(C)O 0.78270 10.1016/j.fluid.2013.10.034
1 298.15 101.325 CCO 0.78507 10.1021/je1013476

Through filtering, we have now cut down from over 250 property measurements down to just 2. There are many more possible filters which can be applied. All of these and more information about the data set object can be found in the PhysicalPropertyDataSet (from which the ThermoMLDataSet class inherits) API documentation.

Adding Extra Data

For the final part of this tutorial, we will be supplementing our newly filtered data set with some enthalpy of vaporization (\(\Delta H_{v}\)) measurements sourced directly from the literature (as opposed to from the ThermoML archive).

We will be sourcing values of the \(\Delta H_{v}\) of ethanol and isopropanol, summarised in the table below, from the Enthalpies of vaporization of some aliphatic alcohols publication:

Compound

Temperature / \(K\)

\(\Delta H_{v}\) / \(kJ mol^{-1}\)

\(\delta \Delta H_{v}\) / \(kJ mol^{-1}\)

Ethanol

298.15

42.26

0.02

Isopropanol

298.15

45.34

0.02

In order to create a new \(\Delta H_{v}\) measurements, we will first define the state (namely temperature and pressure) that the measurements were recorded at:

[12]:
from openff.units import unit
from openff.evaluator.thermodynamics import ThermodynamicState

thermodynamic_state = ThermodynamicState(
    temperature=298.15 * unit.kelvin, pressure=1.0 * unit.atmosphere
)

Note: Here we have made use of the ``openff.evaluator.unit`` module to attach units to the temperatures and pressures we are filtering by. This module simply exposes a ``UnitRegistry`` from the fantasticpintlibrary. Pint provides full support for attaching to units to values and is used extensively throughout this framework.

the substances that the measurements were recorded for:

[13]:
from openff.evaluator.substances import Substance

ethanol = Substance.from_components("CCO")
isopropanol = Substance.from_components("CC(C)O")

and the source of this measurement (defined as the DOI of the publication):

[14]:
from openff.evaluator.datasets import MeasurementSource

source = MeasurementSource(doi="10.1016/S0021-9614(71)80108-8")

We will combine this information with the values of the measurements to create an object which encodes each of the \(\Delta H_{v}\) measurements

[15]:
from openff.evaluator.datasets import PropertyPhase
from openff.evaluator.properties import EnthalpyOfVaporization

ethanol_hvap = EnthalpyOfVaporization(
    thermodynamic_state=thermodynamic_state,
    phase=PropertyPhase.Liquid | PropertyPhase.Gas,
    substance=ethanol,
    value=42.26*unit.kilojoule / unit.mole,
    uncertainty=0.02*unit.kilojoule / unit.mole,
    source=source
)
isopropanol_hvap = EnthalpyOfVaporization(
    thermodynamic_state=thermodynamic_state,
    phase=PropertyPhase.Liquid | PropertyPhase.Gas,
    substance=isopropanol,
    value=45.34*unit.kilojoule / unit.mole,
    uncertainty=0.02*unit.kilojoule / unit.mole,
    source=source
)

These properties can then be added to our data set:

[16]:
data_set.add_properties(ethanol_hvap, isopropanol_hvap)

If we print the data set again using pandas we should see that our new measurements have been added:

[17]:
pandas_data_set = data_set.to_pandas()
pandas_data_set[
    ["Temperature (K)",
     "Pressure (kPa)",
     "Component 1",
     "Density Value (g / ml)",
     "EnthalpyOfVaporization Value (kJ / mol)",
     "Source"
     ]
].head()
[17]:
Temperature (K) Pressure (kPa) Component 1 Density Value (g / ml) EnthalpyOfVaporization Value (kJ / mol) Source
0 298.15 101.325 CC(C)O 0.78270 NaN 10.1016/j.fluid.2013.10.034
1 298.15 101.325 CCO 0.78507 NaN 10.1021/je1013476
2 298.15 101.325 CCO NaN 42.26 10.1016/S0021-9614(71)80108-8
3 298.15 101.325 CC(C)O NaN 45.34 10.1016/S0021-9614(71)80108-8

Conclusion

We will finish off this tutorial by saving the data set we have created as a JSON file for future use:

[18]:
data_set.json("filtered_data_set.json", format=True);

And that concludes the first tutorial. For more information about data sets in the Evaluator framework check out the data set and ThermoML documentation.

In the next tutorial we will be estimating the data set we have created here using molecular simulation.

If you have any questions and / or feedback, please open an issue on the GitHub issue tracker.