Tutorial 03 - Analysing Data Sets

In this tutorial we will be analysing the results of the calculations which we performed in the second tutorial. The tutorial will cover:

comparing the estimated data set with the experimental data set.
plotting the two data sets.

Note: If you are running this tutorial in google colab you will need to run a setup script instead of following the installation instructions:

[1]:

# !wget https://raw.githubusercontent.com/openforcefield/openff-evaluator/master/docs/tutorials/colab_setup.ipynb
# %run colab_setup.ipynb

For the sake of clarity all warnings will be disabled in this tutorial:

[2]:

import warnings
warnings.filterwarnings('ignore')
import logging
logging.getLogger("openforcefield").setLevel(logging.ERROR)

Loading the Data Sets

We will begin by loading both the experimental data set and the estimated data set:

[3]:

from openff.evaluator.datasets import PhysicalPropertyDataSet

experimental_data_set_path = "filtered_data_set.json"
estimated_data_set_path = "estimated_data_set.json"

# If you have not yet completed the previous tutorials or do not have the data set files
# available, copies are provided by the framework:

# from openff.evaluator.utils import get_data_filename
# experimental_data_set_path = get_data_filename(
#     "tutorials/tutorial01/filtered_data_set.json"
# )
# estimated_data_set_path = get_data_filename(
#     "tutorials/tutorial02/estimated_data_set.json"
# )

experimental_data_set = PhysicalPropertyDataSet.from_json(experimental_data_set_path)
estimated_data_set = PhysicalPropertyDataSet.from_json(estimated_data_set_path)

if everything went well from the previous tutorials, these data sets will contain the density and \(H_{vap}\) of ethanol and isopropanol:

[4]:

experimental_data_set.to_pandas().head()

[4]:

	Temperature (K)	Pressure (kPa)	Phase	N Components	Component 1	Role 1	Mole Fraction 1	Exact Amount 1	Density Value (g / ml)	Density Uncertainty (g / ml)	EnthalpyOfVaporization Value (kJ / mol)	EnthalpyOfVaporization Uncertainty (kJ / mol)	Source
0	298.15	101.325	Liquid	1	CC(C)O	Solvent	1.0	None	0.78270	NaN	NaN	NaN	10.1016/j.fluid.2013.10.034
1	298.15	101.325	Liquid	1	CCO	Solvent	1.0	None	0.78507	NaN	NaN	NaN	10.1021/je1013476
2	298.15	101.325	Liquid + Gas	1	CCO	Solvent	1.0	None	NaN	NaN	42.26	0.02	10.1016/S0021-9614(71)80108-8
3	298.15	101.325	Liquid + Gas	1	CC(C)O	Solvent	1.0	None	NaN	NaN	45.34	0.02	10.1016/S0021-9614(71)80108-8

[5]:

estimated_data_set.to_pandas().head()

[5]:

	Temperature (K)	Pressure (kPa)	Phase	N Components	Component 1	Role 1	Mole Fraction 1	Exact Amount 1	Density Value (g / ml)	Density Uncertainty (g / ml)	EnthalpyOfVaporization Value (kJ / mol)	EnthalpyOfVaporization Uncertainty (kJ / mol)	Source
0	298.15	101.325	Liquid	1	CCO	Solvent	1.0	None	0.791767	0.000705	NaN	NaN	SimulationLayer
1	298.15	101.325	Liquid + Gas	1	CCO	Solvent	1.0	None	NaN	NaN	39.434820	0.170356	SimulationLayer
2	298.15	101.325	Liquid	1	CC(C)O	Solvent	1.0	None	0.804158	0.000680	NaN	NaN	SimulationLayer
3	298.15	101.325	Liquid + Gas	1	CC(C)O	Solvent	1.0	None	NaN	NaN	45.649979	0.234394	SimulationLayer

Extracting the Results

We will now compare how the value of each property estimated by simulation deviates from the experimental measurement.

To do this we will extract a list which contains pairs of experimental and evaluated properties. We can easily match properties based on the unique ids which were automatically assigned to them on their creation:

[6]:

properties_by_type = {
    "Density": [],
    "EnthalpyOfVaporization": []
}

for experimental_property in experimental_data_set:

    # Find the estimated property which has the same id as the
    # experimental property.
    estimated_property = next(
        x for x in estimated_data_set if x.id == experimental_property.id
    )

    # Add this pair of properties to the list of pairs
    property_type = experimental_property.__class__.__name__
    properties_by_type[property_type].append((experimental_property, estimated_property))

Plotting the Results

We will now compare the experimental results to the estimated ones by plotting them using matplotlib:

[7]:

from matplotlib import pyplot

# Create the figure we will plot to.
figure, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(8.0, 4.0))

# Set the axis titles
axes[0].set_xlabel('OpenFF 1.0.0')
axes[0].set_ylabel('Experimental')
axes[0].set_title('Density $kg m^{-3}$')

axes[1].set_xlabel('OpenFF 1.0.0')
axes[1].set_ylabel('Experimental')
axes[1].set_title('$H_{vap}$ $kJ mol^{-1}$')

# Define the preferred units of the properties
from openff.units import unit

preferred_units = {
    "Density": unit.kilogram / unit.meter ** 3,
    "EnthalpyOfVaporization": unit.kilojoule / unit.mole
}

for index, property_type in enumerate(properties_by_type):

    experimental_values = []
    estimated_values = []

    preferred_unit = preferred_units[property_type]

    # Convert the values of our properties to the preferred units.
    for experimental_property, estimated_property in properties_by_type[property_type]:

        experimental_values.append(
            experimental_property.value.to(preferred_unit).magnitude
        )
        estimated_values.append(
            estimated_property.value.to(preferred_unit).magnitude
        )

    axes[index].plot(
        estimated_values, experimental_values, marker='x', linestyle='None'
    )

../_images/tutorials_tutorial03_13_0.png

Conclusion

And that concludes the third tutorial!

If you have any questions and / or feedback, please open an issue on the GitHub issue tracker.