Data Set Curation
The framework offers a full suite of features to facilitate the curation of data sets of physical properties, including:
a significant amount of data filters, including to filter by state, substance composition and chemical functionalities.
and components to
easily download and import the full NIST ThermoML and FreeSolv archives .
select data points which were measured close to a set of target states, and which were measured for a diverse range of substances which contain specific functionalities.
convert between different compatible property types (e.g. convert density <-> excess molar volume data).
These features are implemented as CurationComponent
objects, which take as input an associated
CurationComponentSchema
which controls how the curation components should be applied to a particular data
set (or a data set which is being stored as pandas DataFrame
object).
An example of a curation component would be one that filters out data points which were measured outside of a particular temperature range:
# Filter data points measured at less than 290.0 K or greater than 320.0 K
filtered_frame = FilterByTemperature.apply(
data_frame,
FilterByTemperatureSchema(minimum_temperature=290.0, maximum_temperature=320.0),
)
Curation components can be conveniently chained together using a CurationWorkflow
and an associated
CurationWorkflowSchema
so as to easily curated full training and testing data sets:
curation_schema = WorkflowSchema(
component_schemas=[
# Import the ThermoML archive.
thermoml.ImportThermoMLDataSchema()
# Filter out any measurements made for systems with more than two components
filtering.FilterByNComponentsSchema(n_components=[1, 2]),
# Remove any duplicate data.
filtering.FilterDuplicatesSchema(),
# Filter out data points measured away from ambient
# and biologically relevant temperatures.
filtering.FilterByTemperatureSchema(
minimum_temperature=298.0, maximum_temperature=320.0
),
# Retain only density and enthalpy of mixing data points.
filtering.FilterByPropertyTypesSchema(
property_types=["Density", "EnthalpyOfMixing"],
),
# Select data points measured for alcohols, esters or mixtures of both.
selection.SelectSubstancesSchema(
target_environments=[
ChemicalEnvironment.Alcohol,
ChemicalEnvironment.CarboxylicAcidEster,
],
n_per_environment=10,
),
]
)
data_frame = Workflow.apply(pandas.DataFrame(), curation)
Examples
Data Extraction
ImportFreeSolv
: A component which will download the latest, full FreeSolv data set from the GitHub repository:from openff.evaluator.datasets.curation.components.freesolv import ( ImportFreeSolv, ImportFreeSolvSchema, ) # Import the full FreeSolv data set. data_frame = ImportFreeSolv.apply(pandas.DataFrame(), ImportFreeSolvSchema())
ImportThermoMLData
: A component which will download all supported data from the NIST ThermoML Archive:from openff.evaluator.datasets.curation.components.thermoml import ( ImportThermoMLData, ImportThermoMLDataSchema, ) # Import all data collected from the IJT journal. data_frame = ImportThermoMLData.apply(pandas.DataFrame(), ImportThermoMLDataSchema())
Filtration
FilterDuplicates
: A component to remove duplicate data points (within a specified precision) from a data set:from openff.evaluator.datasets.curation.components.filtering import ( FilterDuplicates, FilterDuplicatesSchema, ) filtered_frame = FilterDuplicates.apply(data_frame, FilterDuplicatesSchema())
FilterByTemperature
: A component which will filter out data points which were measured outside of a specified temperature range:from openff.evaluator.datasets.curation.components.filtering import ( FilterByTemperature, FilterByTemperatureSchema, ) filtered_frame = FilterByTemperature.apply( data_frame, FilterByTemperatureSchema(minimum_temperature=290.0, maximum_temperature=320.0), )
FilterByPressure
: A component which will filter out data points which were measured outside of a specified pressure range:from openff.evaluator.datasets.curation.components.filtering import ( FilterByPressure, FilterByPressureSchema, ) filtered_frame = FilterByPressure.apply( data_frame, FilterByPressureSchema(minimum_pressure=100.0, maximum_pressure=140.0), )
FilterByMoleFraction
: A component which will filter out data points which were measured outside of a specified mole fraction range:from openff.evaluator.datasets.curation.components.filtering import ( FilterByMoleFraction, FilterByMoleFractionSchema, ) filtered_frame = FilterByMoleFraction.apply( data_frame, FilterByMoleFractionSchema(mole_fraction_ranges={2: [[(0.1, 0.3)]]}) )
FilterByRacemic
: A component which will filter out data points which were measured for racemic mixtures:from openff.evaluator.datasets.curation.components.filtering import ( FilterByRacemic, FilterByRacemicSchema, ) filtered_frame = FilterByRacemic.apply(data_frame, FilterByRacemicSchema())
FilterByElements
: A component which will filter out data points which were measured for systems which contain specific elements:from openff.evaluator.datasets.curation.components.filtering import ( FilterByElements, FilterByElementsSchema, ) filtered_frame = FilterByElements.apply( data_frame, FilterByElementsSchema(allowed_elements=["C", "O", "H"]), )
FilterByPropertyTypes
: A component which will apply a filter which only retains properties of specified types:from openff.evaluator.datasets.curation.components.filtering import ( FilterByPropertyTypes, FilterByPropertyTypesSchema, ) # Retain only density measurements made for either pure or binary systems. filtered_frame = FilterByPropertyTypes.apply( data_frame, FilterByPropertyTypesSchema( property_types=["Density"], n_components={"Density": [1, 2]}, ), )
FilterByStereochemistry
: A component which filters out data points measured for systems whereby the stereochemistry of a number of components is undefined:from openff.evaluator.datasets.curation.components.filtering import ( FilterByStereochemistry, FilterByStereochemistrySchema, ) filtered_frame = FilterByStereochemistry.apply( data_frame, FilterByStereochemistrySchema() )
FilterByCharged
: A component which filters out data points measured for substance where any of the constituent components have a net non-zero charge.:from openff.evaluator.datasets.curation.components.filtering import ( FilterByCharged, FilterByChargedSchema, ) filtered_frame = FilterByCharged.apply(data_frame, FilterByChargedSchema())
FilterByIonicLiquid
: A component which filters out data points measured for substances which contain or are classed as an ionic liquids:from openff.evaluator.datasets.curation.components.filtering import ( FilterByIonicLiquid, FilterByIonicLiquidSchema, ) filtered_frame = FilterByIonicLiquid.apply(data_frame, FilterByIonicLiquidSchema())
FilterBySmiles
: A component which filters the data set so that it only contains either a specific set of smiles, or does not contain any of a set of specifically excluded smiles:from openff.evaluator.datasets.curation.components.filtering import ( FilterBySmiles, FilterBySmilesSchema, ) filtered_frame = FilterBySmiles.apply( data_frame, FilterBySmilesSchema(smiles_to_include=["CCCO"]), )
FilterBySmirks
: A component which filters a data set so that it only contains measurements made for molecules which contain (or don’t) a set of chemical environments represented by SMIRKS patterns:from openff.evaluator.datasets.curation.components.filtering import ( FilterBySmirks, FilterBySmirksSchema, ) filtered_frame = FilterBySmirks.apply( data_frame, FilterBySmirksSchema(smirks_to_include=["[#6a]"]), )
FilterByNComponents
: A component which filters out data points measured for systems with specified number of components:from openff.evaluator.datasets.curation.components.filtering import ( FilterByNComponents, FilterByNComponentsSchema, ) filtered_frame = FilterByNComponents.apply( data_frame, FilterByNComponentsSchema(n_components=[1, 2]) )
FilterBySubstances
: A component which filters the data set so that it only contains properties measured for particular substances:from openff.evaluator.datasets.curation.components.filtering import ( FilterBySubstances, FilterBySubstancesSchema, ) filtered_frame = FilterBySubstances.apply( data_frame, FilterBySubstancesSchema(substances_to_include=[("CO", "C")]) )
FilterByEnvironments
: A component which filters a data set so that it only contains measurements made for substances which contain specific chemical environments:from openff.evaluator.datasets.curation.components.filtering import ( FilterByEnvironments, FilterByEnvironmentsSchema, ) filtered_frame = FilterByEnvironments.apply( data_frame, FilterByEnvironmentsSchema( environments=[ ChemicalEnvironment.Aqueous, ChemicalEnvironment.Alcohol, ChemicalEnvironment.Amine, ] ), )
Data Selection
SelectSubstances
: A component for selecting data points which were measured for specified number of maximally diverse systems containing a specified set of chemical functionalities:# Select (if possible) data points which were measured for 10 different (and # structurally diverse) alcohols. schema = SelectSubstancesSchema( target_environments=[ChemicalEnvironment.Alcohol], n_per_environment=10, ) data_frame = ConvertExcessDensityData.apply(data_frame, schema)
SelectDataPoints
: A component for selecting a set of data points which are close to a particular set of states:# Select (if possible) density data points which were measured for pure systems # at close to 298.15 K and 308.15K schema = SelectDataPointsSchema( target_states=[ TargetState( property_types=[("Density", 1)], states=[ State(temperature=298.15, pressure=101.325, mole_fractions=(1.0,), State(temperature=308.15, pressure=101.325, mole_fractions=(1.0,), ], ) ] ) data_frame = ConvertExcessDensityData.apply(data_frame, schema)
Data Conversion
ConvertExcessDensityData
: A component for converting binary mass density data to excess molar volume data and vice versa where pure density data measured for the components is available:from openff.evaluator.datasets.curation.components.conversion import ( ConvertExcessDensityData, ConvertExcessDensityDataSchema, ) converted_data_frame = ConvertExcessDensityData.apply( data_frame, ConvertExcessDensityDataSchema() )