Data Classes and Queries

All data which is to be stored within a StorageBackend must inherit from the BaseStoredData class. More broadly there are typically two types of data which are expected to be stored:

  • HashableStoredData - data which is readily hashable and can be quickly queried for in a storage backend. The prime examples of such data are ForceFieldData, whose hash can be easily computed from the file representation of a force field.

  • ReplaceableData - data which should be replaced in a storage backend when new data of the same type, but which has a higher information content, is stored in the backend. An example of this is when storing a piece of StoredSimulationData in the backend which was generated for a particular Substance and at the same ThermodynamicState as an existing piece of data, but which stores many more uncorrelated configurations.

Every data class must be paired with a corresponding data query class which inherits from the BaseDataQuery class. In addition, each data object must implement a to_storage_query() function which returns the data query which would uniquely match that data object. The to_storage_query() is used heavily by storage backends when checking if a piece of data already exists within the backend.

Force Field Data

The ForceFieldData class is used to ForceFieldSource objects within the storage backend. It is a hashable storage object which allows for rapidly checking whether any calculations have been previously been performed for a particular force field source.

It has a corresponding ForceFieldQuery class which can be used to query for particular force field sources within a storage backend.

Cached Simulation Data

Classes derived from the BaseSimulationData class are used to store the data generated by molecular simulation. The data object primarily records the Substance, PropertyPhase and ThermodynamicState that the simulation was run at, as well as provenance about the calculation and the force field parameters used (as the key of the force field in the storage system).

It has a corresponding BaseSimulationDataQuery class which can be used to query for simulation data which matches a set of particular criteria within a storage backend, which in part includes querying for data collected:

  • at a given thermodynamic_state (i.e temperature and pressure).

  • for a given property_phase (e.g. gas, liquid, liquid+gas coexisting, …).

  • using a given set of force field parameters identified by their unique force_field_id assigned by the storage system

Additionally included is not only the ability to find data generated for a particular substance (e.g. only data for methanol), but also the ability to return data for each component of a given substance by setting the substance_query attribute to a SubstanceQuery which has the components_only attribute set to true:

# Load an existing storage backend
storage_backend = LocalFileStorage()

# Define a system of 50% water and 50% methanol.
full_substance = Substance.from_components("O", "CO")

# Look for all simulation data generated for the full substance
data_query = SimulationDataQuery()

data_query.substance = full_substance
data_query.property_phase = PropertyPhase.Liquid

full_substance_data = storage_backend.query(data_query)

# Now look for all of the pure data which has been stored for both pure
# water and pure methanol.
pure_substance_query = SubstanceQuery()
pure_substance_query.components_only = True

data_query.substance_query = pure_substance_query
component_data = storage_backend.query(data_query)

This is particularly useful for when retrieving data for use in the calculation of excess properties (such as the enthalpy of mixing), where such calculations require information about both the full mixture as well as the pure components.

Single Simulation Data

The StoredSimulationData class is used to store data generated by a single molecular simulation and can be queried for using its accompanying SimulationDataQuery query class. In addition to the data stored by the parent BaseSimulationData class, this class further stores:

  • the number of molecules which were simulated.

  • the topology of the simulated system (stored as ancillary data).

  • and trajectory of configurations (stored as ancillary data) and observables generated by the simulation.

  • the statistic inefficiency of the data.

Data of this kind is considered replaceable, whereby data which has the lowest statistical efficiency is preferred. The philosophy here is that we should store the maximum amount of samples (i.e the maximum number of uncorrelated samples for the property which has the shortest correlation time) which will be useful for future calculations, such that future calaculations can simply discard the data which cannot be used (i.e. is likely correlated).

Free Energy Data

The StoredFreeEnergyData class is used to store data generated by a free energy calculation which computes the free energy difference between an end and start state. It can be queried for using its accompanying FreeEnergyDataQuery query class.

In addition to the data stored by the parent BaseSimulationData class, this class further stores:

  • the free energy difference between the end and starting state.

  • the topology of the system (stored as ancillary data).

  • and trajectory of configurations generated in the starting and end states (stored as ancillary data).

Although data of this kind inherits from the ReplaceableData base class, all data deposited in a storage backend will be retained. At this time no situation can be envisaged that the same free energy data from exactly the same calculation will be stored, with the exception of operator errors.