Data Classes and Queries¶
All data which is to be stored within a StorageBackend
must inherit from the BaseStoredData
class. More broadly
there are typically two types of data which are expected to be stored:
HashableStoredData
- data which is readily hashable and can be quickly queried for in a storage backend. The prime examples of such data areForceFieldData
, whose hash can be easily computed from the file representation of a force field.ReplaceableData
- data which should be replaced in a storage backend when new data of the same type, but which has a higher information content, is stored in the backend. An example of this is when storing a piece ofStoredSimulationData
in the backend which was generated for a particularSubstance
and at the sameThermodynamicState
as an existing piece of data, but which stores many more uncorrelated configurations.
Every data class must be paired with a corresponding data query class which inherits from the BaseDataQuery
class. In addition, each data object must implement a to_storage_query()
function which returns the data query
which would uniquely match that data object. The to_storage_query()
is used heavily by storage backends when checking
if a piece of data already exists within the backend.
Force Field Data¶
The ForceFieldData
class is used to ForceFieldSource
objects within the storage backend. It is a hashable
storage object which allows for rapidly checking whether any calculations have been previously been performed for
a particular force field source.
It has a corresponding ForceFieldQuery
class which can be used to query for particular force field sources within
a storage backend.
Cached Simulation Data¶
Classes derived from the BaseSimulationData
class are used to store the data generated by molecular simulation. The
data object primarily records the Substance
, PropertyPhase
and ThermodynamicState
that the simulation was run
at, as well as provenance about the calculation and the force field parameters used (as the key of the force field in
the storage system).
It has a corresponding BaseSimulationDataQuery
class which can be used to query for simulation data which matches
a set of particular criteria within a storage backend, which in part includes querying for data collected:
at a given
thermodynamic_state
(i.e temperature and pressure).for a given
property_phase
(e.g. gas, liquid, liquid+gas coexisting, …).using a given set of force field parameters identified by their unique
force_field_id
assigned by the storage system
Additionally included is not only the ability to find data generated for a particular substance
(e.g. only data for
methanol), but also the ability to return data for each component of a given substance by setting the
substance_query
attribute to a SubstanceQuery
which has the components_only
attribute set to true:
# Load an existing storage backend
storage_backend = LocalFileStorage()
# Define a system of 50% water and 50% methanol.
full_substance = Substance.from_components("O", "CO")
# Look for all simulation data generated for the full substance
data_query = SimulationDataQuery()
data_query.substance = full_substance
data_query.property_phase = PropertyPhase.Liquid
full_substance_data = storage_backend.query(data_query)
# Now look for all of the pure data which has been stored for both pure
# water and pure methanol.
pure_substance_query = SubstanceQuery()
pure_substance_query.components_only = True
data_query.substance_query = pure_substance_query
component_data = storage_backend.query(data_query)
This is particularly useful for when retrieving data for use in the calculation of excess properties (such as the enthalpy of mixing), where such calculations require information about both the full mixture as well as the pure components.
Single Simulation Data¶
The StoredSimulationData
class is used to store data generated by a single molecular simulation and can be
queried for using its accompanying SimulationDataQuery
query class. In addition to the data stored by the parent
BaseSimulationData
class, this class further stores:
the number of molecules which were simulated.
the topology of the simulated system (stored as ancillary data).
and trajectory of configurations (stored as ancillary data) and observables generated by the simulation.
the statistic inefficiency of the data.
Data of this kind is considered replaceable, whereby data which has the lowest statistical efficiency is preferred. The philosophy here is that we should store the maximum amount of samples (i.e the maximum number of uncorrelated samples for the property which has the shortest correlation time) which will be useful for future calculations, such that future calaculations can simply discard the data which cannot be used (i.e. is likely correlated).
Free Energy Data¶
The StoredFreeEnergyData
class is used to store data generated by a free energy calculation which computes the
free energy difference between an end and start state. It can be queried for using its accompanying
FreeEnergyDataQuery
query class.
In addition to the data stored by the parent BaseSimulationData
class, this class further stores:
the free energy difference between the end and starting state.
the topology of the system (stored as ancillary data).
and trajectory of configurations generated in the starting and end states (stored as ancillary data).
Although data of this kind inherits from the ReplaceableData
base class, all data deposited in a storage backend will
be retained. At this time no situation can be envisaged that the same free energy data from exactly the same calculation
will be stored, with the exception of operator errors.