Data Classes and Queries¶
All data which is to be stored within a StorageBackend
must inherit from the BaseStoredData
class. More broadly
there are typically two types of data which are expected to be stored:
HashableStoredData
- data which is readily hashable and can be quickly queried for in a storage backend. The prime examples of such data areForceFieldData
, whose hash can be easily computed from the file representation of a force field.ReplaceableData
- data which should be replaced in a storage backend when new data of the same type, but which has a higher information content, is stored in the backend. An example of this is when storing a piece ofStoredSimulationData
in the backend which was generated for a particularSubstance
and at the sameThermodynamicState
as an existing piece of data, but which stores many more uncorrelated configurations.
Every data class must be paired with a corresponding data query class which inherits from the BaseDataQuery
class. In addition, each data object must implement a to_storage_query()
function which returns the data query
which would uniquely match that data object. The to_storage_query()
is used heavily by storage backends when checking
if a piece of data already exists within the backend.
Force Field Data¶
The ForceFieldData
class is used to ForceFieldSource
objects within the storage backend. It is a hashable
storage object which allows for rapidly checking whether any calculations have been previously been performed for
a particular force field source.
It has a corresponding ForceFieldQuery
class which can be used to query for particular force field sources within
a storage backend.
Cached Simulation Data¶
The StoredSimulationData
class is used to store the data generated by a single molecular simulation. The data object
primarily records the Substance
, PropertyPhase
and ThermodynamicState
that the simulation was run at, as well as
provenance about the calculation and the force field parameters used (as the key of the force field in the storage
system). Further, the object records the file names of the topology, trajectory and statistics files generated by the
simulation - these files should be stored in an associated ancillary data directory.
Cached simulation data is considered replaceable, whereby data which has the lowest statistical efficiency is preferred. The philosophy here is that we should store the maximum amount of samples (i.e the maximum number of uncorrelated samples for the property which has the shortest correlation time) which will be useful for future calculations, such that future calaculations can simply discard the data which cannot be used (i.e. is likely correlated).
It has a corresponding SimulationDataQuery
class which can be used to query for simulation data which matches a set
of particular criteria within a storage backend, which in part includes querying for data collected:
at a given
thermodynamic_state
(i.e temperature and pressure).for a given
property_phase
(e.g. gas, liquid, liquid+gas coexisting, …).using a given set of force field parameters identified by their unique
force_field_id
assigned by the storage system
Included is not only the ability to find data generated for a particular substance
(e.g. only data for methanol),
but also the ability to return data for each component of a given substance by setting the substance_query
attribute to a SubstanceQuery
which has the components_only
attribute set to true:
# Load an existing storage backend
storage_backend = LocalFileStorage()
# Define a system of 50% water and 50% methanol.
full_substance = Substance.from_components("O", "CO")
# Look for all simulation data generated for the full substance
data_query = SimulationDataQuery()
data_query.substance = full_substance
data_query.property_phase = PropertyPhase.Liquid
full_substance_data = storage_backend.query(data_query)
# Now look for all of the pure data which has been stored for both pure
# water and pure methanol.
pure_substance_query = SubstanceQuery()
pure_substance_query.components_only = True
data_query.substance_query = pure_substance_query
component_data = storage_backend.query(data_query)
This is particularly useful for when retrieving data for use in the calculation of excess properties (such as the enthalpy of mixing), where such calculations require information about both the full mixture as well as the pure components.