BasicDataset

class openff.qcsubmit.datasets.BasicDataset(*, qc_specifications={'default': QCSpec(method='B3LYP-D3BJ', basis='DZVP', program='psi4', spec_name='default', spec_description='Standard OpenFF optimization quantum chemistry specification.', store_wavefunction=<WavefunctionProtocolEnum.none: 'none'>, implicit_solvent=None, maxiter=200, scf_properties=[<SCFProperties.Dipole: 'dipole'>, <SCFProperties.Quadrupole: 'quadrupole'>, <SCFProperties.WibergLowdinIndices: 'wiberg_lowdin_indices'>, <SCFProperties.MayerIndices: 'mayer_indices'>], keywords={})}, driver=SinglepointDriver.energy, priority='normal', dataset_tags=['openff'], compute_tag='openff', dataset_name, dataset_tagline, type='DataSet', description, metadata=Metadata(submitter='docs', creation_date=datetime.date(2024, 4, 24), collection_type=None, dataset_name=None, short_description=None, long_description_url=None, long_description=None, elements=set()), provenance={}, dataset={}, filtered_molecules={})[source]

The general QCFractal dataset class which contains all of the molecules and information about them prior to submission.

The class is a simple holder of the dataset and information about it and can do simple checks on the data before submitting it such as ensuring that the molecules have cmiles information and a unique index to be identified by.

Note

The molecules in this dataset are all expanded so that different conformers are unique submissions.

Parameters

qc_specifications (Dict[str, openff.qcsubmit.common_structures.QCSpec]) –
driver (qcportal.singlepoint.record_models.SinglepointDriver) –
priority (str) –
dataset_tags (List[str]) –
compute_tag (str) –
dataset_name (str) –
dataset_tagline (pydantic.v1.types.ConstrainedStrValue) –
type (Literal['DataSet']) –
description (pydantic.v1.types.ConstrainedStrValue) –
metadata (openff.qcsubmit.common_structures.Metadata) –
provenance (Dict[str, str]) –
dataset (Dict[str, openff.qcsubmit.datasets.entries.DatasetEntry]) –
filtered_molecules (Dict[str, openff.qcsubmit.datasets.entries.FilterEntry]) –

Return type

None

__init__(**kwargs): Make sure the metadata has been assigned correctly if not autofill some information.

Methods

`__init__`(**kwargs)	Make sure the metadata has been assigned correctly if not autofill some information.
`add_molecule`(index, molecule[, extras, keywords])	Add a molecule to the dataset under the given index with the passed cmiles.
`add_qc_spec`(method, basis, program, ...[, ...])	Add a new qcspecification to the factory which will be applied to the dataset.
`clear_qcspecs`()	Clear out any current QCSpecs.
`construct`([_fields_set])	Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data.
`copy`(*[, include, exclude, update, deep])	Duplicate a model, optionally choose which fields to include, exclude and change.
`coverage_report`(force_field[, verbose])	Returns a summary of how many molecules within this dataset would be assigned each of the parameters in a force field.
`dict`(args, *kwargs)	Overwrite the dict method to handle any enums when saving to yaml/json via a dict call.
`export_dataset`(file_name[, compression])	Export the dataset to file so that it can be used to make another dataset quickly.
`filter_molecules`(molecules, component, ...)	Filter a molecule or list of molecules by the component they failed.
`from_orm`(obj)
`get_molecule_entry`(molecule)	Search through the dataset for a molecule and return the dataset index of any exact molecule matches.
`json`(*[, include, exclude, by_alias, ...])	Generate a JSON representation of the model, include and exclude arguments as per dict().
`molecules_to_file`(file_name, file_type)	Write the molecules to the requested file type.
`parse_file`(file_name)	Create a Dataset object from a compressed json file.
`parse_obj`(obj)
`parse_raw`(b, *[, content_type, encoding, ...])
`remove_qcspec`(spec_name)	Remove a QCSpec from the dataset.
`schema`([by_alias, ref_template])
`schema_json`(*[, by_alias, ref_template])
`submit`(client[, ignore_errors, verbose])	Submit the dataset to a QCFractal server.
`to_tasks`()	Build a dictionary of single QCEngine tasks that correspond to this dataset organised by program name.
`update_forward_refs`(**localns)	Try to update ForwardRefs on fields based on this Model, globalns and localns.
`validate`(value)
`visualize`(file_name[, columns, toolkit])	Create a pdf file of the molecules with any torsions highlighted using either openeye or rdkit.

Attributes

`components`	Gather the details of the components that were ran during the creation of this dataset.
`filtered`	A generator which yields a openff molecule representation for each molecule filtered while creating this dataset.
`molecules`	A generator that creates an openforcefield.topology.Molecule one by one from the dataset.
`n_components`	Return the amount of components that have been ran during generating the dataset.
`n_filtered`	Calculate the total number of molecules filtered by the components used in a workflow to create this dataset.
`n_molecules`	Calculate the number of unique molecules to be submitted.
`n_qc_specs`	Return the number of QCSpecs on this dataset.
`n_records`	Return the total number of records that will be created on submission of the dataset.
`type`

to_tasks()[source]

Build a dictionary of single QCEngine tasks that correspond to this dataset organised by program name. The tasks can be passed directly to qcengine.compute.

Return type: Dict[str, List[qcelemental.models.results.AtomicInput]]

add_molecule(index, molecule, extras=None, keywords=None, **kwargs)

Add a molecule to the dataset under the given index with the passed cmiles.

Parameters

index (str) – The index that should be associated with the molecule in QCArchive.
molecule (Optional[openff.toolkit.topology.molecule.Molecule]) – The instance of the molecule which contains its conformer information.
extras (Optional[Dict[str, Any]]) – The extras that should be supplied into the qcportal.moldels.Molecule.
keywords (Optional[Dict[str, Any]]) – Any extra keywords which are required for the calculation.

Return type

None

Note

Each molecule in this basic dataset should have all of its conformers expanded out into separate entries. Thus here we take the general molecule index and increment it.

add_qc_spec(method, basis, program, spec_name, spec_description, store_wavefunction='none', overwrite=False, implicit_solvent=None, maxiter=200, scf_properties=None, keywords=None)

Add a new qcspecification to the factory which will be applied to the dataset.

Parameters

method (str) – The name of the method to use eg B3LYP-D3BJ
basis (Optional[str]) – The name of the basis to use can also be None
program (str) – The name of the program to execute the computation
spec_name (str) – The name the spec should be stored under
spec_description (str) – The description of the spec
store_wavefunction (str) – what parts of the wavefunction that should be saved
overwrite (bool) – If there is a spec under this name already overwrite it
implicit_solvent (Optional[openff.qcsubmit.common_structures.PCMSettings]) – The implicit solvent settings if it is to be used.
maxiter (pydantic.v1.types.PositiveInt) – The maximum number of SCF iterations that should be done.
scf_properties (Optional[List[openff.qcsubmit.common_structures.SCFProperties]]) – The list of SCF properties that should be extracted from the calculation.
keywords (Optional[Dict[str, Union[pydantic.v1.types.StrictStr, pydantic.v1.types.StrictInt, pydantic.v1.types.StrictFloat, pydantic.v1.types.StrictBool, List[pydantic.v1.types.StrictFloat]]]]) – Program specific computational keywords that should be passed to the program

Return type

None

clear_qcspecs()

Clear out any current QCSpecs.

Return type: None

property components: List[Dict[str, Union[str, Dict[str, str]]]]: Gather the details of the components that were ran during the creation of this dataset.

coverage_report(force_field, verbose=False)

Returns a summary of how many molecules within this dataset would be assigned each of the parameters in a force field.

Notes

Parameters which would not be assigned to any molecules in the dataset will not be included in the returned summary.

Parameters

force_field (ForceField) – The force field containing the parameters to summarize.
verbose (bool) – If true a progress bar will be shown on screen.

Returns

A dictionary of the form coverage[handler_name][parameter_smirks] = count which stores the number of molecules within this dataset that would be assigned to each parameter.

Return type

Dict[str, Dict[str, int]]

dict(*args, **kwargs): Overwrite the dict method to handle any enums when saving to yaml/json via a dict call.

export_dataset(file_name, compression=None)

Export the dataset to file so that it can be used to make another dataset quickly.

Parameters

file_name (str) – The name of the file the dataset should be wrote to.
compression (Optional[str]) – The type of compression that should be added to the export.

Raises

UnsupportedFiletypeError – If the requested file type is not supported.

Return type

None

Note

The supported file types are:

json

Additionally, the file will automatically compressed depending on the final extension if compression is not explicitly supplied:

json.xz
json.gz
json.bz2

Check serializers.py for more details. Right now bz2 seems to produce the smallest files.

filter_molecules(molecules, component, component_settings, component_provenance)

Filter a molecule or list of molecules by the component they failed.

Parameters

molecules (Union[openff.toolkit.topology.molecule.Molecule, List[openff.toolkit.topology.molecule.Molecule]]) – A molecule or list of molecules to be filtered.
component_settings (Dict[str, Any]) – The dictionary representation of the component that filtered this set of molecules.
component (str) – The name of the component.
component_provenance (Dict[str, str]) – The dictionary representation of the component provenance.

Return type

None

property filtered: openff.toolkit.topology.molecule.Molecule: A generator which yields a openff molecule representation for each molecule filtered while creating this dataset.

Note

Modifying the molecule will have no effect on the data stored.

get_molecule_entry(molecule)

Search through the dataset for a molecule and return the dataset index of any exact molecule matches.

Parameters: molecule (Union[openff.toolkit.topology.molecule.Molecule, str]) – The smiles string for the molecule or an openforcefield.topology.Molecule that is to be searched for.
Returns: A list of dataset indices which contain the target molecule.
Return type: List[str]

property molecules: Generator[openff.toolkit.topology.molecule.Molecule, None, None]: A generator that creates an openforcefield.topology.Molecule one by one from the dataset.

Note

Editing the molecule will not effect the data stored in the dataset as it is immutable.

molecules_to_file(file_name, file_type)

Write the molecules to the requested file type.

Parameters

file_name (str) – The name of the file the molecules should be stored in.
file_type (str) – The file format that should be used to store the molecules.

Return type

None

Important

The supported file types are:

SMI
INCHI
INCKIKEY

property n_components: int: Return the amount of components that have been ran during generating the dataset.

property n_filtered: int: Calculate the total number of molecules filtered by the components used in a workflow to create this dataset.

property n_molecules: int

Calculate the number of unique molecules to be submitted.

Notes

This method has been improved for better performance on large datasets and has been tested on an optimization dataset of over 10500 molecules.
This function does not calculate the total number of entries of the dataset see n_records

property n_qc_specs: int: Return the number of QCSpecs on this dataset.

property n_records: int

Return the total number of records that will be created on submission of the dataset.

Note

The number returned will be different depending on the dataset used.
The amount of unique molecule can be found using n_molecules

classmethod parse_file(file_name)

Create a Dataset object from a compressed json file.

Parameters: file_name (str) – The name of the file the dataset should be created from.

remove_qcspec(spec_name)

Remove a QCSpec from the dataset.

Parameters: spec_name (str) – The name of the spec that should be removed.
Return type: None

Note

The QCSpec settings are not mutable and so they must be removed and a new one added to ensure they are fully validated.

submit(client, ignore_errors=False, verbose=False)

Submit the dataset to a QCFractal server.

Parameters

client (qcportal.client.PortalClient) – Instance of a portal client
ignore_errors (bool) – If the user wants to submit the compute regardless of errors set this to True. Mainly to override basis coverage.
verbose (bool) – If progress bars and submission statistics should be printed True or not False.

Returns

A dictionary of the compute response from the client for each specification submitted.

Raises

MissingBasisCoverageError – If the chosen basis set does not cover some of the elements in the dataset.

Return type

Dict

visualize(file_name, columns=4, toolkit=None)

Create a pdf file of the molecules with any torsions highlighted using either openeye or rdkit.

Parameters

file_name (str) – The name of the pdf file which will be produced.
columns (int) – The number of molecules per row.
toolkit (Optional[Literal['openeye', 'rdkit']]) – The option to specify the backend toolkit used to produce the pdf file.

Return type

None