BasicDataset

class openff.qcsubmit.datasets.BasicDataset(*, qc_specifications={'default': QCSpec(method='B3LYP-D3BJ', basis='DZVP', program='psi4', spec_name='default', spec_description='Standard OpenFF optimization quantum chemistry specification.', store_wavefunction=<WavefunctionProtocolEnum.none: 'none'>, implicit_solvent=None, maxiter=200, scf_properties=[<SCFProperties.Dipole: 'dipole'>, <SCFProperties.Quadrupole: 'quadrupole'>, <SCFProperties.WibergLowdinIndices: 'wiberg_lowdin_indices'>, <SCFProperties.MayerIndices: 'mayer_indices'>], keywords={})}, driver=SinglepointDriver.energy, priority='normal', dataset_tags=['openff'], compute_tag='openff', dataset_name, dataset_tagline, type='DataSet', description, metadata=Metadata(submitter='docs', creation_date=datetime.date(2024, 3, 22), collection_type=None, dataset_name=None, short_description=None, long_description_url=None, long_description=None, elements=set()), provenance={}, dataset={}, filtered_molecules={})[source]

The general QCFractal dataset class which contains all of the molecules and information about them prior to submission.

The class is a simple holder of the dataset and information about it and can do simple checks on the data before submitting it such as ensuring that the molecules have cmiles information and a unique index to be identified by.

Note

The molecules in this dataset are all expanded so that different conformers are unique submissions.

Parameters
Return type

None

__init__(**kwargs)

Make sure the metadata has been assigned correctly if not autofill some information.

Methods

__init__(**kwargs)

Make sure the metadata has been assigned correctly if not autofill some information.

add_molecule(index, molecule[, extras, keywords])

Add a molecule to the dataset under the given index with the passed cmiles.

add_qc_spec(method, basis, program, ...[, ...])

Add a new qcspecification to the factory which will be applied to the dataset.

clear_qcspecs()

Clear out any current QCSpecs.

construct([_fields_set])

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data.

copy(*[, include, exclude, update, deep])

Duplicate a model, optionally choose which fields to include, exclude and change.

coverage_report(force_field[, verbose])

Returns a summary of how many molecules within this dataset would be assigned each of the parameters in a force field.

dict(*args, **kwargs)

Overwrite the dict method to handle any enums when saving to yaml/json via a dict call.

export_dataset(file_name[, compression])

Export the dataset to file so that it can be used to make another dataset quickly.

filter_molecules(molecules, component, ...)

Filter a molecule or list of molecules by the component they failed.

from_orm(obj)

get_molecule_entry(molecule)

Search through the dataset for a molecule and return the dataset index of any exact molecule matches.

json(*[, include, exclude, by_alias, ...])

Generate a JSON representation of the model, include and exclude arguments as per dict().

molecules_to_file(file_name, file_type)

Write the molecules to the requested file type.

parse_file(file_name)

Create a Dataset object from a compressed json file.

parse_obj(obj)

parse_raw(b, *[, content_type, encoding, ...])

remove_qcspec(spec_name)

Remove a QCSpec from the dataset.

schema([by_alias, ref_template])

schema_json(*[, by_alias, ref_template])

submit(client[, ignore_errors, verbose])

Submit the dataset to a QCFractal server.

to_tasks()

Build a dictionary of single QCEngine tasks that correspond to this dataset organised by program name.

update_forward_refs(**localns)

Try to update ForwardRefs on fields based on this Model, globalns and localns.

validate(value)

visualize(file_name[, columns, toolkit])

Create a pdf file of the molecules with any torsions highlighted using either openeye or rdkit.

Attributes

components

Gather the details of the components that were ran during the creation of this dataset.

filtered

A generator which yields a openff molecule representation for each molecule filtered while creating this dataset.

molecules

A generator that creates an openforcefield.topology.Molecule one by one from the dataset.

n_components

Return the amount of components that have been ran during generating the dataset.

n_filtered

Calculate the total number of molecules filtered by the components used in a workflow to create this dataset.

n_molecules

Calculate the number of unique molecules to be submitted.

n_qc_specs

Return the number of QCSpecs on this dataset.

n_records

Return the total number of records that will be created on submission of the dataset.

type

to_tasks()[source]

Build a dictionary of single QCEngine tasks that correspond to this dataset organised by program name. The tasks can be passed directly to qcengine.compute.

Return type

Dict[str, List[qcelemental.models.results.AtomicInput]]

add_molecule(index, molecule, extras=None, keywords=None, **kwargs)

Add a molecule to the dataset under the given index with the passed cmiles.

Parameters
Return type

None

Note

Each molecule in this basic dataset should have all of its conformers expanded out into separate entries. Thus here we take the general molecule index and increment it.

add_qc_spec(method, basis, program, spec_name, spec_description, store_wavefunction='none', overwrite=False, implicit_solvent=None, maxiter=200, scf_properties=None, keywords=None)

Add a new qcspecification to the factory which will be applied to the dataset.

Parameters
  • method (str) – The name of the method to use eg B3LYP-D3BJ

  • basis (Optional[str]) – The name of the basis to use can also be None

  • program (str) – The name of the program to execute the computation

  • spec_name (str) – The name the spec should be stored under

  • spec_description (str) – The description of the spec

  • store_wavefunction (str) – what parts of the wavefunction that should be saved

  • overwrite (bool) – If there is a spec under this name already overwrite it

  • implicit_solvent (Optional[openff.qcsubmit.common_structures.PCMSettings]) – The implicit solvent settings if it is to be used.

  • maxiter (pydantic.v1.types.PositiveInt) – The maximum number of SCF iterations that should be done.

  • scf_properties (Optional[List[openff.qcsubmit.common_structures.SCFProperties]]) – The list of SCF properties that should be extracted from the calculation.

  • keywords (Optional[Dict[str, Union[pydantic.v1.types.StrictStr, pydantic.v1.types.StrictInt, pydantic.v1.types.StrictFloat, pydantic.v1.types.StrictBool, List[pydantic.v1.types.StrictFloat]]]]) – Program specific computational keywords that should be passed to the program

Return type

None

clear_qcspecs()

Clear out any current QCSpecs.

Return type

None

property components: List[Dict[str, Union[str, Dict[str, str]]]]

Gather the details of the components that were ran during the creation of this dataset.

coverage_report(force_field, verbose=False)

Returns a summary of how many molecules within this dataset would be assigned each of the parameters in a force field.

Notes

  • Parameters which would not be assigned to any molecules in the dataset will not be included in the returned summary.

Parameters
  • force_field (ForceField) – The force field containing the parameters to summarize.

  • verbose (bool) – If true a progress bar will be shown on screen.

Returns

A dictionary of the form coverage[handler_name][parameter_smirks] = count which stores the number of molecules within this dataset that would be assigned to each parameter.

Return type

Dict[str, Dict[str, int]]

dict(*args, **kwargs)

Overwrite the dict method to handle any enums when saving to yaml/json via a dict call.

export_dataset(file_name, compression=None)

Export the dataset to file so that it can be used to make another dataset quickly.

Parameters
  • file_name (str) – The name of the file the dataset should be wrote to.

  • compression (Optional[str]) – The type of compression that should be added to the export.

Raises

UnsupportedFiletypeError – If the requested file type is not supported.

Return type

None

Note

The supported file types are:

  • json

Additionally, the file will automatically compressed depending on the final extension if compression is not explicitly supplied:

  • json.xz

  • json.gz

  • json.bz2

Check serializers.py for more details. Right now bz2 seems to produce the smallest files.

filter_molecules(molecules, component, component_settings, component_provenance)

Filter a molecule or list of molecules by the component they failed.

Parameters
Return type

None

property filtered: openff.toolkit.topology.molecule.Molecule

A generator which yields a openff molecule representation for each molecule filtered while creating this dataset.

Note

Modifying the molecule will have no effect on the data stored.

get_molecule_entry(molecule)

Search through the dataset for a molecule and return the dataset index of any exact molecule matches.

Parameters

molecule (Union[openff.toolkit.topology.molecule.Molecule, str]) – The smiles string for the molecule or an openforcefield.topology.Molecule that is to be searched for.

Returns

A list of dataset indices which contain the target molecule.

Return type

List[str]

property molecules: Generator[openff.toolkit.topology.molecule.Molecule, None, None]

A generator that creates an openforcefield.topology.Molecule one by one from the dataset.

Note

Editing the molecule will not effect the data stored in the dataset as it is immutable.

molecules_to_file(file_name, file_type)

Write the molecules to the requested file type.

Parameters
  • file_name (str) – The name of the file the molecules should be stored in.

  • file_type (str) – The file format that should be used to store the molecules.

Return type

None

Important

The supported file types are:

  • SMI

  • INCHI

  • INCKIKEY

property n_components: int

Return the amount of components that have been ran during generating the dataset.

property n_filtered: int

Calculate the total number of molecules filtered by the components used in a workflow to create this dataset.

property n_molecules: int

Calculate the number of unique molecules to be submitted.

Notes

  • This method has been improved for better performance on large datasets and has been tested on an optimization dataset of over 10500 molecules.

  • This function does not calculate the total number of entries of the dataset see n_records

property n_qc_specs: int

Return the number of QCSpecs on this dataset.

property n_records: int

Return the total number of records that will be created on submission of the dataset.

Note

  • The number returned will be different depending on the dataset used.

  • The amount of unique molecule can be found using n_molecules

classmethod parse_file(file_name)

Create a Dataset object from a compressed json file.

Parameters

file_name (str) – The name of the file the dataset should be created from.

remove_qcspec(spec_name)

Remove a QCSpec from the dataset.

Parameters

spec_name (str) – The name of the spec that should be removed.

Return type

None

Note

The QCSpec settings are not mutable and so they must be removed and a new one added to ensure they are fully validated.

submit(client, ignore_errors=False, verbose=False)

Submit the dataset to a QCFractal server.

Parameters
  • client (qcportal.client.PortalClient) – Instance of a portal client

  • ignore_errors (bool) – If the user wants to submit the compute regardless of errors set this to True. Mainly to override basis coverage.

  • verbose (bool) – If progress bars and submission statistics should be printed True or not False.

Returns

A dictionary of the compute response from the client for each specification submitted.

Raises

MissingBasisCoverageError – If the chosen basis set does not cover some of the elements in the dataset.

Return type

Dict

visualize(file_name, columns=4, toolkit=None)

Create a pdf file of the molecules with any torsions highlighted using either openeye or rdkit.

Parameters
  • file_name (str) – The name of the pdf file which will be produced.

  • columns (int) – The number of molecules per row.

  • toolkit (Optional[Literal['openeye', 'rdkit']]) – The option to specify the backend toolkit used to produce the pdf file.

Return type

None