GEOM
Basic Information
Full NameGeometric Ensemble Of Molecules
DomainComputational Chemistry
Year2022
Publication & Access
PaperDOI
DatasetHarvard Dataverse
Dataset Composition
Total Size450,000+ molecules
Drug-like (AICures)304,466 molecules
QM9133,258 molecules
MoleculeNet16,865 molecules
Technical Details
FormatRDKit mol objects, XYZ coordinates
ObservablesConformational energies, COVID-19 assay data
AnnotationsBiological activity labels, experimental assays
Research Context
AuthorsSimon Axelrod, Rafael Gómez-Bombarelli
InstitutionHarvard University, MIT

Dataset Summary

GEOM (Geometric Ensemble Of Molecules) contains 37 million molecular conformations for over 450,000 molecules, with experimental properties and quantum mechanical energies. Unlike traditional datasets that treat molecules as flat 2D graphs, GEOM provides multiple 3D shapes for each molecule, helping machine learning models account for molecular flexibility. The dataset combines conformational sampling with experimental data, including biological assays from the AICures challenge for COVID-19 drug discovery.

Key Features

  • Conformational Diversity: 37 million 3D shapes for modeling how molecules can flex and bend.
  • Quality Levels: Three tiers of computational accuracy, from fast methods to high-accuracy DFT calculations.
  • Experimental Data: Labeled with experimental properties for physical chemistry, biophysics, and biology, including assay results for benchmarking.
  • Energy Information: Quantum mechanical energies help rank which molecular shapes are most likely.
  • ML-Ready: Available in accessible formats, including Python pickle files with RDKit mol objects.

Dataset Structure

The GEOM dataset is composed of several subsets with varying levels of computational accuracy:

GEOM Dataset Subsets
CountDescriptionName
304,466 moleculesAICures COVID-19 challenge moleculesDrug-like (AICures)
133,258 moleculesQM9 benchmark datasetQM9
16,865 moleculesMolecules from benchmarks for physical chemistry, biophysics, and physiologyMoleculeNet
1,511 moleculesA subset of MoleculeNet with DFT energies and experimental dataBACE (High-quality DFT)

Quality Levels

Computational Quality Levels in GEOM
AccuracyDescriptionMethodMolecules
~2 kcal/mol MAESemi-empirical conformational samplingCREST/GFN2-xTBAll subsets
Sub-kcal/molr2scan-3c/mTZVPP energies on CREST geometriesDFT Single-PointBACE subset (1,511)
~0.3 kcal/mol vs CCSD(T)Full r2scan-3c/mTZVPP optimization and free energiesDFT OptimizedBACE subset (534)

Structural Diversity

  • Size Range: Molecules with up to 91 heavy atoms (181 total atoms).
  • Element Types: Mainly C, H, N, O, F for QM9, with S, P, Cl, Br, I included in the drug-like datasets.
  • Conformers per Molecule: Varies widely; the AICures subset averages ~103 conformers per molecule, with a maximum of 7,451.
  • Energy Window: Conformers are included up to 6.0 kcal/mol above the lowest-energy structure.

Use Cases

Primary Applications

  • Conformer-Aware ML Models: Training models that account for molecular flexibility when predicting properties.
  • Property Prediction: Improving prediction accuracy by averaging properties over multiple molecular shapes.
  • Generative Models: Training and testing models that generate realistic 3D molecular structures from graphs.

Research Applications

  • Drug Discovery: Studying structure-activity relationships where molecular flexibility matters.
  • Computational Chemistry: Testing new methods for finding molecular conformations.
  • Molecular Dynamics: Providing high-quality starting structures and reference energies for simulations.

Quality & Limitations

Strengths

  • Large Scale: One of the largest available datasets of molecular conformers annotated with experimental properties.
  • Quality Hierarchy: A multi-tier approach balances computational cost and accuracy, enabling diverse applications.
  • Experimental Data: Extensive biological and physicochemical data for benchmarking property prediction models.
  • Standardized Format: Provides easy-to-use RDKit mol objects with consistent graph and geometry information.
  • Energy Annotations: Quantum mechanical energies allow for robust, physics-based ranking of conformers.

Limitations

  • Energy Uncertainty: CREST energies have an error of approximately 2 kcal/mol, which can affect the statistical weights of conformers.
  • Solvation Model: Uses implicit solvation models (C-PCM or ALPB) only; explicit solvent effects are not included.
  • Size Constraints: While including large molecules, conformer generation for highly flexible molecules was computationally intensive and sometimes did not finish.
  • Element Coverage: Primarily focused on organic molecules; metals and less common elements are not included.
  • Sampling Completeness: The underlying CREST method has high recall but may not capture all thermally accessible conformers for every molecule.

Generation and Processing Pipeline

GEOM employs a multi-step pipeline to generate high-quality conformational ensembles:

Conformational Sampling

  1. Initial Structures: Generated from 2D molecular graphs (SMILES).
  2. CREST Sampling: A metadynamics-enhanced conformational search is performed using the semi-empirical GFN2-xTB method to explore the potential energy surface.
  3. Energy Window: Conformers within 6.0 kcal/mol of the lowest-energy structure are retained.
  4. Clustering: Structurally similar conformers are identified and duplicates are removed.

Energy Refinement

  • Semi-Empirical: GFN2-xTB energies are calculated for all 37 million conformers.
  • DFT Single-Point: For the BACE subset, single-point r2scan-3c/mTZVPP energies and other properties are computed on the fixed CREST geometries.
  • DFT Optimization: For a high-quality subset of 534 BACE molecules, full geometry optimization and free energy calculations are performed using r2scan-3c/mTZVPP via the CENSO protocol.

Experimental Integration

  • Assay Data: Biological assay results (e.g., antiviral activity) and physicochemical properties from AICures and MoleculeNet are linked to each molecule.
  • Property Mapping: Molecular graphs and experimental endpoints are mapped to the generated conformational ensembles.

Model Performance

Several machine learning models have been benchmarked on the GEOM dataset for conformer property prediction tasks. The experiments were conducted on 100,000 species randomly sampled from the AICures drug subset, using a 60-20-20 split for training, validation, and testing. Results show mean absolute error (MAE) performance across three prediction tasks:

AICures Property Prediction Benchmarks (MAE: Free Energy G in kcal/mol, Avg Energy E in kcal/mol, ln(Conformers) unitless)
Avg energy eFree energy gLn conformersModel nameNotes
0.1130.2030.363SchNetFeaturesBest overall performer
0.1100.2250.380ChemPropGraph neural network approach
0.1190.2740.455FFNNFeed-Forward Neural Network
0.1310.2890.484KRRKernel Ridge Regression
0.1660.4060.763Random ForestEnsemble method baseline

Key Findings: SchNetFeatures and ChemProp demonstrated the strongest performance across the three prediction tasks, with SchNetFeatures excelling at free energy and conformer count prediction, while ChemProp achieved the best results for average conformational energy prediction.


Citation: Axelrod, S. & Gómez-Bombarelli, R. “GEOM, energy-annotated molecular conformations for property prediction and molecular generation” Sci Data 2022, 9, 185.