Dataset Summary

GEOM (Geometric Ensemble Of Molecules) is a large-scale dataset containing 37 million molecular conformations for over 450,000 molecules, designed specifically for machine learning applications in computational chemistry. The dataset addresses a critical gap by providing conformer ensembles with experimental properties, enabling the development of models that account for molecular flexibility rather than treating molecules as static 2D graphs or single 3D structures.

Quick Facts

Dataset Composition

Size and Scale

SubsetMoleculesSourceProperties
Drug-like (AICures)304,466AICures COVID-19 challengeBiological assay data
QM9133,258QM9 benchmarkQuantum mechanical properties
MoleculeNet16,865Various benchmarksPhysical/biological properties
BACE (High-quality DFT)1,511MoleculeNet BACEDFT energies + experimental
BACE (DFT optimized)534Subset of aboveFully DFT-optimized conformers

Molecular Characteristics

AICures Drug Dataset (N=304,466)

  • Average atoms: 44.4 (max: 181)
  • Average heavy atoms: 24.9 (max: 91)
  • Molecular weight: 355.4 ± 80.4 amu (max: 1549.7)
  • Rotatable bonds: 6.5 ± 3.0 (max: 53)
  • Stereochemistry: 45,712 specified, 83,326 total stereocenters

QM9 Dataset (N=133,258)

  • Average atoms: 18.0 (max: 29)
  • Average heavy atoms: 8.8 (max: 9)
  • Molecular weight: 122.7 ± 7.6 amu (max: 152.0)
  • Rotatable bonds: 2.2 ± 1.6 (max: 8)
  • Stereochemistry: 95,734 with specified stereochemistry

Conformational Statistics

Drug Dataset Ensembles

  • Average conformers per molecule: 102.6 (max: 7,451)
  • Conformational entropy: 8.2 ± 2.6 cal/mol·K
  • Free energy range: -2.4 ± 0.8 kcal/mol
  • Average energy spread: 0.4 ± 0.2 kcal/mol

Data Generation Methodology

Three-Tier Quality Hierarchy

Tier 1: CREST Conformers (Majority)

  • Method: Semi-empirical GFN2-xTB with metadynamics sampling
  • Accuracy: ~2 kcal/mol energy accuracy
  • Coverage: Excellent conformational space coverage
  • Speed: Fast enough for large-scale generation
  • Statistical weights: $P_i^{\text{CREST}} = \frac{d_i\exp(-E_i/k_BT)}{\sum_j d_j\exp(-E_j/k_BT)}$

Tier 2: DFT Single-Point (BACE subset)

  • Method: r²SCAN-3c functional, mTZVPP basis, C-PCM water model
  • Coverage: All 1,511 BACE molecules
  • Accuracy: High-quality energies on CREST geometries
  • Purpose: More accurate statistical weights

Tier 3: DFT Optimized (Premium subset)

  • Method: Full CENSO workflow with DFT optimization
  • Coverage: 534 BACE molecules (35% of BACE subset)
  • Quality: Highest accuracy conformers and free energies
  • Free energy: $G_i = E_{\text{gas}}^{(i)} + \delta G_{\text{solv}}^{(i)}(T) + G_{\text{trv}}^{(i)}(T)$

Generation Pipeline

  1. SMILES preprocessing: Canonical forms via RDKit
  2. Initial structure: RDKit conformer generation + GFN2-xTB optimization
  3. CREST simulation: Metadynamics-based conformational sampling
  4. Graph re-identification: xyz2mol for consistent molecular graphs
  5. Optional refinement: CENSO DFT optimization (premium subset)

Experimental Properties

Biological Assays (AICures)

TargetSpeciesActive Compounds
SARS-CoV-25,832101
SARS-CoV-2 3CL protease81778
SARS-CoV 3CL protease289,808447
SARS-CoV PL protease232,708696
E. coli inhibition2,186111
P. aeruginosa inhibition1,96848

MoleculeNet Properties

CategoryDatasetPropertyTasksRecovery Rate
Physical ChemistryESOLWater solubility199.6%
FreeSolvHydration free energy1100.0%
Lipophilicitylog K(octanol-water)199.9%
BiophysicsBACEBACE-1 inhibition199.9%
BBBPBlood-brain barrier199.2%
PhysiologyTox21Qualitative toxicity1298.0%
ToxCastQualitative toxicity61798.0%
SIDERDrug side effects2795.1%
ClinToxClinical toxicity298.7%

Intended Use Cases

Property Prediction

  • Conformer-aware models: Train ML models that use conformer ensembles as input
  • Ensemble averaging: Predict properties by averaging over thermally accessible conformers
  • Transfer learning: Leverage SARS-CoV/CoV-2 similarity for cross-target models
  • Multi-task learning: Joint prediction across multiple assays and properties

Generative Modeling

  • Conformer generation: Train models to generate 3D conformations from 2D graphs
  • Pre-training: Large-scale pre-training for generalizable conformer models
  • Benchmark evaluation: Test conformer generation quality (recall, diversity)
  • Fast sampling: Replace expensive CREST calculations with ML inference

Conformer Ensemble Learning

  • Ensemble-aware models: Develop models that explicitly learn from conformer ensembles
  • Benchmarking: Use with complementary datasets like MARCEL for comprehensive evaluation
  • Method comparison: Test different ensemble aggregation strategies

Computational Chemistry Research

  • Method validation: Benchmark new conformational sampling methods
  • Energy ranking: Test accuracy of different energy models
  • Statistical mechanics: Study conformational entropy and free energy
  • Solvent effects: Investigate implicit vs explicit solvation models

Quality Assessment

Strengths

  • Scale: Largest conformer dataset with experimental properties
  • Quality: Semi-empirical to DFT accuracy hierarchy
  • Coverage: Excellent conformational space sampling via metadynamics
  • Diversity: Drug-like and small molecule subsets
  • Validation: Extensive benchmarking against experimental data

Limitations

  • Statistical weights: CREST weights are approximate (2 kcal/mol errors)
  • Solvent model: Implicit solvation only (C-PCM)
  • Size constraints: Limited to medium-sized organic molecules
  • Computational cost: DFT subset is limited in size
  • Stereochemistry: Not all molecules have specified stereochemistry

Accuracy Metrics

  • CREST energies: ~2 kcal/mol accuracy vs experiment
  • DFT energies: Sub-kcal/mol accuracy for optimized subset
  • Conformer recovery: High recall for thermally accessible conformers
  • Property prediction: Benchmarked against experimental assays

Technical Specifications

File Formats

  • Molecular graphs: RDKit mol objects
  • Conformers: XYZ coordinates
  • Energies: Hartree (DFT), kcal/mol (relative)
  • Properties: Various units depending on assay

Software Dependencies

  • CREST: Conformational sampling
  • ORCA: DFT calculations
  • RDKit: Chemical informatics
  • xyz2mol: Graph reconstruction

Computational Requirements

  • CREST generation: ~1-10 CPU hours per molecule
  • DFT optimization: ~100-1000 CPU hours per molecule
  • Storage: Several TB for full dataset

Ethical Considerations

Intended Use

  • Accelerating drug discovery and computational chemistry research
  • Academic and commercial research applications
  • Method development and benchmarking

Potential Misuse

  • Dataset should not be used as sole basis for clinical decisions
  • Conformer quality varies; users should understand limitations
  • COVID-19 data reflects 2020-2022 research priorities

Leaderboards

Conformational Property Prediction

Task: Predict conformational statistics from molecular graphs (SMILES)

ModelConformational Free Energy G (kcal/mol)Average Energy ⟨E⟩ (kcal/mol)ln(Unique Conformers)Paper/SourceYear
SchNet Features0.2030.1130.363GEOM Dataset Paper2022
ChemProp0.2250.1100.380GEOM Dataset Paper2022
FFNN0.2740.1190.455GEOM Dataset Paper2022
KRR0.2890.1310.484GEOM Dataset Paper2022
Random Forest0.4060.1660.763GEOM Dataset Paper2022

Metrics: Mean Absolute Error (MAE). Lower is better. Bold indicates best performance per metric.

Energy Ranking Accuracy

Task: Accurately rank conformers by their relative energies

MethodSpearman Correlation (ρ)MAE (kcal/mol)SubsetPaper/SourceYear
CENSO (DFT optimized)0.85 ± 0.18 (vs energy)0.33BACE (534 molecules)GEOM Dataset Paper2022
DFT Single-Point0.69 ± 0.270.54BACE (1,511 molecules)GEOM Dataset Paper2022
GFN2-xTB (CREST)0.39 ± 0.351.96BACE (1,511 molecules)GEOM Dataset Paper2022

Metrics: Higher Spearman correlation and lower MAE indicate better energy ranking accuracy.


Access and Licensing

  • Availability: Publicly available through Nature Scientific Data
  • License: Check original publication for specific terms
  • Citation: Axelrod & Gómez-Bombarelli, Sci Data 9, 185 (2022)

References


Dataset Status: Active, publicly available
Last Updated: 2022
Contact: Authors via original publication