Dataset Summary
GEOM (Geometric Ensemble Of Molecules) is a large-scale dataset containing 37 million molecular conformations for over 450,000 molecules, designed specifically for machine learning applications in computational chemistry. The dataset addresses a critical gap by providing conformer ensembles with experimental properties, enabling the development of models that account for molecular flexibility rather than treating molecules as static 2D graphs or single 3D structures.
Quick Facts
- Total Conformations: 37 million
- Unique Molecules: 450,000+
- Paper: GEOM, energy-annotated molecular conformations for property prediction and molecular generation
- Authors: Simon Axelrod (Harvard), Rafael Gómez-Bombarelli (MIT)
- Publication: Scientific Data (2022)
Dataset Composition
Size and Scale
Subset | Molecules | Source | Properties |
---|---|---|---|
Drug-like (AICures) | 304,466 | AICures COVID-19 challenge | Biological assay data |
QM9 | 133,258 | QM9 benchmark | Quantum mechanical properties |
MoleculeNet | 16,865 | Various benchmarks | Physical/biological properties |
BACE (High-quality DFT) | 1,511 | MoleculeNet BACE | DFT energies + experimental |
BACE (DFT optimized) | 534 | Subset of above | Fully DFT-optimized conformers |
Molecular Characteristics
AICures Drug Dataset (N=304,466)
- Average atoms: 44.4 (max: 181)
- Average heavy atoms: 24.9 (max: 91)
- Molecular weight: 355.4 ± 80.4 amu (max: 1549.7)
- Rotatable bonds: 6.5 ± 3.0 (max: 53)
- Stereochemistry: 45,712 specified, 83,326 total stereocenters
QM9 Dataset (N=133,258)
- Average atoms: 18.0 (max: 29)
- Average heavy atoms: 8.8 (max: 9)
- Molecular weight: 122.7 ± 7.6 amu (max: 152.0)
- Rotatable bonds: 2.2 ± 1.6 (max: 8)
- Stereochemistry: 95,734 with specified stereochemistry
Conformational Statistics
Drug Dataset Ensembles
- Average conformers per molecule: 102.6 (max: 7,451)
- Conformational entropy: 8.2 ± 2.6 cal/mol·K
- Free energy range: -2.4 ± 0.8 kcal/mol
- Average energy spread: 0.4 ± 0.2 kcal/mol
Data Generation Methodology
Three-Tier Quality Hierarchy
Tier 1: CREST Conformers (Majority)
- Method: Semi-empirical GFN2-xTB with metadynamics sampling
- Accuracy: ~2 kcal/mol energy accuracy
- Coverage: Excellent conformational space coverage
- Speed: Fast enough for large-scale generation
- Statistical weights: $P_i^{\text{CREST}} = \frac{d_i\exp(-E_i/k_BT)}{\sum_j d_j\exp(-E_j/k_BT)}$
Tier 2: DFT Single-Point (BACE subset)
- Method: r²SCAN-3c functional, mTZVPP basis, C-PCM water model
- Coverage: All 1,511 BACE molecules
- Accuracy: High-quality energies on CREST geometries
- Purpose: More accurate statistical weights
Tier 3: DFT Optimized (Premium subset)
- Method: Full CENSO workflow with DFT optimization
- Coverage: 534 BACE molecules (35% of BACE subset)
- Quality: Highest accuracy conformers and free energies
- Free energy: $G_i = E_{\text{gas}}^{(i)} + \delta G_{\text{solv}}^{(i)}(T) + G_{\text{trv}}^{(i)}(T)$
Generation Pipeline
- SMILES preprocessing: Canonical forms via RDKit
- Initial structure: RDKit conformer generation + GFN2-xTB optimization
- CREST simulation: Metadynamics-based conformational sampling
- Graph re-identification: xyz2mol for consistent molecular graphs
- Optional refinement: CENSO DFT optimization (premium subset)
Experimental Properties
Biological Assays (AICures)
Target | Species | Active Compounds |
---|---|---|
SARS-CoV-2 | 5,832 | 101 |
SARS-CoV-2 3CL protease | 817 | 78 |
SARS-CoV 3CL protease | 289,808 | 447 |
SARS-CoV PL protease | 232,708 | 696 |
E. coli inhibition | 2,186 | 111 |
P. aeruginosa inhibition | 1,968 | 48 |
MoleculeNet Properties
Category | Dataset | Property | Tasks | Recovery Rate |
---|---|---|---|---|
Physical Chemistry | ESOL | Water solubility | 1 | 99.6% |
FreeSolv | Hydration free energy | 1 | 100.0% | |
Lipophilicity | log K(octanol-water) | 1 | 99.9% | |
Biophysics | BACE | BACE-1 inhibition | 1 | 99.9% |
BBBP | Blood-brain barrier | 1 | 99.2% | |
Physiology | Tox21 | Qualitative toxicity | 12 | 98.0% |
ToxCast | Qualitative toxicity | 617 | 98.0% | |
SIDER | Drug side effects | 27 | 95.1% | |
ClinTox | Clinical toxicity | 2 | 98.7% |
Intended Use Cases
Property Prediction
- Conformer-aware models: Train ML models that use conformer ensembles as input
- Ensemble averaging: Predict properties by averaging over thermally accessible conformers
- Transfer learning: Leverage SARS-CoV/CoV-2 similarity for cross-target models
- Multi-task learning: Joint prediction across multiple assays and properties
Generative Modeling
- Conformer generation: Train models to generate 3D conformations from 2D graphs
- Pre-training: Large-scale pre-training for generalizable conformer models
- Benchmark evaluation: Test conformer generation quality (recall, diversity)
- Fast sampling: Replace expensive CREST calculations with ML inference
Conformer Ensemble Learning
- Ensemble-aware models: Develop models that explicitly learn from conformer ensembles
- Benchmarking: Use with complementary datasets like MARCEL for comprehensive evaluation
- Method comparison: Test different ensemble aggregation strategies
Computational Chemistry Research
- Method validation: Benchmark new conformational sampling methods
- Energy ranking: Test accuracy of different energy models
- Statistical mechanics: Study conformational entropy and free energy
- Solvent effects: Investigate implicit vs explicit solvation models
Quality Assessment
Strengths
- Scale: Largest conformer dataset with experimental properties
- Quality: Semi-empirical to DFT accuracy hierarchy
- Coverage: Excellent conformational space sampling via metadynamics
- Diversity: Drug-like and small molecule subsets
- Validation: Extensive benchmarking against experimental data
Limitations
- Statistical weights: CREST weights are approximate (2 kcal/mol errors)
- Solvent model: Implicit solvation only (C-PCM)
- Size constraints: Limited to medium-sized organic molecules
- Computational cost: DFT subset is limited in size
- Stereochemistry: Not all molecules have specified stereochemistry
Accuracy Metrics
- CREST energies: ~2 kcal/mol accuracy vs experiment
- DFT energies: Sub-kcal/mol accuracy for optimized subset
- Conformer recovery: High recall for thermally accessible conformers
- Property prediction: Benchmarked against experimental assays
Technical Specifications
File Formats
- Molecular graphs: RDKit mol objects
- Conformers: XYZ coordinates
- Energies: Hartree (DFT), kcal/mol (relative)
- Properties: Various units depending on assay
Software Dependencies
- CREST: Conformational sampling
- ORCA: DFT calculations
- RDKit: Chemical informatics
- xyz2mol: Graph reconstruction
Computational Requirements
- CREST generation: ~1-10 CPU hours per molecule
- DFT optimization: ~100-1000 CPU hours per molecule
- Storage: Several TB for full dataset
Ethical Considerations
Intended Use
- Accelerating drug discovery and computational chemistry research
- Academic and commercial research applications
- Method development and benchmarking
Potential Misuse
- Dataset should not be used as sole basis for clinical decisions
- Conformer quality varies; users should understand limitations
- COVID-19 data reflects 2020-2022 research priorities
Leaderboards
Conformational Property Prediction
Task: Predict conformational statistics from molecular graphs (SMILES)
Model | Conformational Free Energy G (kcal/mol) | Average Energy ⟨E⟩ (kcal/mol) | ln(Unique Conformers) | Paper/Source | Year |
---|---|---|---|---|---|
SchNet Features | 0.203 | 0.113 | 0.363 | GEOM Dataset Paper | 2022 |
ChemProp | 0.225 | 0.110 | 0.380 | GEOM Dataset Paper | 2022 |
FFNN | 0.274 | 0.119 | 0.455 | GEOM Dataset Paper | 2022 |
KRR | 0.289 | 0.131 | 0.484 | GEOM Dataset Paper | 2022 |
Random Forest | 0.406 | 0.166 | 0.763 | GEOM Dataset Paper | 2022 |
Metrics: Mean Absolute Error (MAE). Lower is better. Bold indicates best performance per metric.
Energy Ranking Accuracy
Task: Accurately rank conformers by their relative energies
Method | Spearman Correlation (ρ) | MAE (kcal/mol) | Subset | Paper/Source | Year |
---|---|---|---|---|---|
CENSO (DFT optimized) | 0.85 ± 0.18 (vs energy) | 0.33 | BACE (534 molecules) | GEOM Dataset Paper | 2022 |
DFT Single-Point | 0.69 ± 0.27 | 0.54 | BACE (1,511 molecules) | GEOM Dataset Paper | 2022 |
GFN2-xTB (CREST) | 0.39 ± 0.35 | 1.96 | BACE (1,511 molecules) | GEOM Dataset Paper | 2022 |
Metrics: Higher Spearman correlation and lower MAE indicate better energy ranking accuracy.
Access and Licensing
- Availability: Publicly available through Nature Scientific Data
- License: Check original publication for specific terms
- Citation: Axelrod & Gómez-Bombarelli, Sci Data 9, 185 (2022)
References
- Primary Paper: GEOM, energy-annotated molecular conformations for property prediction and molecular generation
- CREST Software: Conformer-Rotamer Ensemble Sampling Tool
- AICures Challenge: MIT AI Cures
- MoleculeNet: Benchmarking platform for molecular machine learning
- Related Datasets: MARCEL Dataset Card - Conformer ensemble learning benchmark
Dataset Status: Active, publicly available
Last Updated: 2022
Contact: Authors via original publication