Key Contribution
GEOM addresses the lack of large-scale data linking molecular conformer ensembles to experimental properties by providing 450k+ molecules with 37M+ conformations, enabling more accurate machine learning models for property prediction and molecular generation.
Overview
The Geometric Ensemble Of Molecules (GEOM) dataset provides energy-annotated molecular conformations generated through systematic computational methods. The dataset includes molecules from drug discovery campaigns (AICures), quantum chemistry benchmarks (QM9), and molecular property prediction benchmarks (MoleculeNet), with conformations sampled using CREST/GFN2-xTB and a subset refined with high-quality DFT calculations.
Strengths
- Scale: 37M+ conformations across 450k+ molecules, covering drug-like and small molecule chemical spaces
- Energy annotations: All conformations include semi-empirical energies (GFN2-xTB); BACE subset includes DFT-quality energies
- Quality levels: Three tiers of computational quality to match different accuracy requirements
- Benchmark ready: Includes splits and baseline results for property prediction tasks
- Diverse sources: Combines molecules from drug discovery, quantum chemistry, and biophysics domains
Limitations
- Computational energies: Even DFT energies are approximations; experimental validation is limited to BACE subset
- Solvent models: BACE uses implicit solvent (ALPB/C-PCM for water); other subsets use vacuum calculations
- Semi-empirical accuracy: GFN2-xTB has ~2 kcal/mol MAE vs DFT, which may be insufficient for high-precision applications
- Coverage gaps: Missing conformations for flexible molecules or those requiring advanced sampling methods
- Computational cost: High-quality DFT subset (BACE) limited to 1,511 molecules due to compute constraints
Technical Notes
Data Generation Pipeline
Initial conformer sampling (RDKit):
EmbedMultipleConfswithnumConfs=50,pruneRmsThresh=0.01Å- MMFF force field optimization
- GFN2-xTB optimization of seed conformer
Conformational exploration (CREST):
- Metadynamics in NVT ensemble with biased potential
- 12 independent runs per molecule
- 6.0 kcal/mol energy window for conformer retention
- Solvent: ALPB (water) for BACE; vacuum for others
Energy calculation (GFN2-xTB):
- Semi-empirical tight-binding DFT
- Accuracy: ~2 kcal/mol MAE vs DFT
Quality Levels
| Level | Method | Subset | Accuracy |
|---|---|---|---|
| Standard | CREST/GFN2-xTB | All subsets | ~2 kcal/mol MAE vs DFT |
| DFT Single-Point | r2scan-3c/mTZVPP on CREST geometries | BACE (1,511 molecules) | Sub-kcal/mol |
| DFT Optimized | CENSO full optimization + free energies | BACE (534 molecules) | ~0.3 kcal/mol vs CCSD(T) |
Benchmark Setup
Task: Predict ensemble properties (Gibbs free energy $G$, average energy $E$, number of unique conformers) from molecular structure.
Data: 100,000 species randomly sampled from AICures subset, split 60/20/20 (train/validation/test).
Hyperparameters: Optimized using Hyperopt package for each model/task combination.
Models:
- SchNetFeatures: 3D SchNet architecture + graph features, trained on highest-probability conformer
- ChemProp: Message Passing Neural Network (state-of-the-art graph model)
- FFNN: Feed-forward network on Morgan fingerprints
- KRR: Kernel Ridge Regression on Morgan fingerprints
- Random Forest: Random Forest on Morgan fingerprints
Hardware & Computational Cost
CREST/GFN2-xTB Generation
Total compute: ~15.7 million core hours
AICures subset:
- 13M core hours on Knights Landing (32-core nodes)
- 1.2M core hours on Cascade Lake/Sky Lake (13-core nodes)
- Average wall time: 2.8 hours/molecule (KNL) or 0.63 hours/molecule (Sky Lake)
MoleculeNet subset: 1.5M core hours
DFT Calculations (BACE only)
Software: CENSO 1.1.2 + ORCA 5.0.1 (r2scan-3c/mTZVPP functional)
Solvent: C-PCM implicit solvation (water)
Hardware: ~54 cores per job
Compute cost:
- 781,000 CPU hours for CENSO optimizations
- 1.1M CPU hours for single-point energy calculations

