GEOM

GEOM: Energy-Annotated Molecular Conformations
Dataset Details
AuthorsSimon Axelrod, Rafael Gómez-Bombarelli
Paper TitleGEOM, energy-annotated molecular conformations for property prediction and molecular generation
InstitutionsHarvard University, MIT
Published InNature Scientific Data
CategoryComputational Chemistry
FormatSMILES (Canonicalized) RDKit mol objects MessagePack files Pickle files
SizeConformations: 37,000,000+
Molecules: 450,000+
DateSeptember 2025
Year2022
Links📊 Dataset🔗 DOI📄 Paper
GEOM dataset example molecule: N-(4-pyrimidin-2-yloxyphenyl)acetamide
Example SARS-CoV-2 3CL protease active molecule from the GEOM dataset

Key Contribution

GEOM addresses the lack of large-scale data linking molecular conformer ensembles to experimental properties by providing 450k+ molecules with 37M+ conformations, enabling more accurate machine learning models for property prediction and molecular generation.

Overview

The Geometric Ensemble Of Molecules (GEOM) dataset provides energy-annotated molecular conformations generated through systematic computational methods. The dataset includes molecules from drug discovery campaigns (AICures), quantum chemistry benchmarks (QM9), and molecular property prediction benchmarks (MoleculeNet), with conformations sampled using CREST/GFN2-xTB and a subset refined with high-quality DFT calculations.

Strengths

  • Scale: 37M+ conformations across 450k+ molecules, covering drug-like and small molecule chemical spaces
  • Energy annotations: All conformations include semi-empirical energies (GFN2-xTB); BACE subset includes DFT-quality energies
  • Quality levels: Three tiers of computational quality to match different accuracy requirements
  • Benchmark ready: Includes splits and baseline results for property prediction tasks
  • Diverse sources: Combines molecules from drug discovery, quantum chemistry, and biophysics domains

Limitations

  • Computational energies: Even DFT energies are approximations; experimental validation is limited to BACE subset
  • Solvent models: BACE uses implicit solvent (ALPB/C-PCM for water); other subsets use vacuum calculations
  • Semi-empirical accuracy: GFN2-xTB has ~2 kcal/mol MAE vs DFT, which may be insufficient for high-precision applications
  • Coverage gaps: Missing conformations for flexible molecules or those requiring advanced sampling methods
  • Computational cost: High-quality DFT subset (BACE) limited to 1,511 molecules due to compute constraints

Technical Notes

Data Generation Pipeline

Initial conformer sampling (RDKit):

  • EmbedMultipleConfs with numConfs=50, pruneRmsThresh=0.01 Å
  • MMFF force field optimization
  • GFN2-xTB optimization of seed conformer

Conformational exploration (CREST):

  • Metadynamics in NVT ensemble with biased potential
  • 12 independent runs per molecule
  • 6.0 kcal/mol energy window for conformer retention
  • Solvent: ALPB (water) for BACE; vacuum for others

Energy calculation (GFN2-xTB):

  • Semi-empirical tight-binding DFT
  • Accuracy: ~2 kcal/mol MAE vs DFT

Quality Levels

LevelMethodSubsetAccuracy
StandardCREST/GFN2-xTBAll subsets~2 kcal/mol MAE vs DFT
DFT Single-Pointr2scan-3c/mTZVPP on CREST geometriesBACE (1,511 molecules)Sub-kcal/mol
DFT OptimizedCENSO full optimization + free energiesBACE (534 molecules)~0.3 kcal/mol vs CCSD(T)

Benchmark Setup

Task: Predict ensemble properties (Gibbs free energy $G$, average energy $E$, number of unique conformers) from molecular structure.

Data: 100,000 species randomly sampled from AICures subset, split 60/20/20 (train/validation/test).

Hyperparameters: Optimized using Hyperopt package for each model/task combination.

Models:

  • SchNetFeatures: 3D SchNet architecture + graph features, trained on highest-probability conformer
  • ChemProp: Message Passing Neural Network (state-of-the-art graph model)
  • FFNN: Feed-forward network on Morgan fingerprints
  • KRR: Kernel Ridge Regression on Morgan fingerprints
  • Random Forest: Random Forest on Morgan fingerprints

Hardware & Computational Cost

CREST/GFN2-xTB Generation

Total compute: ~15.7 million core hours

AICures subset:

  • 13M core hours on Knights Landing (32-core nodes)
  • 1.2M core hours on Cascade Lake/Sky Lake (13-core nodes)
  • Average wall time: 2.8 hours/molecule (KNL) or 0.63 hours/molecule (Sky Lake)

MoleculeNet subset: 1.5M core hours

DFT Calculations (BACE only)

Software: CENSO 1.1.2 + ORCA 5.0.1 (r2scan-3c/mTZVPP functional)

Solvent: C-PCM implicit solvation (water)

Hardware: ~54 cores per job

Compute cost:

  • 781,000 CPU hours for CENSO optimizations
  • 1.1M CPU hours for single-point energy calculations

Dataset Information

Format

SMILES (Canonicalized) RDKit mol objects MessagePack files Pickle files

Size

TypeCount
Conformations37,000,000+
Molecules450,000+

Dataset Examples

Example SARS-CoV-2 3CL protease active molecule: CC(=O)Nc1ccc(Oc2ncccn2)cc1 (N-(4-pyrimidin-2-yloxyphenyl)acetamide)
Example SARS-CoV-2 3CL protease active molecule: CC(=O)Nc1ccc(Oc2ncccn2)cc1 (N-(4-pyrimidin-2-yloxyphenyl)acetamide)

Dataset Subsets

SubsetCountDescription
Drug-like (AICures)304,466 moleculesDrug-like molecules from AICures COVID-19 challenge (avg 44 atoms)
QM9133,258 moleculesSmall molecules from QM9 (up to 9 heavy atoms)
MoleculeNet16,865 moleculesMolecules from MoleculeNet benchmarks for physical chemistry, biophysics, and physiology
BACE (High-quality DFT)1,511 moleculesBACE subset with high-quality DFT energies (r2scan-3c) and experimental inhibition data

Benchmarks

Gibbs Free Energy Prediction

Predict ensemble Gibbs free energy (G) from molecular structure

Subset: 100k AICures · Split: 60/20/20 train/val/test

RankModelMAE (kcal/mol)
🥇 1SchNetFeatures
3D SchNet + graph features (trained on highest-prob conformer)
0.203
🥈 2ChemProp
Message Passing Neural Network (graph model)
0.225
🥉 3FFNN
Feed-forward network on Morgan fingerprints
0.274
4KRR
Kernel Ridge Regression on Morgan fingerprints
0.289
5Random Forest
Random Forest on Morgan fingerprints
0.406

Average Energy Prediction

Predict ensemble average energy (E) from molecular structure

Subset: 100k AICures · Split: 60/20/20 train/val/test

RankModelMAE (kcal/mol)
🥇 1ChemProp
Message Passing Neural Network (graph model)
0.11
🥈 2SchNetFeatures
3D SchNet + graph features (trained on highest-prob conformer)
0.113
🥉 3FFNN
Feed-forward network on Morgan fingerprints
0.119
4KRR
Kernel Ridge Regression on Morgan fingerprints
0.131
5Random Forest
Random Forest on Morgan fingerprints
0.166

Conformer Count Prediction

Predict ln(number of unique conformers) from molecular structure

Subset: 100k AICures · Split: 60/20/20 train/val/test

RankModelMAE
🥇 1SchNetFeatures
3D SchNet + graph features (trained on highest-prob conformer)
0.363
🥈 2ChemProp
Message Passing Neural Network (graph model)
0.38
🥉 3FFNN
Feed-forward network on Morgan fingerprints
0.455
4KRR
Kernel Ridge Regression on Morgan fingerprints
0.484
5Random Forest
Random Forest on Morgan fingerprints
0.763

Citation

If you use this dataset, please cite:

https://doi.org/10.1038/s41597-022-01288-4