GEOM

GEOM: Energy-Annotated Molecular Conformations
Dataset Details
AuthorsSimon Axelrod, Rafael Gómez-Bombarelli
Paper TitleGEOM, energy-annotated molecular conformations for property prediction and molecular generation
InstitutionsHarvard University, MIT
Published InNature Scientific Data
CategoryComputational Chemistry
FormatSMILES RDKit mol objects Experimental annotations Computational annotations
SizeConformations: 37,000,000+
Molecules: 450,000+
DateSeptember 2025
Year2022
Links📊 Dataset🔗 DOI📄 Paper
GEOM dataset example molecule: N-(4-pyrimidin-2-yloxyphenyl)acetamide
Example SARS-CoV-2 3CL protease active molecule from the GEOM dataset, demonstrating energy-annotated molecular conformations

Key Contribution

This paper presents the Geometric Ensemble Of Molecules (GEOM) dataset to address the critical lack of a large-scale dataset linking molecular conformer ensembles to experimental data, which is necessary for training more advanced and accurate machine learning models.

Dataset Information

Format

SMILES RDKit mol objects Experimental annotations Computational annotations

Size

TypeCount
Conformations37,000,000+
Molecules450,000+

Dataset Examples

Example SARS-CoV-2 3CL protease active molecule: CC(=O)Nc1ccc(Oc2ncccn2)cc1 (N-(4-pyrimidin-2-yloxyphenyl)acetamide)
Example SARS-CoV-2 3CL protease active molecule: CC(=O)Nc1ccc(Oc2ncccn2)cc1 (N-(4-pyrimidin-2-yloxyphenyl)acetamide)

Dataset Subsets

SubsetCountDescription
Drug-like (AICures)304,466 moleculesAICures COVID-19 challenge molecules
QM9133,258 moleculesQM9 benchmark dataset
MoleculeNet16,865 moleculesMolecules from benchmarks for physical chemistry, biophysics, and physiology
BACE (High-quality DFT)1,511 moleculesA subset of MoleculeNet with DFT energies and experimental data

Results

Aicures Property Prediction

ModelGEln(Conformers)
SchNetFeatures🥇 0.203🥈 0.113🥇 0.363
ChemProp🥈 0.225🥇 0.110🥈 0.380
FFNN0.2740.1190.455
KRR0.2890.1310.484
Random Forest0.4060.1660.763

Technical Notes

Quality Levels

All subsets

Semi-empirical conformational sampling with CREST/GFN2-xTB (accuracy level ~2 kcal/mol MAE).

BACE subset (1,511)

r2scan-3c/mTZVPP energies on CREST geometries (sub-kcal/mol accuracy).

BACE subset (534)

Full r2scan-3c/mTZVPP optimization and free energies (accuracy level ~0.3 kcal/mol vs CCSD(T)).

Methods - CREST

Conformer generation software. Uses semi-empirical tight-binding DFT to evaluate energies (GFN2-xTB). Runs metadynamics in an NVT ensemble, biasing the potential to explore conformational space. 1.5 million CPU hours consumed in this process.

Methods - DFT

CENSO applied to 534 species in BACE, yielding high-accuracy ensembles for 35% of the species in the subset. Single-point energies are also completed for the remaining species in BACE. 781,000 CPU hours consumed for CENSO simulations and 1.1 million CPU hours for single-point energy calculations.