GEOM: Energy-Annotated Molecular Conformations

Key Contribution

This paper presents the Geometric Ensemble Of Molecules (GEOM) dataset to address the critical lack of a large-scale dataset linking molecular conformer ensembles to experimental data, which is necessary for training more advanced and accurate machine learning models.

Dataset Information

Format

SMILES RDKit mol objects Experimental annotations Computational annotations

Size

Type	Count
Conformations	37,000,000+
Molecules	450,000+

Dataset Examples

Example SARS-CoV-2 3CL protease active molecule: CC(=O)Nc1ccc(Oc2ncccn2)cc1 (N-(4-pyrimidin-2-yloxyphenyl)acetamide)

Dataset Subsets

Subset	Count	Description
Drug-like (AICures)	304,466 molecules	AICures COVID-19 challenge molecules
QM9	133,258 molecules	QM9 benchmark dataset
MoleculeNet	16,865 molecules	Molecules from benchmarks for physical chemistry, biophysics, and physiology
BACE (High-quality DFT)	1,511 molecules	A subset of MoleculeNet with DFT energies and experimental data

Results

Aicures Property Prediction

Model	G	E	ln(Conformers)
SchNetFeatures	🥇 0.203	🥈 0.113	🥇 0.363
ChemProp	🥈 0.225	🥇 0.110	🥈 0.380
FFNN	0.274	0.119	0.455
KRR	0.289	0.131	0.484
Random Forest	0.406	0.166	0.763

Technical Notes

Quality Levels

All subsets

Semi-empirical conformational sampling with CREST/GFN2-xTB (accuracy level ~2 kcal/mol MAE).

BACE subset (1,511)

r2scan-3c/mTZVPP energies on CREST geometries (sub-kcal/mol accuracy).

BACE subset (534)

Full r2scan-3c/mTZVPP optimization and free energies (accuracy level ~0.3 kcal/mol vs CCSD(T)).

Methods - CREST

Conformer generation software. Uses semi-empirical tight-binding DFT to evaluate energies (GFN2-xTB). Runs metadynamics in an NVT ensemble, biasing the potential to explore conformational space. 1.5 million CPU hours consumed in this process.

Methods - DFT

CENSO applied to 534 species in BACE, yielding high-accuracy ensembles for 35% of the species in the subset. Single-point energies are also completed for the remaining species in BACE. 781,000 CPU hours consumed for CENSO simulations and 1.1 million CPU hours for single-point energy calculations.

Dataset Details
Authors	Simon Axelrod, Rafael Gómez-Bombarelli
Paper Title	GEOM, energy-annotated molecular conformations for property prediction and molecular generation
Institutions	Harvard University, MIT
Published In	Nature Scientific Data
Category	Computational Chemistry
Format	SMILES RDKit mol objects Experimental annotations Computational annotations
Size	Conformations: 37,000,000+ Molecules: 450,000+
Date	September 2025
Year	2022
Links	📊 Dataset • 🔗 DOI • 📄 Paper