GEOM Dataset Card

GEOM
Basic Information
Full Name	Geometric Ensemble Of Molecules
Domain	Computational Chemistry
Year	2022
Publication & Access
Paper	DOI
Dataset	Harvard Dataverse
Dataset Composition
Total Size	450,000+ molecules
Drug-like (AICures)	304,466 molecules
QM9	133,258 molecules
MoleculeNet	16,865 molecules
Technical Details
Format	RDKit mol objects, XYZ coordinates
Observables	Conformational energies, COVID-19 assay data
Annotations	Biological activity labels, experimental assays
Research Context
Authors	Simon Axelrod, Rafael Gómez-Bombarelli
Institution	Harvard University, MIT

Dataset Summary

GEOM (Geometric Ensemble Of Molecules) contains 37 million molecular conformations for over 450,000 molecules, with experimental properties and quantum mechanical energies. Unlike traditional datasets that treat molecules as flat 2D graphs, GEOM provides multiple 3D shapes for each molecule, helping machine learning models account for molecular flexibility. The dataset combines conformational sampling with experimental data, including biological assays from the AICures challenge for COVID-19 drug discovery.

Key Features

Conformational Diversity: 37 million 3D shapes for modeling how molecules can flex and bend.
Quality Levels: Three tiers of computational accuracy, from fast methods to high-accuracy DFT calculations.
Experimental Data: Labeled with experimental properties for physical chemistry, biophysics, and biology, including assay results for benchmarking.
Energy Information: Quantum mechanical energies help rank which molecular shapes are most likely.
ML-Ready: Available in accessible formats, including Python pickle files with RDKit mol objects.

Dataset Structure

The GEOM dataset is composed of several subsets with varying levels of computational accuracy:

GEOM Dataset Subsets
Count	Description	Name
304,466 molecules	AICures COVID-19 challenge molecules	Drug-like (AICures)
133,258 molecules	QM9 benchmark dataset	QM9
16,865 molecules	Molecules from benchmarks for physical chemistry, biophysics, and physiology	MoleculeNet
1,511 molecules	A subset of MoleculeNet with DFT energies and experimental data	BACE (High-quality DFT)

Quality Levels

Computational Quality Levels in GEOM
Accuracy	Description	Method	Molecules
~2 kcal/mol MAE	Semi-empirical conformational sampling	CREST/GFN2-xTB	All subsets
Sub-kcal/mol	r2scan-3c/mTZVPP energies on CREST geometries	DFT Single-Point	BACE subset (1,511)
~0.3 kcal/mol vs CCSD(T)	Full r2scan-3c/mTZVPP optimization and free energies	DFT Optimized	BACE subset (534)

Structural Diversity

Size Range: Molecules with up to 91 heavy atoms (181 total atoms).
Element Types: Mainly C, H, N, O, F for QM9, with S, P, Cl, Br, I included in the drug-like datasets.
Conformers per Molecule: Varies widely; the AICures subset averages ~103 conformers per molecule, with a maximum of 7,451.
Energy Window: Conformers are included up to 6.0 kcal/mol above the lowest-energy structure.

Use Cases

Primary Applications

Conformer-Aware ML Models: Training models that account for molecular flexibility when predicting properties.
Property Prediction: Improving prediction accuracy by averaging properties over multiple molecular shapes.
Generative Models: Training and testing models that generate realistic 3D molecular structures from graphs.

Research Applications

Drug Discovery: Studying structure-activity relationships where molecular flexibility matters.
Computational Chemistry: Testing new methods for finding molecular conformations.
Molecular Dynamics: Providing high-quality starting structures and reference energies for simulations.

Quality & Limitations

Strengths

Large Scale: One of the largest available datasets of molecular conformers annotated with experimental properties.
Quality Hierarchy: A multi-tier approach balances computational cost and accuracy, enabling diverse applications.
Experimental Data: Extensive biological and physicochemical data for benchmarking property prediction models.
Standardized Format: Provides easy-to-use RDKit mol objects with consistent graph and geometry information.
Energy Annotations: Quantum mechanical energies allow for robust, physics-based ranking of conformers.

Limitations

Energy Uncertainty: CREST energies have an error of approximately 2 kcal/mol, which can affect the statistical weights of conformers.
Solvation Model: Uses implicit solvation models (C-PCM or ALPB) only; explicit solvent effects are not included.
Size Constraints: While including large molecules, conformer generation for highly flexible molecules was computationally intensive and sometimes did not finish.
Element Coverage: Primarily focused on organic molecules; metals and less common elements are not included.
Sampling Completeness: The underlying CREST method has high recall but may not capture all thermally accessible conformers for every molecule.

Generation and Processing Pipeline

GEOM employs a multi-step pipeline to generate high-quality conformational ensembles:

Conformational Sampling

Initial Structures: Generated from 2D molecular graphs (SMILES).
CREST Sampling: A metadynamics-enhanced conformational search is performed using the semi-empirical GFN2-xTB method to explore the potential energy surface.
Energy Window: Conformers within 6.0 kcal/mol of the lowest-energy structure are retained.
Clustering: Structurally similar conformers are identified and duplicates are removed.

Semi-Empirical: GFN2-xTB energies are calculated for all 37 million conformers.
DFT Single-Point: For the BACE subset, single-point r2scan-3c/mTZVPP energies and other properties are computed on the fixed CREST geometries.
DFT Optimization: For a high-quality subset of 534 BACE molecules, full geometry optimization and free energy calculations are performed using r2scan-3c/mTZVPP via the CENSO protocol.

Experimental Integration

Assay Data: Biological assay results (e.g., antiviral activity) and physicochemical properties from AICures and MoleculeNet are linked to each molecule.
Property Mapping: Molecular graphs and experimental endpoints are mapped to the generated conformational ensembles.

Model Performance

Several machine learning models have been benchmarked on the GEOM dataset for conformer property prediction tasks. The experiments were conducted on 100,000 species randomly sampled from the AICures drug subset, using a 60-20-20 split for training, validation, and testing. Results show mean absolute error (MAE) performance across three prediction tasks:

AICures Property Prediction Benchmarks (MAE: Free Energy G in kcal/mol, Avg Energy E in kcal/mol, ln(Conformers) unitless)
Avg energy e	Free energy g	Ln conformers	Model name	Notes
0.113	0.203	0.363	SchNetFeatures	Best overall performer
0.110	0.225	0.380	ChemProp	Graph neural network approach
0.119	0.274	0.455	FFNN	Feed-Forward Neural Network
0.131	0.289	0.484	KRR	Kernel Ridge Regression
0.166	0.406	0.763	Random Forest	Ensemble method baseline

Key Findings: SchNetFeatures and ChemProp demonstrated the strongest performance across the three prediction tasks, with SchNetFeatures excelling at free energy and conformer count prediction, while ChemProp achieved the best results for average conformational energy prediction.

Citation: Axelrod, S. & Gómez-Bombarelli, R. “GEOM, energy-annotated molecular conformations for property prediction and molecular generation” Sci Data 2022, 9, 185.

Dataset Summary#

Key Features#

Dataset Structure#

Quality Levels#

Structural Diversity#

Use Cases#

Primary Applications#

Research Applications#

Quality & Limitations#

Strengths#

Limitations#

Generation and Processing Pipeline#

Conformational Sampling#

Energy Refinement#

Experimental Integration#

Model Performance#