MARCEL Dataset Card

MARCEL
Basic Information
Full Name	MoleculAR Conformer Ensemble Learning
Domain	Computational Chemistry
Year	2024
Publication & Access
Paper	ICLR 2024 \| arXiv
Dataset	GitHub
Dataset Composition
Total Size	76651 molecules
Drugs-75K (subset of [GEOM](/notes/dataset-card/computational-chemistry/geom))	75,099
Kraken	1,552
EE (Enantiomeric Excess)	872
Technical Details
Format	Graph representations, 3D coordinates, statistical weights
Observables	Electronic properties, sterimol descriptors, enantiomeric excess
Annotations	DFT-calculated properties, experimental catalysis data
Research Context
Authors	Yanqiao Zhu, Jeehyun Hwang, Brock Anton Stenfors, Yuanqi Du, Olexandr Isayev, Keir Adams, Jatin Chauhan, Connor W. Coley, Yizhou Sun, Zhen Liu, Bozhao Nan, Olaf Wiest, Wei Wang
Institution	UCLA, MIT, CMU, Notre Dame, Cornell

Dataset Summary

MARCEL tackles a specific problem in molecular machine learning: how do you train models that account for the fact that molecules are flexible? Unlike traditional approaches that use single molecular conformations, MARCEL provides four diverse datasets with evaluation protocols that account for molecular flexibility.

Key Features

Ensemble-focused benchmark: Dedicated evaluation framework for learning from multiple molecular conformations
Diverse chemical coverage: Spans drug-like molecules, organocatalysts, and transition-metal chemistry
Multiple evaluation strategies: Various approaches for incorporating molecular flexibility
High-quality conformers: DFT-level accuracy with efficient Auto3D generation

Dataset Structure

MARCEL Dataset Composition
Conformers	Count	Description	Name	Properties	Tasks	Type
558,002	75,099	Drug-like molecules with >= 5 rotatable bonds	Drugs-75K (subset of GEOM)	Electronic properties (IP, EA, χ)	3	Molecules
21,287	1,552	Monodentate organophosphorus (III) ligands	Kraken	Sterimol and Buried Sterimol descriptors for catalysis	4	Molecules
28,806	872	Rh-bound atropisomeric catalysts for asymmetric hydrogenation	EE (Enantiomeric Excess)	Enantiomeric excess prediction	1	Reactions
114,098	5,915	Organometallic catalysts for cross-coupling	BDE (Binding Energy)	Electronic binding energy	1	Reactions

Task Details

The MARCEL benchmark includes seven distinct prediction tasks across its four datasets:

Prediction Tasks in MARCEL
Dataset	Description	Property type	Task
Drugs-75K	Minimum energy to remove an electron from a neutral molecule	Electronic	Ionization Potential (IP)
Drugs-75K	Energy change from adding an electron to a neutral molecule	Electronic	Electron Affinity (EA)
Drugs-75K	Tendency of an atom/molecule to attract a bonding pair of electrons	Electronic	Electronegativity (χ)
Kraken	A steric descriptor quantifying ligand size	Steric	Sterimol B₅
Kraken	A steric descriptor quantifying ligand size	Steric	Sterimol L
Kraken	A steric descriptor within a metal’s coordination sphere	Steric	Buried Sterimol B₅
Kraken	A steric descriptor within a metal’s coordination sphere	Steric	Buried Sterimol L
EE	Selectivity in an asymmetric catalysis reaction	Catalytic	Enantiomeric Excess
BDE	Energy difference between bound and unbound catalyst complexes	Electronic	Binding Energy

Use Cases

Primary Applications

Testing conformer ensemble learning methods
Developing molecular representations that account for flexibility
Catalyst design with conformational considerations

Research Applications

Cross-dataset generalization studies
Evaluating conformer sampling strategies
Transfer learning between chemical domains

Quality & Limitations

Strengths

Evaluation protocols designed for ensemble learning
Chemical diversity across organic and organometallic systems
Practical relevance to drug discovery and catalysis

Limitations

Relatively small dataset sizes for some deep learning applications
Limited to computationally accessible conformer sampling
Focus on specific chemical classes may limit broader applicability

Related Datasets: GEOM (source for Drugs-75K subset)
Citation: Zhu, Y., Hwang, J., Adams, K., Liu, Z., Nan, B., Stenfors, B., Du, Y., Chauhan, J., Wiest, O., Isayev, O., Coley, C. W., Sun, Y., & Wang, W. “Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks” The Twelfth International Conference on Learning Representations (ICLR 2024)

Dataset Summary#

Key Features#

Dataset Structure#

Task Details#

Use Cases#

Primary Applications#

Research Applications#

Quality & Limitations#

Strengths#

Limitations#