MARCEL
Basic Information
Full NameMoleculAR Conformer Ensemble Learning
DomainComputational Chemistry
Year2024
Publication & Access
PaperICLR 2024 | arXiv
DatasetGitHub
Dataset Composition
Total Size76651 molecules
Drugs-75K (subset of [GEOM](/notes/dataset-card/computational-chemistry/geom))75,099
Kraken1,552
EE (Enantiomeric Excess)872
Technical Details
FormatGraph representations, 3D coordinates, statistical weights
ObservablesElectronic properties, sterimol descriptors, enantiomeric excess
AnnotationsDFT-calculated properties, experimental catalysis data
Research Context
AuthorsYanqiao Zhu, Jeehyun Hwang, Brock Anton Stenfors, Yuanqi Du, Olexandr Isayev, Keir Adams, Jatin Chauhan, Connor W. Coley, Yizhou Sun, Zhen Liu, Bozhao Nan, Olaf Wiest, Wei Wang
InstitutionUCLA, MIT, CMU, Notre Dame, Cornell

Dataset Summary

MARCEL tackles a specific problem in molecular machine learning: how do you train models that account for the fact that molecules are flexible? Unlike traditional approaches that use single molecular conformations, MARCEL provides four diverse datasets with evaluation protocols that account for molecular flexibility.

Key Features

  • Ensemble-focused benchmark: Dedicated evaluation framework for learning from multiple molecular conformations
  • Diverse chemical coverage: Spans drug-like molecules, organocatalysts, and transition-metal chemistry
  • Multiple evaluation strategies: Various approaches for incorporating molecular flexibility
  • High-quality conformers: DFT-level accuracy with efficient Auto3D generation

Dataset Structure

MARCEL Dataset Composition
ConformersCountDescriptionNamePropertiesTasksType
558,00275,099Drug-like molecules with >= 5 rotatable bondsDrugs-75K (subset of GEOM)Electronic properties (IP, EA, χ)3Molecules
21,2871,552Monodentate organophosphorus (III) ligandsKrakenSterimol and Buried Sterimol descriptors for catalysis4Molecules
28,806872Rh-bound atropisomeric catalysts for asymmetric hydrogenationEE (Enantiomeric Excess)Enantiomeric excess prediction1Reactions
114,0985,915Organometallic catalysts for cross-couplingBDE (Binding Energy)Electronic binding energy1Reactions

Task Details

The MARCEL benchmark includes seven distinct prediction tasks across its four datasets:

Prediction Tasks in MARCEL
DatasetDescriptionProperty typeTask
Drugs-75KMinimum energy to remove an electron from a neutral moleculeElectronicIonization Potential (IP)
Drugs-75KEnergy change from adding an electron to a neutral moleculeElectronicElectron Affinity (EA)
Drugs-75KTendency of an atom/molecule to attract a bonding pair of electronsElectronicElectronegativity (χ)
KrakenA steric descriptor quantifying ligand sizeStericSterimol B₅
KrakenA steric descriptor quantifying ligand sizeStericSterimol L
KrakenA steric descriptor within a metal’s coordination sphereStericBuried Sterimol B₅
KrakenA steric descriptor within a metal’s coordination sphereStericBuried Sterimol L
EESelectivity in an asymmetric catalysis reactionCatalyticEnantiomeric Excess
BDEEnergy difference between bound and unbound catalyst complexesElectronicBinding Energy

Use Cases

Primary Applications

  • Testing conformer ensemble learning methods
  • Developing molecular representations that account for flexibility
  • Catalyst design with conformational considerations

Research Applications

  • Cross-dataset generalization studies
  • Evaluating conformer sampling strategies
  • Transfer learning between chemical domains

Quality & Limitations

Strengths

  • Evaluation protocols designed for ensemble learning
  • Chemical diversity across organic and organometallic systems
  • Practical relevance to drug discovery and catalysis

Limitations

  • Relatively small dataset sizes for some deep learning applications
  • Limited to computationally accessible conformer sampling
  • Focus on specific chemical classes may limit broader applicability

Related Datasets: GEOM (source for Drugs-75K subset)
Citation: Zhu, Y., Hwang, J., Adams, K., Liu, Z., Nan, B., Stenfors, B., Du, Y., Chauhan, J., Wiest, O., Isayev, O., Coley, C. W., Sun, Y., & Wang, W. “Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks” The Twelfth International Conference on Learning Representations (ICLR 2024)