MARCEL | |
---|---|
Basic Information | |
Full Name | MoleculAR Conformer Ensemble Learning |
Domain | Computational Chemistry |
Year | 2024 |
Publication & Access | |
Paper | ICLR 2024 | arXiv |
Dataset | GitHub |
Dataset Composition | |
Total Size | 76651 molecules |
Drugs-75K (subset of [GEOM](/notes/dataset-card/computational-chemistry/geom)) | 75,099 |
Kraken | 1,552 |
EE (Enantiomeric Excess) | 872 |
Technical Details | |
Format | Graph representations, 3D coordinates, statistical weights |
Observables | Electronic properties, sterimol descriptors, enantiomeric excess |
Annotations | DFT-calculated properties, experimental catalysis data |
Research Context | |
Authors | Yanqiao Zhu, Jeehyun Hwang, Brock Anton Stenfors, Yuanqi Du, Olexandr Isayev, Keir Adams, Jatin Chauhan, Connor W. Coley, Yizhou Sun, Zhen Liu, Bozhao Nan, Olaf Wiest, Wei Wang |
Institution | UCLA, MIT, CMU, Notre Dame, Cornell |
Dataset Summary
MARCEL tackles a specific problem in molecular machine learning: how do you train models that account for the fact that molecules are flexible? Unlike traditional approaches that use single molecular conformations, MARCEL provides four diverse datasets with evaluation protocols that account for molecular flexibility.
Key Features
- Ensemble-focused benchmark: Dedicated evaluation framework for learning from multiple molecular conformations
- Diverse chemical coverage: Spans drug-like molecules, organocatalysts, and transition-metal chemistry
- Multiple evaluation strategies: Various approaches for incorporating molecular flexibility
- High-quality conformers: DFT-level accuracy with efficient Auto3D generation
Dataset Structure
Conformers | Count | Description | Name | Properties | Tasks | Type |
---|---|---|---|---|---|---|
558,002 | 75,099 | Drug-like molecules with >= 5 rotatable bonds | Drugs-75K (subset of GEOM) | Electronic properties (IP, EA, χ) | 3 | Molecules |
21,287 | 1,552 | Monodentate organophosphorus (III) ligands | Kraken | Sterimol and Buried Sterimol descriptors for catalysis | 4 | Molecules |
28,806 | 872 | Rh-bound atropisomeric catalysts for asymmetric hydrogenation | EE (Enantiomeric Excess) | Enantiomeric excess prediction | 1 | Reactions |
114,098 | 5,915 | Organometallic catalysts for cross-coupling | BDE (Binding Energy) | Electronic binding energy | 1 | Reactions |
Task Details
The MARCEL benchmark includes seven distinct prediction tasks across its four datasets:
Dataset | Description | Property type | Task |
---|---|---|---|
Drugs-75K | Minimum energy to remove an electron from a neutral molecule | Electronic | Ionization Potential (IP) |
Drugs-75K | Energy change from adding an electron to a neutral molecule | Electronic | Electron Affinity (EA) |
Drugs-75K | Tendency of an atom/molecule to attract a bonding pair of electrons | Electronic | Electronegativity (χ) |
Kraken | A steric descriptor quantifying ligand size | Steric | Sterimol B₅ |
Kraken | A steric descriptor quantifying ligand size | Steric | Sterimol L |
Kraken | A steric descriptor within a metal’s coordination sphere | Steric | Buried Sterimol B₅ |
Kraken | A steric descriptor within a metal’s coordination sphere | Steric | Buried Sterimol L |
EE | Selectivity in an asymmetric catalysis reaction | Catalytic | Enantiomeric Excess |
BDE | Energy difference between bound and unbound catalyst complexes | Electronic | Binding Energy |
Use Cases
Primary Applications
- Testing conformer ensemble learning methods
- Developing molecular representations that account for flexibility
- Catalyst design with conformational considerations
Research Applications
- Cross-dataset generalization studies
- Evaluating conformer sampling strategies
- Transfer learning between chemical domains
Quality & Limitations
Strengths
- Evaluation protocols designed for ensemble learning
- Chemical diversity across organic and organometallic systems
- Practical relevance to drug discovery and catalysis
Limitations
- Relatively small dataset sizes for some deep learning applications
- Limited to computationally accessible conformer sampling
- Focus on specific chemical classes may limit broader applicability
Related Datasets: GEOM (source for Drugs-75K subset)
Citation: Zhu, Y., Hwang, J., Adams, K., Liu, Z., Nan, B., Stenfors, B., Du, Y., Chauhan, J., Wiest, O., Isayev, O., Coley, C. W., Sun, Y., & Wang, W. “Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks” The Twelfth International Conference on Learning Representations (ICLR 2024)