Dataset Summary
MARCEL (MoleculAR Conformer Ensemble Learning) is the first comprehensive benchmark specifically designed for evaluating molecular representation learning from conformer ensembles. Unlike traditional approaches that use single 2D graphs or individual 3D structures, MARCEL addresses the fundamental challenge of molecular flexibility by providing datasets and benchmarks that explicitly account for the ensemble of thermodynamically accessible conformations.
Quick Facts
- Total Datasets: 4 (Drugs-75K, Kraken, EE, BDE)
- Total Molecules: 77,523 molecules + reactions
- Total Conformers: 608,095+ conformers
- Paper: Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks
- Repository: GitHub
- Authors: Yanqiao Zhu (UCLA), Jeehyun Hwang (UCLA), et al.
- Focus: Conformer ensemble learning for diverse chemical applications
Dataset Composition
Overview Statistics
Dataset | Molecules/Reactions | Conformers | Heavy Atoms (avg) | Rotatable Bonds (avg) | Tasks | Chemical Focus |
---|---|---|---|---|---|---|
Drugs-75K | 75,099 molecules | 558,002 | 30.56 | 7.53 | 3 | Drug-like molecules |
Kraken | 1,552 molecules | 21,287 | 23.70 | 9.05 | 4 | Organophosphorus ligands |
EE | 872 reactions | Pro-R: 14,807 Pro-S: 13,999 | 59.32 | 18.57 | 1 | Enantiomeric excess |
BDE | - | - | - | - | 1 | Bond dissociation energy |
Chemical Diversity
Atomic Species Coverage:
- Drugs-75K: H, C, N, O, F, Si, P, S, Cl
- Kraken: H, B, C, N, O, F, Si, P, S, Cl, Fe, Se, Br, Sn, I
- EE: H, C, N, O, F, P, Cl, Br, Rh
- BDE: Various organic and organometallic species
Individual Dataset Details
Drugs-75K Dataset
Source: Subset of GEOM-Drugs dataset (see GEOM Dataset Card)
Size: 75,099 molecules with ≥5 rotatable bonds
Conformer Generation: Auto3D with DFT-level accuracy
Property Calculation: AIMNet-NSE
Tasks (Boltzmann-averaged properties):
Ionization Potential (IP)
- Definition: $IP = E_{cation} - E_{neutral}$
- Units: eV
- Physical meaning: Energy required to remove an electron
Electron Affinity (EA)
- Definition: $EA = E_{neutral} - E_{anion}$
- Units: eV
- Physical meaning: Energy change upon electron addition
Electronegativity (χ)
- Definition: $\chi = -\left(\frac{\partial E}{\partial N}\right)$
- Units: eV
- Physical meaning: Tendency to attract electrons
Kraken Dataset
Focus: Monodentate organophosphorus(III) ligands
Size: 1,552 molecules
Application: Catalyst design and QSAR modeling
Conformer Quality: DFT-computed ensembles
Tasks (Sterimol descriptors):
- Sterimol B₅: Steric bulk perpendicular to P-X bond
- Sterimol L: Length along P-X bond axis
- Buried Sterimol B₅: Buried volume variant of B₅
- Buried Sterimol L: Buried volume variant of L
All measured in Ångströms (Å)
EE (Enantiomeric Excess) Dataset
Focus: Asymmetric catalysis reactions
Size: 872 reactions
Chemistry: Rhodium-catalyzed asymmetric reactions
Conformers: Separate ensembles for Pro-R and Pro-S pathways
Task:
- Enantiomeric Excess Prediction: Quantifying stereoselectivity in asymmetric catalysis
- Physical meaning: Preference for one enantiomer over another
- Range: -100% to +100%
BDE (Bond Dissociation Energy) Dataset
Focus: Homolytic bond cleavage energies
Application: Understanding chemical reactivity and stability
Relevance: Critical for predicting reaction pathways
Methodology and Data Generation
Conformer Ensemble Representation
Statistical Weighting: Each conformer $C_i$ has probability: $$p_i = \frac{\exp(-\frac{e_i}{k_B T})}{\sum_j \exp(-\frac{e_j}{k_B T})}$$
Where:
- $e_i$ = energy of conformer $C_i$
- $k_B$ = Boltzmann constant
- $T$ = temperature
Boltzmann Averaging: Target properties computed as: $$\langle y \rangle_{k_B} = \sum_{C_i \in \mathcal{C}} p_i y_i$$
Intended Use Cases
Property Prediction
- Ensemble-aware models: Leverage conformational flexibility for better predictions
- Multi-conformer training: Improve model robustness and generalization
- Catalyst design: Predict ligand properties for organometallic catalysis
- Drug discovery: Account for molecular flexibility in ADMET prediction
Method Development
- Benchmark evaluation: Compare conformer ensemble learning approaches
- Model architecture: Develop new neural networks for ensemble inputs
- Sampling strategies: Test conformer selection and weighting methods
- Transfer learning: Evaluate cross-task and cross-chemistry generalization
- Complementary datasets: Use with GEOM for comprehensive conformer studies
Computational Chemistry Research
- Conformational analysis: Study molecular flexibility effects on properties
- Reaction prediction: Model stereoselectivity and reaction outcomes
- Catalyst screening: Virtual screening with conformer ensembles
- Method validation: Benchmark new conformer generation approaches
Quality Assessment
Strengths
- First comprehensive benchmark: Dedicated to conformer ensemble learning
- Chemical diversity: Covers drugs, organocatalysts, and transition-metal chemistry
- Multiple strategies: Various approaches for ensemble incorporation
- High-quality data: DFT-level accuracy with efficient generation methods
- Practical relevance: Tasks directly relevant to chemical applications
Limitations
- Dataset size: Some datasets relatively small for deep learning
- Conformer coverage: Limited by computational sampling methods
- Statistical weights: Approximate Boltzmann weights may have errors
- Chemical scope: Focus on organic and organometallic chemistry
- Computational cost: High-quality conformers require significant resources
Validation Metrics
- Property prediction: MAE, RMSE for regression tasks
- Ensemble quality: Coverage of thermally accessible conformers
- Model comparison: Relative performance across different approaches
- Generalization: Cross-dataset and cross-task evaluation
Technical Specifications
Data Formats
- Molecular graphs: Standard chemical graph representations
- Conformers: 3D atomic coordinates in Cartesian format
- Properties: Numerical values with specified units
- Ensembles: Collections of conformers with statistical weights
Software Dependencies
- Auto3D: Conformer generation
- AIMNet-NSE: Property calculation
- RDKit: Chemical informatics
- PyTorch Geometric: Graph neural network implementations
Ethical Considerations
Intended Use
- Advancing molecular representation learning research
- Improving computational chemistry and drug discovery methods
- Developing more accurate property prediction models
- Educational and academic research applications
Potential Concerns
- Computational bias: Results may favor computationally accessible conformers
- Chemical bias: Focus on specific chemical classes may limit generalizability
- Accuracy limitations: Model predictions should not replace experimental validation
- Resource requirements: High computational costs may limit accessibility
Access and Usage
- Availability: Open source via GitHub
- Repository: https://github.com/SXKDZ/MARCEL
- License: Check repository for specific terms
- Documentation: Comprehensive benchmarking code and examples provided
References
- Primary Paper: Zhu, Y., Hwang, J., et al. “Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks” (arXiv:2310.00115)
- GEOM Dataset: Axelrod & Gómez-Bombarelli, Sci Data 9, 185 (2022) - Dataset Card
- Auto3D: Efficient conformer generation tool
- AIMNet-NSE: Neural network for quantum chemical properties
Dataset Status: Active development, publicly available
Last Updated: 2025
Contact: Primary authors via GitHub repository