Dataset Summary

MARCEL (MoleculAR Conformer Ensemble Learning) is the first comprehensive benchmark specifically designed for evaluating molecular representation learning from conformer ensembles. Unlike traditional approaches that use single 2D graphs or individual 3D structures, MARCEL addresses the fundamental challenge of molecular flexibility by providing datasets and benchmarks that explicitly account for the ensemble of thermodynamically accessible conformations.

Quick Facts

Dataset Composition

Overview Statistics

DatasetMolecules/ReactionsConformersHeavy Atoms (avg)Rotatable Bonds (avg)TasksChemical Focus
Drugs-75K75,099 molecules558,00230.567.533Drug-like molecules
Kraken1,552 molecules21,28723.709.054Organophosphorus ligands
EE872 reactionsPro-R: 14,807
Pro-S: 13,999
59.3218.571Enantiomeric excess
BDE----1Bond dissociation energy

Chemical Diversity

Atomic Species Coverage:

  • Drugs-75K: H, C, N, O, F, Si, P, S, Cl
  • Kraken: H, B, C, N, O, F, Si, P, S, Cl, Fe, Se, Br, Sn, I
  • EE: H, C, N, O, F, P, Cl, Br, Rh
  • BDE: Various organic and organometallic species

Individual Dataset Details

Drugs-75K Dataset

Source: Subset of GEOM-Drugs dataset (see GEOM Dataset Card)
Size: 75,099 molecules with ≥5 rotatable bonds
Conformer Generation: Auto3D with DFT-level accuracy
Property Calculation: AIMNet-NSE

Tasks (Boltzmann-averaged properties):

  1. Ionization Potential (IP)

    • Definition: $IP = E_{cation} - E_{neutral}$
    • Units: eV
    • Physical meaning: Energy required to remove an electron
  2. Electron Affinity (EA)

    • Definition: $EA = E_{neutral} - E_{anion}$
    • Units: eV
    • Physical meaning: Energy change upon electron addition
  3. Electronegativity (χ)

    • Definition: $\chi = -\left(\frac{\partial E}{\partial N}\right)$
    • Units: eV
    • Physical meaning: Tendency to attract electrons

Kraken Dataset

Focus: Monodentate organophosphorus(III) ligands
Size: 1,552 molecules
Application: Catalyst design and QSAR modeling
Conformer Quality: DFT-computed ensembles

Tasks (Sterimol descriptors):

  1. Sterimol B₅: Steric bulk perpendicular to P-X bond
  2. Sterimol L: Length along P-X bond axis
  3. Buried Sterimol B₅: Buried volume variant of B₅
  4. Buried Sterimol L: Buried volume variant of L

All measured in Ångströms (Å)

EE (Enantiomeric Excess) Dataset

Focus: Asymmetric catalysis reactions
Size: 872 reactions
Chemistry: Rhodium-catalyzed asymmetric reactions
Conformers: Separate ensembles for Pro-R and Pro-S pathways

Task:

  • Enantiomeric Excess Prediction: Quantifying stereoselectivity in asymmetric catalysis
  • Physical meaning: Preference for one enantiomer over another
  • Range: -100% to +100%

BDE (Bond Dissociation Energy) Dataset

Focus: Homolytic bond cleavage energies
Application: Understanding chemical reactivity and stability
Relevance: Critical for predicting reaction pathways

Methodology and Data Generation

Conformer Ensemble Representation

Statistical Weighting: Each conformer $C_i$ has probability: $$p_i = \frac{\exp(-\frac{e_i}{k_B T})}{\sum_j \exp(-\frac{e_j}{k_B T})}$$

Where:

  • $e_i$ = energy of conformer $C_i$
  • $k_B$ = Boltzmann constant
  • $T$ = temperature

Boltzmann Averaging: Target properties computed as: $$\langle y \rangle_{k_B} = \sum_{C_i \in \mathcal{C}} p_i y_i$$

Intended Use Cases

Property Prediction

  • Ensemble-aware models: Leverage conformational flexibility for better predictions
  • Multi-conformer training: Improve model robustness and generalization
  • Catalyst design: Predict ligand properties for organometallic catalysis
  • Drug discovery: Account for molecular flexibility in ADMET prediction

Method Development

  • Benchmark evaluation: Compare conformer ensemble learning approaches
  • Model architecture: Develop new neural networks for ensemble inputs
  • Sampling strategies: Test conformer selection and weighting methods
  • Transfer learning: Evaluate cross-task and cross-chemistry generalization
  • Complementary datasets: Use with GEOM for comprehensive conformer studies

Computational Chemistry Research

  • Conformational analysis: Study molecular flexibility effects on properties
  • Reaction prediction: Model stereoselectivity and reaction outcomes
  • Catalyst screening: Virtual screening with conformer ensembles
  • Method validation: Benchmark new conformer generation approaches

Quality Assessment

Strengths

  • First comprehensive benchmark: Dedicated to conformer ensemble learning
  • Chemical diversity: Covers drugs, organocatalysts, and transition-metal chemistry
  • Multiple strategies: Various approaches for ensemble incorporation
  • High-quality data: DFT-level accuracy with efficient generation methods
  • Practical relevance: Tasks directly relevant to chemical applications

Limitations

  • Dataset size: Some datasets relatively small for deep learning
  • Conformer coverage: Limited by computational sampling methods
  • Statistical weights: Approximate Boltzmann weights may have errors
  • Chemical scope: Focus on organic and organometallic chemistry
  • Computational cost: High-quality conformers require significant resources

Validation Metrics

  • Property prediction: MAE, RMSE for regression tasks
  • Ensemble quality: Coverage of thermally accessible conformers
  • Model comparison: Relative performance across different approaches
  • Generalization: Cross-dataset and cross-task evaluation

Technical Specifications

Data Formats

  • Molecular graphs: Standard chemical graph representations
  • Conformers: 3D atomic coordinates in Cartesian format
  • Properties: Numerical values with specified units
  • Ensembles: Collections of conformers with statistical weights

Software Dependencies

  • Auto3D: Conformer generation
  • AIMNet-NSE: Property calculation
  • RDKit: Chemical informatics
  • PyTorch Geometric: Graph neural network implementations

Ethical Considerations

Intended Use

  • Advancing molecular representation learning research
  • Improving computational chemistry and drug discovery methods
  • Developing more accurate property prediction models
  • Educational and academic research applications

Potential Concerns

  • Computational bias: Results may favor computationally accessible conformers
  • Chemical bias: Focus on specific chemical classes may limit generalizability
  • Accuracy limitations: Model predictions should not replace experimental validation
  • Resource requirements: High computational costs may limit accessibility

Access and Usage

  • Availability: Open source via GitHub
  • Repository: https://github.com/SXKDZ/MARCEL
  • License: Check repository for specific terms
  • Documentation: Comprehensive benchmarking code and examples provided

References


Dataset Status: Active development, publicly available
Last Updated: 2025
Contact: Primary authors via GitHub repository