MARCEL

MARCEL: Molecular Representation & Conformers
Dataset Details
AuthorsYanqiao Zhu, Jeehyun Hwang, Brock Anton Stenfors, Yuanqi Du, Olexandr Isayev, Keir Adams, Jatin Chauhan, Connor W. Coley, Yizhou Sun, Zhen Liu, Bozhao Nan, Olaf Wiest, Wei Wang
Paper TitleMARCEL: Molecular Representation and Conformer Ensemble Learning
InstitutionsUCLA, MIT, CMU, Notre Dame, Cornell
Published InInternational Conference on Learning Representations
CategoryComputational Chemistry
FormatSMILES RDKit mol objects 3D coordinates Statistical weights Experimental properties
SizeConformers: 722,193
Molecules: 76,651
Reactions: 6,787
DateSeptember 2025
Year2024
LinksπŸ“Š Dataset β€’ πŸ”— DOI β€’ πŸ“„ Paper
MARCEL dataset Kraken ligand example in 3D conformation
Example conformer from MARCEL’s Kraken subset, showcasing the dataset’s focus on 3D molecular conformations for machine learning applications

Key Contribution

MARCEL provides the first comprehensive benchmark for conformer ensemble learning, demonstrating that explicitly modeling full conformer distributions significantly improves property prediction across drug-like molecules and organometallic catalysts.

Overview

The Molecular Representation and Conformer Ensemble Learning (MARCEL) dataset provides 722K+ conformations across 76K+ molecules spanning four diverse chemical domains: drug-like molecules (Drugs-75K), organophosphorus ligands (Kraken), chiral catalysts (EE), and organometallic complexes (BDE). Unlike prior datasets that focus solely on drug-like molecules, MARCEL enables evaluation of conformer ensemble methods across both pharmaceutical and catalysis applications.

Strengths

  • Domain diversity: Beyond drug-like molecules, includes organometallics and catalysts rarely covered in existing benchmarks
  • Ensemble-based: Provides full conformer ensembles with statistical weights, not just single conformers
  • DFT-quality energies: Drugs-75K features DFT-level conformers and energies (higher accuracy than GEOM-Drugs)
  • Realistic scenarios: BDE subset models the practical constraint of lacking DFT-computed conformers for large catalyst systems
  • Comprehensive baselines: Benchmarks 18 models across 1D (SMILES), 2D (graph), 3D (single conformer), and ensemble methods
  • Property diversity: Covers ionization potential, electron affinity, electronegativity, ligand descriptors, and catalytic properties

Limitations

  • Regression only: All tasks are regression; no classification benchmarks
  • Chemical space coverage: 76K molecules cannot represent full drug-like or catalyst chemical space
  • Compute requirements: Working with large conformer ensembles demands significant computational resources
  • Proprietary data: EE subset not publicly available (as of December 2025)
  • DFT bottleneck: BDE demonstrates the practical limitation - single DFT optimization can take 2-3 days, making conformer-level DFT infeasible for large organometallics

Technical Notes

Data Generation Pipeline

Drugs-75K

Source: GEOM-Drugs subset

Filtering:

  • Minimum 5 rotatable bonds (focus on flexible molecules)
  • Allowed elements: H, C, N, O, F, Si, P, S, Cl

Conformer generation:

  • DFT-level calculations for both conformers and energies
  • Higher accuracy than original GEOM-Drugs (semi-empirical GFN2-xTB)

Properties: Ionization Potential (IP), Electron Affinity (EA), Electronegativity (Ο‡)

Kraken

Source: Original Kraken dataset (1,552 monodentate organophosphorus(III) ligands)

Properties: 5 of 78 available properties

  • $B_5$: Ligand sterics (buried volume)
  • $L$: Ligand electronics
  • $\text{Bur}B_5$: Buried volume variant
  • $\text{Bur}L$: Electronic parameter variant

EE (Enantiomeric Excess)

Generation method: Q2MM (Quantum-guided Molecular Mechanics)

Molecules: 872 Rhodium (Rh)-bound atropisomeric catalysts from chiral bisphosphine

Property: Enantiomeric excess (EE) for asymmetric catalysis

Availability: Proprietary-only (not publicly available as of December 2025)

BDE (Bond Dissociation Energy)

Molecules: 5,195 organometallic catalysts (ML₁Lβ‚‚ structure)

Initial conformers: OpenBabel with geometric optimization

Energies: DFT calculations

Property: Electronic dissociation energy (difference between bound and unbound states)

Key constraint: DFT optimization for full conformer ensembles computationally infeasible (2-3 days per molecule)

Benchmark Setup

Task: Predict molecular properties from structure using different representation strategies (1D/2D/3D/Ensemble)

Data splits: Not explicitly specified in the available information; standard train/validation/test splits used

Hyperparameters: Tuned per model (specific optimization method not detailed in available documentation)

Model categories:

  1. 1D Models: SMILES-based (Random Forest on Morgan fingerprints, LSTM, Transformer)
  2. 2D Models: Graph-based (GIN, GIN+VN, ChemProp, GraphGPS)
  3. 3D Models: Single conformer (SchNet, DimeNet++, GemNet, PaiNN, ClofNet, LEFTNet)
  4. Ensemble Models: Full conformer ensemble (same 3D architectures, aggregating over all conformers)

Evaluation metric: Mean Absolute Error (MAE) for all tasks

Key Findings

Ensemble superiority: Across all three benchmarks, ensemble methods (processing full conformer distributions) consistently outperform single-conformer 3D models, with the largest improvements on:

  • Drugs-75K: Ensemble GemNet achieves πŸ₯‡ 0.4066 eV (IP) vs πŸ₯ˆ 0.4069 eV (single conformer)
  • Kraken: Ensemble PaiNN achieves πŸ₯‡ 0.2225 (Bβ‚…) vs 0.3443 (single conformer)
  • EE: Ensemble GemNet achieves πŸ₯‡ 11.61% vs 18.03% (single conformer)

3D vs 2D: 3D models generally outperform 2D graph models, especially for conformationally-sensitive properties

Model architecture: GemNet and PaiNN architectures consistently top-ranked across tasks

Dataset Information

Format

SMILES RDKit mol objects 3D coordinates Statistical weights Experimental properties

Size

TypeCount
Conformers722,193
Molecules76,651
Reactions6,787

Dataset Examples

Example conformer from Drugs-75K (SMILES: `COC(=O)[C@@]1(Cc2ccc(OC)cc2)[C@H]2c3cc(C(=O)N(C)C)n(Cc4ccc(OC(F)(F)F)cc4)c3C[C@H]2CN1C(=O)c1ccccc1`; IUPAC: methyl (2R,3R,6R)-4-benzoyl-10-(dimethylcarbamoyl)-3-[(4-methoxyphenyl)methyl]-9-[[4-(trifluoromethoxy)phenyl]methyl]-4,9-diazatricyclo[6.3.0.02,6]undeca-1(8),10-diene-3-carboxylate)
Example conformer from Drugs-75K (SMILES: COC(=O)[C@@]1(Cc2ccc(OC)cc2)[C@H]2c3cc(C(=O)N(C)C)n(Cc4ccc(OC(F)(F)F)cc4)c3C[C@H]2CN1C(=O)c1ccccc1; IUPAC: methyl (2R,3R,6R)-4-benzoyl-10-(dimethylcarbamoyl)-3-[(4-methoxyphenyl)methyl]-9-[[4-(trifluoromethoxy)phenyl]methyl]-4,9-diazatricyclo[6.3.0.02,6]undeca-1(8),10-diene-3-carboxylate)
2D structure of Drugs-75K conformer above
2D structure of Drugs-75K conformer above
Example conformer from Kraken (ligand 10, conformer 0) in 2D
Example conformer from Kraken (ligand 10, conformer 0) in 2D
Example conformer from Kraken (ligand 10, conformer 0) in 3D
Example conformer from Kraken (ligand 10, conformer 0) in 3D
Example substrate from BDE in 3D (Pt_9.63)
Example substrate from BDE in 3D (Pt_9.63)
2D structure of BDE substrate above
2D structure of BDE substrate above

Dataset Subsets

SubsetCountDescription
Drugs-75K75,099 moleculesDrug-like molecules with at least 5 rotatable bonds
Kraken1,552 moleculesMonodentate organophosphorus (III) ligands
EE872 moleculesRhodium (Rh)-bound atropisomeric catalysts derived from chiral bisphosphine
BDE5,195 moleculesOrganometallic catalysts ML$_1$L$_2$

Benchmarks

Ionization Potential (Drugs-75K)

Predict ionization potential from molecular structure

Subset: Drugs-75K

RankModelMAE (eV)
πŸ₯‡ 1Ensemble - GemNet
GemNet on full conformer ensemble
0.4066
πŸ₯ˆ 23D - GemNet
Geometry-enhanced message passing (single conformer)
0.4069
πŸ₯‰ 3Ensemble - DimeNet++
DimeNet++ on full conformer ensemble
0.4126
4Ensemble - LEFTNet
LEFTNet on full conformer ensemble
0.4149
53D - LEFTNet
Local Environment Feature Transformer (single conformer)
0.4174
6Ensemble - ClofNet
ClofNet on full conformer ensemble
0.428
72D - GraphGPS
Graph Transformer with positional encodings
0.4351
82D - GIN
Graph Isomorphism Network
0.4354
92D - GIN+VN
GIN with Virtual Nodes
0.4361
103D - ClofNet
Conformation-ensemble learning network (single conformer)
0.4393
113D - SchNet
Continuous-filter convolutional network (single conformer)
0.4394
123D - DimeNet++
Directional message passing network (single conformer)
0.4441
13Ensemble - SchNet
SchNet on full conformer ensemble
0.4452
14Ensemble - PaiNN
PaiNN on full conformer ensemble
0.4466
153D - PaiNN
Polarizable Atom Interaction Network (single conformer)
0.4505
162D - ChemProp
Message Passing Neural Network
0.4595
171D - LSTM
LSTM on SMILES sequences
0.4788
181D - Random forest
Random Forest on Morgan fingerprints
0.4987
191D - Transformer
Transformer on SMILES sequences
0.6617

Electron Affinity (Drugs-75K)

Predict electron affinity from molecular structure

Subset: Drugs-75K

RankModelMAE (eV)
πŸ₯‡ 1Ensemble - GemNet
GemNet on full conformer ensemble
0.391
πŸ₯ˆ 23D - GemNet
Geometry-enhanced message passing (single conformer)
0.3922
πŸ₯‰ 3Ensemble - DimeNet++
DimeNet++ on full conformer ensemble
0.3944
4Ensemble - LEFTNet
LEFTNet on full conformer ensemble
0.3953
53D - LEFTNet
Local Environment Feature Transformer (single conformer)
0.3964
6Ensemble - ClofNet
ClofNet on full conformer ensemble
0.4033
72D - GraphGPS
Graph Transformer with positional encodings
0.4085
82D - GIN
Graph Isomorphism Network
0.4169
92D - GIN+VN
GIN with Virtual Nodes
0.4169
103D - SchNet
Continuous-filter convolutional network (single conformer)
0.4207
113D - DimeNet++
Directional message passing network (single conformer)
0.4233
12Ensemble - SchNet
SchNet on full conformer ensemble
0.4232
133D - ClofNet
Conformation-ensemble learning network (single conformer)
0.4251
14Ensemble - PaiNN
PaiNN on full conformer ensemble
0.4269
152D - ChemProp
Message Passing Neural Network
0.4417
163D - PaiNN
Polarizable Atom Interaction Network (single conformer)
0.4495
171D - LSTM
LSTM on SMILES sequences
0.4648
181D - Random forest
Random Forest on Morgan fingerprints
0.4747
191D - Transformer
Transformer on SMILES sequences
0.585

Electronegativity (Drugs-75K)

Predict electronegativity (Ο‡) from molecular structure

Subset: Drugs-75K

RankModelMAE (eV)
πŸ₯‡ 13D - GemNet
Geometry-enhanced message passing (single conformer)
0.197
πŸ₯ˆ 2Ensemble - GemNet
GemNet on full conformer ensemble
0.2027
πŸ₯‰ 3Ensemble - LEFTNet
LEFTNet on full conformer ensemble
0.2069
43D - LEFTNet
Local Environment Feature Transformer (single conformer)
0.2083
5Ensemble - ClofNet
ClofNet on full conformer ensemble
0.2199
62D - GraphGPS
Graph Transformer with positional encodings
0.2212
73D - SchNet
Continuous-filter convolutional network (single conformer)
0.2243
8Ensemble - SchNet
SchNet on full conformer ensemble
0.2243
92D - GIN
Graph Isomorphism Network
0.226
102D - GIN+VN
GIN with Virtual Nodes
0.2267
11Ensemble - DimeNet++
DimeNet++ on full conformer ensemble
0.2267
12Ensemble - PaiNN
PaiNN on full conformer ensemble
0.2294
133D - PaiNN
Polarizable Atom Interaction Network (single conformer)
0.2324
143D - ClofNet
Conformation-ensemble learning network (single conformer)
0.2378
153D - DimeNet++
Directional message passing network (single conformer)
0.2436
162D - ChemProp
Message Passing Neural Network
0.2441
171D - LSTM
LSTM on SMILES sequences
0.2505
181D - Random forest
Random Forest on Morgan fingerprints
0.2732
191D - Transformer
Transformer on SMILES sequences
0.4073

Bβ‚… Sterimol Parameter (Kraken)

Predict Bβ‚… sterimol descriptor for organophosphorus ligands

Subset: Kraken

RankModelMAE
πŸ₯‡ 1Ensemble - PaiNN
PaiNN on full conformer ensemble
0.2225
πŸ₯ˆ 2Ensemble - GemNet
GemNet on full conformer ensemble
0.2313
πŸ₯‰ 3Ensemble - DimeNet++
DimeNet++ on full conformer ensemble
0.263
4Ensemble - LEFTNet
LEFTNet on full conformer ensemble
0.2644
5Ensemble - SchNet
SchNet on full conformer ensemble
0.2704
63D - GemNet
Geometry-enhanced message passing (single conformer)
0.2789
73D - LEFTNet
Local Environment Feature Transformer (single conformer)
0.3072
82D - GIN
Graph Isomorphism Network
0.3128
9Ensemble - ClofNet
ClofNet on full conformer ensemble
0.3228
103D - SchNet
Continuous-filter convolutional network (single conformer)
0.3293
113D - PaiNN
Polarizable Atom Interaction Network (single conformer)
0.3443
122D - GraphGPS
Graph Transformer with positional encodings
0.345
133D - DimeNet++
Directional message passing network (single conformer)
0.351
142D - GIN+VN
GIN with Virtual Nodes
0.3567
151D - Random forest
Random Forest on Morgan fingerprints
0.476
162D - ChemProp
Message Passing Neural Network
0.485
173D - ClofNet
Conformation-ensemble learning network (single conformer)
0.4873
181D - LSTM
LSTM on SMILES sequences
0.4879
191D - Transformer
Transformer on SMILES sequences
0.9611

L Sterimol Parameter (Kraken)

Predict L sterimol descriptor for organophosphorus ligands

Subset: Kraken

RankModelMAE
πŸ₯‡ 1Ensemble - GemNet
GemNet on full conformer ensemble
0.3386
πŸ₯ˆ 2Ensemble - DimeNet++
DimeNet++ on full conformer ensemble
0.3468
πŸ₯‰ 3Ensemble - PaiNN
PaiNN on full conformer ensemble
0.3619
4Ensemble - LEFTNet
LEFTNet on full conformer ensemble
0.3643
53D - GemNet
Geometry-enhanced message passing (single conformer)
0.3754
62D - GIN
Graph Isomorphism Network
0.4003
73D - DimeNet++
Directional message passing network (single conformer)
0.4174
81D - Random forest
Random Forest on Morgan fingerprints
0.4303
9Ensemble - SchNet
SchNet on full conformer ensemble
0.4322
102D - GIN+VN
GIN with Virtual Nodes
0.4344
112D - GraphGPS
Graph Transformer with positional encodings
0.4363
123D - PaiNN
Polarizable Atom Interaction Network (single conformer)
0.4471
13Ensemble - ClofNet
ClofNet on full conformer ensemble
0.4485
143D - LEFTNet
Local Environment Feature Transformer (single conformer)
0.4493
151D - LSTM
LSTM on SMILES sequences
0.5142
162D - ChemProp
Message Passing Neural Network
0.5452
173D - SchNet
Continuous-filter convolutional network (single conformer)
0.5458
183D - ClofNet
Conformation-ensemble learning network (single conformer)
0.6417
191D - Transformer
Transformer on SMILES sequences
0.8389

Buried Bβ‚… Parameter (Kraken)

Predict buried Bβ‚… sterimol descriptor for organophosphorus ligands

Subset: Kraken

RankModelMAE
πŸ₯‡ 1Ensemble - GemNet
GemNet on full conformer ensemble
0.1589
πŸ₯ˆ 2Ensemble - PaiNN
PaiNN on full conformer ensemble
0.1693
πŸ₯‰ 32D - GIN
Graph Isomorphism Network
0.1719
43D - GemNet
Geometry-enhanced message passing (single conformer)
0.1782
5Ensemble - DimeNet++
DimeNet++ on full conformer ensemble
0.1783
6Ensemble - SchNet
SchNet on full conformer ensemble
0.2024
7Ensemble - LEFTNet
LEFTNet on full conformer ensemble
0.2017
82D - GraphGPS
Graph Transformer with positional encodings
0.2066
93D - DimeNet++
Directional message passing network (single conformer)
0.2097
10Ensemble - ClofNet
ClofNet on full conformer ensemble
0.2178
113D - LEFTNet
Local Environment Feature Transformer (single conformer)
0.2176
123D - SchNet
Continuous-filter convolutional network (single conformer)
0.2295
133D - PaiNN
Polarizable Atom Interaction Network (single conformer)
0.2395
142D - GIN+VN
GIN with Virtual Nodes
0.2422
151D - Random forest
Random Forest on Morgan fingerprints
0.2758
161D - LSTM
LSTM on SMILES sequences
0.2813
173D - ClofNet
Conformation-ensemble learning network (single conformer)
0.2884
182D - ChemProp
Message Passing Neural Network
0.3002
191D - Transformer
Transformer on SMILES sequences
0.4929

Buried L Parameter (Kraken)

Predict buried L sterimol descriptor for organophosphorus ligands

Subset: Kraken

RankModelMAE
πŸ₯‡ 1Ensemble - GemNet
GemNet on full conformer ensemble
0.0947
πŸ₯ˆ 2Ensemble - DimeNet++
DimeNet++ on full conformer ensemble
0.1185
πŸ₯‰ 32D - GIN
Graph Isomorphism Network
0.12
4Ensemble - PaiNN
PaiNN on full conformer ensemble
0.1324
5Ensemble - LEFTNet
LEFTNet on full conformer ensemble
0.1386
6Ensemble - SchNet
SchNet on full conformer ensemble
0.1443
73D - LEFTNet
Local Environment Feature Transformer (single conformer)
0.1486
82D - GraphGPS
Graph Transformer with positional encodings
0.15
91D - Random forest
Random Forest on Morgan fingerprints
0.1521
103D - DimeNet++
Directional message passing network (single conformer)
0.1526
11Ensemble - ClofNet
ClofNet on full conformer ensemble
0.1548
123D - GemNet
Geometry-enhanced message passing (single conformer)
0.1635
133D - PaiNN
Polarizable Atom Interaction Network (single conformer)
0.1673
142D - GIN+VN
GIN with Virtual Nodes
0.1741
153D - SchNet
Continuous-filter convolutional network (single conformer)
0.1861
161D - LSTM
LSTM on SMILES sequences
0.1924
172D - ChemProp
Message Passing Neural Network
0.1948
183D - ClofNet
Conformation-ensemble learning network (single conformer)
0.2529
191D - Transformer
Transformer on SMILES sequences
0.2781

Enantioselectivity (EE)

Predict enantiomeric excess for Rh-catalyzed asymmetric reactions

Subset: EE

RankModelMAE (%)
πŸ₯‡ 1Ensemble - GemNet
GemNet on full conformer ensemble
11.61
πŸ₯ˆ 2Ensemble - DimeNet++
DimeNet++ on full conformer ensemble
12.03
πŸ₯‰ 3Ensemble - PaiNN
PaiNN on full conformer ensemble
13.56
4Ensemble - ClofNet
ClofNet on full conformer ensemble
13.96
5Ensemble - SchNet
SchNet on full conformer ensemble
14.22
63D - DimeNet++
Directional message passing network (single conformer)
14.64
73D - SchNet
Continuous-filter convolutional network (single conformer)
17.74
83D - GemNet
Geometry-enhanced message passing (single conformer)
18.03
9Ensemble - LEFTNet
LEFTNet on full conformer ensemble
18.42
103D - LEFTNet
Local Environment Feature Transformer (single conformer)
19.8
113D - PaiNN
Polarizable Atom Interaction Network (single conformer)
20.24
123D - ClofNet
Conformation-ensemble learning network (single conformer)
33.95
132D - ChemProp
Message Passing Neural Network
61.03
141D - Random forest
Random Forest on Morgan fingerprints
61.3
152D - GraphGPS
Graph Transformer with positional encodings
61.63
161D - Transformer
Transformer on SMILES sequences
62.08
172D - GIN
Graph Isomorphism Network
62.31
182D - GIN+VN
GIN with Virtual Nodes
62.38
191D - LSTM
LSTM on SMILES sequences
64.01

Bond Dissociation Energy (BDE)

Predict metal-ligand bond dissociation energy for organometallic catalysts

Subset: BDE

RankModelMAE (kcal/mol)
πŸ₯‡ 13D - DimeNet++
Directional message passing network (single conformer)
1.45
πŸ₯ˆ 2Ensemble - DimeNet++
DimeNet++ on full conformer ensemble
1.47
πŸ₯‰ 33D - LEFTNet
Local Environment Feature Transformer (single conformer)
1.53
4Ensemble - LEFTNet
LEFTNet on full conformer ensemble
1.53
5Ensemble - GemNet
GemNet on full conformer ensemble
1.61
63D - GemNet
Geometry-enhanced message passing (single conformer)
1.65
7Ensemble - PaiNN
PaiNN on full conformer ensemble
1.87
8Ensemble - SchNet
SchNet on full conformer ensemble
1.97
9Ensemble - ClofNet
ClofNet on full conformer ensemble
2.01
103D - PaiNN
Polarizable Atom Interaction Network (single conformer)
2.13
112D - GraphGPS
Graph Transformer with positional encodings
2.48
123D - SchNet
Continuous-filter convolutional network (single conformer)
2.55
133D - ClofNet
Conformation-ensemble learning network (single conformer)
2.61
142D - GIN
Graph Isomorphism Network
2.64
152D - ChemProp
Message Passing Neural Network
2.66
162D - GIN+VN
GIN with Virtual Nodes
2.74
171D - LSTM
LSTM on SMILES sequences
2.83
181D - Random forest
Random Forest on Morgan fingerprints
3.03
191D - Transformer
Transformer on SMILES sequences
10.08
DatasetRelationshipLink
GEOMSourceπŸ“„ View Details

Citation

If you use this dataset, please cite:

https://doi.org/10.48550/arXiv.2310.00115