MARCEL

MARCEL: Molecular Representation and Conformer Ensemble Learning
Dataset Details
AuthorsYanqiao Zhu, Jeehyun Hwang, Brock Anton Stenfors, Yuanqi Du, Olexandr Isayev, Keir Adams, Jatin Chauhan, Connor W. Coley, Yizhou Sun, Zhen Liu, Bozhao Nan, Olaf Wiest, Wei Wang
Paper TitleLearning over Molecular Conformer Ensembles: Datasets and Benchmarks
InstitutionsUCLA, MIT, CMU, Notre Dame, Cornell
Published InInternational Conference on Learning Representations
CategoryComputational Chemistry
FormatSMILES RDKit mol objects 3D coordinates Statistical weights Experimental properties
SizeConformers: 722,193
Molecules: 76,651
Reactions: 6,787
DateSeptember 2025
Year2024
LinksπŸ“Š Dataset β€’ πŸ“„ Paper
MARCEL dataset Kraken ligand example in 3D conformation
Example conformer from MARCEL’s Kraken subset, showcasing the dataset’s focus on 3D molecular conformations for machine learning applications

Key Contribution

MARCEL contributes a large-scale dataset for molecular representation and conformer ensemble learning, facilitating advancements in drug discovery and cheminformatics.

Dataset Information

Format

SMILES RDKit mol objects 3D coordinates Statistical weights Experimental properties

Size

TypeCount
Conformers722,193
Molecules76,651
Reactions6,787

Dataset Examples

Example conformer from Drugs-75K (SMILES: COC(=O)[C@@]1(Cc2ccc(OC)cc2)[C@H]2c3cc(C(=O)N(C)C)n(Cc4ccc(OC(F)(F)F)cc4)c3C[C@H]2CN1C(=O)c1ccccc1; IUPAC: methyl (2R,3R,6R)-4-benzoyl-10-(dimethylcarbamoyl)-3-[(4-methoxyphenyl)methyl]-9-[[4-(trifluoromethoxy)phenyl]methyl]-4,9-diazatricyclo[6.3.0.02,6]undeca-1(8),10-diene-3-carboxylate)
Example conformer from Drugs-75K (SMILES: COC(=O)[C@@]1(Cc2ccc(OC)cc2)[C@H]2c3cc(C(=O)N(C)C)n(Cc4ccc(OC(F)(F)F)cc4)c3C[C@H]2CN1C(=O)c1ccccc1; IUPAC: methyl (2R,3R,6R)-4-benzoyl-10-(dimethylcarbamoyl)-3-[(4-methoxyphenyl)methyl]-9-[[4-(trifluoromethoxy)phenyl]methyl]-4,9-diazatricyclo[6.3.0.02,6]undeca-1(8),10-diene-3-carboxylate)
2D structure of Drugs-75K conformer above.
2D structure of Drugs-75K conformer above.
Example conformer from Kraken (ligand 10, conformer 0) in 2D
Example conformer from Kraken (ligand 10, conformer 0) in 2D
Example conformer from Kraken (ligand 10, conformer 0) in 3D
Example conformer from Kraken (ligand 10, conformer 0) in 3D
Example substrate from BDE in 3D (Pt_9.63).
Example substrate from BDE in 3D (Pt_9.63).
2D structure of BDE substrate above.
2D structure of BDE substrate above.

Dataset Subsets

SubsetCountDescription
Drugs-75K75,099Drug-like molecules with at least 5 rotatable bonds
Kraken1,552monodentate organophosphorus (III) ligands
EE872Rhodium (Rh)-bound atropisomeric catalysts derived from chiral bisphosphine
BDE5,195Organometallic catalysts ML$_1$L$_2$

Results

Drugs 75k Property Prediction

ModelIPEAχ
1D - Random forest0.49870.47470.2732
1D - LSTM0.47880.46480.2505
1D - Transformer0.66170.58500.4073
2D - GIN0.43540.41690.2260
2D - GIN+VN0.43610.41690.2267
2D - ChemProp0.45950.44170.2441
2D - GraphGPS0.43510.40850.2212
3D - SchNet0.43940.42070.2243
3D - DimeNet++0.44410.42330.2436
3D - GemNetπŸ₯ˆ 0.4069πŸ₯ˆ 0.3922πŸ₯‡ 0.1970
3D - PaiNN0.45050.44950.2324
3D - ClofNet0.43930.42510.2378
3D - LEFTNet0.41740.39640.2083
Ensemble - SchNet0.44520.42320.2243
Ensemble - DimeNet++0.41260.39440.2267
Ensemble - GemNetπŸ₯‡ 0.4066πŸ₯‡ 0.3910πŸ₯ˆ 0.2027
Ensemble - PaiNN0.44660.42690.2294
Ensemble - ClofNet0.42800.40330.2199
Ensemble - LEFTNet0.41490.39530.2069

Kraken Property Prediction

ModelBβ‚…LBurBβ‚…BurL
1D - Random forest0.47600.43030.27580.1521
1D - LSTM0.48790.51420.28130.1924
1D - Transformer0.96110.83890.49290.2781
2D - GIN0.31280.40030.17190.1200
2D - GIN+VN0.35670.43440.24220.1741
2D - ChemProp0.48500.54520.30020.1948
2D - GraphGPS0.34500.43630.20660.1500
3D - SchNet0.32930.54580.22950.1861
3D - DimeNet++0.35100.41740.20970.1526
3D - GemNet0.27890.37540.17820.1635
3D - PaiNN0.34430.44710.23950.1673
3D - ClofNet0.48730.64170.28840.2529
3D - LEFTNet0.30720.44930.21760.1486
Ensemble - SchNet0.27040.43220.20240.1443
Ensemble - DimeNet++0.2630πŸ₯ˆ 0.34680.1783πŸ₯ˆ 0.1185
Ensemble - GemNetπŸ₯ˆ 0.2313πŸ₯‡ 0.3386πŸ₯‡ 0.1589πŸ₯‡ 0.0947
Ensemble - PaiNNπŸ₯‡ 0.22250.3619πŸ₯ˆ 0.16930.1324
Ensemble - ClofNet0.32280.44850.21780.1548
Ensemble - LEFTNet0.26440.36430.20170.1386

Ee Bde Property Prediction

ModelEEBDE
1D - Random forest61.29633.0335
1D - LSTM64.00882.8279
1D - Transformer62.081610.0771
2D - GIN62.30652.6368
2D - GIN+VN62.38152.7417
2D - ChemProp61.03362.6616
2D - GraphGPS61.62512.4827
3D - SchNet17.74212.5488
3D - DimeNet++14.6414πŸ₯‡ 1.4503
3D - GemNet18.03381.6530
3D - PaiNN20.23592.1261
3D - ClofNet33.94732.6057
3D - LEFTNet19.7974πŸ₯ˆ 1.5328
Ensemble - SchNet14.22381.9737
Ensemble - DimeNet++πŸ₯ˆ 12.02591.4741
Ensemble - GemNetπŸ₯‡ 11.61421.6059
Ensemble - PaiNN13.55701.8744
Ensemble - ClofNet13.96472.0106
Ensemble - LEFTNet18.41891.5276

Strengths

  • More than just drugs, also considers organometallics and catalysts
  • Flexible drug-like molecules
  • Ensemble-based dataset
  • DFT-level accuracy for conformer energies
  • Realistic scenarios (BDE, with lack of DFT-computed conformers)

Limitations

  • Only regression tasks
  • Does not cover the full breadth of chemical space
  • Computational cost of working with large conformer ensembles
  • EE appears to be proprietary-only

Technical Notes

Drugs-75K

  • Begins with SMILES from GEOM-Drugs.
  • Excludes any molecules with less than 5 rotatable bonds.
  • Limits to atomic species: H, C, N, O, F, Si, P, S, and Cl.
  • Computes DFT-level conformers and energies (higher accuracy than GEOM-Drugs).

Kraken

Uses the original dataset directly, but only focuses on 5 of the 78 properties.

EE

  • Uses Q2MM to simulate conformer ensembles of 872 catalyst-substrate pairs.
  • Seeks to predict enantiomeric excess (EE).
  • Not available publicly (as of September 2025).

BDE

  • Metallic centers with two flexible ligands.
  • Seeks to predict electronic dissociation energies by difference of bound and unbound states.
  • OpenBabel to generate initial conformers with further geometric optimization.
  • Actual energies computed with DFT.
  • Cannot compute DFT for conformers as a single search with DFT would take upwards of 2-3 days (as of September 2025).

Related Datasets

DatasetRelationshipLink
GEOMContains SubsetπŸ“„ View Details
KrakenContainsN/A
EEContainsN/A
BDEContainsN/A