Molecular Databases & Datasets on Hunter Heidenreich | ML Research Scientist

ZINC-22: A Multi-Billion Scale Database for Ligand Discovery

Sat, 27 Sep 2025 00:00:00 +0000

Key Contribution: Scaling Make-on-Demand Libraries

ZINC-22 addresses the critical infrastructure challenges of managing multi-billion-scale libraries of make-on-demand chemical compounds through a federated database architecture, the CartBlanche web interface, and cloud distribution systems that enable modern virtual screening.

Overview

ZINC-22 is a multi-billion scale public database of commercially available chemical compounds designed for virtual screening. It contains over 37 billion make-on-demand molecules and utilizes a distributed infrastructure capable of managing database indexing limits. For structural biology pipelines, it provides 4.5 billion ready-to-dock 3D conformations alongside pre-calculated pH-specific protonation states, tautomers, and AMSOL partial charges.

Dataset Examples

ZINC-22’s 2D Tranche Browser showing the organization of 37.2 billion molecules by physicochemical properties

Dataset Subsets

Subset	Count	Description
2D Database	37B+	Complete 2D chemical structures from make-on-demand catalogs (Enamine REAL, Enamine REAL Space, WuXi GalaXi, Mcule Ultimate)
3D Database	4.5B+	Ready-to-dock 3D conformations with pre-calculated charges and solvation energies
Custom Tranches	Variable	User-selected molecular subsets via Tranche Browser (e.g., lead-like, fragment-like)

Use Cases

ZINC-22 is designed for ultra-large virtual screening (ULVS), analog searching, and molecular docking campaigns. The Tranche Browser enables targeted subset selection (e.g., lead-like, fragment-like) for screening, and the CartBlanche interface supports both interactive and programmatic access to the database. The authors note that as the database grows, docking can identify better-fitting molecules.

Dataset	Relationship	Link
ZINC-20	Predecessor
Enamine REAL	Source catalog
WuXi GalaXi	Source catalog

Strengths

Massive scale: 37+ billion purchasable compounds from major vendors (Enamine, WuXi, Mcule)
Federated architecture: Supports asynchronous building and horizontal scaling to trillion-molecule growth
Platform access: CartBlanche GUI provides a shopping cart metaphor for compound acquisition
Privacy protection: Dual public/private server clusters protect patentability of undisclosed catalogs
Chemical diversity: Linear growth (1 new scaffold per 10 molecules added), with 96.3M+ unique Bemis-Murcko scaffolds
Ready-to-dock: 3D models include pre-calculated charges, protonation states, and solvation energies
Cloud distribution: Available via AWS Open Data, Oracle OCI, and UCSF servers
Scale-aware search: SmallWorld (similarity) and Arthor (substructure) tools partitioned to address specific constraints of billion-scale queries
Organized access: Tranche system enables targeted selection of chemical space
Open access: Entire database freely available to academic and commercial users

Limitations

Data Transfer Bottlenecks: Distributing 4.5 billion 3D alignments in standard rigid format (like db2 flexibase) requires roughly 1 Petabyte of storage. Transferring this takes months over standard gigabit connections, effectively mandating cloud-based compilation and rendering local copies impractical.
Search Result Caps: Interactive Arthor searches are capped at 20,000 molecules to maintain a reliable public service. Users needing more results can use the asynchronous Arthor search tool via TLDR, which sends results by email.
Enumeration Ceiling: Scaling relies entirely on PostgreSQL sharding. To continue using rigid docking tools, the database must fully enumerate structural states. The authors acknowledge that hardware limitations will likely cap full database enumeration well before the 10-trillion molecule mark, forcing future pipelines to accommodate unenumerated combinatorial fragment spaces.
Download Workflow: Individual 3D molecule downloads are unavailable directly; researchers must rebuild them via the TLDR tool.
Vendor Updates: There is difficulty removing discontinued vendor molecules due to the federated structure.

Technical Notes

Hardware & Software

Compute infrastructure:

1,700 cores across 14 computers for parallel processing
174 independent PostgreSQL 12.0 databases (110 ‘Sn’ for ZINC-ID, 64 ‘Sb’ for Supplier Codes)
Distributed across Amazon AWS, Oracle OCI, and UCSF servers

Software stack:

PostgreSQL 12.2
Python 3.6.8
RDKit 2020.03
Celery task queue with Redis for background processing
All code available on GitHub: docking-org/zinc22-2d, zinc22-3d

Data Organization & Access

Tranche system: Molecules organized into “Tranches” based on 4 dimensions:

Heavy Atom Count
Lipophilicity (LogP)
Charge
File Format

This enables downloading specific chemical neighborhoods (e.g., neutral lead-like molecules) without accessing the entire database.

Search infrastructure: Searching at the billion-molecule scale actively exceeds rapid-access computer memory limits. ZINC-22 splits retrieval between two distinct algorithms:

SmallWorld: Handles whole-molecule similarity using Graph Edit Distance (GED). GED defines the minimum cost of operations (node/edge insertions, deletions, or substitutions) required to transform graph $G_1$ into graph $G_2$:

$$ \text{GED}(G_1, G_2) = \min_{(e_1, …, e_k) \in \mathcal{P}(G_1, G_2)} \sum_{i=1}^k c(e_i) $$

Because SmallWorld searches pre-calculated anonymous graphs, it evaluates close neighbors in near $\mathcal{O}(1)$ time and scales sub-linearly, though it struggles with highly distant structural matches.
Arthor: Provides exact substructure and pattern matching. It scales linearly $\mathcal{O}(N)$ with database size and successfully finds distant hits (e.g., PAINS filters), but performance heavily degrades if the index exceeds available RAM.
CartBlanche: Web interface wrapping these search tools with shopping cart functionality.

3D Generation Pipeline

The 3D database construction pipeline involves multiple specialized tools:

ChemAxon JChem: Protonation state and tautomer generation at physiological pH
Corina: Initial 3D structure generation
Omega: Conformation sampling
AMSOL 7.1: Calculation of atomic partial charges and desolvation energies
Strain calculation: Relative energies of conformations

At sustained throughput, the pipeline builds approximately 11 million molecules per day, each with hundreds of pre-calculated conformations.

Chemical Diversity Analysis

A core debate in billion-scale library generation involves whether continuous enumeration merely yields repetitive derivatives. Analysis of Bemis-Murcko (BM) scaffolds demonstrates that chemical diversity in ZINC-22 continues to grow, but scales sub-linearly based on a power law. Specifically, the authors observe a $\log$ increase in BM scaffolds for every two $\log$ increase in database size:

$$ \log(\text{Scaffolds}_{BM}) \propto 0.5 \log(\text{Molecules}) $$

This suggests that while diversity does not saturate, it grows proportionally to the square root of the library size ($\mathcal{O}(\sqrt{N})$). The majority of this scaffold novelty stems from compounds with the highest heavy atom counts (HAC 24-25), which contribute roughly twice as many unique core structures as the combined HAC 06-23 subset.

Vendor Integration

ZINC-22 is built from five source catalogs with the following approximate sizes:

Enamine REAL Database: 5 billion compounds
Enamine REAL Space: 29 billion compounds
WuXi GalaXi: 2.5 billion compounds
Mcule Ultimate: 128 million compounds
ZINC20 in-stock: 4 million compounds (incorporated as layer “g”)

This focus on purchasable, make-on-demand molecules distinguishes ZINC-22 from theoretical chemical space databases. ZINC20 continues to be maintained separately for smaller catalogs and in-stock compounds.

Reproducibility Details

Artifact	Type	License	Notes
CartBlanche web interface	Dataset	Free access	Web GUI for searching and downloading ZINC-22
docking-org/zinc22-2d	Code	BSD-3-Clause	2D curation and loading pipeline
docking-org/zinc22-3d	Code	Unknown	3D building pipeline
docking-org/cartblanche22	Code	Unknown	CartBlanche22 web application
AWS Open Data / Oracle OCI	Dataset	Free access	Cloud-hosted 3D database mirrors

Data Availability: The compiled database is openly accessible and searchable through the CartBlanche web interface. Subsets can be downloaded, and programmatic access is provided via curl, wget, and Globus.
Code & Algorithms: The source code for database construction, parallel processing, and querying is open-source.
- 2D Pipeline: docking-org/zinc22-2d
- 3D Pipeline: docking-org/zinc22-3d
- CartBlanche: docking-org/cartblanche22
- TLDR modules: docking-org/TLDR and docking-org/tldr-modules (repositories no longer available)
Software Dependencies: While the orchestration code is public, the 3D structure generation relies on commercial software that requires separate licenses (CORINA, OpenEye OMEGA, ChemAxon JChem). This limits end-to-end reproducibility for researchers without access to these tools.
Hardware Limitations: Recreating the entire 37+ billion molecule database from raw vendor catalogs requires approximately 1,700 CPU cores and petabytes of data transfer, restricting full recreation to large institutional clusters or substantial cloud compute budgets.

Paper Information

Tingle, B. I., Tang, K. G., Castanon, M., Gutierrez, J. J., Khurelbaatar, M., Dandarchuluun, C., Moroz, Y. S., and Irwin, J. J. (2023). ZINC-22: A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery. Journal of Chemical Information and Modeling, 63(4), 1166–1176. https://doi.org/10.1021/acs.jcim.2c01253

@article{Tingle_2023,
    title={ZINC-22: A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery},
    volume={63},
    ISSN={1549-960X},
    url={http://dx.doi.org/10.1021/acs.jcim.2c01253},
    DOI={10.1021/acs.jcim.2c01253},
    number={4},
    journal={Journal of Chemical Information and Modeling},
    publisher={American Chemical Society (ACS)},
    author={Tingle, Benjamin I. and Tang, Khanh G. and Castanon, Mar and Gutierrez, John J. and Khurelbaatar, Munkhzul and Dandarchuluun, Chinzorig and Moroz, Yurii S. and Irwin, John J.},
    year={2023},
    month={Feb},
    pages={1166--1176}
}

MARCEL: Molecular Conformer Ensemble Learning Benchmark

Mon, 08 Sep 2025 00:00:00 +0000

Key Contribution

MARCEL provides a benchmark for conformer ensemble learning. It demonstrates that explicitly modeling full conformer distributions improves property prediction across drug-like molecules and organometallic catalysts.

Overview

The Molecular Representation and Conformer Ensemble Learning (MARCEL) dataset provides 722K+ conformations across 76K+ molecules spanning four diverse chemical domains: drug-like molecules (Drugs-75K), organophosphorus ligands (Kraken), chiral catalysts (EE), and organometallic complexes (BDE). MARCEL evaluates conformer ensemble methods across both pharmaceutical and catalysis applications.

Dataset Examples

Example conformer from Drugs-75K (SMILES: COC(=O)[C@@]1(Cc2ccc(OC)cc2)[C@H]2c3cc(C(=O)N(C)C)n(Cc4ccc(OC(F)(F)F)cc4)c3C[C@H]2CN1C(=O)c1ccccc1; IUPAC: methyl (2R,3R,6R)-4-benzoyl-10-(dimethylcarbamoyl)-3-[(4-methoxyphenyl)methyl]-9-[[4-(trifluoromethoxy)phenyl]methyl]-4,9-diazatricyclo[6.3.0.02,6]undeca-1(8),10-diene-3-carboxylate)

2D structure of Drugs-75K conformer above

Example conformer from Kraken (ligand 10, conformer 0) in 2D

Example conformer from Kraken (ligand 10, conformer 0) in 3D

Example substrate from BDE in 3D (Pt_9.63)

2D structure of BDE substrate above

Dataset Subsets

Subset	Count	Description
Drugs-75K	75,099 molecules	Drug-like molecules with at least 5 rotatable bonds
Kraken	1,552 molecules	Monodentate organophosphorus (III) ligands
EE	872 reactions	Rhodium (Rh)-bound atropisomeric catalyst-substrate pairs derived from chiral bisphosphine
BDE	5,915 reactions	Organometallic catalysts ML$_1$L$_2$ with electronic binding energies

Benchmarks

Ionization Potential (Drugs-75K)

Predict ionization potential from molecular structure

Subset: Drugs-75K

Rank	Model	MAE (eV)
🥇 1	Ensemble - GemNet GemNet on full conformer ensemble	0.4066
🥈 2	3D - GemNet Geometry-enhanced message passing (single conformer)	0.4069
🥉 3	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	0.4126
4	Ensemble - LEFTNet LEFTNet on full conformer ensemble	0.4149
5	3D - LEFTNet Local Environment Feature Transformer (single conformer)	0.4174
6	Ensemble - ClofNet ClofNet on full conformer ensemble	0.428
7	2D - GraphGPS Graph Transformer with positional encodings	0.4351
8	2D - GIN Graph Isomorphism Network	0.4354
9	2D - GIN+VN GIN with Virtual Nodes	0.4361
10	3D - ClofNet Conformation-ensemble learning network (single conformer)	0.4393
11	3D - SchNet Continuous-filter convolutional network (single conformer)	0.4394
12	3D - DimeNet++ Directional message passing network (single conformer)	0.4441
13	Ensemble - SchNet SchNet on full conformer ensemble	0.4452
14	Ensemble - PaiNN PaiNN on full conformer ensemble	0.4466
15	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	0.4505
16	2D - ChemProp Message Passing Neural Network	0.4595
17	1D - LSTM LSTM on SMILES sequences	0.4788
18	1D - Random forest Random Forest on Morgan fingerprints	0.4987
19	1D - Transformer Transformer on SMILES sequences	0.6617

Electron Affinity (Drugs-75K)

Predict electron affinity from molecular structure

Subset: Drugs-75K

Rank	Model	MAE (eV)
🥇 1	Ensemble - GemNet GemNet on full conformer ensemble	0.391
🥈 2	3D - GemNet Geometry-enhanced message passing (single conformer)	0.3922
🥉 3	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	0.3944
4	Ensemble - LEFTNet LEFTNet on full conformer ensemble	0.3953
5	3D - LEFTNet Local Environment Feature Transformer (single conformer)	0.3964
6	Ensemble - ClofNet ClofNet on full conformer ensemble	0.4033
7	2D - GraphGPS Graph Transformer with positional encodings	0.4085
8	2D - GIN Graph Isomorphism Network	0.4169
9	2D - GIN+VN GIN with Virtual Nodes	0.4169
10	3D - SchNet Continuous-filter convolutional network (single conformer)	0.4207
11	3D - DimeNet++ Directional message passing network (single conformer)	0.4233
12	Ensemble - SchNet SchNet on full conformer ensemble	0.4232
13	3D - ClofNet Conformation-ensemble learning network (single conformer)	0.4251
14	Ensemble - PaiNN PaiNN on full conformer ensemble	0.4269
15	2D - ChemProp Message Passing Neural Network	0.4417
16	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	0.4495
17	1D - LSTM LSTM on SMILES sequences	0.4648
18	1D - Random forest Random Forest on Morgan fingerprints	0.4747
19	1D - Transformer Transformer on SMILES sequences	0.585

Electronegativity (Drugs-75K)

Predict electronegativity (χ) from molecular structure

Subset: Drugs-75K

Rank	Model	MAE (eV)
🥇 1	3D - GemNet Geometry-enhanced message passing (single conformer)	0.197
🥈 2	Ensemble - GemNet GemNet on full conformer ensemble	0.2027
🥉 3	Ensemble - LEFTNet LEFTNet on full conformer ensemble	0.2069
4	3D - LEFTNet Local Environment Feature Transformer (single conformer)	0.2083
5	Ensemble - ClofNet ClofNet on full conformer ensemble	0.2199
6	2D - GraphGPS Graph Transformer with positional encodings	0.2212
7	3D - SchNet Continuous-filter convolutional network (single conformer)	0.2243
8	Ensemble - SchNet SchNet on full conformer ensemble	0.2243
9	2D - GIN Graph Isomorphism Network	0.226
10	2D - GIN+VN GIN with Virtual Nodes	0.2267
11	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	0.2267
12	Ensemble - PaiNN PaiNN on full conformer ensemble	0.2294
13	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	0.2324
14	3D - ClofNet Conformation-ensemble learning network (single conformer)	0.2378
15	3D - DimeNet++ Directional message passing network (single conformer)	0.2436
16	2D - ChemProp Message Passing Neural Network	0.2441
17	1D - LSTM LSTM on SMILES sequences	0.2505
18	1D - Random forest Random Forest on Morgan fingerprints	0.2732
19	1D - Transformer Transformer on SMILES sequences	0.4073

B₅ Sterimol Parameter (Kraken)

Predict B₅ sterimol descriptor for organophosphorus ligands

Subset: Kraken

Rank	Model	MAE
🥇 1	Ensemble - PaiNN PaiNN on full conformer ensemble	0.2225
🥈 2	Ensemble - GemNet GemNet on full conformer ensemble	0.2313
🥉 3	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	0.263
4	Ensemble - LEFTNet LEFTNet on full conformer ensemble	0.2644
5	Ensemble - SchNet SchNet on full conformer ensemble	0.2704
6	3D - GemNet Geometry-enhanced message passing (single conformer)	0.2789
7	3D - LEFTNet Local Environment Feature Transformer (single conformer)	0.3072
8	2D - GIN Graph Isomorphism Network	0.3128
9	Ensemble - ClofNet ClofNet on full conformer ensemble	0.3228
10	3D - SchNet Continuous-filter convolutional network (single conformer)	0.3293
11	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	0.3443
12	2D - GraphGPS Graph Transformer with positional encodings	0.345
13	3D - DimeNet++ Directional message passing network (single conformer)	0.351
14	2D - GIN+VN GIN with Virtual Nodes	0.3567
15	1D - Random forest Random Forest on Morgan fingerprints	0.476
16	2D - ChemProp Message Passing Neural Network	0.485
17	3D - ClofNet Conformation-ensemble learning network (single conformer)	0.4873
18	1D - LSTM LSTM on SMILES sequences	0.4879
19	1D - Transformer Transformer on SMILES sequences	0.9611

L Sterimol Parameter (Kraken)

Predict L sterimol descriptor for organophosphorus ligands

Subset: Kraken

Rank	Model	MAE
🥇 1	Ensemble - GemNet GemNet on full conformer ensemble	0.3386
🥈 2	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	0.3468
🥉 3	Ensemble - PaiNN PaiNN on full conformer ensemble	0.3619
4	Ensemble - LEFTNet LEFTNet on full conformer ensemble	0.3643
5	3D - GemNet Geometry-enhanced message passing (single conformer)	0.3754
6	2D - GIN Graph Isomorphism Network	0.4003
7	3D - DimeNet++ Directional message passing network (single conformer)	0.4174
8	1D - Random forest Random Forest on Morgan fingerprints	0.4303
9	Ensemble - SchNet SchNet on full conformer ensemble	0.4322
10	2D - GIN+VN GIN with Virtual Nodes	0.4344
11	2D - GraphGPS Graph Transformer with positional encodings	0.4363
12	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	0.4471
13	Ensemble - ClofNet ClofNet on full conformer ensemble	0.4485
14	3D - LEFTNet Local Environment Feature Transformer (single conformer)	0.4493
15	1D - LSTM LSTM on SMILES sequences	0.5142
16	2D - ChemProp Message Passing Neural Network	0.5452
17	3D - SchNet Continuous-filter convolutional network (single conformer)	0.5458
18	3D - ClofNet Conformation-ensemble learning network (single conformer)	0.6417
19	1D - Transformer Transformer on SMILES sequences	0.8389

Buried B₅ Parameter (Kraken)

Predict buried B₅ sterimol descriptor for organophosphorus ligands

Subset: Kraken

Rank	Model	MAE
🥇 1	Ensemble - GemNet GemNet on full conformer ensemble	0.1589
🥈 2	Ensemble - PaiNN PaiNN on full conformer ensemble	0.1693
🥉 3	2D - GIN Graph Isomorphism Network	0.1719
4	3D - GemNet Geometry-enhanced message passing (single conformer)	0.1782
5	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	0.1783
6	Ensemble - SchNet SchNet on full conformer ensemble	0.2024
7	Ensemble - LEFTNet LEFTNet on full conformer ensemble	0.2017
8	2D - GraphGPS Graph Transformer with positional encodings	0.2066
9	3D - DimeNet++ Directional message passing network (single conformer)	0.2097
10	Ensemble - ClofNet ClofNet on full conformer ensemble	0.2178
11	3D - LEFTNet Local Environment Feature Transformer (single conformer)	0.2176
12	3D - SchNet Continuous-filter convolutional network (single conformer)	0.2295
13	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	0.2395
14	2D - GIN+VN GIN with Virtual Nodes	0.2422
15	1D - Random forest Random Forest on Morgan fingerprints	0.2758
16	1D - LSTM LSTM on SMILES sequences	0.2813
17	3D - ClofNet Conformation-ensemble learning network (single conformer)	0.2884
18	2D - ChemProp Message Passing Neural Network	0.3002
19	1D - Transformer Transformer on SMILES sequences	0.4929

Buried L Parameter (Kraken)

Predict buried L sterimol descriptor for organophosphorus ligands

Subset: Kraken

Rank	Model	MAE
🥇 1	Ensemble - GemNet GemNet on full conformer ensemble	0.0947
🥈 2	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	0.1185
🥉 3	2D - GIN Graph Isomorphism Network	0.12
4	Ensemble - PaiNN PaiNN on full conformer ensemble	0.1324
5	Ensemble - LEFTNet LEFTNet on full conformer ensemble	0.1386
6	Ensemble - SchNet SchNet on full conformer ensemble	0.1443
7	3D - LEFTNet Local Environment Feature Transformer (single conformer)	0.1486
8	2D - GraphGPS Graph Transformer with positional encodings	0.15
9	1D - Random forest Random Forest on Morgan fingerprints	0.1521
10	3D - DimeNet++ Directional message passing network (single conformer)	0.1526
11	Ensemble - ClofNet ClofNet on full conformer ensemble	0.1548
12	3D - GemNet Geometry-enhanced message passing (single conformer)	0.1635
13	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	0.1673
14	2D - GIN+VN GIN with Virtual Nodes	0.1741
15	3D - SchNet Continuous-filter convolutional network (single conformer)	0.1861
16	1D - LSTM LSTM on SMILES sequences	0.1924
17	2D - ChemProp Message Passing Neural Network	0.1948
18	3D - ClofNet Conformation-ensemble learning network (single conformer)	0.2529
19	1D - Transformer Transformer on SMILES sequences	0.2781

Enantioselectivity (EE)

Predict enantiomeric excess for Rh-catalyzed asymmetric reactions

Subset: EE

Rank	Model	MAE (%)
🥇 1	Ensemble - GemNet GemNet on full conformer ensemble	11.61
🥈 2	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	12.03
🥉 3	Ensemble - PaiNN PaiNN on full conformer ensemble	13.56
4	Ensemble - ClofNet ClofNet on full conformer ensemble	13.96
5	Ensemble - SchNet SchNet on full conformer ensemble	14.22
6	3D - DimeNet++ Directional message passing network (single conformer)	14.64
7	3D - SchNet Continuous-filter convolutional network (single conformer)	17.74
8	3D - GemNet Geometry-enhanced message passing (single conformer)	18.03
9	Ensemble - LEFTNet LEFTNet on full conformer ensemble	18.42
10	3D - LEFTNet Local Environment Feature Transformer (single conformer)	19.8
11	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	20.24
12	3D - ClofNet Conformation-ensemble learning network (single conformer)	33.95
13	2D - ChemProp Message Passing Neural Network	61.03
14	1D - Random forest Random Forest on Morgan fingerprints	61.3
15	2D - GraphGPS Graph Transformer with positional encodings	61.63
16	1D - Transformer Transformer on SMILES sequences	62.08
17	2D - GIN Graph Isomorphism Network	62.31
18	2D - GIN+VN GIN with Virtual Nodes	62.38
19	1D - LSTM LSTM on SMILES sequences	64.01

Bond Dissociation Energy (BDE)

Predict metal-ligand bond dissociation energy for organometallic catalysts

Subset: BDE

Rank	Model	MAE (kcal/mol)
🥇 1	3D - DimeNet++ Directional message passing network (single conformer)	1.45
🥈 2	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	1.47
🥉 3	3D - LEFTNet Local Environment Feature Transformer (single conformer)	1.53
4	Ensemble - LEFTNet LEFTNet on full conformer ensemble	1.53
5	Ensemble - GemNet GemNet on full conformer ensemble	1.61
6	3D - GemNet Geometry-enhanced message passing (single conformer)	1.65
7	Ensemble - PaiNN PaiNN on full conformer ensemble	1.87
8	Ensemble - SchNet SchNet on full conformer ensemble	1.97
9	Ensemble - ClofNet ClofNet on full conformer ensemble	2.01
10	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	2.13
11	2D - GraphGPS Graph Transformer with positional encodings	2.48
12	3D - SchNet Continuous-filter convolutional network (single conformer)	2.55
13	3D - ClofNet Conformation-ensemble learning network (single conformer)	2.61
14	2D - GIN Graph Isomorphism Network	2.64
15	2D - ChemProp Message Passing Neural Network	2.66
16	2D - GIN+VN GIN with Virtual Nodes	2.74
17	1D - LSTM LSTM on SMILES sequences	2.83
18	1D - Random forest Random Forest on Morgan fingerprints	3.03
19	1D - Transformer Transformer on SMILES sequences	10.08

Dataset	Relationship	Link
GEOM	Source	Notes

Strengths

Domain diversity: Beyond drug-like molecules, includes organometallics and catalysts rarely covered in existing benchmarks
Ensemble-based: Provides full conformer ensembles with statistical weights
DFT-quality energies: Drugs-75K features DFT-level conformers and energies (higher accuracy than GEOM-Drugs)
Realistic scenarios: BDE subset models the practical constraint of lacking DFT-computed conformers for large catalyst systems
Comprehensive baselines: Benchmarks 18 models across 1D (SMILES), 2D (graph), 3D (single conformer), and ensemble methods
Property diversity: Covers ionization potential, electron affinity, electronegativity, ligand descriptors, and catalytic properties

Limitations

Regression only: All tasks evaluate regression metrics exclusively
Chemical space coverage: The 76K molecules encapsulate a fraction of the expansive drug-like and catalyst chemical spaces
Compute requirements: Working with large conformer ensembles demands significant computational resources
Proprietary data: EE subset is proprietary (as of December 2025)
DFT bottleneck: BDE demonstrates a practical limitation: single DFT optimization can take 2-3 days, making conformer-level DFT infeasible for large organometallics
Uniform sampling baseline: The initial data augmentation strategy tested for handling ensembles samples conformers uniformly rather than by Boltzmann weight. This unprincipled physical assumption likely explains why the strategy occasionally introduces noise and fails to aid complex 3D architectures.
Drugs-75K properties: The large-scale benchmark (Drugs-75K) specifically targets electronic properties (Ionization Potential, Electron Affinity, Electronegativity). As the authors explicitly highlight in Section 5.2, these properties are generally less sensitive to conformational rotations compared to steric or spatial interactions. This significantly confounds evaluating whether explicit conformer ensembles actually benefit large-scale regression tasks.
Unrealistic single-conformer baselines: The 3D single-conformer models are exclusively evaluated on the lowest-energy conformer. This setup is inherently flawed for real-world application, as knowing the global minimum a priori requires exhaustively searching and computing energies for the entire conformer space.

Technical Notes

Data Generation Pipeline

Drugs-75K

Source: GEOM-Drugs subset

Filtering:

Minimum 5 rotatable bonds (focus on flexible molecules)
Allowed elements: H, C, N, O, F, Si, P, S, Cl

Conformer generation:

DFT-level calculations for both conformers and energies
Higher accuracy than original GEOM-Drugs (semi-empirical GFN2-xTB)

Properties: Ionization Potential (IP), Electron Affinity (EA), Electronegativity (χ)

Kraken

Source: Original Kraken dataset (1,552 monodentate organophosphorus(III) ligands)

Properties: 4 of 78 available properties (selected for high variance across conformer ensembles)

$B_5$: Sterimol B5, maximum width of substituent (steric descriptor)
$L$: Sterimol L, length of substituent (steric descriptor)
$\text{Bur}B_5$: Buried Sterimol B5, steric effects within the first coordination sphere
$\text{Bur}L$: Buried Sterimol L, steric effects within the first coordination sphere

EE (Enantiomeric Excess)

Generation method: Q2MM (Quantum-guided Molecular Mechanics)

Reactions: 872 catalyst-substrate pairs involving 253 Rhodium (Rh)-bound atropisomeric catalysts from chiral bisphosphine with 10 enamide substrates

Property: Enantiomeric excess (EE) for asymmetric catalysis

Availability: Proprietary-only (closed-source as of December 2025)

BDE (Bond Dissociation Energy)

Molecules: 5,915 organometallic catalysts (ML₁L₂ structure)

Initial conformers: OpenBabel with geometric optimization

Energies: DFT calculations

Property: Electronic binding energy (difference in minimum energies of bound-catalyst complex and unbound catalyst)

Key constraint: DFT optimization for full conformer ensembles computationally infeasible (2-3 days per molecule)

Benchmark Setup

Task: Predict molecular properties from structure using different representation strategies (1D/2D/3D/Ensemble). The ground-truth regression targets are calculated as the Boltzmann-averaged value of the property across the conformer ensemble:

$$ \langle y \rangle_{k_B} = \sum_{\mathbf{C}_i \in \mathcal{C}} p_i y_i $$

Where $p_i$ is the conformer probability (Boltzmann weight) under experimental conditions derived from the conformer energy $e_i$:

$$ p_i = \frac{\exp(-e_i / k_B T)}{\sum_j \exp(-e_j / k_B T)} $$

Data splits: Datasets are partitioned 70% train, 10% validation, and 20% test.

Model categories:

1D Models: SMILES-based (Random Forest on concatenated MACCS/ECFP/RDKit fingerprints, LSTM, Transformer).
2D Models: Graph-based (GIN, GIN+VN, ChemProp, GraphGPS).
3D Models: Single conformer (SchNet, DimeNet++, GemNet, PaiNN, ClofNet, LEFTNet). For evaluation, single 3D models exclusively ingest the lowest-energy conformer. This baseline setting often yields strong performance but is unrealistic in practice, as identifying the global minimum requires exhaustively searching the entire conformer space.
Ensemble Models: Full conformer ensemble processing via explicit set encoders. For each conformer embedding $\mathbf{z}_i$, three aggregation strategies are evaluated:

Mean Pooling: $$ \mathbf{s}_{\text{MEAN}} = \frac{1}{|\mathcal{C}|} \sum_{i=1}^{|\mathcal{C}|} \mathbf{z}_i $$

DeepSets: $$ \mathbf{s}_{\text{DS}} = g\left(\sum_{i=1}^{|\mathcal{C}|} h(\mathbf{z}_i)\right) $$

Self-Attention: $$ \begin{aligned} \mathbf{s}_{\text{ATT}} &= \sum_{i=1}^{|\mathcal{C}|} \mathbf{c}_i, \quad \text{where} \quad \mathbf{c}_i = g\left( \sum_{j=1}^{|\mathcal{C}|} \alpha_{ij} h(\mathbf{z}_j) \right) \\ \alpha_{ij} &= \frac{\exp\left((\mathbf{W} h(\mathbf{z}_i))^\top (\mathbf{W} h(\mathbf{z}_j))\right)}{\sum_{k=1}^{|\mathcal{C}|} \exp\left((\mathbf{W} h(\mathbf{z}_i))^\top (\mathbf{W} h(\mathbf{z}_k))\right)} \end{aligned} $$

Evaluation metric: Mean Absolute Error (MAE) for all tasks.

Key Findings

Ensemble superiority (task-dependent): Across benchmarks, explicitly modeling the full conformer set using DeepSets often achieved top performance. However, these improvements are not uniform:

Small-Scale Success: Ensemble methods show large improvements on tasks like Kraken (Ensemble PaiNN achieves 0.2225 on $B_5$ vs 0.3443 single) and EE (Ensemble GemNet achieves 11.61% vs 18.03% single).
Large-Scale Plateau: The performance improvements did not strongly transfer to large subsets like Drugs-75K (best ensemble strategy for GemNet achieves 0.4066 eV on IP vs 0.4069 eV single). The authors conjecture that the computational burden of encoding all conformers in each ensemble alters learning dynamics and increases training difficulty.

Conformer Sampling for Noise: Data augmentation (randomly sampling one conformer from an ensemble during training) improves performance and robustness when underlying conformers are imprecise (e.g., the forcefield-generated conformers in the BDE subset).

3D vs 2D: 3D models generally outperform 2D graph models, especially for conformationally-sensitive properties, though 1D and 2D methods remain highly competitive on low-resource datasets or less rotation-sensitive properties.

Model architecture: No single model dominates all tasks. GemNet and LEFTNet excel on large-scale Drugs-75K, while DimeNet++ shows strong performance on smaller Kraken and reaction datasets. Model selection depends on dataset size and task characteristics.

Reproducibility Details

Artifact	Type	License	Notes
SXKDZ/MARCEL	Code + Dataset	Apache-2.0	Benchmark suite, dataset loaders, and hyperparameter configs
Drugs-75K	Dataset	Apache-2.0	DFT-level conformers and energies derived from GEOM-Drugs
Kraken	Dataset	Copyright retained by original authors	Conformer ensembles and four steric descriptors
BDE	Dataset	Apache-2.0	OpenBabel-generated conformers with DFT binding energies
EE	Dataset	Proprietary	Closed-source as of 2026

Data: The Drugs-75K, Kraken, and BDE subsets are openly available via the project’s GitHub repository. The EE dataset remains closed-source/proprietary (as of 2026), making the EE suite of the benchmark currently irreproducible.
Code: The benchmark suite and PyTorch-Geometric dataset loaders are open-sourced at GitHub (SXKDZ/MARCEL) under the Apache-2.0 license.
Hardware: The authors trained models using Nvidia A100 (40GB) GPUs. Memory-intensive models (e.g., GemNet, LEFTNet) required Nvidia H100 (80GB) GPUs. Total computation across all benchmark experiments was approximately 6,000 GPU hours.
Algorithms/Models: Hyperparameters for all 18 evaluated models are provided in the repository configuration files (benchmarks/params). All baseline models use publicly available frameworks (e.g., PyTorch Geometric, OGB, RDKit).
Evaluation: Evaluation scripts are provided in the repository with consistent tracking of Mean Absolute Error (MAE) and proper configuration of benchmark splits.

Paper Information

Zhu, Y., Hwang, J., Adams, K., Liu, Z., Nan, B., Stenfors, B., Du, Y., Chauhan, J., Wiest, O., Isayev, O., Coley, C. W., Sun, Y., and Wang, W. (2024). Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks. In The Twelfth International Conference on Learning Representations (ICLR 2024). https://openreview.net/forum?id=NSDszJ2uIV

@inproceedings{zhu2024learning,
title={Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks},
author={Yanqiao Zhu and Jeehyun Hwang and Keir Adams and Zhen Liu and Bozhao Nan and Brock Stenfors and Yuanqi Du and Jatin Chauhan and Olaf Wiest and Olexandr Isayev and Connor W. Coley and Yizhou Sun and Wei Wang},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=NSDszJ2uIV}
}

GEOM: Energy-Annotated Molecular Conformations Dataset

Thu, 04 Sep 2025 00:00:00 +0000

Key Contribution

GEOM addresses the gap between 2D molecular graphs and flexible 3D properties by providing 450k+ molecules with 37M+ conformations. This extensive sampling connects conformer ensembles to experimental properties, providing the necessary infrastructure to benchmark conformer generation methods and train 3D-aware property predictors.

Overview

The Geometric Ensemble Of Molecules (GEOM) dataset provides energy-annotated molecular conformations generated through systematic computational methods. The dataset includes molecules from drug discovery campaigns (AICures), quantum chemistry benchmarks (QM9), and molecular property prediction benchmarks (MoleculeNet), with conformations sampled using CREST/GFN2-xTB and a subset refined with high-quality DFT calculations.

Dataset Examples

Example SARS-CoV-2 3CL protease active molecule: CC(=O)Nc1ccc(Oc2ncccn2)cc1 (N-(4-pyrimidin-2-yloxyphenyl)acetamide)

Dataset Subsets

Subset	Count	Description
Drug-like (AICures)	304,466 molecules	Drug-like molecules from AICures COVID-19 challenge (avg 44 atoms)
QM9	133,258 molecules	Small molecules from QM9 (up to 9 heavy atoms)
MoleculeNet	16,865 molecules	Molecules from MoleculeNet benchmarks for physical chemistry, biophysics, and physiology (includes BACE)
BACE (High-quality DFT)	1,511 molecules	BACE subset of MoleculeNet with high-quality DFT energies (r2scan-3c) and experimental inhibition data

Benchmarks

Gibbs Free Energy Prediction

Predict ensemble Gibbs free energy (G) from molecular structure

Subset: 100k AICures · Split: 60/20/20 train/val/test

Rank	Model	MAE (kcal/mol)
🥇 1	SchNetFeatures 3D SchNet + graph features (trained on highest-prob conformer)	0.203
🥈 2	ChemProp Message Passing Neural Network (graph model)	0.225
🥉 3	FFNN Feed-forward network on Morgan fingerprints	0.274
4	KRR Kernel Ridge Regression on Morgan fingerprints	0.289
5	Random Forest Random Forest on Morgan fingerprints	0.406

Average Energy Prediction

Predict ensemble average energy (E) from molecular structure

Subset: 100k AICures · Split: 60/20/20 train/val/test

Rank	Model	MAE (kcal/mol)
🥇 1	ChemProp Message Passing Neural Network (graph model)	0.11
🥈 2	SchNetFeatures 3D SchNet + graph features (trained on highest-prob conformer)	0.113
🥉 3	FFNN Feed-forward network on Morgan fingerprints	0.119
4	KRR Kernel Ridge Regression on Morgan fingerprints	0.131
5	Random Forest Random Forest on Morgan fingerprints	0.166

Conformer Count Prediction

Predict ln(number of unique conformers) from molecular structure

Subset: 100k AICures · Split: 60/20/20 train/val/test

Rank	Model	MAE
🥇 1	SchNetFeatures 3D SchNet + graph features (trained on highest-prob conformer)	0.363
🥈 2	ChemProp Message Passing Neural Network (graph model)	0.38
🥉 3	FFNN Feed-forward network on Morgan fingerprints	0.455
4	KRR Kernel Ridge Regression on Morgan fingerprints	0.484
5	Random Forest Random Forest on Morgan fingerprints	0.763

Dataset	Description
QM9	134k small molecules with up to 9 heavy atoms and DFT properties
PCQM4Mv2	Millions of computationally generated molecules for HOMO-LUMO gap prediction
PubChemQC	DFT structures and energy properties for millions of PubChem molecules

Strengths

Scale: 37M+ conformations across 450k+ molecules, providing massive coverage of drug-like and small molecule chemical space.
Energy Annotations: All conformations include semi-empirical energies (GFN2-xTB); the BACE subset includes high-quality DFT energies.
Quality Tiers: Three levels of computational rigor allow researchers to trade off dataset size for simulation accuracy.
Benchmark Ready: Includes validated splits and architectural baselines (e.g., ChemProp, SchNet) for property prediction tasks.
Task Diversity: Combines molecules sourced from drug discovery (AICures), quantum chemistry (QM9), and biophysiology domains (MoleculeNet).

Limitations

Computational Constraints: The highest-accuracy DFT subset (BACE) is limited to 1,511 molecules due to the extreme computational cost of exact free energy sampling and Hessian estimation.
Semi-Empirical Accuracy Gap: The $p^{\text{CREST}}$ statistical weights rely on GFN2-xTB energies, which exhibit a $\sim$2 kcal/mol MAE against true DFT. At room temperature ($k_BT \approx 0.59$ kcal/mol), this error heavily skews the Boltzmann distribution, meaning standard subset weights are imprecise.
Solvation Assumptions: Most subsets rely on vacuum calculations. Only the BACE subset uses an implicit solvent (ALPB/C-PCM for water).
Coverage Lapses: Extremely flexible molecules (e.g., within the SIDER dataset) frequently failed the conformer generation pipeline due to runaway topologies.

Technical Notes

Data Generation Pipeline

Initial conformer sampling (RDKit):

EmbedMultipleConfs with numConfs=50, pruneRmsThresh=0.01 Å
MMFF force field optimization
GFN2-xTB optimization of seed conformer

Conformational exploration (CREST):

Metadynamics in NVT ensemble driven by a pushing bias potential: $$ V_{\text{bias}} = \sum_i k_i \exp(-\alpha_i \Delta_i^2) $$ where $\Delta_i$ is the root-mean-square displacement (RMSD) against the $i$-th reference structure.
12 independent MTD runs per molecule with different settings for $k_i$ and $\alpha_i$.
6.0 kcal/mol safety window for conformer retention.
Solvent: ALPB for water (BACE); vacuum for others.

Energy calculation & Weighting:

Standard (GFN2-xTB): Semi-empirical tight-binding DFT ($\approx$ 2 kcal/mol MAE vs DFT). Conformers are assigned a statistical probability based on energy $E_i$ and rotamer degeneracy $d_i$: $$ p^{\text{CREST}}_i = \frac{d_i \exp(-E_i / k_B T)}{\sum_j d_j \exp(-E_j / k_B T)} $$
High-Quality DFT (CENSO): Refines structures using the r2scan-3c functional, computing exact conformation-dependent free energies ($G_i$) that remove the need for explicit rotamer degeneracy approximations:

$$ \begin{aligned} p^{\text{CENSO}}_i &= \frac{\exp(-G_i / k_B T)}{\sum_j \exp(-G_j / k_B T)} \\ G_i &= E_{\text{gas}}^{(i)} + \delta G_{\text{solv}}^{(i)}(T) + G_{\text{trv}}^{(i)}(T) \end{aligned} $$

Quality Levels

Level	Method	Subset	Accuracy
Standard	CREST/GFN2-xTB	All subsets	~2 kcal/mol MAE vs DFT
DFT Single-Point	r2scan-3c/mTZVPP on CREST geometries	BACE (1,511 molecules)	Sub-kcal/mol
DFT Optimized	CENSO full optimization + free energies	BACE (534 molecules)	~0.3 kcal/mol vs CCSD(T)

Benchmark Setup

Task: Predict ensemble summary statistics directly from the 2D molecular structure. The target properties include:

Conformational Free Energy ($G$): $G = -TS$, where $S = -R \sum_i p_i \log p_i$.
Average Energy ($\langle E \rangle$): $\langle E \rangle = \sum_i p_i E_i$.
Unique Conformers: Natural log of the conformer count retained within the energy window.

Data: 100,000 species randomly sampled from AICures subset, split 60/20/20 (train/validation/test).

Hyperparameters: Optimized using Hyperopt package for each model/task combination.

Models:

SchNetFeatures: 3D SchNet architecture + graph features, trained on highest-probability conformer
ChemProp: Message Passing Neural Network on molecular graphs
FFNN: Feed-forward network on Morgan fingerprints
KRR: Kernel Ridge Regression on Morgan fingerprints
Random Forest: Random Forest on Morgan fingerprints

Hardware & Computational Cost

CREST/GFN2-xTB Generation

Total compute: ~15.7 million core hours

AICures subset:

13M core hours on Knights Landing (32-core nodes)
1.2M core hours on Cascade Lake/Sky Lake (13-core nodes)
Average wall time: 2.8 hours/molecule (KNL) or 0.63 hours/molecule (Sky Lake)

MoleculeNet subset: 1.5M core hours

DFT Calculations (BACE only)

Software: CENSO 1.1.2 + ORCA 5.0.1 (r2scan-3c/mTZVPP functional)

Solvent: C-PCM implicit solvation (water)

Hardware: ~54 cores per job

Compute cost:

781,000 CPU hours for CENSO optimizations
1.1M CPU hours for single-point energy calculations

Reproducibility Details

Data Availability: All generated conformations, energies, and thermodynamic properties are publicly hosted on Harvard Dataverse. The data is provided in language-agnostic MessagePack format and Python-specific RDKit .pkl formats.
Code & Analysis: The primary GitHub repository (learningmatter-mit/geom) provides tutorials for data extraction, RDKit processing, and conformational visualization.
Model Training & Baselines: The machine learning benchmarks (SchNet, ChemProp) and corresponding training scripts used to evaluate the dataset can be reproduced using the authors’ NeuralForceField repository.
Hardware & Compute: Extreme compute was required (15.7M core hours for CREST sampling alone), heavily utilizing Knights Landing (KNL) and Cascade Lake architectures. See Hardware & Computational Cost section above for full details.
Software Versions: Precise reproduction of conformational properties requires specific versions to mitigate numerical variances: CREST v2.9, xTB v6.2.3/v6.4.1, CENSO v1.1.2, ORCA v5.0.1/v5.0.2, and RDKit v2020.09.1.
Open-Access Paper: The full methodology is accessible via the arXiv preprint.

Paper Information

Axelrod, S. and Gómez-Bombarelli, R. (2022). GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Scientific Data, 9(1), 185. https://doi.org/10.1038/s41597-022-01288-4

@article{Axelrod_2022,
    title={GEOM, energy-annotated molecular conformations for property prediction and molecular generation},
    volume={9},
    ISSN={2052-4463},
    url={http://dx.doi.org/10.1038/s41597-022-01288-4},
    DOI={10.1038/s41597-022-01288-4},
    number={1},
    journal={Scientific Data},
    publisher={Springer Science and Business Media LLC},
    author={Axelrod, Simon and Gómez-Bombarelli, Rafael},
    year={2022},
    month={apr},
    pages={185}
}

GDB-11: Chemical Universe Database (26.4M Molecules)

Fri, 29 Aug 2025 00:00:00 +0000

Dataset Examples

GDB-11 molecule (SMILES: FC1C2OC1c3c(F)coc23)

Dataset	Relationship	Link
GDB-13	Successor	Notes
GDB-17	Successor	Notes

Key Contribution

The generation and analysis of the Generated Database (GDB), an exhaustive collection of all possible small molecules that meet specific criteria for stability and synthetic feasibility.

Overview

GDB-11 represents the first systematic enumeration of the small molecule chemical universe up to 11 atoms of C, N, O, and F. The database contains 26.4 million unique molecules corresponding to 110.9 million stereoisomers. It was created to support virtual screening and drug discovery by providing a comprehensive collection of diverse, drug-like small molecules that obey standard chemical stability rules.

Strengths

Systematic Enumeration: Exhaustive coverage of mathematically and chemically possible structures up to 11 atoms.
Drug-Likeness: 100% of compounds follow Lipinski’s “Rule of 5” for bioavailability, and 50% (13.2 million) follow Congreve’s more restrictive “Rule of 3” for lead-likeness.
Structural Novelty: Features 538 newly identified ring systems that were previously unknown in existing chemical databases (like the CAS Registry or Beilstein).
High Chirality: Over 70% of GDB molecules are chiral, providing rich 3D structural diversity, particularly in fused carbocycles and heterocycles.

Limitations

Size Restriction: Strictly limited to small molecules with a maximum of 11 heavy atoms.
Element Restriction: Only contains C, N, O, and F. Important biological and pharmaceutical elements like Phosphorus (P), Sulfur (S), and Silicon (Si) are excluded to prevent combinatorial explosion.
Excluded Topologies: Excludes highly strained molecules (e.g., specific bridged systems), allenes, and bridgehead double bonds.
Unstable Functional Groups: Excludes chemical classes deemed unstable or highly reactive (e.g., gem-diols, hemiacetals, aminals, enols, orthoacids).
Computational Nature: Consists entirely of computer-generated, theoretical structures without experimental synthesis or biological validation.

Technical Notes

Construction

Graph Selection

The program GENG was used to generate an initial set of 843,335 connected graphs with up to 11 nodes and a maximum node connectivity of 4. These were filtered to 15,726 stable saturated hydrocarbon graphs using:

Topological Criteria: Removed graphs with a node in multiple small (3- or 4-membered) rings, tetravalent bridgeheads in small rings, and nonplanar graphs (e.g., Claus-benzol).
Steric Criteria: Graphs containing highly distorted centers were removed using an adapted MM2 force field energy-minimization with a cutoff of +17 kcal/mol.

Structure Generation

Graph symmetry algorithms identified valid locations for unsaturations and heteroatoms (C, N, O, F). Specific valence constraints were continuously enforced. Combinatorial distribution of elements and multiple bonds (excluding bridgehead double bonds, triple bonds in rings smaller than nine, and allenes) yielded a theoretical “dark matter universe” (DMU) of over 1.7 billion unique structures.

Filters

The 1.7 billion structural candidates contained unstable environments which were aggressively filtered, reducing the set to 27.7 million possible stable molecules. Rejected unstable/reactive features included:

High-Energy Bonds: Gem-diols, non-stabilized aminals, hemiaminals, enols, orthoesters, unstable imines, acyl fluorides, and geminal di-heteroatoms.
Heteroatom-Heteroatom Bonds: Peroxides (O-O), N-O, N-N, N-F, and triazanes, unless stabilized (e.g., hydrazones, oximes).
Strained Topologies: 3/4-membered rings containing N-N or N-O bonds, and bridgehead heteroatom bonds causing instabilities (like Bredt’s rule violations).

Removal of redundant tautomeric forms collapsed the set to the foundational 26.4 million structures.

Stereoisomer Generation

Stereoisomers were cleanly enumerated by identifying all asymmetric centers and functional double bonds, blocking Z/E isomerism in rings smaller than 10 nodes. From the 26.4 million unique constitutional isomers, 110.9 million stereoisomers were generated (averaging 4.2 stereoisomers per molecule).

Analysis Methodology

Kohonen Maps (Self-Organizing Maps)

The chemical space visualization and compound class analysis used a Kohonen Map (Self-Organizing Map/SOM):

Input Features: 48-dimensional autocorrelation vectors encoding topological relationships and atomic properties. The autocorrelation vector $\text{AC}_d$ for a topological distance $d$ is defined as:

$$ \text{AC}_d = \sum_{i=1}^{N} \sum_{j=1}^{N} \delta (p_i p_j)_d $$

(where $N$ is the number of atoms, $p$ is the atomic property, and $\delta (p_i, p_j)_d = p_i p_j$ if the topological distance between atoms $i$ and $j$ is $d$, and 0 otherwise).

Training Data: Random subset of 1,000,000 GDB molecules
Architecture: 200x200 neuron grid
Training Protocol: 250,000 epochs with 100 molecules presented per epoch
Algorithm: Standard Kohonen algorithm
Key Insight: Reveals that “lead-like” compounds cluster in chiral regions of fused carbocycles/heterocycles

Comparison

The full database was compared comprehensively to a Reference Database (RDB) of 63,857 known compounds (up to 11 atoms) extracted from PubChem, ChemACX, ChemSCX, NCI Open Database, and the Merck Index. Of the 63,857 RDB compounds, 37,393 (58.6%) were found in GDB. The remaining 26,464 compounds were absent due to structural rule violations, exclusion of elements beyond C/N/O/F, and filtered unstable chemistries.

New Rings

All 309 entirely acyclic graphs in GDB mapped cleanly to published structures. External databases contained only 670 of the 1,208 purely cyclic theoretical ring systems (55.5%). Furthermore, 367 of the 538 newly identified ring systems (68.2%) express inherently chiral topologies.

Stereochemistry

Small molecules under 5 heavy atoms skew strongly towards simple achiral structures. As the atom count increases, a dominant stereochemical shift emerges: over two-thirds of structures containing exactly 10 or 11 atoms occupy chiral configuration spaces. Approximately 86% of the molecules in GDB contain exactly 11 atoms (22.8 million of 26.4 million).

Physicochemical Properties

Because all GDB molecules contain at most 11 heavy atoms, 100% of them satisfy Lipinski’s “Rule of 5” for bioavailability. Under the more restrictive Congreve “Rule of 3” for lead-likeness (MW < 300, RBC < 3, logP < 3, HBDC < 3, HBAC < 3, TPSA < 60 $\text{\AA}^2$), exactly 50% (13.2 million structures) qualify. Virtual screening using the Molinspiration miscreen toolkit (Bayesian statistics-based) identified 42,804 virtual hits across three drug target classes: 3,043 kinase inhibitor candidates, 24,489 GPCR ligand candidates, and 19,696 ion-channel modulator candidates. Of these virtual hits, 59.8% occupied Kohonen map neurons not populated by any known RDB compound.

Reproducibility Details

While the generated GDB-11 database is openly available, reproducing the exact generation from graph to stereoisomer relies on in-house and proprietary software which is not publicly available.

Artifact	Type	License	Notes
GDB Downloads (University of Berne)	Dataset	Unknown	Official host for GDB databases
Zenodo Record (10.5281/zenodo.5172017)	Dataset	Unknown	Version-agnostic Zenodo archive of GDB-11

Paper Accessibility: Closed-access (Published in JCIM 2007; no preprint available).
Data Availability: The complete dataset is hosted on an open-access Zenodo repository (version-agnostic DOI): 10.5281/zenodo.5172017.
Software Dependencies (Closed/Commercial):
- Generation code is a closed-source Java (J2SE v5.0) application.
- Relies on proprietary ChemAxon libraries (JChem v3.1, Marvin v4.0 API).
- Virtual screening evaluation utilized the commercial Molinspiration miscreen toolkit.
Hardware Profile:
- CPUs: Two AMD Opteron 252 2.6 GHz processors
- Parallelization: 80-fold parallelization
- Compute Time: Approximately 20 hours for full generation

Force Field

A custom implementation of the MM2 force field was used for steric energy minimization during structure validation. It used the parameter set from Allinger, specifically adding a quartic term for bond stretching to prevent bond lengthening far from equilibrium:

$$ \begin{aligned} E_{\text{Steric}} &= \sum_{\text{bonds}} k_b(l_i - l_{0,i})^2 \left[1 + k’_b(l_i - l_{0,i}) + k’’_b(l_i - l_{0,i})^2\right] \\ &\quad + \sum_{\text{angles}} k_\theta(\theta_i - \theta_{0,i})^2 \left[1 + k’_\theta(\theta_i - \theta_{0,i})^4\right] \\ &\quad + \sum_{\text{angles}} k_{b,\theta}(\theta_i - \theta_{0,i})^2 \left[(l_a - l_{0,a}) + (l_b - l_{0,b})\right] \\ &\quad + \sum_{\text{torsions}} \left[ \frac{V_1}{2} (1 + \cos \omega) + \frac{V_2}{2} (1 - \cos 2\omega) + \frac{V_3}{2} (1 + \cos 3\omega) \right] \\ &\quad + \sum_{i=1}^N \sum_{j=i+1}^N \epsilon_{ij} \left[ A \exp \left( \frac{-B r_{ij}}{\sum r^{\ast}_{ij}} \right) - C \left( \frac{r_{ij}}{\sum r^{\ast}_{ij}} \right)^6 \right] \end{aligned} $$

Paper Information

Fink, T. and Reymond, J.-L. (2007). Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F: Assembly of 26.4 Million Structures (110.9 Million Stereoisomers) and Analysis for New Ring Systems, Stereochemistry, Physicochemical Properties, Compound Classes, and Drug Discovery. Journal of Chemical Information and Modeling, 47(2), 342–353. https://doi.org/10.1021/ci600423u

@article{fink2007virtual,
  title={Virtual exploration of the chemical universe up to 11 atoms of C, N, O, and F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery},
  author={Fink, Tobias and Reymond, Jean-Louis},
  journal={Journal of Chemical Information and Modeling},
  volume={47},
  number={2},
  pages={342--353},
  year={2007},
  publisher={ACS Publications}
}

GDB-17: Chemical Universe Database (166.4B Molecules)

Sat, 16 Aug 2025 00:00:00 +0000

Key Contribution

The systematic enumeration of 166.4 billion organic molecules (GDB-17) up to 17 atoms, extending the known chemical universe into the drug-relevant size range. This reveals a highly dense novel chemical space that is measurably richer in complex stereochemical and three-dimensional structures compared to historically biased chemical databases.

Overview

GDB-17 represents the largest enumerated database of drug-like small molecules, reaching the size range typical of lead compounds and approved drugs ($100 < \text{MW} < 350$ Da). It contains 166.4 billion structures consisting of up to 17 atoms of C, N, O, S, and halogens (F, Cl, Br, I). Because the bounds of combinatorial possibilities scale exponentially with heavy atom count (HAC), the MW distribution of the database sharply peaks in the $240$-$250 \text{ Da}$ range. Compared to known molecules in databases like PubChem, GDB-17 molecules are significantly richer in non-aromatic heterocycles, quaternary centers, and stereoisomers, avoiding “flatland” by deeply populating the third dimension in shape space.

Dataset Examples

Example GDB-17 molecule (SMILES: C1CC2C3CCCC3C3(C4CCC3CC4)C2C1) demonstrating the complex polycyclic structures and 3D diversity characteristic of the database

Dataset Subsets

Subset	Size	Description
GDB-17 (Full)	166.4B	Complete enumeration of the database
GDBLL-17	29B	Lead-like subset ($1 < \text{clogP} < 3$ and $100 < \text{MW} < 350$ Da)
GDBLLnoSR-17	22B	Lead-like subset excluding compounds with small rings (3- or 4-membered)
Random Sample	50M	Random 50M subset available for download, including pre-filtered lead-like and no-small-ring fractions

Benchmarks

Note: As an enumerated database of theoretical structures, GDB-17 lacks standard supervised ML benchmarks. It functions primarily as a generative compass and foundational exploration library for unsupervised learning and molecular generation.

Dataset	Relationship	Link
GDB-11	Predecessor	Notes
GDB-13	Predecessor	Notes

Strengths & Limitations

Strengths:

3D Shape Space (“Escape out of Flatland”): Populates the third dimension (spherical, non-planar shapes) significantly better than known structures in PubChem or ChEMBL, which are primarily flat and rod-like due to aromatic dominance
Stereochemical Complexity: Averages 6.4 possible stereoisomers per molecule (compared to 2.0 in PubChem-17), driven by an abundance of non-planar features and small rings
Massive Scaffold Diversity: Features 35-fold more Murcko scaffolds and 61-fold more ring systems than molecules of matching size in PubChem
Rich in Known Drug Isomers: Contains millions of exact geometric and formula isomers of approved drugs, offering direct variations and “methyl walk” analogs

Limitations:

Experimental Gap: These are virtual, combinatorially enumerated molecules. Despite strict computational stability filtering, they remain unsynthesized and lack experimental validation.
Small Ring Dominance: Up to 16 atoms, roughly 83% of the database consists of compounds with challenging small (3- or 4-membered) rings, though this drops for the 17-atom set, resulting in an overall 28% fraction of small ring compounds
Elemental Scope Restrictions: Elements like P, Si, and B, which occasionally appear in drugs, are completely excluded
Strict Stability Filters: Excludes some potentially viable functional groups strictly to manage the combinatorial explosion and avoid unstable structures (e.g., hemiacetals, aminals, acyclic acetals)
Polarity Skew: The full database contains disproportionately more polar molecules ($\text{clogP} < 0$) than reference sets, and its sheer size makes it computationally demanding to query using advanced docking or 3D shape tools

Technical Notes

Generation Pipeline

GDB-17 was generated from first principles through a highly filtered, multi-stage pipeline:

Graphs $\rightarrow$ Hydrocarbons: Started with 114.3 billion topologies (generated using GENG), filtered down to 5.4 million stable hydrocarbons by applying geometrical strain rules (H-filters).
Hydrocarbons $\rightarrow$ Skeletons: Substituted single bonds with double and triple bonds to yield 1.3 billion skeletons, simultaneously removing reactive unsaturations like allenes (S-filters).
Skeletons $\rightarrow$ CNO Molecules: Diversified into 110.4 billion molecules by combinatorially substituting C with N and O, explicitly avoiding heteroatom-heteroatom bounds and enforcing stability filters (F-filters).
Post-processing: Added diversity by transforming groups to generate aromatics, oximes, $\text{CF}_3$, halogens, and sulfones (P-filters), yielding the final 166.4 billion count.

Hardware & Software

Compute: Mastered over 40,000 jobs spread across a 360-CPU cluster, consuming 100,000 CPU hours (~11 CPU years)
Software: Powered by GENG (Nauty package) for graph generation, CORINA for 3D stereoisomer generation, and ChemAxon JChem libraries running inside custom Java 1.6 applications

Shape Analysis (PMI)

To quantitatively define the “escape from flatland,” the origin paper classifies molecular shape using the normalized Principal Moments of Inertia (PMI) of the generated 3D conformers. The principal moments ($I_1 \le I_2 \le I_3$) are derived by diagonalizing the standard moment of inertia tensor. Molecules are plotted within a normalized 2D triangular space mapped by the ratios:

$$ P_1 = \frac{I_1}{I_3}, \quad P_2 = \frac{I_2}{I_3} $$

The vertices of this plot define the three geometrical boundaries of chemical space:

Rod-like (1D): $(0, 1)$ typical of stretched alkanes
Disc-like (2D): $(0.5, 0.5)$ typical of flat aromatics like benzene
Sphere-like (3D): $(1, 1)$ typical of globular structures like cubane

GDB-17’s core structural finding is that mathematically enumerated chemical space thickly populates the interior and $(1,1)$ spherical regions of this plot, demonstrating significant 3D structure. Empirical libraries traditionally cluster densely along the rod-to-disc axis.

Differences from GDB-13

The algorithm was completely rewritten optimizing memory efficiency, boosting computing speed roughly 400-fold and allowing enumeration beyond the previous 13-atom limit
Scope aggressively expanded to include all functional halogens (F, Cl, Br, I) within the base framework
Introduced intensive, size-dependent graph selection filters (prohibiting complex bridgeheads and completely eliminating small rings in 17-atom graphs) to manage combinatorial explosion
Functional post-processing cycles deliberately decoupled to add features like cyclic oximes, aromatic halogens, and sulfones that would otherwise be rejected or break underlying generation constraints

Reproducibility Details

Paper Accessibility: The original paper is published in the Journal of Chemical Information and Modeling and is available as an Open Access publication under a CC-BY license.
Data Availability: The full 166.4 billion molecule dataset is not publicly available for download (estimated >400 GB compressed). However, a 50 million random subset and pre-filtered lead-like fractions are openly available on the GDB website and archived on Zenodo.
Code & Algorithms: The enumeration rules and logic are well-described in the paper, but the actual Java 1.6 source code has not been released.
Dependencies: The pipeline is a mix of open-source and proprietary software tools. Graph generation uses open-source GENG (Nauty), while chemical logic and stereoisomer generation rely on proprietary ChemAxon JChem libraries and CORINA.
Hardware Specifications: The original database generation was explicitly parallelized across a 360-CPU cluster, consuming 100,000 CPU hours (approximately 11 CPU years) with over 40,000 calculation runs.

Paper Information

Ruddigkeit, L., van Deursen, R., Blum, L. C., and Reymond, J.-L. (2012). Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. Journal of Chemical Information and Modeling, 52(11), 2864–2875. https://doi.org/10.1021/ci300415d

@article{Ruddigkeit_2012,
  title={Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17},
  volume={52},
  ISSN={1549-960X},
  url={http://dx.doi.org/10.1021/ci300415d},
  DOI={10.1021/ci300415d},
  number={11},
  journal={Journal of Chemical Information and Modeling},
  publisher={American Chemical Society (ACS)},
  author={Ruddigkeit, Lars and van Deursen, Ruud and Blum, Lorenz C. and Reymond, Jean-Louis},
  year={2012},
  month=nov,
  pages={2864--2875}
}

GDB-13: Chemical Universe Database (970M Molecules)

Sat, 16 Aug 2025 00:00:00 +0000

Dataset Examples

Example GDB-13 molecule (SMILES: CCCC(O)(CO)CC1CC1CN)

Dataset Subsets

Subset	Size	Description
C/N/O Set	~910.1M	Molecules containing up to 13 atoms of Carbon, Nitrogen, and Oxygen.
Cl/S Set	~67.3M	Molecules containing up to 13 atoms, adding Sulfur (aromatic heterocycles, sulfones, sulfonamides, thioureas) and Chlorine (aromatic substituents).

Dataset	Relationship	Link
GDB-11	Predecessor	Notes
GDB-17	Successor	Notes

Key Contribution

The creation and release of the 977.5 million-compound GDB-13, a significant expansion in molecular size (up to 13 atoms) and elemental diversity (including S and Cl) made possible by key algorithmic optimizations that significantly accelerated the enumeration process.

Overview

GDB-13 extends the systematic enumeration of drug-like chemical space to molecules containing up to 13 atoms of Carbon, Nitrogen, Oxygen, Sulfur, and Chlorine. Building on the methodology established in GDB-11, this database represents a 37-fold increase in size while maintaining 100% Lipinski compliance for virtual screening applications. The enumeration results in a vast array of cyclic topologies, where 54% of the database comprises molecules with at least one three- or four-membered ring.

Strengths

Systematic coverage of structures with up to 13 atoms
High drug-likeness: 100% Lipinski compliance and 99.5% Vieth compliance
High proportion of leadlike (98.9%) and fragmentlike (45.1%) molecules
Structural novelty providing fragments absent from established databases like ZINC, ACX, and PubChem

Limitations

Limited to small molecules with up to 13 atoms of C, N, O, S, and Cl
Omits 66.2% of known chemical space up to 13 atoms found in external databases
Excludes specific nonenumerated elements (F, Br, I, P, Si, metals) and functional groups (chlorine on nonaromatic carbons, mercaptans, sulfoxides, enamines, allenes)
Excludes highly strained molecules and highly polar combinations
Consists entirely of computer-generated structures pending experimental validation

Technical Notes

Algorithmic Approach

Type: Rule-Based Combinatorial Graph Enumeration

This approach relies on combinatorial enumeration. It utilizes a rule-based graph generation algorithm (GENG) paired with chemical stability filters to construct the dataset.

Process:

Start with mathematical graphs representing saturated hydrocarbons up to 13 nodes using GENG (non-planar graphs discarded)
Apply topological filters to remove highly strained small ring systems (e.g., fused cyclopropanes and bridgehead 3/4-membered rings)
Generate 3D structures via CORINA or ChemAxon to apply a 3D volume-based strain filter. The local strain of a tetravalent carbon is estimated by the volume $V$ of the tetrahedron formed by extending a $1 \text{ \AA}$ line along its four single bonds. Hydrocarbons with planar or pyramidal carbon centers are discarded if: $$ V < 0.345 \text{ \AA}^3 $$
Introduce unsaturations and heteroatoms through systematic substitution
Apply chemical rule filters and element-ratio heuristics to ensure stability and drug-likeness
Apply post-processing algorithms to introduce nitro groups, nitriles, aromatic chlorines, thiophenes, sulfonamides, and thioureas

Key Optimization: Replaced computationally expensive MM2 minimization (used in GDB-11) with a fast geometry-based estimation of strained polycyclic ring systems, combined with fast “element-ratio” filters. This achieved a 6.4-fold speedup in structure validation early in the pipeline.

Differences from GDB-11

Element Selection: Fluorine removed from allowed elements; sulfur and chlorine added for higher drug relevance (e.g., thiophenes, sulfonamides).
Optimization Method: MM2-based structure optimization replaced with a much faster, custom geometry-based estimation of local strain (measuring the tetrahedron volume of carbon centers).
Heuristic Filters: Fast elemental ratio filters added to quickly reject highly polar, unstable combinations early in the pipeline.

Reproducibility Details

Paper & Data Availability

Paper Access: The original paper is published in the Journal of the American Chemical Society (JACS) and is closed-access/paywalled. No open-access preprint exists on arXiv or ChemRxiv.
Data Access: The full GDB-13 database and its subsets are freely available via the Reymond Group Downloads Page and are persistently hosted on Zenodo.

Artifacts

Artifact	Type	License	Notes
GDB-13 Database (Reymond Group)	Dataset	Free download	Official download page hosted by the Reymond Group
GDB-13 on Zenodo	Dataset	Unknown	Persistent archival copy

Source Code & Algorithms

The exact custom source code (e.g., GENG orchestration, local strain filters) is not publicly available. Researchers must re-implement the rules strictly described in the paper and supplementary materials.

Heuristic Filters

Implemented element-ratio filters derived from analyzing known compound databases to reject chemically unstable or highly polar molecules early in the generation pipeline:

$$ \begin{aligned} \frac{N + O}{C} &< 1.0 \\ \frac{N}{C} &< 0.571 \\ \frac{O}{C} &< 0.666 \end{aligned} $$

Excluded Functional Groups

O-O bonds (peroxides)
Hemiacetals, aminals, acyclic imines, non-aromatic enols
Compounds containing both primary/secondary amines and aldehydes/ketones
Nonenumerated elements (F, Br, I, P, Si, metals)
High-heteroatom ratio structures (e.g., mannitol)

Hardware & Compute

Compute Cost: ~40,000 CPU hours for the 910 million C/N/O structures.
Infrastructure: Executed in parallel on a 500-node cluster
Assembly Optimization: The switch from MM2 minimization to geometry-based estimation of strained polycyclic ring systems, alongside element-ratio filters, reduced assembly time 6.4-fold comparing GDB-11 workloads (1600 CPU hours to 250 CPU hours).

Paper Information

Blum, L. C. and Reymond, J.-L. (2009). 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. Journal of the American Chemical Society, 131(25), 8732–8733. https://doi.org/10.1021/ja902302h

@article{blum2009gdb13,
  title={970 million druglike small molecules for virtual screening in the chemical universe database GDB-13},
  author={Blum, Lorenz C and Reymond, Jean-Louis},
  journal={Journal of the American Chemical Society},
  volume={131},
  number={25},
  pages={8732--8733},
  year={2009},
  publisher={ACS Publications},
  doi={10.1021/ja902302h}
}

Molecular Databases & Datasets on Hunter Heidenreich | ML Research Scientist

ZINC-22: A Multi-Billion Scale Database for Ligand Discovery

Key Contribution: Scaling Make-on-Demand Libraries

Overview

Dataset Examples

Dataset Subsets

Use Cases

Related Datasets

Strengths

Limitations

Technical Notes

Hardware & Software

Data Organization & Access

3D Generation Pipeline

Chemical Diversity Analysis

Vendor Integration

Reproducibility Details

Paper Information

MARCEL: Molecular Conformer Ensemble Learning Benchmark

Key Contribution

Overview

Dataset Examples

Dataset Subsets

Benchmarks

Ionization Potential (Drugs-75K)#

Electron Affinity (Drugs-75K)#

Electronegativity (Drugs-75K)#

B₅ Sterimol Parameter (Kraken)#

L Sterimol Parameter (Kraken)#

Buried B₅ Parameter (Kraken)#

Buried L Parameter (Kraken)#

Enantioselectivity (EE)#

Bond Dissociation Energy (BDE)#

Related Datasets

Strengths

Limitations

Technical Notes

Data Generation Pipeline

Drugs-75K

Kraken

EE (Enantiomeric Excess)

BDE (Bond Dissociation Energy)

Benchmark Setup

Key Findings

Reproducibility Details

Paper Information

GEOM: Energy-Annotated Molecular Conformations Dataset

Key Contribution

Overview

Dataset Examples

Dataset Subsets

Benchmarks

Gibbs Free Energy Prediction#

Average Energy Prediction#

Conformer Count Prediction#

Related Datasets

Strengths

Limitations

Technical Notes

Data Generation Pipeline

Quality Levels

Benchmark Setup

Hardware & Computational Cost

CREST/GFN2-xTB Generation

DFT Calculations (BACE only)

Reproducibility Details

Paper Information

GDB-11: Chemical Universe Database (26.4M Molecules)

Dataset Examples

Related Datasets

Key Contribution

Overview

Strengths

Limitations

Technical Notes

Construction

Graph Selection

Structure Generation

Filters

Stereoisomer Generation

Ionization Potential (Drugs-75K)

Electron Affinity (Drugs-75K)

Electronegativity (Drugs-75K)

B₅ Sterimol Parameter (Kraken)

L Sterimol Parameter (Kraken)

Buried B₅ Parameter (Kraken)

Buried L Parameter (Kraken)

Enantioselectivity (EE)

Bond Dissociation Energy (BDE)

Gibbs Free Energy Prediction

Average Energy Prediction

Conformer Count Prediction