GDB-13: Chemical Universe Database (970M Molecules)

Dataset Examples

Example GDB-13 molecule (SMILES: `CCCC(O)(CO)CC1CC1CN`)

Dataset Subsets

Subset	Size	Description
C/N/O Set	~910.1M	Molecules containing up to 13 atoms of Carbon, Nitrogen, and Oxygen.
Cl/S Set	~67.4M	Molecules containing up to 13 atoms, adding Sulfur (aromatic heterocycles, sulfones, sulfonamides, thioureas) and Chlorine (aromatic substituents).

Dataset	Relationship	Link
GDB-11	Predecessor	Notes
GDB-17	Successor	Notes

Key Contribution

The creation and release of the 977.4 million-compound GDB-13, a significant expansion in molecular size (up to 13 atoms) and elemental diversity (including S and Cl) made possible by key algorithmic optimizations that significantly accelerated the enumeration process.

Overview

GDB-13 extends the systematic enumeration of drug-like chemical space to molecules containing up to 13 atoms of Carbon, Nitrogen, Oxygen, Sulfur, and Chlorine. Building on the methodology established in GDB-11, this database represents a 37-fold increase in size while maintaining 100% Lipinski compliance for virtual screening applications. The enumeration results in a vast array of cyclic topologies, where 54% of the database comprises molecules with at least one three- or four-membered ring.

Strengths

Systematic coverage of structures with up to 13 atoms
High drug-likeness: 100% Lipinski compliance and 99.5% Vieth compliance
High proportion of leadlike (98.9%) and fragmentlike (45.1%) molecules
Structural novelty providing fragments absent from established databases like ZINC, ACX, and PubChem

Limitations

Limited to small molecules with up to 13 atoms of C, N, O, S, and Cl
Omits 66.2% of known chemical space up to 13 atoms found in external databases
Excludes specific nonenumerated elements (F, Br, I, P, Si, metals) and functional groups (chlorine on nonaromatic carbons, mercaptans, sulfoxides, enamines, allenes)
Excludes highly strained molecules and highly polar combinations
Consists entirely of computer-generated structures pending experimental validation

Technical Notes

Algorithmic Approach

Type: Rule-Based Combinatorial Graph Enumeration

This approach relies on combinatorial enumeration. It utilizes a rule-based graph generation algorithm (GENG) paired with chemical stability filters to construct the dataset.

Process:

Start with mathematical graphs representing saturated hydrocarbons up to 13 nodes using GENG (non-planar graphs discarded)
Apply topological filters to remove highly strained small ring systems (e.g., fused cyclopropanes and bridgehead 3/4-membered rings)
Generate 3D structures via CORINA or ChemAxon to apply a 3D volume-based strain filter. The local strain of a tetravalent carbon is estimated by the volume $V$ of the tetrahedron formed by extending a $1 \text{ \AA}$ line along its four single bonds. Hydrocarbons with planar or pyramidal carbon centers are discarded if: $$ V < 0.345 \text{ \AA}^3 $$
Introduce unsaturations and heteroatoms through systematic substitution
Apply chemical rule filters and element-ratio heuristics to ensure stability and drug-likeness
Apply post-processing algorithms to introduce nitro groups, nitriles, aromatic chlorines, thiophenes, sulfonamides, and thioureas

Key Optimization: Replaced computationally expensive MM2 minimization (used in GDB-11) with a fast geometry-based estimation of strained polycyclic ring systems, combined with fast “element-ratio” filters. This achieved a 6.4-fold speedup in structure validation early in the pipeline.

Differences from GDB-11

Element Selection: Fluorine removed from allowed elements; sulfur and chlorine added for higher drug relevance (e.g., thiophenes, sulfonamides).
Optimization Method: MM2-based structure optimization replaced with a much faster, custom geometry-based estimation of local strain (measuring the tetrahedron volume of carbon centers).
Heuristic Filters: Fast elemental ratio filters added to quickly reject highly polar, unstable combinations early in the pipeline.

Reproducibility Details

Paper & Data Availability

Paper Access: The original paper is published in the Journal of the American Chemical Society (JACS) and is closed-access/paywalled. No open-access preprint exists on arXiv or ChemRxiv.
Data Access: The full GDB-13 database and its subsets are freely available via the Reymond Group Downloads Page and are persistently hosted on Zenodo.

Source Code & Algorithms

The exact custom source code (e.g., GENG orchestration, local strain filters) is not publicly available. Researchers must re-implement the rules strictly described in the paper and supplementary materials.

Heuristic Filters

Implemented element-ratio filters derived from analyzing known compound databases to reject chemically unstable or highly polar molecules early in the generation pipeline:

$$ \begin{aligned} \frac{N + O}{C} &< 1.0 \\ \frac{N}{C} &< 0.571 \\ \frac{O}{C} &< 0.666 \end{aligned} $$

Excluded Functional Groups

O-O bonds (peroxides)
Hemiacetals, aminals, acyclic imines, non-aromatic enols
Compounds containing both primary/secondary amines and aldehydes/ketones
Nonenumerated elements (F, Br, I, P, Si, metals)
High-heteroatom ratio structures (e.g., mannitol)

Hardware & Compute

Compute Cost: ~40,000 CPU hours for the 910 million C/N/O structures.
Infrastructure: Executed in parallel on a 500-node cluster
Assembly Optimization: The switch from MM2 minimization to geometry-based estimation of strained polycyclic ring systems, alongside element-ratio filters, reduced assembly time 6.4-fold comparing GDB-11 workloads (1600 CPU hours to 250 CPU hours).

Citation

@article{blum2009gdb13,
  title={970 million druglike small molecules for virtual screening in the chemical universe database GDB-13},
  author={Blum, Lorenz C and Reymond, Jean-Louis},
  journal={Journal of the American Chemical Society},
  volume={131},
  number={25},
  pages={8732--8733},
  year={2009},
  publisher={ACS Publications},
  doi={10.1021/ja902302h}
}

Dataset Examples#

Dataset Subsets#

Related Datasets#

Key Contribution#

Overview#

Strengths#

Limitations#

Technical Notes#

Algorithmic Approach#

Differences from GDB-11#

Reproducibility Details#

Paper & Data Availability#

Source Code & Algorithms#

Heuristic Filters#

Excluded Functional Groups#

Hardware & Compute#

Citation#