Dataset Examples

Example GDB-13 molecule
Example GDB-13 molecule (SMILES: CCCC(O)(CO)CC1CC1CN)

Dataset Subsets

SubsetSizeDescription
C/N/O Set~910.1MMolecules containing up to 13 atoms of Carbon, Nitrogen, and Oxygen.
Cl/S Set~67.4MMolecules containing up to 13 atoms, adding Sulfur (aromatic heterocycles, sulfones, sulfonamides, thioureas) and Chlorine (aromatic substituents).
DatasetRelationshipLink
GDB-11PredecessorNotes
GDB-17SuccessorNotes

Key Contribution

The creation and release of the 977.4 million-compound GDB-13, a significant expansion in molecular size (up to 13 atoms) and elemental diversity (including S and Cl) made possible by key algorithmic optimizations that significantly accelerated the enumeration process.

Overview

GDB-13 extends the systematic enumeration of drug-like chemical space to molecules containing up to 13 atoms of Carbon, Nitrogen, Oxygen, Sulfur, and Chlorine. Building on the methodology established in GDB-11, this database represents a 37-fold increase in size while maintaining 100% Lipinski compliance for virtual screening applications. The enumeration results in a vast array of cyclic topologies, where 54% of the database comprises molecules with at least one three- or four-membered ring.

Strengths

  • Systematic coverage of structures with up to 13 atoms
  • High drug-likeness: 100% Lipinski compliance and 99.5% Vieth compliance
  • High proportion of leadlike (98.9%) and fragmentlike (45.1%) molecules
  • Structural novelty providing fragments absent from established databases like ZINC, ACX, and PubChem

Limitations

  • Limited to small molecules with up to 13 atoms of C, N, O, S, and Cl
  • Omits 66.2% of known chemical space up to 13 atoms found in external databases
  • Excludes specific nonenumerated elements (F, Br, I, P, Si, metals) and functional groups (chlorine on nonaromatic carbons, mercaptans, sulfoxides, enamines, allenes)
  • Excludes highly strained molecules and highly polar combinations
  • Consists entirely of computer-generated structures pending experimental validation

Technical Notes

Algorithmic Approach

Type: Rule-Based Combinatorial Graph Enumeration

This approach relies on combinatorial enumeration. It utilizes a rule-based graph generation algorithm (GENG) paired with chemical stability filters to construct the dataset.

Process:

  1. Start with mathematical graphs representing saturated hydrocarbons up to 13 nodes using GENG (non-planar graphs discarded)
  2. Apply topological filters to remove highly strained small ring systems (e.g., fused cyclopropanes and bridgehead 3/4-membered rings)
  3. Generate 3D structures via CORINA or ChemAxon to apply a 3D volume-based strain filter. The local strain of a tetravalent carbon is estimated by the volume $V$ of the tetrahedron formed by extending a $1 \text{ \AA}$ line along its four single bonds. Hydrocarbons with planar or pyramidal carbon centers are discarded if: $$ V < 0.345 \text{ \AA}^3 $$
  4. Introduce unsaturations and heteroatoms through systematic substitution
  5. Apply chemical rule filters and element-ratio heuristics to ensure stability and drug-likeness
  6. Apply post-processing algorithms to introduce nitro groups, nitriles, aromatic chlorines, thiophenes, sulfonamides, and thioureas

Key Optimization: Replaced computationally expensive MM2 minimization (used in GDB-11) with a fast geometry-based estimation of strained polycyclic ring systems, combined with fast “element-ratio” filters. This achieved a 6.4-fold speedup in structure validation early in the pipeline.

Differences from GDB-11

  • Element Selection: Fluorine removed from allowed elements; sulfur and chlorine added for higher drug relevance (e.g., thiophenes, sulfonamides).
  • Optimization Method: MM2-based structure optimization replaced with a much faster, custom geometry-based estimation of local strain (measuring the tetrahedron volume of carbon centers).
  • Heuristic Filters: Fast elemental ratio filters added to quickly reject highly polar, unstable combinations early in the pipeline.

Reproducibility Details

Paper & Data Availability

  • Paper Access: The original paper is published in the Journal of the American Chemical Society (JACS) and is closed-access/paywalled. No open-access preprint exists on arXiv or ChemRxiv.
  • Data Access: The full GDB-13 database and its subsets are freely available via the Reymond Group Downloads Page and are persistently hosted on Zenodo.

Source Code & Algorithms

The exact custom source code (e.g., GENG orchestration, local strain filters) is not publicly available. Researchers must re-implement the rules strictly described in the paper and supplementary materials.

Heuristic Filters

Implemented element-ratio filters derived from analyzing known compound databases to reject chemically unstable or highly polar molecules early in the generation pipeline:

$$ \begin{aligned} \frac{N + O}{C} &< 1.0 \\ \frac{N}{C} &< 0.571 \\ \frac{O}{C} &< 0.666 \end{aligned} $$

Excluded Functional Groups

  • O-O bonds (peroxides)
  • Hemiacetals, aminals, acyclic imines, non-aromatic enols
  • Compounds containing both primary/secondary amines and aldehydes/ketones
  • Nonenumerated elements (F, Br, I, P, Si, metals)
  • High-heteroatom ratio structures (e.g., mannitol)

Hardware & Compute

  • Compute Cost: ~40,000 CPU hours for the 910 million C/N/O structures.
  • Infrastructure: Executed in parallel on a 500-node cluster
  • Assembly Optimization: The switch from MM2 minimization to geometry-based estimation of strained polycyclic ring systems, alongside element-ratio filters, reduced assembly time 6.4-fold comparing GDB-11 workloads (1600 CPU hours to 250 CPU hours).

Citation

@article{blum2009gdb13,
  title={970 million druglike small molecules for virtual screening in the chemical universe database GDB-13},
  author={Blum, Lorenz C and Reymond, Jean-Louis},
  journal={Journal of the American Chemical Society},
  volume={131},
  number={25},
  pages={8732--8733},
  year={2009},
  publisher={ACS Publications},
  doi={10.1021/ja902302h}
}