GDB-13

GDB-13: Chemical Universe Database (970M Molecules)
Dataset Details
AuthorsLorenz C. Blum, Jean-Louis Reymond
Paper Title970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13
InstitutionUniversity of Berne
Published InJournal of the American Chemical Society
CategoryComputational Chemistry
FormatSMILES
SizeMolecules: 977,468,314
DateAugust 2025
Year2009
Links📊 Dataset🔗 DOI📄 Paper
GDB-13 molecule structure showing CCCC(O)(CO)CC1CC1CN
Example GDB-13 molecule demonstrating the expanded chemical space with up to 13 atoms

Key Contribution

The creation and release of the 977.4 million-compound GDB-13, a massive expansion in molecular size (up to 13 atoms) and elemental diversity (including S and Cl) made possible by key algorithmic optimizations that significantly accelerated the enumeration process.

Overview

GDB-13 extends the systematic enumeration of drug-like chemical space to molecules containing up to 13 atoms of C, N, O, S, and Cl. Building on the methodology established in GDB-11, this database represents a 37-fold increase in size while maintaining 100% Lipinski compliance for virtual screening applications.

Strengths

  • Systematic coverage of structures with up to 13 atoms
  • High drug-likeness: 100% Lipinski compliance
  • Structural novelty

Limitations

  • Limited to small molecules with up to 13 atoms of C, N, O, S, and Cl
  • Excludes highly strained molecules and some bond patterns
  • Excludes functional groups and highly polar molecules
  • Computer-generated structures, not experimentally validated compounds

Technical Notes

Algorithmic Approach

Type: Combinatorial Graph Enumeration (Non-ML)

This paper uses combinatorial enumeration, not machine learning. The “model” is a rule-based graph generation algorithm (GENG) combined with chemical stability filters, not a neural network or trained system.

Process:

  1. Start with mathematical graphs representing saturated hydrocarbons
  2. Apply topological and strain criteria to filter unstable structures
  3. Introduce unsaturations and heteroatoms through systematic substitution
  4. Apply chemical rule filters to ensure stability and drug-likeness

Key Optimization: Replaced computationally expensive MM2 minimization (used in GDB-11) with a fast geometry-based estimation, achieving a 6.4-fold speedup in structure validation.

Differences from GDB-11

  • Element Selection: Fluorine removed from allowed elements; sulfur and chlorine added
  • Optimization Method: MM2-based structure optimization replaced with much faster geometry-based optimization
  • Heuristic Filters: Fast elemental ratio filters added to auto-reject unstable structures early in the pipeline (informed by analysis of existing molecular databases)

Replication Details

Heuristic Filters

Implemented element-ratio filters derived from analyzing known compound databases to reject chemically unstable molecules early:

  • $(N + O)/C < 1.0$
  • $N/C < 0.571$
  • $O/C < 0.666$

Excluded Functional Groups

  • Enamines
  • Hemiacetals
  • High-heteroatom ratio structures
  • Fluorine-containing compounds (rare in virtual screening contexts)

Hardware & Compute

  • Compute Cost: ~40,000 CPU hours (approximately 4.5 years of single-core compute time)
  • Infrastructure: Executed in parallel on a 500-node cluster
  • Total Wall Time: Significantly reduced through parallelization

Dataset Information

Format

SMILES

Size

TypeCount
Molecules977,468,314

Dataset Examples

Example GDB-13 molecule (SMILES: `CCCC(O)(CO)CC1CC1CN`)
Example GDB-13 molecule (SMILES: CCCC(O)(CO)CC1CC1CN)
DatasetRelationshipLink
GDB-11Predecessor📄 View Details
GDB-17Successor📄 View Details

Citation

If you use this dataset, please cite:

https://doi.org/10.1021/ja902302h