Key Contribution
The creation and release of the 977.4 million-compound GDB-13, a massive expansion in molecular size (up to 13 atoms) and elemental diversity (including S and Cl) made possible by key algorithmic optimizations that significantly accelerated the enumeration process.
Overview
GDB-13 extends the systematic enumeration of drug-like chemical space to molecules containing up to 13 atoms of C, N, O, S, and Cl. Building on the methodology established in GDB-11, this database represents a 37-fold increase in size while maintaining 100% Lipinski compliance for virtual screening applications.
Strengths
- Systematic coverage of structures with up to 13 atoms
- High drug-likeness: 100% Lipinski compliance
- Structural novelty
Limitations
- Limited to small molecules with up to 13 atoms of C, N, O, S, and Cl
- Excludes highly strained molecules and some bond patterns
- Excludes functional groups and highly polar molecules
- Computer-generated structures, not experimentally validated compounds
Technical Notes
Algorithmic Approach
Type: Combinatorial Graph Enumeration (Non-ML)
This paper uses combinatorial enumeration, not machine learning. The “model” is a rule-based graph generation algorithm (GENG) combined with chemical stability filters, not a neural network or trained system.
Process:
- Start with mathematical graphs representing saturated hydrocarbons
- Apply topological and strain criteria to filter unstable structures
- Introduce unsaturations and heteroatoms through systematic substitution
- Apply chemical rule filters to ensure stability and drug-likeness
Key Optimization: Replaced computationally expensive MM2 minimization (used in GDB-11) with a fast geometry-based estimation, achieving a 6.4-fold speedup in structure validation.
Differences from GDB-11
- Element Selection: Fluorine removed from allowed elements; sulfur and chlorine added
- Optimization Method: MM2-based structure optimization replaced with much faster geometry-based optimization
- Heuristic Filters: Fast elemental ratio filters added to auto-reject unstable structures early in the pipeline (informed by analysis of existing molecular databases)
Replication Details
Heuristic Filters
Implemented element-ratio filters derived from analyzing known compound databases to reject chemically unstable molecules early:
- $(N + O)/C < 1.0$
- $N/C < 0.571$
- $O/C < 0.666$
Excluded Functional Groups
- Enamines
- Hemiacetals
- High-heteroatom ratio structures
- Fluorine-containing compounds (rare in virtual screening contexts)
Hardware & Compute
- Compute Cost: ~40,000 CPU hours (approximately 4.5 years of single-core compute time)
- Infrastructure: Executed in parallel on a 500-node cluster
- Total Wall Time: Significantly reduced through parallelization

