Key Contribution
The creation of the 166.4 billion-compound GDB-17, which successfully extends the enumerated chemical universe into the drug-relevant size range of up to 17 atoms, made possible by a 400-fold faster algorithm that revealed a novel chemical space rich in three-dimensional and stereochemically complex structures.
Overview
GDB-17 represents the largest enumerated database of drug-like small molecules, containing 166 billion structures with up to 17 atoms of C, N, O, S, and halogens (F, Cl, Br, I). This database reaches the size range typical of most approved drugs and reveals unprecedented structural diversity, particularly in 3D architecture and ring systems.
Strengths
- Systematic coverage of structures
- Structural novelty, especially 3D diversity
- Significant diversity in scaffolds and ring systems
Limitations
- Experimental Gap: These are virtual molecules; while chemically reasonable, they have not been synthesized or tested
- Elemental Scope: Excludes P, Si, B, and other drug-relevant elements (limited to C, N, O, S, halogens)
- Stability Filters: Excludes specific functional groups deemed unstable or difficult to synthesize (e.g., hemiacetals, acyclic acetals, carbonic acids, ammonals), though the database is on average more polar than PubChem
- Small Ring Dominance: A large portion of the database (83% up to 16 atoms) consists of compounds with small rings (3- or 4-membered), which are chemically challenging and rare in approved drugs
Technical Notes
Hardware & Software
- Compute: 360-CPU cluster, consuming 100,000 CPU hours (approximately 11 CPU years)
- Software: Uses GENG (from the Nauty package) for graph generation and CORINA for 3D stereoisomer generation and counting
Differences from GDB-13
- The generation algorithm was entirely rewritten for memory efficiency, resulting in a 400-fold increase in computing speed that enabled enumeration up to 17 atoms
- The scope of allowed elements was expanded to include all halogens (F, Cl, Br, I)
- More aggressive, size-dependent graph selection filters were introduced to manage the combinatorial explosion, such as restricting or prohibiting small rings and complex bridgeheads in molecules with 14 or more atoms
- A multi-step post-processing stage was added to introduce specific functional groups (e.g., oximes, nitro groups, $\text{CF}_3$, sulfones) that were not generated during the main combinatorial step
- A new functional group filter was implemented to remove non-aromatic C=C bonds for molecules with 17 atoms, further controlling the output size

