GDB-17

GDB-17: Chemical Universe Database (166B Molecules)
Dataset Details
AuthorsLars Ruddigkeit, Ruud van Deursen, Lorenz C. Blum, Jean-Louis Reymond
Paper TitleEnumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17
InstitutionsUniversity of Berne, Ecole Polytechnique Fédérale de Lausanne
Published InJournal of Chemical Information and Modeling
CategoryComputational Chemistry
FormatSMILES
SizeMolecules: 166,443,860,262 (50 million subset available)
DateAugust 2025
Year2012
Links📊 Dataset🔗 DOI📄 Paper
GDB-17 molecule structure showing complex polycyclic architecture
Example GDB-17 molecule demonstrating the complex 3D diversity and polycyclic structures characteristic of the 166 billion molecule database

Key Contribution

The creation of the 166.4 billion-compound GDB-17, which successfully extends the enumerated chemical universe into the drug-relevant size range of up to 17 atoms, made possible by a 400-fold faster algorithm that revealed a novel chemical space rich in three-dimensional and stereochemically complex structures.

Overview

GDB-17 represents the largest enumerated database of drug-like small molecules, containing 166 billion structures with up to 17 atoms of C, N, O, S, and halogens (F, Cl, Br, I). This database reaches the size range typical of most approved drugs and reveals unprecedented structural diversity, particularly in 3D architecture and ring systems.

Strengths

  • Systematic coverage of structures
  • Structural novelty, especially 3D diversity
  • Significant diversity in scaffolds and ring systems

Limitations

  • Experimental Gap: These are virtual molecules; while chemically reasonable, they have not been synthesized or tested
  • Elemental Scope: Excludes P, Si, B, and other drug-relevant elements (limited to C, N, O, S, halogens)
  • Stability Filters: Excludes specific functional groups deemed unstable or difficult to synthesize (e.g., hemiacetals, acyclic acetals, carbonic acids, ammonals), though the database is on average more polar than PubChem
  • Small Ring Dominance: A large portion of the database (83% up to 16 atoms) consists of compounds with small rings (3- or 4-membered), which are chemically challenging and rare in approved drugs

Technical Notes

Hardware & Software

  • Compute: 360-CPU cluster, consuming 100,000 CPU hours (approximately 11 CPU years)
  • Software: Uses GENG (from the Nauty package) for graph generation and CORINA for 3D stereoisomer generation and counting

Differences from GDB-13

  • The generation algorithm was entirely rewritten for memory efficiency, resulting in a 400-fold increase in computing speed that enabled enumeration up to 17 atoms
  • The scope of allowed elements was expanded to include all halogens (F, Cl, Br, I)
  • More aggressive, size-dependent graph selection filters were introduced to manage the combinatorial explosion, such as restricting or prohibiting small rings and complex bridgeheads in molecules with 14 or more atoms
  • A multi-step post-processing stage was added to introduce specific functional groups (e.g., oximes, nitro groups, $\text{CF}_3$, sulfones) that were not generated during the main combinatorial step
  • A new functional group filter was implemented to remove non-aromatic C=C bonds for molecules with 17 atoms, further controlling the output size

Dataset Information

Format

SMILES

Size

TypeCount
Molecules166,443,860,262 (50 million subset available)

Dataset Examples

Example GDB-17 molecule (SMILES: `C1CC2C3CCCC3C3(C4CCC3CC4)C2C1`) demonstrating the complex polycyclic structures and 3D diversity characteristic of the database
Example GDB-17 molecule (SMILES: C1CC2C3CCCC3C3(C4CCC3CC4)C2C1) demonstrating the complex polycyclic structures and 3D diversity characteristic of the database
DatasetRelationshipLink
GDB-11Predecessor📄 View Details
GDB-13Predecessor📄 View Details

Citation

If you use this dataset, please cite:

https://doi.org/10.1021/ci300415d