GDB-17: Chemical Universe Database (166.4B Molecules)

Key Contribution

The systematic enumeration of 166.4 billion organic molecules (GDB-17) up to 17 atoms, extending the known chemical universe into the drug-relevant size range. This reveals a highly dense novel chemical space that is measurably richer in complex stereochemical and three-dimensional structures compared to historically biased chemical databases.

Overview

GDB-17 represents the largest enumerated database of drug-like small molecules, reaching the size range typical of lead compounds and approved drugs ($100 < \text{MW} < 350$ Da). It contains 166.4 billion structures consisting of up to 17 atoms of C, N, O, S, and halogens (F, Cl, Br, I). Because the bounds of combinatorial possibilities scale exponentially with heavy atom count (HAC), the MW distribution of the database sharply peaks around $240 \text{ Da}$. Compared to known molecules in databases like PubChem, GDB-17 molecules are significantly richer in non-aromatic heterocycles, quaternary centers, and stereoisomers, avoiding “flatland” by deeply populating the third dimension in shape space.

Dataset Examples

Example GDB-17 molecule (SMILES: `C1CC2C3CCCC3C3(C4CCC3CC4)C2C1`) demonstrating the complex polycyclic structures and 3D diversity characteristic of the database

Dataset Subsets

Subset	Size	Description
GDB-17 (Full)	166.4B	Complete enumeration of the database
GDBLL-17	29B	Lead-like subset ($1 < \text{clogP} < 3$ and $100 < \text{MW} < 350$ Da)
GDBLLnoSR-17	22B	Lead-like subset excluding compounds with small rings (3- or 4-membered)
Random Sample	50M	Random 50M subset available for download, including pre-filtered lead-like and no-small-ring fractions

Benchmarks

Note: As an enumerated database of theoretical structures, GDB-17 lacks standard supervised ML benchmarks. It functions primarily as a generative compass and foundational exploration library for unsupervised learning and molecular generation.

Dataset	Relationship	Link
GDB-11	Predecessor	Notes
GDB-13	Predecessor	Notes

Strengths & Limitations

Strengths:

3D Shape Space (“Escape out of Flatland”): Populates the third dimension (spherical, non-planar shapes) significantly better than known structures in PubChem or ChEMBL, which are primarily flat and rod-like due to aromatic dominance
Stereochemical Complexity: Averages 6.4 possible stereoisomers per molecule (compared to 2.0 in PubChem-17), driven by an abundance of non-planar features and small rings
Massive Scaffold Diversity: Features 35-fold more Murcko scaffolds and 61-fold more ring systems than molecules of matching size in PubChem
Rich in Known Drug Isomers: Contains millions of exact geometric and formula isomers of approved drugs, offering direct variations and “methyl walk” analogs

Limitations:

Experimental Gap: These are virtual, combinatorially enumerated molecules. Despite strict computational stability filtering, they remain unsynthesized and lack experimental validation.
Small Ring Dominance: Up to 16 atoms, roughly 83% of the database consists of compounds with challenging small (3- or 4-membered) rings, though this drops for the 17-atom set, resulting in an overall 28% fraction of small ring compounds
Elemental Scope Restrictions: Elements like P, Si, and B, which occasionally appear in drugs, are completely excluded
Strict Stability Filters: Excludes some potentially viable functional groups strictly to manage the combinatorial explosion and avoid unstable structures (e.g., hemiacetals, aminals, acyclic acetals)
Polarity Skew: The full database contains disproportionately more polar molecules ($\text{clogP} < 0$) than reference sets, and its sheer size makes it computationally demanding to query using advanced docking or 3D shape tools

Technical Notes

Generation Pipeline

GDB-17 was generated from first principles through a highly filtered, multi-stage pipeline:

Graphs $\rightarrow$ Hydrocarbons: Started with 114.3 billion topologies (generated using GENG), filtered down to 5.4 million stable hydrocarbons by applying geometrical strain rules (H-filters).
Hydrocarbons $\rightarrow$ Skeletons: Substituted single bonds with double and triple bonds to yield 1.3 billion skeletons, simultaneously removing reactive unsaturations like allenes (S-filters).
Skeletons $\rightarrow$ CNO Molecules: Diversified into 110.4 billion molecules by combinatorially substituting C with N and O, explicitly avoiding heteroatom-heteroatom bounds and enforcing stability filters (F-filters).
Post-processing: Added diversity by transforming groups to generate aromatics, oximes, $\text{CF}_3$, halogens, and sulfones (P-filters), yielding the final 166.4 billion count.

Hardware & Software

Compute: Mastered over 40,000 jobs spread across a 360-CPU cluster, consuming 100,000 CPU hours (~11 CPU years)
Software: Powered by GENG (Nauty package) for graph generation, CORINA for 3D stereoisomer generation, and ChemAxon JChem libraries running inside custom Java 1.6 applications

Shape Analysis (PMI)

To quantitatively define the “escape from flatland,” the origin paper classifies molecular shape using the normalized Principal Moments of Inertia (PMI) of the generated 3D conformers. The principal moments ($I_1 \le I_2 \le I_3$) are derived by diagonalizing the standard moment of inertia tensor. Molecules are plotted within a normalized 2D triangular space mapped by the ratios:

$$ P_1 = \frac{I_1}{I_3}, \quad P_2 = \frac{I_2}{I_3} $$

The vertices of this plot define the three geometrical boundaries of chemical space:

Rod-like (1D): $(0, 1)$ typical of stretched alkanes
Disc-like (2D): $(0.5, 0.5)$ typical of flat aromatics like benzene
Sphere-like (3D): $(1, 1)$ typical of globular structures like cubane

GDB-17’s core structural finding is that mathematically enumerated chemical space thickly populates the interior and $(1,1)$ spherical regions of this plot, demonstrating significant 3D structure. Empirical libraries traditionally cluster densely along the rod-to-disc axis.

Differences from GDB-13

The algorithm was completely rewritten optimizing memory efficiency, boosting computing speed roughly 400-fold and allowing enumeration beyond the previous 13-atom limit
Scope aggressively expanded to include all functional halogens (F, Cl, Br, I) within the base framework
Introduced intensive, size-dependent graph selection filters (prohibiting complex bridgeheads and completely eliminating small rings in 17-atom graphs) to manage combinatorial explosion
Functional post-processing cycles deliberately decoupled to add features like cyclic oximes, aromatic halogens, and sulfones that would otherwise be rejected or break underlying generation constraints

Reproducibility Details

Paper Accessibility: The original paper is published in the Journal of Chemical Information and Modeling and is available as an Open Access publication under a CC-BY license.
Data Availability: The full 166.4 billion molecule dataset is not publicly available for download (estimated >400 GB compressed). However, a 50 million random subset and pre-filtered lead-like fractions are openly available on the GDB website and archived on Zenodo.
Code & Algorithms: The enumeration rules and logic are well-described in the paper, but the actual Java 1.6 source code has not been released.
Dependencies: The pipeline is a mix of open-source and proprietary software tools. Graph generation uses open-source GENG (Nauty), while chemical logic and stereoisomer generation rely on proprietary ChemAxon JChem libraries and CORINA.
Hardware Specifications: The original database generation was explicitly parallelized across a 360-CPU cluster, consuming 100,000 CPU hours (approximately 11 CPU years) with over 40,000 calculation runs.

Citation

@article{Ruddigkeit_2012,
  title={Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17},
  volume={52},
  ISSN={1549-960X},
  url={http://dx.doi.org/10.1021/ci300415d},
  DOI={10.1021/ci300415d},
  number={11},
  journal={Journal of Chemical Information and Modeling},
  publisher={American Chemical Society (ACS)},
  author={Ruddigkeit, Lars and van Deursen, Ruud and Blum, Lorenz C. and Reymond, Jean-Louis},
  year={2012},
  month=nov,
  pages={2864–2875}
}

Key Contribution#

Overview#

Dataset Examples#

Dataset Subsets#

Benchmarks#

Related Datasets#

Strengths & Limitations#

Technical Notes#

Generation Pipeline#

Hardware & Software#

Shape Analysis (PMI)#

Differences from GDB-13#

Reproducibility Details#

Citation#