GDB-17
Basic Information
Full NameGenerated Database 17
DomainComputational Chemistry
Year2012
Publication & Access
PaperDOI
Datasetgdb.unibe.ch
Dataset Composition
Total Size166,443,860,262 molecules
CNO Molecules110.4 billion molecules
Full Set (with S, Halogens)166.4 billion molecules
Leadlike Subset (GDBLL-17)29 billion molecules
Technical Details
FormatSMILES strings
Research Context
AuthorsLars Ruddigkeit, Ruud van Deursen, Lorenz C. Blum, Jean-Louis Reymond
InstitutionUniversity of Berne, Ecole Polytechnique Fédérale de Lausanne

GDB Series Overview: The Generated Database (GDB) series represents a systematic exploration of chemical space by generating all possible molecular structures. GDB-11 (26M molecules) established the methodology, GDB-13 (977M molecules) achieved billion-scale generation, and GDB-17 (166B molecules) represents the current limit of systematic chemical space generation.

Dataset Summary

GDB-17 contains 166.4 billion organic small molecules created by systematically exploring all possible structures with up to 17 atoms of carbon, nitrogen, oxygen, sulfur, and halogens. Building on the methodology from GDB-11 and GDB-13, this represents the largest systematically generated chemical database of its time. The dataset explores a vast space of molecular structures with high diversity. All molecules are provided as SMILES strings.

Related Databases: GDB-17 is the largest database in the Generated Database (GDB) series, following GDB-11 (26 million molecules) and GDB-13 (977 million molecules), demonstrating how far systematic chemical space enumeration can be scaled.


Key Features

  • Large Scale: 166.4 billion molecules, greatly expanding known chemical space
  • Complete Coverage: Systematically covers all possible structures from first principles, avoiding bias from existing databases
  • More Elements: Includes sulfur and halogens (F, Cl, Br, I) beyond carbon, nitrogen, oxygen
  • 3D Diversity: Contains more spherical molecules compared to the flat structures common in existing databases
  • High Novelty: Contains 35× more molecular frameworks and 61× more ring systems than equivalent molecules from PubChem
  • Series Culmination: Represents the current limit of the GDB methodology

Dataset Structure

The database is primarily defined by its total size, but two important “leadlike” subsets were also characterized for drug discovery applications.

GDB-17 Dataset Structure
CategoryCompositionCountDescription
Total MoleculesC, N, O, S, Halogens166.4 BThe complete enumerated database of molecules up to 17 heavy atoms.
Leadlike (GDBLL-17)C, N, O, S, Halogens29 BSubset with properties typical for lead compounds (100<MW<350, 1<clogP<3).
Leadlike, No Small Rings (GDBLLnoSR-17)C, N, O, S, Halogens22 BThe GDBLL-17 subset with strained 3- and 4-membered rings removed.

Structural Diversity

  • Topology: Rich in complex ring structures. Molecules with three or more rings account for 32% of GDB-17, far more than in known databases. Simple molecules with no rings are rare (1.8%).
  • Compound Types: Dominated by non-aromatic heterocycles (57%). Aromatic molecules, which make up a third of known compounds, are very rare (0.8%).
  • Stereochemistry: The prevalence of non-flat structures means GDB-17 molecules are highly chiral. They average 6.4 stereoisomers per molecule—three times more than molecules in PubChem. Over 44% have eight or more possible stereoisomers.

Example Sample

The following is an example of a novel, previously unknown C₁₇-hydrocarbon ring system found in GDB-17.

C1CC2C3CCCC3C3(C4CCC3CC4)C2C1

Visualized with PubChem Sketcher:

Example GDB-17 molecule structure showing a complex polycyclic hydrocarbon with multiple fused rings

Representative GDB-17 molecule (SMILES: C1CC2C3CCCC3C3(C4CCC3CC4)C2C1) demonstrating the complex polycyclic structures and 3D diversity characteristic of the database


Use Cases

Primary Applications

  • Virtual Screening: Shape-based screening and active site complementarity
  • Scaffold Hopping: Discovering new molecular series from structural isomers
  • Fragment-Based Discovery: Starting points for complex drug candidate building

Research Applications

  • Chemical Space Exploration: Analyzing fundamental properties of small molecule universe
  • AI/ML Benchmarking: Training and validating chemoinformatics tools and models
  • Structure-Property Studies: 3D shape, stereochemistry, and scaffold complexity relationships

Quality & Limitations

Strengths

  • Large Scale: Most complete systematic generation to date (166 billion molecules)
  • High Novelty: Significantly more scaffolds and ring systems than existing databases
  • 3D Diversity: Rich in spherical molecules, “escape from flatland”
  • Series Achievement: Final development of GDB series methodology
  • Coverage: Complete coverage of all possible structures with up to 17 atoms

Limitations

  • Limited Atom Types: Excludes P, Si, B and other drug-relevant elements (expanded from GDB-11/GDB-13)
  • Structural Constraints: Highly strained graphs, allenes, and nonaromatic C=C bonds excluded
  • Functional Group Gaps: Wide range of unstable groups filtered out (hemiacetals, enols, aminals)
  • Virtual Molecules: Computationally generated without experimental validation
  • Synthetic Accessibility: No explicit scoring for synthetic feasibility

Generation and Filtering Pipeline

GDB-17 represents the current limit of the methodology established in GDB-11 and refined in GDB-13, achieving a thousand-fold increase over GDB-13 while maintaining chemical quality. The construction required a sophisticated, multi-step pipeline designed to ensure that the generated molecules are chemically stable and meaningful:

  1. Graph Generation & Selection: The process started with over 114 billion mathematical graphs (node-edge diagrams) with up to 17 nodes. These were filtered using topological and geometric criteria to remove highly strained and complex structures (e.g., fused small rings), resulting in 5.4 million stable hydrocarbon graphs.
  2. Skeleton Generation: Double and triple bonds were systematically added to the hydrocarbon graphs to create 1.3 billion unique unsaturated “skeletons”. Filters were applied to remove unstable arrangements like allenes and triple bonds in small rings.
  3. CNO Molecule Generation: The carbon atoms in the skeletons were combinatorially substituted with nitrogen and oxygen atoms, respecting valency rules. This generated over 110 billion molecules, which were then curated with a set of 12 chemical filters to remove unstable functional groups (e.g., enols, hemiacetals).
  4. Post-Processing & Diversification: Additional chemical diversity was introduced in a final step. Specific functional groups were systematically transformed to add sulfur, halogens, nitro groups, and oximes, adding another 56 billion molecules to yield the final database of 166.4 billion structures.

Citation: Ruddigkeit, L., van Deursen, R., Blum, L. C., & Reymond, J.-L. “Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17” J. Chem. Inf. Model. 2012, 52 (11), pp 2864–2875.