GDB-17 Dataset Card

GDB-17
Basic Information
Full Name	Generated Database 17
Domain	Computational Chemistry
Year	2012
Publication & Access
Paper	DOI
Dataset	gdb.unibe.ch
Dataset Composition
Total Size	166,443,860,262 molecules
CNO Molecules	110.4 billion molecules
Full Set (with S, Halogens)	166.4 billion molecules
Leadlike Subset (GDBLL-17)	29 billion molecules
Technical Details
Format	SMILES strings
Research Context
Authors	Lars Ruddigkeit, Ruud van Deursen, Lorenz C. Blum, Jean-Louis Reymond
Institution	University of Berne, Ecole Polytechnique Fédérale de Lausanne

GDB Series Overview: The Generated Database (GDB) series represents a systematic exploration of chemical space by generating all possible molecular structures. GDB-11 (26M molecules) established the methodology, GDB-13 (977M molecules) achieved billion-scale generation, and GDB-17 (166B molecules) represents the current limit of systematic chemical space generation.

Dataset Summary

GDB-17 contains 166.4 billion organic small molecules created by systematically exploring all possible structures with up to 17 atoms of carbon, nitrogen, oxygen, sulfur, and halogens. Building on the methodology from GDB-11 and GDB-13, this represents the largest systematically generated chemical database of its time. The dataset explores a vast space of molecular structures with high diversity. All molecules are provided as SMILES strings.

Related Databases: GDB-17 is the largest database in the Generated Database (GDB) series, following GDB-11 (26 million molecules) and GDB-13 (977 million molecules), demonstrating how far systematic chemical space enumeration can be scaled.

Key Features

Large Scale: 166.4 billion molecules, greatly expanding known chemical space
Complete Coverage: Systematically covers all possible structures from first principles, avoiding bias from existing databases
More Elements: Includes sulfur and halogens (F, Cl, Br, I) beyond carbon, nitrogen, oxygen
3D Diversity: Contains more spherical molecules compared to the flat structures common in existing databases
High Novelty: Contains 35× more molecular frameworks and 61× more ring systems than equivalent molecules from PubChem
Series Culmination: Represents the current limit of the GDB methodology

Dataset Structure

The database is primarily defined by its total size, but two important “leadlike” subsets were also characterized for drug discovery applications.

GDB-17 Dataset Structure
Category	Composition	Count	Description
Total Molecules	C, N, O, S, Halogens	166.4 B	The complete enumerated database of molecules up to 17 heavy atoms.
Leadlike (GDBLL-17)	C, N, O, S, Halogens	29 B	Subset with properties typical for lead compounds (100<MW<350, 1<clogP<3).
Leadlike, No Small Rings (GDBLLnoSR-17)	C, N, O, S, Halogens	22 B	The GDBLL-17 subset with strained 3- and 4-membered rings removed.

Structural Diversity

Topology: Rich in complex ring structures. Molecules with three or more rings account for 32% of GDB-17, far more than in known databases. Simple molecules with no rings are rare (1.8%).
Compound Types: Dominated by non-aromatic heterocycles (57%). Aromatic molecules, which make up a third of known compounds, are very rare (0.8%).
Stereochemistry: The prevalence of non-flat structures means GDB-17 molecules are highly chiral. They average 6.4 stereoisomers per molecule—three times more than molecules in PubChem. Over 44% have eight or more possible stereoisomers.

Example Sample

The following is an example of a novel, previously unknown C₁₇-hydrocarbon ring system found in GDB-17.

C1CC2C3CCCC3C3(C4CCC3CC4)C2C1

Visualized with PubChem Sketcher:

Example GDB-17 molecule structure showing a complex polycyclic hydrocarbon with multiple fused rings — Representative GDB-17 molecule (SMILES: C1CC2C3CCCC3C3(C4CCC3CC4)C2C1) demonstrating the complex polycyclic structures and 3D diversity characteristic of the database

Use Cases

Primary Applications

Virtual Screening: Shape-based screening and active site complementarity
Scaffold Hopping: Discovering new molecular series from structural isomers
Fragment-Based Discovery: Starting points for complex drug candidate building

Research Applications

Chemical Space Exploration: Analyzing fundamental properties of small molecule universe
AI/ML Benchmarking: Training and validating chemoinformatics tools and models
Structure-Property Studies: 3D shape, stereochemistry, and scaffold complexity relationships

Quality & Limitations

Strengths

Large Scale: Most complete systematic generation to date (166 billion molecules)
High Novelty: Significantly more scaffolds and ring systems than existing databases
3D Diversity: Rich in spherical molecules, “escape from flatland”
Series Achievement: Final development of GDB series methodology
Coverage: Complete coverage of all possible structures with up to 17 atoms

Limitations

Limited Atom Types: Excludes P, Si, B and other drug-relevant elements (expanded from GDB-11/GDB-13)
Structural Constraints: Highly strained graphs, allenes, and nonaromatic C=C bonds excluded
Functional Group Gaps: Wide range of unstable groups filtered out (hemiacetals, enols, aminals)
Virtual Molecules: Computationally generated without experimental validation
Synthetic Accessibility: No explicit scoring for synthetic feasibility

Generation and Filtering Pipeline

GDB-17 represents the current limit of the methodology established in GDB-11 and refined in GDB-13, achieving a thousand-fold increase over GDB-13 while maintaining chemical quality. The construction required a sophisticated, multi-step pipeline designed to ensure that the generated molecules are chemically stable and meaningful:

Graph Generation & Selection: The process started with over 114 billion mathematical graphs (node-edge diagrams) with up to 17 nodes. These were filtered using topological and geometric criteria to remove highly strained and complex structures (e.g., fused small rings), resulting in 5.4 million stable hydrocarbon graphs.
Skeleton Generation: Double and triple bonds were systematically added to the hydrocarbon graphs to create 1.3 billion unique unsaturated “skeletons”. Filters were applied to remove unstable arrangements like allenes and triple bonds in small rings.
CNO Molecule Generation: The carbon atoms in the skeletons were combinatorially substituted with nitrogen and oxygen atoms, respecting valency rules. This generated over 110 billion molecules, which were then curated with a set of 12 chemical filters to remove unstable functional groups (e.g., enols, hemiacetals).
Post-Processing & Diversification: Additional chemical diversity was introduced in a final step. Specific functional groups were systematically transformed to add sulfur, halogens, nitro groups, and oximes, adding another 56 billion molecules to yield the final database of 166.4 billion structures.

Citation: Ruddigkeit, L., van Deursen, R., Blum, L. C., & Reymond, J.-L. “Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17” J. Chem. Inf. Model. 2012, 52 (11), pp 2864–2875.

Dataset Summary#

Key Features#

Dataset Structure#

Structural Diversity#

Example Sample#

Use Cases#

Primary Applications#

Research Applications#

Quality & Limitations#

Strengths#

Limitations#

Generation and Filtering Pipeline#