GDB-13 Dataset Card

GDB-13
Basic Information
Full Name	Generated Database 13
Domain	Computational Chemistry
Year	2009
Publication & Access
Paper	DOI
Dataset	gdb.unibe.ch
Dataset Composition
Total Size	977,468,314 molecules
Main C/N/O Set	910,111,673 molecules
Cl/S Set	67,356,641 molecules
Technical Details
Format	SMILES strings
Research Context
Authors	Lorenz C. Blum, Jean-Louis Reymond
Institution	University of Berne

GDB Series Overview: The Generated Database (GDB) series represents a systematic exploration of chemical space by generating all possible molecular structures. GDB-11 (26M molecules) established the methodology, GDB-13 (977M molecules) achieved billion-scale generation, and GDB-17 (166B molecules) represents the current limit of systematic chemical space generation.

Dataset Summary

GDB-13 contains nearly one billion small organic molecules created by systematically exploring all possible structures with up to 13 atoms. Building on the approach from GDB-11, this represents a substantial scale-up from millions to nearly a billion molecules while maintaining chemical quality. The dataset covers drug-like chemical space for virtual screening applications. All molecules are provided as SMILES strings.

Related Databases: GDB-13 is part of the Generated Database (GDB) series, following GDB-11 (26 million molecules) and preceding GDB-17 (166 billion molecules), showing the evolution of systematic chemical space exploration.

Key Features

Large scale: Nearly 1 billion molecules - a significant achievement at the time of publication
Complete coverage: Systematically covers all possible structures with up to 13 atoms
Drug-like properties: All molecules follow Lipinski’s Rule of Five for drug-likeness
Quality filtering: Filtered to remove chemically unrealistic or unstable structures

Dataset Structure

GDB-13 Dataset Structure
Composition	Druglike	Fragmentlike	Leadlike	Molecules	Subset
C, N, O only	100%	45.1%	98.9%	910M	Main Set
Cl, S groups	100%			67M	Cl/S Set
All molecules	100%	45.1%	98.9%	977M	Total

Structural Diversity

Heterocycles: 71% of molecules contain rings with nitrogen or oxygen atoms
Small rings: 54% contain strained 3- or 4-membered rings
Graph types: Most molecules have complex ring systems - 43.8% are polycyclic and 34.6% are tricyclic

Example Sample

CCCC(O)(CO)CC1CC1CN

Visualized with PubChem Sketcher:

Example GDB-13 molecule structure showing a small organic compound with cyclopropyl and alcohol functional groups — Representative GDB-13 molecule (SMILES: CCCC(O)(CO)CC1CC1CN) demonstrating typical structural features: small rings, heteroatoms, and druglike properties

Use Cases

Primary Applications

Virtual Screening: Novel drug scaffolds across therapeutic areas
Fragment-Based Drug Discovery: Comprehensive fragment space coverage
Chemical Space Exploration: Large-scale structure-property relationship studies

Research Applications

Algorithm Benchmarking: Virtual screening and molecular descriptor development
ML Model Development: Training molecular property prediction models
Structure-Property Studies: Large-scale analysis of chemical relationships

Quality & Limitations

Strengths

Large Scale: Nearly 1 billion molecules, significant for its time of publication
Complete Drug-likeness: 100% Lipinski compliance across all molecules
High Novelty: Novel chemical structures absent from commercial databases
Coverage: Complete coverage of all possible structures with up to 13 atoms
Refined Methodology: Improved approach enabling larger databases like GDB-17

Limitations

Limited Atom Types: Main set only includes carbon, nitrogen, and oxygen (subset includes Cl and S; GDB-17 expanded to include more elements)
Structural Constraints: Excludes fused small rings and highly strained molecules
Element Ratio Restrictions: Filters out highly polar molecules with strict heteroatom ratios
Missing Functional Groups: Excludes peroxides and many unstable intermediates
Virtual Molecules: Computer-generated without experimental validation or synthetic accessibility scoring
Size Constraint: Maximum 13 heavy atoms limits coverage of larger drug-like molecules

Generation and Filtering Pipeline

GDB-13 refined and scaled the methodology pioneered in GDB-11, achieving nearly 40-fold increase in database size while maintaining chemical quality. This approach later enabled the massive GDB-17 database. The generation process involves sophisticated multi-step filtering to ensure chemical feasibility and druglikeness:

Graph Generation & Topological Filtering

Initial enumeration: 27.3M graphs generated using GENG program
Topo I filter: Removes fused small rings (97.2% rejection rate)
Topo II filter: Eliminates bridgehead atoms in 3-4 membered rings (56.9% rejection)
SAV filter: 3D strain analysis using tetrahedron volumes (4.7% rejection)

Chemical Intelligence Filters

Element ratio limits: N/C < 0.571, O/C < 0.667, (N+O)/C < 1.0
Heteroatom bonds: Strict rules for N-N, N-O, O-O bond patterns
Functional group exclusions: Removes unstable groups (hemiacetals, orthoesters, etc.)
Tautomer standardization: Selects most stable tautomeric forms

Post-Processing Expansions

The Cl/S subset adds chemical diversity through systematic transformations:

Aromatic nitro groups from carboxylic acids
Nitriles from aldehydes
Aromatic chlorines from hydroxyl groups
Thiophene analogs from oxygen heterocycles
Sulfonamides and thioureas from carbonyl groups

This rigorous filtering explains the high druglikeness compliance while maintaining chemical diversity. The methodology proved so successful that it enabled the creation of the vastly larger GDB-17 database containing 166 billion molecules.

Citation: Blum, L. C. & Reymond, J.-L. “970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13” J. Am. Chem. Soc. 2009, 131 (25), pp 8732–8733.

Dataset Summary#

Key Features#

Dataset Structure#

Structural Diversity#

Example Sample#

Use Cases#

Primary Applications#

Research Applications#

Quality & Limitations#

Strengths#

Limitations#

Generation and Filtering Pipeline#

Graph Generation & Topological Filtering#

Chemical Intelligence Filters#

Post-Processing Expansions#