GDB-13 Dataset Card

Dataset Summary

GDB-13 (Generated Database 13) is a computed chemical database containing nearly one billion unique, synthetically accessible small organic molecules. It was created through an exhaustive enumeration of all possible molecules up to 13 atoms, containing Carbon, Nitrogen, Oxygen, Sulfur, and Chlorine. The dataset was designed to provide a vast and novel chemical space for virtual screening in drug discovery applications.

Quick Facts

Total Molecules: 977,468,314
Atom Types: C, N, O, S, Cl (plus implicit Hydrogens)
Size Constraint: Up to 13 heavy atoms
Paper: 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13
Authors: Lorenz C. Blum and Jean-Louis Reymond
Publication: Journal of the American Chemical Society (2009)
Availability: Free for public access at gdb.unibe.ch

Dataset Composition

Size and Scale

Subset	Molecules	Description
Main C/N/O Set	910,111,673	Molecules containing only C, N, and O atoms.
Cl/S Set	67,356,641	A supplementary set containing specific S and Cl functional groups.
Total	977,468,314	The combined database.

Molecular Characteristics (Main Set)

A statistical analysis of the database confirms its “druglike” nature.

Property	Description	Druglike Range	% in GDB-13
Molecular Weight (MW)	Mass of the molecule	< 500 Da	100%
clogP	Water/octanol partition coefficient	< 5	100%
Hydrogen Bond Donors (HBD)	Number of N-H and O-H bonds	< 5	100%
Hydrogen Bond Acceptors (HBA)	Number of N and O atoms	< 10	100%
Rotatable Bonds (RBC)	Number of single bonds that can rotate	< 10	99.5%

Leadlike: 98.9% of molecules meet leadlike criteria.
Fragmentlike: 45.1% of molecules meet fragmentlike criteria.

Structural Diversity

The dataset is highly diverse in terms of molecular structure:

Graph Types: Dominated by polycyclic (43.8%) and tricyclic (34.6%) graph topologies.
Molecule Types: The resulting molecules are mostly bicyclic (37.7%), monocyclic (26.5%), and tricyclic (22.3%).
Heterocycles: A large majority (71.0%) of the molecules are heterocyclic.
Small Rings: 54% of molecules contain at least one three- or four-membered ring, indicating a rich space of strained structures.

Data Generation Methodology

GDB-13 was generated through a systematic, multi-step computational pipeline:

Graph Generation: All non-isomorphic connected graphs with up to 13 vertices were generated.
Chemical Feasibility Filtering:
- Graphs were treated as saturated hydrocarbons and filtered for topological and ring-strain stability. A fast geometry-based estimation was used instead of computationally expensive minimizations.
- An “element-ratio” filter was applied to quickly eliminate molecules with unlikely ratios of heteroatoms to carbon (e.g., (N+O)/C < 1.0).
Combinatorial Enumeration:
- For each valid graph, unsaturations (double and triple bonds) were added combinatorially.
- Carbon atoms were systematically replaced with N, O, S, and Cl, respecting valency rules.
Functional Group Filtering: Final structures were filtered to remove unstable or synthetically unfeasible functional groups (e.g., hemiacetals, enamines).

Intended Use Cases

Virtual Screening & Drug Discovery

Source of Novel Scaffolds: The primary use case is to serve as a vast library for virtual screening campaigns to find novel hit and lead compounds that are absent from existing databases like ZINC or PubChem.
Fragment-Based Design: The large subset of fragment-like molecules can be used for fragment-based drug discovery approaches.

Cheminformatics & Method Development

Benchmarking: Can be used to benchmark and develop new virtual screening algorithms, molecular descriptors, or machine learning models for property prediction.
Chemical Space Exploration: Provides a model of a large, synthetically accessible portion of chemical space for analysis and visualization.

Synthetic Chemistry

Inspiration for Synthesis: The database can serve as a source of inspiration for synthetic chemists looking for new, interesting, and synthetically accessible target molecules.

Quality Assessment

Strengths

Vast Scale: At the time, it was the largest publicly available database of small molecules.
Systematic Enumeration: Provides a comprehensive and unbiased exploration of a defined chemical space.
Druglikeness: The molecules are overwhelmingly compliant with standard druglikeness filters.
Novelty: Contains a wealth of structures not found in databases of existing or commercially available compounds.
Synthetic Feasibility: The generation process included filters to favor synthetically accessible molecules.

Limitations

Incomplete Chemical Space: The database is not exhaustive. It omits:
- Other common elements like F, Br, I, and P.
- Many common functional groups (e.g., mercaptans, sulfoxides) due to the filtering rules.
- Molecules with high heteroatom-to-carbon ratios (e.g., sugars like mannitol).
Computed Nature: The database consists of virtual molecules; their properties are calculated, not experimentally measured. Their synthetic accessibility is estimated, not proven.

Technical Specifications

Data Format

The database is available as a collection of SMILES strings, a standard line notation for representing chemical structures.

Access

The database is freely available for download at www.gdb.unibe.ch.

Dataset Status: Static, publicly available Last Updated: 2009 Contact: Jean-Louis Reymond via original publication

Dataset Summary#

Quick Facts#

Dataset Composition#

Size and Scale#

Molecular Characteristics (Main Set)#

Structural Diversity#

Data Generation Methodology#

Intended Use Cases#

Virtual Screening & Drug Discovery#

Cheminformatics & Method Development#

Synthetic Chemistry#

Quality Assessment#

Strengths#

Limitations#

Technical Specifications#

Data Format#

Access#