Dataset Summary

GDB-13 (Generated Database 13) is a computed chemical database containing nearly one billion unique, synthetically accessible small organic molecules. It was created through an exhaustive enumeration of all possible molecules up to 13 atoms, containing Carbon, Nitrogen, Oxygen, Sulfur, and Chlorine. The dataset was designed to provide a vast and novel chemical space for virtual screening in drug discovery applications.

Quick Facts

Dataset Composition

Size and Scale

SubsetMoleculesDescription
Main C/N/O Set910,111,673Molecules containing only C, N, and O atoms.
Cl/S Set67,356,641A supplementary set containing specific S and Cl functional groups.
Total977,468,314The combined database.

Molecular Characteristics (Main Set)

A statistical analysis of the database confirms its “druglike” nature.

PropertyDescriptionDruglike Range% in GDB-13
Molecular Weight (MW)Mass of the molecule< 500 Da100%
clogPWater/octanol partition coefficient< 5100%
Hydrogen Bond Donors (HBD)Number of N-H and O-H bonds< 5100%
Hydrogen Bond Acceptors (HBA)Number of N and O atoms< 10100%
Rotatable Bonds (RBC)Number of single bonds that can rotate< 1099.5%
  • Leadlike: 98.9% of molecules meet leadlike criteria.
  • Fragmentlike: 45.1% of molecules meet fragmentlike criteria.

Structural Diversity

The dataset is highly diverse in terms of molecular structure:

  • Graph Types: Dominated by polycyclic (43.8%) and tricyclic (34.6%) graph topologies.
  • Molecule Types: The resulting molecules are mostly bicyclic (37.7%), monocyclic (26.5%), and tricyclic (22.3%).
  • Heterocycles: A large majority (71.0%) of the molecules are heterocyclic.
  • Small Rings: 54% of molecules contain at least one three- or four-membered ring, indicating a rich space of strained structures.

Data Generation Methodology

GDB-13 was generated through a systematic, multi-step computational pipeline:

  1. Graph Generation: All non-isomorphic connected graphs with up to 13 vertices were generated.
  2. Chemical Feasibility Filtering:
    • Graphs were treated as saturated hydrocarbons and filtered for topological and ring-strain stability. A fast geometry-based estimation was used instead of computationally expensive minimizations.
    • An “element-ratio” filter was applied to quickly eliminate molecules with unlikely ratios of heteroatoms to carbon (e.g., (N+O)/C < 1.0).
  3. Combinatorial Enumeration:
    • For each valid graph, unsaturations (double and triple bonds) were added combinatorially.
    • Carbon atoms were systematically replaced with N, O, S, and Cl, respecting valency rules.
  4. Functional Group Filtering: Final structures were filtered to remove unstable or synthetically unfeasible functional groups (e.g., hemiacetals, enamines).

Intended Use Cases

Virtual Screening & Drug Discovery

  • Source of Novel Scaffolds: The primary use case is to serve as a vast library for virtual screening campaigns to find novel hit and lead compounds that are absent from existing databases like ZINC or PubChem.
  • Fragment-Based Design: The large subset of fragment-like molecules can be used for fragment-based drug discovery approaches.

Cheminformatics & Method Development

  • Benchmarking: Can be used to benchmark and develop new virtual screening algorithms, molecular descriptors, or machine learning models for property prediction.
  • Chemical Space Exploration: Provides a model of a large, synthetically accessible portion of chemical space for analysis and visualization.

Synthetic Chemistry

  • Inspiration for Synthesis: The database can serve as a source of inspiration for synthetic chemists looking for new, interesting, and synthetically accessible target molecules.

Quality Assessment

Strengths

  • Vast Scale: At the time, it was the largest publicly available database of small molecules.
  • Systematic Enumeration: Provides a comprehensive and unbiased exploration of a defined chemical space.
  • Druglikeness: The molecules are overwhelmingly compliant with standard druglikeness filters.
  • Novelty: Contains a wealth of structures not found in databases of existing or commercially available compounds.
  • Synthetic Feasibility: The generation process included filters to favor synthetically accessible molecules.

Limitations

  • Incomplete Chemical Space: The database is not exhaustive. It omits:
    • Other common elements like F, Br, I, and P.
    • Many common functional groups (e.g., mercaptans, sulfoxides) due to the filtering rules.
    • Molecules with high heteroatom-to-carbon ratios (e.g., sugars like mannitol).
  • Computed Nature: The database consists of virtual molecules; their properties are calculated, not experimentally measured. Their synthetic accessibility is estimated, not proven.

Technical Specifications

Data Format

  • The database is available as a collection of SMILES strings, a standard line notation for representing chemical structures.

Access


Dataset Status: Static, publicly available Last Updated: 2009 Contact: Jean-Louis Reymond via original publication