Dataset Summary
GDB-13 (Generated Database 13) is a computed chemical database containing nearly one billion unique, synthetically accessible small organic molecules. It was created through an exhaustive enumeration of all possible molecules up to 13 atoms, containing Carbon, Nitrogen, Oxygen, Sulfur, and Chlorine. The dataset was designed to provide a vast and novel chemical space for virtual screening in drug discovery applications.
Quick Facts
- Total Molecules: 977,468,314
- Atom Types: C, N, O, S, Cl (plus implicit Hydrogens)
- Size Constraint: Up to 13 heavy atoms
- Paper: 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13
- Authors: Lorenz C. Blum and Jean-Louis Reymond
- Publication: Journal of the American Chemical Society (2009)
- Availability: Free for public access at gdb.unibe.ch
Dataset Composition
Size and Scale
Subset | Molecules | Description |
---|---|---|
Main C/N/O Set | 910,111,673 | Molecules containing only C, N, and O atoms. |
Cl/S Set | 67,356,641 | A supplementary set containing specific S and Cl functional groups. |
Total | 977,468,314 | The combined database. |
Molecular Characteristics (Main Set)
A statistical analysis of the database confirms its “druglike” nature.
Property | Description | Druglike Range | % in GDB-13 |
---|---|---|---|
Molecular Weight (MW) | Mass of the molecule | < 500 Da | 100% |
clogP | Water/octanol partition coefficient | < 5 | 100% |
Hydrogen Bond Donors (HBD) | Number of N-H and O-H bonds | < 5 | 100% |
Hydrogen Bond Acceptors (HBA) | Number of N and O atoms | < 10 | 100% |
Rotatable Bonds (RBC) | Number of single bonds that can rotate | < 10 | 99.5% |
- Leadlike: 98.9% of molecules meet leadlike criteria.
- Fragmentlike: 45.1% of molecules meet fragmentlike criteria.
Structural Diversity
The dataset is highly diverse in terms of molecular structure:
- Graph Types: Dominated by polycyclic (43.8%) and tricyclic (34.6%) graph topologies.
- Molecule Types: The resulting molecules are mostly bicyclic (37.7%), monocyclic (26.5%), and tricyclic (22.3%).
- Heterocycles: A large majority (71.0%) of the molecules are heterocyclic.
- Small Rings: 54% of molecules contain at least one three- or four-membered ring, indicating a rich space of strained structures.
Data Generation Methodology
GDB-13 was generated through a systematic, multi-step computational pipeline:
- Graph Generation: All non-isomorphic connected graphs with up to 13 vertices were generated.
- Chemical Feasibility Filtering:
- Graphs were treated as saturated hydrocarbons and filtered for topological and ring-strain stability. A fast geometry-based estimation was used instead of computationally expensive minimizations.
- An “element-ratio” filter was applied to quickly eliminate molecules with unlikely ratios of heteroatoms to carbon (e.g., (N+O)/C < 1.0).
- Combinatorial Enumeration:
- For each valid graph, unsaturations (double and triple bonds) were added combinatorially.
- Carbon atoms were systematically replaced with N, O, S, and Cl, respecting valency rules.
- Functional Group Filtering: Final structures were filtered to remove unstable or synthetically unfeasible functional groups (e.g., hemiacetals, enamines).
Intended Use Cases
Virtual Screening & Drug Discovery
- Source of Novel Scaffolds: The primary use case is to serve as a vast library for virtual screening campaigns to find novel hit and lead compounds that are absent from existing databases like ZINC or PubChem.
- Fragment-Based Design: The large subset of fragment-like molecules can be used for fragment-based drug discovery approaches.
Cheminformatics & Method Development
- Benchmarking: Can be used to benchmark and develop new virtual screening algorithms, molecular descriptors, or machine learning models for property prediction.
- Chemical Space Exploration: Provides a model of a large, synthetically accessible portion of chemical space for analysis and visualization.
Synthetic Chemistry
- Inspiration for Synthesis: The database can serve as a source of inspiration for synthetic chemists looking for new, interesting, and synthetically accessible target molecules.
Quality Assessment
Strengths
- Vast Scale: At the time, it was the largest publicly available database of small molecules.
- Systematic Enumeration: Provides a comprehensive and unbiased exploration of a defined chemical space.
- Druglikeness: The molecules are overwhelmingly compliant with standard druglikeness filters.
- Novelty: Contains a wealth of structures not found in databases of existing or commercially available compounds.
- Synthetic Feasibility: The generation process included filters to favor synthetically accessible molecules.
Limitations
- Incomplete Chemical Space: The database is not exhaustive. It omits:
- Other common elements like F, Br, I, and P.
- Many common functional groups (e.g., mercaptans, sulfoxides) due to the filtering rules.
- Molecules with high heteroatom-to-carbon ratios (e.g., sugars like mannitol).
- Computed Nature: The database consists of virtual molecules; their properties are calculated, not experimentally measured. Their synthetic accessibility is estimated, not proven.
Technical Specifications
Data Format
- The database is available as a collection of SMILES strings, a standard line notation for representing chemical structures.
Access
- The database is freely available for download at www.gdb.unibe.ch.
Dataset Status: Static, publicly available Last Updated: 2009 Contact: Jean-Louis Reymond via original publication