Key Contribution
The generation and analysis of the Generated Database (GDB), an exhaustive collection of all possible small molecules that meet specific criteria for stability and synthetic feasibility.
Overview
GDB-11 represents the first systematic enumeration of the small molecule chemical universe up to 11 atoms of C, N, O, and F. The database was created to support virtual screening and drug discovery by providing a comprehensive collection of drug-like molecules that obey standard chemical stability rules.
Strengths
- Systematic coverage of structures with up to 11 atoms
- High drug-likeness: 100% Lipinski compliance, 50% Rule of Three compliance
- Structural novelty: 538 previously unknown ring systems
Limitations
- Limited to small molecules with up to 11 atoms of C, N, O, and F
- Excludes highly strained molecules and some bond patterns
- Excludes functional groups: does not include unstable groups like hemiacetals and gem-diols
- Computer-generated structures, not experimentally validated compounds
Technical Notes
Construction
Graph Selection
GENG was used to generate starting graphs resulting in 843,335 connected graphs with up to 11 nodes. Filtered using topological and steric criteria to 15,726 stable graphs.
Structure Generation
Graph symmetry algorithm used to identify valid locations for unsaturations and element types. Combinatorial expansion yielded 1.7 billion unique structures.
Filters
Filtering out heteroatom bonds, gem-diols, aminals, enols, orthoacids, acyl fluorides, and other labile functional groups reduces the set to 27.7 million structures. Removal of redundant tautomeric forms yields 26.4 million structures.
Stereoisomer Generation
110.9 million stereoisomers generated from the 26.4 million structures.
Analysis Methodology
Kohonen Maps (Self-Organizing Maps)
The chemical space visualization and compound class analysis used a Kohonen Map (Self-Organizing Map/SOM):
- Input Features: 48-dimensional autocorrelation vectors encoding topological relationships and atomic properties
- Training Data: Random subset of 1,000,000 GDB molecules
- Architecture: 200×200 neuron grid
- Training Protocol: 250,000 epochs with 100 molecules presented per epoch
- Algorithm: Standard Kohonen algorithm
- Key Insight: Reveals that “lead-like” compounds cluster in chiral regions of fused carbocycles/heterocycles
Comparison
Compares GDB to a combined reference database (RDB) of organic molecules from PubChem, ChemACX, ChemSCX, the NCI Open Database, and the Merck Index.
New Rings
All acyclic graphs from GDB (309) represented in prior databases. Only 670 of 1208 ring systems (55.5%) represented in other databases. 367 of the 538 previously unknown ring systems (68.2%) are chiral.
Stereochemistry
Small molecules with less than 5 heavy atoms were mostly achiral. Over two thirds of molecules with 10 or 11 atoms were chiral.
Physiochemical Properties
100% of GDB obeys Lipinski’s “Rule of 5” for bioavailability. Half of GDB satisfies the more restrictive “Rule of 3” for fragment-based drug design.
Implementation & Replication Details
Hardware
- CPUs: Two AMD Opteron 252 2.6 GHz processors
- Parallelization: 80-fold parallelization
- Compute Time: Approximately 20 hours for full generation
Software Stack
- Language: Java (J2SE v5.0)
- Cheminformatics Libraries: JChem v3.1, Marvin v4.0 API (ChemAxon)
- Graph Generation: GENG program
Force Field
Custom implementation of MM2 force field using parameter set from Allinger. Used for steric energy minimization during structure validation.

