Paper Summary

Citation: Blum, L. C., & Reymond, J.-L. (2009). 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. Journal of the American Chemical Society, 131(25), 8732–8733. https://doi.org/10.1021/ja902302h

Publication: Journal of the American Chemical Society (JACS) 2009

What kind of paper is this?

This is a communications paper that introduces a new, large-scale chemical database, GDB-13, and describes the computational methodology used for its generation.

What is the motivation?

The primary motivation is to expand the known chemical space for drug discovery. The authors argue that innovation in drug discovery requires access to novel small molecule structures that are not present in existing databases of known compounds. By systematically enumerating a vast number of chemically feasible small molecules, they aim to provide a rich resource for virtual screening and the discovery of new lead structures. This work builds upon their previous effort, GDB-11.

What is the novelty here?

The main novelty is the sheer scale and systematic enumeration of the GDB-13 database, which, at the time of publication, was the largest freely available small molecule database with nearly one billion structures. Key technical innovations that enabled this scale-up from their previous GDB-11 database include:

  • Algorithmic Improvements: They introduced a fast “element-ratio” filter to quickly discard chemically unstable molecules, which was a major bottleneck in the previous version. Cutoff values such as (N + O)/C < 1.0 were used.
  • Optimized Geometry: They replaced slow MM2 minimization for strained ring systems with a faster geometry-based estimation, significantly reducing computation time.
  • Expanded Chemical Space: The database includes molecules with up to 13 atoms of C, N, O, S, and Cl, expanding beyond the C, N, O, and F atoms of GDB-11.
  • A Dedicated Cl/S set: A separate set of 67.3 million compounds containing specific chlorine and sulfur functional groups (e.g., thiophenes, sulfonamides) was generated, which are of particular interest for drug discovery.

What experiments were performed?

The paper’s main “experiment” is the computational generation of the database itself. The process involved:

  1. Generating a list of all possible molecular graphs (as saturated hydrocarbons) up to 13 nodes.
  2. Filtering these graphs based on topological and ring-strain criteria.
  3. Combinatorially introducing unsaturations (double/triple bonds) and heteroatoms (N, O, S, Cl) while respecting valency rules.
  4. Applying chemical stability and synthetic feasibility filters, including the new element-ratio filter, to remove unrealistic molecules.

After generation, the authors performed a statistical analysis of the database, evaluating the distribution of key physicochemical properties, such as molecular weight (MW), polar surface area (TPSA), lipophilicity (clogP), rotatable bond count (RBC), and hydrogen bond donors/acceptors (HBD/A), to confirm the “druglike” nature of the enumerated molecules. They also compared the generated molecules to known marketed drugs to demonstrate the novelty and diversity of the database.

What were the outcomes and conclusions drawn?

  • Successful Database Generation: The authors successfully generated GDB-13, containing 977,468,314 unique structures, in approximately 40,000 CPU hours.
  • Druglike Properties: The vast majority of the molecules in GDB-13 adhere to common druglikeness filters, such as Lipinski’s Rule of Five (100% of molecules) and leadlikeness criteria (98.9%).
  • Structural Diversity: The database is structurally diverse, with a majority of molecules being heterocyclic and containing small, strained rings (54% have 3- or 4-membered rings).
  • Vast Chemical Space: The enumeration revealed a vast number of structural isomers for common drug formulas. For example, over 18 million isomers were found for the formula of the anesthetic propofol.

The authors conclude that GDB-13 represents a rich source of novel, synthetically accessible small molecules for virtual screening, providing inspiration for medicinal chemists beyond what is available in databases of existing compounds. The database is made freely available to the public.


Note: This is a personal learning note and may be incomplete or evolving.