Chemical Space on Hunter Heidenreich | ML Research Scientist

VEHICLe: Heteroaromatic Rings of the Future

Sat, 11 Apr 2026 00:00:00 +0000

Exhaustive Enumeration of Heteroaromatic Ring Systems

VEHICLe (Virtual Exploratory Heterocyclic Library) is a complete enumeration of all possible heteroaromatic ring systems under a set of constraints designed to capture the ring types most relevant to medicinal chemistry. The library contains 24,867 ring systems (23,895 after collapsing tautomers), yet only 1,701 of these have ever appeared in published compounds across databases totaling over 10 million molecules. The authors use this complete library to predict which unsynthesized ring systems could plausibly be made and to challenge organic chemists to conquer them.

Why Heteroaromatic Rings Matter for Drug Design

Heteroaromatic rings are central to synthetic bioactive small molecules for several reasons: they bind proteins efficiently through shape and hydrophobicity, their rigidity combined with heteroatom hydrogen bonding provides target selectivity, they support parallelizable coupling reactions (Suzuki, Stille) for rapid SAR exploration, multiple substitution positions can be explored without introducing stereocenters, and unusual ring systems or substitution patterns provide patent novelty. These advantages come with tradeoffs: low aqueous solubility, restricted SAR from rigidity, tendency toward molecular bloat during optimization, and difficulty achieving patent novelty with well-explored ring systems.

VEHICLe Construction

The library is built through a simple combinatorial pipeline implemented in Pipeline Pilot (Accelrys Software Inc.) that runs in about 3 minutes on a single-core 3 GHz Intel Xeon workstation:

Building blocks: Six atomic units (C, N, O, S variants with appropriate bond types) serve as starting materials.
Chain formation: Building blocks are combined into all possible chains of length 5 and 6 using two bond-forming rules (single and double bond).
Ring closure: Chains are closed into five- and six-membered rings using three closure rules. Only rings satisfying Hückel’s $4n + 2$ aromaticity rule are retained.
Ring fusion: Monocyclic rings are fused pairwise into all possible bicyclic combinations using four fusion rules. Aromatic bicycles are retained.

The enumeration constraints are: mono- and bicyclic rings only, five- and six-membered rings only, atoms restricted to C, N, O, S, and H, all neutral, all aromatic by Hückel’s rule, and only exocyclic carbonyls allowed. Including the carbonyl building block expands the library from 2,986 to 24,867 ring systems. Within this count, 1,744 tautomeric pairs exist in 772 clusters. Building blocks are input as MDL mol files, chains formed using MDL REACCS rxn format reactions, and duplicates removed by canonical SMILES comparison.

The following table summarizes VEHICLe ring system coverage across the compound datasets used for analysis:

Dataset	Molecules	Distinct Ring Systems	VEHICLe Rings	VEHICLe %
Launched + Phases II/III	2,461	950	120	13%
Phase I	730	494	86	17%
Derwent patents	44,367	7,910	388	5%
Vendor catalogues	2,991,988	24,073	708	3%

Synthetic Tractability Prediction

Many VEHICLe ring systems are clearly impractical (e.g., rings composed almost entirely of nitrogen). To separate plausible candidates from outlandish ones, the authors train a random forest classifier using the NovoD ArborPharm decision tree software (NovoDynamics, Inc.) within Pipeline Pilot:

Features: ECFP_2 circular fingerprints (346 unique fragment types across VEHICLe), recording the presence or absence of each small substructure fragment per ring system
Training labels: “Good” (769 ring systems found in compound databases totaling 3M+ molecules) vs. “bad” (24,098 remaining)
Method: 100 trees using the Buja pure-bucket split method, optimized to minimize false negatives (GoodBias = 32, the ratio of bad to good examples). The PreserveMinority parameter was set to true, ensuring that training data selected for exclusion came exclusively from the “bad” class.
Tree depth: 200 layers, chosen by systematic variation (50 to 250 in steps of 50) showing diminishing returns beyond this depth
Node parameters: EnrichmentThreshold = 0.2 (if $\geq 20%$ of molecules in a node are “good”, the whole node is classified as good); minimum bucket size = 10 molecules per node ($0.04%$ of the dataset)

The classifier produces a $p(\text{good})$ score for each ring system. All 769 known ring systems scored $p(\text{good}) > 0.9$. Of the unknown ring systems, 2,185 (9%) were predicted tractable ($p(\text{good}) > 0.5$).

Validation: 36 VEHICLe rings from UCB’s corporate collection (not in the training set) were all correctly classified as good ($p(\text{good}) \geq 0.95$). Against the Beilstein database, 663 of 2,185 predicted-good unknowns had at least one substructure hit (30% minimum true positive rate), compared to only 374 of 21,913 predicted-bad unknowns (2% false negative rate), a 15-fold improvement over random. Selecting only $p(\text{good}) = 1.0$ predictions raised this ratio to 56-fold.

A final random forest incorporating Beilstein data predicted 3,288 unique unknown ring systems as tractable, with 232 having fewer than five heteroatoms and $p(\text{good}) > 0.95$. The authors manually selected 22 of these as “unconquered” challenges for synthetic chemists.

Ring System Usage Patterns

Analysis of ring system frequency across compound databases reveals striking concentration:

Phenyl dominance: 2% of ring systems (15 types) account for 90% of occurrences, with phenyl alone at 70%.
Heteroatom penalty: The significance of ring system usage drops sharply with increasing heteroatom count, quantified as:

$$ \text{significance}_{i,j} = \frac{\text{nobs}_{i,j} / \text{nobs}_{j}}{\text{ntot}_{i,j} / \text{ntot}_{j}} $$

where $i$ is the number of heteroatoms, $j$ is the compound set, $\text{nobs}$ is the frequency of observation, and $\text{ntot}$ is the total count in VEHICLe. Drug molecules in clinical trials show an even steeper drop-off than the broader compound set.

Frequency distribution: Ring system frequency does not follow Zipf’s power law across the full range. Only ring systems occurring fewer than 500 times follow a power-law distribution.
Publication rate decline: The rate of first publication of novel heteroaromatic ring systems peaked at about 41 per year in the late 1970s and declined to 5-10 per year by the early 2000s.

The concentration likely reflects the “principle of least effort,” the phylogenetic nature of drug discovery, and conservative risk management in pharma, rather than inherent unsuitability of the unused ring systems.

Reproducibility Details

The enumeration method is fully described and could be reimplemented, but the original implementation relies on proprietary software. The random forest model also uses proprietary tools but is specified in sufficient detail for reproduction with open-source alternatives.

Artifact	Type	License	Notes
VEHICLe on Wolfram Data Repository	Dataset	Unknown	24,867 ring systems with 16 properties each

Software dependencies: Pipeline Pilot (Accelrys Software Inc.) for enumeration; NovoD ArborPharm (NovoDynamics, Inc.) for decision trees. Both are proprietary.
Hardware: 3 GHz Intel Xeon workstation (enumeration completes in ~3 minutes).
Missing components: Original Pipeline Pilot protocols and rxn files are not publicly released. ECFP_2 fingerprints used a proprietary Accelrys implementation, though open-source equivalents (RDKit Morgan fingerprints with radius 1) exist.
Reproducibility status: Partially Reproducible. The VEHICLe library itself is publicly available, and the method is described in sufficient detail for reimplementation with modern open-source tools, but the original code and protocols are not released.

Paper Information

Journal: Journal of Medicinal Chemistry, Vol. 52, No. 9, pp. 2952-2963
Published: April 6, 2009

@article{pitt2009heteroaromatic,
  title={Heteroaromatic Rings of the Future},
  author={Pitt, William R. and Parry, David M. and Perry, Benjamin G. and Groom, Colin R.},
  journal={Journal of Medicinal Chemistry},
  volume={52},
  number={9},
  pages={2952--2963},
  year={2009},
  publisher={American Chemical Society},
  doi={10.1021/jm801513z}
}

Surge: Fastest Open-Source Chemical Graph Generator

Sat, 11 Apr 2026 00:00:00 +0000

A Three-Stage Canonical Generation Path

Surge is an open-source constitutional isomer generator that enumerates all possible molecular structures for a given molecular formula. It is built on the nauty package for graph automorphism computation and uses a three-stage canonical generation path method that decomposes the enumeration problem into progressively refined graph operations. Surge outperforms the previous state-of-the-art (MOLGEN 5.0) by orders of magnitude in speed while running in under 5 MB of RAM regardless of molecule size.

Motivation: The Need for Fast, Open Structure Generators

Chemical structure generators are essential for computer-assisted structure elucidation (CASE), virtual library creation, and chemical space enumeration (e.g., GDB-17’s 166.4 billion molecules). MOLGEN had been the gold standard for decades but is closed-source. The previous best open-source alternative, MAYGEN, was roughly 3x slower than MOLGEN. Reymond’s lab used an in-house nauty-based generator for GDB-17 but did not release it publicly. Surge fills this gap as a fast, open-source, and extensible alternative.

The Three-Stage Algorithm

Given a molecular formula (e.g., $\text{C}_9\text{H}_{18}\text{N}_2\text{O}_4$), Surge proceeds through three stages:

Stage 1 (geng): Simple graph generation. Computes all connected simple graphs with the appropriate number of non-hydrogen atoms and edges, subject to maximum degree constraints from the molecular formula. These graphs represent molecular topologies without atom types or bond orders. For Lysopine ($\text{C}_9\text{H}_{18}\text{N}_2\text{O}_4$), this produces 534,493 graphs in 1.3 seconds.

Stage 2 (vcolg): Vertex coloring (atom assignment). Assigns element types (C, N, O, S, etc.) to vertices in all distinct ways, using the automorphism group of each simple graph to avoid generating equivalent assignments. Given a fixed ordering of elements (e.g., $\text{C} < \text{O} < \text{S}$), element assignments are represented as lists $L$ and compared lexicographically. Exactly one representative from each equivalence class is selected by computing the canonical (lexicographically maximal) list:

$$ \text{canon}(L) = \max\{\gamma(L) \mid \gamma \in \text{Aut}(G)\} $$

A list $L$ is accepted if and only if $\text{canon}(L) = L$, i.e., no automorphism produces a lexicographically larger list. For Lysopine, this expands to 3.0 billion vertex-labeled graphs in 90 seconds.

Stage 3 (multig): Edge multiplicity (bond orders). Assigns bond multiplicities (single, double, triple) to edges, again using automorphism group factorization to avoid duplicates. For Lysopine, this produces 6.0 billion completed molecules in an additional 100 seconds.

Efficient Automorphism Handling via Group Factorization

The key algorithmic innovation is the factorization of the automorphism group:

$$ \text{Aut}(G) = NM = \{\gamma\delta \mid \gamma \in N,; \delta \in M\} $$

where $M$ is the “minor subgroup” generated by transpositions of leaves sharing a common neighbor (“flowers”), and $N$ is a complete set of coset representatives computed by nauty. A flower is a maximal set of degree-1 vertices (leaves) with the same neighbor. The minor subgroup $M$ is normal in $\text{Aut}(G)$, making the factorization well-defined.

Theorem. A list $L$ satisfies $L = \text{canon}(L)$ if and only if $L = \max\{\delta(L) \mid \delta \in M\}$ and $L = \max\{\gamma(L) \mid \gamma \in N\}$.

This factorization enables efficient canonicity testing. Maximality under $M$ reduces to enforcing decreasing element order within each flower (simple inequality constraints during recursive assignment). Maximality under $N$ requires explicit testing against the $N$ generators, but $N$ is trivial (identity only) 58% of the time in Stage 2 and 98% of the time in Stage 3.

Benchmark Results

Benchmarked against MOLGEN 5.0 on 30 natural product molecular formulas from the COCONUT database on a compute-optimized c2-standard-4 Google Cloud VM, Surge achieves 7-22 million molecules per second with a memory footprint of at most 5 MB regardless of molecule size. Representative results:

Formula	Isomers	Surge (s)	MOLGEN (s)	Speedup
$\text{C}_{10}\text{H}_{16}\text{O}_5$	1.1B	69	5,146	75x
$\text{C}_9\text{H}_{18}\text{N}_2\text{O}_4$	6.0B	289	27,250	94x
$\text{C}_{11}\text{H}_{12}\text{O}_4$	31.6B	2,179	181,725	83x
$\text{C}_{10}\text{H}_{13}\text{NO}_5$	552B	54,372	6,325,646	116x
$\text{C}_{10}\text{H}_{10}\text{N}_2\text{O}_3$	1.5T	83,186	8,292,585	100x
$\text{C}_9\text{H}_{12}\text{N}_2\text{O}_5$	1.8T	180,727	13,983,652	77x

MOLGEN hit its built-in limit of $2^{31} - 1$ structures for most formulas; reported times were linearly extrapolated. Both generators were instructed to generate but not output structures. MOLGEN was run with -noaromaticity for fair comparison since Surge v1.0 lacks aromaticity detection.

Surge supports output in both SDfile and SMILES formats. SMILES output is produced efficiently by constructing a template for each simple graph at Stage 1, so that only atom types and bond multiplicities must be filled in before output.

Surge also supports built-in filters applied during generation (more efficient than post-hoc filtering):

-p0:1: at most one cycle of length 5
-P: the molecule must be planar
-B5: no atom has two double bonds and otherwise only hydrogen neighbors
-B9: no atom lies on more than one cycle of length 3 or 4

These filter options are inspired by corresponding features in MOLGEN. Surge’s open-source design also supports a plugin mechanism: users can write small code snippets to insert custom filters into any of the three stages, enabling efficient pruning of the generation tree.

Limitations

Version 1.0 does not perform Hückel aromaticity detection, so it generates duplicate Kekulé structures for aromatic rings that are graph-theoretically distinct
Benchmarking against MOLGEN required disabling MOLGEN’s aromaticity detection (-noaromaticity) for fair comparison
Written in C (from the nauty suite), which limits accessibility compared to Python-based tools, though this is also the source of its speed

Reproducibility Details

Artifact	Type	License	Notes
Surge on GitHub	Code	Apache 2.0	Official C implementation from the nauty suite
Surge project page	Other	Apache 2.0	Project homepage with documentation and binaries

Status: Highly Reproducible. Source code, build instructions, and benchmark formulas are all publicly available.
Hardware: Benchmarks used a compute-optimized c2-standard-4 Google Cloud VM. Surge runs in at most 5 MB of RAM regardless of molecule size.
Build: Standard Unix Configure/Make scheme producing a standalone command-line executable. Written in portable C from the nauty suite.
Dependencies: Requires the nauty package (bundled).

Paper Information

Published: Journal of Cheminformatics, Volume 14, Article 24, April 23, 2022
Preprint: ChemRxiv, December 7, 2021
License: Apache 2.0 (software), Open Access (paper)

@article{mckay2022surge,
  title={Surge: a fast open-source chemical graph generator},
  author={McKay, Brendan D. and Yirik, Mehmet Aziz and Steinbeck, Christoph},
  journal={Journal of Cheminformatics},
  volume={14},
  number={1},
  pages={24},
  year={2022},
  publisher={BioMed Central},
  doi={10.1186/s13321-022-00604-9}
}

Molecular Complexity from the GDB Chemical Space

Sat, 11 Apr 2026 00:00:00 +0000

Molecular Complexity as Branching in the Molecular Graph

This paper proposes two simple, interpretable measures of molecular complexity grounded in the observation that most GDB-enumerated molecules are synthetically challenging despite containing only standard functional groups and ring systems. The core insight is that branching points (non-divalent nodes) in the molecular graph correspond to synthesis difficulty: each additional branching point implies a new ring or substituent requiring extra synthetic steps, possible protecting groups, potential stereogenic centers, and increased steric hindrance.

Motivation: Why Most GDB Molecules Are Hard to Make

The Generated DataBases (GDBs) enumerate billions of hypothetical small organic molecules by exhaustively substituting atoms and bonds in mathematical graphs. Despite applying filters for ring strain, functional group diversity, fragment-likeness, drug-likeness, and ChEMBL-likeness, most enumerated molecules remain daunting to synthesize. Even in the most restrictive subset (GDB-13s, 99.4 million molecules from the 977 million in GDB-13), practical synthesis remains challenging for most entries. This motivated the search for a complexity measure that captures why these molecules are hard, without relying on reaction databases or machine learning.

MC1 and MC2: Two Graph-Based Complexity Measures

The two proposed measures are:

MC1 (size-independent): the fraction of non-divalent nodes in the molecular graph.

$$ \text{MC1} = 1 - \text{FDV} $$

where FDV is the fraction of divalent nodes (e.g., $-\text{CH}_2-$, $=\text{CH}-$, $=\text{C}=$, $-\text{O}-$, $-\text{NH}-$, $=\text{N}-$, $-\text{S}-$) in the molecular graph. The graph is computed by treating the molecule as if all bonds were single and all heavy atoms were carbon. MC1 is independent of molecule size, making it useful for comparing molecules of different sizes.

MC2 (size-dependent): the count of non-divalent nodes, excluding carbonyl carbons in standard carboxyl derivatives.

$$ \text{MC2} = \text{NDV} $$

where NDV is the number of non-divalent nodes, not counting $\text{C}{=}\text{O}$ in $(\text{X}-\text{C}{=}\text{O})$ for $\text{X} = \text{N}$ or $\text{O}$ (acids, esters, amides, carbonates, carbamates, ureas). MC2 grows with molecule size only when branching increases. Linear extensions (adding divalent atoms to chains or enlarging rings) do not increase MC2.

The rationale for excluding carboxyl groups from MC2 is that their chemistry (amide bond formation, esterification) is well-established and straightforward. Functional groups like amidines, guanidines, thioesters, thiones, sulfoxides, sulfinates, sulfones, and sulfonamides, as well as phosphorus-containing groups, are still counted because their synthesis is less routine.

Design Choices and Limitations

MC1 and MC2 deliberately do not distinguish between $\text{sp}^2$ and $\text{sp}^3$ branching points or count chiral centers. This choice is motivated by the observation that unusual substitution patterns on aromatic rings in GDB molecules are also synthetically difficult, and that functionalization of aromatic/heteroaromatic rings and control of atropisomerism in biaryls are both challenging. A consequence is that carbohydrates and polyphenols receive high complexity scores despite being abundant in biomass.

MC1 gives uninformative values for very small molecules (trifluoroacetic acid and tert-butanol both score $\text{MC1} = 1$) and for polymers (where the repeating unit dominates). MC2 similarly cannot give useful values for polymers due to its size dependence.

Comparison with Existing Complexity Measures

The authors compare MC1 and MC2 against six molecular complexity scores and two synthetic accessibility scores across four databases: GDB-13s, ZINC, ChEMBL, and COCONUT.

Measure	Category	Description
FCFP4	Complexity	Number of on-bits in a binary 2048-bit FCFP4 fingerprint
DataWarrior	Complexity	Fractal complexity via Minkowski-Bouligand (box-counting) dimension of distinct substructures up to 7 bonds
Böttcher	Complexity	Shannon entropy using additive atom contributions (valence electrons, atom environment, chirality, symmetry)
Proudfoot	Complexity	Shannon entropy using additive atom contributions (atomic number, connections, paths up to length 2)
SPS/nSPS	Complexity	Spacial score summing heavy atom contributions (hybridization, stereochemistry, nonaromaticity, neighbor count); nSPS normalizes by HAC
SAscore	Synthesizability	Fragment frequency from PubChem combined with complexity penalty (ring types, stereochemistry, size)
SCS	Synthesizability	Machine-learned score from 12 million Reaxys reactions predicting synthesis steps from ECFP4 fingerprint (max value 5)

Key findings from the correlation analysis:

For GDB-13s (where nearly all molecules have HAC = 13), complexity measures generally do not correlate with each other ($r^2 < 0.6$), except MC1 with MC2 and SPS with nSPS (expected, since each pair differs only in size normalization).
For ZINC, ChEMBL, and COCONUT (spanning a broad range of molecular sizes), several complexity measures correlate with heavy atom count (HAC) and therefore with each other.
Size-independent measures (DataWarrior, nSPS, SCS, SAscore, MC1) are unaffected by molecule size across datasets, while Böttcher and Proudfoot scores are strongly size-dependent. FCFP4 and SPS show partial size dependence.
SPS and nSPS also correlate with SAscore.

The analysis is supported by interactive TMAP visualizations (tree-maps organized by MAP4C molecular fingerprint similarity) for 30,000 random molecules from each database, color-coded by each complexity measure. The interactive TMAPs are available online for GDB-13s, ZINC, ChEMBL, and COCONUT.

Reproducibility Details

Artifact	Type	License	Notes
Molecular_Complexity	Code	MIT	Python implementation of MC1, MC2, and eight comparison metrics with Jupyter notebooks

The paper is open access (hybrid). The GitHub repository provides Python code for computing MC1 and MC2 along with Jupyter notebooks demonstrating all ten complexity and synthesizability measures from Table 1. The four databases used (GDB-13s, ZINC, ChEMBL, COCONUT) are all publicly available. No model training or specialized hardware is involved, as MC1 and MC2 are deterministic graph computations.

Reproducibility status: Highly Reproducible.

Paper Information

Journal: Journal of Chemical Information and Modeling, Vol. 65, No. 16, pp. 8405-8410
Published: May 15, 2025
Part of: Special issue “Chemical Compound Space Exploration by Multiscale High-Throughput Screening and Machine Learning”

@article{buehler2025view,
  title={A View on Molecular Complexity from the GDB Chemical Space},
  author={Buehler, Ye and Reymond, Jean-Louis},
  journal={Journal of Chemical Information and Modeling},
  volume={65},
  number={16},
  pages={8405--8410},
  year={2025},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.5c00334}
}

CHX8: Complete Eight-Carbon Hydrocarbon Space

Sat, 11 Apr 2026 00:00:00 +0000

Exhaustive Hydrocarbon Enumeration Without Exclusion Filters

CHX8 is the first dataset to fully enumerate all closed-shell hydrocarbons with up to eight carbon atoms, deliberately including strained, anti-Bredt, and unconventional architectures that prior enumerations (e.g., GDB-13, GDB-17) excluded. Of 77,524 enumerated structures, 31,497 are stable under DFT optimization, covering 16x more C8 hydrocarbons than GDB-13. A universal relative strain energy (RSE) metric provides a quantitative synthesizability proxy for every molecule.

Motivation: Strained Scaffolds Are No Longer Inaccessible

GDB-series databases applied strict filters during enumeration, excluding highly strained polycyclic systems, cyclic allenes, anti-Bredt frameworks, and other “unconventional” motifs. Recent synthetic advances have shown that many of these structures can be accessed and exploited: 3D strained bioisosteres improve pharmacokinetic properties, cyclic allenes enable rapid construction of complex skeletons, and anti-Bredt olefins can be generated and trapped stereospecifically. CHX8 deliberately retains all of these motifs to provide a future-proofed database that remains relevant as synthetic capabilities expand.

Enumeration and Optimization

CHX8-enum (77,524 structures): All mathematically feasible hydrocarbons generated by exhaustively enumerating saturated carbon frameworks using the GENG tool from the nauty graph-isomorphism package (all 1-to-8-node connected graphs with 1-4 edges per node), then converting graphs to 3D coordinates via OpenBabel’s --Gen3D with the MMFF94 force field. Unsaturations (double bonds, triple bonds, allenes) were introduced iteratively in all valid positions by identifying C-C bonds flanked by hydrogen atoms (SMARTS: [#1]~[#6]~[#6]~[#1]), removing H atoms, and incrementing bond order. Point diastereoisomers and E/Z isomers were generated by manipulating InChI chiral layers. Duplicate detection relied on canonical InChI strings; residual duplicates account for no more than 1.5% of CHX8.

HAC	Graphs	Saturated	Unsaturated	CHX8-enum	CHX8 (stable)
1	1	1	0	1	1
2	1	1	2	3	3
3	2	2	7	9	8
4	6	7	31	38	30
5	21	25	138	163	117
6	78	114	753	867	522
7	353	746	4,939	5,685	2,917
8	1,929	12,903	57,856	70,758	27,899
Total	2,391	13,799	63,726	77,524	31,497

DFT optimization: All structures were geometry-optimized at the PBE0-D4/def2-TZVP level of theory. 66.5% of structures converged after a single optimization; the remainder required one or two additional passes. 59% of CHX8-enum structures underwent $\sigma$-framework rearrangements during optimization and were classified as unstable. Rearranged structures were identified by comparing input and output InChI strings. Analysis confirmed that all rearrangement products (closed-shell, zwitterionic, or carbene species) were already present in the enumeration, so no new compounds were missed.

Relative Strain Energy as a Synthesizability Proxy

A universal RSE metric, referenced to cyclohexane (zero strain), was developed and assigned to every molecule. The RSE for a molecule of interest (subscript $n$) relative to a reference structure (subscript $r$) is:

$$ \text{RSE} = E_{n} - E_{r} - (c_{n} - c_{r}),E_{\text{CH}_2} + E_{\text{unsat}} $$

where $E_{n}$ and $E_{r}$ are Gibbs energies, $c_{n}$ and $c_{r}$ are carbon counts, $E_{\text{CH}_2}$ is the average energy cost of adding an unstrained CH$_2$ unit, computed from the Gibbs energy differences between consecutive linear alkanes (ethane through octane, six increments), and $E_{\text{unsat}}$ corrects for differences in unsaturation:

$$ E_{\text{unsat}} = (r_{n} - r_{r}),E_{\text{ring}} + (d_{n} - d_{r}),E_{\text{double}} + (t_{n} - t_{r}),E_{\text{triple}} $$

$E_{\text{double}}$ and $E_{\text{triple}}$ are each derived from internal transformations between the second and third carbon of linear chains, averaged over four chain lengths (n-butane through n-octane). Initial attempts using terminal unsaturations systematically underestimated RSE for structures containing double and triple bonds. $E_{\text{ring}}$ is derived separately using the Dudev-Lim homolytic bond dissociation approach:

$$ E_{\text{ring}} = 2E_{\text{C-H}} - E_{\text{C-C}} $$

where the individual bond energies are obtained from ethane:

$$ E_{\text{C-H}} = E_{\text{ethane}} - E_{\text{ethyl radical}}, \quad E_{\text{C-C}} = E_{\text{ethane}} - 2E_{\text{methyl radical}} $$

The highest-RSE molecule with synthetic precedent (a C6 structure detected by atomic force microscopy on a metal surface) has an RSE of 201.4 kcal/mol. Using this as a threshold, over 90% of the novel structures in CHX8 should be considered synthetically accessible in principle.

Notable reference points on the RSE scale:

Cyclopropane: 27.5 kcal/mol
Tetrahedrane: 140.1 kcal/mol (substituted variants synthesized, unsubstituted not yet)
Cubane: 157.4 kcal/mol (synthesized)
Highest synthesized: 201.4 kcal/mol (C6 structure on metal surface)

Key Findings on Strained Motifs

The exhaustive enumeration enables systematic analysis of structural classes previously excluded:

Trans-cycloalkenes: All trans-cycloalkenes in 6-membered rings or larger should be synthetically feasible. The stability of multi-trans systems depends on the relative position of double bonds: parallel trans-double bonds in a ring can undergo thermally accessible 4$\pi$-electrocyclisation, while non-parallel arrangements may be conformationally locked and stable.
Cyclic alkynes and allenes: 37% of the CHX8 dataset consists of cyclic alkynes or allenes. All cyclic alkynes except cyclopropyne, and all cyclic allenes, should be synthesizable (in singlet or triplet states), with RSE values below cubane.
Trans-fused rings: All but [3,3]- and [3,4]-unsubstituted trans-fused rings should be accessible. The proposed lower limit for trans-ring junctions is either (i) a 3-membered ring trans-fused to a ring of five or more atoms, or (ii) a 4-membered ring trans-fused to another 4-membered ring.
Anti-Bredt structures: CHX8 contains seven hydrocarbon skeletons with a bridging section, yielding fourteen possible anti-Bredt (bridgehead-unsaturated) derivatives. Of these, thirteen are stable under DFT optimization, and over 200 substituted anti-Bredt structures are present in the dataset. All stable anti-Bredt structures have RSE values below cubane. Stability is classified using Fawcett’s S parameter (the number of non-bridgehead ring atoms): CHX8 finds structures with S $\geq$ 4 are stable to optimization, consistent with recent experimental work that has accessed anti-Bredt intermediates at S values as low as 4.

Comparison to Existing Databases

vs. GDB-13: CHX8 contains 31,497 C1-C8 hydrocarbons vs. 1,966 in GDB-13 (16x more). For C8 hydrocarbons specifically, GDB-13 has more coverage than GDB-17 (1,966 vs. 1,121). All GDB-13 hydrocarbons appear in CHX8-enum, though some were unstable to DFT optimization.
vs. VQM24: For C1-C5 hydrocarbons, VQM24 contains 123 closed-shell isomers vs. 154 in CHX8 (14-25% more). Many missing structures in VQM24 are diastereoisomers not generated by the SURGE process.
vs. PubChem: Less than 44% of CHX8 structures appear in PubChem
vs. Reaxys: Only 25% of CHX7 (up to 7 carbons) structures are commercially available

Reproducibility Details

The enumeration pipeline uses open-source tools: GENG from the nauty package for graph generation, RDKit for molecular manipulation and InChI canonicalization, and OpenBabel for 3D coordinate generation (MMFF94). DFT calculations used the PBE0-D4/def2-TZVP level of theory via the ORCA quantum chemistry package. The paper does not report total compute time or hardware specifications.

Artifact	Type	License	Notes
CHX8 Dataset (Nottingham Repository)	Dataset	Unknown	All optimized 3D structures, optimization/frequency output files, organized into CHX7, CHX8-sat, and CHX8-unsat subsets

Missing components for full reproduction: No source code for the enumeration or unsaturation-introduction scripts is released. The RSE calculation scripts and DFT input templates are not provided. Hardware/compute requirements are not reported.

Reproducibility status: Partially Reproducible. The dataset itself is deposited, but the enumeration and analysis code is not released.

Paper Information

Preprint: ChemRxiv, January 2, 2026

@article{harman2026complete,
  title={Complete Computational Exploration of Eight-Carbon Hydrocarbon Chemical Space},
  author={Harman, Stephen J. and Ermanis, Kristaps},
  journal={ChemRxiv},
  year={2026},
  doi={10.26434/chemrxiv-2026-qjr5r}
}

AllChem: Generating and Searching 10^20 Structures

Sat, 11 Apr 2026 00:00:00 +0000

Combinatorial Synthon Assembly at Scale

AllChem is a computer-aided molecular design system that generates and searches an unprecedentedly large space of synthetically accessible structures (on the order of $10^{20}$). Rather than enumerating molecules from mathematical graphs (as in the GDB databases), AllChem builds its chemical space from real synthetic chemistry: it recursively applies known reactions to commercial building blocks, producing synthons (structures with open valences of defined reactivity) that combinatorially assemble into complete molecules. Every structure found by a search comes paired with a proposed synthetic route.

Motivation: Costs and Benefits Together

Most computer-aided molecular design methods focus on predicting biological activity (the benefit) while leaving synthesis feasibility (the cost) to the laboratory chemist. AllChem addresses both simultaneously. Its predecessor, ChemSpace, accessed $\sim 10^{14}$ structures built from simple combinatorial libraries (chemist-proposed scaffolds plus commercial side chains), but only about 5% of structures in the medicinal chemistry literature fit that template. AllChem aims to cover roughly 50% of published structures by allowing multi-step synthon generation that produces more complex, non-trivial scaffolds.

The gensyn Synthon Generator

The core component is gensyn, a program that recursively applies a curated set of approximately 100 reactions to approximately 7,000 commercially available building blocks. Each product becomes a new building block for subsequent reaction steps, with recursion bounded primarily by a cumulative synthesis “cost” limit (roughly five AllChem-type steps per sequence). Structures bearing open valences are collected as synthons. A typical run produces around $5 \times 10^6$ synthons, which combinatorially represent $(5 \times 10^6)^3 = 10^{20}$ complete structures with an A-B-C topology.

Key design decisions in gensyn:

Reaction curation: All reactions come from external human-readable text files, based on reactions already practiced by laboratory chemists. Scope constraints are calibrated so that at least 90% of randomly sampled reaction applications appear unchallengeable to synthetic chemists.
Reactive intermediates: Explicitly represented. For example, amide formation requires three steps: acid chloride to electrophilic synthon, amine to nucleophilic synthon, then coupling.
Protective groups: Addition and removal are treated as standard reactions.
Concerted cyclizations: Represented by splitting the ring formation across two complementary synthons with specially labeled open valences.
Bimolecular reactions: In addition to unimolecular transformations, gensyn performs reactions that combine selected synthons with other synthons, increasing overall structural diversity.
Constraints: Maximum of one prochiral center (to avoid diastereomeric mixtures), heavy atom count limits for lead-likeness, and a cumulative cost bound on synthetic routes. Each reaction step has a default cost of $-5$, and the maximum allowed cumulative cost is $-25$ (roughly five steps per sequence).

Reaction Description Language

Reactions are described using an extension of Sybyl Line Notation (SLN), a SMILES-like notation. Each reaction description specifies the structural pattern required in the substrate, the transformation to apply, the reactivity class of resulting open valences, the relative cost, incompatible functional groups, and rules for handling multiple equivalent reactive sites. A separate reactivity table defines which valence classes can react with each other (e.g., nucleophilic with electrophilic).

Topomer Similarity Search

Searching among $10^{20}$ complete structures relies on topomer shape similarity as a branch-and-bound filter. A query structure is fragmented by breaking acyclic single bonds (individually and pairwise), each fragment is converted to a topomer (a canonical 3D shape), and the topomer is compared against all stored synthons. Topomer comparisons run at tens of thousands per second. Because the vast majority of synthons are individually shape-dissimilar enough to eliminate every complete structure containing them, the search space collapses rapidly. To be acceptable, a product must also have been formed by joining open valences with complementary reactivity.

Validation used repeated “self-searches,” in which a query structure is assembled from randomly chosen synthons and searched for in the database. On the 250,000-synthon leadhopping database, average self-search time was 7.1 minutes; complete searches of the full-scale database take several hours on standard hardware.

Applications: Lead Hopping and Scaffold Generation

Lead hopping: Finding structurally novel molecules that are shape-similar (and therefore likely biologically similar) to a query lead. Using a 250,000-synthon leadhopping database, 18 of 19 self-search queries recovered the query structure perfectly (shape difference of 0 topomer units). The remaining query also recovered itself as the closest hit.

Scaffold idea generation: Filtering the synthon collection for small ($\leq$ 14 heavy atoms), low-chirality scaffolds with at least two diversification sites (primarily through nucleophilic heteroatom reactions on activated carbon electrophiles or Suzuki-type couplings), UV chromophores, minimal freely rotatable bonds (especially between diversification sites and rings), a ring, and short synthetic paths (all branches fewer than about six AllChem steps). Over 20% of gensyn-proposed synthons pass these scaffold filters, suggesting on the order of $10^6$ accessible and structurally distinct scaffolds, compared to the few thousand scaffolds typically represented in large screening collections.

Compute and Infrastructure

Full-scale synthon database recreation takes approximately one week using two standard workstations (one Oracle database server, one compute engine). The codebase was rewritten from Java to Python for portability and performance. All data is managed through an Oracle relational database, including synthons, intermediates, and a reactions table recording every gensyn conversion.

Limitations

Variable reactivity of open valences (e.g., weakly nucleophilic amines may not form the implied bond readily) is handled only approximately via reagent class annotations.
Stereospecificity and most aromatic electrophilic substitution reactions are omitted.
The system was described as under active development at the time of publication, giving the paper the character of an interim progress report.
Drug-likeness of 3-synthon products (average MW ~800, CLOGP ~8.0) requires careful filtering of the synthon distribution toward smaller, less lipophilic components.

Reproducibility Details

AllChem was developed as proprietary software at Tripos Inc. (Tripos Discovery Research, Bude, Cornwall, UK). No source code, synthon databases, or reaction files have been publicly released. The paper functions as a description of the system’s architecture and early results rather than a reproducibility-oriented publication.

Code: Not publicly available. The system was proprietary to Tripos Inc.
Data: Synthon databases and reaction description files are not shared.
Hardware: Two standard workstations (one Oracle server, one compute engine); no specialized hardware required.
Funding: NIH/GMS SBIR grant 2 R44 GM068359-02.

Reproducibility status: Closed.

Paper Information

Journal: Journal of Computer-Aided Molecular Design, Vol. 21, No. 6, pp. 341-350
Published: January 25, 2007

@article{cramer2007allchem,
  title={AllChem: generating and searching 10^{20} synthetically accessible structures},
  author={Cramer, Richard D. and Soltanshahi, Farhad and Jilek, Robert J. and Campbell, Brian},
  journal={Journal of Computer-Aided Molecular Design},
  volume={21},
  number={6},
  pages={341--350},
  year={2007},
  publisher={Springer Science+Business Media},
  doi={10.1007/s10822-006-9093-8}
}

ACSESS: Diverse Optimal Molecules in the SMU

Sat, 11 Apr 2026 00:00:00 +0000

Diversity-Biased Search of the Small Molecule Universe

The small molecule universe (SMU), estimated at over $10^{60}$ synthetically feasible organic molecules under ~500 Da, is far too large for exhaustive enumeration and evaluation. This paper extends the ACSESS (Algorithm for Chemical Space Exploration with Stochastic Search) framework to simultaneously optimize molecular diversity and a targeted physical property. The key insight is that enforcing diversity at each iteration prevents the search from collapsing into local optima, a failure mode common in standard genetic algorithms.

Motivation: Diversity vs. Fitness

Standard genetic algorithms optimize fitness effectively but sacrifice diversity: they converge to a few high-fitness regions while ignoring equally good solutions elsewhere. Exhaustive enumeration guarantees completeness but is computationally infeasible beyond ~20 heavy atoms. ACSESS bridges this gap by maintaining a maximally diverse library throughout the optimization process, ensuring coverage of multiple fitness peaks without needing to evaluate every candidate.

The Property-Optimizing ACSESS Algorithm

The method has four iterative steps:

Initialize a library (from a single molecule or a seed collection)
Breed new compounds via mutations and crossovers
Filter by property threshold, removing compounds below a cutoff
Select a maximally diverse subset of qualifying structures

The property threshold increases linearly with each iteration, starting low (to prevent population collapse) and gradually rising until the desired fitness level is reached. Diversity is enforced via either a maximin algorithm (maximizing nearest-neighbor distance) or cell-based partitioning (linear scaling for large libraries).

Molecules are represented in a 40-dimensional chemical space using Moreau-Broto autocorrelation descriptors. The descriptor encodes correlations of atomic properties as a function of topological distance (bond distance) $d$:

$$ AC(d, p) = \sum_{i \leq j} p_{i} , p_{j} , \delta(d_{ij} - d) $$

where $p_{i}$ is an atomic property of atom $i$ and $d_{ij}$ is the shortest bond path between atoms $i$ and $j$. Five atomic properties are used: atomic number, Gasteiger-Marsili partial charge, atomic polarizability, topological steric index, and unity ($p_{i} = 1$ for all $i$, effectively counting atom pairs at each distance). Topological distance $d$ ranges from 0 to 7, yielding $5 \times 8 = 40$ descriptor components. Descriptors are mean-centered and normalized to unit variance before computing distances.

Chemical space distance is the Euclidean distance between descriptor vectors:

$$ D_{ij} = \sqrt{\sum_{k=1}^{N} (d_{ik} - d_{jk})^2} $$

Library diversity is measured as the average nearest-neighbor distance:

$$ D_{\min} = \frac{1}{M} \sqrt{\sum_{i=1}^{M} \min_{i \neq j} (D_{ij}^2)} $$

Validation on NKp Fitness Landscapes

The NKp model maps binary strings of length $N$ to fitness values in $[0, 1]$. The fitness of a string $g$ is:

$$ \Phi(g) = \frac{1}{N} \sum_{i=1}^{N} \varphi_{i}(g) $$

where each $\varphi_{i} \in [0, 1]$ is a randomly drawn fitness contribution. Ruggedness is controlled by $K$ (the number of inter-bit associations per position) and $p$ (fitness contribution weights). Using $N = 19$, $K = 9$, $p = 0.9$ (524,288 total strings, comparable to GDB-9 size), the global maximum was ~0.3. Both ACSESS and SGA were initialized with the same diverse subset and ran for 30 iterations across 10 independent runs:

ACSESS found the global optimum in 100% of runs (vs. 60% for SGA)
ACSESS discovered ~15 of 19 globally optimal strings on average (vs. ~3 for SGA)
ACSESS solutions had higher average fitness than SGA solutions

Validation on GDB-9 Dipole Moments

The method was tested on all ~300,000 molecules in GDB-9 (up to 9 heavy atoms; allowed atom types: C, N, O, S, Cl). For each molecule, the Boltzmann-averaged dipole moment was computed at the AM1 level (Gaussian 09):

$$ D = \frac{\sum_{i \in C} \mu_{i} , e^{-\beta E_{i}}}{\sum_{i \in C} e^{-\beta E_{i}}} $$

where $\mu_{i}$ and $E_{i}$ are the dipole moment and internal energy of conformation $i$, and $\beta = 1 / (k_{\text{B}} T)$ at $T = 298$ K. Conformations (including stereoisomers) were generated using OpenEye OMEGA. The target was molecules with dipole moments $\geq 5.5$ D (the 90th percentile). ACSESS first generated a maximally diverse seed set, then ran 60 iterations of fitness-biased optimization. All methods were initialized from the same diverse seed and compared over multiple runs.

Method	Dipole Moment (D)	Diversity (eq. 4)
GA-Roulette	5.8 $\pm$ 0.03	6.5 $\pm$ 0.7
GA-Tournament	6.4 $\pm$ 0.08	3.5 $\pm$ 0.7
GA-Elitism	6.74 $\pm$ 0.08	5.4 $\pm$ 0.4
ACSESS	6.05 $\pm$ 0.05	9.7 $\pm$ 0.6

ACSESS achieved nearly double the diversity of the best SGA variant while maintaining competitive fitness. Its diversity (~9.7) approached the diversity of the full enumerated high-fitness subset of GDB-9 (~12). Self-organizing map (SOM) visualizations confirmed that ACSESS covered high-activity regions that SGAs missed entirely.

Only ~30,000 fitness evaluations were needed to locate diverse optimal regions in the 300,000-molecule space, a 10x efficiency gain over exhaustive enumeration.

Limitations

Tested only on relatively small chemical spaces (GDB-9 with ~300k molecules and 19-bit NKp with ~500k strings); scaling to the full SMU ($10^{60}$) remains a research direction
Property evaluation (AM1 dipole moments with conformer generation) is the computational bottleneck, not the ACSESS algorithm itself
The 40-dimensional autocorrelation descriptor space may not capture all relevant structural features for every optimization target
Comparison is limited to simple genetic algorithms; more sophisticated evolutionary strategies were not benchmarked

Reproducibility Details

The ACSESS algorithm relies on proprietary software, limiting full reproducibility.

Artifact	Type	License	Notes
GDB-9	Dataset	CC-BY-4.0	Publicly available enumerated chemical universe (~300k molecules)

Code: No public source code was released. The implementation depends on OpenEye OEChem TK (molecule generation), OpenEye MolProp TK (filtering), and OpenEye OMEGA TK (conformer generation), all of which require commercial licenses.
Property calculations: Dipole moments were computed at the AM1 level using Gaussian 09, also commercial software.
NKp landscape: Fully specified by parameters ($N = 19$, $K = 9$, $p = 0.9$) and standard NKp model equations, making this portion independently reproducible.
Hardware: No specific compute requirements reported.
Reproducibility status: Partially Reproducible. The algorithm is well-described and the NKp experiments could be reimplemented, but the molecular experiments require OpenEye and Gaussian 09 licenses, and no reference implementation was released.

Paper Information

Journal: Journal of Chemical Information and Modeling, Vol. 55, No. 3, pp. 529-537
Published: January 16, 2015

@article{rupakheti2015strategy,
  title={Strategy To Discover Diverse Optimal Molecules in the Small Molecule Universe},
  author={Rupakheti, Chetan and Virshup, Aaron M. and Yang, Weitao and Beratan, David N.},
  journal={Journal of Chemical Information and Modeling},
  volume={55},
  number={3},
  pages={529--537},
  year={2015},
  publisher={American Chemical Society},
  doi={10.1021/ci500749q}
}