Search-Based Generation on Hunter Heidenreich | ML Research Scientist

ChemGE: Molecule Generation via Grammatical Evolution

Sat, 28 Mar 2026 00:00:00 +0000

Grammatical Evolution for De Novo Molecular Design

This is a Method paper that introduces ChemGE, a population-based molecular generation approach built on grammatical evolution. Rather than using deep neural networks, ChemGE evolves populations of SMILES strings through a context-free grammar, enabling concurrent evaluation by multiple molecular simulators and producing diverse molecular libraries. The method represents an alternative paradigm for de novo drug design: evolutionary optimization over formal grammars rather than learned latent spaces or autoregressive neural models.

Limitations of Sequential Deep Learning Generators

At the time of publication, the dominant approaches to de novo molecular generation included Bayesian optimization over VAE latent spaces (CVAE, GVAE), reinforcement learning with recurrent neural networks (ORGAN, REINVENT), sequential Monte Carlo search, and Monte Carlo tree search (ChemTS). These methods share two practical limitations:

Simulation concurrency: Most methods generate one molecule at a time, making it difficult to run multiple molecular simulations (e.g., docking) in parallel. This wastes computational resources in high-throughput virtual screening settings.
Molecular diversity: Deep learning generators tend to exploit narrow regions of chemical space. Deep reinforcement learning methods in particular often generate very similar molecules, requiring special countermeasures to maintain diversity. Since drug discovery is a multi-stage pipeline, limited diversity reduces survival rates in downstream ADMET screening.

ChemGE addresses both problems by maintaining a large population of molecules that are evolved and evaluated concurrently.

Core Innovation: Chromosome-to-SMILES Mapping via Grammar Rules

ChemGE encodes each molecule as a chromosome: a sequence of $N$ integers that deterministically maps to a SMILES string through a context-free grammar. The mapping process works as follows:

Start with the grammar’s start symbol
At each step $k$, look up the $k$-th integer $c = C[k]$ from the chromosome
Identify the leftmost non-terminal symbol and count its $r$ applicable production rules
Apply the $((c \bmod r) + 1)$-th rule
Repeat until no non-terminal symbols remain or the chromosome is exhausted

The context-free grammar is a subset of the OpenSMILES specification, defined formally as $G = (V, \Sigma, R, S)$ where $V$ is the set of non-terminal symbols, $\Sigma$ is the set of terminal symbols, $R$ is the set of production rules, and $S$ is the start symbol.

Evolution follows the $(\mu + \lambda)$ evolution strategy:

Create $\lambda$ new chromosomes by drawing random chromosomes from the population and mutating one integer at a random position
Translate each chromosome to a SMILES string and evaluate fitness (e.g., docking score). Invalid molecules receive fitness $-\infty$
Select the top $\mu$ molecules from the merged pool of $\mu + \lambda$ candidates

The authors did not use crossover, as it did not improve performance. Diversity is inherently maintained because a large fraction of molecules are mutated in each generation.

Experimental Setup and Benchmark Comparisons

Druglikeness Score Benchmark

The first experiment optimized the penalized logP score $J^{\log P}$, an indicator of druglikeness defined as:

$$ J^{\log P}(m) = \log P(m) - \text{SA}(m) - \text{ring-penalty}(m) $$

where $\log P(m)$ is the octanol-water partition coefficient, $\text{SA}(m)$ is the synthetic accessibility score, and ring-penalty$(m)$ penalizes carbon rings larger than size 6. All terms are normalized to zero mean and unit standard deviation. Initial populations were randomly sampled from the ZINC database (35 million compounds), with fitness set to $-\infty$ for molecules with molecular weight above 500 or duplicate structures.

ChemGE was compared against CVAE, GVAE, and ChemTS across population sizes $(\mu, \lambda) \in {(10, 20), (100, 200), (1000, 2000), (10000, 20000)}$.

Method	2h	4h	6h	8h	Mol/Min
ChemGE (10, 20)	4.46 +/- 0.34	4.46 +/- 0.34	4.46 +/- 0.34	4.46 +/- 0.34	14.5
ChemGE (100, 200)	5.17 +/- 0.26	5.17 +/- 0.26	5.17 +/- 0.26	5.17 +/- 0.26	135
ChemGE (1000, 2000)	4.45 +/- 0.24	5.32 +/- 0.43	5.73 +/- 0.33	5.88 +/- 0.34	527
ChemGE (10000, 20000)	4.20 +/- 0.33	4.28 +/- 0.28	4.40 +/- 0.27	4.53 +/- 0.26	555
CVAE	-30.18 +/- 26.91	-1.39 +/- 2.24	-0.61 +/- 1.08	-0.006 +/- 0.92	0.14
GVAE	-4.34 +/- 3.14	-1.29 +/- 1.67	-0.17 +/- 0.96	0.25 +/- 1.31	1.38
ChemTS	4.91 +/- 0.38	5.41 +/- 0.51	5.49 +/- 0.44	5.58 +/- 0.50	40.89

At $(\mu, \lambda) = (1000, 2000)$, ChemGE achieved the highest final score of 5.88 and generated 527 unique molecules per minute, roughly 13x faster than ChemTS and 3700x faster than CVAE. The small population (10, 20) converged prematurely with insufficient diversity, while the overly large population (10000, 20000) could not run enough generations to optimize effectively.

Docking Experiment with Thymidine Kinase

The second experiment applied ChemGE to generate molecules with high predicted binding affinity for thymidine kinase (KITH), a well-known antiviral drug target. The authors used rDock for docking simulation, taking the best intermolecular score $S_{\text{inter}}$ from three runs with different initial conformations. Fitness was defined as $-S_{\text{inter}}$ (lower scores indicate higher affinity). The protein structure was taken from PDB ID 2B8T.

With 32 parallel cores and $(\mu, \lambda) = (32, 64)$, ChemGE completed 1000 generations in approximately 26 hours, generating 9466 molecules total. Among these, 349 molecules achieved intermolecular scores better than the best known inhibitor in the DUD-E database.

Diversity Analysis

Molecular diversity was measured using internal diversity based on Morgan fingerprints:

$$ I(A) = \frac{1}{|A|^2} \sum_{(x,y) \in A \times A} T_d(x, y) $$

where $T_d(x, y) = 1 - \frac{|x \cap y|}{|x \cup y|}$ is the Tanimoto distance.

The 349 “ChemGE-active” molecules (those scoring better than the best known inhibitor) had an internal diversity of 0.55, compared to 0.46 for known inhibitors and 0.65 for the whole ZINC database. This is a substantial improvement over known actives, achieved without any explicit diversity-promoting mechanism.

ISOMAP visualizations showed that ChemGE populations migrated away from known inhibitors over generations, ultimately occupying a completely different region of chemical space by generation 1000. This suggests ChemGE discovered a novel structural class of potential binders.

High Throughput and Diversity Without Deep Learning

ChemGE demonstrates several notable findings:

Deep learning is not required for competitive de novo molecular generation. Grammatical evolution over SMILES achieves higher throughput and comparable or better optimization scores than VAE- and RNN-based methods.
Population size matters significantly. Too small a population leads to premature convergence. Too large a population prevents sufficient per-molecule optimization within the computational budget. The $(\mu, \lambda) = (1000, 2000)$ setting provided the best balance.
Inherent diversity is a key advantage of evolutionary methods. Without any explicit diversity loss or penalty, ChemGE maintains diversity comparable to the ZINC database and exceeds that of known active molecules.
Parallel evaluation is naturally supported. Each generation produces $\lambda$ independent molecules that can be evaluated by separate docking simulators simultaneously.

The authors acknowledge several limitations. Synthetic routes and ADMET properties were not evaluated for the generated molecules. The docking scores, while favorable, require confirmation through biological assays. The authors also note that incorporating probabilistic or neural models into the evolutionary process might further improve performance.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Initial population	ZINC	~35M compounds	Randomly sampled starting molecules
Docking target	PDB 2B8T	1 structure	Thymidine kinase-ligand complex
Baseline actives	DUD-E (KITH)	57 inhibitors	Known thymidine kinase inhibitors

Algorithms

Grammatical evolution with $(\mu + \lambda)$ evolution strategy
Mutation only (no crossover)
Context-free grammar subset of OpenSMILES specification
Chromosome length: $N$ integers per molecule
Fitness set to $-\infty$ for invalid SMILES, MW > 500, or duplicate molecules

Models

No neural network models are used. ChemGE is purely evolutionary.

Evaluation

Metric	Value	Baseline	Notes
Max $J^{\log P}$ (8h)	5.88 +/- 0.34	ChemTS: 5.58 +/- 0.50	ChemGE (1000, 2000)
Molecules/min	527	ChemTS: 40.89	~13x throughput improvement
Docking hits	349	Best DUD-E inhibitor	Molecules with better $S_{\text{inter}}$
Internal diversity	0.55	Known inhibitors: 0.46	Morgan fingerprint Tanimoto distance

Hardware

CPU: Intel Xeon E5-2630 v3 (benchmark experiments, single core)
Docking: 32 cores in parallel (thymidine kinase experiment, ~26 hours for 1000 generations)

Artifacts

Artifact	Type	License	Notes
ChemGE	Code	MIT	Official implementation

Paper Information

Citation: Yoshikawa, N., Terayama, K., Sumita, M., Homma, T., Oono, K., & Tsuda, K. (2018). Population-based de novo molecule generation, using grammatical evolution. Chemistry Letters, 47(11), 1431-1434. https://doi.org/10.1246/cl.180665

@article{yoshikawa2018chemge,
  title={Population-based De Novo Molecule Generation, Using Grammatical Evolution},
  author={Yoshikawa, Naruki and Terayama, Kei and Sumita, Masato and Homma, Teruki and Oono, Kenta and Tsuda, Koji},
  journal={Chemistry Letters},
  volume={47},
  number={11},
  pages={1431--1434},
  year={2018},
  publisher={Oxford University Press},
  doi={10.1246/cl.180665}
}

STONED: Training-Free Molecular Design with SELFIES

Wed, 25 Mar 2026 00:00:00 +0000

A Training-Free Algorithm for Molecular Generation

This is a Method paper that introduces STONED (Superfast Traversal, Optimization, Novelty, Exploration and Discovery), a suite of algorithms for molecular generation and chemical space exploration. STONED operates entirely through string manipulations on the SELFIES molecular representation, avoiding the need for deep learning models, training data, or GPU resources. The key claim is that simple character-level mutations and interpolations in SELFIES can achieve results competitive with state-of-the-art deep generative models on standard benchmarks.

Why Deep Generative Models May Be Overkill

Deep generative models (VAEs, GANs, RNNs, reinforcement learning) have become popular for inverse molecular design, but they come with practical costs: large training datasets, expensive GPU compute, and long training times. Fragile representations like SMILES compound the problem, since large portions of a latent space can map to invalid molecules. Even with the introduction of SELFIES (a 100% valid string representation), prior work still embedded it within neural network architectures.

The authors argue that for tasks like local chemical space exploration and molecular interpolation, the guarantees of SELFIES alone may be sufficient. Because every SELFIES string maps to a valid molecule, random character mutations always produce valid structures. This observation eliminates the need for learned generation procedures entirely.

Core Innovation: SELFIES String Mutations as Molecular Operators

STONED relies on four key techniques built on SELFIES string manipulations:

1. Random character mutations. A point mutation in SELFIES (character replacement, deletion, or addition) always yields a valid molecule. The position of mutations serves as a hyperparameter controlling exploration vs. exploitation: terminal character mutations preserve more structural similarity to the seed, while random mutations explore more broadly.

2. Multiple SMILES orderings. A single molecule has many valid SMILES strings, each mapping to a different SELFIES. By generating 50,000 SMILES orderings and converting to SELFIES before mutation, the diversity of generated structures increases substantially.

3. Deterministic interpolation. Given two SELFIES strings (padded to equal length), characters at equivalent positions can be successively replaced from the start molecule to the target molecule. Every intermediate string is a valid molecule. A chemical path is extracted by keeping only those intermediates that increase fingerprint similarity to the target.

4. Fingerprint-based filtering. Since edit distance in SELFIES does not reflect molecular similarity, STONED uses fingerprint comparisons (ECFP4, FCFP4, atom-pair) to enforce structural similarity constraints.

The authors also propose a revised joint molecular similarity metric for evaluating median molecules. Given $n$ reference molecules $M = {m_1, m_2, \ldots, m_n}$, the joint similarity of a candidate molecule $m$ is:

$$ F(m) = \frac{1}{n} \sum_{i=1}^{n} \text{sim}(m_i, m) - \left[\max_{i} \text{sim}(m_i, m) - \min_{i} \text{sim}(m_i, m)\right] $$

This penalizes candidates that are similar to only a subset of references, unlike the geometric mean metric used in GuacaMol which can yield high scores even with lopsided similarities.

Experimental Setup and Applications

Local chemical subspace formation

Starting from a single seed molecule (aripiprazole, albuterol, mestranol, or celecoxib), the algorithm generates 50,000 SMILES orderings and performs 1-5 point mutations per ordering, producing 250,000 candidate strings. Unique valid molecules are filtered by fingerprint similarity thresholds.

Starting structure	Fingerprint	Molecules at $\delta > 0.75$	Molecules at $\delta > 0.60$	Molecules at $\delta > 0.40$
Aripiprazole (SELFIES, random)	ECFP4	513 (0.25%)	4,206 (2.15%)	34,416 (17.66%)
Albuterol (SELFIES, random)	FCFP4	587 (0.32%)	4,156 (2.33%)	16,977 (9.35%)
Mestranol (SELFIES, random)	AP	478 (0.22%)	4,079 (1.90%)	45,594 (21.66%)
Celecoxib (SELFIES, random)	ECFP4	198 (0.10%)	1,925 (1.00%)	18,045 (9.44%)
Celecoxib (SELFIES, terminal 10%)	ECFP4	864 (2.02%)	9,407 (21.99%)	34,187 (79.91%)

Key finding: restricting mutations to terminal characters yields a 20x increase in high-similarity molecules compared to random positions. Compared to SMILES mutations (0.30% valid) and DeepSMILES (1.44% valid), SELFIES mutations are all valid by construction.

A two-step expansion (mutating all unique first-round neighbors) produced over 17 million unique molecules, with 120,000 having similarity greater than 0.4 to celecoxib.

Chemical path formation and drug design

Deterministic SELFIES interpolation between tadalafil and sildenafil generated paths where logP and QED values varied smoothly. A more challenging application docked intermediates between dihydroergotamine (5-HT1B binder) and prinomastat (CYP2D6 binder), finding molecules with non-trivial binding affinity to both proteins without any optimization routine.

Median molecules for photovoltaics

Using 100 triplets from the Harvard Clean Energy (HCE) dataset, each with one molecule optimized for high LUMO energy, one for high dipole moment, and one for high HOMO-LUMO gap, generalized chemical paths produced median molecules. These were evaluated with GFN2-xTB semiempirical calculations. The generated medians matched or exceeded the best molecules available in the HCE database in both structural similarity and target properties.

GuacaMol benchmarks

Without any training, STONED achieved an overall GuacaMol score of 14.70, competitive with several deep generative models. The approach simply identifies the single best molecule in the benchmark’s training set and generates its local chemical subspace. 38% of the top-100 molecules from each benchmark passed compound quality filters, comparable to Graph GA and SMILES GA.

Results Summary and Limitations

STONED demonstrates that SELFIES string mutations can match or approach deep generative models on standard molecular design benchmarks while being orders of magnitude faster and requiring no training. The most expensive benchmark (aripiprazole subspace) completed in 500 seconds on a laptop CPU.

The method comparison table from the paper highlights STONED’s unique position:

Feature	Expert Systems	VAE	GAN	RL	STONED
Expert rule-free	No	Yes	Yes	Yes	Yes
Structure coverage	Partial	Partial	Partial	Partial	Yes
Interpolatability	No	Yes	Yes	No	Yes
Property-based navigation	Partial	Yes	Yes	Yes	Partial
Training-free	Yes	No	No	No	Yes
Data independence	Yes	No	No	No	Yes

Limitations acknowledged by the authors:

STONED lacks property-based navigation (gradient-guided optimization toward specific property targets). It can only do stochastic property optimization when wrapped in a genetic algorithm.
The success rate of mutations leading to structurally similar molecules is relatively low (0.1-2% at high similarity thresholds), though speed compensates.
Chemical paths can contain molecules with unstable functional groups or tautomerization issues, requiring post-hoc filtering with domain-specific rules.
Fingerprint similarity does not capture all aspects of chemical similarity (3D geometry, reactivity, synthesizability).
The penalized logP and QED benchmarks used by GuacaMol do not represent the full complexity of practical molecular design.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Photovoltaics	Harvard Clean Energy (HCE) database	~2.3M molecules	Used for median molecule triplet experiments
Benchmarking	GuacaMol benchmark suite	Varies per task	Standard benchmarks for generative molecular design
Comparison	ChEMBL (SCScore <= 2.5 subset)	Fragment database	Used for CReM comparison experiments

Algorithms

Local subspace formation: 50,000 SMILES orderings per seed molecule, 1-5 SELFIES point mutations each, totaling 250,000 candidates per experiment.
Chemical paths: Deterministic character-by-character interpolation between padded SELFIES strings, with monotonic fingerprint similarity filtering.
Median molecules: Generalized paths between 3+ reference molecules using 10,000 paths per triplet with randomized SMILES orderings.
Docking: SMINA with crystal structures from PDB (4IAQ for 5-HT1B, 3QM4 for CYP2D6). Top-5 binding poses averaged.
Quantum chemistry: GFN2-xTB for dipole moments, LUMO energies, and HOMO-LUMO gaps.

Evaluation

Metric	Value	Baseline	Notes
GuacaMol overall score	14.70	Varies by model	Competitive with deep generative models
Quality filter pass rate	38%	Graph GA/SMILES GA comparable	Top-100 molecules per benchmark
Celecoxib neighbors ($\delta > 0.75$)	198-864	CReM: 239	Depends on mutation position strategy

Hardware

All experiments run on a laptop with Intel i7-8750H CPU at 2.20 GHz. No GPU required. Most expensive single experiment (aripiprazole subspace) completed in 500 seconds.

Artifacts

Artifact	Type	License	Notes
stoned-selfies	Code	Not specified	Official implementation of STONED algorithms

Paper Information

Citation: Nigam, A. K., Pollice, R., Krenn, M., dos Passos Gomes, G., & Aspuru-Guzik, A. (2021). Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery (STONED) algorithm for molecules using SELFIES. Chemical Science, 12(20), 7079-7090. https://doi.org/10.1039/d1sc00231g

@article{nigam2021stoned,
  title={Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery ({STONED}) algorithm for molecules using {SELFIES}},
  author={Nigam, AkshatKumar and Pollice, Robert and Krenn, Mario and dos Passos Gomes, Gabriel and Aspuru-Guzik, Al{\'a}n},
  journal={Chemical Science},
  volume={12},
  number={20},
  pages={7079--7090},
  year={2021},
  publisher={Royal Society of Chemistry},
  doi={10.1039/d1sc00231g}
}

Graph-Based GA and MCTS Generative Model for Molecules

Wed, 25 Mar 2026 00:00:00 +0000

A Graph-Based Approach to Molecular Optimization

This is a Method paper that introduces two graph-based approaches for exploring chemical space: a genetic algorithm (GB-GA) and a generative model combined with Monte Carlo tree search (GB-GM-MCTS). The primary contribution is demonstrating that these non-ML, graph-based methods can match or exceed the performance of contemporary ML-based generative models for molecular property optimization, while being several orders of magnitude faster. The paper provides open-source implementations built on the RDKit cheminformatics package. The two approaches explore chemical space using direct graph manipulations rather than string-based representations like SMILES.

Why Compare Simple Baselines to ML Generative Models?

By 2018, several ML-based generative models for molecules had been published, including VAEs, RNNs, and graph convolutional policy networks. However, these models were rarely compared against traditional optimization approaches such as genetic algorithms. Jensen identifies this gap explicitly: while ML generative model performance had been impressive, the lack of comparison to simpler baselines made it difficult to assess whether the complexity of ML approaches was justified.

A practical barrier to such comparisons was the absence of free, open-source GA implementations for molecular optimization (the existing ACSESS algorithm required proprietary OpenEye toolkits). This paper fills that gap by providing RDKit-based implementations of both the GB-GA and GB-GM-MCTS.

Graph-Based Crossovers, Mutations, and Monte Carlo Tree Search

GB-GA: Crossovers and Mutations on Molecular Graphs

The GB-GA operates directly on molecular graph representations (not string representations like SMILES). It combines ideas from Brown et al. (2004) and the ACSESS algorithm of Virshup et al. (2013).

Crossovers can occur at two types of positions with equal probability:

Non-ring bonds: a molecule is cut at a non-ring bond, and fragments from two parent molecules are recombined
Ring bonds: adjacent bonds or bonds separated by one bond are cut, and fragments are mated using single or double bonds

Mutations include seven operation types, each with specified probabilities:

Append atom (15%): adds an atom with a single, double, or triple bond
Insert atom (15%): inserts an atom into an existing bond
Delete atom (14%): removes an atom, reconnecting neighbors
Change atom type (14%): swaps element identity (C, N, O, F, S, Cl, Br)
Change bond order (14%): toggles between single, double, and triple bonds
Delete ring bond (14%): opens a ring
Add ring bond (14%): closes a new ring

Molecules with macrocycles (seven or more atoms), allene centers in rings, fewer than five heavy atoms, incorrect valences, or more non-H atoms than the target size are discarded. The target size is sampled from a normal distribution with mean 39.15 and standard deviation 3.50 non-H atoms, calibrated to match the molecules found by Yang et al. (2017).

GB-GM-MCTS: A Probabilistic Growth Model with Tree Search

The GB-GM grows molecules one atom at a time, with the choice of bond order and atom type determined probabilistically from a bonding analysis of a reference dataset (the first 1000 molecules from ZINC). Since 63% of atoms in the reference set are ring atoms, ring-creation or ring-insertion mutations are chosen 63% of the time.

The generative model is combined with a Monte Carlo tree search where:

Each node corresponds to an atom addition step
Leaf parallelization uses a maximum of 25 leaf nodes
The exploration factor is $1 / \sqrt{2}$
Rollout terminates if the molecule exceeds the target size
The reward function returns 1 if the predicted $J(\mathbf{m})$ value exceeds the largest value found so far, and 0 otherwise

The Penalized logP Objective

Both methods optimize the penalized logP score $J(\mathbf{m})$:

$$ J(\mathbf{m}) = \log P(\mathbf{m}) - \text{SA}(\mathbf{m}) - \text{RingPenalty}(\mathbf{m}) $$

where $\log P(\mathbf{m})$ is the octanol-water partition coefficient predicted by RDKit, $\text{SA}(\mathbf{m})$ is a synthetic accessibility score, and $\text{RingPenalty}(\mathbf{m})$ penalizes unrealistically large rings by reducing the score by $\text{RingSize} - 6$ for each oversized ring. Each property is normalized to zero mean and unit standard deviation across the ZINC dataset.

Experimental Setup and Comparisons to ML Methods

GB-GA Experiments

Ten GA simulations were performed with a population size of 20 over 50 generations (1000 $J(\mathbf{m})$ evaluations per run). The initial mating pool was 20 random molecules from the first 1000 molecules in ZINC. Two mutation rates were tested: 50% and 1%.

GB-GM-MCTS Experiments

Ten simulations used ethane as a seed molecule with 1000 tree traversals per run. Additional experiments used 5000 traversals and an adjusted probability of generating $\text{C}=\text{C}-\text{C}$ ring patterns (increased from 62% to 80%).

Baselines

Results were compared to those compiled by Yang et al. (2017):

ChemTS (RNN + MCTS)
RNN with and without Bayesian optimization
Continuous VAE (CVAE)
Grammar VAE (GVAE)
Graph convolutional policy network (GCPN, from You et al. 2018)

Key Results

Method	Average $J(\mathbf{m})$	Molecules Evaluated	CPU Time
GB-GA (50% mutation)	6.8 +/- 0.7	1000	30 seconds
GB-GA (1% mutation)	7.4 +/- 0.9	1000	30 seconds
GB-GM-MCTS (62%)	2.6 +/- 0.6	1000	90 seconds
GB-GM-MCTS (80%)	3.4 +/- 0.6	1000	90 seconds
GB-GM-MCTS (80%)	4.3 +/- 0.6	5000	9 minutes
ChemTS	4.9 +/- 0.5	~5000	2 hours
ChemTS	5.6 +/- 0.5	~20000	8 hours
RNN + BO	4.5 +/- 0.2	~4000	8 hours
Only RNN	4.8 +/- 0.2	~20000	8 hours
CVAE + BO	0.0 +/- 0.9	~100	8 hours
GVAE + BO	0.2 +/- 1.3	~1000	8 hours

The GB-GA with 1% mutation rate achieved an average maximum $J(\mathbf{m})$ of 7.4, which is 1.8 units higher than the best ML result (ChemTS at 5.6) while using 20x fewer evaluations and completing in 30 seconds versus 8 hours. The two highest-scoring individual molecules found by GB-GA had $J(\mathbf{m})$ scores of 8.8 and 8.5, exceeding the 7.8-8.0 range found by the GCPN approach. These molecules bore little resemblance to the initial mating pool (Tanimoto similarities of 0.27 and 0.12 to the most similar ZINC molecules), indicating that the GA traversed a large distance in chemical space in just 50 generations.

The GB-GM-MCTS performed below ChemTS at equal evaluations (4.3 vs. 4.9 at 5000 evaluations) but was several orders of magnitude faster (9 minutes vs. 2 hours). The MCTS approach tended to extract the dominant hydrophobic structural motif (benzene rings) from the training set, making it more dependent on training set composition than the GA.

Simple Methods Set a High Bar for Molecular Optimization

The central finding is that a simple graph-based genetic algorithm outperforms all tested ML-based generative models on penalized logP optimization, both in terms of solution quality and computational efficiency. The GB-GA achieves higher $J(\mathbf{m})$ scores with 1000 evaluations in 30 seconds than ML methods achieve with 20,000 evaluations over 8 hours.

Several additional observations emerge:

Chemical space traversal: The GB-GA can reach high-scoring molecules that are structurally distant from the starting population, with Tanimoto similarity as low as 0.12 to the nearest ZINC molecule.
Mutation rate matters: A 1% mutation rate outperformed a 50% rate (7.4 vs. 6.8), suggesting that preserving more parental structure during crossover is beneficial for this objective.
Training set dependence: The GB-GM-MCTS is more sensitive to training set composition than the GA. Its preference for benzene-ring-containing molecules (the dominant ZINC motif) limits its ability to discover alternative structural solutions like the long aliphatic chains favored by the GA.
Generalizability caveat: Jensen explicitly notes that these comparisons cover only one property (penalized logP) and that similar comparisons for other properties are needed before drawing general conclusions.

The paper’s influence has been substantial: it helped establish the expectation that new molecular generative models should be benchmarked against genetic algorithm baselines, a position subsequently reinforced by Brown et al. (2019) in GuacaMol and by Tripp and Hernandez-Lobato (2023).

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Initial mating pool / reference set	ZINC (subset)	First 1000 molecules	Same subset used in previous studies (Gomez-Bombarelli et al., Yang et al.)
Target molecule size	Derived from Yang et al. results	20 molecules	Mean 39.15, SD 3.50 non-H atoms

Algorithms

GB-GA: Population size 20, 50 generations, mutation rates of 1% and 50% tested. Crossovers at ring and non-ring bonds with equal probability. Seven mutation types with specified probabilities. Molecules selected from mating pool based on normalized logP scores.
GB-GM: Atom-by-atom growth using probabilistic rules derived from ZINC bonding analysis. Ring creation probability 63% (matching ZINC), with 80% variant also tested. Seed molecule: ethane.
MCTS: Modified from haroldsultan/MCTS Python implementation. Leaf parallelization with max 25 leaf nodes. Exploration factor $1/\sqrt{2}$. Binary reward function (1 if new best, 0 otherwise).
Property calculation: logP, SA score, and ring penalty all computed via RDKit. Each property normalized to zero mean and unit standard deviation across ZINC.

Models

No neural network models are used. The GB-GA and GB-GM are purely algorithmic approaches parameterized by bonding statistics from the ZINC dataset.

Evaluation

Metric	GB-GA (1%)	Best ML (ChemTS)	Notes
Average max $J(\mathbf{m})$	7.4 +/- 0.9	5.6 +/- 0.5	Over 10 runs
Single best $J(\mathbf{m})$	8.8	~8.0 (GCPN)	GB-GA vs. You et al.
Evaluations per run	1000	~20,000	20x fewer for GB-GA
CPU time per run	30 seconds	8 hours	~960x faster

Hardware

All GB-GA and GB-GM experiments were run on a laptop. No GPU required. The GB-GA completes in 30 seconds per run and the GB-GM-MCTS in 90 seconds (1000 traversals) to 9 minutes (5000 traversals).

Artifacts

Artifact	Type	License	Notes
GB-GA (v0.0)	Code	Not specified	Graph-based genetic algorithm, RDKit dependency only
GB-GM (v0.0)	Code	Not specified	Graph-based generative model + MCTS, RDKit dependency only

Paper Information

Citation: Jensen, J. H. (2019). A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space. Chemical Science, 10(12), 3567-3572. https://doi.org/10.1039/c8sc05372c

Publication: Chemical Science (Royal Society of Chemistry), 2019

Additional Resources:

Citation

@article{jensen2019graph,
  title={A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space},
  author={Jensen, Jan H.},
  journal={Chemical Science},
  volume={10},
  number={12},
  pages={3567--3572},
  year={2019},
  publisher={Royal Society of Chemistry},
  doi={10.1039/c8sc05372c}
}

Genetic Algorithms as Baselines for Molecule Generation

Mon, 23 Mar 2026 00:00:00 +0000

A Position Paper on Molecular Generation Baselines

This is a Position paper that argues genetic algorithms (GAs) are underused and underappreciated as baselines in the molecular generation community. The primary contribution is empirical evidence that a simple GA implementation (MOL_GA) matches or outperforms many sophisticated deep learning methods on standard benchmarks. The authors propose the “GA criterion” as a minimum bar for evaluating new molecular generation algorithms.

Why Molecular Generation May Be Easier Than Assumed

Drug discovery is fundamentally a molecular generation task, and many machine learning methods have been proposed for it (Du et al., 2022). The problem has many variants, from unconditional generation of novel molecules to directed optimization of specific molecular properties.

The authors observe that generating valid molecules is, in some respects, straightforward. The rules governing molecular validity are well-defined bond constraints that can be checked using standard cheminformatics software like RDKit. This means new molecules can be generated simply by adding, removing, or substituting fragments of known molecules. When applied iteratively, this is exactly what a genetic algorithm does. Despite this, many papers in the field propose complex deep learning methods without adequately comparing to simple GA baselines.

The GA Criterion for Evaluating New Methods

The core proposal is the GA criterion: new methods in molecular generation should offer some clear advantage over genetic algorithms. This advantage can be:

Empirical: outperforming GAs on relevant benchmarks
Conceptual: identifying and overcoming a specific limitation of randomly modifying known molecules

The authors argue that the current state of molecular generation research reflects poor empirical practices, where comprehensive baseline evaluation is treated as optional rather than essential.

Genetic Algorithm Framework and Benchmark Experiments

How Genetic Algorithms Work for Molecules

GAs operate through the following iterative procedure:

Start with an initial population $P$ of molecules
Sample a subset $S \subseteq P$ from the population (possibly biased toward better molecules)
Generate new molecules $N$ from $S$ via mutation and crossover operations
Select a new population $P’$ from $P \cup N$ (e.g., keep the highest-scoring molecules)
Set $P \leftarrow P’$ and repeat from step 2

The MOL_GA implementation uses:

Quantile-based sampling (step 2): molecules are sampled from the top quantiles of the population using a log-uniform distribution over quantile thresholds:

$$ u \sim \mathcal{U}[-3, 0], \quad \epsilon = 10^{u} $$

A molecule is drawn uniformly from the top $\epsilon$ fraction of the population.

Mutation and crossover (step 3): graph-based operations from Jensen (2019), as implemented in the GuacaMol benchmark (Brown et al., 2019)
Greedy population selection (step 4): molecules with the highest scores are retained

Unconditional Generation on ZINC 250K

The first experiment evaluates unconditional molecule generation, where the task is to produce novel, valid, and unique molecules distinct from a reference set (ZINC 250K). Success is measured by validity, novelty (at 10,000 generated molecules), and uniqueness.

Method	Paper	Validity	Novelty@10k	Uniqueness
JT-VAE	Jin et al. (2018)	99.8%	100%	100%
GCPN	You et al. (2018)	100%	100%	99.97%
MolecularRNN	Popova et al. (2019)	100%	100%	99.89%
Graph NVP	Madhawa et al. (2019)	100%	100%	94.80%
Graph AF	Shi et al. (2020)	100%	100%	99.10%
MoFlow	Zang and Wang (2020)	100%	100%	99.99%
GraphCNF	Lippe and Gavves (2020)	96.35%	99.98%	99.98%
Graph DF	Luo et al. (2021)	100%	100%	99.16%
ModFlow	Verma et al. (2022)	98.1%	100%	99.3%
GraphEBM	Liu et al. (2021)	99.96%	100%	98.79%
AddCarbon	Renz et al. (2019)	100%	99.94%	99.86%
MOL_GA	(this paper)	99.76%	99.94%	98.60%

All methods perform near 100% on all metrics, demonstrating that unconditional molecule generation is not a particularly discriminative benchmark. The authors note that generation speed (molecules per second) is an important missing dimension from these comparisons, where simple methods like GAs have a clear advantage.

Molecule Optimization on the PMO Benchmark

The second experiment evaluates directed molecule optimization on the Practical Molecular Optimization (PMO) benchmark (Gao et al., 2022), which measures the ability to find molecules optimizing a scalar objective function $f: \mathcal{M} \mapsto \mathbb{R}$ with a budget of 10,000 evaluations.

A key insight is that previous GA implementations in PMO used large generation sizes ($\approx 100$), which limits the number of improvement iterations. The authors set the generation size to 5, allowing approximately 2,000 iterations of improvement within the same evaluation budget.

Task	REINVENT	Graph GA	MOL_GA
albuterol_similarity	0.882 +/- 0.006	0.838 +/- 0.016	0.896 +/- 0.035
amlodipine_mpo	0.635 +/- 0.035	0.661 +/- 0.020	0.688 +/- 0.039
celecoxib_rediscovery	0.713 +/- 0.067	0.630 +/- 0.097	0.567 +/- 0.083
drd2	0.945 +/- 0.007	0.964 +/- 0.012	0.936 +/- 0.016
fexofenadine_mpo	0.784 +/- 0.006	0.760 +/- 0.011	0.825 +/- 0.019
isomers_c9h10n2o2pf2cl	0.642 +/- 0.054	0.719 +/- 0.047	0.865 +/- 0.012
sitagliptin_mpo	0.021 +/- 0.003	0.433 +/- 0.075	0.582 +/- 0.040
zaleplon_mpo	0.358 +/- 0.062	0.346 +/- 0.032	0.519 +/- 0.029
Sum (23 tasks)	14.196	13.751	14.708
Rank	2	3	1

MOL_GA achieves the highest aggregate score across all 23 PMO tasks, outperforming both the previous best GA (Graph GA) and the previous best overall method (REINVENT). The authors attribute this partly to the tuning of the baselines in PMO rather than MOL_GA being an especially strong method, since MOL_GA is essentially the same algorithm as Graph GA with different hyperparameters.

Implications for Molecular Generation Research

The key findings and arguments are:

GAs match or outperform deep learning methods on standard molecular generation benchmarks, both for unconditional generation and directed optimization.
Hyperparameter choices matter significantly: MOL_GA’s strong performance on PMO comes partly from using a smaller generation size (5 vs. ~100), which allows more iterations of refinement within the same evaluation budget.
The GA criterion should be enforced in peer review: new molecular generation methods should demonstrate a clear advantage over GAs, whether empirical or conceptual.
Deep learning methods may implicitly do what GAs do explicitly: many generative models are trained on datasets of known molecules, so the novel molecules they produce may simply be variants of their training data. The authors consider this an important direction for future investigation.
Poor empirical practices are widespread: the paper argues that many experiments in molecule generation are conducted with an explicit desired outcome (that the novel algorithm is the best), leading to inadequate baseline comparisons.

The authors are careful to note that this result should not be interpreted as GAs being exceptional algorithms. Rather, it is an indication that more complex methods have made surprisingly little progress beyond what simple heuristic search can achieve.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Unconditional generation	ZINC 250K	250,000 molecules	Reference set for novelty evaluation
Directed optimization	PMO benchmark	23 tasks	10,000 evaluation budget per task

Algorithms

GA implementation: MOL_GA package, using graph-based mutation and crossover from Jensen (2019) via the GuacaMol implementation
Generation size: 5 molecules per iteration (allowing ~2,000 iterations with 10,000 evaluations)
Population selection: Greedy (highest-scoring molecules retained)
Sampling: Quantile-based with log-uniform distribution over quantile thresholds

Evaluation

Metric	Benchmark	Notes
Validity, Novelty@10k, Uniqueness	ZINC 250K unconditional	Calculated using MOSES package
AUC top-10 scores	PMO benchmark	23 optimization tasks with 10,000 evaluation budget

Hardware

The paper does not specify hardware requirements. Given that GAs are computationally lightweight compared to deep learning methods, standard CPU hardware is likely sufficient.

Artifacts

Artifact	Type	License	Notes
MOL_GA	Code	MIT	Python package for molecular genetic algorithms
MOL_GA on PyPI	Code	MIT	pip-installable package

Paper Information

Citation: Tripp, A., & Hernández-Lobato, J. M. (2023). Genetic algorithms are strong baselines for molecule generation. arXiv preprint arXiv:2310.09267. https://arxiv.org/abs/2310.09267

Publication: arXiv preprint, 2023

Additional Resources:

Citation

@article{tripp2023genetic,
  title={Genetic algorithms are strong baselines for molecule generation},
  author={Tripp, Austin and Hern{\'a}ndez-Lobato, Jos{\'e} Miguel},
  journal={arXiv preprint arXiv:2310.09267},
  year={2023}
}