A Graph-Based Approach to Molecular Optimization

This is a Method paper that introduces two graph-based approaches for exploring chemical space: a genetic algorithm (GB-GA) and a generative model combined with Monte Carlo tree search (GB-GM-MCTS). The primary contribution is demonstrating that these non-ML, graph-based methods can match or exceed the performance of contemporary ML-based generative models for molecular property optimization, while being several orders of magnitude faster. The paper provides open-source implementations built on the RDKit cheminformatics package. The two approaches explore chemical space using direct graph manipulations rather than string-based representations like SMILES.

Why Compare Simple Baselines to ML Generative Models?

By 2018, several ML-based generative models for molecules had been published, including VAEs, RNNs, and graph convolutional policy networks. However, these models were rarely compared against traditional optimization approaches such as genetic algorithms. Jensen identifies this gap explicitly: while ML generative model performance had been impressive, the lack of comparison to simpler baselines made it difficult to assess whether the complexity of ML approaches was justified.

A practical barrier to such comparisons was the absence of free, open-source GA implementations for molecular optimization (the existing ACSESS algorithm required proprietary OpenEye toolkits). This paper fills that gap by providing RDKit-based implementations of both the GB-GA and GB-GM-MCTS.

GB-GA: Crossovers and Mutations on Molecular Graphs

The GB-GA operates directly on molecular graph representations (not string representations like SMILES). It combines ideas from Brown et al. (2004) and the ACSESS algorithm of Virshup et al. (2013).

Crossovers can occur at two types of positions with equal probability:

  • Non-ring bonds: a molecule is cut at a non-ring bond, and fragments from two parent molecules are recombined
  • Ring bonds: adjacent bonds or bonds separated by one bond are cut, and fragments are mated using single or double bonds

Mutations include seven operation types, each with specified probabilities:

  • Append atom (15%): adds an atom with a single, double, or triple bond
  • Insert atom (15%): inserts an atom into an existing bond
  • Delete atom (14%): removes an atom, reconnecting neighbors
  • Change atom type (14%): swaps element identity (C, N, O, F, S, Cl, Br)
  • Change bond order (14%): toggles between single, double, and triple bonds
  • Delete ring bond (14%): opens a ring
  • Add ring bond (14%): closes a new ring

Molecules with macrocycles (seven or more atoms), allene centers in rings, fewer than five heavy atoms, incorrect valences, or more non-H atoms than the target size are discarded. The target size is sampled from a normal distribution with mean 39.15 and standard deviation 3.50 non-H atoms, calibrated to match the molecules found by Yang et al. (2017).

The GB-GM grows molecules one atom at a time, with the choice of bond order and atom type determined probabilistically from a bonding analysis of a reference dataset (the first 1000 molecules from ZINC). Since 63% of atoms in the reference set are ring atoms, ring-creation or ring-insertion mutations are chosen 63% of the time.

The generative model is combined with a Monte Carlo tree search where:

  • Each node corresponds to an atom addition step
  • Leaf parallelization uses a maximum of 25 leaf nodes
  • The exploration factor is $1 / \sqrt{2}$
  • Rollout terminates if the molecule exceeds the target size
  • The reward function returns 1 if the predicted $J(\mathbf{m})$ value exceeds the largest value found so far, and 0 otherwise

The Penalized logP Objective

Both methods optimize the penalized logP score $J(\mathbf{m})$:

$$ J(\mathbf{m}) = \log P(\mathbf{m}) - \text{SA}(\mathbf{m}) - \text{RingPenalty}(\mathbf{m}) $$

where $\log P(\mathbf{m})$ is the octanol-water partition coefficient predicted by RDKit, $\text{SA}(\mathbf{m})$ is a synthetic accessibility score, and $\text{RingPenalty}(\mathbf{m})$ penalizes unrealistically large rings by reducing the score by $\text{RingSize} - 6$ for each oversized ring. Each property is normalized to zero mean and unit standard deviation across the ZINC dataset.

Experimental Setup and Comparisons to ML Methods

GB-GA Experiments

Ten GA simulations were performed with a population size of 20 over 50 generations (1000 $J(\mathbf{m})$ evaluations per run). The initial mating pool was 20 random molecules from the first 1000 molecules in ZINC. Two mutation rates were tested: 50% and 1%.

GB-GM-MCTS Experiments

Ten simulations used ethane as a seed molecule with 1000 tree traversals per run. Additional experiments used 5000 traversals and an adjusted probability of generating $\text{C}=\text{C}-\text{C}$ ring patterns (increased from 62% to 80%).

Baselines

Results were compared to those compiled by Yang et al. (2017):

Key Results

MethodAverage $J(\mathbf{m})$Molecules EvaluatedCPU Time
GB-GA (50% mutation)6.8 +/- 0.7100030 seconds
GB-GA (1% mutation)7.4 +/- 0.9100030 seconds
GB-GM-MCTS (62%)2.6 +/- 0.6100090 seconds
GB-GM-MCTS (80%)3.4 +/- 0.6100090 seconds
GB-GM-MCTS (80%)4.3 +/- 0.650009 minutes
ChemTS4.9 +/- 0.5~50002 hours
ChemTS5.6 +/- 0.5~200008 hours
RNN + BO4.5 +/- 0.2~40008 hours
Only RNN4.8 +/- 0.2~200008 hours
CVAE + BO0.0 +/- 0.9~1008 hours
GVAE + BO0.2 +/- 1.3~10008 hours

The GB-GA with 1% mutation rate achieved an average maximum $J(\mathbf{m})$ of 7.4, which is 1.8 units higher than the best ML result (ChemTS at 5.6) while using 20x fewer evaluations and completing in 30 seconds versus 8 hours. The two highest-scoring individual molecules found by GB-GA had $J(\mathbf{m})$ scores of 8.8 and 8.5, exceeding the 7.8-8.0 range found by the GCPN approach. These molecules bore little resemblance to the initial mating pool (Tanimoto similarities of 0.27 and 0.12 to the most similar ZINC molecules), indicating that the GA traversed a large distance in chemical space in just 50 generations.

The GB-GM-MCTS performed below ChemTS at equal evaluations (4.3 vs. 4.9 at 5000 evaluations) but was several orders of magnitude faster (9 minutes vs. 2 hours). The MCTS approach tended to extract the dominant hydrophobic structural motif (benzene rings) from the training set, making it more dependent on training set composition than the GA.

Simple Methods Set a High Bar for Molecular Optimization

The central finding is that a simple graph-based genetic algorithm outperforms all tested ML-based generative models on penalized logP optimization, both in terms of solution quality and computational efficiency. The GB-GA achieves higher $J(\mathbf{m})$ scores with 1000 evaluations in 30 seconds than ML methods achieve with 20,000 evaluations over 8 hours.

Several additional observations emerge:

  1. Chemical space traversal: The GB-GA can reach high-scoring molecules that are structurally distant from the starting population, with Tanimoto similarity as low as 0.12 to the nearest ZINC molecule.
  2. Mutation rate matters: A 1% mutation rate outperformed a 50% rate (7.4 vs. 6.8), suggesting that preserving more parental structure during crossover is beneficial for this objective.
  3. Training set dependence: The GB-GM-MCTS is more sensitive to training set composition than the GA. Its preference for benzene-ring-containing molecules (the dominant ZINC motif) limits its ability to discover alternative structural solutions like the long aliphatic chains favored by the GA.
  4. Generalizability caveat: Jensen explicitly notes that these comparisons cover only one property (penalized logP) and that similar comparisons for other properties are needed before drawing general conclusions.

The paper’s influence has been substantial: it helped establish the expectation that new molecular generative models should be benchmarked against genetic algorithm baselines, a position subsequently reinforced by Brown et al. (2019) in GuacaMol and by Tripp and Hernandez-Lobato (2023).


Reproducibility Details

Data

PurposeDatasetSizeNotes
Initial mating pool / reference setZINC (subset)First 1000 moleculesSame subset used in previous studies (Gomez-Bombarelli et al., Yang et al.)
Target molecule sizeDerived from Yang et al. results20 moleculesMean 39.15, SD 3.50 non-H atoms

Algorithms

  • GB-GA: Population size 20, 50 generations, mutation rates of 1% and 50% tested. Crossovers at ring and non-ring bonds with equal probability. Seven mutation types with specified probabilities. Molecules selected from mating pool based on normalized logP scores.
  • GB-GM: Atom-by-atom growth using probabilistic rules derived from ZINC bonding analysis. Ring creation probability 63% (matching ZINC), with 80% variant also tested. Seed molecule: ethane.
  • MCTS: Modified from haroldsultan/MCTS Python implementation. Leaf parallelization with max 25 leaf nodes. Exploration factor $1/\sqrt{2}$. Binary reward function (1 if new best, 0 otherwise).
  • Property calculation: logP, SA score, and ring penalty all computed via RDKit. Each property normalized to zero mean and unit standard deviation across ZINC.

Models

No neural network models are used. The GB-GA and GB-GM are purely algorithmic approaches parameterized by bonding statistics from the ZINC dataset.

Evaluation

MetricGB-GA (1%)Best ML (ChemTS)Notes
Average max $J(\mathbf{m})$7.4 +/- 0.95.6 +/- 0.5Over 10 runs
Single best $J(\mathbf{m})$8.8~8.0 (GCPN)GB-GA vs. You et al.
Evaluations per run1000~20,00020x fewer for GB-GA
CPU time per run30 seconds8 hours~960x faster

Hardware

All GB-GA and GB-GM experiments were run on a laptop. No GPU required. The GB-GA completes in 30 seconds per run and the GB-GM-MCTS in 90 seconds (1000 traversals) to 9 minutes (5000 traversals).

Artifacts

ArtifactTypeLicenseNotes
GB-GA (v0.0)CodeNot specifiedGraph-based genetic algorithm, RDKit dependency only
GB-GM (v0.0)CodeNot specifiedGraph-based generative model + MCTS, RDKit dependency only

Paper Information

Citation: Jensen, J. H. (2019). A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space. Chemical Science, 10(12), 3567-3572. https://doi.org/10.1039/c8sc05372c

Publication: Chemical Science (Royal Society of Chemistry), 2019

Additional Resources:

Citation

@article{jensen2019graph,
  title={A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space},
  author={Jensen, Jan H.},
  journal={Chemical Science},
  volume={10},
  number={12},
  pages={3567--3572},
  year={2019},
  publisher={Royal Society of Chemistry},
  doi={10.1039/c8sc05372c}
}