GuacaMol: Benchmarking Models for De Novo Molecular Design

A Standardized Benchmark for Molecular Design

GuacaMol is a Resource paper. Its primary contribution is a standardized, open-source benchmarking framework for evaluating models for de novo molecular design. The framework defines 5 distribution-learning benchmarks and 20 goal-directed optimization benchmarks, implemented as a Python package. The authors also provide baseline results for several classical and neural generative models, establishing reference performance levels for future comparisons.

The Need for Consistent Evaluation in Generative Chemistry

By 2018, deep generative models for molecular design (VAEs, RNNs, GANs) had shown promising results, but the field lacked consistent evaluation standards. Different papers used different tasks, different datasets, and different metrics, making it difficult to compare models or assess real progress. Comparative studies between neural approaches and well-established algorithms like genetic algorithms were rare.

In other areas of machine learning, standardized benchmarks (ImageNet for vision, GLUE for NLP) had driven rapid progress by enabling fair comparisons. The de novo design community lacked an equivalent. Additionally, many existing evaluations focused on easily optimizable properties (logP, QED) that could not differentiate between models, since even simple baselines achieved near-perfect scores on those tasks.

Benchmark Design: Distribution Learning and Goal-Directed Optimization

GuacaMol separates evaluation into two independent dimensions, reflecting the two main use cases of generative models.

Distribution-Learning Benchmarks

These five benchmarks assess how well a model learns to generate molecules similar to a training set (a standardized subset of ChEMBL 24):

Validity: Fraction of generated molecules that are chemically valid (parseable by RDKit), measured over 10,000 generated samples.
Uniqueness: Fraction of unique canonical SMILES among 10,000 valid generated molecules.
Novelty: Fraction of generated molecules not present in the training set, measured over 10,000 unique samples.
Fréchet ChemNet Distance (FCD): Measures distributional similarity between generated and reference molecules using hidden representations from ChemNet (trained on biological activity prediction). The FCD score is transformed as:

$$S = \exp(-0.2 \cdot \text{FCD})$$

KL Divergence: Compares distributions of nine physicochemical descriptors (BertzCT, MolLogP, MolWt, TPSA, NumHAcceptors, NumHDonors, NumRotatableBonds, NumAliphaticRings, NumAromaticRings) plus maximum nearest-neighbor ECFP4 similarity. The final score aggregates per-descriptor KL divergences:

$$S = \frac{1}{k} \sum_{i}^{k} \exp(-D_{\text{KL}, i})$$

where $k = 9$ is the number of descriptors.

Goal-Directed Benchmarks

The 20 goal-directed benchmarks evaluate a model’s ability to generate molecules that maximize a given scoring function. These span several categories:

Rediscovery (3 tasks): Regenerate a specific target molecule (Celecoxib, Troglitazone, Thiothixene) using Tanimoto similarity on ECFP4 fingerprints.
Similarity (3 tasks): Generate many molecules similar to a target (Aripiprazole, Albuterol, Mestranol) above a threshold of 0.75.
Isomers (2 tasks): Generate molecules matching a target molecular formula ($\text{C}_{11}\text{H}_{24}$ and $\text{C}_9\text{H}_{10}\text{N}_2\text{O}_2\text{PF}_2\text{Cl}$).
Median molecules (2 tasks): Maximize similarity to two reference molecules simultaneously (camphor/menthol and tadalafil/sildenafil).
Multi-property optimization (7 tasks): Optimize combinations of similarity, physicochemical properties, and structural features for drug-relevant molecules (Osimertinib, Fexofenadine, Ranolazine, Perindopril, Amlodipine, Sitagliptin, Zaleplon).
SMARTS-based (1 task): Target molecules containing specific substructure patterns with constrained physicochemical properties (Valsartan SMARTS).
Scaffold/decorator hop (2 tasks): Modify molecular scaffolds while preserving substituent patterns, or vice versa.

The benchmark score for most goal-directed tasks combines top-1, top-10, and top-100 molecule scores:

$$S = \frac{1}{3}\left(s_1 + \frac{1}{10}\sum_{i=1}^{10} s_i + \frac{1}{100}\sum_{i=1}^{100} s_i\right)$$

where $s_i$ are molecule scores sorted in decreasing order.

Score Modifiers

Raw molecular properties are transformed via modifier functions to restrict scores to [0, 1]:

Gaussian($\mu$, $\sigma$): Targets a specific property value
MinGaussian($\mu$, $\sigma$): Full score below $\mu$, decreasing above
MaxGaussian($\mu$, $\sigma$): Full score above $\mu$, decreasing below
Thresholded($t$): Full score above threshold $t$, linear decrease below

Multi-property objectives use either arithmetic or geometric means to combine individual scores.

Baseline Models and Experimental Setup

The authors evaluate six baseline models spanning different paradigms:

Distribution-learning baselines:

Random sampler: Samples molecules directly from the dataset (provides upper/lower bounds).
SMILES LSTM: 3-layer LSTM (hidden size 1024) trained to predict next SMILES characters.
Graph MCTS: Monte Carlo Tree Search building molecules atom-by-atom.
VAE: Variational autoencoder on SMILES representations.
AAE: Adversarial autoencoder.
ORGAN: Objective-reinforced generative adversarial network.

Goal-directed baselines:

Best of dataset: Scores all training molecules and returns the best (virtual screening baseline).
SMILES LSTM: Same model with 20 iterations of hill-climbing (8192 samples per iteration, top 1024 for fine-tuning).
SMILES GA: Genetic algorithm operating on SMILES strings with grammar-based mutations.
Graph GA: Genetic algorithm operating on molecular graphs with crossover and mutation.
Graph MCTS: Monte Carlo Tree Search with 40 simulations per molecule.

The training dataset is ChEMBL 24 after filtering: salt removal, charge neutralization, SMILES length cap of 100, element restrictions, and removal of molecules similar (ECFP4 > 0.323) to 10 held-out drug molecules used in benchmarks.

Distribution-Learning Results

Benchmark	Random	SMILES LSTM	Graph MCTS	AAE	ORGAN	VAE
Validity	1.000	0.959	1.000	0.822	0.379	0.870
Uniqueness	0.997	1.000	1.000	1.000	0.841	0.999
Novelty	0.000	0.912	0.994	0.998	0.687	0.974
KL divergence	0.998	0.991	0.522	0.886	0.267	0.982
FCD	0.929	0.913	0.015	0.529	0.000	0.863

Goal-Directed Results (Selected)

Benchmark	Best of Dataset	SMILES LSTM	SMILES GA	Graph GA	Graph MCTS
Celecoxib rediscovery	0.505	1.000	0.732	1.000	0.355
Osimertinib MPO	0.839	0.907	0.886	0.953	0.784
Sitagliptin MPO	0.509	0.545	0.689	0.891	0.458
Scaffold Hop	0.738	0.998	0.885	1.000	0.478
Total (20 tasks)	12.144	17.340	14.396	17.983	9.009

Key Findings and Limitations

Main Findings

The Graph GA achieves the highest total score across goal-directed benchmarks (17.983), followed closely by the SMILES LSTM (17.340). This result is notable because genetic algorithms are well-established methods, and the LSTM-based neural approach nearly matches their optimization performance.

However, compound quality tells a different story. When examining the top 100 molecules per task through chemical quality filters (SureChEMBL, Glaxo, PAINS rules), 77% of LSTM-generated molecules pass, matching the Best of ChEMBL baseline. In contrast, Graph GA produces only 40% passing molecules, and Graph MCTS only 22%. This suggests that neural models benefit from pre-training on real molecular distributions, which encodes implicit knowledge about what constitutes a “reasonable” molecule.

ORGAN performs poorly across all distribution-learning tasks, with more than half its generated molecules being invalid. This is consistent with mode collapse, a known problem in GAN training.

Simpler generative models (LSTM, VAE) outperform more complex architectures (ORGAN, AAE) on distribution learning. Graph MCTS struggles with both distribution learning and goal-directed optimization, suggesting that single-molecule search trees are less effective than population-based approaches.

Limitations

The authors explicitly identify several issues:

Compound quality is hard to quantify: The rule-based filters used are acknowledged as “high precision, low recall” surrogates. They catch some problematic molecules but cannot encode the full breadth of medicinal chemistry expertise.
Some benchmarks are too easy: The trivially optimizable tasks (logP, QED, CNS MPO) cannot differentiate between models. All baselines achieve near-perfect scores on these.
Sample efficiency and runtime are not benchmarked: The framework does not penalize models for requiring excessive scoring function calls.
Synthesis accessibility is not addressed: Generated molecules may be valid but impractical to synthesize.

Future Directions

The authors call for harder benchmark tasks, better compound quality metrics, attention to sample efficiency and runtime constraints, and further development of graph-based neural generative models.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	ChEMBL 24 (post-processed)	~1.6M molecules	Salt removal, neutralization, SMILES length cap, element restrictions
Evaluation	10 held-out drug molecules	10	Removed from training set via ECFP4 similarity threshold
Quality filters	SureChEMBL, Glaxo, PAINS, in-house rules	N/A	Applied via rd_filters

Algorithms

SMILES LSTM: 3-layer LSTM, hidden size 1024; hill-climbing with 20 iterations, 8192 samples per iteration, top 1024 for fine-tuning
Graph GA: Population of 100, mating pool of 200, crossover + mutation (probability 0.5), 1000 epochs max
SMILES GA: Population of 300, offspring of 600, SMILES grammar-based mutations, 1000 epochs max
Graph MCTS: 40 simulations per molecule, 25 children per step, rollout to 60 atoms, starting from CC

Models

All baseline implementations are released as open-source code. VAE, AAE, and ORGAN implementations are from the MOSES repository.

Evaluation

All distribution-learning benchmarks sample 10,000 molecules. Goal-directed benchmarks use combinations of top-1, top-10, and top-100 scores. Compound quality is assessed via the percentage of top-100 molecules passing chemical filters.

Hardware

Hardware requirements are not specified in the paper.

Artifacts

Artifact	Type	License	Notes
GuacaMol	Code	MIT	Benchmarking framework and scoring functions
GuacaMol Baselines	Code	MIT	Baseline model implementations
ChEMBL dataset	Dataset	CC-BY-SA 3.0	Post-processed ChEMBL 24 for benchmarks
FCD package	Code	LGPL-3.0	Fréchet ChemNet Distance implementation

Paper Information

Citation: Brown, N., Fiscato, M., Segler, M. H. S., & Vaucher, A. C. (2019). GuacaMol: Benchmarking Models for De Novo Molecular Design. Journal of Chemical Information and Modeling, 59(3), 1096-1108. https://doi.org/10.1021/acs.jcim.8b00839

Additional Resources:

@article{brown2019guacamol,
  title={GuacaMol: Benchmarking Models for de Novo Molecular Design},
  author={Brown, Nathan and Fiscato, Marco and Segler, Marwin H. S. and Vaucher, Alain C.},
  journal={Journal of Chemical Information and Modeling},
  volume={59},
  number={3},
  pages={1096--1108},
  year={2019},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.8b00839}
}

A Standardized Benchmark for Molecular Design#

The Need for Consistent Evaluation in Generative Chemistry#

Benchmark Design: Distribution Learning and Goal-Directed Optimization#

Distribution-Learning Benchmarks#

Goal-Directed Benchmarks#

Score Modifiers#

Baseline Models and Experimental Setup#

Distribution-Learning Results#

Goal-Directed Results (Selected)#

Key Findings and Limitations#

Main Findings#

Limitations#

Future Directions#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#