Evaluation & Tools on Hunter Heidenreich | ML Research Scientist

Re-evaluating Sample Efficiency in Molecule Generation

Thu, 26 Mar 2026 00:00:00 +0000

An Empirical Re-evaluation of Generative Model Benchmarks

This is an Empirical paper. The primary contribution is a critical reassessment of the Practical Molecular Optimization (PMO) benchmark for de novo molecule generation. Rather than proposing a new generative model, the authors modify existing benchmark metrics to account for chemical desirability (molecular weight, LogP, topological novelty) and molecular diversity. They then re-evaluate all 25 generative models from the original PMO benchmark plus the recently proposed Augmented Hill-Climb (AHC) method.

Sample Efficiency and Chemical Quality in Drug Design

Deep generative models for de novo molecule generation often require large numbers of oracle evaluations (up to $10^5$ samples) to optimize toward a target objective. This is a practical limitation when using computationally expensive scoring functions like molecular docking. The PMO benchmark by Gao et al. addressed this by reformulating performance as maximizing an objective within a fixed budget of 10,000 oracle calls, finding REINVENT to be the most sample-efficient model across 23 tasks.

However, the authors identify a key limitation: the PMO benchmark measures only sample efficiency without considering the chemical quality of proposed molecules. Investigating the top-performing REINVENT model on the JNK3 task, they find that 4 of 5 replicate runs produce molecules with molecular weight and LogP distributions far outside the training data (ZINC250k). The resulting molecules contain large structures with repeating substructures that are undesirable from a medicinal chemistry perspective. This disconnect between benchmark performance and practical utility motivates the modified evaluation metrics.

Modified Metrics: Property Filters and Diversity Requirements

The core innovation is the introduction of three modified AUC Top-10 metrics that extend the original PMO benchmark evaluation:

AUC Top-10 (Filtered): Molecules are excluded if their molecular weight or LogP falls beyond 4 standard deviations from the mean of the ZINC250k pre-training dataset ($\mu \pm 4\sigma$, covering approximately 99.99% of a normal distribution). Molecules with more than 10% de novo (unobserved in ZINC250k) ECFP4 fingerprint bits are also filtered out. This ensures the generative model does not drift beyond its applicability domain.

AUC Top-10 (Diverse): The top 10 molecules are selected iteratively, where a molecule is only added if its Tanimoto similarity (by ECFP4 fingerprints) to any previously selected compound does not exceed 0.35. This threshold corresponds to an approximately 80% probability that more-similar molecules belong to the same bioactivity class, enforcing that distinct candidates possess different profiles.

AUC Top-10 (Combined): Applies both property filters and diversity filters simultaneously, providing the most stringent evaluation of practical performance.

Benchmark Setup and Generative Models Evaluated

Implementation Details

The authors re-implement the PMO benchmark using the original code and data (MIT license) with no changes beyond adding AHC and the new metrics. For Augmented Hill-Climb, the architecture follows REINVENT: an embedding layer of size 128 and 3 layers of Gated Recurrent Units (GRU) with size 512. The prior is trained on ZINC250k using SMILES notation with batch size 128 for 5 epochs.

Two AHC variants are benchmarked:

SMILES-AHC: Hyperparameters optimized via the standard PMO procedure, yielding batch size 256, $\sigma = 120$, $K = 0.25$
SMILES-AHC*: Uses $\sigma = 60$, chosen based on prior knowledge that lower $\sigma$ values maintain better regularization and chemical quality

Both omit diversity filters and non-unique penalization for standardized comparison, despite these being shown to improve performance in prior work.

Models Compared

The benchmark includes 25 generative models from the original PMO paper spanning diverse architectures: REINVENT (RNN + RL), Graph GA (graph-based genetic algorithm), GP BO (Gaussian process Bayesian optimization), SMILES GA (SMILES-based genetic algorithm), SELFIES-based VAEs, and others. The 23 objective tasks derive primarily from the GuacaMol benchmark.

Re-ranked Results and Augmented Hill-Climb Performance

The modified metrics substantially re-order the ranking of generative models:

SMILES-AHC achieves top performance on AUC Top-10 (Combined)*, where both property filters and diversity are enforced. The use of domain-informed hyperparameter selection ($\sigma = 60$) proves critical.
SMILES-AHC (data-driven hyperparameters) ranks first when accounting for property filters alone, diversity alone, or both combined, demonstrating that the AHC algorithm itself provides strong performance even without manual tuning.
REINVENT retains its first-place rank under property filters alone, suggesting that the minority of compounds staying within acceptable property space still perform well. However, it drops when diversity is also required.
Evolutionary algorithms (Graph GA, GP BO, SMILES GA) drop significantly under the new metrics. This is expected because rule-based methods are not constrained by the ZINC250k distribution and tend to propose molecules that diverge from drug-like chemical space.
Both AHC variants excel on empirically difficult tasks, including isomer-based tasks, Zaleplon MPO, and Sitagliptin MPO, where other methods struggle.

Limitations

The authors acknowledge several limitations:

Results are preliminary because generative models have not undergone hyperparameter optimization against the new metrics
Property filter thresholds are subjective, and the 10% de novo ECFP4 bit threshold was chosen by visual inspection
Comparing rule-based models against distribution-based models using ZINC250k similarity introduces a bias toward distribution-based approaches
Six objective task reference molecules sit in the lowest 0.01% of ZINC250k property space, raising questions about whether distribution-based models can reasonably optimize for these objectives
Property filters and diversity could alternatively be incorporated directly into the objective function as additional oracles, though this would not necessarily produce the same results

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC250k	~250K molecules	Subset of ZINC15, provided by PMO benchmark
Evaluation	PMO benchmark tasks	23 objectives	Derived primarily from GuacaMol

Algorithms

Augmented Hill-Climb: RL strategy from Thomas et al. (2022), patience of 5
Hyperparameters (SMILES-AHC): batch size 256, $\sigma = 120$, $K = 0.25$
Hyperparameters (SMILES-AHC)*: $\sigma = 60$ (domain-informed selection)
Prior training: 5 epochs, batch size 128, SMILES notation
Oracle budget: 10,000 evaluations per task
Replicates: 5 per model per task

Models

Architecture: Embedding (128) + 3x GRU (512), following REINVENT
All 25 PMO benchmark models re-evaluated using original implementations

Evaluation

Metric	Description	Notes
AUC Top-10 (Original)	Area under curve of average top 10 molecules	Standard PMO metric
AUC Top-10 (Filtered)	Original with MW/LogP and ECFP4 novelty filters	$\mu \pm 4\sigma$ from ZINC250k
AUC Top-10 (Diverse)	Top 10 selected with Tanimoto < 0.35 diversity	ECFP4 fingerprints
AUC Top-10 (Combined)	Both filters and diversity applied	Most stringent metric

Hardware

Hardware requirements are not specified in the paper. The benchmark uses 10,000 oracle evaluations per task with 5 replicates, which is computationally modest compared to standard generative model training.

Artifacts

Artifact	Type	License	Notes
MolScore	Code	MIT	Scoring and benchmarking framework by the first author
PMO Benchmark	Code	MIT	Original benchmark code and data

Paper Information

Citation: Thomas, M., O’Boyle, N. M., Bender, A., & de Graaf, C. (2022). Re-evaluating sample efficiency in de novo molecule generation. arXiv preprint arXiv:2212.01385.

@misc{thomas2022reevaluating,
  title={Re-evaluating sample efficiency in de novo molecule generation},
  author={Thomas, Morgan and O'Boyle, Noel M. and Bender, Andreas and de Graaf, Chris},
  year={2022},
  eprint={2212.01385},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  doi={10.48550/arxiv.2212.01385}
}

Avoiding Failure Modes in Goal-Directed Generation

Thu, 26 Mar 2026 00:00:00 +0000

Reinterpreting Goal-Directed Generation Failures as QSAR Model Issues

This is an Empirical study that challenges a widely cited finding about failure modes in goal-directed molecular generation. Renz et al. (2019) had shown that when molecules are optimized against a machine learning scoring function, control models trained on the same data distribution assign much lower scores to the generated molecules. This was interpreted as evidence that generation algorithms exploit model-specific biases. Langevin et al. demonstrate that this divergence is already present in the original data distribution and is attributable to disagreement among the QSAR classifiers, not to flaws in the generation algorithms themselves.

Why QSAR Model Agreement Matters for Molecular Generation

Goal-directed generation uses a scoring function (typically a QSAR model) to guide the design of molecules that maximize predicted activity. In the experimental framework from Renz et al., three Random Forest classifiers are trained: an optimization model $C_{opt}$ on Split 1, a model control $C_{mc}$ on Split 1 with a different random seed, and a data control $C_{dc}$ on Split 2. Each returns a confidence score ($S_{opt}$, $S_{mc}$, $S_{dc}$). The expectation is that molecules with high $S_{opt}$ should also score highly under $S_{mc}$ and $S_{dc}$, since all three models are trained on the same data distribution for the same target.

Renz et al. observed that during optimization, $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$, reaching substantially lower values. This was interpreted as goal-directed generation exploiting biases unique to the optimization model. The recommendation was to halt generation when control scores stop increasing, requiring a held-out dataset for a control model, which may not be feasible in low-data regimes.

The key insight of Langevin et al. is that nobody had checked whether this score disagreement existed before generation even began. If the classifiers already disagree on high-scoring molecules in the original dataset, the divergence during generation is expected behavior, not evidence of algorithmic failure.

Pre-Existing Classifier Disagreement Explains the Divergence

The core contribution is showing that the gap between optimization and control scores is a property of the QSAR models, not of the generation algorithms.

The authors introduce a held-out test set (10% of the data, used for neither training split) and augment it via Topliss tree enumeration to produce structural analogs for smoother statistical estimates. On this held-out set, they compute the Mean Average Difference (MAD) between $S_{opt}$ and control scores as a function of $S_{opt}$:

$$ \text{MAD}(x) = \frac{1}{|\{i : S_{opt}(x_i) \geq x\}|} \sum_{S_{opt}(x_i) \geq x} |S_{opt}(x_i) - S_{dc}(x_i)| $$

On the three original datasets (DRD2, EGFR, JAK2), the MAD between $S_{opt}$ and $S_{dc}$ grows substantially with $S_{opt}$, reaching approximately 0.3 for the highest-scoring molecules. For EGFR, even the top molecules (with $S_{opt}$ between 0.5 and 0.6) have $S_{dc}$ below 0.2. This disagreement exists entirely within the original data distribution, before any generative algorithm is applied.

The authors formalize this with tolerance intervals. At each generation time step $t$, the distribution of optimization scores is $P_t[S_{opt}(x)]$. From the held-out set, the conditional distributions $P[S_{dc}(x) | S_{opt}(x)]$ and $P[S_{mc}(x) | S_{opt}(x)]$ are estimated empirically. The expected control scores at time $t$ are then:

$$ \mathbb{E}[S_{dc}] = \int P[S_{dc}(x) | S_{opt}(x)] \cdot P_t[S_{opt}(x)] , dS_{opt} $$

By sampling from these distributions, the authors construct 95% tolerance intervals for the expected control scores at each time step. The observed trajectories of $S_{mc}$ and $S_{dc}$ during generation fall within these intervals, demonstrating that the divergence is fully explained by pre-existing classifier disagreement.

Experimental Setup: Original Reproduction and Corrected Experiments

Reproduction of Renz et al.

The original experimental framework uses three datasets from ChEMBL: DRD2 (842 molecules, 59 actives), EGFR (842 molecules, 40 actives), and JAK2 (667 molecules, 140 actives). These are small, noisy, and chemically diverse. Three goal-directed generation algorithms are tested:

Algorithm	Type	Mechanism
Graph GA	Genetic algorithm on molecular graphs	Mutation and crossover of molecular graphs
SMILES-LSTM	Recurrent neural network	Hill-climbing fine-tuning on best molecules
MSO	Particle swarm in CDDD latent space	Multiple swarm optimization

All algorithms are run for 151 epochs with 10 runs each. The reproduction confirms the findings of Renz et al.: $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$ during optimization.

Tolerance interval analysis

The held-out set is augmented using Topliss tree enumeration on phenyl rings, providing structural analogs that are reasonable from a medicinal chemistry perspective. The optimization score range is divided into 25 equal bins, and for each molecule at each time step, 10 samples from the conditional control score distribution are drawn to construct empirical tolerance intervals.

Corrected experiments with adequate models

To test whether generation algorithms actually exploit biases when the classifiers agree, the authors construct two tasks where optimization and control models correlate well:

ALDH1 dataset: 464 molecules from LIT-PCBA, split using similarity-based pairing to maximize intra-pair chemical similarity. This ensures both splits sample similar chemistry.
Modified JAK2: The same JAK2 dataset but with Random Forest hyperparameters adjusted (200 trees instead of 100, minimum 3 samples per leaf instead of 1) to reduce overfitting to spurious correlations.

In both cases, $S_{opt}$, $S_{mc}$, and $S_{dc}$ agree well on the held-out test set. The starting population for generation is set to the held-out test set (rather than random ChEMBL molecules) to avoid building in a distribution shift.

Findings: No Algorithmic Failure When Models Agree

On the corrected experimental setups (ALDH1 and modified JAK2), there is no major divergence between optimization and control scores during generation. The three algorithms produce molecules that score similarly under all three classifiers.

Key findings:

Pre-existing disagreement explains divergence: On all three original datasets, the divergence between $S_{opt}$ and control scores during generation falls within the tolerance intervals predicted from the initial data distribution alone. The generation algorithms are not exploiting model-specific biases beyond what already exists in the data.
Split similarity bias is also pre-existing: Renz et al. observed that generated molecules are more similar to Split 1 (used to train $C_{opt}$) than Split 2. The authors show this bias is already present in the top-5 percentile of the held-out set: on EGFR and DRD2, high-scoring molecules are inherently more similar to Split 1.
Appropriate model design resolves the issue: When Random Forest hyperparameters are chosen to avoid overfitting (more trees, higher minimum samples per leaf), or when data splits are constructed to be chemically balanced, the classifiers agree and the generation algorithms behave as expected.
Quality problems remain independent: Even when optimization and control scores align, the generated molecules can still be poor drug candidates (unreactive, unsynthesizable, containing unusual fragments). The score divergence issue and the chemical quality issue are separate problems.

Limitations acknowledged by the authors

The study focuses on Random Forest classifiers with ECFP fingerprints. The behavior of other model types (e.g., graph neural networks) and descriptor types is not fully explored, though supplementary results show similar patterns with physico-chemical descriptors and Atom-Pair fingerprints.
The corrected ALDH1 task uses a relatively small dataset (464 molecules) with careful split construction. Scaling this approach to larger, more heterogeneous datasets is not demonstrated.
The authors note that their results do not prove generation algorithms never exploit biases; they show that the specific evidence from Renz et al. can be explained without invoking algorithmic failure.
The problem of low-quality generated molecules (poor synthesizability, unusual fragments) remains unresolved and is acknowledged as an open question.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Original tasks	DRD2, EGFR, JAK2	842, 842, 667 molecules	Extracted from ChEMBL; small with few actives
New task	ALDH1	464 molecules (173 with purine substructure)	Extracted from LIT-PCBA; similarity-based split
Augmentation	Topliss tree analogs	~10x augmentation of held-out set	Structural analogs via phenyl ring enumeration

Algorithms

Three goal-directed generation algorithms from the original Renz et al. study:

Graph GA: Genetic algorithm on molecular graphs (Jensen, 2019)
SMILES-LSTM: Hill-climbing on LSTM-generated SMILES (Segler et al., 2018)
MSO: Multi-Swarm Optimization in CDDD latent space (Winter et al., 2019)

All run for 151 epochs, 10 runs each.

Models

Random Forest classifiers (scikit-learn) with:

ECFP fingerprints (radius 2, 1024 bits, RDKit)
Default parameters for original tasks
Modified parameters for JAK2 correction: 200 trees, min 3 samples per leaf

Evaluation

Metric	Purpose	Notes
Mean Average Difference (MAD)	Measures disagreement between optimization and control scores	Computed as function of $S_{opt}$ on held-out set
95% tolerance intervals	Expected range of control scores given optimization scores	Empirical, constructed from held-out set
Tanimoto similarity	Split bias assessment	Morgan fingerprints, radius 2, 1024 bits
ROC-AUC	Classifier predictive performance	Used to verify models have comparable accuracy

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
Code and datasets	Code	Apache-2.0	Fork of Renz et al. codebase with modifications

Paper Information

Citation: Langevin, M., Vuilleumier, R., & Bianciotto, M. (2022). Explaining and avoiding failure modes in goal-directed generation of small molecules. Journal of Cheminformatics, 14, 20. https://doi.org/10.1186/s13321-022-00601-y

@article{langevin2022explaining,
  title={Explaining and avoiding failure modes in goal-directed generation of small molecules},
  author={Langevin, Maxime and Vuilleumier, Rodolphe and Bianciotto, Marc},
  journal={Journal of Cheminformatics},
  volume={14},
  number={1},
  pages={20},
  year={2022},
  publisher={Springer},
  doi={10.1186/s13321-022-00601-y}
}

Failure Modes in Molecule Generation & Optimization

Wed, 25 Mar 2026 00:00:00 +0000

An Empirical Critique of Molecular Generation Evaluation

This is an Empirical paper that critically examines evaluation practices for molecular generative models. Rather than proposing a new generative method, the paper exposes systematic weaknesses in both distribution-learning metrics and goal-directed optimization scoring functions. The primary contributions are: (1) demonstrating that a trivially simple “AddCarbon” model can achieve near-perfect scores on widely used distribution-learning benchmarks, and (2) introducing an experimental framework with optimization scores and control scores that reveals model-specific and data-specific biases when ML models serve as scoring functions for goal-directed generation.

Evaluation Gaps in De Novo Molecular Design

The rapid growth of deep learning methods for molecular generation (RNN-based SMILES generators, VAEs, GANs, graph neural networks) created a need for standardized evaluation. Benchmarking suites like GuacaMol and MOSES introduced metrics for validity, uniqueness, novelty, KL divergence over molecular properties, and Frechet ChemNet Distance (FCD). For goal-directed generation, penalized logP became a common optimization target.

However, these metrics leave significant blind spots. Distribution-learning metrics do not detect whether a model merely copies training molecules with minimal modifications. Goal-directed benchmarks often use scoring functions that fail to capture the full requirements of drug discovery (synthetic feasibility, drug-likeness, absence of reactive substructures). When ML models serve as scoring functions, the problem worsens because generated molecules can exploit artifacts of the learned model rather than exhibiting genuinely desirable properties.

At the time of writing, wet-lab validations of generative models remained scarce, with only a handful of studies (Merk et al., Zhavoronkov et al.) demonstrating in vitro activity for generated compounds. The lack of rigorous evaluation left the field unable to distinguish meaningfully innovative methods from those that simply exploit metric weaknesses.

The Copy Problem and Control Score Framework

The paper introduces two key conceptual contributions.

The AddCarbon Model for Distribution-Learning

The AddCarbon model is deliberately trivial: it samples a molecule from the training set, inserts a single carbon atom at a random position in its SMILES string, and returns the result if it produces a valid, novel molecule. This model achieves near-perfect scores across most GuacaMol distribution-learning benchmarks:

Benchmark	RS	LSTM	GraphMCTS	AAE	ORGAN	VAE	AddCarbon
Validity	1.000	0.959	1.000	0.822	0.379	0.870	1.000
Uniqueness	0.997	1.000	1.000	1.000	0.841	0.999	0.999
Novelty	0.000	0.912	0.994	0.998	0.687	0.974	1.000
KL divergence	0.998	0.991	0.522	0.886	0.267	0.982	0.982
FCD	0.929	0.913	0.015	0.529	0.000	0.863	0.871

The AddCarbon model beats all baselines except the LSTM on the FCD metric, despite being practically useless. This exposes what the authors call the “copy problem”: current metrics check only for exact matches to training molecules, so minimal edits evade novelty detection. The authors argue that likelihood-based evaluation on hold-out test sets, analogous to standard practice in NLP, would provide a more comprehensive metric.

Control Scores for Goal-Directed Generation

For goal-directed generation, the authors introduce a three-score experimental design:

Optimization Score (OS): Output of a classifier trained on data split 1, used to guide the molecular optimizer.
Model Control Score (MCS): Output of a second classifier trained on split 1 with a different random seed. Divergence between OS and MCS quantifies model-specific biases.
Data Control Score (DCS): Output of a classifier trained on data split 2. Divergence between OS and DCS quantifies data-specific biases.

This mirrors the training/test split paradigm in supervised learning. If a generator truly produces molecules with the desired bioactivity, the control scores should track the optimization score. Divergence between them indicates the optimizer is exploiting artifacts of the specific model or training data rather than learning generalizable chemical properties.

Experimental Setup: Three Targets, Three Generators

Targets and Data

The authors selected three biological targets from ChEMBL: Janus kinase 2 (JAK2), epidermal growth factor receptor (EGFR), and dopamine receptor D2 (DRD2). For each target, the data was split into two halves (split 1 and split 2) with balanced active/inactive ratios. Random forest classifiers using binary folded ECFP4 fingerprints (radius 2, size 1024) were trained to produce three scoring functions per target: the OS and MCS on split 1 (different random seeds), and the DCS on split 2.

Generators

Three molecular generators were evaluated:

Graph-based Genetic Algorithm (GA): Iteratively applies random mutations and crossovers to a population of molecules, retaining the best in each generation. One of the top performers in GuacaMol.
SMILES-LSTM: An autoregressive model that generates SMILES character by character, optimized via hill climbing (iteratively sampling, keeping top molecules, fine-tuning). Also a top GuacaMol performer.
Particle Swarm Optimization (PS): Optimizes molecules in the continuous latent space of a SMILES-based sequence-to-sequence model.

Each optimizer was run 10 times per target dataset.

Score Divergence and Exploitable Biases

Optimization vs. Control Score Divergence

Across all three targets and all three generators, the OS consistently outpaced both control scores during optimization. The DCS sometimes stagnated or even decreased while the OS continued to climb. This divergence demonstrates that the generators exploit biases in the scoring function rather than discovering genuinely active compounds.

The MCS also diverged from the OS despite being trained on exactly the same data, confirming model-specific biases: the optimization exploits features unique to the particular random forest instance. The larger gap between OS and DCS (compared to OS and MCS) indicates that data-specific biases contribute more to the divergence than model-specific biases.

Chemical Space Migration

Optimized molecules migrated toward the region of split 1 actives (used to train the OS), as shown by t-SNE embeddings and nearest-neighbor Tanimoto similarity analysis. Optimized molecules had more similar neighbors in split 1 than in split 2, confirming data-specific bias. By the end of optimization, generated molecules occupied different regions of chemical space than known actives when measured by logP and molecular weight, with compounds from the same optimization run forming distinct clusters.

Quality of Generated Molecules

High-scoring generated molecules frequently contained problematic substructures: reactive dienes, nitrogen-fluorine bonds, long heteroatom chains that are synthetically infeasible, and highly uncommon functional groups. The LSTM optimizer showed a bias toward high molecular weight, low diversity, and high logP values. These molecules would be rejected by medicinal chemists despite their high optimization scores.

Key Takeaways

The authors emphasize several practical implications:

Early stopping: Control scores can indicate when further optimization is exploiting biases rather than finding better molecules. Optimization should stop when control scores plateau.
Scoring function iteration: In practice, generative models are “highly adept at exploiting” incomplete scoring functions, necessitating several iterations of generation and scoring function refinement.
Synthetic accessibility: Even high-scoring molecules are useless if they cannot be synthesized. The authors consider this a major challenge for practical adoption.
Likelihood-based evaluation: For distribution-learning, the authors recommend reporting test-set likelihoods for likelihood-based models, following standard NLP practice.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Bioactivity data	ChEMBL (JAK2, EGFR, DRD2)	See Table S1	Binary classification tasks, split 50/50
Distribution-learning	GuacaMol training set	Subset of ChEMBL	Used as starting population for GA and PS

Algorithms

Scoring function: Random forest classifier (scikit-learn) on binary ECFP4 fingerprints (size 1024, radius 2, RDKit)
GA: Graph-based genetic algorithm from Jensen (2019)
LSTM: SMILES-LSTM with hill climbing, pretrained model from GuacaMol
PS: Particle swarm optimization in latent space of a sequence-to-sequence model (Winter et al. 2019)
Each optimizer run 10 times per target

Evaluation

Metric	Description	Notes
Optimization Score (OS)	RF classifier on split 1	Guides optimization
Model Control Score (MCS)	RF on split 1, different seed	Detects model-specific bias
Data Control Score (DCS)	RF on split 2	Detects data-specific bias
GuacaMol metrics	Validity, uniqueness, novelty, KL div, FCD	For distribution-learning

Artifacts

Artifact	Type	License	Notes
ml-jku/mgenerators-failure-modes	Code	Unknown	Data, code, and results

Hardware

Not specified in the paper.

Citation

@article{renz2019failure,
  title={On failure modes in molecule generation and optimization},
  author={Renz, Philipp and Van Rompaey, Dries and Wegner, J{\"o}rg Kurt and Hochreiter, Sepp and Klambauer, G{\"u}nter},
  journal={Drug Discovery Today: Technologies},
  volume={32-33},
  pages={55--63},
  year={2019},
  publisher={Elsevier},
  doi={10.1016/j.ddtec.2020.09.003}
}

Paper Information

Citation: Renz, P., Van Rompaey, D., Wegner, J. K., Hochreiter, S., & Klambauer, G. (2019). On failure modes in molecule generation and optimization. Drug Discovery Today: Technologies, 32-33, 55-63. https://doi.org/10.1016/j.ddtec.2020.09.003

Publication: Drug Discovery Today: Technologies, Volume 32-33, 2019

Additional Resources:

Code and data (GitHub)

UnCorrupt SMILES: Post Hoc Correction for De Novo Design

Sun, 22 Mar 2026 00:00:00 +0000

A Transformer-Based SMILES Error Corrector

This is a Method paper that proposes a post hoc approach to fixing invalid SMILES produced by de novo molecular generators. Rather than trying to prevent invalid outputs through alternative representations (SELFIES) or constrained architectures (graph models), the authors train a transformer model to translate invalid SMILES into valid ones. The corrector is framed as a sequence-to-sequence translation task, drawing on techniques from grammatical error correction (GEC) in natural language processing.

The Problem of Invalid SMILES in Molecular Generation

SMILES-based generative models produce some percentage of invalid outputs that cannot be converted to molecules. The invalidity rate varies substantially across model types:

RNN models (DrugEx): 5.7% invalid (pretrained) and 4.7% invalid (target-directed)
GANs (ORGANIC): 9.5% invalid
VAEs (GENTRL): 88.9% invalid

These invalid outputs represent wasted computation and potentially introduce bias toward molecules that are easier to generate correctly. Previous approaches to this problem include using alternative representations (DeepSMILES, SELFIES) or graph-based models, but these either limit the search space or increase computational cost. The authors propose a complementary strategy: fix the errors after generation.

Error Taxonomy Across Generator Types

The paper classifies invalid SMILES errors into six categories based on RDKit error messages:

Syntax errors: malformed SMILES grammar
Unclosed rings: unmatched ring closure digits
Parentheses errors: unbalanced open/close parentheses
Bond already exists: duplicate bonds between the same atoms
Aromaticity errors: atoms incorrectly marked as aromatic or kekulization failures
Valence errors: atoms exceeding their maximum bond count

The distribution of error types differs across generators. RNN-based models primarily produce aromaticity errors, suggesting they learn SMILES grammar well but struggle with chemical validity. The GAN (ORGANIC) produces mostly valence errors. The VAE (GENTRL) produces more grammar-level errors (syntax, parentheses, unclosed rings), indicating that sampling from the continuous latent space often produces sequences that violate basic SMILES structure.

Architecture and Training

The SMILES corrector uses a standard encoder-decoder transformer architecture based on Vaswani et al., with learned positional encodings. Key specifications:

Embedding dimension: 256
Encoder/decoder layers: 3 each
Attention heads: 8
Feed-forward dimension: 512
Dropout: 0.1
Optimizer: Adam (learning rate 0.0005)
Training: 20 epochs, batch size 16

Since no dataset of manually corrected invalid-valid SMILES pairs exists, the authors create synthetic training data by introducing errors into valid SMILES from the Papyrus bioactivity dataset (approximately 1.3M pairs). Errors are introduced through random perturbations following SMILES syntax rules: character substitutions, bond order changes, fragment additions from the GDB-8 database to atoms with full valence, and other structural modifications.

Training with Multiple Errors Improves Correction

A key finding is that training the corrector on inputs with multiple errors per SMILES substantially improves performance on real generator outputs. The baseline model (1 error per input) fixes 35-80% of invalid outputs depending on the generator. Increasing errors per training input to 12 raises this to 62-95%:

Generator	1 error/input	12 errors/input
RNN (DrugEx)	~60% fixed	62% fixed
Target-directed RNN	~60% fixed	68% fixed
GAN (ORGANIC)	~80% fixed	95% fixed
VAE (GENTRL)	~35% fixed	80% fixed

Training beyond 12 errors per input yields diminishing returns (80% average at 20 errors vs. 78% at 12). The improvement from multi-error training is consistent with GEC literature, where models learn to “distrust” inputs more when exposed to higher error rates.

The model also shows low overcorrection: only 14% of valid SMILES are altered during translation, comparable to overcorrection rates in spelling correction systems.

Fixed Molecules Are Comparable to Generator Outputs

The corrected molecules are evaluated against both the training set and the readily generated (valid) molecules from each generator:

Uniqueness: 97% of corrected molecules are unique
Novelty vs. generated: 97% of corrected molecules are novel compared to the valid generator outputs
Similarity to nearest neighbor (SNN): 0.45 between fixed and generated sets, indicating the corrected molecules explore different parts of chemical space
Property distributions: KL divergence scores between fixed molecules and the training set are comparable to those between generated molecules and the training set

This demonstrates that SMILES correction produces molecules that are as chemically reasonable as the generator’s valid outputs while exploring complementary regions of chemical space.

Local Chemical Space Exploration via Error Introduction

Beyond fixing generator errors, the authors propose using the SMILES corrector for analog generation. The workflow is:

Take a known active molecule
Introduce random errors into its SMILES (repeated 1000 times)
Correct the errors using the trained corrector

This “local sequence exploration” generates novel analogs with 97% validity. The uniqueness (39%) and novelty (16-37%) are lower than for generator correction because the corrector often regenerates the original molecule. However, the approach produces molecules that are structurally similar to the starting compound (SNN of 0.85 to known ligands).

The authors demonstrate this on selective Aurora kinase B (AURKB) inhibitors. The generated analogs occupy the same binding site region as the co-crystallized ligand VX-680 in docking studies, with predicted bioactivities similar to known compounds. Compared to target-directed RNN generation, SMILES exploration produces molecules closer to known actives (higher SNN, scaffold similarity, and KL divergence scores).

Limitations

The corrector performance drops when applied to real generator outputs compared to synthetic test data, because the synthetic error distribution does not perfectly match the errors that generators actually produce. Generator-specific correctors trained on actual invalid outputs could improve performance. The local exploration approach has limited novelty since the corrector frequently regenerates the original molecule. The evaluation uses predicted rather than experimental bioactivities for the Aurora kinase case study.

Reproducibility

Artifact	Type	License	Notes
LindeSchoenmaker/SMILES-corrector	Code + Data	MIT	Training code, synthetic error generation, and evaluation scripts

Data: Synthetic training pairs derived from the Papyrus bioactivity dataset (v5.5). Approximately 1.3M invalid-valid pairs per error-count setting.

Code: Transformer implemented in PyTorch, adapted from Ben Trevett’s seq2seq tutorial. Generative model baselines use DrugEx, GENTRL, and ORGANIC.

Evaluation: Validity assessed with RDKit. Similarity metrics (SNN, fragment, scaffold) and KL divergence computed following MOSES and GuacaMol benchmark protocols.

Paper Information

Citation: Schoenmaker, L., Béquignon, O. J. M., Jespers, W., & van Westen, G. J. P. (2023). UnCorrupt SMILES: a novel approach to de novo design. Journal of Cheminformatics, 15, 22.

Publication: Journal of Cheminformatics, 2023

Additional Resources:

GitHub: LindeSchoenmaker/SMILES-corrector

@article{schoenmaker2023uncorrupt,
  title={UnCorrupt SMILES: a novel approach to de novo design},
  author={Schoenmaker, Linde and B{\'e}quignon, Olivier J. M. and Jespers, Willem and van Westen, Gerard J. P.},
  journal={Journal of Cheminformatics},
  volume={15},
  number={1},
  pages={22},
  year={2023},
  publisher={Springer},
  doi={10.1186/s13321-023-00696-x}
}