RL-Tuned Generation on Hunter Heidenreich | ML Research Scientist

REINVENT: Reinforcement Learning for Mol. Design

Sat, 28 Mar 2026 00:00:00 +0000

Augmented Episodic Likelihood for Goal-Directed Generation

This is a Method paper that introduces REINVENT, a policy-based reinforcement learning framework for molecular de novo design. The primary contribution is a novel cost function, the augmented episodic likelihood, that fine-tunes a SMILES-based recurrent neural network (RNN) pre-trained on ChEMBL toward generating molecules satisfying user-defined property objectives. The method anchors the agent to the prior distribution of valid drug-like molecules, addressing failure modes of standard REINFORCE algorithms (reward exploitation and mode collapse to trivially simple structures).

De Novo Design Needs Flexible, Data-Driven Approaches

Traditional de novo design methods fall into three categories, each with limitations:

Structure-based approaches grow ligands to fit binding pockets but often produce molecules with poor DMPK profiles and synthetic intractability.
Ligand-based virtual library approaches generate large libraries and score them, but are constrained by pre-defined reaction rules or transformation rules that limit chemical diversity.
Inverse QSAR methods attempt to map favorable activity regions back to molecular structures, but require descriptors suitable for both forward prediction and inverse mapping.

RNN-based generative models trained on SMILES offer a data-driven alternative that can learn the underlying distribution of drug-like chemical space without rigid rules. Segler et al. (2017) showed that fine-tuning a pre-trained RNN on focused actives yields high fractions of predicted actives. However, this maximum likelihood fine-tuning cannot use negative or continuous scores and risks catastrophic forgetting.

Prior RL approaches had significant issues. Jaques et al. (2016) used Deep Q-learning with prior likelihood regularization for sequence generation, but reported dependence on hand-written rules to penalize undesirable sequences and still observed reward exploitation producing unrealistically simple molecules. Standard REINFORCE algorithms tend to converge on trivial solutions (e.g., generating only “C” to satisfy a scoring function).

The Augmented Episodic Likelihood Framework

The core innovation is a formulation where the agent learns a policy that minimizes the squared difference between its own log-likelihood and an augmented target likelihood.

The RNN is first pre-trained on 1.5 million canonical SMILES from ChEMBL via maximum likelihood estimation:

$$ J(\Theta) = -\sum_{t=1}^{T} \log P(x^{t} \mid x^{t-1}, \dots, x^{1}) $$

The pre-trained model (the Prior) is then used as the starting point for the Agent. For a generated SMILES sequence $A = a_1, a_2, \dots, a_T$, the model likelihood is $P(A) = \prod_{t=1}^{T} \pi(a_t \mid s_t)$, and a scoring function $S(A) \in [-1, 1]$ rates desirability.

The augmented likelihood combines prior likelihood with the score:

$$ \log P(A)_{\mathbb{U}} = \log P(A)_{Prior} + \sigma S(A) $$

where $\sigma$ is a scalar coefficient controlling the trade-off between prior fidelity and score optimization.

The return is defined as the negative squared difference between the augmented likelihood and the agent’s likelihood:

$$ G(A) = -\left[\log P(A)_{\mathbb{U}} - \log P(A)_{\mathbb{A}}\right]^{2} $$

The agent minimizes $J(\Theta) = -G$, effectively learning a policy whose sequence likelihoods match the prior modulated by the scoring function. The authors show in supplementary material that this is equivalent to a REINFORCE algorithm with a specific final-step reward formulation.

This design has three key advantages over standard REINFORCE:

The target policy is explicitly stochastic, preserving diversity in generated molecules
The prior anchoring prevents catastrophic forgetting of SMILES syntax and chemical space coverage
No hand-written rules are needed to penalize degenerate solutions

The Agent is trained on-policy with batches of 128 generated sequences, using SGD with learning rate 0.0005 and gradient clipping to $[-3, 3]$.

Three Experiments: Sulphur Avoidance, Celecoxib Analogues, and DRD2 Activity

Prior Network Architecture

The Prior is a 3-layer RNN with 1024 Gated Recurrent Units per layer, trained on RDKit canonical SMILES from ChEMBL (molecules with 10-50 heavy atoms, elements from ${H, B, C, N, O, F, Si, P, S, Cl, Br, I}$). Training used Adam ($\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$) for 50,000 steps with batch size 128 and learning rate decay of 0.02 every 100 steps. The Prior generates 94% valid SMILES, of which 90% are novel.

Experiment 1: Learning to Avoid Sulphur

A proof-of-principle task where the scoring function assigns $S(A) = 1$ for valid sulphur-free molecules, $S(A) = 0$ for invalid SMILES, and $S(A) = -1$ for sulphur-containing molecules.

The Agent method was compared against three alternatives:

Method	Fraction Valid	Fraction No S	Avg MW	Avg cLogP	Avg RotBonds	Avg AromRings
Prior	0.94	0.66	371	3.36	5.39	2.26
Agent	0.95	0.98	367	3.37	5.41	2.26
Action basis	0.95	0.92	372	3.39	6.08	2.09
REINFORCE	0.98	0.98	585	11.3	30.0	0.57
REINFORCE + Prior	0.98	0.92	232	3.05	2.8	2.11

Standard REINFORCE exploited the reward by generating sequences of predominantly “C” (average MW 585, cLogP 11.3). REINFORCE + Prior avoided this but collapsed to small, simplistic structures (MW 232). The Agent achieved 98% sulphur-free structures while maintaining molecular properties nearly identical to the Prior, demonstrating that augmented episodic likelihood preserves the prior distribution.

Experiment 2: Similarity-Guided Generation (Celecoxib Analogues)

The scoring function uses Jaccard similarity on FCFP4 fingerprints:

$$ S(A) = -1 + 2 \times \frac{\min{J_{i,j}, k}}{k} $$

where $k$ caps the rewarded similarity. With $k = 1$ and $\sigma = 15$, the Agent recovers Celecoxib itself within 200 training steps. Even when all structures with $J > 0.5$ to Celecoxib (1,804 molecules) were removed from the Prior training set, the Agent still found Celecoxib after 400 steps, despite a 700-fold reduction in prior likelihood ($\log_e P$ from $-12.7$ to $-19.2$).

With moderate similarity targets ($k = 0.7$, $\sigma = 12$), the Agent generates diverse analogues including scaffold hops where functional groups are rearranged.

Experiment 3: Target Activity (DRD2)

The most drug-discovery-relevant task: generating molecules predicted active against the dopamine receptor type 2 (DRD2). An SVM classifier (Gaussian kernel, $C = 2^7$, $\gamma = 2^{-6}$) was trained on bioactivity data from ExCAPE-DB (7,218 actives with pIC50 > 5, 100,000 sampled inactives). The actives were split by Butina clustering (ECFP6, cutoff 0.4) to decrease nearest-neighbor similarity between train and test sets.

Metric	Prior	Agent	Prior (reduced)	Agent (reduced)
Fraction valid SMILES	0.94	0.99	0.94	0.99
Fraction predicted actives	0.03	0.97	0.02	0.96
Fraction similar to train active	0.02	0.79	0.02	0.75
Fraction similar to test active	0.01	0.46	0.01	0.38
Test actives recovered (x10^-3)	13.5	126	2.85	72.6

The Agent increased the fraction of predicted actives from 2-3% (Prior) to 96-97%, representing a 250-fold enrichment in the probability of generating a test set active. The Agent based on the reduced Prior (DRD2 actives removed from ChEMBL) still recovered 7% of test actives, meaning it generated experimentally confirmed actives that appeared in neither the generative model nor the activity prediction model training data.

Anchored Policy Learning Prevents Reward Exploitation

The key finding is that augmented episodic likelihood successfully balances score optimization with prior distribution preservation. The Agent achieves task objectives (sulphur avoidance, similarity targets, activity prediction) while maintaining the molecular property distributions learned from ChEMBL. This is a significant improvement over standard REINFORCE, which either exploits rewards trivially or collapses to simple structures.

Analysis of the conditional probability distributions between the Prior and Agent (for DRD2 active generation) shows that the policy changes are not drastic: most trends learned by the Prior carry over, with targeted modifications at specific steps that substantially alter sequence likelihoods and generated structure types.

Limitations acknowledged by the authors:

All experiments use single-parameter scoring functions; multi-parametric optimization (activity + DMPK + synthetic accessibility) is left for future work
The quality of generated structures depends heavily on the Prior’s coverage of chemical space
The activity model (SVM) has limited domain of applicability, and structures outside this domain may be falsely scored
No exhaustive study of how Prior training set size, model size, and regularization affect generation quality

Future directions include multi-parametric scoring functions, exploration of token embeddings, and adversarial training where the scoring function is replaced by a discriminator network (GAN-style training).

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior training	ChEMBL	1.5M structures	10-50 heavy atoms, filtered elements
DRD2 activity model	ExCAPE-DB	7,218 actives + 100K inactives	Butina clustering split (ECFP6, cutoff 0.4)
Similarity target	Celecoxib	1 query structure	FCFP4 fingerprints for Jaccard similarity

Algorithms

Prior: 3-layer GRU RNN (1024 units/layer), Adam optimizer, 50K steps, batch size 128, LR 0.001 with 0.02 decay/100 steps
Agent: Same architecture, SGD with LR 0.0005, gradient clipping [-3, 3], on-policy batches of 128
DRD2 model: SVM with Gaussian kernel ($C = 2^7$, $\gamma = 2^{-6}$), grid search on validation set

Models

Artifact	Type	License	Notes
REINVENT	Code	MIT	Original implementation in TensorFlow/Python 2.7
Archived version	Code	MIT	Zenodo archive (DOI: 10.5281/zenodo.572576)

Evaluation

SMILES validity rate (RDKit parsing)
Fraction of structures satisfying scoring function
Molecular property distributions (MW, cLogP, rotatable bonds, aromatic rings)
Jaccard similarity on ECFP6/FCFP4 fingerprints
Recovery rate of known actives from test set

Hardware

Not specified in the paper. The implementation uses TensorFlow 1.0.1 with Python 2.7, RDKit, and Scikit-learn.

Paper Information

Citation: Olivecrona, M., Blaschke, T., Engkvist, O., & Chen, H. (2017). Molecular de-novo design through deep reinforcement learning. Journal of Cheminformatics, 9(1), 48.

@article{olivecrona2017molecular,
  title={Molecular de-novo design through deep reinforcement learning},
  author={Olivecrona, Marcus and Blaschke, Thomas and Engkvist, Ola and Chen, Hongming},
  journal={Journal of Cheminformatics},
  volume={9},
  number={1},
  pages={48},
  year={2017},
  publisher={Springer},
  doi={10.1186/s13321-017-0235-x}
}

ORGAN: Objective-Reinforced GANs for Molecule Design

Sat, 28 Mar 2026 00:00:00 +0000

Combining GANs and Reinforcement Learning for Goal-Directed Sequence Generation

This is a Method paper that introduces ORGAN (Objective-Reinforced Generative Adversarial Network), a framework for generating sequences that are both realistic (close to the training distribution) and optimized for domain-specific objectives. ORGAN extends SeqGAN by adding external reward functions to the reinforcement learning signal, with a tunable parameter $\lambda$ controlling the balance between adversarial (discriminator) and objective-based rewards. The authors demonstrate ORGAN on two domains: molecular generation using SMILES strings (optimizing druglikeness, solubility, and synthesizability) and musical melody generation (optimizing tonality and step ratios).

Exposure Bias and Mode Collapse in Discrete Sequence Generation

Generating discrete sequences with desirable properties presents two intertwined challenges. First, RNNs trained via maximum likelihood estimation (MLE) suffer from exposure bias, where the model sees only ground-truth prefixes during training but must condition on its own (potentially erroneous) outputs at generation time. Second, while GANs can address some of these issues through adversarial training, they were not initially applicable to discrete data due to non-differentiability of the sampling step. SeqGAN resolved this by framing the generator as an RL agent, but it optimizes only for distributional fidelity (fooling the discriminator) without any mechanism to steer generation toward specific property targets.

In drug discovery, simply generating valid, drug-like molecules is insufficient. Practitioners need to optimize for particular pharmaceutical properties (e.g., solubility, synthesizability, druglikeness) while maintaining structural diversity. Naive RL approaches can optimize properties effectively but tend to collapse onto trivial solutions (e.g., repeating “CCCCCCC” to maximize solubility). The challenge is to combine the distributional regularization of adversarial training with the goal-directedness of RL.

Mixed Reward: Interpolating Between Adversarial and Objective Signals

ORGAN’s core innovation is a reward function that linearly interpolates between the discriminator score and domain-specific objectives:

$$R(Y_{1:T}) = \lambda \cdot D_{\phi}(Y_{1:T}) + (1 - \lambda) \cdot O_{i}(Y_{1:T})$$

When $\lambda = 1$, the model reduces to SeqGAN (pure adversarial training). When $\lambda = 0$, it becomes naive RL optimizing only the objective. Intermediate values allow the adversarial component to regularize the generator, keeping samples within the distribution while the objective component steers toward desired properties.

The generator $G_{\theta}$ is an LSTM-based RNN that produces sequences token-by-token. Training follows the REINFORCE algorithm, where the expected long-term reward is:

$$J(\theta) = \mathbb{E}\left[R(Y_{1:T}) \mid s_{0}, \theta\right] = \sum_{y_{1} \in Y} G_{\theta}(y_{1} \mid s_{0}) \cdot Q(s_{0}, y_{1})$$

For intermediate timesteps (partial sequences), the action-value function $Q$ is estimated via $N$-time Monte Carlo rollouts:

$$Q(Y_{1:t-1}, y_{t}) = \begin{cases} \frac{1}{N} \sum_{n=1}^{N} R(Y_{1:T}^{n}), & \text{if } t < T \\ R(Y_{1:T}), & \text{if } t = T \end{cases}$$

where $Y_{1:T}^{n}$ are completions sampled by rolling out the current policy $G_{\theta}$ from state $Y_{1:t}$.

The policy gradient is:

$$\nabla_{\theta} J(\theta) \simeq \frac{1}{T} \sum_{t=1}^{T} \mathbb{E}_{y_{t} \sim G_{\theta}(y_{t} \mid Y_{1:t-1})} \left[\nabla_{\theta} \log G_{\theta}(y_{t} \mid Y_{1:t-1}) \cdot Q(Y_{1:t-1}, y_{t})\right]$$

Two additional mechanisms improve training:

Diversity penalty: Repeated sequences have their reward divided by their copy count, providing diminishing returns for non-unique outputs.
Wasserstein distance: The authors also implement a variant (OR(W)GAN) that replaces the standard GAN discriminator loss with the Wasserstein-1 distance via Kantorovich-Rubinstein duality, which can improve training stability and diversity.

Molecular and Musical Melody Generation Experiments

Architecture

The generator $G_{\theta}$ is an RNN with LSTM cells. The discriminator $D_{\phi}$ is a CNN for text classification following Kim (2014), with 75% dropout and L2 regularization. All optimization uses Adam. Molecular metrics are computed with RDKit.

Molecular Generation Setup

Training data consists of 5,000 random molecules from the QM9 dataset (134k stable small molecules with up to 9 heavy atoms), encoded as SMILES strings with maximum sequence length 51 and alphabet size 43. Each generator is pre-trained for 250 MLE epochs, with the discriminator trained for 10 epochs. Adversarial/RL training then proceeds for up to 100 additional epochs. The default $\lambda$ is 0.5.

Three molecular objectives are evaluated:

Solubility (LogP): water-octanol partition coefficient via RDKit’s Crippen function
Synthesizability: SA score estimating ease of synthesis (0 = hard, 1 = easy)
Druglikeness: QED score capturing medicinal chemistry aesthetics

Diversity is measured using average Jaccard distance of molecular fingerprints relative to a random training subset.

Molecular Generation Results

Objective	Algorithm	Validity (%)	Diversity	Druglikeness	Synthesizability	Solubility
None	MLE	75.9	0.64	0.48 (0%)	0.23 (0%)	0.30 (0%)
None	SeqGAN	80.3	0.61	0.49 (+2%)	0.25 (+6%)	0.31 (+3%)
Druglikeness	ORGAN	88.2	0.55	0.52 (+8%)	0.32 (+38%)	0.35 (+18%)
Druglikeness	OR(W)GAN	85.0	0.95	0.60 (+25%)	0.54 (+130%)	0.47 (+57%)
Druglikeness	Naive RL	97.1	0.80	0.57 (+19%)	0.53 (+126%)	0.50 (+67%)
Synthesizability	ORGAN	96.5	0.92	0.51 (+6%)	0.83 (+255%)	0.45 (+52%)
Synthesizability	OR(W)GAN	97.6	1.00	0.20 (-59%)	0.75 (+223%)	0.84 (+184%)
Solubility	ORGAN	94.7	0.76	0.50 (+4%)	0.63 (+171%)	0.55 (+85%)
Solubility	OR(W)GAN	94.1	0.90	0.42 (-12%)	0.66 (+185%)	0.54 (+81%)
Solubility	Naive RL	92.7	0.75	0.49 (+3%)	0.70 (+200%)	0.78 (+162%)
All (alternated)	ORGAN	96.1	92.3	0.52 (+9%)	0.71 (+206%)	0.53 (+79%)

Key observations: OR(W)GAN consistently achieves higher diversity than standard ORGAN. Naive RL often achieves higher raw objective scores but at the cost of generating trivial solutions (e.g., simple atom chains for solubility). The Wasserstein variant provides better diversity properties. Multi-objective training via alternating objectives across epochs achieves gains comparable to individually optimized models.

Music Generation Setup

Using 1,000 melodies from the EsAC folk dataset, each encoded as 36-token sequences where tokens represent sixteenth-note events across three octaves (C3-B5). Two metrics are optimized: tonality (proportion of perfect fifths) and ratio of steps (conjunct melodic motion). Diversity is measured as average pairwise edit distance.

Music Results

Objective	Algorithm	Diversity	Tonality	Ratio of Steps
None	MLE	0.221	0.007	0.010
None	SeqGAN	0.187	0.005	0.010
Tonality	Naive RL	0.100	0.478	2.9E-05
Tonality	ORGAN	0.268	0.372	1.78E-04
Tonality	OR(W)GAN	0.268	0.177	2.4E-04
Ratio of Steps	Naive RL	0.321	0.001	0.829
Ratio of Steps	ORGAN	0.433	0.001	0.632
Ratio of Steps	OR(W)GAN	0.134	5.95E-05	0.622

ORGAN outperforms SeqGAN and MLE on all metrics. Naive RL achieves higher raw scores but with lower diversity, producing simpler, less interesting outputs.

Capacity Ceilings, Trade-offs, and Future Directions

The authors identify several limitations and findings:

Capacity ceiling: GAN-based models tend to generate sequences matching the training set’s average length (15.42 characters). RL-only approaches can break this constraint, generating shorter (9.4) or longer (21.3) sequences depending on the objective. The upper bound of optimized properties also matches the training data’s maximum, suggesting dataset-dependent limits.

Lambda trade-off: Varying $\lambda$ reveals an optimal balance between objective optimization and distributional fidelity. This optimum depends on the model, dataset, and metric, suggesting that hyperparameter search over $\lambda$ is important in practice.

Tonality vs. steps inverse relationship: In the music task, optimizing for tonality (perfect fifths) inherently conflicts with optimizing for step ratios (consecutive notes), since consecutive scale notes do not form perfect fifths.

Limitations: The paper evaluates on relatively small datasets (5k molecules, 1k melodies) and short sequences. The molecular experiments use QM9 (small molecules with up to 9 heavy atoms), which limits the scope of conclusions for drug-like chemical space. The Wasserstein variant sometimes lags behind the standard GAN loss in raw metric scores, though it offers better diversity.

Future directions: The authors propose extending ORGAN to non-sequential data (images, audio) by framing GANs as RL problems more broadly, and investigating how different heuristic choices affect performance. They also suggest exploring other discrete GAN formulations (MaliGAN, BGAN) with RL extensions.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Molecular training	QM9 subset	5,000 molecules	Random subset from 134k stable small molecules with up to 9 heavy atoms
Music training	EsAC folk dataset	1,000 melodies	36-token sequences, processed following Chen et al. (2017)

Algorithms

Generator pre-trained for 250 epochs via MLE; discriminator for 10 epochs
Adversarial/RL training for up to 100 epochs
Default $\lambda = 0.5$ for reward mixing
Monte Carlo rollouts for intermediate reward estimation
Duplicate penalty: reward divided by copy count

Models

Generator: RNN with LSTM cells
Discriminator: CNN for text classification (Kim, 2014) with 75% dropout, L2 regularization
Optimizer: Adam for all gradient descent steps

Evaluation

Metric	Description	Domain
Validity (%)	Fraction of generated SMILES that decode to valid molecules	Molecules
Diversity	Average Jaccard distance of fingerprints to training subset	Molecules
Druglikeness (QED)	Quantitative Estimate of Drug-likeness	Molecules
Synthesizability (SA)	Synthetic accessibility score	Molecules
Solubility (LogP)	Water-octanol partition coefficient	Molecules
Tonality	Proportion of perfect fifths	Music
Ratio of Steps	Proportion of conjunct melodic intervals	Music
Diversity (edit)	Average pairwise edit distance	Music

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
ORGAN	Code	GPL-2.0	Official implementation including metrics for molecules and music

Paper Information

Citation: Guimaraes, G. L., Sánchez-Lengeling, B., Outeiral, C., Farias, P. L. C., & Aspuru-Guzik, A. (2017). Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models. arXiv preprint arXiv:1705.10843.

@article{guimaraes2017organ,
  title={Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models},
  author={Guimaraes, Gabriel Lima and Sanchez-Lengeling, Benjamin and Outeiral, Carlos and Farias, Pedro Luis Cunha and Aspuru-Guzik, Al{\'a}n},
  journal={arXiv preprint arXiv:1705.10843},
  year={2017}
}

MolecularRNN: Graph-Based Molecular Generation and RL

Sat, 28 Mar 2026 00:00:00 +0000

A Graph Recurrent Model for Molecular Generation with Property Optimization

This is a Method paper that introduces MolecularRNN, a graph-based recurrent generative model for molecular structures. The model extends GraphRNN to handle typed nodes (atom types) and typed edges (bond types), enabling direct generation of molecular graphs rather than working through string representations like SMILES. Three key contributions are combined: (1) the MolecularRNN architecture for autoregressive graph generation, (2) valency-based rejection sampling for guaranteed 100% validity at inference, and (3) policy gradient reinforcement learning for shifting molecular property distributions toward desired ranges.

Why Generate Molecules as Graphs Rather Than Strings

Computational de novo molecular design aims to create novel molecules with desired properties, a task central to drug discovery. At the time of this work, most deep generative models for molecules operated on SMILES strings, inheriting the complications of SMILES grammar and the problem that structurally similar molecules can have very different string representations. Graph-based representations are more natural for molecules, with atoms mapping to nodes and bonds to edges, and they allow direct enforcement of chemical constraints during generation.

Existing graph-based methods had their own limitations. Junction tree VAE (JT-VAE) generates molecules from structural fragments, which introduces ambiguity when converting junction trees back to molecules, particularly problematic during property optimization since molecules sharing a junction tree can have very different property values. The GCPN model uses graph convolutional networks with reinforcement learning but was evaluated only on top-3 generated molecules, making it difficult to assess overall distribution quality. Prior atom-level graph generation models like Li et al. (2018a) were restricted to molecules with at most 20 heavy atoms, limiting practical applicability.

Core Innovation: Extending GraphRNN with Chemical Constraints and RL

MolecularRNN builds on the GraphRNN architecture by introducing atom type predictions alongside edge type predictions. The model generates molecular graphs sequentially: at each step, a NodeRNN predicts the type of the next atom, then an EdgeRNN predicts bond types to all preceding atoms within a BFS-ordered window.

Autoregressive Graph Generation

The joint likelihood over atom types $C^{\pi}$ and adjacency vectors $S^{\pi}$ under BFS ordering $\pi$ is factorized as:

$$ p\left(S^{\pi}, C^{\pi}\right) = \prod_{i=1}^{n+1} p\left(C_{i}^{\pi} \mid S_{

NodeRNN processes embeddings of previous atom types and adjacency vectors to produce a hidden state, from which a two-layer MLP with softmax predicts the next atom type $\psi_{i}$:

$$ h_{i}^{\text{node}} = \text{NodeRNN}\left(h_{i-1}^{\text{node}}, \left[\text{emb}(S_{i-1}^{\pi}), \text{emb}(C_{i-1}^{\pi})\right]\right) $$

$$ \psi_{i} = \text{NodeMLP}\left(h_{i}^{\text{node}}\right) $$

EdgeRNN then unrolls across preceding atoms to predict bond types $\phi_{i,j}$, initialized with the NodeRNN hidden state:

$$ h_{i,j}^{\text{edge}} = \text{EdgeRNN}\left(h_{i,j-1}^{\text{edge}}, \text{emb}(S_{i,j-1}^{\pi})\right), \quad h_{i,0}^{\text{edge}} = h_{i}^{\text{node}} $$

$$ \phi_{i,j} = \text{EdgeMLP}\left(h_{i,j}^{\text{edge}}\right) $$

Bond types are categorical over {no bond, single, double, triple}, and molecules are represented in kekulized form. BFS ordering limits the EdgeRNN window to $M = 12$ preceding atoms.

Valency-Based Rejection Sampling

During inference, each proposed bond of order $k$ between atoms $i$ and $j$ is accepted only if both atoms remain within their allowed valencies:

$$ \sum_{j} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{i}^{\pi}} \quad \text{and} \quad \sum_{i} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{j}^{\pi}} $$

Atoms that do not fill their valencies are complemented with hydrogens. This constraint can be enforced directly on graphs (unlike SMILES, where intermediate substrings are not chemically meaningful), yielding 100% valid molecules.

Property Optimization via Policy Gradient

For property optimization, MolecularRNN is formulated as a policy network in a Markov Decision Process. The loss function uses REINFORCE with a discounted final reward:

$$ L(\theta) = -\sum_{i=1}^{N} r(s_{N}) \cdot \gamma^{i} \cdot \log p(s_{i} \mid s_{i-1}; \theta) $$

where $r(s_{N})$ is the reward from a property critic and $\gamma$ is a discount factor. The authors also introduce a structural penalty during RL training that assigns a penalty of $-10$ to atoms violating valency constraints, providing a learning signal from invalid intermediate molecules.

Experimental Setup: Pretraining and Property Optimization

Pretraining

MolecularRNN is pretrained on three datasets: ChEMBL (~1.5M bioactive molecules), ZINC 250k (250K randomly selected commercially available compounds), and MOSES (~1.9M drug-like molecules from ZINC). The model considers 9 atom types (C, N, O, F, P, S, Cl, Br, I), 3 bond types (single, double, triple), and molecules with 10-50 heavy atoms. Architecture: NodeRNN with 4 GRU layers (hidden size 256), EdgeRNN with 4 GRU layers (hidden size 128), node embedding size 128, edge embedding size 16. Training uses Adam with learning rate 0.001 and multiplicative decay on 4 GPUs with batch size 512 per GPU for 250 epochs.

Generation Quality at Scale

The pretrained model generates 1 million molecules per dataset (far larger than prior work: JT-VAE used 5K samples, Li et al. used 100K). Results with valency-based rejection sampling:

Training Set	Valid	Unique	Novel	IntDiv (p=1)	IntDiv (p=2)	SA Score	QED
ChEMBL	100%	99.2%	99.3%	0.895	0.890	3.67 +/- 1.20	0.56 +/- 0.20
ZINC 250k	100%	99.8%	100%	0.892	0.887	3.60 +/- 1.01	0.68 +/- 0.16
MOSES	100%	99.4%	100%	0.881	0.876	3.24 +/- 0.97	0.74 +/- 0.14

Comparison with baselines on ZINC 250k (30K samples):

Method	Valid	Unique	Novel	SA Score	QED	IntDiv
JT-VAE	99.8%	100%	100%	3.37	0.76	0.85
GCPN	100%	99.97%	100%	4.62	0.61	0.90
MolecularRNN	100%	99.89%	100%	3.59	0.68	0.89

GCPN generates overly complex molecules (high SA score of 4.62), while MolecularRNN produces more realistic structures with higher internal diversity than JT-VAE.

Property Optimization Results

Policy gradient optimization is run for 300 iterations with batch size 512 and constant learning rate $10^{-5}$, discount factor $\gamma = 0.97$. Top-3 scores for penalized logP and QED:

Method	logP 1st	logP 2nd	logP 3rd	QED 1st	QED 2nd	QED 3rd
ORGAN	3.63	3.49	3.44	0.896	0.824	0.820
JT-VAE	5.30	4.93	4.49	0.925	0.911	0.910
GCPN	7.98	7.85	7.80	0.948	0.947	0.946
MolecularRNN	10.34	10.19	10.14	0.948	0.948	0.947

MolecularRNN achieves the highest penalized logP scores (10.34 vs. GCPN’s 7.98) while matching GCPN on QED. The authors also demonstrate melting temperature optimization using a GCN-based property predictor as the critic (RMSE 39.5 degrees C), showing that the RL framework generalizes to properties that cannot be computed directly from molecular graphs.

Distribution-Level Evaluation and Learned Chemical Patterns

The authors emphasize that reporting only top-3 scores is not informative, and they compare full property distributions. MolecularRNN shifts the QED distribution further toward maximum values compared to GCPN. They also note that during melting temperature optimization, the model rediscovered two chemical phenomena: fusing aromatic rings increases melting point, and the presence of polar groups (C=O, OH, NH2, heterocyclic nitrogens) enhances dipole-dipole interactions and raises melting temperature.

Without valency-based rejection sampling, the pretrained model achieves 65% validity. After structural penalty training (assigning -10 to valency-violating atoms and optimizing with policy gradient), validity increases to 90%. Enabling rejection sampling then achieves 100%.

Several limitations are worth noting. The BFS ordering introduces an arbitrary sequencing over equivalent graph traversals (the node order permutation problem is not addressed). The evaluation uses top-3 scores for property optimization, though the authors do advocate for distributional evaluation. The molecule size is capped at 50 heavy atoms. The paper does not report training time or wall-clock generation speed. Future directions mentioned include multi-objective property optimization and scaffold completion (graph completion from a given core structure).

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL	~1.5M molecules	Bioactive molecules with experimental measurements
Pretraining	ZINC 250k	250K molecules	Random subset of ZINC database
Pretraining	MOSES	~1.9M molecules	Drug-like subset of ZINC
Melting point critic	Custom split	37,940 train / 9,458 test	Melting temperatures from -196 to 517 degrees C

Algorithms

Pretraining: Maximum likelihood with Adam optimizer, learning rate 0.001 with multiplicative decay to $10^{-5}$, 250 epochs
Structural penalty: Policy gradient with -10 penalty per valency-violating atom
Property optimization: REINFORCE (policy gradient), 300 iterations, batch size 512, learning rate $10^{-5}$, discount factor $\gamma = 0.97$
Melting point critic: GCN regression (4 layers, hidden size 128), Adam with learning rate 0.001, exponential decay $\gamma = 0.8$, 30 epochs, batch size 32

Models

NodeRNN: 4 GRU layers, hidden size 256, node embedding 128
EdgeRNN: 4 GRU layers, hidden size 128, edge embedding 16
NodeMLP/EdgeMLP: 2-layer MLP with 128 hidden units, ReLU activation, softmax output
BFS window: $M = 12$ preceding atoms
Atom types: 9 (C, N, O, F, P, S, Cl, Br, I)
Bond types: 3 (single, double, triple) + no bond

Evaluation

Metric	Description
Validity	% chemically valid molecules (RDKit)
Uniqueness	% unique in generated pool (up to 1M)
Novelty	% not in training set
Internal Diversity	Average pairwise Tanimoto distance
SA Score	Synthetic accessibility (2-4 optimal range)
QED	Drug-likeness score (0-1)
Penalized logP	Lipophilicity with ring and SA penalties

Hardware

4 GPUs (NVIDIA, specific model not stated)
Per-GPU batch size of 512 for pretraining
Training time not reported

Paper Information

Citation: Popova, M., Shvets, M., Oliva, J., & Isayev, O. (2019). MolecularRNN: Generating realistic molecular graphs with optimized properties. arXiv preprint arXiv:1905.13372.

@article{popova2019molecularrnn,
  title={MolecularRNN: Generating realistic molecular graphs with optimized properties},
  author={Popova, Mariya and Shvets, Mykhailo and Oliva, Junier and Isayev, Olexandr},
  journal={arXiv preprint arXiv:1905.13372},
  year={2019}
}

Memory-Assisted RL for Diverse De Novo Mol. Design

Sat, 28 Mar 2026 00:00:00 +0000

A Memory Module for Diverse Molecular Generation via RL

This is a Method paper that introduces a memory unit for reinforcement learning (RL)-based molecular generation. The primary contribution is a hash-table-based memory mechanism that integrates into the REINVENT framework’s scoring function. By tracking previously generated high-scoring molecules and penalizing the reward when new molecules are too similar to those already stored, the memory unit forces the generative model to explore different regions of chemical space rather than collapsing onto a single scaffold family.

Policy Collapse Limits RL-Based De Novo Design

Recurrent neural networks (RNNs) trained with reinforcement learning can generate novel molecules optimized for desired properties. The REINVENT algorithm and related approaches (ORGANIC, GENTRL) demonstrated the viability of coupling a pretrained SMILES-based generative model with a scoring function via RL. However, a persistent problem is policy collapse (also called mode collapse): once the model discovers a high-scoring region of chemical space, it continues to exploit that region, producing structurally similar compounds with minor substitution differences. This severely limits the practical utility of RL-based generation in drug design, where medicinal chemists need diverse scaffolds to explore structure-activity relationships and manage intellectual property concerns.

Prior work by Liu et al. [31] attempted to address this by engineering an explorative RNN alongside the standard generative RNN, but it did not substantially increase diversity compared to standard REINVENT. Other approaches like Generative Examination Networks (GEN) performed statistical analysis during training but were not evaluated in optimization scenarios.

Core Innovation: Hash-Table Memory Unit for Reward Modification

The key insight is to dynamically modify the reward surface during RL by maintaining a memory of previously explored chemical space. The memory unit is a hash table of index-bucket pairs. Each bucket stores up to a fixed number of high-scoring molecules (default: 25) that are chemically similar to a seed molecule (the index).

Integration with REINVENT

The memory unit modifies the augmented likelihood used in REINVENT. For a generated compound $c$, the augmented log-likelihood becomes:

$$ \log P(c)_{Aug} = \log P(c)_{PriorNetwork} + \sigma \times S(c) \times M(c) $$

where $\sigma$ is a scalar coefficient, $S(c)$ is the scoring function output, and $M(c)$ is the memory unit output (either 0 or 1). The reward is:

$$ R(c) = \left(\log P(c)_{Aug} - \log P(c)_{AgentNetwork}\right)^2 $$

and the loss is $\text{loss} = -R(c)$.

Memory Unit Operation

When a high-scoring molecule is generated:

Its fingerprint or scaffold is compared against all index structures in the memory
If it is similar to an index (above a Tanimoto cutoff, default 0.6) and the corresponding bucket is not full, $M(c) = 1$ and the molecule is added to the bucket
If the bucket is full, $M(c) = 0$, effectively zeroing the reward contribution and discouraging the model from generating similar molecules
If no similar index exists, a new index-bucket pair is created

Four Similarity Criteria

The authors evaluate four criteria for grouping molecules in the memory:

Compound similarity: ECFP4 Tanimoto similarity at the whole-molecule level
Identical Bemis-Murcko (BM) scaffold: exact match of Bemis-Murcko frameworks
Identical carbon skeleton: exact match of carbon skeletons (BM scaffolds with all heteroatoms replaced by carbon and bonds set to single)
Scaffold similarity: atom pair fingerprint Tanimoto similarity between carbon skeletons (fuzzy matching)

Alternative Output Modes

Beyond the binary output ($M(c) \in {0, 1}$), the authors also explored smooth output functions. The linear mode:

$$ M(c) = 1 - \frac{\text{compounds in bucket}}{\text{bucket size}} $$

And the sigmoid mode:

$$ M(c) = 1 - \frac{1}{1 + e^{-\left(\frac{\frac{\text{compounds in bucket}}{\text{bucket size}} \times 2 - 1}{0.15}\right)}} $$

Both smooth modes yielded slightly fewer analogs than the binary mode and were not pursued further.

Experimental Setup: LogP Optimization and Target Activity Prediction

Case Study 1: LogP Optimization

As a proof of concept, the authors optimized LogP values for known DRD2 inhibitors. Starting from 487 DRD2 compounds with LogP >= 5 (from ExCAPE-DB), they applied transfer learning to the prior model for 20 epochs, then ran RL for 150 iterations (100 compounds per iteration, 15,000 total). The scoring function was:

$$ S = 1 - \tanh\left(\min\left(|2 - \text{AlogP}|, |3 - \text{AlogP}|\right)\right) $$

targeting LogP values between 2.0 and 3.0.

Case Study 2: HTR1A and DRD2 Activity Prediction

For a more complex scenario, the authors trained SVM classifiers (with Platt scaling for probabilistic output) on bioactivity data from ExCAPE-DB to predict activity against two neurotransmitter receptors:

HTR1A: 3,599 actives (pIC50 >= 7) and 66,684 inactives
DRD2: 2,981 actives (pIC50 >= 7) and 346,206 inactives (100,000 sampled)

Data was split using Butina clustering on ECFP6 at a 0.4 Tanimoto cutoff (60/20/20 train/val/test). The SVM models achieved excellent performance:

Target	Set	Balanced Accuracy	ROC AUC	F1	MCC
HTR1A	Test	0.96	0.99	0.75	0.75
DRD2	Test	0.95	0.99	0.71	0.72

RL was run for 300 iterations (100 compounds each, 30,000 total). Compounds with predicted activity >= 0.7 were considered active.

Generative Model Architecture

The RNN prior model followed the REINVENT architecture: an embedding layer, three GRU layers with 256 dimensions, and a linear output layer. It was pretrained on ~1.5 million ChEMBL 25 compounds (filtered to remove known HTR1A actives and DRD2 analogs) for 10 epochs using Adam with a learning rate of 0.01.

Comparisons

The authors compared memory-assisted RL against:

Standard REINVENT RL (no memory)
Experience replay (re-presenting 8 high-scoring compounds per iteration)
Temperature scaling (values from 1.0 to 10.0)
Memory + experience replay combined

Results: Up to Fourfold Increase in Diverse Active Compounds

LogP Optimization Results

Memory-assisted RL increased the number of optimized compounds (LogP 2-3) by roughly threefold:

Memory Type	Optimized Compounds	Unique BM Scaffolds	Unique Carbon Skeletons
No memory	938	727	396
Compound similarity	3,451	2,963	1,472
Identical BM Scaffold	3,428	2,865	1,398
Identical Carbon Skeleton	3,315	3,002	1,799
Scaffold Similarity	3,591	3,056	1,538

The memory unit also increased the generation of relevant analogs. ECFP6 analogs (Tanimoto >= 0.4 to training set) increased from 145 to up to 549, and shared MMP cores increased from 5 to up to 19, confirming that the memory unit promoted exploration of chemically relevant space rather than random drift.

HTR1A and DRD2 Activity Optimization Results

The improvements were even more pronounced for target activity optimization:

Target	Memory Type	Active Compounds	Unique BM Scaffolds	Unique Carbon Skeletons
HTR1A	No memory	9,323	7,312	5,446
HTR1A	Compound similarity	16,779	13,304	9,887
HTR1A	Identical Carbon Skeleton	17,597	15,531	12,408
DRD2	No memory	5,143	2,635	1,949
DRD2	Compound similarity	21,486	17,844	12,749
DRD2	Scaffold Similarity	22,784	20,712	16,434

For DRD2, the effect was particularly striking: standard RL showed clear policy collapse with only 576 ECFP6 analogs to the training set, while memory-assisted RL generated up to 6,315. The compound similarity memory unit produced the most MMP analogs (217 to the training set vs. 7 without memory).

Parameter Sensitivity

Bucket size had a modest effect: larger buckets (allowing more compounds before penalization) slightly increased analog generation. The Tanimoto similarity threshold of 0.6 was near-optimal for the scaffold similarity memory; higher thresholds reduced diversity gains. The compound similarity memory showed increasing analogs with higher thresholds, but BM scaffold and carbon skeleton counts plateaued above 0.6.

Comparison with Experience Replay and Temperature Scaling

Experience replay alone increased diversity compared to vanilla RL but was less effective than the memory unit alone
Memory + experience replay achieved the best results overall, as experience replay provided the model with diverse starting points for exploration after the memory unit altered the reward landscape
Temperature scaling was largely ineffective: only a value of 1.25 showed improvement, and even then it achieved only about 50% of the analogs generated by memory-assisted RL. Temperatures above 2.0 degraded SMILES validity, and above 4.0 prevented valid molecule generation entirely

Limitations

The authors acknowledge several limitations:

All evaluations are retrospective; no synthesized compounds were experimentally tested
The SVM activity models, while accurate, may have applicability domain limitations for highly novel scaffolds
The binary memory output mode was found to work best, but the transition from exploration to exploitation is abrupt
The method was only tested with two biological targets and one physicochemical property
Computational overhead of the memory unit is not discussed

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior model training	ChEMBL 25	~1.5M compounds	Filtered: max 50 heavy atoms, no stereochemistry, removed HTR1A actives and DRD2 analogs
HTR1A activity data	ExCAPE-DB	3,599 actives + 66,684 inactives	pIC50 >= 7 threshold for actives
DRD2 activity data	ExCAPE-DB	2,981 actives + 100,000 inactives (sampled)	pIC50 >= 7 threshold for actives

Algorithms

Generative model: RNN with embedding + 3 GRU layers (256 dim) + linear output (REINVENT architecture)
RL: Augmented likelihood formulation with sigma scaling coefficient
SVM classifiers: Non-linear SVM with MinMax kernel, Platt scaling, ECFP6 count-based fingerprints (2048 dim)
Butina clustering: ECFP6 Tanimoto cutoff 0.4 for train/val/test splitting

Evaluation

Metric	Description
Unique compounds	Number of distinct valid SMILES generated
Unique BM scaffolds	Bemis-Murcko framework diversity
Unique carbon skeletons	Carbon skeleton diversity (stripped BM scaffolds)
ECFP6 analogs	Compounds with Tanimoto >= 0.4 to known actives
MMP analogs	Matched molecular pair relationships with known actives
Shared MMP cores	Scaffold cores shared between generated and known compounds

Artifacts

Artifact	Type	License	Notes
reinvent-memory	Code	MIT	Official implementation with prepared datasets

Hardware

Not specified in the paper.

Paper Information

Citation: Blaschke, T., Engkvist, O., Bajorath, J., & Chen, H. (2020). Memory-assisted reinforcement learning for diverse molecular de novo design. Journal of Cheminformatics, 12, 68. https://doi.org/10.1186/s13321-020-00473-0

@article{blaschke2020memory,
  title={Memory-assisted reinforcement learning for diverse molecular de novo design},
  author={Blaschke, Thomas and Engkvist, Ola and Bajorath, J{\"u}rgen and Chen, Hongming},
  journal={Journal of Cheminformatics},
  volume={12},
  number={1},
  pages={68},
  year={2020},
  publisher={Springer},
  doi={10.1186/s13321-020-00473-0}
}

DrugEx v2: Pareto Multi-Objective RL for Drug Design

Sat, 28 Mar 2026 00:00:00 +0000

Multi-Objective De Novo Drug Design with Pareto Optimization

This is a Method paper that extends the DrugEx framework (v1) to handle multi-objective optimization in de novo drug design. The primary contribution is integrating Pareto-based ranking with evolutionary algorithm concepts (crossover and mutation) into an RNN-based reinforcement learning pipeline. The system generates SMILES-based molecules optimized simultaneously for activity toward multiple protein targets while avoiding off-targets, addressing polypharmacology scenarios where drugs must bind multiple specific receptors.

Polypharmacology and the Limits of Single-Objective Generation

Traditional drug discovery follows the “one drug, one target, one disease” paradigm, but drug molecules interact with an average of six protein targets. Off-target binding causes side effects that remain a leading cause of clinical failure and post-approval drug withdrawals (over 500 drugs withdrawn due to fatal toxicity). Complex diseases often require modulating multiple targets simultaneously, making polypharmacology an important design objective.

Prior deep learning approaches for de novo design, including DrugEx v1, focused on generating molecules active against a single target. Extending these methods to multiple objectives introduces fundamental challenges: objectives are often contradictory (high affinity for one target may correlate with high affinity for an undesired off-target), and naive weighted-sum approaches can collapse diversity by over-optimizing a single dominant objective. The authors specifically target the adenosine receptor system, where $A_1AR$ and $A_{2A}AR$ selectivity profiles matter for therapeutic efficacy, and hERG channel binding must be avoided to prevent cardiac toxicity.

Evolutionary Exploration and Pareto Ranking in RL

The core innovation of DrugEx v2 has two components: an evolutionary exploration strategy and Pareto-based reward assignment.

Evolutionary Exploration Strategy

The generation process uses three RNN networks with identical LSTM architectures:

Agent net ($G_A$): the primary generator, updated at each training epoch via policy gradient
Crossover net ($G_C$): initialized from the fine-tuned model, updated iteratively from $G_A$ after each convergence period
Mutation net ($G_M$): initialized from the pre-trained model, parameters fixed throughout training

At each token-generation step, a random number determines whether the token probability comes from the combination of $G_A$ and $G_C$ (with probability $1 - \varepsilon$) or from $G_M$ (with probability $\varepsilon$). This mirrors crossover and mutation operations from evolutionary algorithms, maintaining diversity while steering toward desired properties.

Pareto Front Reward Scheme

For $n$ objectives (three in this study: $A_1AR$, $A_{2A}AR$, hERG), each molecule receives a score $R_i$ based on its predicted bioactivity:

$$ R_{i} = \begin{cases} \text{minmax}(pX_{i}), & \text{if high affinity required} \\ 1 - \text{minmax}(pX_{i}), & \text{if low affinity required} \\ 0, & \text{if SMILES invalid} \end{cases} $$

where $pX_i$ is the predicted bioactivity (range 3.0 to 10.0), normalized to [0, 1].

For the multi-target case, high affinity is required for both $A_1AR$ and $A_{2A}AR$ while low affinity is required for hERG. For the target-specific case, high affinity is required only for $A_{2A}AR$ while low affinity is required for both $A_1AR$ and hERG.

Molecules are ranked using a non-dominated sorting algorithm to construct Pareto fronts. Within each front, molecules are ranked by average Tanimoto distance (using ECFP6 fingerprints) rather than crowding distance, favoring chemically diverse solutions. The final reward is:

$$ R_i^{*} = \begin{cases} 0.5 + \frac{k - N_{undesired}}{2N_{desired}}, & \text{if desired} \\ \frac{k}{2N_{undesired}}, & \text{if undesired} \end{cases} $$

where $k$ is the molecule’s index in the Pareto rank. Rewards for undesired and desired solutions are distributed in $(0, 0.5]$ and $(0.5, 1.0]$, respectively.

The agent is trained via policy gradient:

$$ J(\theta) = \mathbb{E}\left[R^{*}(y_{1:T}) \middle|\theta\right] = \sum_{t=1}^{T} \log G(y_t | y_{1:t-1}) \cdot R^{*}(y_{1:T}) $$

Weighted Sum Alternative

The authors also implement a weighted sum (WS) scheme with dynamic weights proportional to the ratio of undesired to desired molecules per objective:

$$ w_i = \frac{r_i}{\sum_{k=1}^{M} r_k}, \quad R^{*} = \sum_{i=1}^{n} w_i R_i $$

This auto-adjusts importance toward under-performing objectives during training.

Molecular Diversity Metric

Diversity is measured using the Solow-Polasky metric adapted from ecological biodiversity:

$$ I(A) = \frac{1}{|A|} \mathbf{e}^{\top} F(\mathbf{s})^{-1} \mathbf{e} $$

where $F(\mathbf{s})$ is a distance matrix with entries $f(d_{ij}) = e^{-\theta d_{ij}}$ and $d_{ij}$ is the Tanimoto distance between ECFP6 fingerprints of molecules $s_i$ and $s_j$.

Multi-Target and Target-Specific Experiments

QSAR Environment

Four ML algorithms were benchmarked for the bioactivity prediction environment: Random Forest (RF), SVM, PLS, and Multi-task DNN (MT-DNN). Input features combined 2048-bit ECFP6 fingerprints with 19 physicochemical descriptors (2067D total). The training data came from ChEMBL v26: 25,731 ligands with bioactivity measurements toward $A_1AR$, $A_{2A}AR$, and hERG. RF was selected as the final predictor based on superior performance in temporal-split independent testing ($R^2$ and RMSE), prioritizing robustness over cross-validation metrics.

Generative Model Architecture

The RNN generator uses six layers: input, embedding (128D), three LSTM recurrent layers (512 hidden units), and output. LSTM was chosen over GRU based on higher valid SMILES rates (97.5% vs. 93.1% for pre-trained, 97.9% vs. 95.7% for fine-tuned). Pre-training used 1.7M molecules from ChEMBL; fine-tuning used the 25,731 LIGAND set molecules.

Baselines

DrugEx v2 was compared against DrugEx v1, REINVENT, and ORGANIC, all using the same RNN architecture and pre-trained/fine-tuned models, with only the RL framework differing. Both Pareto front (PF) and weighted sum (WS) reward schemes were tested.

Multi-Target Results

In the multi-target case (high affinity for $A_1AR$ and $A_{2A}AR$, low affinity for hERG):

Method	Scheme	Validity	Desirability	Uniqueness	Diversity
DrugEx v2	PF	99.57%	80.81%	87.29%	0.70
DrugEx v2	WS	99.80%	97.45%	89.08%	0.49
REINVENT	PF	99.54%	57.43%	98.84%	0.77
ORGANIC	PF	98.84%	66.01%	82.67%	0.65
DrugEx v1	PF	98.28%	43.27%	88.96%	0.71

DrugEx v2 achieved the highest desirability under both schemes. The WS scheme maximized desirability (97.45%) but at the cost of diversity (0.49). The PF scheme maintained higher diversity (0.70) with still-strong desirability (80.81%).

Target-Specific Results

In the target-specific case (high $A_{2A}AR$, low $A_1AR$ and hERG):

Method	Scheme	Validity	Desirability	Uniqueness	Diversity
DrugEx v2	PF	99.53%	89.49%	90.55%	0.73
DrugEx v2	WS	99.62%	97.86%	90.54%	0.31
REINVENT	WS	99.55%	81.27%	98.87%	0.34
ORGANIC	PF	98.29%	86.98%	80.30%	0.64

DrugEx v2 with PF achieved high desirability (89.49%) while maintaining diversity (0.73), outperforming both the WS scheme’s diversity collapse (0.31) and competing methods.

Chemical Space Coverage

t-SNE visualization with ECFP6 descriptors showed that the PF scheme guided generators to cover chemical space more broadly than the WS scheme. DrugEx v1 and v2 covered nearly all of the chemical space occupied by known active ligands, while REINVENT and ORGANIC covered only partial regions in the target-specific case.

Substructure Distribution

Generated molecules were evaluated for purine ring, furan ring, and benzene ring frequencies. DrugEx v2 with PF produced substructure distributions closest to the LIGAND set, suggesting it better preserves the chemical characteristics of known active molecules compared to REINVENT (which over-represented benzene rings) and ORGANIC.

GuacaMol Benchmark

DrugEx v2 was tested on 20 goal-directed tasks from the GuacaMol benchmark, achieving the best score in 12 of 20 tasks and an overall second place. The method struggled with tasks requiring contradictory objectives in narrow chemical spaces (e.g., the Sitagliptin MPO task), reflecting its emphasis on diverse feasible molecules rather than optimal individual solutions.

Diversity-Desirability Trade-off and Limitations

The key finding is that the Pareto front scheme and weighted sum scheme offer complementary strengths: PF produces molecules with higher diversity and more realistic substructure distributions, while WS achieves higher raw desirability scores. The Pareto front scheme is preferred for polypharmacology applications where chemical diversity matters for lead optimization.

The mutation rate $\varepsilon$ controls the diversity-desirability trade-off. Higher $\varepsilon$ increases diversity at the cost of desirability. The authors tested $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$ and found that appropriate tuning is important.

Limitations acknowledged by the authors include:

The method is less effective for tasks with contradictory objectives in narrow chemical spaces
Emphasis is on generating diverse feasible molecules rather than individual optimal solutions
REINVENT 2.0 did not converge with the PF scheme, suggesting the Pareto approach may not be universally compatible with all RL frameworks
Bioactivity predictions rely on QSAR models (RF), which may not generalize perfectly to novel chemical scaffolds

Future directions mentioned include adopting newer architectures (BERT, Transformer, GPT-2), handling graph and fragment representations, and integrating additional objectives like stability and synthesizability.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL v26 (ChEMBL set)	1.7M molecules	SMILES syntax learning, drug-like molecules
Fine-tuning / Environment	LIGAND set	25,731 ligands	Bioactivities for $A_1AR$, $A_{2A}AR$, hERG from ChEMBL
Benchmark	GuacaMol	20 tasks	Goal-directed generation tasks

Active/inactive thresholds: $pX \geq 6.5$ (active), $pX < 6.5$ (inactive). Low-quality data without exact pX assigned $pX = 3.99$ with sample weight 0.1.

Algorithms

QSAR predictor: Random Forest, 1000 trees, Gini criterion. Input: 2048-bit ECFP6 + 19 physicochemical properties (2067D). MinMax normalization.
Generator: 6-layer RNN with LSTM cells (512 hidden units), embedding dim 128, vocabulary 84 tokens. Adam optimizer, lr $10^{-3}$, batch size 512, 1000 epochs.
RL training: Policy gradient with Pareto-based or weighted-sum reward. Mutation rates tested: $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$.
Pareto ranking: GPU-accelerated non-dominated sorting via PyTorch. Tanimoto-based crowding distance with ECFP6 fingerprints.

Models

Component	Architecture	Parameters
Generator	LSTM (3 layers, 512 hidden)	Embedding 128D, vocab 84
Predictor	Random Forest	1000 trees, 2067D input
MT-DNN (alternative)	3 hidden layers (4000, 2000, 1000)	ReLU, 20% dropout

Evaluation

Metric	Description
Validity	Fraction of generated SMILES that parse to valid molecules
Desirability	Fraction of molecules meeting all activity thresholds ($pX \geq 6.5$ on-targets, $pX < 6.5$ off-targets)
Uniqueness	Fraction of non-duplicate molecules
Diversity	Solow-Polasky metric on ECFP6 Tanimoto distances
SA score	Synthetic accessibility (1-10, lower is easier)
QED	Quantitative estimate of drug-likeness (0-1, higher is better)

Hardware

GPU acceleration was used for Pareto optimization via PyTorch. Specific hardware details (GPU model, training time) are not reported in the paper.

Artifacts

Artifact	Type	License	Notes
DrugEx GitHub	Code	MIT	Official implementation (Python, PyTorch)
ChEMBL v26	Dataset	CC BY-SA 3.0	Source of training molecules and bioactivity data

Paper Information

Citation: Liu, X., Ye, K., van Vlijmen, H. W. T., Emmerich, M. T. M., IJzerman, A. P., & van Westen, G. J. P. (2021). DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology. Journal of Cheminformatics, 13(1), 85. https://doi.org/10.1186/s13321-021-00561-9

@article{liu2021drugex,
  title={DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology},
  author={Liu, Xuhan and Ye, Kai and van Vlijmen, Herman W. T. and Emmerich, Michael T. M. and IJzerman, Adriaan P. and van Westen, Gerard J. P.},
  journal={Journal of Cheminformatics},
  volume={13},
  number={1},
  pages={85},
  year={2021},
  doi={10.1186/s13321-021-00561-9}
}

REINVENT 4: Open-Source Generative Molecule Design

Thu, 26 Mar 2026 00:00:00 +0000

An Open-Source Reference Implementation for Generative Molecular Design

REINVENT 4 is a Resource paper presenting a production-grade, open-source software framework for AI-driven generative molecular design. The primary contribution is the unified codebase that integrates four distinct molecule generators (de novo, scaffold decoration, linker design, molecular optimization) within three machine learning optimization algorithms (transfer learning, reinforcement learning, curriculum learning). The software is released under the Apache 2.0 license and represents the fourth major version of the REINVENT platform, which has been in continuous production use at AstraZeneca for drug discovery.

Bridging the Gap Between Research Prototypes and Production Molecular Design

The motivation for REINVENT 4 stems from several gaps in the generative molecular design landscape. While numerous AI model architectures have been developed for molecular generation (VAEs, GANs, RNNs, transformers, flow models, diffusion models), most exist as research prototypes released alongside individual publications rather than as maintained, integrated software. The authors argue that the scientific community needs reference implementations of common generative molecular design algorithms in the public domain to:

Enable nuanced debate about the application of AI in drug discovery
Serve as educational tools for practitioners entering the field
Increase transparency around AI-driven molecular design
Provide a foundation for future innovation

REINVENT 4 consolidates previously separate codebases (REINVENT v1, v2, LibInvent, LinkInvent, Mol2Mol) into a single repository with a consistent interface, addressing the fragmentation that characterized earlier releases.

Unified Framework for Sequence-Based Molecular Generation

The core design of REINVENT 4 centers on sequence-based neural network models that generate SMILES strings in an autoregressive manner. All generators model the probability of producing a token sequence, with two formulations.

For unconditional agents (de novo generation), the joint probability of a sequence $T$ with tokens $t_1, t_2, \ldots, t_\ell$ is:

$$ \mathbf{P}(T) = \prod_{i=1}^{\ell} \mathbf{P}(t_i \mid t_{i-1}, t_{i-2}, \ldots, t_1) $$

For conditional agents (scaffold decoration, linker design, molecular optimization), the joint probability given an input sequence $S$ is:

$$ \mathbf{P}(T \mid S) = \prod_{i=1}^{\ell} \mathbf{P}(t_i \mid t_{i-1}, t_{i-2}, \ldots, t_1, S) $$

The negative log-likelihood for unconditional agents is:

$$ NLL(T) = -\log \mathbf{P}(T) = -\sum_{i=1}^{\ell} \log \mathbf{P}(t_i \mid t_{i-1}, t_{i-2}, \ldots, t_1) $$

Reinforcement Learning with DAP

The key optimization mechanism is reinforcement learning via the “Difference between Augmented and Posterior” (DAP) strategy. For each generated sequence $T$, the augmented likelihood is defined as:

$$ \log \mathbf{P}_{\text{aug}}(T) = \log \mathbf{P}_{\text{prior}}(T) + \sigma \mathbf{S}(T) $$

where $\mathbf{S}(T) \in [0, 1]$ is the scalar score and $\sigma \geq 0$ controls the balance between reward and regularization. The DAP loss is:

$$ \mathcal{L}(T) = \left(\log \mathbf{P}_{\text{aug}}(T) - \log \mathbf{P}_{\text{agent}}(T)\right)^2 $$

The presence of the prior likelihood in the augmented likelihood constrains how far the agent can deviate from chemically plausible space, functioning similarly to proximal policy gradient methods. The loss is lower-bounded by:

$$ \mathcal{L}(T) \geq \max\left(0, \log \mathbf{P}_{\text{prior}}(T) + \sigma \mathbf{S}(T)\right)^2 $$

Four Molecule Generators

REINVENT 4 supports four generator types:

Generator	Architecture	Input	Task
Reinvent	RNN	None	De novo design from scratch
LibInvent	RNN	Scaffold SMILES	R-group replacement, library design
LinkInvent	RNN	Two warhead fragments	Linker design, scaffold hopping
Mol2Mol	Transformer	Input molecule	Molecular optimization within similarity bounds

All generators are fully integrated with all three optimization algorithms (TL, RL, CL). The Mol2Mol transformer was trained on over 200 billion molecular pairs from PubChem with Tanimoto similarity $\geq 0.50$, using ranking loss to directly link negative log-likelihood to molecular similarity.

Staged Learning (Curriculum Learning)

A key new feature is staged learning, which implements curriculum learning as multi-stage RL. Each stage can define a different scoring profile, allowing users to gradually phase in computationally expensive scoring functions. For example, cheap drug-likeness filters can run first, followed by docking in later stages. Stages terminate when a maximum score threshold is exceeded or a step limit is reached.

Scoring Subsystem

The scoring subsystem implements a plugin architecture supporting over 25 scoring components, including:

Physicochemical descriptors from RDKit (QED, SLogP, TPSA, molecular weight, etc.)
Molecular docking via DockStream (AutoDock Vina, rDock, Hybrid, Glide, GOLD)
QSAR models via Qptuna and ChemProp (D-MPNN)
Shape similarity via ROCS
Synthesizability estimation via SA score
Matched molecular pairs via mmpdb
Generic REST and external process interfaces

Scores are aggregated via weighted arithmetic or geometric mean. A transform system (sigmoid, step functions, value maps) normalizes individual component scores to $[0, 1]$.

PDK1 Inhibitor Case Study

The paper demonstrates REINVENT 4 through a structure-based drug design exercise targeting Phosphoinositide-dependent kinase-1 (PDK1) inhibitors. The experimental setup uses PDB crystal structure 2XCH with DockStream and Glide for docking, defining hits as molecules with docking score $\leq -8$ kcal/mol and QED $\geq 0.7$.

Baseline RL from prior: 50 epochs of staged learning with batch size 128 produced 119 hits from 6,400 generated molecules (1.9% hit rate), spread across 103 generic Bemis-Murcko scaffolds.

Transfer learning + RL: After 10 epochs of TL on 315 congeneric pyridinone PDK1 actives from PubChem Assay AID1798002, the same 50-epoch RL run produced 222 hits (3.5% hit rate) across 176 unique generic scaffolds, nearly doubling productivity.

Both approaches generated top-scoring molecules (docking score of -10.1 kcal/mol each) with plausible binding poses reproducing key protein-ligand interactions seen in the native crystal structure, including hinge interactions with ALA 162 and contacts with LYS 111.

The paper also demonstrates the agent’s plasticity through a molecular weight switching experiment: after 500 epochs driving generation toward 1500 Da molecules, switching the reward to favor molecules $\leq 500$ Da resulted in rapid adaptation within ~50 epochs, showing that the RL agent can recover from extreme biases.

Practical Software for AI-Driven Drug Discovery

REINVENT 4 represents a mature, well-documented framework that consolidates years of incremental development into a single codebase. Key practical features include TOML/JSON configuration, TensorBoard visualization, multinomial sampling and beam search decoding, diversity filters for scaffold-level novelty, experience replay (inception), and a plugin mechanism for extending the scoring subsystem.

The authors acknowledge that this is one approach among many and that there is no single solution that uniformly outperforms others. REINVENT has demonstrated strong sample efficiency in benchmarks and produced realistic 3D docking poses, but the paper does not claim universal superiority. The focus is on providing a well-engineered, transparent reference implementation rather than advancing a novel algorithm.

Limitations include that only the Mol2Mol prior supports stereochemistry, the training data biases constrain the explorable chemical space, and the SMILES-based representation inherits the known fragility of string-based molecular encodings.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior training (Reinvent)	ChEMBL 25	~1.7M molecules	Drug-like compounds
Prior training (LibInvent)	ChEMBL 27	~1.9M molecules	Scaffold-decoration pairs
Prior training (LinkInvent)	ChEMBL 27	~1.9M molecules	Fragment-linker pairs
Prior training (Mol2Mol)	ChEMBL 28 / PubChem	~200B pairs	Tanimoto similarity $\geq 0.50$
Case study TL	PubChem AID1798002	315 compounds	Congeneric PDK1 actives
Case study docking	PDB 2XCH	1 structure	PDK1 crystal structure

Algorithms

Optimization: DAP (recommended), plus three deprecated alternatives (REINFORCE, A2C, MAULI)
Decoding: Multinomial sampling (default, temperature $K = 1$) and beam search
Diversity filter: Murcko scaffold, topological scaffold, scaffold similarity, same-SMILES penalty
Experience replay: Inception memory with configurable size and sampling rate
Gradient descent: Adam optimizer

Models

All pre-trained priors are distributed with the repository. RNN-based generators (Reinvent, LibInvent, LinkInvent) and transformer-based generator (Mol2Mol) with multiple similarity-conditioned variants.

Evaluation

Metric	Value	Condition	Notes
Hit rate (RL)	1.9%	50 epochs, batch 128	PDK1 case study
Hit rate (TL+RL)	3.5%	10 TL + 50 RL epochs	PDK1 case study
Scaffold diversity (RL)	103 scaffolds	From 119 hits	Generic Bemis-Murcko
Scaffold diversity (TL+RL)	176 scaffolds	From 222 hits	Generic Bemis-Murcko
Best docking score	-10.1 kcal/mol	Both methods	Glide SP

Hardware

The paper does not specify hardware requirements. REINVENT 4 supports both GPU and CPU execution. Python 3.10+ is required, with PyTorch 1.x (2.0 also compatible) and RDKit 2022.9+.

Artifacts

Artifact	Type	License	Notes
REINVENT4	Code	Apache-2.0	Full framework with pre-trained priors
DockStream	Code	Apache-2.0	Docking wrapper for scoring

Paper Information

Citation: Loeffler, H. H., He, J., Tibo, A., Janet, J. P., Voronov, A., Mervin, L. H., & Engkvist, O. (2024). Reinvent 4: Modern AI-driven generative molecule design. Journal of Cheminformatics, 16, 20. https://doi.org/10.1186/s13321-024-00812-5

@article{loeffler2024reinvent,
  title={Reinvent 4: Modern AI-driven generative molecule design},
  author={Loeffler, Hannes H. and He, Jiazhen and Tibo, Alessandro and Janet, Jon Paul and Voronov, Alexey and Mervin, Lewis H. and Engkvist, Ola},
  journal={Journal of Cheminformatics},
  volume={16},
  number={1},
  pages={20},
  year={2024},
  publisher={Springer},
  doi={10.1186/s13321-024-00812-5}
}

Link-INVENT: RL-Driven Molecular Linker Generation

Thu, 26 Mar 2026 00:00:00 +0000

A Method for Generative Linker Design with Reinforcement Learning

Link-INVENT is a Method paper that introduces a generative model for molecular linker design built on the REINVENT de novo design platform. The primary contribution is an encoder-decoder recurrent neural network (RNN) architecture that generates SMILES-based linkers connecting two molecular subunits, combined with a flexible multi-parameter optimization (MPO) scoring function and reinforcement learning (RL) to steer generation toward desired properties. Link-INVENT targets three practical drug discovery tasks: fragment linking, scaffold hopping, and proteolysis targeting chimera (PROTAC) design.

Why Linker Design Needs Flexible Multi-Parameter Optimization

Generating suitable chemical linkers between molecular subunits is a central challenge in fragment-based drug discovery (FBDD), scaffold hopping, and PROTAC design. Traditional computational approaches rely on database searches, inherently limiting the generalizability of proposed linkers to the pre-defined collection. Recent deep learning methods (DeLinker, SyntaLinker, 3DLinker, DiffLinker) can generate novel linkers but offer limited support for optimizing specific physicochemical properties. Users can typically control only linker length and a few properties like hydrogen-bond donor count.

The key gaps that Link-INVENT addresses are:

Conditioning on both subunits: Prior RNN-based approaches (SAMOA) generate linkers conditioned only on the SMILES sequence seen so far, which may not account for the second molecular subunit. Link-INVENT conditions on both warheads simultaneously.
Flexible scoring: Existing DL-based linker design tools lack the ability to define tailored MPO objectives. Link-INVENT inherits REINVENT 4’s full scoring infrastructure and adds linker-specific properties.
Generalizability: A single trained prior handles fragment linking, scaffold hopping, and PROTAC tasks without retraining.

Core Innovation: Conditional Linker Generation with Augmented Likelihood RL

Link-INVENT’s architecture is an encoder-decoder RNN adapted from the Lib-INVENT library design model. The encoder processes a pair of warheads (molecular subunits with defined exit vectors), and the decoder generates a linker token by token, yielding a connected molecule in SMILES format. The model uses three hidden layers of 512 LSTM cells with an embedding size of 256.

Training

The prior is trained on ChEMBL v27 data processed through reaction-based slicing to generate (linker, warheads pair, full molecule) tuples. SMILES randomization augments the training data at each epoch, improving chemical space generalizability. The prior is trained by maximizing the likelihood of generating a linker conditioned on the input warhead pair, with teacher forcing for stability.

Multi-Parameter Optimization via RL

The scoring function $S(x)$ is a weighted geometric mean of individual component scores:

$$ S(x) = \left(\prod_{i=1}^{n} C_{i}(x)^{w_{i}}\right)^{\frac{1}{\sum_{i=1}^{n} w_{i}}} $$

where $x$ is a sampled linked molecule, $C_{i}(x)$ is the score for the $i$-th component, and $w_{i}$ is its weight.

The agent (initialized as a copy of the prior) is updated via the Difference of Augmented and Posterior likelihoods (DAP) loss. The augmented log likelihood is:

$$ \log \pi_{\text{augmented}} = \log \pi_{\text{prior}} + \sigma \cdot S(x) $$

where $\pi$ denotes a policy (token sampling probabilities conditioned on the sequence so far) and $\sigma$ is a scalar factor. The loss function is:

$$ J(\theta) = \left(\log \pi_{\text{augmented}} - \log \pi_{\text{agent}}\right)^{2} $$

Minimizing $J(\theta)$ steers the agent to generate molecules that satisfy the scoring function while remaining anchored to the prior’s chemical space.

Diversity Filters

Link-INVENT uses Diversity Filters (DFs) to balance exploration and exploitation. Buckets of limited size track unique Bemis-Murcko scaffolds. When a bucket is full, further sampling of that scaffold receives a score of zero, encouraging the agent to explore diverse chemical space regions.

Linker-Specific Scoring Components

New scoring components provide direct control over linker properties:

Linker effective length: number of bonds between attachment atoms
Linker maximum graph length: bonds in the longest graph traversal path
Linker length ratio: effective length divided by maximum graph length (controls branching)
Linker ratio of rotatable bonds: rotatable bonds over total bonds (controls flexibility)
Linker number of rings: controls linearity vs. cyclicity
Linker number of HBDs: hydrogen-bond donors in the linker itself

Experimental Evaluation Across Three Drug Discovery Tasks

Link-INVENT was evaluated through four experiments across three drug discovery applications, all using the same pre-trained prior.

Illustrative Example: Two Benzene Rings

A simple experiment linked two benzene rings with the objectives of limiting HBDs and requiring exactly one ring in the linker. Over 20 epochs, the agent learned to satisfy both objectives, demonstrating the basic RL-guided generation process.

Experiment 1a: Fragment Linking (CK2 alpha Inhibitors)

Based on the casein kinase 2 (CK2 alpha) fragment linking campaign by Fusco and Brear et al., Link-INVENT was tasked with linking two fragment hits while retaining the Lys68 hydrogen-bond interaction via a DockStream docking constraint (Glide/LigPrep backend). The scoring function also enforced linker length ratio >= 70 and linker MW <= 200 Da.

Over 100 epochs in triplicate, the agent generated molecules with gradually improving docking scores. Key results:

Docking score distributions across triplicates were nearly identical, demonstrating reproducibility
Some generated molecules achieved more favorable docking scores than the reference ligand CAM4066 (-15.20 kcal/mol)
More than 5000 unique Bemis-Murcko scaffolds were generated, with minimal overlap across replicates
Binding pose analysis showed the generated linker closely resembled the ground-truth linker, retaining the Lys68 interaction

Experiment 1b: Comparison Fragment Linking (IMPDH Inhibitors)

Using the IMPDH inhibitor fragment linking case study from Trapero et al., this experiment applied core constrained docking (fragment pose within 0.3 A of reference) and compared results to DeLinker and SyntaLinker. The scoring function enforced linker effective length in [3, 5], length ratio >= 70, and linker MW <= 150 Da.

Link-INVENT generated 8960 SMILES across 70 epochs (comparable to DeLinker’s 9000 molecular graphs). Results:

Link-INVENT generated molecules with more favorable docking scores than the reference ligand across triplicate runs
Of 20 DeLinker and 3 SyntaLinker example molecules, none and one (the recovered reference) docked better than or equal to the reference
Approximately 3000 unique Bemis-Murcko scaffolds were generated from 5000 total molecules
Link-INVENT’s advantage comes from including docking explicitly as a learning objective rather than applying it post hoc

Experiment 2: Scaffold Hopping (DLK Inhibitor CNS Optimization)

Based on Patel et al.’s dual leucine zipper kinase (DLK) inhibitor campaign, Link-INVENT generated new scaffold ideas to improve CNS penetration while retaining potency. The scoring function included a Cys193 docking constraint plus CNS-compatible properties (HBDs < 2, tPSA <= 90 A squared, 3 <= SlogP <= 4, MW <= 450 Da, 1-2 aromatic rings in linker).

The solution space was significantly narrower than fragment linking. The agent still generated diverse scaffolds with favorable docking scores, though fewer exceeded the reference ligand’s score. Binding pose analysis confirmed retained Cys193 interactions and predicted additional Gln195 hydrogen bonds.

Experiment 3: PROTAC Design (Bcl-2/Mcl-1 Dual Degradation)

Three sub-experiments demonstrated linker-specific scoring components for PROTAC design based on Wang et al.’s Bcl-2/Mcl-1 dual degradation strategy:

Sub-Experiment	Objective	Key Finding
Sub-Exp 1: Linker length	Generate linkers within specified length intervals [4,6], [7,9], [10,12], [13,15]	Clear enrichment within target intervals vs. baseline broad distribution
Sub-Exp 2: Linearity	Control linear vs. cyclic linkers at fixed length [7,9]	Baseline ratio ~1:2 linear:cyclic; enforcing linearity or cyclicity achieved strong enrichment
Sub-Exp 3: Flexibility	Generate linkers with Low [0,30], Moderate [40,60], or High [70,100] rotatable bond ratios	Agent learned that rings and sp2 atoms yield rigidity; linear sp3 chains yield flexibility

Key Findings and Practical Implications for Drug Discovery

Link-INVENT demonstrates several practical advantages for molecular linker design:

Single prior, multiple tasks: The same pre-trained model handles fragment linking, scaffold hopping, and PROTAC design without retraining.
Docking as a learning signal: Including molecular docking explicitly in the scoring function (via DockStream) during RL yields molecules with more favorable docking scores than approaches that apply docking post hoc.
Implicit 3D awareness: The docking constraint guides the agent toward 3D structural awareness without explicit 3D coordinate inputs, as demonstrated by the overlap between generated and reference binding poses.
Diverse and reproducible output: Diversity filters ensure exploration of multiple chemical space regions, and triplicate experiments show consistent docking score distributions with minimal scaffold overlap.

Limitations acknowledged by the authors include:

The linker flexibility metric (ratio of rotatable bonds) is agnostic to intra-molecular hydrogen bonds and does not account for all rigidity factors
Molecular docking is an approximation that can be exploited (e.g., excessive HBDs achieving favorable scores at the expense of permeability)
Experiments 1a and 1b require a proprietary Schrodinger license for Glide/LigPrep docking
No direct experimental (wet-lab) validation was performed in this study

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior training	ChEMBL v27 (reaction-sliced)	Not specified	Filtered for drug-like compounds, then reaction-based slicing with SMIRKS
Validation	Held-out Bemis-Murcko scaffolds	287 scaffolds	Held out from training set
SMILES augmentation	Randomized SMILES per epoch	Same tuples, different representations	Improves generalizability

Algorithms

Architecture: Encoder-decoder RNN with 3 hidden layers of 512 LSTM cells, embedding size 256
RL loss: DAP (Difference of Augmented and Posterior likelihoods)
Batch size: 128 molecules per epoch
Diversity filter: Bemis-Murcko scaffold buckets of size 25
Score threshold: 0 (to store all molecules for analysis)
Scoring function: Weighted geometric mean of component scores

Models

Single pre-trained prior used across all experiments
Agent initialized as copy of prior, updated via RL
Pre-trained prior available at GitHub repository

Evaluation

Molecular docking via DockStream with Glide/LigPrep backend
Triplicate runs for all experiments
Metrics: docking scores, unique Bemis-Murcko scaffold counts, binding pose overlap

Hardware

Hardware specifications are not reported in the paper.

Artifacts

Artifact	Type	License	Notes
REINVENT (Link-INVENT code)	Code	Apache-2.0	Main codebase for Link-INVENT
ReinventCommunity (data + tutorial)	Code + Data	MIT	Training/validation data, reaction SMIRKS, pre-trained prior, Jupyter tutorial

Reproducibility status: Partially Reproducible. Code, training data, and pre-trained prior are publicly available. However, reproducing the docking-based experiments (1a, 1b, and 2) requires a proprietary Schrodinger license for Glide and LigPrep. The PROTAC experiments (Experiment 3) that use only physicochemical scoring are fully reproducible with the open-source code.

Paper Information

Citation: Guo, J., Knuth, F., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., & Patronov, A. (2023). Link-INVENT: generative linker design with reinforcement learning. Digital Discovery, 2, 392-408. https://doi.org/10.1039/D2DD00115B

@article{guo2023link,
  title={Link-INVENT: generative linker design with reinforcement learning},
  author={Guo, Jeff and Knuth, Franziska and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas},
  journal={Digital Discovery},
  volume={2},
  number={2},
  pages={392--408},
  year={2023},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D2DD00115B}
}

DrugEx v3: Scaffold-Constrained Graph Transformer

Thu, 26 Mar 2026 00:00:00 +0000

A Graph Transformer Method for Scaffold-Constrained Drug Design

This is a Method paper that introduces DrugEx v3, a Graph Transformer model for scaffold-constrained de novo drug design. The primary contribution is a novel positional encoding scheme for molecular graphs that allows a Transformer architecture to operate on graph-structured molecular data rather than SMILES strings. The model takes user-provided scaffold fragments as input and generates complete molecules through growing and connecting operations, trained with multi-objective reinforcement learning to optimize for both target affinity and drug-likeness.

From Fixed Objectives to User-Guided Scaffold Design

Prior versions of DrugEx (v1 and v2) used RNN-based generators trained with reinforcement learning for de novo drug design, but they operated under fixed objectives and could not accept user-provided structural priors. If a medicinal chemist wanted to explore analogs of a specific scaffold, the model needed retraining from scratch. Meanwhile, SMILES-based molecular generators face inherent limitations for scaffold-constrained design: SMILES is a linear notation, so inserting fragments at multiple positions of a scaffold requires complex grammar handling, and small token changes can produce invalid molecules.

Several approaches had been proposed for scaffold-based generation, including graph generative models (Lim et al., 2019), DeepScaffold (Li et al., 2020), SMILES-based scaffold decorators (Arus-Pous et al., 2020), and SyntaLinker for fragment linking (Yang et al., 2020). DrugEx v3 aims to combine the advantages of graph representations (validity guarantees, local invariance, flexible extension) with the Transformer architecture’s ability to handle complex dependencies, while maintaining the multi-objective reinforcement learning framework from DrugEx v2.

Graph Positional Encoding for Molecular Transformers

The core innovation is adapting the Transformer architecture to work directly with molecular graph representations. Two key modifications make this possible.

Graph word encoding. Since atoms and bonds cannot be processed simultaneously in a graph, the authors combine them into a single index:

$$ W = T_{atom} \times 4 + T_{bond} $$

where $T_{atom}$ is the atom type index and $T_{bond}$ is the bond type index (four bond types: single, double, triple, and none).

Graph positional encoding. Standard sequential position encoding does not capture molecular topology. The authors propose an adjacency-matrix-based positional encoding:

$$ P = I_{Atom} \times L_{max} + I_{Connected} $$

where $I_{Atom}$ is the current atom index, $L_{max}$ is the maximum sequence length, and $I_{Connected}$ is the index of the atom connected by the current bond. This encoding is then processed through the standard sinusoidal positional encoding:

$$ PE_{(p, 2i)} = \sin(pos / 10000^{2i / d_{m}}) $$

$$ PE_{(p, 2i+1)} = \cos(pos / 10000^{2i / d_{m}}) $$

with $d_{m} = 512$.

Molecule generation procedure. Each molecule in the training data is represented as a five-row matrix encoding atom type, bond type, connected atom index, current atom index, and fragment index. The columns are divided into three sections: fragment (the scaffold), growing (new atoms added to fragments), and linking (bonds connecting grown fragments). The decoder uses a GRU-based recurrent layer to sequentially output atom type, bond type, connected atom index, and current atom index at each step, with chemical valence rules enforced at every generation step to guarantee valid molecules.

Multi-objective reinforcement learning. The generator is trained with a policy gradient objective:

$$ J(\theta) = \mathbb{E}\left[R^{*}(y_{1:T}) | \theta\right] = \sum_{t=1}^{T} \log G(y_{t} | y_{1:t-1}) \cdot R^{\ast}(y_{1:T}) $$

where $R^{*}$ is a Pareto-based reward combining target affinity and QED drug-likeness score:

$$ R^{*} = \begin{cases} 0.5 + \frac{k - N_{undesired}}{2N_{desired}}, & \text{if desired} \\ \frac{k}{2N_{undesired}}, & \text{if undesired} \end{cases} $$

with $k$ being the solution’s index in the Pareto rank. An exploration strategy uses two networks: an exploitation network $G_{\theta}$ (updated by policy gradient) and an exploration network $G_{\phi}$ (fixed, pre-trained on ChEMBL), with an exploration rate $\varepsilon$ controlling how many scaffolds are routed to $G_{\phi}$ during training.

Experimental Setup: Architecture Comparison and RL Optimization

Data

The ChEMBL set (version 27) contained approximately 1.7 million molecules for pre-training, preprocessed via RDKit (charge neutralization, metal/fragment removal). The LIGAND set comprised 10,828 adenosine receptor ligands for fine-tuning. Each molecule was decomposed into fragments using the BRICS algorithm, creating scaffold-molecule pairs (up to 15 pairs per molecule with four fragments). The ChEMBL set yielded 9.3 million training pairs, and the LIGAND set produced 53,888 training pairs.

Architecture comparison

Four architectures were compared:

Graph Transformer: graph input with novel positional encoding
Sequential Transformer: SMILES input with standard Transformer
LSTM-BASE: SMILES encoder-decoder with three recurrent layers
LSTM+ATTN: LSTM-BASE with an attention mechanism between encoder and decoder

All models were pre-trained on ChEMBL and fine-tuned on the LIGAND set. The bioactivity predictor was a random forest regression model using 2048D ECFP6 fingerprints and 19D physicochemical descriptors, with an activity threshold of pX = 6.5 for the A2A adenosine receptor.

Evaluation metrics

Five metrics were used: validity (parseable molecules), accuracy (scaffold containment), desirability (meeting all objectives), uniqueness, and novelty (not in ChEMBL). Diversity was measured using the Solow-Polasky index with Tanimoto distance on ECFP6 fingerprints:

$$ I(A) = \frac{1}{|A|} \mathbf{e}^{\intercal} F(\mathbf{s})^{-1} \mathbf{e} $$

Hardware

Models were benchmarked on a server with NVIDIA Tesla P100 GPUs.

Key Results: Graph Representation Advantages and RL Trade-offs

Pre-training and fine-tuning performance

The Graph Transformer achieved the best overall performance across all metrics:

Method	Validity (PT)	Accuracy (PT)	Validity (FT)	Accuracy (FT)	Novelty (FT)	Uniqueness (FT)
Graph Transformer (512)	100.0%	99.3%	100.0%	99.2%	68.9%	82.9%
Seq. Transformer (512)	96.7%	74.0%	99.3%	92.7%	8.9%	28.9%
LSTM+ATTN (512)	94.3%	72.8%	96.9%	85.9%	6.3%	20.7%
LSTM-BASE (512)	93.9%	52.4%	98.7%	81.6%	3.9%	19.2%

PT = pre-trained, FT = fine-tuned. The Graph Transformer achieved 100% validity due to its explicit valence checking at each generation step. It also produced substantially more novel and unique molecules after fine-tuning compared to SMILES-based methods.

The authors identified four advantages of the graph representation over SMILES: (1) local invariance, where fragment ordering does not affect output; (2) global extendibility, where new atoms can be appended without restructuring existing data; (3) freedom from grammar constraints; and (4) direct accessibility of chemical valence rules for validity enforcement.

Reinforcement learning results

With multi-objective RL (affinity + QED), 74.6% of generated molecules were predicted active at $\varepsilon = 0.0$. The exploration rate $\varepsilon$ trades off desirability against uniqueness:

$\varepsilon$	Desirability	Uniqueness	Novelty	Diversity
0.0	74.6%	60.7%	60.6%	0.879
0.1	66.8%	75.0%	74.6%	0.842
0.2	61.6%	80.2%	79.4%	0.879
0.3	56.8%	89.8%	88.8%	0.874

The authors report that $\varepsilon = 0.3$ produced the best balance between desirability and uniqueness, with 56.8% desired molecules and 89.8% uniqueness. Diversity remained above 0.84 across all settings.

Limitations

The Graph Transformer produced molecules with worse synthetic accessibility (SA scores) compared to SMILES-based methods, particularly after fine-tuning on the smaller LIGAND set. The authors attribute this to uncommon ring systems generated when the model handles long-distance dependencies. A kekulization issue also causes a small fraction of molecules to fail scaffold matching: aromatic bond inference during sanitization can alter the scaffold substructure. Without single-objective affinity constraint, the model generates molecules with molecular weight exceeding 500 Da, reducing drug-likeness. All bioactivity predictions rely on a random forest model rather than experimental validation, and the t-SNE analysis suggests some generated molecules fall outside the model’s applicability domain.

Future directions

The authors propose extending the Graph Transformer to accept protein information as input via proteochemometric modeling, enabling design of ligands for targets without known ligands. Lead optimization, where a “hit” serves as input to generate improved analogs, is also identified as a natural extension.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL v27	~1.7M molecules (9.3M scaffold-molecule pairs)	Preprocessed via RDKit
Fine-tuning	LIGAND set (A2A AR ligands from ChEMBL)	10,828 ligands (53,888 pairs)	Split 8:1:1 train/val/test
Bioactivity labels	ChEMBL A2A AR activity data	pX threshold = 6.5	Average pChEMBL values

Algorithms

Fragment decomposition: BRICS algorithm via RDKit (max 4 fragments per molecule)
Optimizer: Adam with learning rate $10^{-4}$, batch size 256
Pre-training: 20 epochs; fine-tuning: up to 1,000 epochs with early stopping (patience: 100 epochs)
Bioactivity predictor: random forest regression (scikit-learn) with 2048D ECFP6 + 19D physicochemical descriptors
Pareto-based multi-objective ranking with GPU acceleration

Models

Graph Transformer: 512 hidden units, 8 attention heads, $d_{k} = d_{v} = 64$
Sequential Transformer: same hidden size, sinusoidal positional encoding
LSTM-BASE / LSTM+ATTN: 128 embedding units, 512 hidden units, 3 recurrent layers

Evaluation

Metric	Graph Transformer	Best SMILES Baseline	Notes
Validity (fine-tuned)	100.0%	99.6% (LSTM-BASE 1024)	Valence checking guarantees validity
Accuracy (fine-tuned)	99.2%	94.3% (Seq. Transformer 1024)	Scaffold containment
Desirability (RL, $\varepsilon$=0.0)	74.6%	N/A	Only Graph Transformer used for RL
Diversity (RL)	0.879	N/A	Solow-Polasky index

Hardware

NVIDIA Tesla P100 GPUs. Specific training times not reported, but Transformer models trained faster than LSTM models with the same hidden layer size.

Artifacts

Artifact	Type	License	Notes
CDDLeiden/DrugEx	Code	MIT	Official implementation (v1, v2, v3)
ChEMBL v27	Dataset	CC-BY-SA 3.0	Pre-training data source

Paper Information

Citation: Liu, X., Ye, K., van Vlijmen, H. W. T., IJzerman, A. P., & van Westen, G. J. P. (2023). DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning. Journal of Cheminformatics, 15, 24. https://doi.org/10.1186/s13321-023-00694-z

@article{liu2023drugex,
  title={DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning},
  author={Liu, Xuhan and Ye, Kai and van Vlijmen, Herman W. T. and IJzerman, Adriaan P. and van Westen, Gerard J. P.},
  journal={Journal of Cheminformatics},
  volume={15},
  number={1},
  pages={24},
  year={2023},
  publisher={Springer},
  doi={10.1186/s13321-023-00694-z}
}

Curriculum Learning for De Novo Drug Design (REINVENT)

Thu, 26 Mar 2026 00:00:00 +0000

Curriculum Learning as a Method for Molecular Generation

This is a Method paper that introduces curriculum learning (CL) into the REINVENT de novo molecular design platform. The primary contribution is a training strategy that decomposes complex multi-parameter optimization (MPO) objectives into sequences of simpler tasks with increasing complexity. The agent learns each simpler task before progressing to the full production objective, accelerating convergence and improving the quality and diversity of generated molecules compared to standard policy-based reinforcement learning (RL).

The Computational Cost of Complex Reward Functions

Policy-based RL for molecular design works by training a generative model (the agent) to produce molecules that maximize a reward function. In practice, drug design reward functions often include computationally expensive components such as molecular docking. When the reward landscape is complex and minima are difficult to find, the agent may spend many epochs sampling molecules far from the desired objective. The resulting small gradients cause minimal policy updates, leading to long periods of non-productivity. This is particularly wasteful when each reward evaluation involves expensive physics-based computations.

The core problem is that standard RL treats the full MPO objective as a monolithic task. If the agent cannot find any rewarding molecules early in training, it receives near-zero gradients and makes negligible progress. This creates a bootstrapping problem: the agent needs to already be sampling from favorable regions of chemical space to receive useful learning signals, but it has no guidance on how to get there.

Curriculum learning, originally proposed by Bengio et al. (2009), addresses this by arranging training tasks in order of increasing difficulty. When constituent tasks are correlated with the final objective, the gradients from simpler tasks provide more effective traversal of the optimization landscape.

Formalized Curriculum Strategy for REINVENT

The key innovation is a two-phase training protocol with formal definitions for curriculum progression.

A scoring function maps SMILES strings to desirability scores in $[0, 1]$ using a weighted geometric mean:

$$S(x) = \left(\prod_{i=1}^{n} c_{i}(x)^{w_{i}}\right)^{1 / \sum_{i=1}^{n} w_{i}}$$

where $x$ is a sampled compound in SMILES format, $c_{i}$ is the $i$-th scoring component, and $w_{i}$ is its weight.

A Curriculum $C$ consists of a sequence of Objectives $O = {O_{C_1}, \ldots, O_{C_n}, O_{P}}$, where subscripts $C$ and $P$ denote Curriculum and Production Objectives respectively. Each Objective has a corresponding scoring function. Progression is controlled by Curriculum Progression Criteria $P = {P_{1}, \ldots, P_{n}}$, where each $P_{i}$ defines a score threshold the agent must achieve before advancing to the next objective.

Curriculum Phase: The agent trains on sequential Curriculum Objectives with increasing complexity. A diversity filter is not applied during this phase, as it could be counterproductive to guiding the agent toward favorable chemical space. No computationally expensive components (e.g., docking) are used in Curriculum Objectives.

Production Phase: Activated only when the final Curriculum Progression Criterion $P_{n}$ is satisfied. The agent now optimizes the full Production Objective, which may include expensive components like molecular docking. A new inception memory is initialized (clearing Curriculum Phase compounds), and a Bemis-Murcko scaffold diversity filter is applied to encourage exploration across multiple local minima.

The implementation builds on REINVENT’s RNN architecture: three hidden layers of 512 LSTM cells with an embedding size of 256 and a linear layer with softmax activation, pretrained on ChEMBL to learn SMILES syntax.

Three Experiments on PDK1 Inhibitor Design

The authors evaluate CL on three molecular design tasks of increasing complexity, all centered on designing 3-phosphoinositide-dependent protein kinase-1 (PDK1) inhibitors.

Experiment 1: Target Scaffold Construction

The goal is to generate compounds possessing a dihydro-pyrazoloquinazoline scaffold with a phenyl substituent, a scaffold not present in the prior’s training set. Standard RL fails entirely over 2000 epochs because the probability of randomly sampling a compound with this scaffold is negligibly small, producing binary rewards (1.0 if scaffold present, 0.5 otherwise) that never rise above 0.5.

CL decomposes the target scaffold into 5 progressively complex substructures. Each Curriculum Progression Criterion threshold is set to 0.8. The agent learns to generate compounds with each substructure before advancing. CL finds the target scaffold within 1750 epochs, while baseline RL cannot find it in the same timeframe.

Experiments 2 and 3: Molecular Docking Constraints

These experiments use a Production Objective combining a molecular docking constraint (retaining two hydrogen-bonding interactions with Ala 162 of PDK1, PDB ID: 2XCH) and QED (Quantitative Estimate of Druglikeness). Both experiments limit computational cost by capping production epochs at 300.

Experiment 2 uses Tanimoto (2D) similarity to a reference ligand as the Curriculum Objective. Two scenarios are tested: “Low” (threshold 0.5) and “High” (threshold 0.8).

Experiment 3 uses ROCS (3D) shape-based similarity to the reference ligand as the Curriculum Objective, with “Low” (0.5) and “High” (0.75) thresholds.

All experiments are run in triplicate. The baseline includes both standard RL and RL with Tanimoto/ROCS components added directly to the scoring function (not sequentially), to control for the presence of these components.

Across all CL experiments, CL generates between 2,941 and 9,068 more compounds with docking scores better than the reference ligand (-10.907 kcal/mol) compared to baseline RL, corresponding to 12.42-23.79% improvement in the fraction of compounds exceeding the reference. Between the Curriculum Objectives, the “High” threshold scenario outperforms the “Low” scenario by 316-3,415 additional compounds (with percentage improvements ranging from -0.4% to 10.57%).

Baseline RL produces essentially no compounds satisfying the docking constraint for the first 100 epochs. CL agents achieve immediate productivity: in the “High” Tanimoto scenario, the initial docking score already exceeds the maximum score achieved by baseline RL over 300 epochs.

Scaffold Diversity Analysis

CL generates more unique Bemis-Murcko scaffolds than baseline RL in all experiments. The “High” scenarios produce more unique scaffolds than the “Low” scenarios. CL also produces a higher fraction of “favorable” scaffolds (those with better docking scores than the reference ligand).

Accelerated Convergence with a Diversity Trade-off

The results demonstrate three consistent findings across all experiments:

Accelerated productivity: CL agents reach productive sampling of favorable compounds substantially faster than baseline RL. Even a single Curriculum Objective with a computationally inexpensive metric can guide the agent to regions of chemical space where expensive Production Objectives are readily satisfied.
Improved output quality: CL generates more compounds with favorable docking scores, more unique scaffolds, and a higher fraction of scaffolds that outperform the reference ligand.
Controllable trade-off: The Curriculum Progression Criterion threshold provides direct control over agent policy. Higher thresholds produce better Production Objective optimization but reduce intra-set diversity (higher cross-Tanimoto similarities among generated compounds). UMAP visualizations confirm that “Low” and “High” scenarios sample from nearby but distinct regions of chemical space.

The authors note that even moderate optimization of similarity-based Curriculum Objectives (the “Low” scenarios) already substantially narrows the agent’s perceived solution space. This suggests that CL inherently regularizes the agent policy, trading some diversity for convergence speed.

Limitations: The authors acknowledge that data supporting the findings are available only upon request rather than as public deposits. The approach is demonstrated on a single target (PDK1), and the curriculum design requires domain expertise to decompose objectives appropriately. The inverse relationship between Curriculum Objective optimization and solution diversity means practitioners must carefully tune thresholds for their specific applications.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior training	ChEMBL	Not specified	Used to pretrain the RNN on SMILES syntax
Docking target	PDB 2XCH	1 structure	PDK1 receptor crystal structure

Raw data supporting the findings are available from the corresponding author upon request.

Algorithms

REINVENT platform with LSTM-based RNN (3 hidden layers, 512 cells, embedding size 256)
Scoring function: weighted geometric mean of components
Curriculum Progression Criteria: score thresholds (0.5 or 0.75-0.8 depending on scenario)
Diversity filter: Identical Murcko Scaffold with bucket size 25 (Production Phase only)
Inception (experience replay) for both phases, reset at phase transition
Batch size: 128, learning rate: 0.0001, sigma: 128, Adam optimizer

Models

Prior: RNN pretrained on ChEMBL SMILES
Agent: Initialized from prior, focused via RL/CL
No pretrained model weights are publicly released

Evaluation

Metric	Description	Notes
Docking score (Glide SP)	Predicted binding affinity (kcal/mol)	Lower is better; reference ligand: -10.907
QED	Quantitative Estimate of Druglikeness	Range [0, 1]
Unique Bemis-Murcko scaffolds	Scaffold diversity measure	Averaged over triplicates
Cross-Tanimoto similarity	Intra-set compound diversity	Calculated on pooled triplicates
Tanimoto/ROCS similarity	Curriculum Objective metrics	2D fingerprint and 3D shape similarity

Hardware

GPU: NVIDIA Tesla V100 (32 GB)
Docking: AWS p3.8xlarge instance
LigPrep parallelized over 8 CPU cores
Glide docking parallelized over 48 CPU cores via DockStream

Artifacts

Artifact	Type	License	Notes
REINVENT	Code	Apache-2.0	De novo molecular design platform
CL Tutorial Notebook	Code	MIT	Jupyter notebook tutorial for curriculum learning

Paper Information

Citation: Guo, J., Fialková, V., Arango, J. D., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., & Patronov, A. (2022). Improving de novo molecular design with curriculum learning. Nature Machine Intelligence, 4, 555-563. https://doi.org/10.1038/s42256-022-00494-4

@article{guo2022curriculum,
  title={Improving de novo molecular design with curriculum learning},
  author={Guo, Jeff and Fialkov{\'a}, Vendy and Arango, Juan Diego and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas},
  journal={Nature Machine Intelligence},
  volume={4},
  number={6},
  pages={555--563},
  year={2022},
  publisher={Springer Nature},
  doi={10.1038/s42256-022-00494-4}
}

Augmented Hill-Climb for RL-Based Molecule Design

Thu, 26 Mar 2026 00:00:00 +0000

A Hybrid RL Strategy for De Novo Molecule Generation

This is a Method paper that proposes Augmented Hill-Climb (AHC), a reinforcement learning strategy for conditioning SMILES-based language models during de novo molecule generation. The primary contribution is a simple hybrid between the REINVENT and Hill-Climb (HC) RL strategies that computes the REINVENT loss function only on the top-k highest-scoring molecules per batch (as in HC), thereby removing the counterproductive regularization effect of low-scoring molecules. The authors demonstrate that AHC improves optimization ability ~1.5-fold and sample efficiency ~45-fold compared to REINVENT across docking tasks against four GPCR targets, and that the approach generalizes to transformer architectures.

Sample Efficiency Bottleneck in RL-Guided Molecular Generation

Recurrent neural networks trained on SMILES have become a standard approach for de novo molecule generation, with RL strategies like REINVENT and Hill-Climb achieving top performance on benchmarks such as GuacaMol and MOSES. However, RL-guided generation can be highly sample-inefficient, often requiring $10^5$ or more molecules to optimize complex objectives. This is acceptable for cheap scoring functions (e.g., QSAR models, property calculators) but becomes a practical bottleneck when using computationally expensive scoring functions like molecular docking or computer-aided synthesis planning.

The REINVENT strategy regularizes the agent by computing a loss based on the difference between the agent’s policy and an “augmented likelihood” that combines the prior policy with a scaled reward. When low-scoring molecules are sampled ($R_T \approx 0$), the augmented likelihood reduces to the prior likelihood, causing the agent to trend back toward the prior policy. This negates useful learnings, especially early in training or when the objective is difficult. Meanwhile, Hill-Climb simply fine-tunes the RNN on the top-k molecules per batch, which is sample-efficient but lacks explicit regularization, leading to mode collapse and generation of invalid SMILES.

Previous work by Neil et al. compared RL strategies but did not clearly quantify sample-efficiency differences, and modifications to the REINVENT loss function by Fialkova et al. showed no significant improvement. The best agent reminder (BAR) mechanism offered modest gains but was originally tested on graph-based models.

Core Innovation: Filtering Low-Scoring Molecules from the REINVENT Loss

Augmented Hill-Climb combines the loss formulation of REINVENT with the top-k selection mechanism of Hill-Climb. The agent samples a batch of molecules, ranks them by reward, and computes the REINVENT loss only on the top-k molecules. This removes the counterproductive regularization caused by low-scoring molecules while retaining the prior-based regularization for high-scoring molecules.

The REINVENT loss defines an augmented likelihood:

$$ \log P_{\mathbb{U}}(A) = \log P_{prior}(A) + \sigma R_T $$

where $\sigma$ is a scaling coefficient controlling the reward contribution. The agent loss is the squared difference between the augmented likelihood and the agent’s log-likelihood:

$$ L(\theta) = \left[\log P_{\mathbb{U}}(A) - \log P_{agent}(A)\right]^2 $$

In standard REINVENT, this loss is computed over all molecules in the batch. When $R_T \approx 0$, the augmented likelihood collapses to the prior likelihood, pushing the agent back toward the prior. AHC avoids this by computing the loss only on the top-k molecules ranked by reward, exactly as Hill-Climb selects molecules for fine-tuning.

The key insight is that high-scoring molecules are still regularized by the prior component of the augmented likelihood ($\log P_{prior}(A)$), preventing catastrophic forgetting. Low-scoring molecules, which would otherwise pull the agent back toward the prior, are simply excluded from the loss computation.

Diversity Filters to Prevent Mode Collapse

AHC is more susceptible to mode collapse than REINVENT because it focuses learning on high-scoring molecules. The authors address this with diversity filters (DFs) that penalize the reward of molecules similar to previously generated ones. Through a hyperparameter search over 825 configurations on three GuacaMol tasks, they identify an optimal DF configuration (DF2) with:

Minimum score threshold of 0.5 (lower than DF1’s 0.8)
Linear penalization output mode (softer than binary)
Bin size of 50 (larger than DF1’s 25)
Scaffold similarity based on ECFP4 fingerprints

The authors find that stricter DFs (lower thresholds, smaller bins) better prevent mode collapse but reduce optimization performance, while more lenient DFs enable better learning of chemotype-reward associations. DF2 represents a compromise.

Experimental Setup: Docking Tasks and Benchmark Comparisons

The evaluation spans five experiments:

Experiment 1: AHC vs. REINVENT on DRD2 docking over 100 RL updates (6,400 samples), varying $\sigma$ from 30 to 240. RNN trained on the MOSESn dataset (MOSES with neutralized charges, 2.45M molecules).

Experiment 2: AHC + DF2 vs. REINVENT on four GPCR targets (DRD2, OPRM1, AGTR1, OX1R) over 500 RL updates. Docking performed with Glide-SP after ligand preparation with LigPrep.

Experiment 3: Diversity filter hyperparameter search (825 configurations) on three GuacaMol tasks (Aripiprazole similarity, C11H24 isomers, Osimertinib MPO) using the GuacaMol training set (1.27M molecules from ChEMBL24).

Experiment 4: Benchmark of AHC against REINFORCE, REINVENT (v1 and v2), BAR, and Hill-Climb (with and without KL regularization) on six tasks of varying difficulty:

Task	Difficulty	Objective
Heavy atoms	Easy	Maximize number of heavy atoms
Risperidone similarity	Easy	Maximize Tanimoto similarity to Risperidone
DRD2 activity	Medium	Maximize QSAR-predicted DRD2 activity
DRD2 docking	Medium	Minimize Glide-SP docking score
DRD2-DRD3 dual	Hard	Maximize predicted activity against both targets
DRD2/DRD3 selective	Hard	Maximize selective DRD2 activity over DRD3

Experiment 5: AHC vs. REINVENT on transformer (Tr) and gated transformer (GTr) architectures on the same six benchmark tasks. The GTr implements a GRU-style gate in place of residual connections to stabilize RL training.

RNN and Transformer Architectures

Three RNN configurations were used: (1) embedding 128 + 3 GRU layers of 512 (REINVENT v1), (2) embedding 256 + 3 LSTM layers of 512 (REINVENT 2.0), (3) 3 LSTM layers of 512 with dropout 0.2 (GuacaMol). Transformers used 4 encoder layers with hidden dimension 512, 8 attention heads, and feed-forward dimension 1024.

QSAR models for DRD2 and DRD3 activity were random forest classifiers trained on ExCAPE-DB data with GHOST threshold identification for handling class imbalance.

Key Findings: 45-Fold Sample Efficiency Improvement

Experiment 1: AHC Consistently Outperforms REINVENT

AHC improved optimization ability by 1.39-fold over REINVENT averaged across all $\sigma$ values, with maximum optimization of 205% at $\sigma = 240$ (compared to 128% for REINVENT). AHC required ~80 fewer RL steps to match REINVENT’s mean docking score at 100 steps. With DF1 applied, the improvement was 1.45-fold.

AHC showed greater sensitivity to $\sigma$, giving practitioners more control over the regularization-optimization trade-off. At $\sigma = 60$ (heavily regularized), AHC still improved 1.47-fold over REINVENT while maintaining property space defined by the MOSESn training set. At higher $\sigma$ values, AHC extrapolated further outside the training distribution, which can be favorable (novel chemical space) or unfavorable (scoring function exploitation, e.g., larger molecules getting better docking scores due to the additive nature of scoring functions).

Experiment 2: Improvement Across Four GPCR Targets

Across DRD2, OPRM1, AGTR1, and OX1R, AHC + DF2 required on average 7.4-fold fewer training steps and 45.5-fold fewer samples to reach optimization thresholds. The improvement was largest early in training: 19.8-fold fewer steps to reach 120% optimization, and 71.8-fold fewer samples to first produce a molecule exceeding 160% optimization.

AHC + DF2 surpassed the 80% retrospective precision threshold within 100 RL updates for all targets except the challenging OX1R. DF2 successfully stabilized learning, avoiding the convergence-to-threshold failure mode observed with DF1.

Scaffold analysis showed AHC generates similar chemistry to REINVENT. The top 500 scaffolds produced by REINVENT were also generated by AHC, but typically much sooner.

Experiment 4: Benchmark Against All RL Strategies

AHC outperformed all other RL strategies on all six benchmark tasks except maximizing heavy atoms (an extrapolation task of limited practical relevance). AHC was particularly superior during early-stage optimization and for harder objectives (dual activity, selective activity).

Hill-Climb with a smaller batch size (HC*) showed improved early-stage sample efficiency similar to AHC, but rapidly underwent mode collapse. KL regularization did not rescue mode collapse in any case and sometimes worsened performance. BAR performed poorly in most tasks, possibly because the best-agent memory acts as a second regularizer that inhibits learning.

In terms of wall time for the DRD2 docking task, AHC reached 140% optimization in 16 CPU hours vs. 202 CPU hours for REINVENT 2.0. AHC was the only strategy to reach 200% optimization within the allotted time (216 CPU hours). Parallelized over 10 CPUs, this corresponds to ~21.6 hours, making docking-guided generation feasible on local machines.

Experiment 5: Generalization to Transformers

AHC outperformed REINVENT on both the standard transformer and the gated transformer architectures. The standard transformer was unstable under RL, readily undergoing mode collapse. The gated transformer (with GRU-style gating replacing residual connections) stabilized RL training. AHC’s efficiency gains generalized to both architectures.

Limitations

The authors acknowledge several limitations:

Chemistry quality evaluation is complicated by the interaction between RL strategy and scoring function suitability. Greater optimization may lead to unreasonable chemistry due to scoring function exploitation rather than the RL strategy itself.
The diversity filter hyperparameter search was conducted on GuacaMol toy tasks, which may not fully transfer to docking-based objectives.
The docking scoring function was system-dependent: DRD2 and OPRM1 were optimized effectively, while AGTR1 and OX1R proved more challenging (especially AGTR1, where the docking algorithm targeted the wrong sub-pocket).
KL regularization proved ineffective for HC and REINFORCE, suggesting it is not a sufficient regularization method in this context.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
RNN pretraining	MOSESn (MOSES neutralized)	2,454,087 molecules	ZINC15 clean leads with neutralized charges
RNN pretraining	GuacaMol train	1,273,104 molecules	ChEMBL24 with property filters
QSAR training	ExCAPE-DB (DRD2)	4,609 actives / 343,026 inactives	Random forest with GHOST thresholds
QSAR training	ExCAPE-DB (DRD3)	2,758 actives / 402,524 inactives	Unique subsets for dual/selective tasks
DF parameter search	GuacaMol benchmark tasks	3 tasks	825 configurations tested

Algorithms

AHC: REINVENT loss computed on top-k molecules per batch, ranked by reward
Baselines: REINFORCE, REINVENT (v1, v2), BAR, Hill-Climb, Hill-Climb + KL regularization
Hyperparameters: Default values from each original publication (listed in Supplementary Table S3)
Docking: Glide-SP with Schrodinger Protein Preparation Wizard, LigPrep for ligand preparation

Models

RNNs: 3 configurations (GRU/LSTM, 512 hidden units, trained 5-10 epochs)
Transformer: 4 encoder layers, 512 hidden dim, 8 heads, 1024 FFN dim
Gated Transformer: Same architecture with GRU-style gating replacing residual connections
QSAR: Random forest classifiers (100 estimators, max depth 15, min leaf 2)

Evaluation

Metric	AHC + DF2	REINVENT	Notes
Optimization fold-improvement	1.45x	baseline	DRD2 docking, averaged across sigma values
Sample efficiency	45.5x fewer samples	baseline	Averaged across 4 GPCR targets
Step efficiency	7.4x fewer steps	baseline	Averaged across 4 GPCR targets
CPU hours to 140% (DRD2 docking)	16h	202h (REINVENT 2.0)	AMD Threadripper 1920 + RTX 2060 Super

Hardware

AMD Threadripper 1920 CPU
Nvidia GeForce RTX 2060 Super GPU
DRD2 docking benchmark: 216 CPU hours for AHC to reach 200% optimization (~21.6h parallelized over 10 CPUs)

Artifacts

Artifact	Type	License	Notes
SMILES-RNN	Code	MIT	RNN and transformer generative model code
MolScore	Code	MIT	Scoring function platform
Figshare datasets	Dataset	CC-BY-4.0	Supporting data (published under same license as paper)

Paper Information

Citation: Thomas, M., O’Boyle, N. M., Bender, A., & de Graaf, C. (2022). Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation. Journal of Cheminformatics, 14, 68.

@article{thomas2022augmented,
  title={Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation},
  author={Thomas, Morgan and O'Boyle, Noel M. and Bender, Andreas and de Graaf, Chris},
  journal={Journal of Cheminformatics},
  volume={14},
  number={1},
  pages={68},
  year={2022},
  doi={10.1186/s13321-022-00646-z}
}