<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>RL-Tuned Generation on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/</link><description>Recent content in RL-Tuned Generation on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sat, 28 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/index.xml" rel="self" type="application/rss+xml"/><item><title>REINVENT: Reinforcement Learning for Mol. Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/</guid><description>REINVENT uses augmented episodic likelihood to fine-tune a SMILES-based RNN via reinforcement learning for goal-directed molecular generation.</description><content:encoded><![CDATA[<h2 id="augmented-episodic-likelihood-for-goal-directed-generation">Augmented Episodic Likelihood for Goal-Directed Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces REINVENT, a policy-based reinforcement learning framework for molecular de novo design. The primary contribution is a novel cost function, the <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/">augmented episodic likelihood</a>, that fine-tunes a <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>-based recurrent neural network (RNN) pre-trained on ChEMBL toward generating molecules satisfying user-defined property objectives. The method anchors the agent to the prior distribution of valid drug-like molecules, addressing failure modes of standard REINFORCE algorithms (reward exploitation and <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/failure-modes-molecule-generation/">mode collapse</a> to trivially simple structures).</p>
<h2 id="de-novo-design-needs-flexible-data-driven-approaches">De Novo Design Needs Flexible, Data-Driven Approaches</h2>
<p>Traditional de novo design methods fall into three categories, each with limitations:</p>
<ol>
<li><strong>Structure-based approaches</strong> grow ligands to fit binding pockets but often produce molecules with poor DMPK profiles and synthetic intractability.</li>
<li><strong>Ligand-based virtual library</strong> approaches generate large libraries and score them, but are constrained by pre-defined reaction rules or transformation rules that limit chemical diversity.</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/property-prediction/">Inverse QSAR</a></strong> methods attempt to map favorable activity regions back to molecular structures, but require descriptors suitable for both forward prediction and inverse mapping.</li>
</ol>
<p>RNN-based generative models trained on SMILES offer a data-driven alternative that can learn the underlying distribution of drug-like chemical space without rigid rules. Segler et al. (2017) showed that fine-tuning a pre-trained RNN on focused actives yields high fractions of predicted actives. However, this maximum likelihood fine-tuning cannot use negative or continuous scores and risks catastrophic forgetting.</p>
<p>Prior RL approaches had significant issues. Jaques et al. (2016) used Deep Q-learning with prior likelihood regularization for sequence generation, but reported dependence on hand-written rules to penalize undesirable sequences and still observed reward exploitation producing unrealistically simple molecules. Standard REINFORCE algorithms tend to converge on trivial solutions (e.g., generating only &ldquo;C&rdquo; to satisfy a scoring function).</p>
<h2 id="the-augmented-episodic-likelihood-framework">The Augmented Episodic Likelihood Framework</h2>
<p>The core innovation is a formulation where the agent learns a policy that minimizes the squared difference between its own log-likelihood and an augmented target likelihood.</p>
<p>The RNN is first pre-trained on 1.5 million canonical SMILES from ChEMBL via maximum likelihood estimation:</p>
<p>$$
J(\Theta) = -\sum_{t=1}^{T} \log P(x^{t} \mid x^{t-1}, \dots, x^{1})
$$</p>
<p>The pre-trained model (the Prior) is then used as the starting point for the Agent. For a generated SMILES sequence $A = a_1, a_2, \dots, a_T$, the model likelihood is $P(A) = \prod_{t=1}^{T} \pi(a_t \mid s_t)$, and a scoring function $S(A) \in [-1, 1]$ rates desirability.</p>
<p>The augmented likelihood combines prior likelihood with the score:</p>
<p>$$
\log P(A)_{\mathbb{U}} = \log P(A)_{Prior} + \sigma S(A)
$$</p>
<p>where $\sigma$ is a scalar coefficient controlling the trade-off between prior fidelity and score optimization.</p>
<p>The return is defined as the negative squared difference between the augmented likelihood and the agent&rsquo;s likelihood:</p>
<p>$$
G(A) = -\left[\log P(A)_{\mathbb{U}} - \log P(A)_{\mathbb{A}}\right]^{2}
$$</p>
<p>The agent minimizes $J(\Theta) = -G$, effectively learning a policy whose sequence likelihoods match the prior modulated by the scoring function. The authors show in supplementary material that this is equivalent to a REINFORCE algorithm with a specific final-step reward formulation.</p>
<p>This design has three key advantages over standard REINFORCE:</p>
<ul>
<li>The target policy is explicitly stochastic, preserving diversity in generated molecules</li>
<li>The prior anchoring prevents catastrophic forgetting of SMILES syntax and chemical space coverage</li>
<li>No hand-written rules are needed to penalize degenerate solutions</li>
</ul>
<p>The Agent is trained on-policy with batches of 128 generated sequences, using SGD with learning rate 0.0005 and gradient clipping to $[-3, 3]$.</p>
<h2 id="three-experiments-sulphur-avoidance-celecoxib-analogues-and-drd2-activity">Three Experiments: Sulphur Avoidance, Celecoxib Analogues, and DRD2 Activity</h2>
<h3 id="prior-network-architecture">Prior Network Architecture</h3>
<p>The Prior is a 3-layer RNN with 1024 Gated Recurrent Units per layer, trained on RDKit canonical SMILES from ChEMBL (molecules with 10-50 heavy atoms, elements from ${H, B, C, N, O, F, Si, P, S, Cl, Br, I}$). Training used Adam ($\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$) for 50,000 steps with batch size 128 and learning rate decay of 0.02 every 100 steps. The Prior generates 94% valid SMILES, of which 90% are novel.</p>
<h3 id="experiment-1-learning-to-avoid-sulphur">Experiment 1: Learning to Avoid Sulphur</h3>
<p>A proof-of-principle task where the scoring function assigns $S(A) = 1$ for valid sulphur-free molecules, $S(A) = 0$ for invalid SMILES, and $S(A) = -1$ for sulphur-containing molecules.</p>
<p>The Agent method was compared against three alternatives:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Fraction Valid</th>
          <th>Fraction No S</th>
          <th>Avg MW</th>
          <th>Avg cLogP</th>
          <th>Avg RotBonds</th>
          <th>Avg AromRings</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior</td>
          <td>0.94</td>
          <td>0.66</td>
          <td>371</td>
          <td>3.36</td>
          <td>5.39</td>
          <td>2.26</td>
      </tr>
      <tr>
          <td>Agent</td>
          <td>0.95</td>
          <td>0.98</td>
          <td>367</td>
          <td>3.37</td>
          <td>5.41</td>
          <td>2.26</td>
      </tr>
      <tr>
          <td>Action basis</td>
          <td>0.95</td>
          <td>0.92</td>
          <td>372</td>
          <td>3.39</td>
          <td>6.08</td>
          <td>2.09</td>
      </tr>
      <tr>
          <td>REINFORCE</td>
          <td>0.98</td>
          <td>0.98</td>
          <td>585</td>
          <td>11.3</td>
          <td>30.0</td>
          <td>0.57</td>
      </tr>
      <tr>
          <td>REINFORCE + Prior</td>
          <td>0.98</td>
          <td>0.92</td>
          <td>232</td>
          <td>3.05</td>
          <td>2.8</td>
          <td>2.11</td>
      </tr>
  </tbody>
</table>
<p>Standard REINFORCE exploited the reward by generating sequences of predominantly &ldquo;C&rdquo; (average MW 585, cLogP 11.3). REINFORCE + Prior avoided this but collapsed to small, simplistic structures (MW 232). The Agent achieved 98% sulphur-free structures while maintaining molecular properties nearly identical to the Prior, demonstrating that augmented episodic likelihood preserves the prior distribution.</p>
<h3 id="experiment-2-similarity-guided-generation-celecoxib-analogues">Experiment 2: Similarity-Guided Generation (Celecoxib Analogues)</h3>
<p>The scoring function uses <a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity</a> on FCFP4 fingerprints:</p>
<p>$$
S(A) = -1 + 2 \times \frac{\min{J_{i,j}, k}}{k}
$$</p>
<p>where $k$ caps the rewarded similarity. With $k = 1$ and $\sigma = 15$, the Agent recovers <a href="https://en.wikipedia.org/wiki/Celecoxib">Celecoxib</a> itself within 200 training steps. Even when all structures with $J &gt; 0.5$ to Celecoxib (1,804 molecules) were removed from the Prior training set, the Agent still found Celecoxib after 400 steps, despite a 700-fold reduction in prior likelihood ($\log_e P$ from $-12.7$ to $-19.2$).</p>
<p>With moderate similarity targets ($k = 0.7$, $\sigma = 12$), the Agent generates diverse analogues including scaffold hops where functional groups are rearranged.</p>
<h3 id="experiment-3-target-activity-drd2">Experiment 3: Target Activity (DRD2)</h3>
<p>The most drug-discovery-relevant task: generating molecules predicted active against the <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">dopamine receptor type 2 (DRD2)</a>. An SVM classifier (Gaussian kernel, $C = 2^7$, $\gamma = 2^{-6}$) was trained on bioactivity data from ExCAPE-DB (7,218 actives with pIC50 &gt; 5, 100,000 sampled inactives). The actives were split by Butina clustering (ECFP6, cutoff 0.4) to decrease nearest-neighbor similarity between train and test sets.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Prior</th>
          <th>Agent</th>
          <th>Prior (reduced)</th>
          <th>Agent (reduced)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Fraction valid SMILES</td>
          <td>0.94</td>
          <td>0.99</td>
          <td>0.94</td>
          <td>0.99</td>
      </tr>
      <tr>
          <td>Fraction predicted actives</td>
          <td>0.03</td>
          <td>0.97</td>
          <td>0.02</td>
          <td>0.96</td>
      </tr>
      <tr>
          <td>Fraction similar to train active</td>
          <td>0.02</td>
          <td>0.79</td>
          <td>0.02</td>
          <td>0.75</td>
      </tr>
      <tr>
          <td>Fraction similar to test active</td>
          <td>0.01</td>
          <td>0.46</td>
          <td>0.01</td>
          <td>0.38</td>
      </tr>
      <tr>
          <td>Test actives recovered (x10^-3)</td>
          <td>13.5</td>
          <td>126</td>
          <td>2.85</td>
          <td>72.6</td>
      </tr>
  </tbody>
</table>
<p>The Agent increased the fraction of predicted actives from 2-3% (Prior) to 96-97%, representing a 250-fold enrichment in the probability of generating a test set active. The Agent based on the reduced Prior (DRD2 actives removed from ChEMBL) still recovered 7% of test actives, meaning it generated experimentally confirmed actives that appeared in neither the generative model nor the activity prediction model training data.</p>
<h2 id="anchored-policy-learning-prevents-reward-exploitation">Anchored Policy Learning Prevents Reward Exploitation</h2>
<p>The key finding is that augmented episodic likelihood successfully balances score optimization with prior distribution preservation. The Agent achieves task objectives (sulphur avoidance, similarity targets, activity prediction) while maintaining the molecular property distributions learned from ChEMBL. This is a significant improvement over standard REINFORCE, which either exploits rewards trivially or collapses to simple structures.</p>
<p>Analysis of the conditional probability distributions between the Prior and Agent (for DRD2 active generation) shows that the policy changes are not drastic: most trends learned by the Prior carry over, with targeted modifications at specific steps that substantially alter sequence likelihoods and generated structure types.</p>
<p>Limitations acknowledged by the authors:</p>
<ul>
<li>All experiments use single-parameter scoring functions; multi-parametric optimization (activity + DMPK + synthetic accessibility) is left for future work</li>
<li>The quality of generated structures depends heavily on the Prior&rsquo;s coverage of chemical space</li>
<li>The activity model (SVM) has limited domain of applicability, and structures outside this domain may be falsely scored</li>
<li>No exhaustive study of how Prior training set size, model size, and regularization affect generation quality</li>
</ul>
<p>Future directions include multi-parametric scoring functions, exploration of token embeddings, and adversarial training where the scoring function is replaced by a discriminator network (GAN-style training).</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior training</td>
          <td>ChEMBL</td>
          <td>1.5M structures</td>
          <td>10-50 heavy atoms, filtered elements</td>
      </tr>
      <tr>
          <td>DRD2 activity model</td>
          <td>ExCAPE-DB</td>
          <td>7,218 actives + 100K inactives</td>
          <td>Butina clustering split (ECFP6, cutoff 0.4)</td>
      </tr>
      <tr>
          <td>Similarity target</td>
          <td>Celecoxib</td>
          <td>1 query structure</td>
          <td>FCFP4 fingerprints for Jaccard similarity</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Prior</strong>: 3-layer GRU RNN (1024 units/layer), Adam optimizer, 50K steps, batch size 128, LR 0.001 with 0.02 decay/100 steps</li>
<li><strong>Agent</strong>: Same architecture, SGD with LR 0.0005, gradient clipping [-3, 3], on-policy batches of 128</li>
<li><strong>DRD2 model</strong>: SVM with Gaussian kernel ($C = 2^7$, $\gamma = 2^{-6}$), grid search on validation set</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MarcusOlivecrona/REINVENT">REINVENT</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Original implementation in TensorFlow/Python 2.7</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.572576">Archived version</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Zenodo archive (DOI: 10.5281/zenodo.572576)</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>SMILES validity rate (RDKit parsing)</li>
<li>Fraction of structures satisfying scoring function</li>
<li>Molecular property distributions (MW, cLogP, rotatable bonds, aromatic rings)</li>
<li>Jaccard similarity on ECFP6/FCFP4 fingerprints</li>
<li>Recovery rate of known actives from test set</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. The implementation uses TensorFlow 1.0.1 with Python 2.7, RDKit, and Scikit-learn.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Olivecrona, M., Blaschke, T., Engkvist, O., &amp; Chen, H. (2017). Molecular de-novo design through deep reinforcement learning. <em>Journal of Cheminformatics</em>, 9(1), 48.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{olivecrona2017molecular,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Molecular de-novo design through deep reinforcement learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Olivecrona, Marcus and Blaschke, Thomas and Engkvist, Ola and Chen, Hongming}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{48}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-017-0235-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ORGAN: Objective-Reinforced GANs for Molecule Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/</guid><description>ORGAN combines GANs with reinforcement learning to steer SMILES-based molecular generation toward drug-likeness, solubility, and synthesizability objectives.</description><content:encoded><![CDATA[<h2 id="combining-gans-and-reinforcement-learning-for-goal-directed-sequence-generation">Combining GANs and Reinforcement Learning for Goal-Directed Sequence Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces ORGAN (Objective-Reinforced Generative Adversarial Network), a framework for generating sequences that are both realistic (close to the training distribution) and optimized for domain-specific objectives. ORGAN extends SeqGAN by adding external reward functions to the reinforcement learning signal, with a tunable parameter $\lambda$ controlling the balance between adversarial (discriminator) and objective-based rewards. The authors demonstrate ORGAN on two domains: molecular generation using <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings (optimizing druglikeness, solubility, and synthesizability) and musical melody generation (optimizing tonality and step ratios).</p>
<h2 id="exposure-bias-and-mode-collapse-in-discrete-sequence-generation">Exposure Bias and Mode Collapse in Discrete Sequence Generation</h2>
<p>Generating discrete sequences with desirable properties presents two intertwined challenges. First, RNNs trained via maximum likelihood estimation (MLE) suffer from exposure bias, where the model sees only ground-truth prefixes during training but must condition on its own (potentially erroneous) outputs at generation time. Second, while <a href="/posts/what-is-a-gan/">GANs</a> can address some of these issues through adversarial training, they were not initially applicable to discrete data due to non-differentiability of the sampling step. SeqGAN resolved this by framing the generator as an RL agent, but it optimizes only for distributional fidelity (fooling the discriminator) without any mechanism to steer generation toward specific property targets.</p>
<p>In drug discovery, simply generating valid, drug-like molecules is insufficient. Practitioners need to optimize for particular pharmaceutical properties (e.g., solubility, synthesizability, druglikeness) while maintaining structural diversity. Naive RL approaches can optimize properties effectively but tend to collapse onto trivial solutions (e.g., repeating &ldquo;CCCCCCC&rdquo; to maximize solubility). The challenge is to combine the distributional regularization of adversarial training with the goal-directedness of RL.</p>
<h2 id="mixed-reward-interpolating-between-adversarial-and-objective-signals">Mixed Reward: Interpolating Between Adversarial and Objective Signals</h2>
<p>ORGAN&rsquo;s core innovation is a reward function that linearly interpolates between the discriminator score and domain-specific objectives:</p>
<p>$$R(Y_{1:T}) = \lambda \cdot D_{\phi}(Y_{1:T}) + (1 - \lambda) \cdot O_{i}(Y_{1:T})$$</p>
<p>When $\lambda = 1$, the model reduces to SeqGAN (pure adversarial training). When $\lambda = 0$, it becomes naive RL optimizing only the objective. Intermediate values allow the adversarial component to regularize the generator, keeping samples within the distribution while the objective component steers toward desired properties.</p>
<p>The generator $G_{\theta}$ is an LSTM-based RNN that produces sequences token-by-token. Training follows the REINFORCE algorithm, where the expected long-term reward is:</p>
<p>$$J(\theta) = \mathbb{E}\left[R(Y_{1:T}) \mid s_{0}, \theta\right] = \sum_{y_{1} \in Y} G_{\theta}(y_{1} \mid s_{0}) \cdot Q(s_{0}, y_{1})$$</p>
<p>For intermediate timesteps (partial sequences), the action-value function $Q$ is estimated via $N$-time Monte Carlo rollouts:</p>
<p>$$Q(Y_{1:t-1}, y_{t}) = \begin{cases} \frac{1}{N} \sum_{n=1}^{N} R(Y_{1:T}^{n}), &amp; \text{if } t &lt; T \\ R(Y_{1:T}), &amp; \text{if } t = T \end{cases}$$</p>
<p>where $Y_{1:T}^{n}$ are completions sampled by rolling out the current policy $G_{\theta}$ from state $Y_{1:t}$.</p>
<p>The policy gradient is:</p>
<p>$$\nabla_{\theta} J(\theta) \simeq \frac{1}{T} \sum_{t=1}^{T} \mathbb{E}_{y_{t} \sim G_{\theta}(y_{t} \mid Y_{1:t-1})} \left[\nabla_{\theta} \log G_{\theta}(y_{t} \mid Y_{1:t-1}) \cdot Q(Y_{1:t-1}, y_{t})\right]$$</p>
<p>Two additional mechanisms improve training:</p>
<ol>
<li><strong>Diversity penalty</strong>: Repeated sequences have their reward divided by their copy count, providing diminishing returns for non-unique outputs.</li>
<li><strong>Wasserstein distance</strong>: The authors also implement a variant (OR(W)GAN) that replaces the standard GAN discriminator loss with the Wasserstein-1 distance via Kantorovich-Rubinstein duality, which can improve training stability and diversity.</li>
</ol>
<h2 id="molecular-and-musical-melody-generation-experiments">Molecular and Musical Melody Generation Experiments</h2>
<h3 id="architecture">Architecture</h3>
<p>The generator $G_{\theta}$ is an RNN with LSTM cells. The discriminator $D_{\phi}$ is a CNN for text classification following Kim (2014), with 75% dropout and L2 regularization. All optimization uses Adam. Molecular metrics are computed with RDKit.</p>
<h3 id="molecular-generation-setup">Molecular Generation Setup</h3>
<p>Training data consists of 5,000 random molecules from the QM9 dataset (134k stable small molecules with up to 9 heavy atoms), encoded as SMILES strings with maximum sequence length 51 and alphabet size 43. Each generator is pre-trained for 250 MLE epochs, with the discriminator trained for 10 epochs. Adversarial/RL training then proceeds for up to 100 additional epochs. The default $\lambda$ is 0.5.</p>
<p>Three molecular objectives are evaluated:</p>
<ul>
<li><strong>Solubility (LogP)</strong>: water-octanol partition coefficient via RDKit&rsquo;s Crippen function</li>
<li><strong>Synthesizability</strong>: SA score estimating ease of synthesis (0 = hard, 1 = easy)</li>
<li><strong>Druglikeness</strong>: QED score capturing medicinal chemistry aesthetics</li>
</ul>
<p>Diversity is measured using average Jaccard distance of molecular fingerprints relative to a random training subset.</p>
<h3 id="molecular-generation-results">Molecular Generation Results</h3>
<table>
  <thead>
      <tr>
          <th>Objective</th>
          <th>Algorithm</th>
          <th>Validity (%)</th>
          <th>Diversity</th>
          <th>Druglikeness</th>
          <th>Synthesizability</th>
          <th>Solubility</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>None</td>
          <td>MLE</td>
          <td>75.9</td>
          <td>0.64</td>
          <td>0.48 (0%)</td>
          <td>0.23 (0%)</td>
          <td>0.30 (0%)</td>
      </tr>
      <tr>
          <td>None</td>
          <td>SeqGAN</td>
          <td>80.3</td>
          <td>0.61</td>
          <td>0.49 (+2%)</td>
          <td>0.25 (+6%)</td>
          <td>0.31 (+3%)</td>
      </tr>
      <tr>
          <td>Druglikeness</td>
          <td>ORGAN</td>
          <td>88.2</td>
          <td>0.55</td>
          <td>0.52 (+8%)</td>
          <td>0.32 (+38%)</td>
          <td>0.35 (+18%)</td>
      </tr>
      <tr>
          <td>Druglikeness</td>
          <td>OR(W)GAN</td>
          <td>85.0</td>
          <td>0.95</td>
          <td>0.60 (+25%)</td>
          <td>0.54 (+130%)</td>
          <td>0.47 (+57%)</td>
      </tr>
      <tr>
          <td>Druglikeness</td>
          <td>Naive RL</td>
          <td>97.1</td>
          <td>0.80</td>
          <td>0.57 (+19%)</td>
          <td>0.53 (+126%)</td>
          <td>0.50 (+67%)</td>
      </tr>
      <tr>
          <td>Synthesizability</td>
          <td>ORGAN</td>
          <td>96.5</td>
          <td>0.92</td>
          <td>0.51 (+6%)</td>
          <td>0.83 (+255%)</td>
          <td>0.45 (+52%)</td>
      </tr>
      <tr>
          <td>Synthesizability</td>
          <td>OR(W)GAN</td>
          <td>97.6</td>
          <td>1.00</td>
          <td>0.20 (-59%)</td>
          <td>0.75 (+223%)</td>
          <td>0.84 (+184%)</td>
      </tr>
      <tr>
          <td>Solubility</td>
          <td>ORGAN</td>
          <td>94.7</td>
          <td>0.76</td>
          <td>0.50 (+4%)</td>
          <td>0.63 (+171%)</td>
          <td>0.55 (+85%)</td>
      </tr>
      <tr>
          <td>Solubility</td>
          <td>OR(W)GAN</td>
          <td>94.1</td>
          <td>0.90</td>
          <td>0.42 (-12%)</td>
          <td>0.66 (+185%)</td>
          <td>0.54 (+81%)</td>
      </tr>
      <tr>
          <td>Solubility</td>
          <td>Naive RL</td>
          <td>92.7</td>
          <td>0.75</td>
          <td>0.49 (+3%)</td>
          <td>0.70 (+200%)</td>
          <td>0.78 (+162%)</td>
      </tr>
      <tr>
          <td>All (alternated)</td>
          <td>ORGAN</td>
          <td>96.1</td>
          <td>92.3</td>
          <td>0.52 (+9%)</td>
          <td>0.71 (+206%)</td>
          <td>0.53 (+79%)</td>
      </tr>
  </tbody>
</table>
<p>Key observations: OR(W)GAN consistently achieves higher diversity than standard ORGAN. Naive RL often achieves higher raw objective scores but at the cost of generating trivial solutions (e.g., simple atom chains for solubility). The Wasserstein variant provides better diversity properties. Multi-objective training via alternating objectives across epochs achieves gains comparable to individually optimized models.</p>
<h3 id="music-generation-setup">Music Generation Setup</h3>
<p>Using 1,000 melodies from the EsAC folk dataset, each encoded as 36-token sequences where tokens represent sixteenth-note events across three octaves (C3-B5). Two metrics are optimized: tonality (proportion of perfect fifths) and ratio of steps (conjunct melodic motion). Diversity is measured as average pairwise edit distance.</p>
<h3 id="music-results">Music Results</h3>
<table>
  <thead>
      <tr>
          <th>Objective</th>
          <th>Algorithm</th>
          <th>Diversity</th>
          <th>Tonality</th>
          <th>Ratio of Steps</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>None</td>
          <td>MLE</td>
          <td>0.221</td>
          <td>0.007</td>
          <td>0.010</td>
      </tr>
      <tr>
          <td>None</td>
          <td>SeqGAN</td>
          <td>0.187</td>
          <td>0.005</td>
          <td>0.010</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>Naive RL</td>
          <td>0.100</td>
          <td>0.478</td>
          <td>2.9E-05</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>ORGAN</td>
          <td>0.268</td>
          <td>0.372</td>
          <td>1.78E-04</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>OR(W)GAN</td>
          <td>0.268</td>
          <td>0.177</td>
          <td>2.4E-04</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>Naive RL</td>
          <td>0.321</td>
          <td>0.001</td>
          <td>0.829</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>ORGAN</td>
          <td>0.433</td>
          <td>0.001</td>
          <td>0.632</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>OR(W)GAN</td>
          <td>0.134</td>
          <td>5.95E-05</td>
          <td>0.622</td>
      </tr>
  </tbody>
</table>
<p>ORGAN outperforms SeqGAN and MLE on all metrics. Naive RL achieves higher raw scores but with lower diversity, producing simpler, less interesting outputs.</p>
<h2 id="capacity-ceilings-trade-offs-and-future-directions">Capacity Ceilings, Trade-offs, and Future Directions</h2>
<p>The authors identify several limitations and findings:</p>
<p><strong>Capacity ceiling</strong>: GAN-based models tend to generate sequences matching the training set&rsquo;s average length (15.42 characters). RL-only approaches can break this constraint, generating shorter (9.4) or longer (21.3) sequences depending on the objective. The upper bound of optimized properties also matches the training data&rsquo;s maximum, suggesting dataset-dependent limits.</p>
<p><strong>Lambda trade-off</strong>: Varying $\lambda$ reveals an optimal balance between objective optimization and distributional fidelity. This optimum depends on the model, dataset, and metric, suggesting that hyperparameter search over $\lambda$ is important in practice.</p>
<p><strong>Tonality vs. steps inverse relationship</strong>: In the music task, optimizing for tonality (perfect fifths) inherently conflicts with optimizing for step ratios (consecutive notes), since consecutive scale notes do not form perfect fifths.</p>
<p><strong>Limitations</strong>: The paper evaluates on relatively small datasets (5k molecules, 1k melodies) and short sequences. The molecular experiments use QM9 (small molecules with up to 9 heavy atoms), which limits the scope of conclusions for drug-like chemical space. The Wasserstein variant sometimes lags behind the standard GAN loss in raw metric scores, though it offers better diversity.</p>
<p><strong>Future directions</strong>: The authors propose extending ORGAN to non-sequential data (images, audio) by framing GANs as RL problems more broadly, and investigating how different heuristic choices affect performance. They also suggest exploring other discrete GAN formulations (MaliGAN, BGAN) with RL extensions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Molecular training</td>
          <td>QM9 subset</td>
          <td>5,000 molecules</td>
          <td>Random subset from 134k stable small molecules with up to 9 heavy atoms</td>
      </tr>
      <tr>
          <td>Music training</td>
          <td>EsAC folk dataset</td>
          <td>1,000 melodies</td>
          <td>36-token sequences, processed following Chen et al. (2017)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Generator pre-trained for 250 epochs via MLE; discriminator for 10 epochs</li>
<li>Adversarial/RL training for up to 100 epochs</li>
<li>Default $\lambda = 0.5$ for reward mixing</li>
<li>Monte Carlo rollouts for intermediate reward estimation</li>
<li>Duplicate penalty: reward divided by copy count</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Generator</strong>: RNN with LSTM cells</li>
<li><strong>Discriminator</strong>: CNN for text classification (Kim, 2014) with 75% dropout, L2 regularization</li>
<li><strong>Optimizer</strong>: Adam for all gradient descent steps</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Domain</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity (%)</td>
          <td>Fraction of generated SMILES that decode to valid molecules</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>Average Jaccard distance of fingerprints to training subset</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Druglikeness (QED)</td>
          <td>Quantitative Estimate of Drug-likeness</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Synthesizability (SA)</td>
          <td>Synthetic accessibility score</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Solubility (LogP)</td>
          <td>Water-octanol partition coefficient</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>Proportion of perfect fifths</td>
          <td>Music</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>Proportion of conjunct melodic intervals</td>
          <td>Music</td>
      </tr>
      <tr>
          <td>Diversity (edit)</td>
          <td>Average pairwise edit distance</td>
          <td>Music</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/gablg1/ORGAN">ORGAN</a></td>
          <td>Code</td>
          <td>GPL-2.0</td>
          <td>Official implementation including metrics for molecules and music</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Guimaraes, G. L., Sánchez-Lengeling, B., Outeiral, C., Farias, P. L. C., &amp; Aspuru-Guzik, A. (2017). Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models. <em>arXiv preprint arXiv:1705.10843</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{guimaraes2017organ,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Guimaraes, Gabriel Lima and Sanchez-Lengeling, Benjamin and Outeiral, Carlos and Farias, Pedro Luis Cunha and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1705.10843}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolecularRNN: Graph-Based Molecular Generation and RL</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/molecularrnn-graph-generation-optimized-properties/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/molecularrnn-graph-generation-optimized-properties/</guid><description>MolecularRNN extends GraphRNN with atom and bond type predictions, valency-based rejection sampling, and policy gradient optimization for molecular generation.</description><content:encoded><![CDATA[<h2 id="a-graph-recurrent-model-for-molecular-generation-with-property-optimization">A Graph Recurrent Model for Molecular Generation with Property Optimization</h2>
<p>This is a <strong>Method</strong> paper that introduces MolecularRNN, a graph-based recurrent generative model for molecular structures. The model extends GraphRNN to handle typed nodes (atom types) and typed edges (bond types), enabling direct generation of molecular graphs rather than working through string representations like SMILES. Three key contributions are combined: (1) the MolecularRNN architecture for autoregressive graph generation, (2) valency-based rejection sampling for guaranteed 100% validity at inference, and (3) policy gradient reinforcement learning for shifting molecular property distributions toward desired ranges.</p>
<h2 id="why-generate-molecules-as-graphs-rather-than-strings">Why Generate Molecules as Graphs Rather Than Strings</h2>
<p>Computational de novo molecular design aims to create novel molecules with desired properties, a task central to drug discovery. At the time of this work, most deep generative models for molecules operated on <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings, inheriting the complications of SMILES grammar and the problem that structurally similar molecules can have very different string representations. Graph-based representations are more natural for molecules, with atoms mapping to nodes and bonds to edges, and they allow direct enforcement of chemical constraints during generation.</p>
<p>Existing graph-based methods had their own limitations. Junction tree VAE (JT-VAE) generates molecules from structural fragments, which introduces ambiguity when converting junction trees back to molecules, particularly problematic during property optimization since molecules sharing a junction tree can have very different property values. The GCPN model uses graph convolutional networks with reinforcement learning but was evaluated only on top-3 generated molecules, making it difficult to assess overall distribution quality. Prior atom-level graph generation models like Li et al. (2018a) were restricted to molecules with at most 20 heavy atoms, limiting practical applicability.</p>
<h2 id="core-innovation-extending-graphrnn-with-chemical-constraints-and-rl">Core Innovation: Extending GraphRNN with Chemical Constraints and RL</h2>
<p>MolecularRNN builds on the GraphRNN architecture by introducing atom type predictions alongside edge type predictions. The model generates molecular graphs sequentially: at each step, a NodeRNN predicts the type of the next atom, then an EdgeRNN predicts bond types to all preceding atoms within a BFS-ordered window.</p>
<h3 id="autoregressive-graph-generation">Autoregressive Graph Generation</h3>
<p>The joint likelihood over atom types $C^{\pi}$ and adjacency vectors $S^{\pi}$ under BFS ordering $\pi$ is factorized as:</p>
<p>$$
p\left(S^{\pi}, C^{\pi}\right) = \prod_{i=1}^{n+1} p\left(C_{i}^{\pi} \mid S_{&lt;i}^{\pi}, C_{&lt;i}^{\pi}\right) p\left(S_{i}^{\pi} \mid C_{i}^{\pi}, S_{&lt;i}^{\pi}, C_{&lt;i}^{\pi}\right)
$$</p>
<p>NodeRNN processes embeddings of previous atom types and adjacency vectors to produce a hidden state, from which a two-layer MLP with softmax predicts the next atom type $\psi_{i}$:</p>
<p>$$
h_{i}^{\text{node}} = \text{NodeRNN}\left(h_{i-1}^{\text{node}}, \left[\text{emb}(S_{i-1}^{\pi}), \text{emb}(C_{i-1}^{\pi})\right]\right)
$$</p>
<p>$$
\psi_{i} = \text{NodeMLP}\left(h_{i}^{\text{node}}\right)
$$</p>
<p>EdgeRNN then unrolls across preceding atoms to predict bond types $\phi_{i,j}$, initialized with the NodeRNN hidden state:</p>
<p>$$
h_{i,j}^{\text{edge}} = \text{EdgeRNN}\left(h_{i,j-1}^{\text{edge}}, \text{emb}(S_{i,j-1}^{\pi})\right), \quad h_{i,0}^{\text{edge}} = h_{i}^{\text{node}}
$$</p>
<p>$$
\phi_{i,j} = \text{EdgeMLP}\left(h_{i,j}^{\text{edge}}\right)
$$</p>
<p>Bond types are categorical over {no bond, single, double, triple}, and molecules are represented in kekulized form. BFS ordering limits the EdgeRNN window to $M = 12$ preceding atoms.</p>
<h3 id="valency-based-rejection-sampling">Valency-Based Rejection Sampling</h3>
<p>During inference, each proposed bond of order $k$ between atoms $i$ and $j$ is accepted only if both atoms remain within their allowed valencies:</p>
<p>$$
\sum_{j} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{i}^{\pi}} \quad \text{and} \quad \sum_{i} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{j}^{\pi}}
$$</p>
<p>Atoms that do not fill their valencies are complemented with hydrogens. This constraint can be enforced directly on graphs (unlike SMILES, where intermediate substrings are not chemically meaningful), yielding 100% valid molecules.</p>
<h3 id="property-optimization-via-policy-gradient">Property Optimization via Policy Gradient</h3>
<p>For property optimization, MolecularRNN is formulated as a policy network in a Markov Decision Process. The loss function uses REINFORCE with a discounted final reward:</p>
<p>$$
L(\theta) = -\sum_{i=1}^{N} r(s_{N}) \cdot \gamma^{i} \cdot \log p(s_{i} \mid s_{i-1}; \theta)
$$</p>
<p>where $r(s_{N})$ is the reward from a property critic and $\gamma$ is a discount factor. The authors also introduce a structural penalty during RL training that assigns a penalty of $-10$ to atoms violating valency constraints, providing a learning signal from invalid intermediate molecules.</p>
<h2 id="experimental-setup-pretraining-and-property-optimization">Experimental Setup: Pretraining and Property Optimization</h2>
<h3 id="pretraining">Pretraining</h3>
<p>MolecularRNN is pretrained on three datasets: ChEMBL (~1.5M bioactive molecules), <a href="/notes/computational-chemistry/datasets/zinc-22/">ZINC 250k</a> (250K randomly selected commercially available compounds), and <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a> (~1.9M drug-like molecules from ZINC). The model considers 9 atom types (C, N, O, F, P, S, Cl, Br, I), 3 bond types (single, double, triple), and molecules with 10-50 heavy atoms. Architecture: NodeRNN with 4 GRU layers (hidden size 256), EdgeRNN with 4 GRU layers (hidden size 128), node embedding size 128, edge embedding size 16. Training uses Adam with learning rate 0.001 and multiplicative decay on 4 GPUs with batch size 512 per GPU for 250 epochs.</p>
<h3 id="generation-quality-at-scale">Generation Quality at Scale</h3>
<p>The pretrained model generates 1 million molecules per dataset (far larger than prior work: JT-VAE used 5K samples, Li et al. used 100K). Results with valency-based rejection sampling:</p>
<table>
  <thead>
      <tr>
          <th>Training Set</th>
          <th>Valid</th>
          <th>Unique</th>
          <th>Novel</th>
          <th>IntDiv (p=1)</th>
          <th>IntDiv (p=2)</th>
          <th>SA Score</th>
          <th>QED</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChEMBL</td>
          <td>100%</td>
          <td>99.2%</td>
          <td>99.3%</td>
          <td>0.895</td>
          <td>0.890</td>
          <td>3.67 +/- 1.20</td>
          <td>0.56 +/- 0.20</td>
      </tr>
      <tr>
          <td>ZINC 250k</td>
          <td>100%</td>
          <td>99.8%</td>
          <td>100%</td>
          <td>0.892</td>
          <td>0.887</td>
          <td>3.60 +/- 1.01</td>
          <td>0.68 +/- 0.16</td>
      </tr>
      <tr>
          <td>MOSES</td>
          <td>100%</td>
          <td>99.4%</td>
          <td>100%</td>
          <td>0.881</td>
          <td>0.876</td>
          <td>3.24 +/- 0.97</td>
          <td>0.74 +/- 0.14</td>
      </tr>
  </tbody>
</table>
<p>Comparison with baselines on ZINC 250k (30K samples):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Valid</th>
          <th>Unique</th>
          <th>Novel</th>
          <th>SA Score</th>
          <th>QED</th>
          <th>IntDiv</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JT-VAE</td>
          <td>99.8%</td>
          <td>100%</td>
          <td>100%</td>
          <td>3.37</td>
          <td>0.76</td>
          <td>0.85</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>100%</td>
          <td>99.97%</td>
          <td>100%</td>
          <td>4.62</td>
          <td>0.61</td>
          <td>0.90</td>
      </tr>
      <tr>
          <td>MolecularRNN</td>
          <td>100%</td>
          <td>99.89%</td>
          <td>100%</td>
          <td>3.59</td>
          <td>0.68</td>
          <td>0.89</td>
      </tr>
  </tbody>
</table>
<p>GCPN generates overly complex molecules (high SA score of 4.62), while MolecularRNN produces more realistic structures with higher internal diversity than JT-VAE.</p>
<h3 id="property-optimization-results">Property Optimization Results</h3>
<p>Policy gradient optimization is run for 300 iterations with batch size 512 and constant learning rate $10^{-5}$, discount factor $\gamma = 0.97$. Top-3 scores for penalized logP and QED:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>logP 1st</th>
          <th>logP 2nd</th>
          <th>logP 3rd</th>
          <th>QED 1st</th>
          <th>QED 2nd</th>
          <th>QED 3rd</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a></td>
          <td>3.63</td>
          <td>3.49</td>
          <td>3.44</td>
          <td>0.896</td>
          <td>0.824</td>
          <td>0.820</td>
      </tr>
      <tr>
          <td>JT-VAE</td>
          <td>5.30</td>
          <td>4.93</td>
          <td>4.49</td>
          <td>0.925</td>
          <td>0.911</td>
          <td>0.910</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>7.98</td>
          <td>7.85</td>
          <td>7.80</td>
          <td>0.948</td>
          <td>0.947</td>
          <td>0.946</td>
      </tr>
      <tr>
          <td>MolecularRNN</td>
          <td>10.34</td>
          <td>10.19</td>
          <td>10.14</td>
          <td>0.948</td>
          <td>0.948</td>
          <td>0.947</td>
      </tr>
  </tbody>
</table>
<p>MolecularRNN achieves the highest penalized logP scores (10.34 vs. GCPN&rsquo;s 7.98) while matching GCPN on QED. The authors also demonstrate melting temperature optimization using a GCN-based property predictor as the critic (RMSE 39.5 degrees C), showing that the RL framework generalizes to properties that cannot be computed directly from molecular graphs.</p>
<h2 id="distribution-level-evaluation-and-learned-chemical-patterns">Distribution-Level Evaluation and Learned Chemical Patterns</h2>
<p>The authors emphasize that reporting only top-3 scores is not informative, and they compare full property distributions. MolecularRNN shifts the QED distribution further toward maximum values compared to GCPN. They also note that during melting temperature optimization, the model rediscovered two chemical phenomena: fusing aromatic rings increases melting point, and the presence of polar groups (C=O, OH, NH2, heterocyclic nitrogens) enhances dipole-dipole interactions and raises melting temperature.</p>
<p>Without valency-based rejection sampling, the pretrained model achieves 65% validity. After structural penalty training (assigning -10 to valency-violating atoms and optimizing with policy gradient), validity increases to 90%. Enabling rejection sampling then achieves 100%.</p>
<p>Several limitations are worth noting. The BFS ordering introduces an arbitrary sequencing over equivalent graph traversals (the node order permutation problem is not addressed). The evaluation uses top-3 scores for property optimization, though the authors do advocate for distributional evaluation. The molecule size is capped at 50 heavy atoms. The paper does not report training time or wall-clock generation speed. Future directions mentioned include multi-objective property optimization and scaffold completion (graph completion from a given core structure).</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL</td>
          <td>~1.5M molecules</td>
          <td>Bioactive molecules with experimental measurements</td>
      </tr>
      <tr>
          <td>Pretraining</td>
          <td>ZINC 250k</td>
          <td>250K molecules</td>
          <td>Random subset of ZINC database</td>
      </tr>
      <tr>
          <td>Pretraining</td>
          <td>MOSES</td>
          <td>~1.9M molecules</td>
          <td>Drug-like subset of ZINC</td>
      </tr>
      <tr>
          <td>Melting point critic</td>
          <td>Custom split</td>
          <td>37,940 train / 9,458 test</td>
          <td>Melting temperatures from -196 to 517 degrees C</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pretraining</strong>: Maximum likelihood with Adam optimizer, learning rate 0.001 with multiplicative decay to $10^{-5}$, 250 epochs</li>
<li><strong>Structural penalty</strong>: Policy gradient with -10 penalty per valency-violating atom</li>
<li><strong>Property optimization</strong>: REINFORCE (policy gradient), 300 iterations, batch size 512, learning rate $10^{-5}$, discount factor $\gamma = 0.97$</li>
<li><strong>Melting point critic</strong>: GCN regression (4 layers, hidden size 128), Adam with learning rate 0.001, exponential decay $\gamma = 0.8$, 30 epochs, batch size 32</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>NodeRNN</strong>: 4 GRU layers, hidden size 256, node embedding 128</li>
<li><strong>EdgeRNN</strong>: 4 GRU layers, hidden size 128, edge embedding 16</li>
<li><strong>NodeMLP/EdgeMLP</strong>: 2-layer MLP with 128 hidden units, ReLU activation, softmax output</li>
<li><strong>BFS window</strong>: $M = 12$ preceding atoms</li>
<li><strong>Atom types</strong>: 9 (C, N, O, F, P, S, Cl, Br, I)</li>
<li><strong>Bond types</strong>: 3 (single, double, triple) + no bond</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>% chemically valid molecules (RDKit)</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>% unique in generated pool (up to 1M)</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>% not in training set</td>
      </tr>
      <tr>
          <td>Internal Diversity</td>
          <td>Average pairwise Tanimoto distance</td>
      </tr>
      <tr>
          <td>SA Score</td>
          <td>Synthetic accessibility (2-4 optimal range)</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>Drug-likeness score (0-1)</td>
      </tr>
      <tr>
          <td>Penalized logP</td>
          <td>Lipophilicity with ring and SA penalties</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>4 GPUs (NVIDIA, specific model not stated)</li>
<li>Per-GPU batch size of 512 for pretraining</li>
<li>Training time not reported</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Popova, M., Shvets, M., Oliva, J., &amp; Isayev, O. (2019). MolecularRNN: Generating realistic molecular graphs with optimized properties. <em>arXiv preprint arXiv:1905.13372</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{popova2019molecularrnn,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolecularRNN: Generating realistic molecular graphs with optimized properties}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Popova, Mariya and Shvets, Mykhailo and Oliva, Junier and Isayev, Olexandr}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1905.13372}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Memory-Assisted RL for Diverse De Novo Mol. Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/memory-assisted-rl-diverse-molecular-design/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/memory-assisted-rl-diverse-molecular-design/</guid><description>A memory unit for REINVENT-based RL that tracks generated scaffolds and penalizes repeated solutions, increasing molecular diversity up to fourfold.</description><content:encoded><![CDATA[<h2 id="a-memory-module-for-diverse-molecular-generation-via-rl">A Memory Module for Diverse Molecular Generation via RL</h2>
<p>This is a <strong>Method</strong> paper that introduces a memory unit for reinforcement learning (RL)-based molecular generation. The primary contribution is a hash-table-based memory mechanism that integrates into the REINVENT framework&rsquo;s scoring function. By tracking previously generated high-scoring molecules and penalizing the reward when new molecules are too similar to those already stored, the memory unit forces the generative model to explore different regions of chemical space rather than collapsing onto a single scaffold family.</p>
<h2 id="policy-collapse-limits-rl-based-de-novo-design">Policy Collapse Limits RL-Based De Novo Design</h2>
<p>Recurrent neural networks (RNNs) trained with reinforcement learning can generate novel molecules optimized for desired properties. The <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> algorithm and related approaches (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGANIC</a>, GENTRL) demonstrated the viability of coupling a pretrained SMILES-based generative model with a scoring function via RL. However, a persistent problem is <strong>policy collapse</strong> (also called mode collapse): once the model discovers a high-scoring region of chemical space, it continues to exploit that region, producing structurally similar compounds with minor substitution differences. This severely limits the practical utility of RL-based generation in drug design, where medicinal chemists need diverse scaffolds to explore structure-activity relationships and manage intellectual property concerns.</p>
<p>Prior work by Liu et al. [31] attempted to address this by engineering an explorative RNN alongside the standard generative RNN, but it did not substantially increase diversity compared to standard REINVENT. Other approaches like Generative Examination Networks (GEN) performed statistical analysis during training but were not evaluated in optimization scenarios.</p>
<h2 id="core-innovation-hash-table-memory-unit-for-reward-modification">Core Innovation: Hash-Table Memory Unit for Reward Modification</h2>
<p>The key insight is to dynamically modify the reward surface during RL by maintaining a memory of previously explored chemical space. The memory unit is a hash table of index-bucket pairs. Each bucket stores up to a fixed number of high-scoring molecules (default: 25) that are chemically similar to a seed molecule (the index).</p>
<h3 id="integration-with-reinvent">Integration with REINVENT</h3>
<p>The memory unit modifies the augmented likelihood used in REINVENT. For a generated compound $c$, the augmented log-likelihood becomes:</p>
<p>$$
\log P(c)_{Aug} = \log P(c)_{PriorNetwork} + \sigma \times S(c) \times M(c)
$$</p>
<p>where $\sigma$ is a scalar coefficient, $S(c)$ is the scoring function output, and $M(c)$ is the memory unit output (either 0 or 1). The reward is:</p>
<p>$$
R(c) = \left(\log P(c)_{Aug} - \log P(c)_{AgentNetwork}\right)^2
$$</p>
<p>and the loss is $\text{loss} = -R(c)$.</p>
<h3 id="memory-unit-operation">Memory Unit Operation</h3>
<p>When a high-scoring molecule is generated:</p>
<ol>
<li>Its fingerprint or scaffold is compared against all index structures in the memory</li>
<li>If it is similar to an index (above a Tanimoto cutoff, default 0.6) and the corresponding bucket is not full, $M(c) = 1$ and the molecule is added to the bucket</li>
<li>If the bucket is full, $M(c) = 0$, effectively zeroing the reward contribution and discouraging the model from generating similar molecules</li>
<li>If no similar index exists, a new index-bucket pair is created</li>
</ol>
<h3 id="four-similarity-criteria">Four Similarity Criteria</h3>
<p>The authors evaluate four criteria for grouping molecules in the memory:</p>
<ol>
<li><strong>Compound similarity</strong>: ECFP4 Tanimoto similarity at the whole-molecule level</li>
<li><strong>Identical Bemis-Murcko (BM) scaffold</strong>: exact match of Bemis-Murcko frameworks</li>
<li><strong>Identical carbon skeleton</strong>: exact match of carbon skeletons (BM scaffolds with all heteroatoms replaced by carbon and bonds set to single)</li>
<li><strong>Scaffold similarity</strong>: atom pair fingerprint Tanimoto similarity between carbon skeletons (fuzzy matching)</li>
</ol>
<h3 id="alternative-output-modes">Alternative Output Modes</h3>
<p>Beyond the binary output ($M(c) \in {0, 1}$), the authors also explored smooth output functions. The linear mode:</p>
<p>$$
M(c) = 1 - \frac{\text{compounds in bucket}}{\text{bucket size}}
$$</p>
<p>And the sigmoid mode:</p>
<p>$$
M(c) = 1 - \frac{1}{1 + e^{-\left(\frac{\frac{\text{compounds in bucket}}{\text{bucket size}} \times 2 - 1}{0.15}\right)}}
$$</p>
<p>Both smooth modes yielded slightly fewer analogs than the binary mode and were not pursued further.</p>
<h2 id="experimental-setup-logp-optimization-and-target-activity-prediction">Experimental Setup: LogP Optimization and Target Activity Prediction</h2>
<h3 id="case-study-1-logp-optimization">Case Study 1: LogP Optimization</h3>
<p>As a proof of concept, the authors optimized LogP values for known DRD2 inhibitors. Starting from 487 DRD2 compounds with LogP &gt;= 5 (from ExCAPE-DB), they applied transfer learning to the prior model for 20 epochs, then ran RL for 150 iterations (100 compounds per iteration, 15,000 total). The scoring function was:</p>
<p>$$
S = 1 - \tanh\left(\min\left(|2 - \text{AlogP}|, |3 - \text{AlogP}|\right)\right)
$$</p>
<p>targeting LogP values between 2.0 and 3.0.</p>
<h3 id="case-study-2-htr1a-and-drd2-activity-prediction">Case Study 2: HTR1A and DRD2 Activity Prediction</h3>
<p>For a more complex scenario, the authors trained SVM classifiers (with <a href="https://en.wikipedia.org/wiki/Platt_scaling">Platt scaling</a> for probabilistic output) on bioactivity data from ExCAPE-DB to predict activity against two neurotransmitter receptors:</p>
<ul>
<li><strong><a href="https://en.wikipedia.org/wiki/5-HT1A_receptor">HTR1A</a></strong>: 3,599 actives (pIC50 &gt;= 7) and 66,684 inactives</li>
<li><strong><a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">DRD2</a></strong>: 2,981 actives (pIC50 &gt;= 7) and 346,206 inactives (100,000 sampled)</li>
</ul>
<p>Data was split using Butina clustering on ECFP6 at a 0.4 Tanimoto cutoff (60/20/20 train/val/test). The SVM models achieved excellent performance:</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Set</th>
          <th>Balanced Accuracy</th>
          <th>ROC AUC</th>
          <th>F1</th>
          <th>MCC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HTR1A</td>
          <td>Test</td>
          <td>0.96</td>
          <td>0.99</td>
          <td>0.75</td>
          <td>0.75</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>Test</td>
          <td>0.95</td>
          <td>0.99</td>
          <td>0.71</td>
          <td>0.72</td>
      </tr>
  </tbody>
</table>
<p>RL was run for 300 iterations (100 compounds each, 30,000 total). Compounds with predicted activity &gt;= 0.7 were considered active.</p>
<h3 id="generative-model-architecture">Generative Model Architecture</h3>
<p>The RNN prior model followed the REINVENT architecture: an embedding layer, three GRU layers with 256 dimensions, and a linear output layer. It was pretrained on ~1.5 million ChEMBL 25 compounds (filtered to remove known HTR1A actives and DRD2 analogs) for 10 epochs using Adam with a learning rate of 0.01.</p>
<h3 id="comparisons">Comparisons</h3>
<p>The authors compared memory-assisted RL against:</p>
<ul>
<li>Standard REINVENT RL (no memory)</li>
<li>Experience replay (re-presenting 8 high-scoring compounds per iteration)</li>
<li>Temperature scaling (values from 1.0 to 10.0)</li>
<li>Memory + experience replay combined</li>
</ul>
<h2 id="results-up-to-fourfold-increase-in-diverse-active-compounds">Results: Up to Fourfold Increase in Diverse Active Compounds</h2>
<h3 id="logp-optimization-results">LogP Optimization Results</h3>
<p>Memory-assisted RL increased the number of optimized compounds (LogP 2-3) by roughly threefold:</p>
<table>
  <thead>
      <tr>
          <th>Memory Type</th>
          <th>Optimized Compounds</th>
          <th>Unique BM Scaffolds</th>
          <th>Unique Carbon Skeletons</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>No memory</td>
          <td>938</td>
          <td>727</td>
          <td>396</td>
      </tr>
      <tr>
          <td>Compound similarity</td>
          <td>3,451</td>
          <td>2,963</td>
          <td>1,472</td>
      </tr>
      <tr>
          <td>Identical BM Scaffold</td>
          <td>3,428</td>
          <td>2,865</td>
          <td>1,398</td>
      </tr>
      <tr>
          <td>Identical Carbon Skeleton</td>
          <td>3,315</td>
          <td>3,002</td>
          <td>1,799</td>
      </tr>
      <tr>
          <td>Scaffold Similarity</td>
          <td>3,591</td>
          <td>3,056</td>
          <td>1,538</td>
      </tr>
  </tbody>
</table>
<p>The memory unit also increased the generation of relevant analogs. ECFP6 analogs (Tanimoto &gt;= 0.4 to training set) increased from 145 to up to 549, and shared MMP cores increased from 5 to up to 19, confirming that the memory unit promoted exploration of chemically relevant space rather than random drift.</p>
<h3 id="htr1a-and-drd2-activity-optimization-results">HTR1A and DRD2 Activity Optimization Results</h3>
<p>The improvements were even more pronounced for target activity optimization:</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Memory Type</th>
          <th>Active Compounds</th>
          <th>Unique BM Scaffolds</th>
          <th>Unique Carbon Skeletons</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HTR1A</td>
          <td>No memory</td>
          <td>9,323</td>
          <td>7,312</td>
          <td>5,446</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>Compound similarity</td>
          <td>16,779</td>
          <td>13,304</td>
          <td>9,887</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>Identical Carbon Skeleton</td>
          <td>17,597</td>
          <td>15,531</td>
          <td>12,408</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>No memory</td>
          <td>5,143</td>
          <td>2,635</td>
          <td>1,949</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>Compound similarity</td>
          <td>21,486</td>
          <td>17,844</td>
          <td>12,749</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>Scaffold Similarity</td>
          <td>22,784</td>
          <td>20,712</td>
          <td>16,434</td>
      </tr>
  </tbody>
</table>
<p>For DRD2, the effect was particularly striking: standard RL showed clear policy collapse with only 576 ECFP6 analogs to the training set, while memory-assisted RL generated up to 6,315. The compound similarity memory unit produced the most MMP analogs (217 to the training set vs. 7 without memory).</p>
<h3 id="parameter-sensitivity">Parameter Sensitivity</h3>
<p>Bucket size had a modest effect: larger buckets (allowing more compounds before penalization) slightly increased analog generation. The Tanimoto similarity threshold of 0.6 was near-optimal for the scaffold similarity memory; higher thresholds reduced diversity gains. The compound similarity memory showed increasing analogs with higher thresholds, but BM scaffold and carbon skeleton counts plateaued above 0.6.</p>
<h3 id="comparison-with-experience-replay-and-temperature-scaling">Comparison with Experience Replay and Temperature Scaling</h3>
<ul>
<li><strong>Experience replay alone</strong> increased diversity compared to vanilla RL but was less effective than the memory unit alone</li>
<li><strong>Memory + experience replay</strong> achieved the best results overall, as experience replay provided the model with diverse starting points for exploration after the memory unit altered the reward landscape</li>
<li><strong>Temperature scaling</strong> was largely ineffective: only a value of 1.25 showed improvement, and even then it achieved only about 50% of the analogs generated by memory-assisted RL. Temperatures above 2.0 degraded SMILES validity, and above 4.0 prevented valid molecule generation entirely</li>
</ul>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>All evaluations are retrospective; no synthesized compounds were experimentally tested</li>
<li>The SVM activity models, while accurate, may have applicability domain limitations for highly novel scaffolds</li>
<li>The binary memory output mode was found to work best, but the transition from exploration to exploitation is abrupt</li>
<li>The method was only tested with two biological targets and one physicochemical property</li>
<li>Computational overhead of the memory unit is not discussed</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior model training</td>
          <td>ChEMBL 25</td>
          <td>~1.5M compounds</td>
          <td>Filtered: max 50 heavy atoms, no stereochemistry, removed HTR1A actives and DRD2 analogs</td>
      </tr>
      <tr>
          <td>HTR1A activity data</td>
          <td>ExCAPE-DB</td>
          <td>3,599 actives + 66,684 inactives</td>
          <td>pIC50 &gt;= 7 threshold for actives</td>
      </tr>
      <tr>
          <td>DRD2 activity data</td>
          <td>ExCAPE-DB</td>
          <td>2,981 actives + 100,000 inactives (sampled)</td>
          <td>pIC50 &gt;= 7 threshold for actives</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Generative model</strong>: RNN with embedding + 3 GRU layers (256 dim) + linear output (REINVENT architecture)</li>
<li><strong>RL</strong>: Augmented likelihood formulation with sigma scaling coefficient</li>
<li><strong>SVM classifiers</strong>: Non-linear SVM with MinMax kernel, Platt scaling, ECFP6 count-based fingerprints (2048 dim)</li>
<li><strong>Butina clustering</strong>: ECFP6 Tanimoto cutoff 0.4 for train/val/test splitting</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Unique compounds</td>
          <td>Number of distinct valid SMILES generated</td>
      </tr>
      <tr>
          <td>Unique BM scaffolds</td>
          <td>Bemis-Murcko framework diversity</td>
      </tr>
      <tr>
          <td>Unique carbon skeletons</td>
          <td>Carbon skeleton diversity (stripped BM scaffolds)</td>
      </tr>
      <tr>
          <td>ECFP6 analogs</td>
          <td>Compounds with Tanimoto &gt;= 0.4 to known actives</td>
      </tr>
      <tr>
          <td>MMP analogs</td>
          <td>Matched molecular pair relationships with known actives</td>
      </tr>
      <tr>
          <td>Shared MMP cores</td>
          <td>Scaffold cores shared between generated and known compounds</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/tblaschke/reinvent-memory">reinvent-memory</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation with prepared datasets</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Blaschke, T., Engkvist, O., Bajorath, J., &amp; Chen, H. (2020). Memory-assisted reinforcement learning for diverse molecular de novo design. <em>Journal of Cheminformatics</em>, 12, 68. <a href="https://doi.org/10.1186/s13321-020-00473-0">https://doi.org/10.1186/s13321-020-00473-0</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{blaschke2020memory,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Memory-assisted reinforcement learning for diverse molecular de novo design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Blaschke, Thomas and Engkvist, Ola and Bajorath, J{\&#34;u}rgen and Chen, Hongming}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{68}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-020-00473-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DrugEx v2: Pareto Multi-Objective RL for Drug Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/drugex-v2-pareto-multi-objective-rl/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/drugex-v2-pareto-multi-objective-rl/</guid><description>DrugEx v2 extends RNN-based de novo drug design with Pareto ranking and evolutionary exploration for multi-objective molecule generation.</description><content:encoded><![CDATA[<h2 id="multi-objective-de-novo-drug-design-with-pareto-optimization">Multi-Objective De Novo Drug Design with Pareto Optimization</h2>
<p>This is a <strong>Method</strong> paper that extends the DrugEx framework (v1) to handle multi-objective optimization in de novo drug design. The primary contribution is integrating Pareto-based ranking with evolutionary algorithm concepts (crossover and mutation) into an RNN-based reinforcement learning pipeline. The system generates <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>-based molecules optimized simultaneously for activity toward multiple protein targets while avoiding off-targets, addressing polypharmacology scenarios where drugs must bind multiple specific receptors.</p>
<h2 id="polypharmacology-and-the-limits-of-single-objective-generation">Polypharmacology and the Limits of Single-Objective Generation</h2>
<p>Traditional drug discovery follows the &ldquo;one drug, one target, one disease&rdquo; paradigm, but drug molecules interact with an average of six protein targets. Off-target binding causes side effects that remain a leading cause of clinical failure and post-approval drug withdrawals (over 500 drugs withdrawn due to fatal toxicity). Complex diseases often require modulating multiple targets simultaneously, making polypharmacology an important design objective.</p>
<p>Prior deep learning approaches for de novo design, including DrugEx v1, focused on generating molecules active against a single target. Extending these methods to multiple objectives introduces fundamental challenges: objectives are often contradictory (high affinity for one target may correlate with high affinity for an undesired off-target), and naive weighted-sum approaches can collapse diversity by over-optimizing a single dominant objective. The authors specifically target the <a href="https://en.wikipedia.org/wiki/Adenosine_receptor">adenosine receptor</a> system, where $A_1AR$ and $A_{2A}AR$ selectivity profiles matter for therapeutic efficacy, and <a href="https://en.wikipedia.org/wiki/HERG">hERG</a> channel binding must be avoided to prevent cardiac toxicity.</p>
<h2 id="evolutionary-exploration-and-pareto-ranking-in-rl">Evolutionary Exploration and Pareto Ranking in RL</h2>
<p>The core innovation of DrugEx v2 has two components: an evolutionary exploration strategy and Pareto-based reward assignment.</p>
<h3 id="evolutionary-exploration-strategy">Evolutionary Exploration Strategy</h3>
<p>The generation process uses three RNN networks with identical LSTM architectures:</p>
<ul>
<li><strong>Agent net</strong> ($G_A$): the primary generator, updated at each training epoch via policy gradient</li>
<li><strong>Crossover net</strong> ($G_C$): initialized from the fine-tuned model, updated iteratively from $G_A$ after each convergence period</li>
<li><strong>Mutation net</strong> ($G_M$): initialized from the pre-trained model, parameters fixed throughout training</li>
</ul>
<p>At each token-generation step, a random number determines whether the token probability comes from the combination of $G_A$ and $G_C$ (with probability $1 - \varepsilon$) or from $G_M$ (with probability $\varepsilon$). This mirrors crossover and mutation operations from evolutionary algorithms, maintaining diversity while steering toward desired properties.</p>
<h3 id="pareto-front-reward-scheme">Pareto Front Reward Scheme</h3>
<p>For $n$ objectives (three in this study: $A_1AR$, $A_{2A}AR$, hERG), each molecule receives a score $R_i$ based on its predicted bioactivity:</p>
<p>$$
R_{i} = \begin{cases} \text{minmax}(pX_{i}), &amp; \text{if high affinity required} \\ 1 - \text{minmax}(pX_{i}), &amp; \text{if low affinity required} \\ 0, &amp; \text{if SMILES invalid} \end{cases}
$$</p>
<p>where $pX_i$ is the predicted bioactivity (range 3.0 to 10.0), normalized to [0, 1].</p>
<p>For the multi-target case, high affinity is required for both $A_1AR$ and $A_{2A}AR$ while low affinity is required for hERG. For the target-specific case, high affinity is required only for $A_{2A}AR$ while low affinity is required for both $A_1AR$ and hERG.</p>
<p>Molecules are ranked using a <a href="https://en.wikipedia.org/wiki/Non-dominated_sorting_genetic_algorithm_II">non-dominated sorting</a> algorithm to construct Pareto fronts. Within each front, molecules are ranked by average Tanimoto distance (using ECFP6 fingerprints) rather than crowding distance, favoring chemically diverse solutions. The final reward is:</p>
<p>$$
R_i^{*} = \begin{cases} 0.5 + \frac{k - N_{undesired}}{2N_{desired}}, &amp; \text{if desired} \\ \frac{k}{2N_{undesired}}, &amp; \text{if undesired} \end{cases}
$$</p>
<p>where $k$ is the molecule&rsquo;s index in the Pareto rank. Rewards for undesired and desired solutions are distributed in $(0, 0.5]$ and $(0.5, 1.0]$, respectively.</p>
<p>The agent is trained via policy gradient:</p>
<p>$$
J(\theta) = \mathbb{E}\left[R^{*}(y_{1:T}) \middle|\theta\right] = \sum_{t=1}^{T} \log G(y_t | y_{1:t-1}) \cdot R^{*}(y_{1:T})
$$</p>
<h3 id="weighted-sum-alternative">Weighted Sum Alternative</h3>
<p>The authors also implement a weighted sum (WS) scheme with dynamic weights proportional to the ratio of undesired to desired molecules per objective:</p>
<p>$$
w_i = \frac{r_i}{\sum_{k=1}^{M} r_k}, \quad R^{*} = \sum_{i=1}^{n} w_i R_i
$$</p>
<p>This auto-adjusts importance toward under-performing objectives during training.</p>
<h3 id="molecular-diversity-metric">Molecular Diversity Metric</h3>
<p>Diversity is measured using the Solow-Polasky metric adapted from ecological biodiversity:</p>
<p>$$
I(A) = \frac{1}{|A|} \mathbf{e}^{\top} F(\mathbf{s})^{-1} \mathbf{e}
$$</p>
<p>where $F(\mathbf{s})$ is a distance matrix with entries $f(d_{ij}) = e^{-\theta d_{ij}}$ and $d_{ij}$ is the Tanimoto distance between ECFP6 fingerprints of molecules $s_i$ and $s_j$.</p>
<h2 id="multi-target-and-target-specific-experiments">Multi-Target and Target-Specific Experiments</h2>
<h3 id="qsar-environment">QSAR Environment</h3>
<p>Four ML algorithms were benchmarked for the bioactivity prediction environment: Random Forest (RF), SVM, PLS, and Multi-task DNN (MT-DNN). Input features combined 2048-bit ECFP6 fingerprints with 19 physicochemical descriptors (2067D total). The training data came from ChEMBL v26: 25,731 ligands with bioactivity measurements toward $A_1AR$, $A_{2A}AR$, and hERG. RF was selected as the final predictor based on superior performance in temporal-split independent testing ($R^2$ and RMSE), prioritizing robustness over cross-validation metrics.</p>
<h3 id="generative-model-architecture">Generative Model Architecture</h3>
<p>The RNN generator uses six layers: input, embedding (128D), three LSTM recurrent layers (512 hidden units), and output. LSTM was chosen over GRU based on higher valid SMILES rates (97.5% vs. 93.1% for pre-trained, 97.9% vs. 95.7% for fine-tuned). Pre-training used 1.7M molecules from ChEMBL; fine-tuning used the 25,731 LIGAND set molecules.</p>
<h3 id="baselines">Baselines</h3>
<p>DrugEx v2 was compared against DrugEx v1, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>, and <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGANIC</a>, all using the same RNN architecture and pre-trained/fine-tuned models, with only the RL framework differing. Both Pareto front (PF) and weighted sum (WS) reward schemes were tested.</p>
<h3 id="multi-target-results">Multi-Target Results</h3>
<p>In the multi-target case (high affinity for $A_1AR$ and $A_{2A}AR$, low affinity for hERG):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Scheme</th>
          <th>Validity</th>
          <th>Desirability</th>
          <th>Uniqueness</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DrugEx v2</td>
          <td>PF</td>
          <td>99.57%</td>
          <td>80.81%</td>
          <td>87.29%</td>
          <td>0.70</td>
      </tr>
      <tr>
          <td>DrugEx v2</td>
          <td>WS</td>
          <td>99.80%</td>
          <td><strong>97.45%</strong></td>
          <td>89.08%</td>
          <td>0.49</td>
      </tr>
      <tr>
          <td>REINVENT</td>
          <td>PF</td>
          <td>99.54%</td>
          <td>57.43%</td>
          <td><strong>98.84%</strong></td>
          <td><strong>0.77</strong></td>
      </tr>
      <tr>
          <td>ORGANIC</td>
          <td>PF</td>
          <td>98.84%</td>
          <td>66.01%</td>
          <td>82.67%</td>
          <td>0.65</td>
      </tr>
      <tr>
          <td>DrugEx v1</td>
          <td>PF</td>
          <td>98.28%</td>
          <td>43.27%</td>
          <td>88.96%</td>
          <td>0.71</td>
      </tr>
  </tbody>
</table>
<p>DrugEx v2 achieved the highest desirability under both schemes. The WS scheme maximized desirability (97.45%) but at the cost of diversity (0.49). The PF scheme maintained higher diversity (0.70) with still-strong desirability (80.81%).</p>
<h3 id="target-specific-results">Target-Specific Results</h3>
<p>In the target-specific case (high $A_{2A}AR$, low $A_1AR$ and hERG):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Scheme</th>
          <th>Validity</th>
          <th>Desirability</th>
          <th>Uniqueness</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DrugEx v2</td>
          <td>PF</td>
          <td>99.53%</td>
          <td><strong>89.49%</strong></td>
          <td>90.55%</td>
          <td>0.73</td>
      </tr>
      <tr>
          <td>DrugEx v2</td>
          <td>WS</td>
          <td>99.62%</td>
          <td><strong>97.86%</strong></td>
          <td>90.54%</td>
          <td>0.31</td>
      </tr>
      <tr>
          <td>REINVENT</td>
          <td>WS</td>
          <td>99.55%</td>
          <td>81.27%</td>
          <td>98.87%</td>
          <td>0.34</td>
      </tr>
      <tr>
          <td>ORGANIC</td>
          <td>PF</td>
          <td>98.29%</td>
          <td>86.98%</td>
          <td>80.30%</td>
          <td>0.64</td>
      </tr>
  </tbody>
</table>
<p>DrugEx v2 with PF achieved high desirability (89.49%) while maintaining diversity (0.73), outperforming both the WS scheme&rsquo;s diversity collapse (0.31) and competing methods.</p>
<h3 id="chemical-space-coverage">Chemical Space Coverage</h3>
<p>t-SNE visualization with ECFP6 descriptors showed that the PF scheme guided generators to cover chemical space more broadly than the WS scheme. DrugEx v1 and v2 covered nearly all of the chemical space occupied by known active ligands, while REINVENT and ORGANIC covered only partial regions in the target-specific case.</p>
<h3 id="substructure-distribution">Substructure Distribution</h3>
<p>Generated molecules were evaluated for purine ring, furan ring, and benzene ring frequencies. DrugEx v2 with PF produced substructure distributions closest to the LIGAND set, suggesting it better preserves the chemical characteristics of known active molecules compared to REINVENT (which over-represented benzene rings) and ORGANIC.</p>
<h3 id="guacamol-benchmark">GuacaMol Benchmark</h3>
<p>DrugEx v2 was tested on 20 goal-directed tasks from the <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmark, achieving the best score in 12 of 20 tasks and an overall second place. The method struggled with tasks requiring contradictory objectives in narrow chemical spaces (e.g., the Sitagliptin MPO task), reflecting its emphasis on diverse feasible molecules rather than optimal individual solutions.</p>
<h2 id="diversity-desirability-trade-off-and-limitations">Diversity-Desirability Trade-off and Limitations</h2>
<p>The key finding is that the Pareto front scheme and weighted sum scheme offer complementary strengths: PF produces molecules with higher diversity and more realistic substructure distributions, while WS achieves higher raw desirability scores. The Pareto front scheme is preferred for polypharmacology applications where chemical diversity matters for lead optimization.</p>
<p>The mutation rate $\varepsilon$ controls the diversity-desirability trade-off. Higher $\varepsilon$ increases diversity at the cost of desirability. The authors tested $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$ and found that appropriate tuning is important.</p>
<p>Limitations acknowledged by the authors include:</p>
<ul>
<li>The method is less effective for tasks with contradictory objectives in narrow chemical spaces</li>
<li>Emphasis is on generating diverse feasible molecules rather than individual optimal solutions</li>
<li>REINVENT 2.0 did not converge with the PF scheme, suggesting the Pareto approach may not be universally compatible with all RL frameworks</li>
<li>Bioactivity predictions rely on QSAR models (RF), which may not generalize perfectly to novel chemical scaffolds</li>
</ul>
<p>Future directions mentioned include adopting newer architectures (BERT, Transformer, GPT-2), handling graph and fragment representations, and integrating additional objectives like stability and synthesizability.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL v26 (ChEMBL set)</td>
          <td>1.7M molecules</td>
          <td>SMILES syntax learning, drug-like molecules</td>
      </tr>
      <tr>
          <td>Fine-tuning / Environment</td>
          <td>LIGAND set</td>
          <td>25,731 ligands</td>
          <td>Bioactivities for $A_1AR$, $A_{2A}AR$, hERG from ChEMBL</td>
      </tr>
      <tr>
          <td>Benchmark</td>
          <td>GuacaMol</td>
          <td>20 tasks</td>
          <td>Goal-directed generation tasks</td>
      </tr>
  </tbody>
</table>
<p>Active/inactive thresholds: $pX \geq 6.5$ (active), $pX &lt; 6.5$ (inactive). Low-quality data without exact pX assigned $pX = 3.99$ with sample weight 0.1.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>QSAR predictor</strong>: Random Forest, 1000 trees, Gini criterion. Input: 2048-bit ECFP6 + 19 physicochemical properties (2067D). MinMax normalization.</li>
<li><strong>Generator</strong>: 6-layer RNN with LSTM cells (512 hidden units), embedding dim 128, vocabulary 84 tokens. Adam optimizer, lr $10^{-3}$, batch size 512, 1000 epochs.</li>
<li><strong>RL training</strong>: Policy gradient with Pareto-based or weighted-sum reward. Mutation rates tested: $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$.</li>
<li><strong>Pareto ranking</strong>: GPU-accelerated non-dominated sorting via PyTorch. Tanimoto-based crowding distance with ECFP6 fingerprints.</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Architecture</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Generator</td>
          <td>LSTM (3 layers, 512 hidden)</td>
          <td>Embedding 128D, vocab 84</td>
      </tr>
      <tr>
          <td>Predictor</td>
          <td>Random Forest</td>
          <td>1000 trees, 2067D input</td>
      </tr>
      <tr>
          <td>MT-DNN (alternative)</td>
          <td>3 hidden layers (4000, 2000, 1000)</td>
          <td>ReLU, 20% dropout</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>Fraction of generated SMILES that parse to valid molecules</td>
      </tr>
      <tr>
          <td>Desirability</td>
          <td>Fraction of molecules meeting all activity thresholds ($pX \geq 6.5$ on-targets, $pX &lt; 6.5$ off-targets)</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>Fraction of non-duplicate molecules</td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>Solow-Polasky metric on ECFP6 Tanimoto distances</td>
      </tr>
      <tr>
          <td>SA score</td>
          <td>Synthetic accessibility (1-10, lower is easier)</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>Quantitative estimate of drug-likeness (0-1, higher is better)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>GPU acceleration was used for Pareto optimization via PyTorch. Specific hardware details (GPU model, training time) are not reported in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/XuhanLiu/DrugEx">DrugEx GitHub</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation (Python, PyTorch)</td>
      </tr>
      <tr>
          <td><a href="https://www.ebi.ac.uk/chembl/">ChEMBL v26</a></td>
          <td>Dataset</td>
          <td>CC BY-SA 3.0</td>
          <td>Source of training molecules and bioactivity data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liu, X., Ye, K., van Vlijmen, H. W. T., Emmerich, M. T. M., IJzerman, A. P., &amp; van Westen, G. J. P. (2021). DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology. <em>Journal of Cheminformatics</em>, 13(1), 85. <a href="https://doi.org/10.1186/s13321-021-00561-9">https://doi.org/10.1186/s13321-021-00561-9</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{liu2021drugex,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Liu, Xuhan and Ye, Kai and van Vlijmen, Herman W. T. and Emmerich, Michael T. M. and IJzerman, Adriaan P. and van Westen, Gerard J. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{85}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-021-00561-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>REINVENT 4: Open-Source Generative Molecule Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent4-generative-molecule-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent4-generative-molecule-design/</guid><description>REINVENT 4 is an open-source generative AI framework combining RNNs and transformers with reinforcement and curriculum learning for de novo molecular design.</description><content:encoded><![CDATA[<h2 id="an-open-source-reference-implementation-for-generative-molecular-design">An Open-Source Reference Implementation for Generative Molecular Design</h2>
<p>REINVENT 4 is a <strong>Resource</strong> paper presenting a production-grade, open-source software framework for AI-driven generative molecular design. The primary contribution is the unified codebase that integrates four distinct molecule generators (de novo, scaffold decoration, linker design, molecular optimization) within three machine learning optimization algorithms (transfer learning, reinforcement learning, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/curriculum-learning-molecular-design/">curriculum learning</a>). The software is released under the Apache 2.0 license and represents the fourth major version of the REINVENT platform, which has been in continuous production use at AstraZeneca for drug discovery.</p>
<h2 id="bridging-the-gap-between-research-prototypes-and-production-molecular-design">Bridging the Gap Between Research Prototypes and Production Molecular Design</h2>
<p>The motivation for REINVENT 4 stems from several gaps in the generative molecular design landscape. While numerous AI model architectures have been developed for molecular generation (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">VAEs</a>, GANs, RNNs, transformers, flow models, diffusion models), most exist as research prototypes released alongside individual publications rather than as maintained, integrated software. The authors argue that the scientific community needs reference implementations of common generative molecular design algorithms in the public domain to:</p>
<ol>
<li>Enable nuanced debate about the application of AI in drug discovery</li>
<li>Serve as educational tools for practitioners entering the field</li>
<li>Increase transparency around AI-driven molecular design</li>
<li>Provide a foundation for future innovation</li>
</ol>
<p>REINVENT 4 consolidates previously separate codebases (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> v1, v2, LibInvent, LinkInvent, Mol2Mol) into a single repository with a consistent interface, addressing the fragmentation that characterized earlier releases.</p>
<h2 id="unified-framework-for-sequence-based-molecular-generation">Unified Framework for Sequence-Based Molecular Generation</h2>
<p>The core design of REINVENT 4 centers on sequence-based neural network models that generate <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings in an autoregressive manner. All generators model the probability of producing a token sequence, with two formulations.</p>
<p>For unconditional agents (de novo generation), the joint probability of a sequence $T$ with tokens $t_1, t_2, \ldots, t_\ell$ is:</p>
<p>$$
\mathbf{P}(T) = \prod_{i=1}^{\ell} \mathbf{P}(t_i \mid t_{i-1}, t_{i-2}, \ldots, t_1)
$$</p>
<p>For conditional agents (scaffold decoration, linker design, molecular optimization), the joint probability given an input sequence $S$ is:</p>
<p>$$
\mathbf{P}(T \mid S) = \prod_{i=1}^{\ell} \mathbf{P}(t_i \mid t_{i-1}, t_{i-2}, \ldots, t_1, S)
$$</p>
<p>The negative log-likelihood for unconditional agents is:</p>
<p>$$
NLL(T) = -\log \mathbf{P}(T) = -\sum_{i=1}^{\ell} \log \mathbf{P}(t_i \mid t_{i-1}, t_{i-2}, \ldots, t_1)
$$</p>
<h3 id="reinforcement-learning-with-dap">Reinforcement Learning with DAP</h3>
<p>The key optimization mechanism is reinforcement learning via the &ldquo;Difference between Augmented and Posterior&rdquo; (DAP) strategy. For each generated sequence $T$, the augmented likelihood is defined as:</p>
<p>$$
\log \mathbf{P}_{\text{aug}}(T) = \log \mathbf{P}_{\text{prior}}(T) + \sigma \mathbf{S}(T)
$$</p>
<p>where $\mathbf{S}(T) \in [0, 1]$ is the scalar score and $\sigma \geq 0$ controls the balance between reward and regularization. The DAP loss is:</p>
<p>$$
\mathcal{L}(T) = \left(\log \mathbf{P}_{\text{aug}}(T) - \log \mathbf{P}_{\text{agent}}(T)\right)^2
$$</p>
<p>The presence of the prior likelihood in the augmented likelihood constrains how far the agent can deviate from chemically plausible space, functioning similarly to proximal policy gradient methods. The loss is lower-bounded by:</p>
<p>$$
\mathcal{L}(T) \geq \max\left(0, \log \mathbf{P}_{\text{prior}}(T) + \sigma \mathbf{S}(T)\right)^2
$$</p>
<h3 id="four-molecule-generators">Four Molecule Generators</h3>
<p>REINVENT 4 supports four generator types:</p>
<table>
  <thead>
      <tr>
          <th>Generator</th>
          <th>Architecture</th>
          <th>Input</th>
          <th>Task</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Reinvent</td>
          <td>RNN</td>
          <td>None</td>
          <td>De novo design from scratch</td>
      </tr>
      <tr>
          <td>LibInvent</td>
          <td>RNN</td>
          <td>Scaffold SMILES</td>
          <td>R-group replacement, library design</td>
      </tr>
      <tr>
          <td><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/link-invent-generative-linker-design/">LinkInvent</a></td>
          <td>RNN</td>
          <td>Two warhead fragments</td>
          <td>Linker design, scaffold hopping</td>
      </tr>
      <tr>
          <td>Mol2Mol</td>
          <td>Transformer</td>
          <td>Input molecule</td>
          <td>Molecular optimization within similarity bounds</td>
      </tr>
  </tbody>
</table>
<p>All generators are fully integrated with all three optimization algorithms (TL, RL, CL). The Mol2Mol transformer was trained on over 200 billion molecular pairs from PubChem with <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> $\geq 0.50$, using ranking loss to directly link negative log-likelihood to molecular similarity.</p>
<h3 id="staged-learning-curriculum-learning">Staged Learning (Curriculum Learning)</h3>
<p>A key new feature is staged learning, which implements curriculum learning as multi-stage RL. Each stage can define a different scoring profile, allowing users to gradually phase in computationally expensive scoring functions. For example, cheap drug-likeness filters can run first, followed by docking in later stages. Stages terminate when a maximum score threshold is exceeded or a step limit is reached.</p>
<h3 id="scoring-subsystem">Scoring Subsystem</h3>
<p>The scoring subsystem implements a plugin architecture supporting over 25 scoring components, including:</p>
<ul>
<li>Physicochemical descriptors from RDKit (QED, SLogP, TPSA, molecular weight, etc.)</li>
<li>Molecular docking via DockStream (<a href="https://en.wikipedia.org/wiki/AutoDock">AutoDock Vina</a>, rDock, Hybrid, Glide, GOLD)</li>
<li>QSAR models via Qptuna and ChemProp (D-MPNN)</li>
<li>Shape similarity via ROCS</li>
<li>Synthesizability estimation via SA score</li>
<li>Matched molecular pairs via mmpdb</li>
<li>Generic REST and external process interfaces</li>
</ul>
<p>Scores are aggregated via weighted arithmetic or geometric mean. A transform system (sigmoid, step functions, value maps) normalizes individual component scores to $[0, 1]$.</p>
<h2 id="pdk1-inhibitor-case-study">PDK1 Inhibitor Case Study</h2>
<p>The paper demonstrates REINVENT 4 through a structure-based drug design exercise targeting <a href="https://en.wikipedia.org/wiki/PDPK1">Phosphoinositide-dependent kinase-1 (PDK1)</a> inhibitors. The experimental setup uses PDB crystal structure 2XCH with DockStream and Glide for docking, defining hits as molecules with docking score $\leq -8$ kcal/mol and QED $\geq 0.7$.</p>
<p><strong>Baseline RL from prior</strong>: 50 epochs of staged learning with batch size 128 produced 119 hits from 6,400 generated molecules (1.9% hit rate), spread across 103 generic Bemis-Murcko scaffolds.</p>
<p><strong>Transfer learning + RL</strong>: After 10 epochs of TL on 315 congeneric pyridinone PDK1 actives from PubChem Assay AID1798002, the same 50-epoch RL run produced 222 hits (3.5% hit rate) across 176 unique generic scaffolds, nearly doubling productivity.</p>
<p>Both approaches generated top-scoring molecules (docking score of -10.1 kcal/mol each) with plausible binding poses reproducing key protein-ligand interactions seen in the native crystal structure, including hinge interactions with ALA 162 and contacts with LYS 111.</p>
<p>The paper also demonstrates the agent&rsquo;s plasticity through a molecular weight switching experiment: after 500 epochs driving generation toward 1500 Da molecules, switching the reward to favor molecules $\leq 500$ Da resulted in rapid adaptation within ~50 epochs, showing that the RL agent can recover from extreme biases.</p>
<h2 id="practical-software-for-ai-driven-drug-discovery">Practical Software for AI-Driven Drug Discovery</h2>
<p>REINVENT 4 represents a mature, well-documented framework that consolidates years of incremental development into a single codebase. Key practical features include TOML/JSON configuration, TensorBoard visualization, multinomial sampling and beam search decoding, diversity filters for scaffold-level novelty, experience replay (inception), and a plugin mechanism for extending the scoring subsystem.</p>
<p>The authors acknowledge that this is one approach among many and that there is no single solution that uniformly outperforms others. REINVENT has demonstrated strong sample efficiency in benchmarks and produced realistic 3D docking poses, but the paper does not claim universal superiority. The focus is on providing a well-engineered, transparent reference implementation rather than advancing a novel algorithm.</p>
<p>Limitations include that only the Mol2Mol prior supports stereochemistry, the training data biases constrain the explorable chemical space, and the SMILES-based representation inherits the known fragility of string-based molecular encodings.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior training (Reinvent)</td>
          <td>ChEMBL 25</td>
          <td>~1.7M molecules</td>
          <td>Drug-like compounds</td>
      </tr>
      <tr>
          <td>Prior training (LibInvent)</td>
          <td>ChEMBL 27</td>
          <td>~1.9M molecules</td>
          <td>Scaffold-decoration pairs</td>
      </tr>
      <tr>
          <td>Prior training (LinkInvent)</td>
          <td>ChEMBL 27</td>
          <td>~1.9M molecules</td>
          <td>Fragment-linker pairs</td>
      </tr>
      <tr>
          <td>Prior training (Mol2Mol)</td>
          <td>ChEMBL 28 / PubChem</td>
          <td>~200B pairs</td>
          <td>Tanimoto similarity $\geq 0.50$</td>
      </tr>
      <tr>
          <td>Case study TL</td>
          <td>PubChem AID1798002</td>
          <td>315 compounds</td>
          <td>Congeneric PDK1 actives</td>
      </tr>
      <tr>
          <td>Case study docking</td>
          <td>PDB 2XCH</td>
          <td>1 structure</td>
          <td>PDK1 crystal structure</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimization</strong>: DAP (recommended), plus three deprecated alternatives (REINFORCE, A2C, MAULI)</li>
<li><strong>Decoding</strong>: Multinomial sampling (default, temperature $K = 1$) and beam search</li>
<li><strong>Diversity filter</strong>: Murcko scaffold, topological scaffold, scaffold similarity, same-SMILES penalty</li>
<li><strong>Experience replay</strong>: Inception memory with configurable size and sampling rate</li>
<li><strong>Gradient descent</strong>: Adam optimizer</li>
</ul>
<h3 id="models">Models</h3>
<p>All pre-trained priors are distributed with the repository. RNN-based generators (Reinvent, LibInvent, LinkInvent) and transformer-based generator (Mol2Mol) with multiple similarity-conditioned variants.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Condition</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Hit rate (RL)</td>
          <td>1.9%</td>
          <td>50 epochs, batch 128</td>
          <td>PDK1 case study</td>
      </tr>
      <tr>
          <td>Hit rate (TL+RL)</td>
          <td>3.5%</td>
          <td>10 TL + 50 RL epochs</td>
          <td>PDK1 case study</td>
      </tr>
      <tr>
          <td>Scaffold diversity (RL)</td>
          <td>103 scaffolds</td>
          <td>From 119 hits</td>
          <td>Generic Bemis-Murcko</td>
      </tr>
      <tr>
          <td>Scaffold diversity (TL+RL)</td>
          <td>176 scaffolds</td>
          <td>From 222 hits</td>
          <td>Generic Bemis-Murcko</td>
      </tr>
      <tr>
          <td>Best docking score</td>
          <td>-10.1 kcal/mol</td>
          <td>Both methods</td>
          <td>Glide SP</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify hardware requirements. REINVENT 4 supports both GPU and CPU execution. Python 3.10+ is required, with PyTorch 1.x (2.0 also compatible) and RDKit 2022.9+.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MolecularAI/REINVENT4">REINVENT4</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Full framework with pre-trained priors</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MolecularAI/DockStream">DockStream</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Docking wrapper for scoring</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Loeffler, H. H., He, J., Tibo, A., Janet, J. P., Voronov, A., Mervin, L. H., &amp; Engkvist, O. (2024). Reinvent 4: Modern AI-driven generative molecule design. <em>Journal of Cheminformatics</em>, 16, 20. <a href="https://doi.org/10.1186/s13321-024-00812-5">https://doi.org/10.1186/s13321-024-00812-5</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{loeffler2024reinvent,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Reinvent 4: Modern AI-driven generative molecule design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Loeffler, Hannes H. and He, Jiazhen and Tibo, Alessandro and Janet, Jon Paul and Voronov, Alexey and Mervin, Lewis H. and Engkvist, Ola}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{20}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-024-00812-5}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Link-INVENT: RL-Driven Molecular Linker Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/link-invent-generative-linker-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/link-invent-generative-linker-design/</guid><description>Link-INVENT extends REINVENT for molecular linker design using RNN-based generation and reinforcement learning with flexible multi-parameter scoring.</description><content:encoded><![CDATA[<h2 id="a-method-for-generative-linker-design-with-reinforcement-learning">A Method for Generative Linker Design with Reinforcement Learning</h2>
<p>Link-INVENT is a <strong>Method</strong> paper that introduces a generative model for molecular linker design built on the <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> de novo design platform. The primary contribution is an encoder-decoder recurrent neural network (RNN) architecture that generates SMILES-based linkers connecting two molecular subunits, combined with a flexible multi-parameter optimization (MPO) scoring function and reinforcement learning (RL) to steer generation toward desired properties. Link-INVENT targets three practical drug discovery tasks: fragment linking, scaffold hopping, and <a href="https://en.wikipedia.org/wiki/Proteolysis_targeting_chimera">proteolysis targeting chimera</a> (PROTAC) design.</p>
<h2 id="why-linker-design-needs-flexible-multi-parameter-optimization">Why Linker Design Needs Flexible Multi-Parameter Optimization</h2>
<p>Generating suitable chemical linkers between molecular subunits is a central challenge in <a href="https://en.wikipedia.org/wiki/Fragment-based_lead_discovery">fragment-based drug discovery</a> (FBDD), scaffold hopping, and PROTAC design. Traditional computational approaches rely on database searches, inherently limiting the generalizability of proposed linkers to the pre-defined collection. Recent deep learning methods (DeLinker, SyntaLinker, 3DLinker, DiffLinker) can generate novel linkers but offer limited support for optimizing specific physicochemical properties. Users can typically control only linker length and a few properties like hydrogen-bond donor count.</p>
<p>The key gaps that Link-INVENT addresses are:</p>
<ol>
<li><strong>Conditioning on both subunits</strong>: Prior RNN-based approaches (SAMOA) generate linkers conditioned only on the SMILES sequence seen so far, which may not account for the second molecular subunit. Link-INVENT conditions on both warheads simultaneously.</li>
<li><strong>Flexible scoring</strong>: Existing DL-based linker design tools lack the ability to define tailored MPO objectives. Link-INVENT inherits <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent4-generative-molecule-design/">REINVENT 4&rsquo;s</a> full scoring infrastructure and adds linker-specific properties.</li>
<li><strong>Generalizability</strong>: A single trained prior handles fragment linking, scaffold hopping, and PROTAC tasks without retraining.</li>
</ol>
<h2 id="core-innovation-conditional-linker-generation-with-augmented-likelihood-rl">Core Innovation: Conditional Linker Generation with Augmented Likelihood RL</h2>
<p>Link-INVENT&rsquo;s architecture is an encoder-decoder RNN adapted from the Lib-INVENT library design model. The encoder processes a pair of warheads (molecular subunits with defined exit vectors), and the decoder generates a linker token by token, yielding a connected molecule in SMILES format. The model uses three hidden layers of 512 LSTM cells with an embedding size of 256.</p>
<h3 id="training">Training</h3>
<p>The prior is trained on ChEMBL v27 data processed through reaction-based slicing to generate (linker, warheads pair, full molecule) tuples. <a href="/notes/computational-chemistry/molecular-representations/randomized-smiles-generative-models/">SMILES randomization</a> augments the training data at each epoch, improving chemical space generalizability. The prior is trained by maximizing the likelihood of generating a linker conditioned on the input warhead pair, with teacher forcing for stability.</p>
<h3 id="multi-parameter-optimization-via-rl">Multi-Parameter Optimization via RL</h3>
<p>The scoring function $S(x)$ is a weighted geometric mean of individual component scores:</p>
<p>$$
S(x) = \left(\prod_{i=1}^{n} C_{i}(x)^{w_{i}}\right)^{\frac{1}{\sum_{i=1}^{n} w_{i}}}
$$</p>
<p>where $x$ is a sampled linked molecule, $C_{i}(x)$ is the score for the $i$-th component, and $w_{i}$ is its weight.</p>
<p>The agent (initialized as a copy of the prior) is updated via the Difference of Augmented and Posterior likelihoods (DAP) loss. The <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/">augmented log likelihood</a> is:</p>
<p>$$
\log \pi_{\text{augmented}} = \log \pi_{\text{prior}} + \sigma \cdot S(x)
$$</p>
<p>where $\pi$ denotes a policy (token sampling probabilities conditioned on the sequence so far) and $\sigma$ is a scalar factor. The loss function is:</p>
<p>$$
J(\theta) = \left(\log \pi_{\text{augmented}} - \log \pi_{\text{agent}}\right)^{2}
$$</p>
<p>Minimizing $J(\theta)$ steers the agent to generate molecules that satisfy the scoring function while remaining anchored to the prior&rsquo;s chemical space.</p>
<h3 id="diversity-filters">Diversity Filters</h3>
<p>Link-INVENT uses Diversity Filters (DFs) to balance exploration and exploitation. Buckets of limited size track unique <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/memory-assisted-rl-diverse-molecular-design/">Bemis-Murcko scaffolds</a>. When a bucket is full, further sampling of that scaffold receives a score of zero, encouraging the agent to explore diverse chemical space regions.</p>
<h3 id="linker-specific-scoring-components">Linker-Specific Scoring Components</h3>
<p>New scoring components provide direct control over linker properties:</p>
<ul>
<li><strong>Linker effective length</strong>: number of bonds between attachment atoms</li>
<li><strong>Linker maximum graph length</strong>: bonds in the longest graph traversal path</li>
<li><strong>Linker length ratio</strong>: effective length divided by maximum graph length (controls branching)</li>
<li><strong>Linker ratio of rotatable bonds</strong>: rotatable bonds over total bonds (controls flexibility)</li>
<li><strong>Linker number of rings</strong>: controls linearity vs. cyclicity</li>
<li><strong>Linker number of HBDs</strong>: hydrogen-bond donors in the linker itself</li>
</ul>
<h2 id="experimental-evaluation-across-three-drug-discovery-tasks">Experimental Evaluation Across Three Drug Discovery Tasks</h2>
<p>Link-INVENT was evaluated through four experiments across three drug discovery applications, all using the same pre-trained prior.</p>
<h3 id="illustrative-example-two-benzene-rings">Illustrative Example: Two Benzene Rings</h3>
<p>A simple experiment linked two benzene rings with the objectives of limiting HBDs and requiring exactly one ring in the linker. Over 20 epochs, the agent learned to satisfy both objectives, demonstrating the basic RL-guided generation process.</p>
<h3 id="experiment-1a-fragment-linking-ck2-alpha-inhibitors">Experiment 1a: Fragment Linking (CK2 alpha Inhibitors)</h3>
<p>Based on the <a href="https://en.wikipedia.org/wiki/Casein_kinase_2">casein kinase 2</a> (CK2 alpha) fragment linking campaign by Fusco and Brear et al., Link-INVENT was tasked with linking two fragment hits while retaining the Lys68 hydrogen-bond interaction via a DockStream docking constraint (Glide/LigPrep backend). The scoring function also enforced linker length ratio &gt;= 70 and linker MW &lt;= 200 Da.</p>
<p>Over 100 epochs in triplicate, the agent generated molecules with gradually improving docking scores. Key results:</p>
<ul>
<li>Docking score distributions across triplicates were nearly identical, demonstrating reproducibility</li>
<li>Some generated molecules achieved more favorable docking scores than the reference ligand CAM4066 (-15.20 kcal/mol)</li>
<li>More than 5000 unique Bemis-Murcko scaffolds were generated, with minimal overlap across replicates</li>
<li>Binding pose analysis showed the generated linker closely resembled the ground-truth linker, retaining the Lys68 interaction</li>
</ul>
<h3 id="experiment-1b-comparison-fragment-linking-impdh-inhibitors">Experiment 1b: Comparison Fragment Linking (IMPDH Inhibitors)</h3>
<p>Using the IMPDH inhibitor fragment linking case study from Trapero et al., this experiment applied core constrained docking (fragment pose within 0.3 A of reference) and compared results to DeLinker and SyntaLinker. The scoring function enforced linker effective length in [3, 5], length ratio &gt;= 70, and linker MW &lt;= 150 Da.</p>
<p>Link-INVENT generated 8960 SMILES across 70 epochs (comparable to DeLinker&rsquo;s 9000 molecular graphs). Results:</p>
<ul>
<li>Link-INVENT generated molecules with more favorable docking scores than the reference ligand across triplicate runs</li>
<li>Of 20 DeLinker and 3 SyntaLinker example molecules, none and one (the recovered reference) docked better than or equal to the reference</li>
<li>Approximately 3000 unique Bemis-Murcko scaffolds were generated from 5000 total molecules</li>
<li>Link-INVENT&rsquo;s advantage comes from including docking explicitly as a learning objective rather than applying it post hoc</li>
</ul>
<h3 id="experiment-2-scaffold-hopping-dlk-inhibitor-cns-optimization">Experiment 2: Scaffold Hopping (DLK Inhibitor CNS Optimization)</h3>
<p>Based on Patel et al.&rsquo;s <a href="https://en.wikipedia.org/wiki/Dual_leucine_zipper_kinase">dual leucine zipper kinase</a> (DLK) inhibitor campaign, Link-INVENT generated new scaffold ideas to improve CNS penetration while retaining potency. The scoring function included a Cys193 docking constraint plus CNS-compatible properties (HBDs &lt; 2, tPSA &lt;= 90 A squared, 3 &lt;= SlogP &lt;= 4, MW &lt;= 450 Da, 1-2 aromatic rings in linker).</p>
<p>The solution space was significantly narrower than fragment linking. The agent still generated diverse scaffolds with favorable docking scores, though fewer exceeded the reference ligand&rsquo;s score. Binding pose analysis confirmed retained Cys193 interactions and predicted additional Gln195 hydrogen bonds.</p>
<h3 id="experiment-3-protac-design-bcl-2mcl-1-dual-degradation">Experiment 3: PROTAC Design (Bcl-2/Mcl-1 Dual Degradation)</h3>
<p>Three sub-experiments demonstrated linker-specific scoring components for PROTAC design based on Wang et al.&rsquo;s Bcl-2/Mcl-1 dual degradation strategy:</p>
<table>
  <thead>
      <tr>
          <th>Sub-Experiment</th>
          <th>Objective</th>
          <th>Key Finding</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Sub-Exp 1: Linker length</td>
          <td>Generate linkers within specified length intervals [4,6], [7,9], [10,12], [13,15]</td>
          <td>Clear enrichment within target intervals vs. baseline broad distribution</td>
      </tr>
      <tr>
          <td>Sub-Exp 2: Linearity</td>
          <td>Control linear vs. cyclic linkers at fixed length [7,9]</td>
          <td>Baseline ratio ~1:2 linear:cyclic; enforcing linearity or cyclicity achieved strong enrichment</td>
      </tr>
      <tr>
          <td>Sub-Exp 3: Flexibility</td>
          <td>Generate linkers with Low [0,30], Moderate [40,60], or High [70,100] rotatable bond ratios</td>
          <td>Agent learned that rings and sp2 atoms yield rigidity; linear sp3 chains yield flexibility</td>
      </tr>
  </tbody>
</table>
<h2 id="key-findings-and-practical-implications-for-drug-discovery">Key Findings and Practical Implications for Drug Discovery</h2>
<p>Link-INVENT demonstrates several practical advantages for molecular linker design:</p>
<ol>
<li><strong>Single prior, multiple tasks</strong>: The same pre-trained model handles fragment linking, scaffold hopping, and PROTAC design without retraining.</li>
<li><strong>Docking as a learning signal</strong>: Including molecular docking explicitly in the scoring function (via DockStream) during RL yields molecules with more favorable docking scores than approaches that apply docking post hoc.</li>
<li><strong>Implicit 3D awareness</strong>: The docking constraint guides the agent toward 3D structural awareness without explicit 3D coordinate inputs, as demonstrated by the overlap between generated and reference binding poses.</li>
<li><strong>Diverse and reproducible output</strong>: Diversity filters ensure exploration of multiple chemical space regions, and triplicate experiments show consistent docking score distributions with minimal scaffold overlap.</li>
</ol>
<p>Limitations acknowledged by the authors include:</p>
<ul>
<li>The linker flexibility metric (ratio of rotatable bonds) is agnostic to intra-molecular hydrogen bonds and does not account for all rigidity factors</li>
<li>Molecular docking is an approximation that can be exploited (e.g., excessive HBDs achieving favorable scores at the expense of permeability)</li>
<li>Experiments 1a and 1b require a proprietary Schrodinger license for Glide/LigPrep docking</li>
<li>No direct experimental (wet-lab) validation was performed in this study</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior training</td>
          <td>ChEMBL v27 (reaction-sliced)</td>
          <td>Not specified</td>
          <td>Filtered for drug-like compounds, then reaction-based slicing with SMIRKS</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>Held-out Bemis-Murcko scaffolds</td>
          <td>287 scaffolds</td>
          <td>Held out from training set</td>
      </tr>
      <tr>
          <td>SMILES augmentation</td>
          <td>Randomized SMILES per epoch</td>
          <td>Same tuples, different representations</td>
          <td>Improves generalizability</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-decoder RNN with 3 hidden layers of 512 LSTM cells, embedding size 256</li>
<li><strong>RL loss</strong>: DAP (Difference of Augmented and Posterior likelihoods)</li>
<li><strong>Batch size</strong>: 128 molecules per epoch</li>
<li><strong>Diversity filter</strong>: Bemis-Murcko scaffold buckets of size 25</li>
<li><strong>Score threshold</strong>: 0 (to store all molecules for analysis)</li>
<li><strong>Scoring function</strong>: Weighted geometric mean of component scores</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Single pre-trained prior used across all experiments</li>
<li>Agent initialized as copy of prior, updated via RL</li>
<li>Pre-trained prior available at GitHub repository</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Molecular docking via DockStream with Glide/LigPrep backend</li>
<li>Triplicate runs for all experiments</li>
<li>Metrics: docking scores, unique Bemis-Murcko scaffold counts, binding pose overlap</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not reported in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MolecularAI/Reinvent">REINVENT (Link-INVENT code)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Main codebase for Link-INVENT</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MolecularAI/ReinventCommunity">ReinventCommunity (data + tutorial)</a></td>
          <td>Code + Data</td>
          <td>MIT</td>
          <td>Training/validation data, reaction SMIRKS, pre-trained prior, Jupyter tutorial</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Partially Reproducible. Code, training data, and pre-trained prior are publicly available. However, reproducing the docking-based experiments (1a, 1b, and 2) requires a proprietary Schrodinger license for Glide and LigPrep. The PROTAC experiments (Experiment 3) that use only physicochemical scoring are fully reproducible with the open-source code.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Guo, J., Knuth, F., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., &amp; Patronov, A. (2023). Link-INVENT: generative linker design with reinforcement learning. <em>Digital Discovery</em>, 2, 392-408. <a href="https://doi.org/10.1039/D2DD00115B">https://doi.org/10.1039/D2DD00115B</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{guo2023link,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Link-INVENT: generative linker design with reinforcement learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Guo, Jeff and Knuth, Franziska and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{392--408}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D2DD00115B}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DrugEx v3: Scaffold-Constrained Graph Transformer</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/drugex-v3-scaffold-graph-transformer/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/drugex-v3-scaffold-graph-transformer/</guid><description>DrugEx v3 proposes a Graph Transformer with novel positional encoding for scaffold-constrained molecular generation via multi-objective reinforcement learning.</description><content:encoded><![CDATA[<h2 id="a-graph-transformer-method-for-scaffold-constrained-drug-design">A Graph Transformer Method for Scaffold-Constrained Drug Design</h2>
<p>This is a <strong>Method</strong> paper that introduces DrugEx v3, a Graph Transformer model for scaffold-constrained de novo drug design. The primary contribution is a novel positional encoding scheme for molecular graphs that allows a Transformer architecture to operate on graph-structured molecular data rather than <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings. The model takes user-provided scaffold fragments as input and generates complete molecules through growing and connecting operations, trained with multi-objective reinforcement learning to optimize for both target affinity and drug-likeness.</p>
<h2 id="from-fixed-objectives-to-user-guided-scaffold-design">From Fixed Objectives to User-Guided Scaffold Design</h2>
<p>Prior versions of DrugEx (v1 and <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/drugex-v2-pareto-multi-objective-rl/">v2</a>) used RNN-based generators trained with reinforcement learning for de novo drug design, but they operated under fixed objectives and could not accept user-provided structural priors. If a medicinal chemist wanted to explore analogs of a specific scaffold, the model needed retraining from scratch. Meanwhile, SMILES-based molecular generators face inherent limitations for scaffold-constrained design: SMILES is a linear notation, so inserting fragments at multiple positions of a scaffold requires complex grammar handling, and small token changes can produce invalid molecules.</p>
<p>Several approaches had been proposed for scaffold-based generation, including graph generative models (Lim et al., 2019), DeepScaffold (Li et al., 2020), SMILES-based scaffold decorators (Arus-Pous et al., 2020), and SyntaLinker for fragment linking (Yang et al., 2020). DrugEx v3 aims to combine the advantages of graph representations (validity guarantees, local invariance, flexible extension) with the Transformer architecture&rsquo;s ability to handle complex dependencies, while maintaining the multi-objective reinforcement learning framework from DrugEx v2.</p>
<h2 id="graph-positional-encoding-for-molecular-transformers">Graph Positional Encoding for Molecular Transformers</h2>
<p>The core innovation is adapting the Transformer architecture to work directly with molecular graph representations. Two key modifications make this possible.</p>
<p><strong>Graph word encoding.</strong> Since atoms and bonds cannot be processed simultaneously in a graph, the authors combine them into a single index:</p>
<p>$$
W = T_{atom} \times 4 + T_{bond}
$$</p>
<p>where $T_{atom}$ is the atom type index and $T_{bond}$ is the bond type index (four bond types: single, double, triple, and none).</p>
<p><strong>Graph positional encoding.</strong> Standard sequential position encoding does not capture molecular topology. The authors propose an adjacency-matrix-based positional encoding:</p>
<p>$$
P = I_{Atom} \times L_{max} + I_{Connected}
$$</p>
<p>where $I_{Atom}$ is the current atom index, $L_{max}$ is the maximum sequence length, and $I_{Connected}$ is the index of the atom connected by the current bond. This encoding is then processed through the standard sinusoidal positional encoding:</p>
<p>$$
PE_{(p, 2i)} = \sin(pos / 10000^{2i / d_{m}})
$$</p>
<p>$$
PE_{(p, 2i+1)} = \cos(pos / 10000^{2i / d_{m}})
$$</p>
<p>with $d_{m} = 512$.</p>
<p><strong>Molecule generation procedure.</strong> Each molecule in the training data is represented as a five-row matrix encoding atom type, bond type, connected atom index, current atom index, and fragment index. The columns are divided into three sections: fragment (the scaffold), growing (new atoms added to fragments), and linking (bonds connecting grown fragments). The decoder uses a GRU-based recurrent layer to sequentially output atom type, bond type, connected atom index, and current atom index at each step, with chemical valence rules enforced at every generation step to guarantee valid molecules.</p>
<p><strong>Multi-objective reinforcement learning.</strong> The generator is trained with a policy gradient objective:</p>
<p>$$
J(\theta) = \mathbb{E}\left[R^{*}(y_{1:T}) | \theta\right] = \sum_{t=1}^{T} \log G(y_{t} | y_{1:t-1}) \cdot R^{\ast}(y_{1:T})
$$</p>
<p>where $R^{*}$ is a Pareto-based reward combining target affinity and QED drug-likeness score:</p>
<p>$$
R^{*} = \begin{cases} 0.5 + \frac{k - N_{undesired}}{2N_{desired}}, &amp; \text{if desired} \\ \frac{k}{2N_{undesired}}, &amp; \text{if undesired} \end{cases}
$$</p>
<p>with $k$ being the solution&rsquo;s index in the Pareto rank. An exploration strategy uses two networks: an exploitation network $G_{\theta}$ (updated by policy gradient) and an exploration network $G_{\phi}$ (fixed, pre-trained on ChEMBL), with an exploration rate $\varepsilon$ controlling how many scaffolds are routed to $G_{\phi}$ during training.</p>
<h2 id="experimental-setup-architecture-comparison-and-rl-optimization">Experimental Setup: Architecture Comparison and RL Optimization</h2>
<h3 id="data">Data</h3>
<p>The ChEMBL set (version 27) contained approximately 1.7 million molecules for pre-training, preprocessed via RDKit (charge neutralization, metal/fragment removal). The LIGAND set comprised 10,828 adenosine receptor ligands for fine-tuning. Each molecule was decomposed into fragments using the BRICS algorithm, creating scaffold-molecule pairs (up to 15 pairs per molecule with four fragments). The ChEMBL set yielded 9.3 million training pairs, and the LIGAND set produced 53,888 training pairs.</p>
<h3 id="architecture-comparison">Architecture comparison</h3>
<p>Four architectures were compared:</p>
<ol>
<li><strong>Graph Transformer</strong>: graph input with novel positional encoding</li>
<li><strong>Sequential Transformer</strong>: SMILES input with standard Transformer</li>
<li><strong>LSTM-BASE</strong>: SMILES encoder-decoder with three recurrent layers</li>
<li><strong>LSTM+ATTN</strong>: LSTM-BASE with an attention mechanism between encoder and decoder</li>
</ol>
<p>All models were pre-trained on ChEMBL and fine-tuned on the LIGAND set. The bioactivity predictor was a random forest regression model using 2048D ECFP6 fingerprints and 19D physicochemical descriptors, with an activity threshold of pX = 6.5 for the A2A adenosine receptor.</p>
<h3 id="evaluation-metrics">Evaluation metrics</h3>
<p>Five metrics were used: validity (parseable molecules), accuracy (scaffold containment), desirability (meeting all objectives), uniqueness, and novelty (not in ChEMBL). Diversity was measured using the Solow-Polasky index with Tanimoto distance on ECFP6 fingerprints:</p>
<p>$$
I(A) = \frac{1}{|A|} \mathbf{e}^{\intercal} F(\mathbf{s})^{-1} \mathbf{e}
$$</p>
<h3 id="hardware">Hardware</h3>
<p>Models were benchmarked on a server with NVIDIA Tesla P100 GPUs.</p>
<h2 id="key-results-graph-representation-advantages-and-rl-trade-offs">Key Results: Graph Representation Advantages and RL Trade-offs</h2>
<h3 id="pre-training-and-fine-tuning-performance">Pre-training and fine-tuning performance</h3>
<p>The Graph Transformer achieved the best overall performance across all metrics:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Validity (PT)</th>
          <th>Accuracy (PT)</th>
          <th>Validity (FT)</th>
          <th>Accuracy (FT)</th>
          <th>Novelty (FT)</th>
          <th>Uniqueness (FT)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Graph Transformer (512)</td>
          <td>100.0%</td>
          <td>99.3%</td>
          <td>100.0%</td>
          <td>99.2%</td>
          <td>68.9%</td>
          <td>82.9%</td>
      </tr>
      <tr>
          <td>Seq. Transformer (512)</td>
          <td>96.7%</td>
          <td>74.0%</td>
          <td>99.3%</td>
          <td>92.7%</td>
          <td>8.9%</td>
          <td>28.9%</td>
      </tr>
      <tr>
          <td>LSTM+ATTN (512)</td>
          <td>94.3%</td>
          <td>72.8%</td>
          <td>96.9%</td>
          <td>85.9%</td>
          <td>6.3%</td>
          <td>20.7%</td>
      </tr>
      <tr>
          <td>LSTM-BASE (512)</td>
          <td>93.9%</td>
          <td>52.4%</td>
          <td>98.7%</td>
          <td>81.6%</td>
          <td>3.9%</td>
          <td>19.2%</td>
      </tr>
  </tbody>
</table>
<p>PT = pre-trained, FT = fine-tuned. The Graph Transformer achieved 100% validity due to its explicit valence checking at each generation step. It also produced substantially more novel and unique molecules after fine-tuning compared to SMILES-based methods.</p>
<p>The authors identified four advantages of the graph representation over SMILES: (1) local invariance, where fragment ordering does not affect output; (2) global extendibility, where new atoms can be appended without restructuring existing data; (3) freedom from grammar constraints; and (4) direct accessibility of chemical valence rules for validity enforcement.</p>
<h3 id="reinforcement-learning-results">Reinforcement learning results</h3>
<p>With multi-objective RL (affinity + QED), 74.6% of generated molecules were predicted active at $\varepsilon = 0.0$. The exploration rate $\varepsilon$ trades off desirability against uniqueness:</p>
<table>
  <thead>
      <tr>
          <th>$\varepsilon$</th>
          <th>Desirability</th>
          <th>Uniqueness</th>
          <th>Novelty</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0.0</td>
          <td>74.6%</td>
          <td>60.7%</td>
          <td>60.6%</td>
          <td>0.879</td>
      </tr>
      <tr>
          <td>0.1</td>
          <td>66.8%</td>
          <td>75.0%</td>
          <td>74.6%</td>
          <td>0.842</td>
      </tr>
      <tr>
          <td>0.2</td>
          <td>61.6%</td>
          <td>80.2%</td>
          <td>79.4%</td>
          <td>0.879</td>
      </tr>
      <tr>
          <td>0.3</td>
          <td>56.8%</td>
          <td>89.8%</td>
          <td>88.8%</td>
          <td>0.874</td>
      </tr>
  </tbody>
</table>
<p>The authors report that $\varepsilon = 0.3$ produced the best balance between desirability and uniqueness, with 56.8% desired molecules and 89.8% uniqueness. Diversity remained above 0.84 across all settings.</p>
<h3 id="limitations">Limitations</h3>
<p>The Graph Transformer produced molecules with worse synthetic accessibility (SA scores) compared to SMILES-based methods, particularly after fine-tuning on the smaller LIGAND set. The authors attribute this to uncommon ring systems generated when the model handles long-distance dependencies. A kekulization issue also causes a small fraction of molecules to fail scaffold matching: aromatic bond inference during sanitization can alter the scaffold substructure. Without single-objective affinity constraint, the model generates molecules with molecular weight exceeding 500 Da, reducing drug-likeness. All bioactivity predictions rely on a random forest model rather than experimental validation, and the t-SNE analysis suggests some generated molecules fall outside the model&rsquo;s applicability domain.</p>
<h3 id="future-directions">Future directions</h3>
<p>The authors propose extending the Graph Transformer to accept protein information as input via proteochemometric modeling, enabling design of ligands for targets without known ligands. Lead optimization, where a &ldquo;hit&rdquo; serves as input to generate improved analogs, is also identified as a natural extension.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data-1">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL v27</td>
          <td>~1.7M molecules (9.3M scaffold-molecule pairs)</td>
          <td>Preprocessed via RDKit</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>LIGAND set (A2A AR ligands from ChEMBL)</td>
          <td>10,828 ligands (53,888 pairs)</td>
          <td>Split 8:1:1 train/val/test</td>
      </tr>
      <tr>
          <td>Bioactivity labels</td>
          <td>ChEMBL A2A AR activity data</td>
          <td>pX threshold = 6.5</td>
          <td>Average pChEMBL values</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Fragment decomposition: BRICS algorithm via RDKit (max 4 fragments per molecule)</li>
<li>Optimizer: Adam with learning rate $10^{-4}$, batch size 256</li>
<li>Pre-training: 20 epochs; fine-tuning: up to 1,000 epochs with early stopping (patience: 100 epochs)</li>
<li>Bioactivity predictor: random forest regression (scikit-learn) with 2048D ECFP6 + 19D physicochemical descriptors</li>
<li>Pareto-based multi-objective ranking with GPU acceleration</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Graph Transformer: 512 hidden units, 8 attention heads, $d_{k} = d_{v} = 64$</li>
<li>Sequential Transformer: same hidden size, sinusoidal positional encoding</li>
<li>LSTM-BASE / LSTM+ATTN: 128 embedding units, 512 hidden units, 3 recurrent layers</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Graph Transformer</th>
          <th>Best SMILES Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity (fine-tuned)</td>
          <td>100.0%</td>
          <td>99.6% (LSTM-BASE 1024)</td>
          <td>Valence checking guarantees validity</td>
      </tr>
      <tr>
          <td>Accuracy (fine-tuned)</td>
          <td>99.2%</td>
          <td>94.3% (Seq. Transformer 1024)</td>
          <td>Scaffold containment</td>
      </tr>
      <tr>
          <td>Desirability (RL, $\varepsilon$=0.0)</td>
          <td>74.6%</td>
          <td>N/A</td>
          <td>Only Graph Transformer used for RL</td>
      </tr>
      <tr>
          <td>Diversity (RL)</td>
          <td>0.879</td>
          <td>N/A</td>
          <td>Solow-Polasky index</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware-1">Hardware</h3>
<p>NVIDIA Tesla P100 GPUs. Specific training times not reported, but Transformer models trained faster than LSTM models with the same hidden layer size.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/CDDLeiden/DrugEx">CDDLeiden/DrugEx</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation (v1, v2, v3)</td>
      </tr>
      <tr>
          <td><a href="https://www.ebi.ac.uk/chembl/">ChEMBL v27</a></td>
          <td>Dataset</td>
          <td>CC-BY-SA 3.0</td>
          <td>Pre-training data source</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liu, X., Ye, K., van Vlijmen, H. W. T., IJzerman, A. P., &amp; van Westen, G. J. P. (2023). DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning. <em>Journal of Cheminformatics</em>, 15, 24. <a href="https://doi.org/10.1186/s13321-023-00694-z">https://doi.org/10.1186/s13321-023-00694-z</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{liu2023drugex,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Liu, Xuhan and Ye, Kai and van Vlijmen, Herman W. T. and IJzerman, Adriaan P. and van Westen, Gerard J. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{24}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-023-00694-z}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Curriculum Learning for De Novo Drug Design (REINVENT)</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/curriculum-learning-molecular-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/curriculum-learning-molecular-design/</guid><description>Curriculum learning applied to REINVENT accelerates convergence on complex multi-parameter drug design objectives compared to standard reinforcement learning.</description><content:encoded><![CDATA[<h2 id="curriculum-learning-as-a-method-for-molecular-generation">Curriculum Learning as a Method for Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces curriculum learning (CL) into the <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> de novo molecular design platform. The primary contribution is a training strategy that decomposes complex multi-parameter optimization (MPO) objectives into sequences of simpler tasks with increasing complexity. The agent learns each simpler task before progressing to the full production objective, accelerating convergence and improving the quality and diversity of generated molecules compared to standard policy-based reinforcement learning (RL).</p>
<h2 id="the-computational-cost-of-complex-reward-functions">The Computational Cost of Complex Reward Functions</h2>
<p>Policy-based RL for molecular design works by training a generative model (the agent) to produce molecules that maximize a reward function. In practice, drug design reward functions often include computationally expensive components such as molecular docking. When the reward landscape is complex and minima are difficult to find, the agent may spend many epochs sampling molecules far from the desired objective. The resulting small gradients cause minimal policy updates, leading to long periods of non-productivity. This is particularly wasteful when each reward evaluation involves expensive physics-based computations.</p>
<p>The core problem is that standard RL treats the full MPO objective as a monolithic task. If the agent cannot find any rewarding molecules early in training, it receives near-zero gradients and makes negligible progress. This creates a bootstrapping problem: the agent needs to already be sampling from favorable regions of chemical space to receive useful learning signals, but it has no guidance on how to get there.</p>
<p>Curriculum learning, originally proposed by Bengio et al. (2009), addresses this by arranging training tasks in order of increasing difficulty. When constituent tasks are correlated with the final objective, the gradients from simpler tasks provide more effective traversal of the optimization landscape.</p>
<h2 id="formalized-curriculum-strategy-for-reinvent">Formalized Curriculum Strategy for REINVENT</h2>
<p>The key innovation is a two-phase training protocol with formal definitions for curriculum progression.</p>
<p>A scoring function maps SMILES strings to desirability scores in $[0, 1]$ using a weighted geometric mean:</p>
<p>$$S(x) = \left(\prod_{i=1}^{n} c_{i}(x)^{w_{i}}\right)^{1 / \sum_{i=1}^{n} w_{i}}$$</p>
<p>where $x$ is a sampled compound in SMILES format, $c_{i}$ is the $i$-th scoring component, and $w_{i}$ is its weight.</p>
<p>A Curriculum $C$ consists of a sequence of Objectives $O = {O_{C_1}, \ldots, O_{C_n}, O_{P}}$, where subscripts $C$ and $P$ denote Curriculum and Production Objectives respectively. Each Objective has a corresponding scoring function. Progression is controlled by Curriculum Progression Criteria $P = {P_{1}, \ldots, P_{n}}$, where each $P_{i}$ defines a score threshold the agent must achieve before advancing to the next objective.</p>
<p><strong>Curriculum Phase</strong>: The agent trains on sequential Curriculum Objectives with increasing complexity. A diversity filter is not applied during this phase, as it could be counterproductive to guiding the agent toward favorable chemical space. No computationally expensive components (e.g., docking) are used in Curriculum Objectives.</p>
<p><strong>Production Phase</strong>: Activated only when the final Curriculum Progression Criterion $P_{n}$ is satisfied. The agent now optimizes the full Production Objective, which may include expensive components like molecular docking. A new inception memory is initialized (clearing Curriculum Phase compounds), and a Bemis-Murcko scaffold diversity filter is applied to encourage exploration across multiple local minima.</p>
<p>The implementation builds on REINVENT&rsquo;s RNN architecture: three hidden layers of 512 LSTM cells with an embedding size of 256 and a linear layer with softmax activation, pretrained on ChEMBL to learn SMILES syntax.</p>
<h2 id="three-experiments-on-pdk1-inhibitor-design">Three Experiments on PDK1 Inhibitor Design</h2>
<p>The authors evaluate CL on three molecular design tasks of increasing complexity, all centered on designing <a href="https://en.wikipedia.org/wiki/PDPK1">3-phosphoinositide-dependent protein kinase-1</a> (PDK1) inhibitors.</p>
<h3 id="experiment-1-target-scaffold-construction">Experiment 1: Target Scaffold Construction</h3>
<p>The goal is to generate compounds possessing a dihydro-pyrazoloquinazoline scaffold with a phenyl substituent, a scaffold not present in the prior&rsquo;s training set. Standard RL fails entirely over 2000 epochs because the probability of randomly sampling a compound with this scaffold is negligibly small, producing binary rewards (1.0 if scaffold present, 0.5 otherwise) that never rise above 0.5.</p>
<p>CL decomposes the target scaffold into 5 progressively complex substructures. Each Curriculum Progression Criterion threshold is set to 0.8. The agent learns to generate compounds with each substructure before advancing. CL finds the target scaffold within 1750 epochs, while baseline RL cannot find it in the same timeframe.</p>
<h3 id="experiments-2-and-3-molecular-docking-constraints">Experiments 2 and 3: Molecular Docking Constraints</h3>
<p>These experiments use a Production Objective combining a molecular docking constraint (retaining two hydrogen-bonding interactions with Ala 162 of PDK1, PDB ID: 2XCH) and QED (Quantitative Estimate of Druglikeness). Both experiments limit computational cost by capping production epochs at 300.</p>
<p><strong>Experiment 2</strong> uses Tanimoto (2D) similarity to a reference ligand as the Curriculum Objective. Two scenarios are tested: &ldquo;Low&rdquo; (threshold 0.5) and &ldquo;High&rdquo; (threshold 0.8).</p>
<p><strong>Experiment 3</strong> uses ROCS (3D) shape-based similarity to the reference ligand as the Curriculum Objective, with &ldquo;Low&rdquo; (0.5) and &ldquo;High&rdquo; (0.75) thresholds.</p>
<p>All experiments are run in triplicate. The baseline includes both standard RL and RL with Tanimoto/ROCS components added directly to the scoring function (not sequentially), to control for the presence of these components.</p>
<p>Across all CL experiments, CL generates between 2,941 and 9,068 more compounds with docking scores better than the reference ligand (-10.907 kcal/mol) compared to baseline RL, corresponding to 12.42-23.79% improvement in the fraction of compounds exceeding the reference. Between the Curriculum Objectives, the &ldquo;High&rdquo; threshold scenario outperforms the &ldquo;Low&rdquo; scenario by 316-3,415 additional compounds (with percentage improvements ranging from -0.4% to 10.57%).</p>
<p>Baseline RL produces essentially no compounds satisfying the docking constraint for the first 100 epochs. CL agents achieve immediate productivity: in the &ldquo;High&rdquo; Tanimoto scenario, the initial docking score already exceeds the maximum score achieved by baseline RL over 300 epochs.</p>
<h3 id="scaffold-diversity-analysis">Scaffold Diversity Analysis</h3>
<p>CL generates more unique Bemis-Murcko scaffolds than baseline RL in all experiments. The &ldquo;High&rdquo; scenarios produce more unique scaffolds than the &ldquo;Low&rdquo; scenarios. CL also produces a higher fraction of &ldquo;favorable&rdquo; scaffolds (those with better docking scores than the reference ligand).</p>
<h2 id="accelerated-convergence-with-a-diversity-trade-off">Accelerated Convergence with a Diversity Trade-off</h2>
<p>The results demonstrate three consistent findings across all experiments:</p>
<ol>
<li>
<p><strong>Accelerated productivity</strong>: CL agents reach productive sampling of favorable compounds substantially faster than baseline RL. Even a single Curriculum Objective with a computationally inexpensive metric can guide the agent to regions of chemical space where expensive Production Objectives are readily satisfied.</p>
</li>
<li>
<p><strong>Improved output quality</strong>: CL generates more compounds with favorable docking scores, more unique scaffolds, and a higher fraction of scaffolds that outperform the reference ligand.</p>
</li>
<li>
<p><strong>Controllable trade-off</strong>: The Curriculum Progression Criterion threshold provides direct control over agent policy. Higher thresholds produce better Production Objective optimization but reduce intra-set diversity (higher cross-Tanimoto similarities among generated compounds). UMAP visualizations confirm that &ldquo;Low&rdquo; and &ldquo;High&rdquo; scenarios sample from nearby but distinct regions of chemical space.</p>
</li>
</ol>
<p>The authors note that even moderate optimization of similarity-based Curriculum Objectives (the &ldquo;Low&rdquo; scenarios) already substantially narrows the agent&rsquo;s perceived solution space. This suggests that CL inherently regularizes the agent policy, trading some diversity for convergence speed.</p>
<p><strong>Limitations</strong>: The authors acknowledge that data supporting the findings are available only upon request rather than as public deposits. The approach is demonstrated on a single target (PDK1), and the curriculum design requires domain expertise to decompose objectives appropriately. The inverse relationship between Curriculum Objective optimization and solution diversity means practitioners must carefully tune thresholds for their specific applications.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior training</td>
          <td>ChEMBL</td>
          <td>Not specified</td>
          <td>Used to pretrain the RNN on SMILES syntax</td>
      </tr>
      <tr>
          <td>Docking target</td>
          <td>PDB 2XCH</td>
          <td>1 structure</td>
          <td>PDK1 receptor crystal structure</td>
      </tr>
  </tbody>
</table>
<p>Raw data supporting the findings are available from the corresponding author upon request.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>REINVENT platform with LSTM-based RNN (3 hidden layers, 512 cells, embedding size 256)</li>
<li>Scoring function: weighted geometric mean of components</li>
<li>Curriculum Progression Criteria: score thresholds (0.5 or 0.75-0.8 depending on scenario)</li>
<li>Diversity filter: Identical Murcko Scaffold with bucket size 25 (Production Phase only)</li>
<li>Inception (experience replay) for both phases, reset at phase transition</li>
<li>Batch size: 128, learning rate: 0.0001, sigma: 128, Adam optimizer</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Prior: RNN pretrained on ChEMBL SMILES</li>
<li>Agent: Initialized from prior, focused via RL/CL</li>
<li>No pretrained model weights are publicly released</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Docking score (Glide SP)</td>
          <td>Predicted binding affinity (kcal/mol)</td>
          <td>Lower is better; reference ligand: -10.907</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>Quantitative Estimate of Druglikeness</td>
          <td>Range [0, 1]</td>
      </tr>
      <tr>
          <td>Unique Bemis-Murcko scaffolds</td>
          <td>Scaffold diversity measure</td>
          <td>Averaged over triplicates</td>
      </tr>
      <tr>
          <td>Cross-Tanimoto similarity</td>
          <td>Intra-set compound diversity</td>
          <td>Calculated on pooled triplicates</td>
      </tr>
      <tr>
          <td>Tanimoto/ROCS similarity</td>
          <td>Curriculum Objective metrics</td>
          <td>2D fingerprint and 3D shape similarity</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>GPU: NVIDIA Tesla V100 (32 GB)</li>
<li>Docking: AWS p3.8xlarge instance</li>
<li>LigPrep parallelized over 8 CPU cores</li>
<li>Glide docking parallelized over 48 CPU cores via DockStream</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MolecularAI/Reinvent">REINVENT</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>De novo molecular design platform</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MolecularAI/ReinventCommunity/blob/master/notebooks/Automated_Curriculum_Learning_Demo.ipynb">CL Tutorial Notebook</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Jupyter notebook tutorial for curriculum learning</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Guo, J., Fialková, V., Arango, J. D., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., &amp; Patronov, A. (2022). Improving de novo molecular design with curriculum learning. <em>Nature Machine Intelligence</em>, 4, 555-563. <a href="https://doi.org/10.1038/s42256-022-00494-4">https://doi.org/10.1038/s42256-022-00494-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{guo2022curriculum,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Improving de novo molecular design with curriculum learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Guo, Jeff and Fialkov{\&#39;a}, Vendy and Arango, Juan Diego and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{555--563}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Nature}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-022-00494-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Augmented Hill-Climb for RL-Based Molecule Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/</guid><description>Augmented Hill-Climb combines REINVENT and Hill-Climb RL strategies to improve sample efficiency ~45-fold for SMILES-based de novo molecule generation.</description><content:encoded><![CDATA[<h2 id="a-hybrid-rl-strategy-for-de-novo-molecule-generation">A Hybrid RL Strategy for De Novo Molecule Generation</h2>
<p>This is a <strong>Method</strong> paper that proposes Augmented Hill-Climb (AHC), a reinforcement learning strategy for conditioning SMILES-based language models during de novo molecule generation. The primary contribution is a simple hybrid between the <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> and Hill-Climb (HC) RL strategies that computes the REINVENT loss function only on the top-k highest-scoring molecules per batch (as in HC), thereby removing the counterproductive regularization effect of low-scoring molecules. The authors demonstrate that AHC improves optimization ability ~1.5-fold and sample efficiency ~45-fold compared to REINVENT across docking tasks against four <a href="https://en.wikipedia.org/wiki/G_protein-coupled_receptor">GPCR</a> targets, and that the approach generalizes to transformer architectures.</p>
<h2 id="sample-efficiency-bottleneck-in-rl-guided-molecular-generation">Sample Efficiency Bottleneck in RL-Guided Molecular Generation</h2>
<p>Recurrent neural networks trained on SMILES have become a standard approach for de novo molecule generation, with RL strategies like REINVENT and Hill-Climb achieving top performance on benchmarks such as <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> and <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a>. However, RL-guided generation can be highly <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/sample-efficiency-de-novo-generation/">sample-inefficient</a>, often requiring $10^5$ or more molecules to optimize complex objectives. This is acceptable for cheap scoring functions (e.g., QSAR models, property calculators) but becomes a practical bottleneck when using computationally expensive scoring functions like molecular docking or computer-aided synthesis planning.</p>
<p>The REINVENT strategy regularizes the agent by computing a loss based on the difference between the agent&rsquo;s policy and an &ldquo;augmented likelihood&rdquo; that combines the prior policy with a scaled reward. When low-scoring molecules are sampled ($R_T \approx 0$), the augmented likelihood reduces to the prior likelihood, causing the agent to trend back toward the prior policy. This negates useful learnings, especially early in training or when the objective is difficult. Meanwhile, Hill-Climb simply fine-tunes the RNN on the top-k molecules per batch, which is sample-efficient but lacks explicit regularization, leading to mode collapse and generation of invalid SMILES.</p>
<p>Previous work by Neil et al. compared RL strategies but did not clearly quantify sample-efficiency differences, and modifications to the REINVENT loss function by Fialkova et al. showed no significant improvement. The best agent reminder (BAR) mechanism offered modest gains but was originally tested on graph-based models.</p>
<h2 id="core-innovation-filtering-low-scoring-molecules-from-the-reinvent-loss">Core Innovation: Filtering Low-Scoring Molecules from the REINVENT Loss</h2>
<p>Augmented Hill-Climb combines the loss formulation of REINVENT with the top-k selection mechanism of Hill-Climb. The agent samples a batch of molecules, ranks them by reward, and computes the REINVENT loss only on the top-k molecules. This removes the counterproductive regularization caused by low-scoring molecules while retaining the prior-based regularization for high-scoring molecules.</p>
<p>The REINVENT loss defines an augmented likelihood:</p>
<p>$$
\log P_{\mathbb{U}}(A) = \log P_{prior}(A) + \sigma R_T
$$</p>
<p>where $\sigma$ is a scaling coefficient controlling the reward contribution. The agent loss is the squared difference between the augmented likelihood and the agent&rsquo;s log-likelihood:</p>
<p>$$
L(\theta) = \left[\log P_{\mathbb{U}}(A) - \log P_{agent}(A)\right]^2
$$</p>
<p>In standard REINVENT, this loss is computed over all molecules in the batch. When $R_T \approx 0$, the augmented likelihood collapses to the prior likelihood, pushing the agent back toward the prior. AHC avoids this by computing the loss only on the top-k molecules ranked by reward, exactly as Hill-Climb selects molecules for fine-tuning.</p>
<p>The key insight is that high-scoring molecules are still regularized by the prior component of the augmented likelihood ($\log P_{prior}(A)$), preventing catastrophic forgetting. Low-scoring molecules, which would otherwise pull the agent back toward the prior, are simply excluded from the loss computation.</p>
<h3 id="diversity-filters-to-prevent-mode-collapse">Diversity Filters to Prevent Mode Collapse</h3>
<p>AHC is more susceptible to mode collapse than REINVENT because it focuses learning on high-scoring molecules. The authors address this with diversity filters (DFs) that penalize the reward of molecules similar to previously generated ones. Through a hyperparameter search over 825 configurations on three GuacaMol tasks, they identify an optimal DF configuration (DF2) with:</p>
<ul>
<li>Minimum score threshold of 0.5 (lower than DF1&rsquo;s 0.8)</li>
<li>Linear penalization output mode (softer than binary)</li>
<li>Bin size of 50 (larger than DF1&rsquo;s 25)</li>
<li>Scaffold similarity based on ECFP4 fingerprints</li>
</ul>
<p>The authors find that stricter DFs (lower thresholds, smaller bins) better prevent mode collapse but reduce optimization performance, while more lenient DFs enable better learning of chemotype-reward associations. DF2 represents a compromise.</p>
<h2 id="experimental-setup-docking-tasks-and-benchmark-comparisons">Experimental Setup: Docking Tasks and Benchmark Comparisons</h2>
<p>The evaluation spans five experiments:</p>
<p><strong>Experiment 1</strong>: AHC vs. REINVENT on DRD2 docking over 100 RL updates (6,400 samples), varying $\sigma$ from 30 to 240. RNN trained on the MOSESn dataset (MOSES with neutralized charges, 2.45M molecules).</p>
<p><strong>Experiment 2</strong>: AHC + DF2 vs. REINVENT on four GPCR targets (DRD2, OPRM1, AGTR1, OX1R) over 500 RL updates. Docking performed with Glide-SP after ligand preparation with LigPrep.</p>
<p><strong>Experiment 3</strong>: Diversity filter hyperparameter search (825 configurations) on three GuacaMol tasks (<a href="https://en.wikipedia.org/wiki/Aripiprazole">Aripiprazole</a> similarity, C11H24 isomers, <a href="https://en.wikipedia.org/wiki/Osimertinib">Osimertinib</a> MPO) using the GuacaMol training set (1.27M molecules from ChEMBL24).</p>
<p><strong>Experiment 4</strong>: Benchmark of AHC against REINFORCE, REINVENT (v1 and v2), BAR, and Hill-Climb (with and without KL regularization) on six tasks of varying difficulty:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Difficulty</th>
          <th>Objective</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Heavy atoms</td>
          <td>Easy</td>
          <td>Maximize number of heavy atoms</td>
      </tr>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Risperidone">Risperidone</a> similarity</td>
          <td>Easy</td>
          <td>Maximize Tanimoto similarity to Risperidone</td>
      </tr>
      <tr>
          <td>DRD2 activity</td>
          <td>Medium</td>
          <td>Maximize QSAR-predicted DRD2 activity</td>
      </tr>
      <tr>
          <td>DRD2 docking</td>
          <td>Medium</td>
          <td>Minimize Glide-SP docking score</td>
      </tr>
      <tr>
          <td>DRD2-DRD3 dual</td>
          <td>Hard</td>
          <td>Maximize predicted activity against both targets</td>
      </tr>
      <tr>
          <td>DRD2/DRD3 selective</td>
          <td>Hard</td>
          <td>Maximize selective DRD2 activity over DRD3</td>
      </tr>
  </tbody>
</table>
<p><strong>Experiment 5</strong>: AHC vs. REINVENT on transformer (Tr) and gated transformer (GTr) architectures on the same six benchmark tasks. The GTr implements a GRU-style gate in place of residual connections to stabilize RL training.</p>
<h3 id="rnn-and-transformer-architectures">RNN and Transformer Architectures</h3>
<p>Three RNN configurations were used: (1) embedding 128 + 3 GRU layers of 512 (REINVENT v1), (2) embedding 256 + 3 LSTM layers of 512 (REINVENT 2.0), (3) 3 LSTM layers of 512 with dropout 0.2 (GuacaMol). Transformers used 4 encoder layers with hidden dimension 512, 8 attention heads, and feed-forward dimension 1024.</p>
<p>QSAR models for DRD2 and DRD3 activity were random forest classifiers trained on ExCAPE-DB data with GHOST threshold identification for handling class imbalance.</p>
<h2 id="key-findings-45-fold-sample-efficiency-improvement">Key Findings: 45-Fold Sample Efficiency Improvement</h2>
<h3 id="experiment-1-ahc-consistently-outperforms-reinvent">Experiment 1: AHC Consistently Outperforms REINVENT</h3>
<p>AHC improved optimization ability by 1.39-fold over REINVENT averaged across all $\sigma$ values, with maximum optimization of 205% at $\sigma = 240$ (compared to 128% for REINVENT). AHC required ~80 fewer RL steps to match REINVENT&rsquo;s mean docking score at 100 steps. With DF1 applied, the improvement was 1.45-fold.</p>
<p>AHC showed greater sensitivity to $\sigma$, giving practitioners more control over the regularization-optimization trade-off. At $\sigma = 60$ (heavily regularized), AHC still improved 1.47-fold over REINVENT while maintaining property space defined by the MOSESn training set. At higher $\sigma$ values, AHC extrapolated further outside the training distribution, which can be favorable (novel chemical space) or unfavorable (scoring function exploitation, e.g., larger molecules getting better docking scores due to the additive nature of scoring functions).</p>
<h3 id="experiment-2-improvement-across-four-gpcr-targets">Experiment 2: Improvement Across Four GPCR Targets</h3>
<p>Across DRD2, OPRM1, AGTR1, and OX1R, AHC + DF2 required on average 7.4-fold fewer training steps and 45.5-fold fewer samples to reach optimization thresholds. The improvement was largest early in training: 19.8-fold fewer steps to reach 120% optimization, and 71.8-fold fewer samples to first produce a molecule exceeding 160% optimization.</p>
<p>AHC + DF2 surpassed the 80% retrospective precision threshold within 100 RL updates for all targets except the challenging OX1R. DF2 successfully stabilized learning, avoiding the convergence-to-threshold failure mode observed with DF1.</p>
<p>Scaffold analysis showed AHC generates similar chemistry to REINVENT. The top 500 scaffolds produced by REINVENT were also generated by AHC, but typically much sooner.</p>
<h3 id="experiment-4-benchmark-against-all-rl-strategies">Experiment 4: Benchmark Against All RL Strategies</h3>
<p>AHC outperformed all other RL strategies on all six benchmark tasks except maximizing heavy atoms (an extrapolation task of limited practical relevance). AHC was particularly superior during early-stage optimization and for harder objectives (dual activity, selective activity).</p>
<p>Hill-Climb with a smaller batch size (HC*) showed improved early-stage sample efficiency similar to AHC, but rapidly underwent mode collapse. KL regularization did not rescue mode collapse in any case and sometimes worsened performance. BAR performed poorly in most tasks, possibly because the best-agent memory acts as a second regularizer that inhibits learning.</p>
<p>In terms of wall time for the DRD2 docking task, AHC reached 140% optimization in 16 CPU hours vs. 202 CPU hours for <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent4-generative-molecule-design/">REINVENT 2.0</a>. AHC was the only strategy to reach 200% optimization within the allotted time (216 CPU hours). Parallelized over 10 CPUs, this corresponds to ~21.6 hours, making docking-guided generation feasible on local machines.</p>
<h3 id="experiment-5-generalization-to-transformers">Experiment 5: Generalization to Transformers</h3>
<p>AHC outperformed REINVENT on both the standard transformer and the gated transformer architectures. The standard transformer was unstable under RL, readily undergoing mode collapse. The gated transformer (with GRU-style gating replacing residual connections) stabilized RL training. AHC&rsquo;s efficiency gains generalized to both architectures.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>Chemistry quality evaluation is complicated by the interaction between RL strategy and scoring function suitability. Greater optimization may lead to unreasonable chemistry due to scoring function exploitation rather than the RL strategy itself.</li>
<li>The diversity filter hyperparameter search was conducted on GuacaMol toy tasks, which may not fully transfer to docking-based objectives.</li>
<li>The docking scoring function was system-dependent: DRD2 and OPRM1 were optimized effectively, while AGTR1 and OX1R proved more challenging (especially AGTR1, where the docking algorithm targeted the wrong sub-pocket).</li>
<li>KL regularization proved ineffective for HC and REINFORCE, suggesting it is not a sufficient regularization method in this context.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RNN pretraining</td>
          <td>MOSESn (MOSES neutralized)</td>
          <td>2,454,087 molecules</td>
          <td>ZINC15 clean leads with neutralized charges</td>
      </tr>
      <tr>
          <td>RNN pretraining</td>
          <td>GuacaMol train</td>
          <td>1,273,104 molecules</td>
          <td>ChEMBL24 with property filters</td>
      </tr>
      <tr>
          <td>QSAR training</td>
          <td>ExCAPE-DB (DRD2)</td>
          <td>4,609 actives / 343,026 inactives</td>
          <td>Random forest with GHOST thresholds</td>
      </tr>
      <tr>
          <td>QSAR training</td>
          <td>ExCAPE-DB (DRD3)</td>
          <td>2,758 actives / 402,524 inactives</td>
          <td>Unique subsets for dual/selective tasks</td>
      </tr>
      <tr>
          <td>DF parameter search</td>
          <td>GuacaMol benchmark tasks</td>
          <td>3 tasks</td>
          <td>825 configurations tested</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>AHC</strong>: REINVENT loss computed on top-k molecules per batch, ranked by reward</li>
<li><strong>Baselines</strong>: REINFORCE, REINVENT (v1, v2), BAR, Hill-Climb, Hill-Climb + KL regularization</li>
<li><strong>Hyperparameters</strong>: Default values from each original publication (listed in Supplementary Table S3)</li>
<li><strong>Docking</strong>: Glide-SP with Schrodinger Protein Preparation Wizard, LigPrep for ligand preparation</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>RNNs</strong>: 3 configurations (GRU/LSTM, 512 hidden units, trained 5-10 epochs)</li>
<li><strong>Transformer</strong>: 4 encoder layers, 512 hidden dim, 8 heads, 1024 FFN dim</li>
<li><strong>Gated Transformer</strong>: Same architecture with GRU-style gating replacing residual connections</li>
<li><strong>QSAR</strong>: Random forest classifiers (100 estimators, max depth 15, min leaf 2)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>AHC + DF2</th>
          <th>REINVENT</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Optimization fold-improvement</td>
          <td>1.45x</td>
          <td>baseline</td>
          <td>DRD2 docking, averaged across sigma values</td>
      </tr>
      <tr>
          <td>Sample efficiency</td>
          <td>45.5x fewer samples</td>
          <td>baseline</td>
          <td>Averaged across 4 GPCR targets</td>
      </tr>
      <tr>
          <td>Step efficiency</td>
          <td>7.4x fewer steps</td>
          <td>baseline</td>
          <td>Averaged across 4 GPCR targets</td>
      </tr>
      <tr>
          <td>CPU hours to 140% (DRD2 docking)</td>
          <td>16h</td>
          <td>202h (REINVENT 2.0)</td>
          <td>AMD Threadripper 1920 + RTX 2060 Super</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>AMD Threadripper 1920 CPU</li>
<li>Nvidia GeForce RTX 2060 Super GPU</li>
<li>DRD2 docking benchmark: 216 CPU hours for AHC to reach 200% optimization (~21.6h parallelized over 10 CPUs)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MorganCThomas/SMILES-RNN">SMILES-RNN</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>RNN and transformer generative model code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MorganCThomas/MolScore">MolScore</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td><a href="/notes/computational-chemistry/benchmark-problems/molscore-scoring-benchmarking-framework/">Scoring function platform</a></td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.6084/m9.figshare.19591024.v1">Figshare datasets</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>Supporting data (published under same license as paper)</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Thomas, M., O&rsquo;Boyle, N. M., Bender, A., &amp; de Graaf, C. (2022). Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation. <em>Journal of Cheminformatics</em>, 14, 68.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{thomas2022augmented,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Thomas, Morgan and O&#39;Boyle, Noel M. and Bender, Andreas and de Graaf, Chris}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{68}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-022-00646-z}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>