<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Molecular Generation on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/</link><description>Recent content in Molecular Generation on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sun, 05 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/index.xml" rel="self" type="application/rss+xml"/><item><title>REINVENT: Reinforcement Learning for Mol. Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/</guid><description>REINVENT uses augmented episodic likelihood to fine-tune a SMILES-based RNN via reinforcement learning for goal-directed molecular generation.</description><content:encoded><![CDATA[<h2 id="augmented-episodic-likelihood-for-goal-directed-generation">Augmented Episodic Likelihood for Goal-Directed Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces REINVENT, a policy-based reinforcement learning framework for molecular de novo design. The primary contribution is a novel cost function, the <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/">augmented episodic likelihood</a>, that fine-tunes a <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>-based recurrent neural network (RNN) pre-trained on ChEMBL toward generating molecules satisfying user-defined property objectives. The method anchors the agent to the prior distribution of valid drug-like molecules, addressing failure modes of standard REINFORCE algorithms (reward exploitation and <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/failure-modes-molecule-generation/">mode collapse</a> to trivially simple structures).</p>
<h2 id="de-novo-design-needs-flexible-data-driven-approaches">De Novo Design Needs Flexible, Data-Driven Approaches</h2>
<p>Traditional de novo design methods fall into three categories, each with limitations:</p>
<ol>
<li><strong>Structure-based approaches</strong> grow ligands to fit binding pockets but often produce molecules with poor DMPK profiles and synthetic intractability.</li>
<li><strong>Ligand-based virtual library</strong> approaches generate large libraries and score them, but are constrained by pre-defined reaction rules or transformation rules that limit chemical diversity.</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/property-prediction/">Inverse QSAR</a></strong> methods attempt to map favorable activity regions back to molecular structures, but require descriptors suitable for both forward prediction and inverse mapping.</li>
</ol>
<p>RNN-based generative models trained on SMILES offer a data-driven alternative that can learn the underlying distribution of drug-like chemical space without rigid rules. Segler et al. (2017) showed that fine-tuning a pre-trained RNN on focused actives yields high fractions of predicted actives. However, this maximum likelihood fine-tuning cannot use negative or continuous scores and risks catastrophic forgetting.</p>
<p>Prior RL approaches had significant issues. Jaques et al. (2016) used Deep Q-learning with prior likelihood regularization for sequence generation, but reported dependence on hand-written rules to penalize undesirable sequences and still observed reward exploitation producing unrealistically simple molecules. Standard REINFORCE algorithms tend to converge on trivial solutions (e.g., generating only &ldquo;C&rdquo; to satisfy a scoring function).</p>
<h2 id="the-augmented-episodic-likelihood-framework">The Augmented Episodic Likelihood Framework</h2>
<p>The core innovation is a formulation where the agent learns a policy that minimizes the squared difference between its own log-likelihood and an augmented target likelihood.</p>
<p>The RNN is first pre-trained on 1.5 million canonical SMILES from ChEMBL via maximum likelihood estimation:</p>
<p>$$
J(\Theta) = -\sum_{t=1}^{T} \log P(x^{t} \mid x^{t-1}, \dots, x^{1})
$$</p>
<p>The pre-trained model (the Prior) is then used as the starting point for the Agent. For a generated SMILES sequence $A = a_1, a_2, \dots, a_T$, the model likelihood is $P(A) = \prod_{t=1}^{T} \pi(a_t \mid s_t)$, and a scoring function $S(A) \in [-1, 1]$ rates desirability.</p>
<p>The augmented likelihood combines prior likelihood with the score:</p>
<p>$$
\log P(A)_{\mathbb{U}} = \log P(A)_{Prior} + \sigma S(A)
$$</p>
<p>where $\sigma$ is a scalar coefficient controlling the trade-off between prior fidelity and score optimization.</p>
<p>The return is defined as the negative squared difference between the augmented likelihood and the agent&rsquo;s likelihood:</p>
<p>$$
G(A) = -\left[\log P(A)_{\mathbb{U}} - \log P(A)_{\mathbb{A}}\right]^{2}
$$</p>
<p>The agent minimizes $J(\Theta) = -G$, effectively learning a policy whose sequence likelihoods match the prior modulated by the scoring function. The authors show in supplementary material that this is equivalent to a REINFORCE algorithm with a specific final-step reward formulation.</p>
<p>This design has three key advantages over standard REINFORCE:</p>
<ul>
<li>The target policy is explicitly stochastic, preserving diversity in generated molecules</li>
<li>The prior anchoring prevents catastrophic forgetting of SMILES syntax and chemical space coverage</li>
<li>No hand-written rules are needed to penalize degenerate solutions</li>
</ul>
<p>The Agent is trained on-policy with batches of 128 generated sequences, using SGD with learning rate 0.0005 and gradient clipping to $[-3, 3]$.</p>
<h2 id="three-experiments-sulphur-avoidance-celecoxib-analogues-and-drd2-activity">Three Experiments: Sulphur Avoidance, Celecoxib Analogues, and DRD2 Activity</h2>
<h3 id="prior-network-architecture">Prior Network Architecture</h3>
<p>The Prior is a 3-layer RNN with 1024 Gated Recurrent Units per layer, trained on RDKit canonical SMILES from ChEMBL (molecules with 10-50 heavy atoms, elements from ${H, B, C, N, O, F, Si, P, S, Cl, Br, I}$). Training used Adam ($\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$) for 50,000 steps with batch size 128 and learning rate decay of 0.02 every 100 steps. The Prior generates 94% valid SMILES, of which 90% are novel.</p>
<h3 id="experiment-1-learning-to-avoid-sulphur">Experiment 1: Learning to Avoid Sulphur</h3>
<p>A proof-of-principle task where the scoring function assigns $S(A) = 1$ for valid sulphur-free molecules, $S(A) = 0$ for invalid SMILES, and $S(A) = -1$ for sulphur-containing molecules.</p>
<p>The Agent method was compared against three alternatives:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Fraction Valid</th>
          <th>Fraction No S</th>
          <th>Avg MW</th>
          <th>Avg cLogP</th>
          <th>Avg RotBonds</th>
          <th>Avg AromRings</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior</td>
          <td>0.94</td>
          <td>0.66</td>
          <td>371</td>
          <td>3.36</td>
          <td>5.39</td>
          <td>2.26</td>
      </tr>
      <tr>
          <td>Agent</td>
          <td>0.95</td>
          <td>0.98</td>
          <td>367</td>
          <td>3.37</td>
          <td>5.41</td>
          <td>2.26</td>
      </tr>
      <tr>
          <td>Action basis</td>
          <td>0.95</td>
          <td>0.92</td>
          <td>372</td>
          <td>3.39</td>
          <td>6.08</td>
          <td>2.09</td>
      </tr>
      <tr>
          <td>REINFORCE</td>
          <td>0.98</td>
          <td>0.98</td>
          <td>585</td>
          <td>11.3</td>
          <td>30.0</td>
          <td>0.57</td>
      </tr>
      <tr>
          <td>REINFORCE + Prior</td>
          <td>0.98</td>
          <td>0.92</td>
          <td>232</td>
          <td>3.05</td>
          <td>2.8</td>
          <td>2.11</td>
      </tr>
  </tbody>
</table>
<p>Standard REINFORCE exploited the reward by generating sequences of predominantly &ldquo;C&rdquo; (average MW 585, cLogP 11.3). REINFORCE + Prior avoided this but collapsed to small, simplistic structures (MW 232). The Agent achieved 98% sulphur-free structures while maintaining molecular properties nearly identical to the Prior, demonstrating that augmented episodic likelihood preserves the prior distribution.</p>
<h3 id="experiment-2-similarity-guided-generation-celecoxib-analogues">Experiment 2: Similarity-Guided Generation (Celecoxib Analogues)</h3>
<p>The scoring function uses <a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity</a> on FCFP4 fingerprints:</p>
<p>$$
S(A) = -1 + 2 \times \frac{\min{J_{i,j}, k}}{k}
$$</p>
<p>where $k$ caps the rewarded similarity. With $k = 1$ and $\sigma = 15$, the Agent recovers <a href="https://en.wikipedia.org/wiki/Celecoxib">Celecoxib</a> itself within 200 training steps. Even when all structures with $J &gt; 0.5$ to Celecoxib (1,804 molecules) were removed from the Prior training set, the Agent still found Celecoxib after 400 steps, despite a 700-fold reduction in prior likelihood ($\log_e P$ from $-12.7$ to $-19.2$).</p>
<p>With moderate similarity targets ($k = 0.7$, $\sigma = 12$), the Agent generates diverse analogues including scaffold hops where functional groups are rearranged.</p>
<h3 id="experiment-3-target-activity-drd2">Experiment 3: Target Activity (DRD2)</h3>
<p>The most drug-discovery-relevant task: generating molecules predicted active against the <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">dopamine receptor type 2 (DRD2)</a>. An SVM classifier (Gaussian kernel, $C = 2^7$, $\gamma = 2^{-6}$) was trained on bioactivity data from ExCAPE-DB (7,218 actives with pIC50 &gt; 5, 100,000 sampled inactives). The actives were split by Butina clustering (ECFP6, cutoff 0.4) to decrease nearest-neighbor similarity between train and test sets.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Prior</th>
          <th>Agent</th>
          <th>Prior (reduced)</th>
          <th>Agent (reduced)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Fraction valid SMILES</td>
          <td>0.94</td>
          <td>0.99</td>
          <td>0.94</td>
          <td>0.99</td>
      </tr>
      <tr>
          <td>Fraction predicted actives</td>
          <td>0.03</td>
          <td>0.97</td>
          <td>0.02</td>
          <td>0.96</td>
      </tr>
      <tr>
          <td>Fraction similar to train active</td>
          <td>0.02</td>
          <td>0.79</td>
          <td>0.02</td>
          <td>0.75</td>
      </tr>
      <tr>
          <td>Fraction similar to test active</td>
          <td>0.01</td>
          <td>0.46</td>
          <td>0.01</td>
          <td>0.38</td>
      </tr>
      <tr>
          <td>Test actives recovered (x10^-3)</td>
          <td>13.5</td>
          <td>126</td>
          <td>2.85</td>
          <td>72.6</td>
      </tr>
  </tbody>
</table>
<p>The Agent increased the fraction of predicted actives from 2-3% (Prior) to 96-97%, representing a 250-fold enrichment in the probability of generating a test set active. The Agent based on the reduced Prior (DRD2 actives removed from ChEMBL) still recovered 7% of test actives, meaning it generated experimentally confirmed actives that appeared in neither the generative model nor the activity prediction model training data.</p>
<h2 id="anchored-policy-learning-prevents-reward-exploitation">Anchored Policy Learning Prevents Reward Exploitation</h2>
<p>The key finding is that augmented episodic likelihood successfully balances score optimization with prior distribution preservation. The Agent achieves task objectives (sulphur avoidance, similarity targets, activity prediction) while maintaining the molecular property distributions learned from ChEMBL. This is a significant improvement over standard REINFORCE, which either exploits rewards trivially or collapses to simple structures.</p>
<p>Analysis of the conditional probability distributions between the Prior and Agent (for DRD2 active generation) shows that the policy changes are not drastic: most trends learned by the Prior carry over, with targeted modifications at specific steps that substantially alter sequence likelihoods and generated structure types.</p>
<p>Limitations acknowledged by the authors:</p>
<ul>
<li>All experiments use single-parameter scoring functions; multi-parametric optimization (activity + DMPK + synthetic accessibility) is left for future work</li>
<li>The quality of generated structures depends heavily on the Prior&rsquo;s coverage of chemical space</li>
<li>The activity model (SVM) has limited domain of applicability, and structures outside this domain may be falsely scored</li>
<li>No exhaustive study of how Prior training set size, model size, and regularization affect generation quality</li>
</ul>
<p>Future directions include multi-parametric scoring functions, exploration of token embeddings, and adversarial training where the scoring function is replaced by a discriminator network (GAN-style training).</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior training</td>
          <td>ChEMBL</td>
          <td>1.5M structures</td>
          <td>10-50 heavy atoms, filtered elements</td>
      </tr>
      <tr>
          <td>DRD2 activity model</td>
          <td>ExCAPE-DB</td>
          <td>7,218 actives + 100K inactives</td>
          <td>Butina clustering split (ECFP6, cutoff 0.4)</td>
      </tr>
      <tr>
          <td>Similarity target</td>
          <td>Celecoxib</td>
          <td>1 query structure</td>
          <td>FCFP4 fingerprints for Jaccard similarity</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Prior</strong>: 3-layer GRU RNN (1024 units/layer), Adam optimizer, 50K steps, batch size 128, LR 0.001 with 0.02 decay/100 steps</li>
<li><strong>Agent</strong>: Same architecture, SGD with LR 0.0005, gradient clipping [-3, 3], on-policy batches of 128</li>
<li><strong>DRD2 model</strong>: SVM with Gaussian kernel ($C = 2^7$, $\gamma = 2^{-6}$), grid search on validation set</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MarcusOlivecrona/REINVENT">REINVENT</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Original implementation in TensorFlow/Python 2.7</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.572576">Archived version</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Zenodo archive (DOI: 10.5281/zenodo.572576)</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>SMILES validity rate (RDKit parsing)</li>
<li>Fraction of structures satisfying scoring function</li>
<li>Molecular property distributions (MW, cLogP, rotatable bonds, aromatic rings)</li>
<li>Jaccard similarity on ECFP6/FCFP4 fingerprints</li>
<li>Recovery rate of known actives from test set</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. The implementation uses TensorFlow 1.0.1 with Python 2.7, RDKit, and Scikit-learn.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Olivecrona, M., Blaschke, T., Engkvist, O., &amp; Chen, H. (2017). Molecular de-novo design through deep reinforcement learning. <em>Journal of Cheminformatics</em>, 9(1), 48.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{olivecrona2017molecular,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Molecular de-novo design through deep reinforcement learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Olivecrona, Marcus and Blaschke, Thomas and Engkvist, Ola and Chen, Hongming}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{48}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-017-0235-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ORGAN: Objective-Reinforced GANs for Molecule Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/</guid><description>ORGAN combines GANs with reinforcement learning to steer SMILES-based molecular generation toward drug-likeness, solubility, and synthesizability objectives.</description><content:encoded><![CDATA[<h2 id="combining-gans-and-reinforcement-learning-for-goal-directed-sequence-generation">Combining GANs and Reinforcement Learning for Goal-Directed Sequence Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces ORGAN (Objective-Reinforced Generative Adversarial Network), a framework for generating sequences that are both realistic (close to the training distribution) and optimized for domain-specific objectives. ORGAN extends SeqGAN by adding external reward functions to the reinforcement learning signal, with a tunable parameter $\lambda$ controlling the balance between adversarial (discriminator) and objective-based rewards. The authors demonstrate ORGAN on two domains: molecular generation using <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings (optimizing druglikeness, solubility, and synthesizability) and musical melody generation (optimizing tonality and step ratios).</p>
<h2 id="exposure-bias-and-mode-collapse-in-discrete-sequence-generation">Exposure Bias and Mode Collapse in Discrete Sequence Generation</h2>
<p>Generating discrete sequences with desirable properties presents two intertwined challenges. First, RNNs trained via maximum likelihood estimation (MLE) suffer from exposure bias, where the model sees only ground-truth prefixes during training but must condition on its own (potentially erroneous) outputs at generation time. Second, while <a href="/posts/what-is-a-gan/">GANs</a> can address some of these issues through adversarial training, they were not initially applicable to discrete data due to non-differentiability of the sampling step. SeqGAN resolved this by framing the generator as an RL agent, but it optimizes only for distributional fidelity (fooling the discriminator) without any mechanism to steer generation toward specific property targets.</p>
<p>In drug discovery, simply generating valid, drug-like molecules is insufficient. Practitioners need to optimize for particular pharmaceutical properties (e.g., solubility, synthesizability, druglikeness) while maintaining structural diversity. Naive RL approaches can optimize properties effectively but tend to collapse onto trivial solutions (e.g., repeating &ldquo;CCCCCCC&rdquo; to maximize solubility). The challenge is to combine the distributional regularization of adversarial training with the goal-directedness of RL.</p>
<h2 id="mixed-reward-interpolating-between-adversarial-and-objective-signals">Mixed Reward: Interpolating Between Adversarial and Objective Signals</h2>
<p>ORGAN&rsquo;s core innovation is a reward function that linearly interpolates between the discriminator score and domain-specific objectives:</p>
<p>$$R(Y_{1:T}) = \lambda \cdot D_{\phi}(Y_{1:T}) + (1 - \lambda) \cdot O_{i}(Y_{1:T})$$</p>
<p>When $\lambda = 1$, the model reduces to SeqGAN (pure adversarial training). When $\lambda = 0$, it becomes naive RL optimizing only the objective. Intermediate values allow the adversarial component to regularize the generator, keeping samples within the distribution while the objective component steers toward desired properties.</p>
<p>The generator $G_{\theta}$ is an LSTM-based RNN that produces sequences token-by-token. Training follows the REINFORCE algorithm, where the expected long-term reward is:</p>
<p>$$J(\theta) = \mathbb{E}\left[R(Y_{1:T}) \mid s_{0}, \theta\right] = \sum_{y_{1} \in Y} G_{\theta}(y_{1} \mid s_{0}) \cdot Q(s_{0}, y_{1})$$</p>
<p>For intermediate timesteps (partial sequences), the action-value function $Q$ is estimated via $N$-time Monte Carlo rollouts:</p>
<p>$$Q(Y_{1:t-1}, y_{t}) = \begin{cases} \frac{1}{N} \sum_{n=1}^{N} R(Y_{1:T}^{n}), &amp; \text{if } t &lt; T \\ R(Y_{1:T}), &amp; \text{if } t = T \end{cases}$$</p>
<p>where $Y_{1:T}^{n}$ are completions sampled by rolling out the current policy $G_{\theta}$ from state $Y_{1:t}$.</p>
<p>The policy gradient is:</p>
<p>$$\nabla_{\theta} J(\theta) \simeq \frac{1}{T} \sum_{t=1}^{T} \mathbb{E}_{y_{t} \sim G_{\theta}(y_{t} \mid Y_{1:t-1})} \left[\nabla_{\theta} \log G_{\theta}(y_{t} \mid Y_{1:t-1}) \cdot Q(Y_{1:t-1}, y_{t})\right]$$</p>
<p>Two additional mechanisms improve training:</p>
<ol>
<li><strong>Diversity penalty</strong>: Repeated sequences have their reward divided by their copy count, providing diminishing returns for non-unique outputs.</li>
<li><strong>Wasserstein distance</strong>: The authors also implement a variant (OR(W)GAN) that replaces the standard GAN discriminator loss with the Wasserstein-1 distance via Kantorovich-Rubinstein duality, which can improve training stability and diversity.</li>
</ol>
<h2 id="molecular-and-musical-melody-generation-experiments">Molecular and Musical Melody Generation Experiments</h2>
<h3 id="architecture">Architecture</h3>
<p>The generator $G_{\theta}$ is an RNN with LSTM cells. The discriminator $D_{\phi}$ is a CNN for text classification following Kim (2014), with 75% dropout and L2 regularization. All optimization uses Adam. Molecular metrics are computed with RDKit.</p>
<h3 id="molecular-generation-setup">Molecular Generation Setup</h3>
<p>Training data consists of 5,000 random molecules from the QM9 dataset (134k stable small molecules with up to 9 heavy atoms), encoded as SMILES strings with maximum sequence length 51 and alphabet size 43. Each generator is pre-trained for 250 MLE epochs, with the discriminator trained for 10 epochs. Adversarial/RL training then proceeds for up to 100 additional epochs. The default $\lambda$ is 0.5.</p>
<p>Three molecular objectives are evaluated:</p>
<ul>
<li><strong>Solubility (LogP)</strong>: water-octanol partition coefficient via RDKit&rsquo;s Crippen function</li>
<li><strong>Synthesizability</strong>: SA score estimating ease of synthesis (0 = hard, 1 = easy)</li>
<li><strong>Druglikeness</strong>: QED score capturing medicinal chemistry aesthetics</li>
</ul>
<p>Diversity is measured using average Jaccard distance of molecular fingerprints relative to a random training subset.</p>
<h3 id="molecular-generation-results">Molecular Generation Results</h3>
<table>
  <thead>
      <tr>
          <th>Objective</th>
          <th>Algorithm</th>
          <th>Validity (%)</th>
          <th>Diversity</th>
          <th>Druglikeness</th>
          <th>Synthesizability</th>
          <th>Solubility</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>None</td>
          <td>MLE</td>
          <td>75.9</td>
          <td>0.64</td>
          <td>0.48 (0%)</td>
          <td>0.23 (0%)</td>
          <td>0.30 (0%)</td>
      </tr>
      <tr>
          <td>None</td>
          <td>SeqGAN</td>
          <td>80.3</td>
          <td>0.61</td>
          <td>0.49 (+2%)</td>
          <td>0.25 (+6%)</td>
          <td>0.31 (+3%)</td>
      </tr>
      <tr>
          <td>Druglikeness</td>
          <td>ORGAN</td>
          <td>88.2</td>
          <td>0.55</td>
          <td>0.52 (+8%)</td>
          <td>0.32 (+38%)</td>
          <td>0.35 (+18%)</td>
      </tr>
      <tr>
          <td>Druglikeness</td>
          <td>OR(W)GAN</td>
          <td>85.0</td>
          <td>0.95</td>
          <td>0.60 (+25%)</td>
          <td>0.54 (+130%)</td>
          <td>0.47 (+57%)</td>
      </tr>
      <tr>
          <td>Druglikeness</td>
          <td>Naive RL</td>
          <td>97.1</td>
          <td>0.80</td>
          <td>0.57 (+19%)</td>
          <td>0.53 (+126%)</td>
          <td>0.50 (+67%)</td>
      </tr>
      <tr>
          <td>Synthesizability</td>
          <td>ORGAN</td>
          <td>96.5</td>
          <td>0.92</td>
          <td>0.51 (+6%)</td>
          <td>0.83 (+255%)</td>
          <td>0.45 (+52%)</td>
      </tr>
      <tr>
          <td>Synthesizability</td>
          <td>OR(W)GAN</td>
          <td>97.6</td>
          <td>1.00</td>
          <td>0.20 (-59%)</td>
          <td>0.75 (+223%)</td>
          <td>0.84 (+184%)</td>
      </tr>
      <tr>
          <td>Solubility</td>
          <td>ORGAN</td>
          <td>94.7</td>
          <td>0.76</td>
          <td>0.50 (+4%)</td>
          <td>0.63 (+171%)</td>
          <td>0.55 (+85%)</td>
      </tr>
      <tr>
          <td>Solubility</td>
          <td>OR(W)GAN</td>
          <td>94.1</td>
          <td>0.90</td>
          <td>0.42 (-12%)</td>
          <td>0.66 (+185%)</td>
          <td>0.54 (+81%)</td>
      </tr>
      <tr>
          <td>Solubility</td>
          <td>Naive RL</td>
          <td>92.7</td>
          <td>0.75</td>
          <td>0.49 (+3%)</td>
          <td>0.70 (+200%)</td>
          <td>0.78 (+162%)</td>
      </tr>
      <tr>
          <td>All (alternated)</td>
          <td>ORGAN</td>
          <td>96.1</td>
          <td>92.3</td>
          <td>0.52 (+9%)</td>
          <td>0.71 (+206%)</td>
          <td>0.53 (+79%)</td>
      </tr>
  </tbody>
</table>
<p>Key observations: OR(W)GAN consistently achieves higher diversity than standard ORGAN. Naive RL often achieves higher raw objective scores but at the cost of generating trivial solutions (e.g., simple atom chains for solubility). The Wasserstein variant provides better diversity properties. Multi-objective training via alternating objectives across epochs achieves gains comparable to individually optimized models.</p>
<h3 id="music-generation-setup">Music Generation Setup</h3>
<p>Using 1,000 melodies from the EsAC folk dataset, each encoded as 36-token sequences where tokens represent sixteenth-note events across three octaves (C3-B5). Two metrics are optimized: tonality (proportion of perfect fifths) and ratio of steps (conjunct melodic motion). Diversity is measured as average pairwise edit distance.</p>
<h3 id="music-results">Music Results</h3>
<table>
  <thead>
      <tr>
          <th>Objective</th>
          <th>Algorithm</th>
          <th>Diversity</th>
          <th>Tonality</th>
          <th>Ratio of Steps</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>None</td>
          <td>MLE</td>
          <td>0.221</td>
          <td>0.007</td>
          <td>0.010</td>
      </tr>
      <tr>
          <td>None</td>
          <td>SeqGAN</td>
          <td>0.187</td>
          <td>0.005</td>
          <td>0.010</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>Naive RL</td>
          <td>0.100</td>
          <td>0.478</td>
          <td>2.9E-05</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>ORGAN</td>
          <td>0.268</td>
          <td>0.372</td>
          <td>1.78E-04</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>OR(W)GAN</td>
          <td>0.268</td>
          <td>0.177</td>
          <td>2.4E-04</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>Naive RL</td>
          <td>0.321</td>
          <td>0.001</td>
          <td>0.829</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>ORGAN</td>
          <td>0.433</td>
          <td>0.001</td>
          <td>0.632</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>OR(W)GAN</td>
          <td>0.134</td>
          <td>5.95E-05</td>
          <td>0.622</td>
      </tr>
  </tbody>
</table>
<p>ORGAN outperforms SeqGAN and MLE on all metrics. Naive RL achieves higher raw scores but with lower diversity, producing simpler, less interesting outputs.</p>
<h2 id="capacity-ceilings-trade-offs-and-future-directions">Capacity Ceilings, Trade-offs, and Future Directions</h2>
<p>The authors identify several limitations and findings:</p>
<p><strong>Capacity ceiling</strong>: GAN-based models tend to generate sequences matching the training set&rsquo;s average length (15.42 characters). RL-only approaches can break this constraint, generating shorter (9.4) or longer (21.3) sequences depending on the objective. The upper bound of optimized properties also matches the training data&rsquo;s maximum, suggesting dataset-dependent limits.</p>
<p><strong>Lambda trade-off</strong>: Varying $\lambda$ reveals an optimal balance between objective optimization and distributional fidelity. This optimum depends on the model, dataset, and metric, suggesting that hyperparameter search over $\lambda$ is important in practice.</p>
<p><strong>Tonality vs. steps inverse relationship</strong>: In the music task, optimizing for tonality (perfect fifths) inherently conflicts with optimizing for step ratios (consecutive notes), since consecutive scale notes do not form perfect fifths.</p>
<p><strong>Limitations</strong>: The paper evaluates on relatively small datasets (5k molecules, 1k melodies) and short sequences. The molecular experiments use QM9 (small molecules with up to 9 heavy atoms), which limits the scope of conclusions for drug-like chemical space. The Wasserstein variant sometimes lags behind the standard GAN loss in raw metric scores, though it offers better diversity.</p>
<p><strong>Future directions</strong>: The authors propose extending ORGAN to non-sequential data (images, audio) by framing GANs as RL problems more broadly, and investigating how different heuristic choices affect performance. They also suggest exploring other discrete GAN formulations (MaliGAN, BGAN) with RL extensions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Molecular training</td>
          <td>QM9 subset</td>
          <td>5,000 molecules</td>
          <td>Random subset from 134k stable small molecules with up to 9 heavy atoms</td>
      </tr>
      <tr>
          <td>Music training</td>
          <td>EsAC folk dataset</td>
          <td>1,000 melodies</td>
          <td>36-token sequences, processed following Chen et al. (2017)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Generator pre-trained for 250 epochs via MLE; discriminator for 10 epochs</li>
<li>Adversarial/RL training for up to 100 epochs</li>
<li>Default $\lambda = 0.5$ for reward mixing</li>
<li>Monte Carlo rollouts for intermediate reward estimation</li>
<li>Duplicate penalty: reward divided by copy count</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Generator</strong>: RNN with LSTM cells</li>
<li><strong>Discriminator</strong>: CNN for text classification (Kim, 2014) with 75% dropout, L2 regularization</li>
<li><strong>Optimizer</strong>: Adam for all gradient descent steps</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Domain</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity (%)</td>
          <td>Fraction of generated SMILES that decode to valid molecules</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>Average Jaccard distance of fingerprints to training subset</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Druglikeness (QED)</td>
          <td>Quantitative Estimate of Drug-likeness</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Synthesizability (SA)</td>
          <td>Synthetic accessibility score</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Solubility (LogP)</td>
          <td>Water-octanol partition coefficient</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>Proportion of perfect fifths</td>
          <td>Music</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>Proportion of conjunct melodic intervals</td>
          <td>Music</td>
      </tr>
      <tr>
          <td>Diversity (edit)</td>
          <td>Average pairwise edit distance</td>
          <td>Music</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/gablg1/ORGAN">ORGAN</a></td>
          <td>Code</td>
          <td>GPL-2.0</td>
          <td>Official implementation including metrics for molecules and music</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Guimaraes, G. L., Sánchez-Lengeling, B., Outeiral, C., Farias, P. L. C., &amp; Aspuru-Guzik, A. (2017). Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models. <em>arXiv preprint arXiv:1705.10843</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{guimaraes2017organ,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Guimaraes, Gabriel Lima and Sanchez-Lengeling, Benjamin and Outeiral, Carlos and Farias, Pedro Luis Cunha and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1705.10843}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolecularRNN: Graph-Based Molecular Generation and RL</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/molecularrnn-graph-generation-optimized-properties/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/molecularrnn-graph-generation-optimized-properties/</guid><description>MolecularRNN extends GraphRNN with atom and bond type predictions, valency-based rejection sampling, and policy gradient optimization for molecular generation.</description><content:encoded><![CDATA[<h2 id="a-graph-recurrent-model-for-molecular-generation-with-property-optimization">A Graph Recurrent Model for Molecular Generation with Property Optimization</h2>
<p>This is a <strong>Method</strong> paper that introduces MolecularRNN, a graph-based recurrent generative model for molecular structures. The model extends GraphRNN to handle typed nodes (atom types) and typed edges (bond types), enabling direct generation of molecular graphs rather than working through string representations like SMILES. Three key contributions are combined: (1) the MolecularRNN architecture for autoregressive graph generation, (2) valency-based rejection sampling for guaranteed 100% validity at inference, and (3) policy gradient reinforcement learning for shifting molecular property distributions toward desired ranges.</p>
<h2 id="why-generate-molecules-as-graphs-rather-than-strings">Why Generate Molecules as Graphs Rather Than Strings</h2>
<p>Computational de novo molecular design aims to create novel molecules with desired properties, a task central to drug discovery. At the time of this work, most deep generative models for molecules operated on <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings, inheriting the complications of SMILES grammar and the problem that structurally similar molecules can have very different string representations. Graph-based representations are more natural for molecules, with atoms mapping to nodes and bonds to edges, and they allow direct enforcement of chemical constraints during generation.</p>
<p>Existing graph-based methods had their own limitations. Junction tree VAE (JT-VAE) generates molecules from structural fragments, which introduces ambiguity when converting junction trees back to molecules, particularly problematic during property optimization since molecules sharing a junction tree can have very different property values. The GCPN model uses graph convolutional networks with reinforcement learning but was evaluated only on top-3 generated molecules, making it difficult to assess overall distribution quality. Prior atom-level graph generation models like Li et al. (2018a) were restricted to molecules with at most 20 heavy atoms, limiting practical applicability.</p>
<h2 id="core-innovation-extending-graphrnn-with-chemical-constraints-and-rl">Core Innovation: Extending GraphRNN with Chemical Constraints and RL</h2>
<p>MolecularRNN builds on the GraphRNN architecture by introducing atom type predictions alongside edge type predictions. The model generates molecular graphs sequentially: at each step, a NodeRNN predicts the type of the next atom, then an EdgeRNN predicts bond types to all preceding atoms within a BFS-ordered window.</p>
<h3 id="autoregressive-graph-generation">Autoregressive Graph Generation</h3>
<p>The joint likelihood over atom types $C^{\pi}$ and adjacency vectors $S^{\pi}$ under BFS ordering $\pi$ is factorized as:</p>
<p>$$
p\left(S^{\pi}, C^{\pi}\right) = \prod_{i=1}^{n+1} p\left(C_{i}^{\pi} \mid S_{&lt;i}^{\pi}, C_{&lt;i}^{\pi}\right) p\left(S_{i}^{\pi} \mid C_{i}^{\pi}, S_{&lt;i}^{\pi}, C_{&lt;i}^{\pi}\right)
$$</p>
<p>NodeRNN processes embeddings of previous atom types and adjacency vectors to produce a hidden state, from which a two-layer MLP with softmax predicts the next atom type $\psi_{i}$:</p>
<p>$$
h_{i}^{\text{node}} = \text{NodeRNN}\left(h_{i-1}^{\text{node}}, \left[\text{emb}(S_{i-1}^{\pi}), \text{emb}(C_{i-1}^{\pi})\right]\right)
$$</p>
<p>$$
\psi_{i} = \text{NodeMLP}\left(h_{i}^{\text{node}}\right)
$$</p>
<p>EdgeRNN then unrolls across preceding atoms to predict bond types $\phi_{i,j}$, initialized with the NodeRNN hidden state:</p>
<p>$$
h_{i,j}^{\text{edge}} = \text{EdgeRNN}\left(h_{i,j-1}^{\text{edge}}, \text{emb}(S_{i,j-1}^{\pi})\right), \quad h_{i,0}^{\text{edge}} = h_{i}^{\text{node}}
$$</p>
<p>$$
\phi_{i,j} = \text{EdgeMLP}\left(h_{i,j}^{\text{edge}}\right)
$$</p>
<p>Bond types are categorical over {no bond, single, double, triple}, and molecules are represented in kekulized form. BFS ordering limits the EdgeRNN window to $M = 12$ preceding atoms.</p>
<h3 id="valency-based-rejection-sampling">Valency-Based Rejection Sampling</h3>
<p>During inference, each proposed bond of order $k$ between atoms $i$ and $j$ is accepted only if both atoms remain within their allowed valencies:</p>
<p>$$
\sum_{j} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{i}^{\pi}} \quad \text{and} \quad \sum_{i} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{j}^{\pi}}
$$</p>
<p>Atoms that do not fill their valencies are complemented with hydrogens. This constraint can be enforced directly on graphs (unlike SMILES, where intermediate substrings are not chemically meaningful), yielding 100% valid molecules.</p>
<h3 id="property-optimization-via-policy-gradient">Property Optimization via Policy Gradient</h3>
<p>For property optimization, MolecularRNN is formulated as a policy network in a Markov Decision Process. The loss function uses REINFORCE with a discounted final reward:</p>
<p>$$
L(\theta) = -\sum_{i=1}^{N} r(s_{N}) \cdot \gamma^{i} \cdot \log p(s_{i} \mid s_{i-1}; \theta)
$$</p>
<p>where $r(s_{N})$ is the reward from a property critic and $\gamma$ is a discount factor. The authors also introduce a structural penalty during RL training that assigns a penalty of $-10$ to atoms violating valency constraints, providing a learning signal from invalid intermediate molecules.</p>
<h2 id="experimental-setup-pretraining-and-property-optimization">Experimental Setup: Pretraining and Property Optimization</h2>
<h3 id="pretraining">Pretraining</h3>
<p>MolecularRNN is pretrained on three datasets: ChEMBL (~1.5M bioactive molecules), <a href="/notes/computational-chemistry/datasets/zinc-22/">ZINC 250k</a> (250K randomly selected commercially available compounds), and <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a> (~1.9M drug-like molecules from ZINC). The model considers 9 atom types (C, N, O, F, P, S, Cl, Br, I), 3 bond types (single, double, triple), and molecules with 10-50 heavy atoms. Architecture: NodeRNN with 4 GRU layers (hidden size 256), EdgeRNN with 4 GRU layers (hidden size 128), node embedding size 128, edge embedding size 16. Training uses Adam with learning rate 0.001 and multiplicative decay on 4 GPUs with batch size 512 per GPU for 250 epochs.</p>
<h3 id="generation-quality-at-scale">Generation Quality at Scale</h3>
<p>The pretrained model generates 1 million molecules per dataset (far larger than prior work: JT-VAE used 5K samples, Li et al. used 100K). Results with valency-based rejection sampling:</p>
<table>
  <thead>
      <tr>
          <th>Training Set</th>
          <th>Valid</th>
          <th>Unique</th>
          <th>Novel</th>
          <th>IntDiv (p=1)</th>
          <th>IntDiv (p=2)</th>
          <th>SA Score</th>
          <th>QED</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChEMBL</td>
          <td>100%</td>
          <td>99.2%</td>
          <td>99.3%</td>
          <td>0.895</td>
          <td>0.890</td>
          <td>3.67 +/- 1.20</td>
          <td>0.56 +/- 0.20</td>
      </tr>
      <tr>
          <td>ZINC 250k</td>
          <td>100%</td>
          <td>99.8%</td>
          <td>100%</td>
          <td>0.892</td>
          <td>0.887</td>
          <td>3.60 +/- 1.01</td>
          <td>0.68 +/- 0.16</td>
      </tr>
      <tr>
          <td>MOSES</td>
          <td>100%</td>
          <td>99.4%</td>
          <td>100%</td>
          <td>0.881</td>
          <td>0.876</td>
          <td>3.24 +/- 0.97</td>
          <td>0.74 +/- 0.14</td>
      </tr>
  </tbody>
</table>
<p>Comparison with baselines on ZINC 250k (30K samples):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Valid</th>
          <th>Unique</th>
          <th>Novel</th>
          <th>SA Score</th>
          <th>QED</th>
          <th>IntDiv</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JT-VAE</td>
          <td>99.8%</td>
          <td>100%</td>
          <td>100%</td>
          <td>3.37</td>
          <td>0.76</td>
          <td>0.85</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>100%</td>
          <td>99.97%</td>
          <td>100%</td>
          <td>4.62</td>
          <td>0.61</td>
          <td>0.90</td>
      </tr>
      <tr>
          <td>MolecularRNN</td>
          <td>100%</td>
          <td>99.89%</td>
          <td>100%</td>
          <td>3.59</td>
          <td>0.68</td>
          <td>0.89</td>
      </tr>
  </tbody>
</table>
<p>GCPN generates overly complex molecules (high SA score of 4.62), while MolecularRNN produces more realistic structures with higher internal diversity than JT-VAE.</p>
<h3 id="property-optimization-results">Property Optimization Results</h3>
<p>Policy gradient optimization is run for 300 iterations with batch size 512 and constant learning rate $10^{-5}$, discount factor $\gamma = 0.97$. Top-3 scores for penalized logP and QED:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>logP 1st</th>
          <th>logP 2nd</th>
          <th>logP 3rd</th>
          <th>QED 1st</th>
          <th>QED 2nd</th>
          <th>QED 3rd</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a></td>
          <td>3.63</td>
          <td>3.49</td>
          <td>3.44</td>
          <td>0.896</td>
          <td>0.824</td>
          <td>0.820</td>
      </tr>
      <tr>
          <td>JT-VAE</td>
          <td>5.30</td>
          <td>4.93</td>
          <td>4.49</td>
          <td>0.925</td>
          <td>0.911</td>
          <td>0.910</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>7.98</td>
          <td>7.85</td>
          <td>7.80</td>
          <td>0.948</td>
          <td>0.947</td>
          <td>0.946</td>
      </tr>
      <tr>
          <td>MolecularRNN</td>
          <td>10.34</td>
          <td>10.19</td>
          <td>10.14</td>
          <td>0.948</td>
          <td>0.948</td>
          <td>0.947</td>
      </tr>
  </tbody>
</table>
<p>MolecularRNN achieves the highest penalized logP scores (10.34 vs. GCPN&rsquo;s 7.98) while matching GCPN on QED. The authors also demonstrate melting temperature optimization using a GCN-based property predictor as the critic (RMSE 39.5 degrees C), showing that the RL framework generalizes to properties that cannot be computed directly from molecular graphs.</p>
<h2 id="distribution-level-evaluation-and-learned-chemical-patterns">Distribution-Level Evaluation and Learned Chemical Patterns</h2>
<p>The authors emphasize that reporting only top-3 scores is not informative, and they compare full property distributions. MolecularRNN shifts the QED distribution further toward maximum values compared to GCPN. They also note that during melting temperature optimization, the model rediscovered two chemical phenomena: fusing aromatic rings increases melting point, and the presence of polar groups (C=O, OH, NH2, heterocyclic nitrogens) enhances dipole-dipole interactions and raises melting temperature.</p>
<p>Without valency-based rejection sampling, the pretrained model achieves 65% validity. After structural penalty training (assigning -10 to valency-violating atoms and optimizing with policy gradient), validity increases to 90%. Enabling rejection sampling then achieves 100%.</p>
<p>Several limitations are worth noting. The BFS ordering introduces an arbitrary sequencing over equivalent graph traversals (the node order permutation problem is not addressed). The evaluation uses top-3 scores for property optimization, though the authors do advocate for distributional evaluation. The molecule size is capped at 50 heavy atoms. The paper does not report training time or wall-clock generation speed. Future directions mentioned include multi-objective property optimization and scaffold completion (graph completion from a given core structure).</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL</td>
          <td>~1.5M molecules</td>
          <td>Bioactive molecules with experimental measurements</td>
      </tr>
      <tr>
          <td>Pretraining</td>
          <td>ZINC 250k</td>
          <td>250K molecules</td>
          <td>Random subset of ZINC database</td>
      </tr>
      <tr>
          <td>Pretraining</td>
          <td>MOSES</td>
          <td>~1.9M molecules</td>
          <td>Drug-like subset of ZINC</td>
      </tr>
      <tr>
          <td>Melting point critic</td>
          <td>Custom split</td>
          <td>37,940 train / 9,458 test</td>
          <td>Melting temperatures from -196 to 517 degrees C</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pretraining</strong>: Maximum likelihood with Adam optimizer, learning rate 0.001 with multiplicative decay to $10^{-5}$, 250 epochs</li>
<li><strong>Structural penalty</strong>: Policy gradient with -10 penalty per valency-violating atom</li>
<li><strong>Property optimization</strong>: REINFORCE (policy gradient), 300 iterations, batch size 512, learning rate $10^{-5}$, discount factor $\gamma = 0.97$</li>
<li><strong>Melting point critic</strong>: GCN regression (4 layers, hidden size 128), Adam with learning rate 0.001, exponential decay $\gamma = 0.8$, 30 epochs, batch size 32</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>NodeRNN</strong>: 4 GRU layers, hidden size 256, node embedding 128</li>
<li><strong>EdgeRNN</strong>: 4 GRU layers, hidden size 128, edge embedding 16</li>
<li><strong>NodeMLP/EdgeMLP</strong>: 2-layer MLP with 128 hidden units, ReLU activation, softmax output</li>
<li><strong>BFS window</strong>: $M = 12$ preceding atoms</li>
<li><strong>Atom types</strong>: 9 (C, N, O, F, P, S, Cl, Br, I)</li>
<li><strong>Bond types</strong>: 3 (single, double, triple) + no bond</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>% chemically valid molecules (RDKit)</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>% unique in generated pool (up to 1M)</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>% not in training set</td>
      </tr>
      <tr>
          <td>Internal Diversity</td>
          <td>Average pairwise Tanimoto distance</td>
      </tr>
      <tr>
          <td>SA Score</td>
          <td>Synthetic accessibility (2-4 optimal range)</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>Drug-likeness score (0-1)</td>
      </tr>
      <tr>
          <td>Penalized logP</td>
          <td>Lipophilicity with ring and SA penalties</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>4 GPUs (NVIDIA, specific model not stated)</li>
<li>Per-GPU batch size of 512 for pretraining</li>
<li>Training time not reported</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Popova, M., Shvets, M., Oliva, J., &amp; Isayev, O. (2019). MolecularRNN: Generating realistic molecular graphs with optimized properties. <em>arXiv preprint arXiv:1905.13372</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{popova2019molecularrnn,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolecularRNN: Generating realistic molecular graphs with optimized properties}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Popova, Mariya and Shvets, Mykhailo and Oliva, Junier and Isayev, Olexandr}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1905.13372}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Memory-Assisted RL for Diverse De Novo Mol. Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/memory-assisted-rl-diverse-molecular-design/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/memory-assisted-rl-diverse-molecular-design/</guid><description>A memory unit for REINVENT-based RL that tracks generated scaffolds and penalizes repeated solutions, increasing molecular diversity up to fourfold.</description><content:encoded><![CDATA[<h2 id="a-memory-module-for-diverse-molecular-generation-via-rl">A Memory Module for Diverse Molecular Generation via RL</h2>
<p>This is a <strong>Method</strong> paper that introduces a memory unit for reinforcement learning (RL)-based molecular generation. The primary contribution is a hash-table-based memory mechanism that integrates into the REINVENT framework&rsquo;s scoring function. By tracking previously generated high-scoring molecules and penalizing the reward when new molecules are too similar to those already stored, the memory unit forces the generative model to explore different regions of chemical space rather than collapsing onto a single scaffold family.</p>
<h2 id="policy-collapse-limits-rl-based-de-novo-design">Policy Collapse Limits RL-Based De Novo Design</h2>
<p>Recurrent neural networks (RNNs) trained with reinforcement learning can generate novel molecules optimized for desired properties. The <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> algorithm and related approaches (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGANIC</a>, GENTRL) demonstrated the viability of coupling a pretrained SMILES-based generative model with a scoring function via RL. However, a persistent problem is <strong>policy collapse</strong> (also called mode collapse): once the model discovers a high-scoring region of chemical space, it continues to exploit that region, producing structurally similar compounds with minor substitution differences. This severely limits the practical utility of RL-based generation in drug design, where medicinal chemists need diverse scaffolds to explore structure-activity relationships and manage intellectual property concerns.</p>
<p>Prior work by Liu et al. [31] attempted to address this by engineering an explorative RNN alongside the standard generative RNN, but it did not substantially increase diversity compared to standard REINVENT. Other approaches like Generative Examination Networks (GEN) performed statistical analysis during training but were not evaluated in optimization scenarios.</p>
<h2 id="core-innovation-hash-table-memory-unit-for-reward-modification">Core Innovation: Hash-Table Memory Unit for Reward Modification</h2>
<p>The key insight is to dynamically modify the reward surface during RL by maintaining a memory of previously explored chemical space. The memory unit is a hash table of index-bucket pairs. Each bucket stores up to a fixed number of high-scoring molecules (default: 25) that are chemically similar to a seed molecule (the index).</p>
<h3 id="integration-with-reinvent">Integration with REINVENT</h3>
<p>The memory unit modifies the augmented likelihood used in REINVENT. For a generated compound $c$, the augmented log-likelihood becomes:</p>
<p>$$
\log P(c)_{Aug} = \log P(c)_{PriorNetwork} + \sigma \times S(c) \times M(c)
$$</p>
<p>where $\sigma$ is a scalar coefficient, $S(c)$ is the scoring function output, and $M(c)$ is the memory unit output (either 0 or 1). The reward is:</p>
<p>$$
R(c) = \left(\log P(c)_{Aug} - \log P(c)_{AgentNetwork}\right)^2
$$</p>
<p>and the loss is $\text{loss} = -R(c)$.</p>
<h3 id="memory-unit-operation">Memory Unit Operation</h3>
<p>When a high-scoring molecule is generated:</p>
<ol>
<li>Its fingerprint or scaffold is compared against all index structures in the memory</li>
<li>If it is similar to an index (above a Tanimoto cutoff, default 0.6) and the corresponding bucket is not full, $M(c) = 1$ and the molecule is added to the bucket</li>
<li>If the bucket is full, $M(c) = 0$, effectively zeroing the reward contribution and discouraging the model from generating similar molecules</li>
<li>If no similar index exists, a new index-bucket pair is created</li>
</ol>
<h3 id="four-similarity-criteria">Four Similarity Criteria</h3>
<p>The authors evaluate four criteria for grouping molecules in the memory:</p>
<ol>
<li><strong>Compound similarity</strong>: ECFP4 Tanimoto similarity at the whole-molecule level</li>
<li><strong>Identical Bemis-Murcko (BM) scaffold</strong>: exact match of Bemis-Murcko frameworks</li>
<li><strong>Identical carbon skeleton</strong>: exact match of carbon skeletons (BM scaffolds with all heteroatoms replaced by carbon and bonds set to single)</li>
<li><strong>Scaffold similarity</strong>: atom pair fingerprint Tanimoto similarity between carbon skeletons (fuzzy matching)</li>
</ol>
<h3 id="alternative-output-modes">Alternative Output Modes</h3>
<p>Beyond the binary output ($M(c) \in {0, 1}$), the authors also explored smooth output functions. The linear mode:</p>
<p>$$
M(c) = 1 - \frac{\text{compounds in bucket}}{\text{bucket size}}
$$</p>
<p>And the sigmoid mode:</p>
<p>$$
M(c) = 1 - \frac{1}{1 + e^{-\left(\frac{\frac{\text{compounds in bucket}}{\text{bucket size}} \times 2 - 1}{0.15}\right)}}
$$</p>
<p>Both smooth modes yielded slightly fewer analogs than the binary mode and were not pursued further.</p>
<h2 id="experimental-setup-logp-optimization-and-target-activity-prediction">Experimental Setup: LogP Optimization and Target Activity Prediction</h2>
<h3 id="case-study-1-logp-optimization">Case Study 1: LogP Optimization</h3>
<p>As a proof of concept, the authors optimized LogP values for known DRD2 inhibitors. Starting from 487 DRD2 compounds with LogP &gt;= 5 (from ExCAPE-DB), they applied transfer learning to the prior model for 20 epochs, then ran RL for 150 iterations (100 compounds per iteration, 15,000 total). The scoring function was:</p>
<p>$$
S = 1 - \tanh\left(\min\left(|2 - \text{AlogP}|, |3 - \text{AlogP}|\right)\right)
$$</p>
<p>targeting LogP values between 2.0 and 3.0.</p>
<h3 id="case-study-2-htr1a-and-drd2-activity-prediction">Case Study 2: HTR1A and DRD2 Activity Prediction</h3>
<p>For a more complex scenario, the authors trained SVM classifiers (with <a href="https://en.wikipedia.org/wiki/Platt_scaling">Platt scaling</a> for probabilistic output) on bioactivity data from ExCAPE-DB to predict activity against two neurotransmitter receptors:</p>
<ul>
<li><strong><a href="https://en.wikipedia.org/wiki/5-HT1A_receptor">HTR1A</a></strong>: 3,599 actives (pIC50 &gt;= 7) and 66,684 inactives</li>
<li><strong><a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">DRD2</a></strong>: 2,981 actives (pIC50 &gt;= 7) and 346,206 inactives (100,000 sampled)</li>
</ul>
<p>Data was split using Butina clustering on ECFP6 at a 0.4 Tanimoto cutoff (60/20/20 train/val/test). The SVM models achieved excellent performance:</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Set</th>
          <th>Balanced Accuracy</th>
          <th>ROC AUC</th>
          <th>F1</th>
          <th>MCC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HTR1A</td>
          <td>Test</td>
          <td>0.96</td>
          <td>0.99</td>
          <td>0.75</td>
          <td>0.75</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>Test</td>
          <td>0.95</td>
          <td>0.99</td>
          <td>0.71</td>
          <td>0.72</td>
      </tr>
  </tbody>
</table>
<p>RL was run for 300 iterations (100 compounds each, 30,000 total). Compounds with predicted activity &gt;= 0.7 were considered active.</p>
<h3 id="generative-model-architecture">Generative Model Architecture</h3>
<p>The RNN prior model followed the REINVENT architecture: an embedding layer, three GRU layers with 256 dimensions, and a linear output layer. It was pretrained on ~1.5 million ChEMBL 25 compounds (filtered to remove known HTR1A actives and DRD2 analogs) for 10 epochs using Adam with a learning rate of 0.01.</p>
<h3 id="comparisons">Comparisons</h3>
<p>The authors compared memory-assisted RL against:</p>
<ul>
<li>Standard REINVENT RL (no memory)</li>
<li>Experience replay (re-presenting 8 high-scoring compounds per iteration)</li>
<li>Temperature scaling (values from 1.0 to 10.0)</li>
<li>Memory + experience replay combined</li>
</ul>
<h2 id="results-up-to-fourfold-increase-in-diverse-active-compounds">Results: Up to Fourfold Increase in Diverse Active Compounds</h2>
<h3 id="logp-optimization-results">LogP Optimization Results</h3>
<p>Memory-assisted RL increased the number of optimized compounds (LogP 2-3) by roughly threefold:</p>
<table>
  <thead>
      <tr>
          <th>Memory Type</th>
          <th>Optimized Compounds</th>
          <th>Unique BM Scaffolds</th>
          <th>Unique Carbon Skeletons</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>No memory</td>
          <td>938</td>
          <td>727</td>
          <td>396</td>
      </tr>
      <tr>
          <td>Compound similarity</td>
          <td>3,451</td>
          <td>2,963</td>
          <td>1,472</td>
      </tr>
      <tr>
          <td>Identical BM Scaffold</td>
          <td>3,428</td>
          <td>2,865</td>
          <td>1,398</td>
      </tr>
      <tr>
          <td>Identical Carbon Skeleton</td>
          <td>3,315</td>
          <td>3,002</td>
          <td>1,799</td>
      </tr>
      <tr>
          <td>Scaffold Similarity</td>
          <td>3,591</td>
          <td>3,056</td>
          <td>1,538</td>
      </tr>
  </tbody>
</table>
<p>The memory unit also increased the generation of relevant analogs. ECFP6 analogs (Tanimoto &gt;= 0.4 to training set) increased from 145 to up to 549, and shared MMP cores increased from 5 to up to 19, confirming that the memory unit promoted exploration of chemically relevant space rather than random drift.</p>
<h3 id="htr1a-and-drd2-activity-optimization-results">HTR1A and DRD2 Activity Optimization Results</h3>
<p>The improvements were even more pronounced for target activity optimization:</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Memory Type</th>
          <th>Active Compounds</th>
          <th>Unique BM Scaffolds</th>
          <th>Unique Carbon Skeletons</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HTR1A</td>
          <td>No memory</td>
          <td>9,323</td>
          <td>7,312</td>
          <td>5,446</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>Compound similarity</td>
          <td>16,779</td>
          <td>13,304</td>
          <td>9,887</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>Identical Carbon Skeleton</td>
          <td>17,597</td>
          <td>15,531</td>
          <td>12,408</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>No memory</td>
          <td>5,143</td>
          <td>2,635</td>
          <td>1,949</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>Compound similarity</td>
          <td>21,486</td>
          <td>17,844</td>
          <td>12,749</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>Scaffold Similarity</td>
          <td>22,784</td>
          <td>20,712</td>
          <td>16,434</td>
      </tr>
  </tbody>
</table>
<p>For DRD2, the effect was particularly striking: standard RL showed clear policy collapse with only 576 ECFP6 analogs to the training set, while memory-assisted RL generated up to 6,315. The compound similarity memory unit produced the most MMP analogs (217 to the training set vs. 7 without memory).</p>
<h3 id="parameter-sensitivity">Parameter Sensitivity</h3>
<p>Bucket size had a modest effect: larger buckets (allowing more compounds before penalization) slightly increased analog generation. The Tanimoto similarity threshold of 0.6 was near-optimal for the scaffold similarity memory; higher thresholds reduced diversity gains. The compound similarity memory showed increasing analogs with higher thresholds, but BM scaffold and carbon skeleton counts plateaued above 0.6.</p>
<h3 id="comparison-with-experience-replay-and-temperature-scaling">Comparison with Experience Replay and Temperature Scaling</h3>
<ul>
<li><strong>Experience replay alone</strong> increased diversity compared to vanilla RL but was less effective than the memory unit alone</li>
<li><strong>Memory + experience replay</strong> achieved the best results overall, as experience replay provided the model with diverse starting points for exploration after the memory unit altered the reward landscape</li>
<li><strong>Temperature scaling</strong> was largely ineffective: only a value of 1.25 showed improvement, and even then it achieved only about 50% of the analogs generated by memory-assisted RL. Temperatures above 2.0 degraded SMILES validity, and above 4.0 prevented valid molecule generation entirely</li>
</ul>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>All evaluations are retrospective; no synthesized compounds were experimentally tested</li>
<li>The SVM activity models, while accurate, may have applicability domain limitations for highly novel scaffolds</li>
<li>The binary memory output mode was found to work best, but the transition from exploration to exploitation is abrupt</li>
<li>The method was only tested with two biological targets and one physicochemical property</li>
<li>Computational overhead of the memory unit is not discussed</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior model training</td>
          <td>ChEMBL 25</td>
          <td>~1.5M compounds</td>
          <td>Filtered: max 50 heavy atoms, no stereochemistry, removed HTR1A actives and DRD2 analogs</td>
      </tr>
      <tr>
          <td>HTR1A activity data</td>
          <td>ExCAPE-DB</td>
          <td>3,599 actives + 66,684 inactives</td>
          <td>pIC50 &gt;= 7 threshold for actives</td>
      </tr>
      <tr>
          <td>DRD2 activity data</td>
          <td>ExCAPE-DB</td>
          <td>2,981 actives + 100,000 inactives (sampled)</td>
          <td>pIC50 &gt;= 7 threshold for actives</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Generative model</strong>: RNN with embedding + 3 GRU layers (256 dim) + linear output (REINVENT architecture)</li>
<li><strong>RL</strong>: Augmented likelihood formulation with sigma scaling coefficient</li>
<li><strong>SVM classifiers</strong>: Non-linear SVM with MinMax kernel, Platt scaling, ECFP6 count-based fingerprints (2048 dim)</li>
<li><strong>Butina clustering</strong>: ECFP6 Tanimoto cutoff 0.4 for train/val/test splitting</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Unique compounds</td>
          <td>Number of distinct valid SMILES generated</td>
      </tr>
      <tr>
          <td>Unique BM scaffolds</td>
          <td>Bemis-Murcko framework diversity</td>
      </tr>
      <tr>
          <td>Unique carbon skeletons</td>
          <td>Carbon skeleton diversity (stripped BM scaffolds)</td>
      </tr>
      <tr>
          <td>ECFP6 analogs</td>
          <td>Compounds with Tanimoto &gt;= 0.4 to known actives</td>
      </tr>
      <tr>
          <td>MMP analogs</td>
          <td>Matched molecular pair relationships with known actives</td>
      </tr>
      <tr>
          <td>Shared MMP cores</td>
          <td>Scaffold cores shared between generated and known compounds</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/tblaschke/reinvent-memory">reinvent-memory</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation with prepared datasets</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Blaschke, T., Engkvist, O., Bajorath, J., &amp; Chen, H. (2020). Memory-assisted reinforcement learning for diverse molecular de novo design. <em>Journal of Cheminformatics</em>, 12, 68. <a href="https://doi.org/10.1186/s13321-020-00473-0">https://doi.org/10.1186/s13321-020-00473-0</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{blaschke2020memory,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Memory-assisted reinforcement learning for diverse molecular de novo design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Blaschke, Thomas and Engkvist, Ola and Bajorath, J{\&#34;u}rgen and Chen, Hongming}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{68}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-020-00473-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LSTM Neural Network for Drug-Like Molecule Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/lstm-drug-like-molecule-generation/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/lstm-drug-like-molecule-generation/</guid><description>An LSTM neural network trained on 509K ChEMBL SMILES generates one million novel drug-like molecules with realistic substructures and bioactivity profiles.</description><content:encoded><![CDATA[<h2 id="an-early-method-for-lstm-based-molecular-generation">An Early Method for LSTM-Based Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that applies character-level LSTM networks to the task of de novo drug-like molecule generation. The primary contribution is demonstrating that an LSTM trained on <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings from a large bioactive compound database (ChEMBL) can produce novel, diverse molecules whose chemical properties closely match those of known drug-like compounds. The paper also validates the generated molecules through virtual screening with profile QSAR models, showing comparable predicted bioactivity to the training set.</p>
<h2 id="the-challenge-of-exploring-drug-like-chemical-space">The Challenge of Exploring Drug-Like Chemical Space</h2>
<p>The theoretical space of drug-like molecules is astronomically large. Brute-force enumeration approaches such as <a href="/notes/computational-chemistry/datasets/gdb-17/">GDB-17</a> (which catalogued 166 billion molecules) are feasible only for small molecules, and full enumeration of molecules with 25-30 heavy atoms (the typical size of drug molecules) remains computationally intractable. Traditional cheminformatics approaches to sampling this space rely on fragment combination, evolutionary algorithms, or particle swarm optimization.</p>
<p>The authors position LSTM networks as a viable alternative. LSTMs had already demonstrated the ability to learn sequential structure in domains like text and music generation, making them natural candidates for learning SMILES grammar and generating novel valid molecular strings. At the time of writing (late 2017), several groups were exploring this direction, including Bjerrum and Threlfall (ZINC-based generation), <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al.</a> (VAE-based latent space design), <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">Olivecrona et al.</a> (RL-guided generation), and Segler et al. (focused library design). This paper contributes a large-scale empirical study with detailed analysis of the generated molecules&rsquo; chemical quality.</p>
<h2 id="character-level-lstm-with-temperature-based-sampling">Character-Level LSTM with Temperature-Based Sampling</h2>
<p>The core approach is straightforward: train an LSTM to predict the next character in a SMILES string, then sample from the trained model to generate new molecules character by character.</p>
<p>The network architecture consists of:</p>
<ul>
<li>Two stacked LSTM layers (which learn the SMILES grammar)</li>
<li>A dropout layer for regularization</li>
<li>A dense output layer with 23 neurons (one per character in the reduced SMILES alphabet) and softmax activation</li>
</ul>
<p>The RMSProp optimizer was used for training. The learning rate was gradually decreased from 0.01 to 0.0002 during training. At generation time, a temperature parameter controls the randomness of character sampling to produce more diverse structures rather than reproducing training molecules too closely.</p>
<p>A key preprocessing step reduces the SMILES alphabet to 23 characters. Multi-character atom tokens are replaced with single characters (<code>Cl</code> → <code>L</code>, <code>Br</code> → <code>R</code>, <code>[nH]</code> → <code>A</code>). Only the organic atom subset (<code>H</code>, <code>C</code>, <code>N</code>, <code>O</code>, <code>S</code>, <code>P</code>, <code>F</code>, <code>Cl</code>, <code>Br</code>, <code>I</code>) is retained. Charged molecules, stereo information, and molecules with more than 5 ring closures are excluded. The training corpus totals 23,664,668 characters, with 40-character windows used as input sequences during training.</p>
<h2 id="training-on-chembl-and-generating-one-million-molecules">Training on ChEMBL and Generating One Million Molecules</h2>
<h3 id="training-data">Training Data</h3>
<p>The training set consists of 509,000 bioactive molecules from ChEMBL with reported activity below 10 micromolar on any target.</p>
<h3 id="generation-and-filtering">Generation and Filtering</h3>
<p>The LSTM generates SMILES strings character by character. The generated strings undergo a two-stage validation:</p>
<ol>
<li><strong>Bracket and ring closure check</strong> (fast text-based): 54% of generated SMILES are discarded for unpaired brackets or ring closures</li>
<li><strong>Full chemical parsing with RDKit</strong>: An additional 14% fail due to unrealistic aromatic systems or incorrect valences</li>
<li><strong>Final yield</strong>: 32% of generated SMILES correspond to valid molecules</li>
</ol>
<p>One million valid molecules were generated in under 2 hours on 300 CPUs.</p>
<h3 id="novelty-and-diversity">Novelty and Diversity</h3>
<p>Out of one million generated molecules, only 2,774 (0.28%) were identical to molecules in the training ChEMBL set. The generated set contained 627,000 unique scaffolds compared to 172,000 in ChEMBL, with an overlap of only 18,000 scaffolds. This demonstrates substantial novelty and diversity.</p>
<h3 id="physicochemical-properties">Physicochemical Properties</h3>
<p>Calculated molecular descriptors (molecular weight, logP, and topological polar surface area) for the generated molecules closely matched the distributions of the ChEMBL training set. The synthetic accessibility score distributions were also practically identical, indicating comparable molecular complexity.</p>
<h3 id="substructure-feature-comparison">Substructure Feature Comparison</h3>
<p>The paper compares substructure features across three molecule sets: ChEMBL training data, LSTM-generated molecules, and a naive SMILES baseline generator. The naive generator uses only character frequency statistics and basic SMILES syntax rules, producing primarily macrocycles with very few fused aromatic systems.</p>
<table>
  <thead>
      <tr>
          <th>Feature</th>
          <th>ChEMBL (%)</th>
          <th>LSTM Generated (%)</th>
          <th>Naive Baseline (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>No rings</td>
          <td>0.4</td>
          <td>0.4</td>
          <td>0.1</td>
      </tr>
      <tr>
          <td>1 ring</td>
          <td>2.8</td>
          <td>4.3</td>
          <td>13.2</td>
      </tr>
      <tr>
          <td>2 rings</td>
          <td>14.8</td>
          <td>23.1</td>
          <td>17.7</td>
      </tr>
      <tr>
          <td>3 rings</td>
          <td>32.2</td>
          <td>43.5</td>
          <td>27.3</td>
      </tr>
      <tr>
          <td>4 rings</td>
          <td>32.7</td>
          <td>23.9</td>
          <td>25.2</td>
      </tr>
      <tr>
          <td>&gt;4 rings</td>
          <td>17.2</td>
          <td>4.8</td>
          <td>16.5</td>
      </tr>
      <tr>
          <td>Fused aromatic rings</td>
          <td>38.8</td>
          <td>30.9</td>
          <td>0.2</td>
      </tr>
      <tr>
          <td>Large rings (&gt;8)</td>
          <td>0.4</td>
          <td>1.8</td>
          <td>75.9</td>
      </tr>
      <tr>
          <td>Spiro rings</td>
          <td>1.9</td>
          <td>0.6</td>
          <td>0.6</td>
      </tr>
      <tr>
          <td>Contains N</td>
          <td>96.5</td>
          <td>96.1</td>
          <td>92.3</td>
      </tr>
      <tr>
          <td>Contains O</td>
          <td>93.0</td>
          <td>92.0</td>
          <td>85.5</td>
      </tr>
      <tr>
          <td>Contains S</td>
          <td>35.6</td>
          <td>27.9</td>
          <td>39.6</td>
      </tr>
      <tr>
          <td>Contains halogen</td>
          <td>40.7</td>
          <td>38.8</td>
          <td>49.4</td>
      </tr>
  </tbody>
</table>
<p>The LSTM-generated molecules closely mirror the ChEMBL distributions, while the naive generator fails to capture drug-like structural patterns. The LSTM tends to slightly over-represent 2-3 ring systems and under-represent 4+ ring systems relative to ChEMBL. Functional group distributions also closely matched between ChEMBL and the LSTM output.</p>
<h3 id="virtual-screening-validation">Virtual Screening Validation</h3>
<p>The generated molecules were evaluated using profile QSAR models for 159 ChEMBL kinase assays. The six best models (with realistic test set R-squared &gt; 0.75) were used to predict pIC50 values for both actual ChEMBL compounds and generated compounds. The cumulative frequency distributions of predicted activity were nearly identical between the two sets.</p>
<p>Kolmogorov-Smirnov (KS) tests on random samples of 1,000 compounds confirmed this quantitatively:</p>
<table>
  <thead>
      <tr>
          <th>Assay</th>
          <th>KS D</th>
          <th>Distributions Differ?</th>
          <th>Mean (Real)</th>
          <th>Mean (Gen)</th>
          <th>Stdev (Real)</th>
          <th>Stdev (Gen)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>688395</td>
          <td>6.01%</td>
          <td>No</td>
          <td>4.66</td>
          <td>4.69</td>
          <td>0.25</td>
          <td>0.24</td>
      </tr>
      <tr>
          <td>668624</td>
          <td>3.60%</td>
          <td>No</td>
          <td>4.86</td>
          <td>4.86</td>
          <td>0.25</td>
          <td>0.24</td>
      </tr>
      <tr>
          <td>809226</td>
          <td>9.90%</td>
          <td>Yes</td>
          <td>5.33</td>
          <td>5.26</td>
          <td>0.34</td>
          <td>0.30</td>
      </tr>
      <tr>
          <td>809226</td>
          <td>4.30%</td>
          <td>No</td>
          <td>5.18</td>
          <td>5.13</td>
          <td>0.47</td>
          <td>0.43</td>
      </tr>
      <tr>
          <td>688781</td>
          <td>2.20%</td>
          <td>No</td>
          <td>4.83</td>
          <td>4.82</td>
          <td>0.26</td>
          <td>0.25</td>
      </tr>
      <tr>
          <td>809170</td>
          <td>8.70%</td>
          <td>Yes</td>
          <td>5.12</td>
          <td>5.07</td>
          <td>0.51</td>
          <td>0.46</td>
      </tr>
  </tbody>
</table>
<p>For 4 of 6 models, the null hypothesis that the distributions are the same could not be rejected at the 95% confidence level (critical D = 6.04%). Even for the two assays where the KS test rejected the null hypothesis, the maximum vertical distance between distributions was below 10%.</p>
<h2 id="generated-molecules-are-novel-drug-like-and-potentially-bioactive">Generated Molecules Are Novel, Drug-Like, and Potentially Bioactive</h2>
<p>The key findings of this study are:</p>
<ol>
<li><strong>High novelty</strong>: Only 0.28% of generated molecules match training compounds; 627K novel scaffolds were produced versus 172K in ChEMBL</li>
<li><strong>Drug-like quality</strong>: Physicochemical properties, substructure features, functional group distributions, and synthetic accessibility scores all closely match the ChEMBL training distribution, without these being explicit constraints</li>
<li><strong>Predicted bioactivity</strong>: Virtual screening with profile QSAR models shows the generated molecules have comparable predicted activity profiles to known bioactive compounds</li>
<li><strong>Scalability</strong>: One million valid molecules in under 2 hours on 300 CPUs, with the potential to scale to billions with GPU acceleration</li>
<li><strong>LSTM superiority over naive baselines</strong>: A simple statistical SMILES generator using only character frequencies produces chemically unrealistic molecules (mostly macrocycles), demonstrating that the LSTM genuinely learns drug-like chemical patterns</li>
</ol>
<p>The main limitations are the 32% validity rate (68% of generated SMILES are invalid), the exclusion of stereochemistry and charged molecules from the training set, and the lack of any goal-directed generation capability (the model produces unconditional samples from the training distribution). The code was described as &ldquo;available on request&rdquo; from the corresponding author rather than publicly released.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ChEMBL bioactive molecules</td>
          <td>509,000 molecules</td>
          <td>Activity &lt; 10 uM on any target; organic atoms only; no charges or stereo</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Double-stacked LSTM layers with dropout</li>
<li>Softmax output over 23-character reduced SMILES alphabet</li>
<li>RMSProp optimizer with learning rate annealed from 0.01 to 0.0002</li>
<li>Temperature-based sampling at generation time</li>
<li>40-character input windows during training</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture consists of two LSTM layers, a dropout layer, and a 23-neuron dense output layer. Exact hidden unit counts and dropout rates are not specified in the paper.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid SMILES rate</td>
          <td>32%</td>
          <td>After bracket check and RDKit parsing</td>
      </tr>
      <tr>
          <td>Novelty (vs. training)</td>
          <td>99.72%</td>
          <td>Only 2,774 of 1M match ChEMBL</td>
      </tr>
      <tr>
          <td>Unique scaffolds</td>
          <td>627,000</td>
          <td>vs. 172,000 in ChEMBL</td>
      </tr>
      <tr>
          <td>KS test (4/6 assays)</td>
          <td>Not significantly different</td>
          <td>At 95% confidence</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Generation: 300 CPUs for under 2 hours (1 million valid molecules)</li>
<li>Training hardware not specified</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ertl, P., Lewis, R., Martin, E., &amp; Polyakov, V. (2017). In silico generation of novel, drug-like chemical matter using the LSTM neural network. <em>arXiv preprint</em>, arXiv:1712.07449.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ertl2017silico,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{In silico generation of novel, drug-like chemical matter using the LSTM neural network}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ertl, Peter and Lewis, Richard and Martin, Eric and Polyakov, Valery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1712.07449}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LatentGAN: Latent-Space GAN for Molecular Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/latentgan-de-novo-molecular-generation/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/latentgan-de-novo-molecular-generation/</guid><description>LatentGAN combines a SMILES heteroencoder with a Wasserstein GAN to generate novel drug-like molecules in latent space, avoiding SMILES syntax issues.</description><content:encoded><![CDATA[<h2 id="a-gan-operating-in-learned-latent-space-for-molecular-design">A GAN Operating in Learned Latent Space for Molecular Design</h2>
<p>LatentGAN is a <strong>Method</strong> paper that introduces a two-stage architecture for de novo molecular generation. The first stage trains a heteroencoder to map SMILES strings into a continuous latent vector space. The second stage trains a Wasserstein GAN with gradient penalty (WGAN-GP) to generate new latent vectors that, when decoded, produce valid and novel molecular structures. The key contribution is decoupling the GAN from direct SMILES string generation, allowing the adversarial training to focus on learning the distribution of molecular latent representations rather than character-level sequence generation.</p>
<h2 id="limitations-of-direct-smiles-generation-with-gans">Limitations of Direct SMILES Generation with GANs</h2>
<p>Prior GAN-based molecular generation methods such as ORGAN and ORGANIC operated directly on SMILES strings. This created a fundamental challenge: the generator had to simultaneously learn valid SMILES syntax and the distribution of chemically meaningful molecules. ORGAN struggled with optimizing discrete molecular properties like Lipinski&rsquo;s Rule of Five, while ORGANIC showed limited success beyond the QED drug-likeness score. Other approaches (RANC, ATNC) substituted more advanced recurrent architectures but still operated in the discrete SMILES space.</p>
<p>Meanwhile, variational autoencoders (VAEs) demonstrated that working in continuous latent space could enable molecular generation, but they relied on forcing the latent distribution to match a Gaussian prior through KL divergence. This assumption is not necessarily appropriate for chemical space, which is inherently discontinuous.</p>
<p>RNN-based methods with transfer learning offered an alternative for target-biased generation, but the authors hypothesized that combining GANs with learned latent representations could produce complementary chemical space coverage.</p>
<h2 id="heteroencoder-plus-wasserstein-gan-architecture">Heteroencoder Plus Wasserstein GAN Architecture</h2>
<p>The core innovation of LatentGAN is separating molecular representation learning from adversarial generation through a two-component pipeline.</p>
<h3 id="heteroencoder">Heteroencoder</h3>
<p>The heteroencoder is an autoencoder trained on pairs of different non-canonical (randomized) SMILES representations of the same molecule. This is distinct from a standard autoencoder because the input and target SMILES are different representations of the same structure.</p>
<p>The encoder uses a two-layer bidirectional LSTM with 512 units per layer (256 forward, 256 backward). The concatenated output feeds into a 512-dimensional feed-forward layer. During training, zero-centered Gaussian noise with $\sigma = 0.1$ is added to the latent vector as regularization. The decoder is a four-layer unidirectional LSTM with a softmax output layer. Batch normalization with momentum 0.9 is applied to all hidden layers except the noise layer.</p>
<p>Training uses teacher forcing with categorical cross-entropy loss for 100 epochs. The learning rate starts at $10^{-3}$ for the first 50 epochs and decays exponentially to $10^{-6}$ by the final epoch. After training, the noise layer is deactivated for deterministic encoding and decoding.</p>
<p>An important design choice is that the heteroencoder makes no assumption about the latent space distribution (unlike VAEs with their KL divergence term). The latent space is shaped purely by reconstruction loss, and the GAN later learns to sample from this unconstrained distribution.</p>
<h3 id="wasserstein-gan-with-gradient-penalty">Wasserstein GAN with Gradient Penalty</h3>
<p>The GAN uses the WGAN-GP formulation. The critic (discriminator) consists of three feed-forward layers of 256 dimensions each with leaky ReLU activations (no activation on the final layer). The generator has five feed-forward layers of 256 dimensions each with batch normalization and leaky ReLU between layers.</p>
<p>The training ratio is 5:1, with five critic updates for every generator update. The generator takes random vectors sampled from a uniform distribution and learns to produce latent vectors indistinguishable from the real encoded molecular latent vectors.</p>
<p>The WGAN-GP loss for the critic is:</p>
<p>$$L_{\text{critic}} = \mathbb{E}_{\tilde{x} \sim \mathbb{P}_g}[D(\tilde{x})] - \mathbb{E}_{x \sim \mathbb{P}_r}[D(x)] + \lambda \mathbb{E}_{\hat{x} \sim \mathbb{P}_{\hat{x}}}[(|\nabla_{\hat{x}} D(\hat{x})|_2 - 1)^2]$$</p>
<p>where $\lambda$ is the gradient penalty coefficient, $\mathbb{P}_r$ is the real data distribution (encoded latent vectors), $\mathbb{P}_g$ is the generator distribution, and $\mathbb{P}_{\hat{x}}$ samples uniformly along straight lines between pairs of real and generated points.</p>
<h3 id="generation-pipeline">Generation Pipeline</h3>
<p>At inference time, the full pipeline operates as: (1) sample a random vector, (2) pass through the trained generator to produce a latent vector, (3) decode the latent vector into a SMILES string using the pretrained heteroencoder decoder.</p>
<h2 id="experiments-on-drug-like-and-target-biased-generation">Experiments on Drug-Like and Target-Biased Generation</h2>
<h3 id="datasets">Datasets</h3>
<p>The heteroencoder was trained on 1,347,173 SMILES from ChEMBL 25, standardized with MolVS and restricted to molecules with atoms from {H, C, N, O, S, Cl, Br} and at most 50 heavy atoms.</p>
<p>For general drug-like generation, a random subset of 100,000 ChEMBL compounds was used to train the GAN model for 30,000 epochs.</p>
<p>For target-biased generation, three datasets were extracted from ExCAPE-DB for EGFR, HTR1A, and S1PR1 targets. These were clustered into training and test sets to ensure chemical series were not split across sets.</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Training Set</th>
          <th>Test Set</th>
          <th>SVM ROC-AUC</th>
          <th>SVM Kappa</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>EGFR</td>
          <td>2,949</td>
          <td>2,326</td>
          <td>0.850</td>
          <td>0.56</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>48,283</td>
          <td>23,048</td>
          <td>0.993</td>
          <td>0.90</td>
      </tr>
      <tr>
          <td>S1PR1</td>
          <td>49,381</td>
          <td>23,745</td>
          <td>0.995</td>
          <td>0.91</td>
      </tr>
  </tbody>
</table>
<p>SVM target prediction models using 2048-bit FCFP6 fingerprints were built with scikit-learn to evaluate generated compounds.</p>
<h3 id="baselines">Baselines</h3>
<p>RNN-based generative models with transfer learning served as the primary baseline. A prior RNN model was trained on the same ChEMBL set, then fine-tuned on each target dataset. The LatentGAN was also benchmarked on the MOSES platform against VAE, JTN-VAE, and AAE architectures.</p>
<h3 id="heteroencoder-performance">Heteroencoder Performance</h3>
<p>The heteroencoder achieved 99% valid SMILES on the training set and 98% on the test set. Reconstruction error (decoding to a different molecule) was 18% on training and 20% on test. Notably, decoding to a different valid SMILES of the same molecule is not counted as an error.</p>
<h3 id="target-biased-generation-results">Target-Biased Generation Results</h3>
<p>From 50,000 sampled SMILES per target model:</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Arch.</th>
          <th>Valid (%)</th>
          <th>Unique (%)</th>
          <th>Novel (%)</th>
          <th>Active (%)</th>
          <th>Recovered Actives (%)</th>
          <th>Recovered Neighbors</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>EGFR</td>
          <td>GAN</td>
          <td>86</td>
          <td>56</td>
          <td>97</td>
          <td>71</td>
          <td>5.26</td>
          <td>196</td>
      </tr>
      <tr>
          <td>EGFR</td>
          <td>RNN</td>
          <td>96</td>
          <td>46</td>
          <td>95</td>
          <td>65</td>
          <td>7.74</td>
          <td>238</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>GAN</td>
          <td>86</td>
          <td>66</td>
          <td>95</td>
          <td>71</td>
          <td>5.05</td>
          <td>284</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>RNN</td>
          <td>96</td>
          <td>50</td>
          <td>90</td>
          <td>81</td>
          <td>7.28</td>
          <td>384</td>
      </tr>
      <tr>
          <td>S1PR1</td>
          <td>GAN</td>
          <td>89</td>
          <td>31</td>
          <td>98</td>
          <td>44</td>
          <td>0.93</td>
          <td>24</td>
      </tr>
      <tr>
          <td>S1PR1</td>
          <td>RNN</td>
          <td>97</td>
          <td>35</td>
          <td>97</td>
          <td>65</td>
          <td>3.72</td>
          <td>43</td>
      </tr>
  </tbody>
</table>
<h3 id="moses-benchmark">MOSES Benchmark</h3>
<p>On the MOSES benchmark (trained on a ZINC subset of 1,584,663 compounds, sampled 30,000 SMILES), LatentGAN showed comparable or better results than JTN-VAE and AAE on Frechet ChemNet Distance (FCD), Fragment similarity, and Scaffold similarity, while producing slightly worse nearest-neighbor cosine similarity (SNN). The standard VAE showed signs of mode collapse with high test metric overlap and low novelty.</p>
<h2 id="complementary-generation-and-drug-likeness-preservation">Complementary Generation and Drug-Likeness Preservation</h2>
<h3 id="key-findings">Key Findings</h3>
<p><strong>Validity and novelty</strong>: LatentGAN achieved 86-89% validity on target-biased tasks (lower than RNN&rsquo;s 96-97%) but produced higher uniqueness on two of three targets and comparable or higher novelty (95-98%).</p>
<p><strong>Complementary chemical space</strong>: The overlap between LatentGAN-generated and RNN-generated active compounds was very small at both compound and scaffold levels. A probabilistic analysis showed that the RNN model would be very unlikely to eventually cover the LatentGAN output space. This suggests the two architectures can work complementarily in de novo design campaigns.</p>
<p><strong>Drug-likeness</strong>: QED score distributions of LatentGAN-generated compounds closely matched training set distributions across all three targets, with training compounds showing only slightly higher drug-likeness. SA score distributions were similarly well-preserved.</p>
<p><strong>Chemical space coverage</strong>: PCA analysis using MQN fingerprints confirmed that generated compounds occupy most of the chemical space of the training sets. Some regions of the PCA plots contained compounds predicted as inactive, which corresponded to non-drug-like outliers in the training data.</p>
<p><strong>Novel scaffolds</strong>: About 14% of scaffolds in the sampled sets had similarity below 0.4 to the training set across all three targets, indicating LatentGAN can generate genuinely novel chemical scaffolds. Around 5% of generated compounds were identical to training set compounds, while 21-25% had Tanimoto similarity below 0.4.</p>
<h3 id="limitations">Limitations</h3>
<p>The paper acknowledges several limitations. The 18-20% heteroencoder reconstruction error means a non-trivial fraction of encoded molecules decode to different structures. Validity rates (86-89%) are lower than RNN baselines (96-97%). The S1PR1 target showed notably lower uniqueness (31%) and predicted activity (44%) compared to the other targets, possibly due to the smaller effective training set of active compounds. The paper does not report specific hardware requirements or training times. No wet-lab experimental validation of generated compounds was performed.</p>
<h3 id="future-directions">Future Directions</h3>
<p>The authors envision LatentGAN as a complementary tool to existing RNN-based generative models, with the two architectures covering different regions of chemical space. The approach of operating in learned latent space rather than directly on SMILES strings offers a general framework that could be extended to other molecular representations or generation objectives.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Heteroencoder training</td>
          <td>ChEMBL 25 (subset)</td>
          <td>1,347,173 SMILES</td>
          <td>Standardized with MolVS; atoms restricted to H, C, N, O, S, Cl, Br; max 50 heavy atoms</td>
      </tr>
      <tr>
          <td>General GAN training</td>
          <td>ChEMBL 25 (random subset)</td>
          <td>100,000</td>
          <td>Subset of heteroencoder training set</td>
      </tr>
      <tr>
          <td>Target-biased training</td>
          <td>ExCAPE-DB (EGFR)</td>
          <td>2,949 actives</td>
          <td>Clustered train/test split</td>
      </tr>
      <tr>
          <td>Target-biased training</td>
          <td>ExCAPE-DB (HTR1A)</td>
          <td>48,283 actives</td>
          <td>Clustered train/test split</td>
      </tr>
      <tr>
          <td>Target-biased training</td>
          <td>ExCAPE-DB (S1PR1)</td>
          <td>49,381 actives</td>
          <td>Clustered train/test split</td>
      </tr>
      <tr>
          <td>Benchmarking</td>
          <td>ZINC (MOSES subset)</td>
          <td>1,584,663</td>
          <td>Canonical SMILES</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Heteroencoder</strong>: Bidirectional LSTM encoder (2 layers, 512 units) + unidirectional LSTM decoder (4 layers), trained with teacher forcing and categorical cross-entropy for 100 epochs</li>
<li><strong>GAN</strong>: WGAN-GP with 5:1 critic-to-generator training ratio. General model trained 30,000 epochs; target models trained 10,000 epochs</li>
<li><strong>Evaluation</strong>: SVM classifiers with FCFP6 fingerprints (2048 bits) for activity prediction; MQN fingerprints for PCA-based chemical space analysis; Murcko scaffolds for scaffold-level analysis</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Heteroencoder: 512-dim latent space, bidirectional LSTM encoder, unidirectional LSTM decoder</li>
<li>Generator: 5 feed-forward layers of 256 dims with batch norm and leaky ReLU</li>
<li>Critic: 3 feed-forward layers of 256 dims with leaky ReLU</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>LatentGAN (EGFR)</th>
          <th>RNN Baseline (EGFR)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>86%</td>
          <td>96%</td>
          <td>Percent valid SMILES</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>56%</td>
          <td>46%</td>
          <td>Percent unique among valid</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>97%</td>
          <td>95%</td>
          <td>Not in training set</td>
      </tr>
      <tr>
          <td>Predicted active</td>
          <td>71%</td>
          <td>65%</td>
          <td>By SVM model</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Dierme/latent-gan">LatentGAN source code</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Includes trained heteroencoder model and training sets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Prykhodko, O., Johansson, S.V., Kotsias, P.-C., Arús-Pous, J., Bjerrum, E.J., Engkvist, O., &amp; Chen, H. (2019). A de novo molecular generation method using latent vector based generative adversarial network. <em>Journal of Cheminformatics</em>, 11(1), 74. <a href="https://doi.org/10.1186/s13321-019-0397-9">https://doi.org/10.1186/s13321-019-0397-9</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{prykhodko2019latentgan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A de novo molecular generation method using latent vector based generative adversarial network}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Prykhodko, Oleksii and Johansson, Simon Viet and Kotsias, Panagiotis-Christos and Ar{\&#39;u}s-Pous, Josep and Bjerrum, Esben Jannik and Engkvist, Ola and Chen, Hongming}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{74}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-019-0397-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Grammar VAE: Generating Valid Molecules via CFGs</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/grammar-variational-autoencoder/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/grammar-variational-autoencoder/</guid><description>The Grammar VAE encodes and decodes molecular parse trees from context-free grammars, guaranteeing syntactically valid SMILES outputs during generation.</description><content:encoded><![CDATA[<h2 id="a-grammar-constrained-vae-for-discrete-data-generation">A Grammar-Constrained VAE for Discrete Data Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces the Grammar Variational Autoencoder (GVAE), a variant of the <a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">variational autoencoder</a> that operates directly on parse trees from context-free grammars (CFGs) rather than on raw character sequences. The primary contribution is a decoding mechanism that uses a stack and grammar-derived masks to restrict the output at every timestep to only syntactically valid production rules. This guarantees that every decoded output is a valid string under the grammar, addressing a fundamental limitation of character-level VAEs when applied to structured discrete data such as <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> molecular strings and arithmetic expressions.</p>
<h2 id="why-character-level-vaes-fail-on-structured-discrete-data">Why Character-Level VAEs Fail on Structured Discrete Data</h2>
<p>Generative models for continuous data (images, audio) had achieved impressive results by 2017, but generating structured discrete data remained difficult. The key challenge is that string representations of molecules and mathematical expressions are brittle: small perturbations to a character sequence often produce invalid outputs. <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al. (2016)</a> demonstrated a character-level VAE (CVAE) for SMILES strings that could encode molecules into a continuous latent space and decode them back, enabling latent-space optimization for molecular design. However, the CVAE frequently decoded latent points into strings that were not valid SMILES, particularly when exploring regions of latent space far from training data.</p>
<p>The fundamental issue is that character-level decoders must implicitly learn the syntactic rules of the target language from data alone. For SMILES, this includes matching parentheses, valid atom types, proper bonding, and ring closure notation. The GVAE addresses this by giving the decoder explicit knowledge of the grammar, so it can focus entirely on learning the semantic structure of the data.</p>
<h2 id="core-innovation-stack-based-grammar-masking-in-the-decoder">Core Innovation: Stack-Based Grammar Masking in the Decoder</h2>
<p>The GVAE encodes and decodes sequences of production rules from a context-free grammar rather than sequences of characters.</p>
<p><strong>Encoding.</strong> Given an input string (e.g., a SMILES molecule), the encoder first parses it into a parse tree using the CFG, then performs a left-to-right pre-order traversal of the tree to extract an ordered sequence of production rules. Each rule is represented as a one-hot vector of dimension $K$ (total number of production rules in the grammar). The resulting $T(\mathbf{X}) \times K$ matrix is processed by a convolutional neural network to produce the mean and variance of a Gaussian posterior $q_{\phi}(\mathbf{z} \mid \mathbf{X})$.</p>
<p><strong>Decoding with grammar masks.</strong> The decoder maps a latent vector $\mathbf{z}$ through an RNN to produce a matrix of logits $\mathbf{F} \in \mathbb{R}^{T_{max} \times K}$. The key innovation is a last-in first-out (LIFO) stack that tracks the current parsing state. At each timestep $t$, the decoder:</p>
<ol>
<li>Pops the top non-terminal $\alpha$ from the stack</li>
<li>Applies a fixed binary mask $\mathbf{m}_{\alpha} \in {0, 1}^K$ that zeros out all production rules whose left-hand side is not $\alpha$</li>
<li>Samples a production rule from the masked softmax distribution:</li>
</ol>
<p>$$
p(\mathbf{x}_{t} = k \mid \alpha, \mathbf{z}) = \frac{m_{\alpha,k} \exp(f_{tk})}{\sum_{j=1}^{K} m_{\alpha,j} \exp(f_{tj})}
$$</p>
<ol start="4">
<li>Pushes the right-hand-side non-terminals of the selected rule onto the stack (right-to-left, so the leftmost is on top)</li>
</ol>
<p>This process continues until the stack is empty or $T_{max}$ timesteps are reached. Because the mask restricts selection to only those rules applicable to the current non-terminal, every generated sequence of production rules is guaranteed to be a valid derivation under the grammar.</p>
<p><strong>Training.</strong> The model is trained by maximizing the ELBO:</p>
<p>$$
\mathcal{L}(\phi, \theta; \mathbf{X}) = \mathbb{E}_{q(\mathbf{z} \mid \mathbf{X})} \left[ \log p_{\theta}(\mathbf{X}, \mathbf{z}) - \log q_{\phi}(\mathbf{z} \mid \mathbf{X}) \right]
$$</p>
<p>where the likelihood factorizes as:</p>
<p>$$
p(\mathbf{X} \mid \mathbf{z}) = \prod_{t=1}^{T(\mathbf{X})} p(\mathbf{x}_{t} \mid \mathbf{z})
$$</p>
<p>During training, the masks at each timestep are determined by the ground-truth production rule sequence, so no stack simulation is needed. The stack-based decoding is only required at generation time.</p>
<p><strong>Syntactic vs. semantic validity.</strong> The grammar guarantees syntactic validity but not semantic validity. The GVAE can still produce chemically implausible molecules (e.g., an oxygen atom with three bonds) because such constraints are not context-free. SMILES ring-bond digit matching is also not context-free, so the grammar cannot enforce it. Additionally, sequences that have not emptied the stack by $T_{max}$ are marked invalid.</p>
<h2 id="experiments-on-symbolic-regression-and-molecular-optimization">Experiments on Symbolic Regression and Molecular Optimization</h2>
<p>The authors evaluate the GVAE on two domains: arithmetic expressions and molecules. Both use Bayesian optimization (BO) over the learned latent space.</p>
<p><strong>Setup.</strong> After training each VAE, the authors encode training data into latent vectors and train a sparse Gaussian process (SGP) with 500 inducing points to predict properties from latent representations. They then run batch BO with expected improvement, selecting 50 candidates per iteration.</p>
<h3 id="arithmetic-expressions">Arithmetic Expressions</h3>
<ul>
<li><strong>Data</strong>: 100,000 randomly generated univariate expressions from a simple grammar (3 binary operators, 2 unary operators, 3 constants), each with at most 15 production rules</li>
<li><strong>Target</strong>: Find an expression minimizing $\log(1 + \text{MSE})$ against the true function $1/3 + x + \sin(x \cdot x)$</li>
<li><strong>BO iterations</strong>: 5, averaged over 10 repetitions</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Fraction Valid</th>
          <th>Average Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GVAE</td>
          <td>0.99 +/- 0.01</td>
          <td>3.47 +/- 0.24</td>
      </tr>
      <tr>
          <td>CVAE</td>
          <td>0.86 +/- 0.06</td>
          <td>4.75 +/- 0.25</td>
      </tr>
  </tbody>
</table>
<p>The GVAE&rsquo;s best expression ($x/1 + \sin(3) + \sin(x \cdot x)$, score 0.04) nearly exactly recovers the true function, while the CVAE&rsquo;s best ($x \cdot 1 + \sin(3) + \sin(3/1)$, score 0.39) misses the sinusoidal component.</p>
<h3 id="molecular-optimization">Molecular Optimization</h3>
<ul>
<li><strong>Data</strong>: 250,000 SMILES strings from the ZINC database</li>
<li><strong>Target</strong>: Maximize penalized logP (water-octanol partition coefficient penalized for ring size and synthetic accessibility)</li>
<li><strong>BO iterations</strong>: 10, averaged over 5 trials</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Fraction Valid</th>
          <th>Average Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GVAE</td>
          <td>0.31 +/- 0.07</td>
          <td>-9.57 +/- 1.77</td>
      </tr>
      <tr>
          <td>CVAE</td>
          <td>0.17 +/- 0.05</td>
          <td>-54.66 +/- 2.66</td>
      </tr>
  </tbody>
</table>
<p>The GVAE produces roughly twice as many valid molecules as the CVAE and finds molecules with substantially better penalized logP scores (best: 2.94 vs. 1.98).</p>
<h3 id="latent-space-quality">Latent Space Quality</h3>
<p>Interpolation experiments show that the GVAE produces valid outputs at every intermediate point when linearly interpolating between two encoded expressions, while the CVAE passes through invalid strings. Grid searches around encoded molecules in the GVAE latent space show smooth transitions where neighboring points differ by single atoms.</p>
<h3 id="predictive-performance">Predictive Performance</h3>
<p>Sparse GP models trained on GVAE latent features achieve better test RMSE and log-likelihood than those trained on CVAE features for both expressions and molecules:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>GVAE (Expressions)</th>
          <th>CVAE (Expressions)</th>
          <th>GVAE (Molecules)</th>
          <th>CVAE (Molecules)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Test LL</td>
          <td>-1.320 +/- 0.001</td>
          <td>-1.397 +/- 0.003</td>
          <td>-1.739 +/- 0.004</td>
          <td>-1.812 +/- 0.004</td>
      </tr>
      <tr>
          <td>Test RMSE</td>
          <td>0.884 +/- 0.002</td>
          <td>0.975 +/- 0.004</td>
          <td>1.404 +/- 0.006</td>
          <td>1.504 +/- 0.006</td>
      </tr>
  </tbody>
</table>
<h3 id="reconstruction-and-prior-sampling">Reconstruction and Prior Sampling</h3>
<p>On held-out molecules, the GVAE achieves 53.7% reconstruction accuracy vs. 44.6% for the CVAE. When sampling from the prior $p(\mathbf{z}) = \mathcal{N}(0, \mathbf{I})$, 7.2% of GVAE samples are valid molecules vs. 0.7% for the CVAE.</p>
<h2 id="key-findings-limitations-and-impact">Key Findings, Limitations, and Impact</h2>
<p><strong>Key findings.</strong> Incorporating grammar structure into the VAE decoder consistently improves validity rates, latent space smoothness, downstream predictive performance, and Bayesian optimization outcomes across both domains. The approach is general: any domain with a context-free grammar can benefit.</p>
<p><strong>Limitations acknowledged by the authors.</strong></p>
<ul>
<li>The GVAE guarantees syntactic but not semantic validity. For molecules, invalid ring-bond patterns and chemically implausible structures can still be generated.</li>
<li>The molecular validity rate during BO (31%) is substantially higher than the CVAE (17%) but still means most decoded molecules are invalid, largely due to non-context-free constraints in SMILES.</li>
<li>The approach requires a context-free grammar for the target domain, which limits applicability to well-defined formal languages.</li>
<li>Sequences that do not complete parsing within $T_{max}$ timesteps are discarded as invalid.</li>
</ul>
<p><strong>Impact.</strong> The GVAE was an influential early contribution to constrained molecular generation. It directly inspired the Syntax-Directed VAE (SD-VAE) by Dai et al. (2018), which uses attribute grammars for tighter semantic constraints, and contributed to the broader movement toward structured molecular generation methods including graph-based approaches. The paper demonstrated that encoding domain knowledge into the decoder architecture is more effective than relying on the model to learn structural constraints from data alone.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training (expressions)</td>
          <td>Generated arithmetic expressions</td>
          <td>100,000</td>
          <td>Up to 15 production rules each</td>
      </tr>
      <tr>
          <td>Training (molecules)</td>
          <td>ZINC database subset</td>
          <td>250,000 SMILES</td>
          <td>Same subset as <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al. (2016)</a></td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Encoder: 1D convolutional neural network over one-hot rule sequences</li>
<li>Decoder: RNN with stack-based grammar masking</li>
<li>Latent space: 56 dimensions (molecules), isotropic Gaussian prior</li>
<li>Property predictor: Sparse Gaussian process with 500 inducing points</li>
<li>Optimization: Batch Bayesian optimization with expected improvement, 50 candidates per iteration, Kriging Believer for batch selection</li>
</ul>
<h3 id="models">Models</h3>
<p>Architecture details follow <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al. (2016)</a> with modifications for grammar-based encoding/decoding. Specific layer sizes and hyperparameters are described in the supplementary material.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>GVAE</th>
          <th>CVAE</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Fraction valid (expressions)</td>
          <td>0.99</td>
          <td>0.86</td>
          <td>During BO</td>
      </tr>
      <tr>
          <td>Fraction valid (molecules)</td>
          <td>0.31</td>
          <td>0.17</td>
          <td>During BO</td>
      </tr>
      <tr>
          <td>Best penalized logP</td>
          <td>2.94</td>
          <td>1.98</td>
          <td>Best molecule found</td>
      </tr>
      <tr>
          <td>Reconstruction accuracy</td>
          <td>53.7%</td>
          <td>44.6%</td>
          <td>On held-out molecules</td>
      </tr>
      <tr>
          <td>Prior validity</td>
          <td>7.2%</td>
          <td>0.7%</td>
          <td>Sampling from N(0,I)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/mkusner/grammarVAE">grammarVAE</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kusner, M. J., Paige, B., &amp; Hernández-Lobato, J. M. (2017). Grammar Variational Autoencoder. <em>Proceedings of the 34th International Conference on Machine Learning (ICML)</em>, 1945-1954.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{kusner2017grammar,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Grammar Variational Autoencoder}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kusner, Matt J. and Paige, Brooks and Hern{\&#39;a}ndez-Lobato, Jos{\&#39;e} Miguel}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 34th International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1945--1954}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{PMLR}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DrugEx v2: Pareto Multi-Objective RL for Drug Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/drugex-v2-pareto-multi-objective-rl/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/drugex-v2-pareto-multi-objective-rl/</guid><description>DrugEx v2 extends RNN-based de novo drug design with Pareto ranking and evolutionary exploration for multi-objective molecule generation.</description><content:encoded><![CDATA[<h2 id="multi-objective-de-novo-drug-design-with-pareto-optimization">Multi-Objective De Novo Drug Design with Pareto Optimization</h2>
<p>This is a <strong>Method</strong> paper that extends the DrugEx framework (v1) to handle multi-objective optimization in de novo drug design. The primary contribution is integrating Pareto-based ranking with evolutionary algorithm concepts (crossover and mutation) into an RNN-based reinforcement learning pipeline. The system generates <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>-based molecules optimized simultaneously for activity toward multiple protein targets while avoiding off-targets, addressing polypharmacology scenarios where drugs must bind multiple specific receptors.</p>
<h2 id="polypharmacology-and-the-limits-of-single-objective-generation">Polypharmacology and the Limits of Single-Objective Generation</h2>
<p>Traditional drug discovery follows the &ldquo;one drug, one target, one disease&rdquo; paradigm, but drug molecules interact with an average of six protein targets. Off-target binding causes side effects that remain a leading cause of clinical failure and post-approval drug withdrawals (over 500 drugs withdrawn due to fatal toxicity). Complex diseases often require modulating multiple targets simultaneously, making polypharmacology an important design objective.</p>
<p>Prior deep learning approaches for de novo design, including DrugEx v1, focused on generating molecules active against a single target. Extending these methods to multiple objectives introduces fundamental challenges: objectives are often contradictory (high affinity for one target may correlate with high affinity for an undesired off-target), and naive weighted-sum approaches can collapse diversity by over-optimizing a single dominant objective. The authors specifically target the <a href="https://en.wikipedia.org/wiki/Adenosine_receptor">adenosine receptor</a> system, where $A_1AR$ and $A_{2A}AR$ selectivity profiles matter for therapeutic efficacy, and <a href="https://en.wikipedia.org/wiki/HERG">hERG</a> channel binding must be avoided to prevent cardiac toxicity.</p>
<h2 id="evolutionary-exploration-and-pareto-ranking-in-rl">Evolutionary Exploration and Pareto Ranking in RL</h2>
<p>The core innovation of DrugEx v2 has two components: an evolutionary exploration strategy and Pareto-based reward assignment.</p>
<h3 id="evolutionary-exploration-strategy">Evolutionary Exploration Strategy</h3>
<p>The generation process uses three RNN networks with identical LSTM architectures:</p>
<ul>
<li><strong>Agent net</strong> ($G_A$): the primary generator, updated at each training epoch via policy gradient</li>
<li><strong>Crossover net</strong> ($G_C$): initialized from the fine-tuned model, updated iteratively from $G_A$ after each convergence period</li>
<li><strong>Mutation net</strong> ($G_M$): initialized from the pre-trained model, parameters fixed throughout training</li>
</ul>
<p>At each token-generation step, a random number determines whether the token probability comes from the combination of $G_A$ and $G_C$ (with probability $1 - \varepsilon$) or from $G_M$ (with probability $\varepsilon$). This mirrors crossover and mutation operations from evolutionary algorithms, maintaining diversity while steering toward desired properties.</p>
<h3 id="pareto-front-reward-scheme">Pareto Front Reward Scheme</h3>
<p>For $n$ objectives (three in this study: $A_1AR$, $A_{2A}AR$, hERG), each molecule receives a score $R_i$ based on its predicted bioactivity:</p>
<p>$$
R_{i} = \begin{cases} \text{minmax}(pX_{i}), &amp; \text{if high affinity required} \\ 1 - \text{minmax}(pX_{i}), &amp; \text{if low affinity required} \\ 0, &amp; \text{if SMILES invalid} \end{cases}
$$</p>
<p>where $pX_i$ is the predicted bioactivity (range 3.0 to 10.0), normalized to [0, 1].</p>
<p>For the multi-target case, high affinity is required for both $A_1AR$ and $A_{2A}AR$ while low affinity is required for hERG. For the target-specific case, high affinity is required only for $A_{2A}AR$ while low affinity is required for both $A_1AR$ and hERG.</p>
<p>Molecules are ranked using a <a href="https://en.wikipedia.org/wiki/Non-dominated_sorting_genetic_algorithm_II">non-dominated sorting</a> algorithm to construct Pareto fronts. Within each front, molecules are ranked by average Tanimoto distance (using ECFP6 fingerprints) rather than crowding distance, favoring chemically diverse solutions. The final reward is:</p>
<p>$$
R_i^{*} = \begin{cases} 0.5 + \frac{k - N_{undesired}}{2N_{desired}}, &amp; \text{if desired} \\ \frac{k}{2N_{undesired}}, &amp; \text{if undesired} \end{cases}
$$</p>
<p>where $k$ is the molecule&rsquo;s index in the Pareto rank. Rewards for undesired and desired solutions are distributed in $(0, 0.5]$ and $(0.5, 1.0]$, respectively.</p>
<p>The agent is trained via policy gradient:</p>
<p>$$
J(\theta) = \mathbb{E}\left[R^{*}(y_{1:T}) \middle|\theta\right] = \sum_{t=1}^{T} \log G(y_t | y_{1:t-1}) \cdot R^{*}(y_{1:T})
$$</p>
<h3 id="weighted-sum-alternative">Weighted Sum Alternative</h3>
<p>The authors also implement a weighted sum (WS) scheme with dynamic weights proportional to the ratio of undesired to desired molecules per objective:</p>
<p>$$
w_i = \frac{r_i}{\sum_{k=1}^{M} r_k}, \quad R^{*} = \sum_{i=1}^{n} w_i R_i
$$</p>
<p>This auto-adjusts importance toward under-performing objectives during training.</p>
<h3 id="molecular-diversity-metric">Molecular Diversity Metric</h3>
<p>Diversity is measured using the Solow-Polasky metric adapted from ecological biodiversity:</p>
<p>$$
I(A) = \frac{1}{|A|} \mathbf{e}^{\top} F(\mathbf{s})^{-1} \mathbf{e}
$$</p>
<p>where $F(\mathbf{s})$ is a distance matrix with entries $f(d_{ij}) = e^{-\theta d_{ij}}$ and $d_{ij}$ is the Tanimoto distance between ECFP6 fingerprints of molecules $s_i$ and $s_j$.</p>
<h2 id="multi-target-and-target-specific-experiments">Multi-Target and Target-Specific Experiments</h2>
<h3 id="qsar-environment">QSAR Environment</h3>
<p>Four ML algorithms were benchmarked for the bioactivity prediction environment: Random Forest (RF), SVM, PLS, and Multi-task DNN (MT-DNN). Input features combined 2048-bit ECFP6 fingerprints with 19 physicochemical descriptors (2067D total). The training data came from ChEMBL v26: 25,731 ligands with bioactivity measurements toward $A_1AR$, $A_{2A}AR$, and hERG. RF was selected as the final predictor based on superior performance in temporal-split independent testing ($R^2$ and RMSE), prioritizing robustness over cross-validation metrics.</p>
<h3 id="generative-model-architecture">Generative Model Architecture</h3>
<p>The RNN generator uses six layers: input, embedding (128D), three LSTM recurrent layers (512 hidden units), and output. LSTM was chosen over GRU based on higher valid SMILES rates (97.5% vs. 93.1% for pre-trained, 97.9% vs. 95.7% for fine-tuned). Pre-training used 1.7M molecules from ChEMBL; fine-tuning used the 25,731 LIGAND set molecules.</p>
<h3 id="baselines">Baselines</h3>
<p>DrugEx v2 was compared against DrugEx v1, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>, and <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGANIC</a>, all using the same RNN architecture and pre-trained/fine-tuned models, with only the RL framework differing. Both Pareto front (PF) and weighted sum (WS) reward schemes were tested.</p>
<h3 id="multi-target-results">Multi-Target Results</h3>
<p>In the multi-target case (high affinity for $A_1AR$ and $A_{2A}AR$, low affinity for hERG):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Scheme</th>
          <th>Validity</th>
          <th>Desirability</th>
          <th>Uniqueness</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DrugEx v2</td>
          <td>PF</td>
          <td>99.57%</td>
          <td>80.81%</td>
          <td>87.29%</td>
          <td>0.70</td>
      </tr>
      <tr>
          <td>DrugEx v2</td>
          <td>WS</td>
          <td>99.80%</td>
          <td><strong>97.45%</strong></td>
          <td>89.08%</td>
          <td>0.49</td>
      </tr>
      <tr>
          <td>REINVENT</td>
          <td>PF</td>
          <td>99.54%</td>
          <td>57.43%</td>
          <td><strong>98.84%</strong></td>
          <td><strong>0.77</strong></td>
      </tr>
      <tr>
          <td>ORGANIC</td>
          <td>PF</td>
          <td>98.84%</td>
          <td>66.01%</td>
          <td>82.67%</td>
          <td>0.65</td>
      </tr>
      <tr>
          <td>DrugEx v1</td>
          <td>PF</td>
          <td>98.28%</td>
          <td>43.27%</td>
          <td>88.96%</td>
          <td>0.71</td>
      </tr>
  </tbody>
</table>
<p>DrugEx v2 achieved the highest desirability under both schemes. The WS scheme maximized desirability (97.45%) but at the cost of diversity (0.49). The PF scheme maintained higher diversity (0.70) with still-strong desirability (80.81%).</p>
<h3 id="target-specific-results">Target-Specific Results</h3>
<p>In the target-specific case (high $A_{2A}AR$, low $A_1AR$ and hERG):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Scheme</th>
          <th>Validity</th>
          <th>Desirability</th>
          <th>Uniqueness</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DrugEx v2</td>
          <td>PF</td>
          <td>99.53%</td>
          <td><strong>89.49%</strong></td>
          <td>90.55%</td>
          <td>0.73</td>
      </tr>
      <tr>
          <td>DrugEx v2</td>
          <td>WS</td>
          <td>99.62%</td>
          <td><strong>97.86%</strong></td>
          <td>90.54%</td>
          <td>0.31</td>
      </tr>
      <tr>
          <td>REINVENT</td>
          <td>WS</td>
          <td>99.55%</td>
          <td>81.27%</td>
          <td>98.87%</td>
          <td>0.34</td>
      </tr>
      <tr>
          <td>ORGANIC</td>
          <td>PF</td>
          <td>98.29%</td>
          <td>86.98%</td>
          <td>80.30%</td>
          <td>0.64</td>
      </tr>
  </tbody>
</table>
<p>DrugEx v2 with PF achieved high desirability (89.49%) while maintaining diversity (0.73), outperforming both the WS scheme&rsquo;s diversity collapse (0.31) and competing methods.</p>
<h3 id="chemical-space-coverage">Chemical Space Coverage</h3>
<p>t-SNE visualization with ECFP6 descriptors showed that the PF scheme guided generators to cover chemical space more broadly than the WS scheme. DrugEx v1 and v2 covered nearly all of the chemical space occupied by known active ligands, while REINVENT and ORGANIC covered only partial regions in the target-specific case.</p>
<h3 id="substructure-distribution">Substructure Distribution</h3>
<p>Generated molecules were evaluated for purine ring, furan ring, and benzene ring frequencies. DrugEx v2 with PF produced substructure distributions closest to the LIGAND set, suggesting it better preserves the chemical characteristics of known active molecules compared to REINVENT (which over-represented benzene rings) and ORGANIC.</p>
<h3 id="guacamol-benchmark">GuacaMol Benchmark</h3>
<p>DrugEx v2 was tested on 20 goal-directed tasks from the <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmark, achieving the best score in 12 of 20 tasks and an overall second place. The method struggled with tasks requiring contradictory objectives in narrow chemical spaces (e.g., the Sitagliptin MPO task), reflecting its emphasis on diverse feasible molecules rather than optimal individual solutions.</p>
<h2 id="diversity-desirability-trade-off-and-limitations">Diversity-Desirability Trade-off and Limitations</h2>
<p>The key finding is that the Pareto front scheme and weighted sum scheme offer complementary strengths: PF produces molecules with higher diversity and more realistic substructure distributions, while WS achieves higher raw desirability scores. The Pareto front scheme is preferred for polypharmacology applications where chemical diversity matters for lead optimization.</p>
<p>The mutation rate $\varepsilon$ controls the diversity-desirability trade-off. Higher $\varepsilon$ increases diversity at the cost of desirability. The authors tested $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$ and found that appropriate tuning is important.</p>
<p>Limitations acknowledged by the authors include:</p>
<ul>
<li>The method is less effective for tasks with contradictory objectives in narrow chemical spaces</li>
<li>Emphasis is on generating diverse feasible molecules rather than individual optimal solutions</li>
<li>REINVENT 2.0 did not converge with the PF scheme, suggesting the Pareto approach may not be universally compatible with all RL frameworks</li>
<li>Bioactivity predictions rely on QSAR models (RF), which may not generalize perfectly to novel chemical scaffolds</li>
</ul>
<p>Future directions mentioned include adopting newer architectures (BERT, Transformer, GPT-2), handling graph and fragment representations, and integrating additional objectives like stability and synthesizability.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL v26 (ChEMBL set)</td>
          <td>1.7M molecules</td>
          <td>SMILES syntax learning, drug-like molecules</td>
      </tr>
      <tr>
          <td>Fine-tuning / Environment</td>
          <td>LIGAND set</td>
          <td>25,731 ligands</td>
          <td>Bioactivities for $A_1AR$, $A_{2A}AR$, hERG from ChEMBL</td>
      </tr>
      <tr>
          <td>Benchmark</td>
          <td>GuacaMol</td>
          <td>20 tasks</td>
          <td>Goal-directed generation tasks</td>
      </tr>
  </tbody>
</table>
<p>Active/inactive thresholds: $pX \geq 6.5$ (active), $pX &lt; 6.5$ (inactive). Low-quality data without exact pX assigned $pX = 3.99$ with sample weight 0.1.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>QSAR predictor</strong>: Random Forest, 1000 trees, Gini criterion. Input: 2048-bit ECFP6 + 19 physicochemical properties (2067D). MinMax normalization.</li>
<li><strong>Generator</strong>: 6-layer RNN with LSTM cells (512 hidden units), embedding dim 128, vocabulary 84 tokens. Adam optimizer, lr $10^{-3}$, batch size 512, 1000 epochs.</li>
<li><strong>RL training</strong>: Policy gradient with Pareto-based or weighted-sum reward. Mutation rates tested: $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$.</li>
<li><strong>Pareto ranking</strong>: GPU-accelerated non-dominated sorting via PyTorch. Tanimoto-based crowding distance with ECFP6 fingerprints.</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Architecture</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Generator</td>
          <td>LSTM (3 layers, 512 hidden)</td>
          <td>Embedding 128D, vocab 84</td>
      </tr>
      <tr>
          <td>Predictor</td>
          <td>Random Forest</td>
          <td>1000 trees, 2067D input</td>
      </tr>
      <tr>
          <td>MT-DNN (alternative)</td>
          <td>3 hidden layers (4000, 2000, 1000)</td>
          <td>ReLU, 20% dropout</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>Fraction of generated SMILES that parse to valid molecules</td>
      </tr>
      <tr>
          <td>Desirability</td>
          <td>Fraction of molecules meeting all activity thresholds ($pX \geq 6.5$ on-targets, $pX &lt; 6.5$ off-targets)</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>Fraction of non-duplicate molecules</td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>Solow-Polasky metric on ECFP6 Tanimoto distances</td>
      </tr>
      <tr>
          <td>SA score</td>
          <td>Synthetic accessibility (1-10, lower is easier)</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>Quantitative estimate of drug-likeness (0-1, higher is better)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>GPU acceleration was used for Pareto optimization via PyTorch. Specific hardware details (GPU model, training time) are not reported in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/XuhanLiu/DrugEx">DrugEx GitHub</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation (Python, PyTorch)</td>
      </tr>
      <tr>
          <td><a href="https://www.ebi.ac.uk/chembl/">ChEMBL v26</a></td>
          <td>Dataset</td>
          <td>CC BY-SA 3.0</td>
          <td>Source of training molecules and bioactivity data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liu, X., Ye, K., van Vlijmen, H. W. T., Emmerich, M. T. M., IJzerman, A. P., &amp; van Westen, G. J. P. (2021). DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology. <em>Journal of Cheminformatics</em>, 13(1), 85. <a href="https://doi.org/10.1186/s13321-021-00561-9">https://doi.org/10.1186/s13321-021-00561-9</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{liu2021drugex,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Liu, Xuhan and Ye, Kai and van Vlijmen, Herman W. T. and Emmerich, Michael T. M. and IJzerman, Adriaan P. and van Westen, Gerard J. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{85}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-021-00561-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemGE: Molecule Generation via Grammatical Evolution</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemge-grammatical-evolution-molecule-generation/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemge-grammatical-evolution-molecule-generation/</guid><description>ChemGE applies grammatical evolution to SMILES strings for population-based de novo molecule generation with inherent parallelism and diversity.</description><content:encoded><![CDATA[<h2 id="grammatical-evolution-for-de-novo-molecular-design">Grammatical Evolution for De Novo Molecular Design</h2>
<p>This is a <strong>Method</strong> paper that introduces ChemGE, a population-based molecular generation approach built on grammatical evolution. Rather than using deep neural networks, ChemGE evolves populations of <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings through a context-free grammar, enabling concurrent evaluation by multiple molecular simulators and producing diverse molecular libraries. The method represents an alternative paradigm for de novo drug design: evolutionary optimization over formal grammars rather than learned latent spaces or autoregressive neural models.</p>
<h2 id="limitations-of-sequential-deep-learning-generators">Limitations of Sequential Deep Learning Generators</h2>
<p>At the time of publication, the dominant approaches to de novo molecular generation included Bayesian optimization over VAE latent spaces (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">CVAE</a>, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/grammar-variational-autoencoder/">GVAE</a>), reinforcement learning with recurrent neural networks (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a>, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>), sequential Monte Carlo search, and Monte Carlo tree search (ChemTS). These methods share two practical limitations:</p>
<ol>
<li>
<p><strong>Simulation concurrency</strong>: Most methods generate one molecule at a time, making it difficult to run multiple molecular simulations (e.g., <a href="https://en.wikipedia.org/wiki/Molecular_docking">docking</a>) in parallel. This wastes computational resources in high-throughput virtual screening settings.</p>
</li>
<li>
<p><strong>Molecular diversity</strong>: Deep learning generators tend to exploit narrow regions of chemical space. Deep reinforcement learning methods in particular often generate very similar molecules, requiring special countermeasures to maintain diversity. Since drug discovery is a multi-stage pipeline, limited diversity reduces survival rates in downstream <a href="https://en.wikipedia.org/wiki/ADME">ADMET</a> screening.</p>
</li>
</ol>
<p>ChemGE addresses both problems by maintaining a large population of molecules that are evolved and evaluated concurrently.</p>
<h2 id="core-innovation-chromosome-to-smiles-mapping-via-grammar-rules">Core Innovation: Chromosome-to-SMILES Mapping via Grammar Rules</h2>
<p>ChemGE encodes each molecule as a chromosome: a sequence of $N$ integers that deterministically maps to a SMILES string through a context-free grammar. The mapping process works as follows:</p>
<ol>
<li>Start with the grammar&rsquo;s start symbol</li>
<li>At each step $k$, look up the $k$-th integer $c = C[k]$ from the chromosome</li>
<li>Identify the leftmost non-terminal symbol and count its $r$ applicable production rules</li>
<li>Apply the $((c \bmod r) + 1)$-th rule</li>
<li>Repeat until no non-terminal symbols remain or the chromosome is exhausted</li>
</ol>
<p>The context-free grammar is a subset of the OpenSMILES specification, defined formally as $G = (V, \Sigma, R, S)$ where $V$ is the set of non-terminal symbols, $\Sigma$ is the set of terminal symbols, $R$ is the set of production rules, and $S$ is the start symbol.</p>
<p>Evolution follows the $(\mu + \lambda)$ evolution strategy:</p>
<ol>
<li>Create $\lambda$ new chromosomes by drawing random chromosomes from the population and mutating one integer at a random position</li>
<li>Translate each chromosome to a SMILES string and evaluate fitness (e.g., docking score). Invalid molecules receive fitness $-\infty$</li>
<li>Select the top $\mu$ molecules from the merged pool of $\mu + \lambda$ candidates</li>
</ol>
<p>The authors did not use crossover, as it did not improve performance. Diversity is inherently maintained because a large fraction of molecules are mutated in each generation.</p>
<h2 id="experimental-setup-and-benchmark-comparisons">Experimental Setup and Benchmark Comparisons</h2>
<h3 id="druglikeness-score-benchmark">Druglikeness Score Benchmark</h3>
<p>The first experiment optimized the penalized logP score $J^{\log P}$, an indicator of druglikeness defined as:</p>
<p>$$
J^{\log P}(m) = \log P(m) - \text{SA}(m) - \text{ring-penalty}(m)
$$</p>
<p>where $\log P(m)$ is the <a href="https://en.wikipedia.org/wiki/Octanol-water_partition_coefficient">octanol-water partition coefficient</a>, $\text{SA}(m)$ is the synthetic accessibility score, and ring-penalty$(m)$ penalizes carbon rings larger than size 6. All terms are normalized to zero mean and unit standard deviation. Initial populations were randomly sampled from the ZINC database (35 million compounds), with fitness set to $-\infty$ for molecules with molecular weight above 500 or duplicate structures.</p>
<p>ChemGE was compared against CVAE, GVAE, and ChemTS across population sizes $(\mu, \lambda) \in {(10, 20), (100, 200), (1000, 2000), (10000, 20000)}$.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>2h</th>
          <th>4h</th>
          <th>6h</th>
          <th>8h</th>
          <th>Mol/Min</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChemGE (10, 20)</td>
          <td>4.46 +/- 0.34</td>
          <td>4.46 +/- 0.34</td>
          <td>4.46 +/- 0.34</td>
          <td>4.46 +/- 0.34</td>
          <td>14.5</td>
      </tr>
      <tr>
          <td>ChemGE (100, 200)</td>
          <td>5.17 +/- 0.26</td>
          <td>5.17 +/- 0.26</td>
          <td>5.17 +/- 0.26</td>
          <td>5.17 +/- 0.26</td>
          <td>135</td>
      </tr>
      <tr>
          <td>ChemGE (1000, 2000)</td>
          <td>4.45 +/- 0.24</td>
          <td>5.32 +/- 0.43</td>
          <td>5.73 +/- 0.33</td>
          <td>5.88 +/- 0.34</td>
          <td>527</td>
      </tr>
      <tr>
          <td>ChemGE (10000, 20000)</td>
          <td>4.20 +/- 0.33</td>
          <td>4.28 +/- 0.28</td>
          <td>4.40 +/- 0.27</td>
          <td>4.53 +/- 0.26</td>
          <td>555</td>
      </tr>
      <tr>
          <td>CVAE</td>
          <td>-30.18 +/- 26.91</td>
          <td>-1.39 +/- 2.24</td>
          <td>-0.61 +/- 1.08</td>
          <td>-0.006 +/- 0.92</td>
          <td>0.14</td>
      </tr>
      <tr>
          <td>GVAE</td>
          <td>-4.34 +/- 3.14</td>
          <td>-1.29 +/- 1.67</td>
          <td>-0.17 +/- 0.96</td>
          <td>0.25 +/- 1.31</td>
          <td>1.38</td>
      </tr>
      <tr>
          <td>ChemTS</td>
          <td>4.91 +/- 0.38</td>
          <td>5.41 +/- 0.51</td>
          <td>5.49 +/- 0.44</td>
          <td>5.58 +/- 0.50</td>
          <td>40.89</td>
      </tr>
  </tbody>
</table>
<p>At $(\mu, \lambda) = (1000, 2000)$, ChemGE achieved the highest final score of 5.88 and generated 527 unique molecules per minute, roughly 13x faster than ChemTS and 3700x faster than CVAE. The small population (10, 20) converged prematurely with insufficient diversity, while the overly large population (10000, 20000) could not run enough generations to optimize effectively.</p>
<h3 id="docking-experiment-with-thymidine-kinase">Docking Experiment with Thymidine Kinase</h3>
<p>The second experiment applied ChemGE to generate molecules with high predicted binding affinity for <a href="https://en.wikipedia.org/wiki/Thymidine_kinase">thymidine kinase</a> (KITH), a well-known antiviral drug target. The authors used rDock for docking simulation, taking the best intermolecular score $S_{\text{inter}}$ from three runs with different initial conformations. Fitness was defined as $-S_{\text{inter}}$ (lower scores indicate higher affinity). The protein structure was taken from PDB ID 2B8T.</p>
<p>With 32 parallel cores and $(\mu, \lambda) = (32, 64)$, ChemGE completed 1000 generations in approximately 26 hours, generating 9466 molecules total. Among these, 349 molecules achieved intermolecular scores better than the best known inhibitor in the DUD-E database.</p>
<h3 id="diversity-analysis">Diversity Analysis</h3>
<p>Molecular diversity was measured using internal diversity based on Morgan fingerprints:</p>
<p>$$
I(A) = \frac{1}{|A|^2} \sum_{(x,y) \in A \times A} T_d(x, y)
$$</p>
<p>where $T_d(x, y) = 1 - \frac{|x \cap y|}{|x \cup y|}$ is the <a href="https://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_similarity_and_distance">Tanimoto distance</a>.</p>
<p>The 349 &ldquo;ChemGE-active&rdquo; molecules (those scoring better than the best known inhibitor) had an internal diversity of 0.55, compared to 0.46 for known inhibitors and 0.65 for the whole ZINC database. This is a substantial improvement over known actives, achieved without any explicit diversity-promoting mechanism.</p>
<p>ISOMAP visualizations showed that ChemGE populations migrated away from known inhibitors over generations, ultimately occupying a completely different region of chemical space by generation 1000. This suggests ChemGE discovered a novel structural class of potential binders.</p>
<h2 id="high-throughput-and-diversity-without-deep-learning">High Throughput and Diversity Without Deep Learning</h2>
<p>ChemGE demonstrates several notable findings:</p>
<ol>
<li>
<p><strong>Deep learning is not required</strong> for competitive de novo molecular generation. Grammatical evolution over SMILES achieves higher throughput and comparable or better optimization scores than VAE- and RNN-based methods.</p>
</li>
<li>
<p><strong>Population size matters significantly</strong>. Too small a population leads to premature convergence. Too large a population prevents sufficient per-molecule optimization within the computational budget. The $(\mu, \lambda) = (1000, 2000)$ setting provided the best balance.</p>
</li>
<li>
<p><strong>Inherent diversity</strong> is a key advantage of evolutionary methods. Without any explicit diversity loss or penalty, ChemGE maintains diversity comparable to the ZINC database and exceeds that of known active molecules.</p>
</li>
<li>
<p><strong>Parallel evaluation</strong> is naturally supported. Each generation produces $\lambda$ independent molecules that can be evaluated by separate docking simulators simultaneously.</p>
</li>
</ol>
<p>The authors acknowledge several limitations. Synthetic routes and ADMET properties were not evaluated for the generated molecules. The docking scores, while favorable, require confirmation through biological assays. The authors also note that incorporating probabilistic or neural models into the evolutionary process might further improve performance.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Initial population</td>
          <td>ZINC</td>
          <td>~35M compounds</td>
          <td>Randomly sampled starting molecules</td>
      </tr>
      <tr>
          <td>Docking target</td>
          <td>PDB 2B8T</td>
          <td>1 structure</td>
          <td>Thymidine kinase-ligand complex</td>
      </tr>
      <tr>
          <td>Baseline actives</td>
          <td>DUD-E (KITH)</td>
          <td>57 inhibitors</td>
          <td>Known thymidine kinase inhibitors</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Grammatical evolution with $(\mu + \lambda)$ evolution strategy</li>
<li>Mutation only (no crossover)</li>
<li>Context-free grammar subset of OpenSMILES specification</li>
<li>Chromosome length: $N$ integers per molecule</li>
<li>Fitness set to $-\infty$ for invalid SMILES, MW &gt; 500, or duplicate molecules</li>
</ul>
<h3 id="models">Models</h3>
<p>No neural network models are used. ChemGE is purely evolutionary.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Max $J^{\log P}$ (8h)</td>
          <td>5.88 +/- 0.34</td>
          <td>ChemTS: 5.58 +/- 0.50</td>
          <td>ChemGE (1000, 2000)</td>
      </tr>
      <tr>
          <td>Molecules/min</td>
          <td>527</td>
          <td>ChemTS: 40.89</td>
          <td>~13x throughput improvement</td>
      </tr>
      <tr>
          <td>Docking hits</td>
          <td>349</td>
          <td>Best DUD-E inhibitor</td>
          <td>Molecules with better $S_{\text{inter}}$</td>
      </tr>
      <tr>
          <td>Internal diversity</td>
          <td>0.55</td>
          <td>Known inhibitors: 0.46</td>
          <td>Morgan fingerprint Tanimoto distance</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>CPU: Intel Xeon E5-2630 v3 (benchmark experiments, single core)</li>
<li>Docking: 32 cores in parallel (thymidine kinase experiment, ~26 hours for 1000 generations)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/tsudalab/ChemGE">ChemGE</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yoshikawa, N., Terayama, K., Sumita, M., Homma, T., Oono, K., &amp; Tsuda, K. (2018). Population-based de novo molecule generation, using grammatical evolution. <em>Chemistry Letters</em>, 47(11), 1431-1434. <a href="https://doi.org/10.1246/cl.180665">https://doi.org/10.1246/cl.180665</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{yoshikawa2018chemge,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Population-based De Novo Molecule Generation, Using Grammatical Evolution}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Yoshikawa, Naruki and Terayama, Kei and Sumita, Masato and Homma, Teruki and Oono, Kenta and Tsuda, Koji}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemistry Letters}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{47}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1431--1434}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1246/cl.180665}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>VAE for Automatic Chemical Design (2018 Seminal)</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/</guid><description>A variational autoencoder maps SMILES strings to a continuous latent space, enabling gradient-based optimization for molecular design and generation.</description><content:encoded><![CDATA[<h2 id="a-foundational-method-for-continuous-molecular-representation">A Foundational Method for Continuous Molecular Representation</h2>
<p>This is a <strong>Method</strong> paper that introduces a variational autoencoder (VAE) framework for mapping discrete molecular representations (<a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings) into a continuous latent space. The primary contribution is demonstrating that this continuous representation enables three key capabilities: (1) automatic generation of novel molecules by decoding random or perturbed latent vectors, (2) smooth interpolation between molecules in latent space, and (3) gradient-based optimization of molecular properties using a jointly trained property predictor. This work is widely regarded as one of the earliest and most influential applications of deep generative models to molecular design.</p>
<h2 id="the-challenge-of-searching-discrete-chemical-space">The Challenge of Searching Discrete Chemical Space</h2>
<p>Molecular design is fundamentally an optimization problem: identify molecules that maximize some set of desirable properties. The search space is enormous (estimated $10^{23}$ to $10^{60}$ drug-like molecules) and discrete, making systematic exploration difficult. Prior approaches fell into two categories, each with significant limitations:</p>
<ol>
<li><strong>Virtual screening</strong> over fixed libraries: effective but monolithic, costly to enumerate, and requiring hand-crafted rules to avoid impractical chemistries.</li>
<li><strong>Discrete local search</strong> (e.g., genetic algorithms): requires manual specification of mutation and crossover heuristics, and cannot leverage gradient information to guide the search.</li>
</ol>
<p>The core insight is that mapping molecules into a continuous vector space sidesteps these problems entirely. In a continuous space, new compounds can be generated by vector perturbation (no hand-crafted mutation rules), optimization can follow property gradients (enabling larger and more directed jumps), and large unlabeled chemical databases can be leveraged through unsupervised representation learning.</p>
<h2 id="a-vae-architecture-for-smiles-strings-with-joint-property-prediction">A VAE Architecture for SMILES Strings with Joint Property Prediction</h2>
<p>The architecture consists of three coupled neural networks trained jointly:</p>
<ol>
<li>
<p><strong>Encoder</strong>: Converts SMILES character strings into fixed-dimensional continuous vectors (the latent representation). Uses three 1D convolutional layers followed by a fully connected layer. For ZINC molecules, the latent space has 196 dimensions; for QM9, 156 dimensions.</p>
</li>
<li>
<p><strong>Decoder</strong>: Converts latent vectors back into SMILES strings character by character using three layers of gated recurrent units (GRUs). The output is stochastic, as each character is sampled from a probability distribution over the SMILES alphabet.</p>
</li>
<li>
<p><strong>Property Predictor</strong>: A multilayer perceptron that predicts molecular properties directly from the latent representation. Joint training with the autoencoder reconstruction loss organizes the latent space so that molecules with similar properties cluster together.</p>
</li>
</ol>
<h3 id="the-vae-objective">The VAE Objective</h3>
<p>The model uses the <a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">variational autoencoder framework of Kingma and Welling</a>. The training objective combines three terms:</p>
<p>$$\mathcal{L} = \mathcal{L}_{recon} + \beta \cdot D_{KL}(q(z|x) | p(z)) + \lambda \cdot \mathcal{L}_{prop}$$</p>
<p>where $\mathcal{L}_{recon}$ is the reconstruction loss (cross-entropy over SMILES characters), $D_{KL}$ is the KL divergence regularizer that encourages the latent distribution $q(z|x)$ to match a standard Gaussian prior $p(z)$, and $\mathcal{L}_{prop}$ is the property prediction regression loss. Both the variational loss and the property prediction loss are annealed in using a sigmoid schedule after 29 epochs over a total of 120 epochs of training.</p>
<p>The KL regularization is critical: it forces the decoder to handle a wider variety of latent points, preventing &ldquo;dead areas&rdquo; in latent space that would decode to invalid molecules.</p>
<h3 id="gradient-based-optimization">Gradient-Based Optimization</h3>
<p>After training, a Gaussian process (GP) surrogate model is fit on top of the latent representations to predict the target property. Optimization proceeds by:</p>
<ol>
<li>Encoding a seed molecule into the latent space</li>
<li>Using the GP model to define a smooth property surface over the latent space</li>
<li>Optimizing the latent vector $z$ to maximize the predicted property via gradient ascent</li>
<li>Decoding the optimized $z$ back into a SMILES string</li>
</ol>
<p>The objective used for demonstration is $5 \times \text{QED} - \text{SAS}$, balancing drug-likeness (QED) against synthetic accessibility (SAS).</p>
<h2 id="experiments-on-zinc-and-qm9-datasets">Experiments on ZINC and QM9 Datasets</h2>
<p>Two autoencoder systems were trained:</p>
<ul>
<li><strong>ZINC</strong>: 250,000 drug-like molecules from the ZINC database, with a 196-dimensional latent space. Properties predicted: logP, QED, SAS.</li>
<li><strong>QM9</strong>: 108,000 molecules with fewer than 9 heavy atoms, with a 156-dimensional latent space. Properties predicted: HOMO energy, LUMO energy, electronic spatial extent ($\langle R^2 \rangle$).</li>
</ul>
<h3 id="latent-space-quality">Latent Space Quality</h3>
<p>The encoded latent dimensions follow approximately normal distributions as enforced by the variational regularizer. Decoding is stochastic: sampling the same latent point multiple times yields different SMILES strings, with the most frequent decoding tending to be closest to the original point in latent space. Decoding validity rates are 73-79% for points near known molecules but only 4% for randomly selected latent points.</p>
<p>Spherical interpolation (slerp) between molecules in latent space produces smooth structural transitions, accounting for the geometry of high-dimensional Gaussian distributions where linear interpolation would pass through low-probability regions.</p>
<h3 id="molecular-generation-comparison">Molecular Generation Comparison</h3>
<table>
  <thead>
      <tr>
          <th>Source</th>
          <th>Dataset</th>
          <th>Samples</th>
          <th>logP</th>
          <th>SAS</th>
          <th>QED</th>
          <th>% in ZINC</th>
          <th>% in eMolecules</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Data</td>
          <td>ZINC</td>
          <td>249k</td>
          <td>2.46 (1.43)</td>
          <td>3.05 (0.83)</td>
          <td>0.73 (0.14)</td>
          <td>100</td>
          <td>12.9</td>
      </tr>
      <tr>
          <td>GA</td>
          <td>ZINC</td>
          <td>5303</td>
          <td>2.84 (1.86)</td>
          <td>3.80 (1.01)</td>
          <td>0.57 (0.20)</td>
          <td>6.5</td>
          <td>4.8</td>
      </tr>
      <tr>
          <td>VAE</td>
          <td>ZINC</td>
          <td>8728</td>
          <td>2.67 (1.46)</td>
          <td>3.18 (0.86)</td>
          <td>0.70 (0.14)</td>
          <td>5.8</td>
          <td>7.0</td>
      </tr>
      <tr>
          <td>Data</td>
          <td>QM9</td>
          <td>134k</td>
          <td>0.30 (1.00)</td>
          <td>4.25 (0.94)</td>
          <td>0.48 (0.07)</td>
          <td>0.0</td>
          <td>8.6</td>
      </tr>
      <tr>
          <td>GA</td>
          <td>QM9</td>
          <td>5470</td>
          <td>0.96 (1.53)</td>
          <td>4.47 (1.01)</td>
          <td>0.53 (0.13)</td>
          <td>0.018</td>
          <td>3.8</td>
      </tr>
      <tr>
          <td>VAE</td>
          <td>QM9</td>
          <td>2839</td>
          <td>0.30 (0.97)</td>
          <td>4.34 (0.98)</td>
          <td>0.47 (0.08)</td>
          <td>0.0</td>
          <td>8.9</td>
      </tr>
  </tbody>
</table>
<p>The VAE generates molecules whose property distributions closely match the training data, outperforming a genetic algorithm baseline that biases toward higher chemical complexity and decreased drug-likeness. Only 5.8% of VAE-generated ZINC molecules were found in the original ZINC database, indicating genuine novelty.</p>
<h3 id="property-prediction">Property Prediction</h3>
<table>
  <thead>
      <tr>
          <th>Dataset/Property</th>
          <th>Mean Baseline</th>
          <th>ECFP</th>
          <th>Graph Conv.</th>
          <th>1-hot SMILES</th>
          <th>Encoder Only</th>
          <th>VAE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ZINC/logP</td>
          <td>1.14</td>
          <td>0.38</td>
          <td>0.05</td>
          <td>0.16</td>
          <td>0.13</td>
          <td>0.15</td>
      </tr>
      <tr>
          <td>ZINC/QED</td>
          <td>0.112</td>
          <td>0.045</td>
          <td>0.017</td>
          <td>0.041</td>
          <td>0.037</td>
          <td>0.054</td>
      </tr>
      <tr>
          <td>QM9/HOMO (eV)</td>
          <td>0.44</td>
          <td>0.20</td>
          <td>0.12</td>
          <td>0.12</td>
          <td>0.13</td>
          <td>0.16</td>
      </tr>
      <tr>
          <td>QM9/LUMO (eV)</td>
          <td>1.05</td>
          <td>0.20</td>
          <td>0.15</td>
          <td>0.11</td>
          <td>0.14</td>
          <td>0.16</td>
      </tr>
      <tr>
          <td>QM9/Gap (eV)</td>
          <td>1.07</td>
          <td>0.30</td>
          <td>0.18</td>
          <td>0.16</td>
          <td>0.18</td>
          <td>0.21</td>
      </tr>
  </tbody>
</table>
<p>The VAE latent representation achieves property prediction accuracy comparable to graph convolutions for some properties, though graph convolutions generally perform best. The primary purpose of joint training is not to maximize prediction accuracy but to organize the latent space for optimization.</p>
<h3 id="optimization-results">Optimization Results</h3>
<p>Bayesian optimization with a GP model on the jointly trained latent space consistently produces molecules with higher percentile scores on the $5 \times \text{QED} - \text{SAS}$ objective compared to both random Gaussian search and genetic algorithm baselines. Starting from molecules in the bottom 10th percentile of the ZINC dataset, the optimizer reliably discovers molecules in regions of high objective value. Training the GP with 1000 molecules (vs. 2000) produces a wider diversity of solutions by optimizing to multiple local optima rather than a single global optimum.</p>
<h2 id="key-findings-limitations-and-legacy">Key Findings, Limitations, and Legacy</h2>
<h3 id="key-findings">Key Findings</h3>
<ul>
<li>A continuous latent representation of molecules enables gradient-based search through chemical space, a qualitatively different approach from discrete enumeration or genetic algorithms.</li>
<li>Joint training with property prediction organizes the latent space by property values, creating smooth gradients that optimization can follow.</li>
<li>The VAE generates novel molecules with realistic property distributions, and the latent space encodes an estimated 7.5 million molecules despite training on only 250,000.</li>
</ul>
<h3 id="acknowledged-limitations">Acknowledged Limitations</h3>
<ul>
<li>The SMILES-based decoder sometimes produces formally valid but chemically undesirable molecules (acid chlorides, anhydrides, cyclopentadienes, aziridines, etc.) because the grammar of valid SMILES does not capture all synthetic or stability constraints.</li>
<li>Character-level SMILES generation is fragile: the decoder must implicitly learn which strings are valid SMILES, making the learning problem harder than necessary.</li>
<li>Decoding validity drops to only 4% for random latent points far from training data, limiting the ability to explore truly novel regions of chemical space.</li>
</ul>
<h3 id="directions-identified">Directions Identified</h3>
<p>The authors point to several extensions that were already underway at the time of publication:</p>
<ul>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/grammar-variational-autoencoder/">Grammar VAE</a></strong>: Using an explicitly defined SMILES grammar instead of forcing the model to learn one (Kusner et al., 2017).</li>
<li><strong>Graph-based decoders</strong>: Directly outputting molecular graphs to avoid the SMILES validity problem.</li>
<li><strong>Adversarial training</strong>: Using GANs for molecular generation (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN, ORGANIC</a>).</li>
<li><strong>LSTM/RNN generators</strong>: Applying recurrent networks directly to SMILES for generation and reaction prediction.</li>
</ul>
<p>This paper has been cited over 2,900 times and launched a large body of follow-up work in VAE-based, GAN-based, and reinforcement learning-based molecular generation.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ZINC (drug-like subset)</td>
          <td>250,000 molecules</td>
          <td>Randomly sampled from ZINC database</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>QM9</td>
          <td>108,000 molecules</td>
          <td>Molecules with fewer than 9 heavy atoms</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ZINC held-out set</td>
          <td>5,000 molecules</td>
          <td>For latent space analysis</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Encoder</strong>: 3 x 1D convolutional layers (ZINC: filters 9,9,10 with kernels 9,9,11; QM9: filters 2,2,1 with kernels 5,5,4), followed by a fully connected layer</li>
<li><strong>Decoder</strong>: 3 x GRU layers (ZINC: hidden dim 488; QM9: hidden dim 500), trained with teacher forcing</li>
<li><strong>Property Predictor</strong>: 2 fully connected layers of 1000 neurons (dropout 0.20) for prediction; smaller 3-layer MLP of 67 neurons (dropout 0.15) for latent space shaping</li>
<li><strong>Variational loss annealing</strong>: Sigmoid schedule after 29 epochs, total 120 epochs</li>
<li><strong>SMILES validation</strong>: Post-hoc filtering with RDKit; invalid outputs discarded</li>
<li><strong>Optimization</strong>: Gaussian process surrogate model trained on 2000 maximally diverse molecules from latent space</li>
</ul>
<h3 id="models">Models</h3>
<p>Built with Keras and TensorFlow. Latent dimensions: 196 (ZINC), 156 (QM9). SMILES alphabet: 35 characters (ZINC), 22 characters (QM9). Maximum string length: 120 (ZINC), 34 (QM9). Only canonicalized SMILES used for training.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>logP</td>
          <td>Water-octanol partition coefficient</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>Quantitative Estimation of Drug-likeness (0-1)</td>
      </tr>
      <tr>
          <td>SAS</td>
          <td>Synthetic Accessibility Score</td>
      </tr>
      <tr>
          <td>HOMO/LUMO (eV)</td>
          <td>Frontier orbital energies (QM9)</td>
      </tr>
      <tr>
          <td>Decoding validity</td>
          <td>Fraction of latent points producing valid SMILES</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>Fraction of generated molecules not in training set</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Training was performed on the Harvard FAS Odyssey Cluster. Specific GPU types and training times are not reported. The Gaussian process optimization requires only minutes to train on a few thousand molecules.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/chemical_vae">chemical_vae</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official implementation with training scripts and pre-trained models</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernández-Lobato, J. M., Sánchez-Lengeling, B., Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., &amp; Aspuru-Guzik, A. (2018). Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. <em>ACS Central Science</em>, 4(2), 268-276. <a href="https://doi.org/10.1021/acscentsci.7b00572">https://doi.org/10.1021/acscentsci.7b00572</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{gomez2018automatic,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{G{\&#39;o}mez-Bombarelli, Rafael and Wei, Jennifer N. and Duvenaud, David and Hern{\&#39;a}ndez-Lobato, Jos{\&#39;e} Miguel and S{\&#39;a}nchez-Lengeling, Benjam{\&#39;i}n and Sheberla, Dennis and Aguilera-Iparraguirre, Jorge and Hirzel, Timothy D. and Adams, Ryan P. and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{ACS Central Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{268--276}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acscentsci.7b00572}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>S4 Structured State Space Models for De Novo Drug Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/s4-chemical-language-modeling/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/s4-chemical-language-modeling/</guid><description>S4 state space models are applied to chemical language modeling for de novo drug design, outperforming LSTMs and GPTs in bioactivity learning from SMILES.</description><content:encoded><![CDATA[<h2 id="structured-state-spaces-meet-chemical-language-modeling">Structured State Spaces Meet Chemical Language Modeling</h2>
<p>This is a <strong>Method</strong> paper that introduces structured state space sequence (S4) models to chemical language modeling (CLM) for de novo drug design. S4 models have a dual formulation: they process entire input sequences via convolution during training (like Transformers) and generate sequences element-by-element via recurrence during inference (like LSTMs). The authors benchmark S4 against LSTM and GPT architectures across multiple drug discovery tasks, including drug-like molecule generation, bioactivity learning, chemical space exploration, natural product design, and prospective kinase inhibitor design validated by molecular dynamics simulations.</p>
<h2 id="bridging-the-lstm-transformer-gap-in-molecular-generation">Bridging the LSTM-Transformer Gap in Molecular Generation</h2>
<p>Chemical language models (CLMs) generate molecules by learning the &ldquo;chemical language&rdquo; of <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> string representations. The two dominant architectures for CLMs are LSTMs and GPTs, each with complementary strengths and limitations:</p>
<ul>
<li><strong>LSTMs</strong> generate sequences recurrently (element-by-element), which enables efficient generation and good learning of local/short-range dependencies. However, their sequential information bottleneck limits learning of global sequence properties.</li>
<li><strong>GPTs</strong> (Transformer decoders) process the entire input at once, better capturing global properties like bioactivity. However, they become increasingly compute-intensive for longer SMILES strings and struggle with chemical space exploration at higher sampling temperatures.</li>
</ul>
<p>Complex molecular properties like bioactivity can emerge from separated portions of a SMILES string (e.g., distant functional groups in the linear notation). Neither architecture fully addresses the need to learn these long-range dependencies while maintaining efficient, robust generation. The chemical space, estimated at up to $10^{60}$ small molecules, demands models that can both capture complex property relationships and explore diverse scaffolds efficiently.</p>
<h2 id="the-dual-nature-of-s4-convolution-meets-recurrence">The Dual Nature of S4: Convolution Meets Recurrence</h2>
<p>S4 models are built on discrete <a href="https://en.wikipedia.org/wiki/State-space_model">state space models</a>, which map an input sequence $\mathbf{u}$ to an output sequence $\mathbf{y}$ through learnable parameters $\overline{\mathbf{A}} \in \mathbb{R}^{N \times N}$, $\overline{\mathbf{B}} \in \mathbb{R}^{N \times 1}$, $\overline{\mathbf{C}} \in \mathbb{R}^{1 \times N}$, and $\overline{\mathbf{D}} \in \mathbb{R}^{1 \times 1}$:</p>
<p>$$
x_{k} = \overline{\mathbf{A}} x_{k-1} + \overline{\mathbf{B}} u_{k}
$$</p>
<p>$$
y_{k} = \overline{\mathbf{C}} x_{k} + \overline{\mathbf{D}} u_{k}
$$</p>
<p>This linear recurrence can equivalently be &ldquo;unrolled&rdquo; into a global convolution:</p>
<p>$$
\mathbf{y} = \mathbf{u} * \overline{\mathbf{K}}
$$</p>
<p>where $\overline{\mathbf{K}}$ is a convolution filter parameterized by $\overline{\mathbf{A}}$, $\overline{\mathbf{B}}$, and $\overline{\mathbf{C}}$. This duality is the core innovation for CLMs:</p>
<ul>
<li><strong>Training</strong>: S4 uses the convolutional formulation to learn from entire SMILES sequences simultaneously, capturing global molecular properties.</li>
<li><strong>Generation</strong>: S4 switches to the recurrent formulation, producing SMILES tokens one at a time for efficient, robust chemical space exploration.</li>
</ul>
<p>S4 addresses the numerical instabilities of naive state space models through high-order polynomial projection operators (HiPPO) and reduction to the stable Cauchy kernel computation, enabling effective learning of long-range dependencies.</p>
<p>For molecular ranking after fine-tuning, the log-likelihood score subtracts the pre-training likelihood to isolate target-specific information:</p>
<p>$$
\mathcal{L}_{\text{score}}(\mathbf{M}) = \mathcal{L}(\mathbf{M}_{\text{ft}}) - \mathcal{L}(\mathbf{M}_{\text{pt}})
$$</p>
<p>where $\mathcal{L}(\mathbf{M}_{\text{ft}})$ and $\mathcal{L}(\mathbf{M}_{\text{pt}})$ are the fine-tuned and pre-trained model log-likelihoods, respectively.</p>
<h2 id="benchmarking-s4-across-drug-discovery-tasks">Benchmarking S4 Across Drug Discovery Tasks</h2>
<h3 id="drug-like-molecule-generation">Drug-like molecule generation</h3>
<p>All three CLMs (S4, LSTM, GPT) were pre-trained on 1.9M canonical SMILES from ChEMBL v31 (molecules with fewer than 100 tokens). Each model generated 102,400 SMILES strings de novo.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Valid</th>
          <th>Unique</th>
          <th>Novel</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>S4</td>
          <td>99,268 (97%)</td>
          <td>98,712 (96%)</td>
          <td>95,552 (93%)</td>
      </tr>
      <tr>
          <td>LSTM</td>
          <td>97,151 (95%)</td>
          <td>96,618 (94%)</td>
          <td>82,988 (81%)</td>
      </tr>
      <tr>
          <td>GPT</td>
          <td>93,580 (91%)</td>
          <td>93,263 (91%)</td>
          <td>91,590 (89%)</td>
      </tr>
  </tbody>
</table>
<p>S4 produces the most valid, unique, and novel molecules. Error analysis reveals that each architecture shows different failure modes: LSTMs struggle most with branching errors, GPTs with ring and bond assignment errors, while S4 generates fewer branching and ring errors but more bond assignment errors than LSTM. This pattern supports the hypothesis that S4 captures long-range dependencies (branching, ring opening/closure) better while local dependencies (bond assignment) are handled better by recurrent processing.</p>
<h3 id="bioactivity-learning-via-transfer-learning">Bioactivity learning via transfer learning</h3>
<p>Five fine-tuning campaigns were conducted on targets from the LIT-PCBA dataset: PKM2, <a href="https://en.wikipedia.org/wiki/Mitogen-activated_protein_kinase_1">MAPK1</a>, GBA, mTORC1, and TP53. After fine-tuning, models ranked held-out test molecules by learned log-likelihoods to evaluate bioactive compound prioritization.</p>
<p>S4 outperformed both benchmarks across targets. Wilcoxon signed-rank tests on pooled scores confirmed statistically significant superiority:</p>
<ul>
<li>S4 vs. LSTM: $p$ [top 10] = 8.41e-6, $p$ [top 50] = 2.93e-7, $p$ [top 100] = 1.45e-7</li>
<li>S4 vs. GPT: $p$ [top 10] = 2.33e-3, $p$ [top 50] = 3.72e-3, $p$ [top 100] = 2.61e-2</li>
</ul>
<p>TP53 was the most challenging target, where no model consistently retrieved actives in the top 10, possibly due to <a href="/notes/computational-chemistry/benchmark-problems/activity-cliffs-benchmark/">activity cliffs</a> in the test set.</p>
<h3 id="chemical-space-exploration-with-temperature-sampling">Chemical space exploration with temperature sampling</h3>
<p>Models were evaluated across sampling temperatures from $T = 1.0$ to $T = 2.0$ on three metrics: SMILES validity, rediscovery rate of known actives, and scaffold diversity. Key findings:</p>
<ul>
<li><strong>Validity</strong>: S4 and LSTM maintain higher validity than GPT at elevated temperatures (GPT median validity drops below 40% at high T).</li>
<li><strong>Rediscovery</strong>: S4 outperforms LSTM in rediscovering bioactive molecules at all temperatures.</li>
<li><strong>Scaffold diversity</strong>: LSTM achieves the highest number of unique scaffold clusters (median 6,602 at $T = 1.75$), with S4 as close second (6,520 clusters).</li>
</ul>
<p>S4 provides the best balance between bioactivity capture and structural diversity.</p>
<h3 id="natural-product-design">Natural product design</h3>
<p>Models were trained on 32,360 large natural product SMILES (length &gt; 100 tokens) from the COCONUT database and used to generate 102,400 designs each.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>S4</th>
          <th>LSTM</th>
          <th>GPT</th>
          <th>Training Set</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid</td>
          <td>82,633 (81%)</td>
          <td>76,264 (74%)</td>
          <td>70,117 (68%)</td>
          <td>n.a.</td>
      </tr>
      <tr>
          <td>Unique</td>
          <td>53,293 (52%)</td>
          <td>51,326 (50%)</td>
          <td>50,487 (49%)</td>
          <td>n.a.</td>
      </tr>
      <tr>
          <td>Novel</td>
          <td>40,897 (40%)</td>
          <td>43,245 (42%)</td>
          <td>43,168 (42%)</td>
          <td>n.a.</td>
      </tr>
      <tr>
          <td>NP-likeness</td>
          <td>1.6 +/- 0.7</td>
          <td>1.5 +/- 0.7</td>
          <td>1.5 +/- 0.7</td>
          <td>1.6 +/- 0.7</td>
      </tr>
  </tbody>
</table>
<p>S4 designs the most valid molecules (6,000 to 12,000 more than benchmarks) and achieves significantly higher NP-likeness ($p = 1.41 \times 10^{-53}$ vs. LSTM, $p = 1.02 \times 10^{-82}$ vs. GPT). S4 also achieves the lowest Kolmogorov-Smirnov distances to the training/test distributions across multiple structural properties (sp3 carbons, aliphatic rings, spiro atoms, molecular weight, fused ring size, heavy atoms).</p>
<p>For computational efficiency, S4 trains as fast as GPT (both approximately 1.3x faster than LSTM) and generates fastest among all architectures.</p>
<h3 id="prospective-mapk1-inhibitor-design">Prospective MAPK1 inhibitor design</h3>
<p>The pre-trained S4 model was fine-tuned on 68 manually curated MAPK1 inhibitors ($K_i &lt; 1 \mu M$) from ChEMBL v33. The last five fine-tuning epochs generated 256K molecules across five temperature values. After ranking and filtering by log-likelihood score and scaffold similarity, the top 10 designs were evaluated via <a href="/notes/computational-chemistry/molecular-dynamics/umbrella-sampling/">Umbrella Sampling</a> <a href="/notes/computational-chemistry/molecular-dynamics/">molecular dynamics</a> simulations.</p>
<p>Eight out of ten designs showed high predicted affinity, with $\Delta G$ values ranging from $-10.3 \pm 0.6$ to $-23 \pm 4$ kcal/mol. These affinities are comparable to or exceed those of the closest known active neighbors ($\Delta G = -9.1 \pm 0.8$ to $-13 \pm 2$ kcal/mol). The most potent predicted design (molecule 2, $\Delta G = -23 \pm 4$ kcal/mol) engages extensively with the MAPK1 binding pocket, though synthetic accessibility may be limited. Several designs incorporate halogen substitutions favorable for MAPK1 inhibition, consistent with known structure-activity relationships.</p>
<h2 id="s4-combines-the-best-of-lstms-and-gpts-for-molecular-design">S4 Combines the Best of LSTMs and GPTs for Molecular Design</h2>
<p>The main findings of this study are:</p>
<ol>
<li><strong>S4 outperforms both LSTM and GPT</strong> in learning complex molecular properties like bioactivity, while maintaining competitive or superior performance in syntax learning and chemical space exploration.</li>
<li><strong>The dual formulation is key</strong>: holistic training (convolution) enables better capture of global molecular properties, while recurrent generation preserves robust chemical syntax and diverse scaffold exploration.</li>
<li><strong>S4 is especially strong for longer sequences</strong>: natural product design (SMILES &gt; 100 tokens) shows the largest advantages over benchmarks in validity and property matching.</li>
<li><strong>Prospective validation</strong>: 8/10 S4-designed MAPK1 inhibitors are predicted as highly active by molecular dynamics, with affinities comparable to or exceeding known actives.</li>
</ol>
<p><strong>Limitations acknowledged by the authors</strong>:</p>
<ul>
<li>All evaluations are computational; no wet-lab experimental validation is reported.</li>
<li>Bioactivity evaluation relies on likelihood-based ranking, which is an indirect proxy.</li>
<li>The MD simulations, while more rigorous than simple docking, still represent in silico predictions.</li>
<li>SMILES augmentation and improved ranking protocols could further boost performance.</li>
</ul>
<p><strong>Future directions</strong> include application to macrocyclic peptides and protein sequences, organic reaction planning, structure-based drug design, and integration with wet-lab experimental validation.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL v31</td>
          <td>1.9M SMILES</td>
          <td>Molecules with SMILES length &lt;= 100 tokens</td>
      </tr>
      <tr>
          <td>Fine-tuning (bioactivity)</td>
          <td>LIT-PCBA (5 targets)</td>
          <td>11-56 actives + ~10K inactives per target</td>
          <td>PKM2, MAPK1, GBA, mTORC1, TP53</td>
      </tr>
      <tr>
          <td>Natural product training</td>
          <td>COCONUT</td>
          <td>32,360 SMILES</td>
          <td>SMILES length &gt; 100 tokens</td>
      </tr>
      <tr>
          <td>Prospective fine-tuning</td>
          <td>ChEMBL v33 (MAPK1)</td>
          <td>68 inhibitors</td>
          <td>$K_i &lt; 1 \mu M$, target ID CHEMBL4040</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Pre-training: next-token prediction on SMILES strings</li>
<li>Fine-tuning: transfer learning with early stopping (patience 5, tolerance $10^{-5}$)</li>
<li>Molecule ranking: log-likelihood scoring with pre-training bias subtraction (Eq. 5)</li>
<li>Temperature sampling: $T$ from 1.0 to 2.0 (step 0.25) for chemical space exploration</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>S4</strong>: Structured state space sequence model with HiPPO initialization; hyperparameter search over 242 + 108 configurations</li>
<li><strong>LSTM</strong>: 40 configurations optimized via random search</li>
<li><strong>GPT</strong>: 35 configurations optimized via random search</li>
<li>All models share the same pre-training data and fine-tuning protocol for fair comparison</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Model</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity (ChEMBL)</td>
          <td>S4</td>
          <td>97%</td>
          <td>Out of 102,400 generated SMILES</td>
      </tr>
      <tr>
          <td>Uniqueness (ChEMBL)</td>
          <td>S4</td>
          <td>96%</td>
          <td>Among valid designs</td>
      </tr>
      <tr>
          <td>Novelty (ChEMBL)</td>
          <td>S4</td>
          <td>93%</td>
          <td>Not in training set</td>
      </tr>
      <tr>
          <td>Bioactivity ranking (top 10)</td>
          <td>S4</td>
          <td>Significant (p = 8.41e-6 vs LSTM)</td>
          <td>Wilcoxon signed-rank test</td>
      </tr>
      <tr>
          <td>NP validity</td>
          <td>S4</td>
          <td>81%</td>
          <td>COCONUT, SMILES &gt; 100 tokens</td>
      </tr>
      <tr>
          <td>MAPK1 inhibitor success</td>
          <td>S4</td>
          <td>8/10 designs active</td>
          <td>Validated by MD (Umbrella Sampling)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Hyperparameter search: NVIDIA A100 40GB GPUs</li>
<li>LSTM/GPT search: 5 days on single A100</li>
<li>S4 search: 10 days on multiple A100 GPUs</li>
<li>MD simulations: Dutch supercomputer Snellius; 1.2-1.6 microseconds per ligand (<a href="/notes/computational-chemistry/molecular-dynamics/umbrella-sampling/">Umbrella Sampling</a>)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/molML/s4-for-de-novo-drug-design">S4 for de novo drug design</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch implementation with data and trained models</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.12666371">Zenodo archive</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>Source data and molecule designs</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ozcelik, R., de Ruiter, S., Criscuolo, E., &amp; Grisoni, F. (2024). Chemical language modeling with structured state space sequence models. <em>Nature Communications</em>, 15, 6176.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ozcelik2024chemical,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Chemical language modeling with structured state space sequence models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{\&#34;O{}z\c{c}elik, R{\i}za and de Ruiter, Sarah and Criscuolo, Emanuele and Grisoni, Francesca}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{6176}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-024-50469-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>REINVENT 4: Open-Source Generative Molecule Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent4-generative-molecule-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent4-generative-molecule-design/</guid><description>REINVENT 4 is an open-source generative AI framework combining RNNs and transformers with reinforcement and curriculum learning for de novo molecular design.</description><content:encoded><![CDATA[<h2 id="an-open-source-reference-implementation-for-generative-molecular-design">An Open-Source Reference Implementation for Generative Molecular Design</h2>
<p>REINVENT 4 is a <strong>Resource</strong> paper presenting a production-grade, open-source software framework for AI-driven generative molecular design. The primary contribution is the unified codebase that integrates four distinct molecule generators (de novo, scaffold decoration, linker design, molecular optimization) within three machine learning optimization algorithms (transfer learning, reinforcement learning, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/curriculum-learning-molecular-design/">curriculum learning</a>). The software is released under the Apache 2.0 license and represents the fourth major version of the REINVENT platform, which has been in continuous production use at AstraZeneca for drug discovery.</p>
<h2 id="bridging-the-gap-between-research-prototypes-and-production-molecular-design">Bridging the Gap Between Research Prototypes and Production Molecular Design</h2>
<p>The motivation for REINVENT 4 stems from several gaps in the generative molecular design landscape. While numerous AI model architectures have been developed for molecular generation (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">VAEs</a>, GANs, RNNs, transformers, flow models, diffusion models), most exist as research prototypes released alongside individual publications rather than as maintained, integrated software. The authors argue that the scientific community needs reference implementations of common generative molecular design algorithms in the public domain to:</p>
<ol>
<li>Enable nuanced debate about the application of AI in drug discovery</li>
<li>Serve as educational tools for practitioners entering the field</li>
<li>Increase transparency around AI-driven molecular design</li>
<li>Provide a foundation for future innovation</li>
</ol>
<p>REINVENT 4 consolidates previously separate codebases (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> v1, v2, LibInvent, LinkInvent, Mol2Mol) into a single repository with a consistent interface, addressing the fragmentation that characterized earlier releases.</p>
<h2 id="unified-framework-for-sequence-based-molecular-generation">Unified Framework for Sequence-Based Molecular Generation</h2>
<p>The core design of REINVENT 4 centers on sequence-based neural network models that generate <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings in an autoregressive manner. All generators model the probability of producing a token sequence, with two formulations.</p>
<p>For unconditional agents (de novo generation), the joint probability of a sequence $T$ with tokens $t_1, t_2, \ldots, t_\ell$ is:</p>
<p>$$
\mathbf{P}(T) = \prod_{i=1}^{\ell} \mathbf{P}(t_i \mid t_{i-1}, t_{i-2}, \ldots, t_1)
$$</p>
<p>For conditional agents (scaffold decoration, linker design, molecular optimization), the joint probability given an input sequence $S$ is:</p>
<p>$$
\mathbf{P}(T \mid S) = \prod_{i=1}^{\ell} \mathbf{P}(t_i \mid t_{i-1}, t_{i-2}, \ldots, t_1, S)
$$</p>
<p>The negative log-likelihood for unconditional agents is:</p>
<p>$$
NLL(T) = -\log \mathbf{P}(T) = -\sum_{i=1}^{\ell} \log \mathbf{P}(t_i \mid t_{i-1}, t_{i-2}, \ldots, t_1)
$$</p>
<h3 id="reinforcement-learning-with-dap">Reinforcement Learning with DAP</h3>
<p>The key optimization mechanism is reinforcement learning via the &ldquo;Difference between Augmented and Posterior&rdquo; (DAP) strategy. For each generated sequence $T$, the augmented likelihood is defined as:</p>
<p>$$
\log \mathbf{P}_{\text{aug}}(T) = \log \mathbf{P}_{\text{prior}}(T) + \sigma \mathbf{S}(T)
$$</p>
<p>where $\mathbf{S}(T) \in [0, 1]$ is the scalar score and $\sigma \geq 0$ controls the balance between reward and regularization. The DAP loss is:</p>
<p>$$
\mathcal{L}(T) = \left(\log \mathbf{P}_{\text{aug}}(T) - \log \mathbf{P}_{\text{agent}}(T)\right)^2
$$</p>
<p>The presence of the prior likelihood in the augmented likelihood constrains how far the agent can deviate from chemically plausible space, functioning similarly to proximal policy gradient methods. The loss is lower-bounded by:</p>
<p>$$
\mathcal{L}(T) \geq \max\left(0, \log \mathbf{P}_{\text{prior}}(T) + \sigma \mathbf{S}(T)\right)^2
$$</p>
<h3 id="four-molecule-generators">Four Molecule Generators</h3>
<p>REINVENT 4 supports four generator types:</p>
<table>
  <thead>
      <tr>
          <th>Generator</th>
          <th>Architecture</th>
          <th>Input</th>
          <th>Task</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Reinvent</td>
          <td>RNN</td>
          <td>None</td>
          <td>De novo design from scratch</td>
      </tr>
      <tr>
          <td>LibInvent</td>
          <td>RNN</td>
          <td>Scaffold SMILES</td>
          <td>R-group replacement, library design</td>
      </tr>
      <tr>
          <td><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/link-invent-generative-linker-design/">LinkInvent</a></td>
          <td>RNN</td>
          <td>Two warhead fragments</td>
          <td>Linker design, scaffold hopping</td>
      </tr>
      <tr>
          <td>Mol2Mol</td>
          <td>Transformer</td>
          <td>Input molecule</td>
          <td>Molecular optimization within similarity bounds</td>
      </tr>
  </tbody>
</table>
<p>All generators are fully integrated with all three optimization algorithms (TL, RL, CL). The Mol2Mol transformer was trained on over 200 billion molecular pairs from PubChem with <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> $\geq 0.50$, using ranking loss to directly link negative log-likelihood to molecular similarity.</p>
<h3 id="staged-learning-curriculum-learning">Staged Learning (Curriculum Learning)</h3>
<p>A key new feature is staged learning, which implements curriculum learning as multi-stage RL. Each stage can define a different scoring profile, allowing users to gradually phase in computationally expensive scoring functions. For example, cheap drug-likeness filters can run first, followed by docking in later stages. Stages terminate when a maximum score threshold is exceeded or a step limit is reached.</p>
<h3 id="scoring-subsystem">Scoring Subsystem</h3>
<p>The scoring subsystem implements a plugin architecture supporting over 25 scoring components, including:</p>
<ul>
<li>Physicochemical descriptors from RDKit (QED, SLogP, TPSA, molecular weight, etc.)</li>
<li>Molecular docking via DockStream (<a href="https://en.wikipedia.org/wiki/AutoDock">AutoDock Vina</a>, rDock, Hybrid, Glide, GOLD)</li>
<li>QSAR models via Qptuna and ChemProp (D-MPNN)</li>
<li>Shape similarity via ROCS</li>
<li>Synthesizability estimation via SA score</li>
<li>Matched molecular pairs via mmpdb</li>
<li>Generic REST and external process interfaces</li>
</ul>
<p>Scores are aggregated via weighted arithmetic or geometric mean. A transform system (sigmoid, step functions, value maps) normalizes individual component scores to $[0, 1]$.</p>
<h2 id="pdk1-inhibitor-case-study">PDK1 Inhibitor Case Study</h2>
<p>The paper demonstrates REINVENT 4 through a structure-based drug design exercise targeting <a href="https://en.wikipedia.org/wiki/PDPK1">Phosphoinositide-dependent kinase-1 (PDK1)</a> inhibitors. The experimental setup uses PDB crystal structure 2XCH with DockStream and Glide for docking, defining hits as molecules with docking score $\leq -8$ kcal/mol and QED $\geq 0.7$.</p>
<p><strong>Baseline RL from prior</strong>: 50 epochs of staged learning with batch size 128 produced 119 hits from 6,400 generated molecules (1.9% hit rate), spread across 103 generic Bemis-Murcko scaffolds.</p>
<p><strong>Transfer learning + RL</strong>: After 10 epochs of TL on 315 congeneric pyridinone PDK1 actives from PubChem Assay AID1798002, the same 50-epoch RL run produced 222 hits (3.5% hit rate) across 176 unique generic scaffolds, nearly doubling productivity.</p>
<p>Both approaches generated top-scoring molecules (docking score of -10.1 kcal/mol each) with plausible binding poses reproducing key protein-ligand interactions seen in the native crystal structure, including hinge interactions with ALA 162 and contacts with LYS 111.</p>
<p>The paper also demonstrates the agent&rsquo;s plasticity through a molecular weight switching experiment: after 500 epochs driving generation toward 1500 Da molecules, switching the reward to favor molecules $\leq 500$ Da resulted in rapid adaptation within ~50 epochs, showing that the RL agent can recover from extreme biases.</p>
<h2 id="practical-software-for-ai-driven-drug-discovery">Practical Software for AI-Driven Drug Discovery</h2>
<p>REINVENT 4 represents a mature, well-documented framework that consolidates years of incremental development into a single codebase. Key practical features include TOML/JSON configuration, TensorBoard visualization, multinomial sampling and beam search decoding, diversity filters for scaffold-level novelty, experience replay (inception), and a plugin mechanism for extending the scoring subsystem.</p>
<p>The authors acknowledge that this is one approach among many and that there is no single solution that uniformly outperforms others. REINVENT has demonstrated strong sample efficiency in benchmarks and produced realistic 3D docking poses, but the paper does not claim universal superiority. The focus is on providing a well-engineered, transparent reference implementation rather than advancing a novel algorithm.</p>
<p>Limitations include that only the Mol2Mol prior supports stereochemistry, the training data biases constrain the explorable chemical space, and the SMILES-based representation inherits the known fragility of string-based molecular encodings.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior training (Reinvent)</td>
          <td>ChEMBL 25</td>
          <td>~1.7M molecules</td>
          <td>Drug-like compounds</td>
      </tr>
      <tr>
          <td>Prior training (LibInvent)</td>
          <td>ChEMBL 27</td>
          <td>~1.9M molecules</td>
          <td>Scaffold-decoration pairs</td>
      </tr>
      <tr>
          <td>Prior training (LinkInvent)</td>
          <td>ChEMBL 27</td>
          <td>~1.9M molecules</td>
          <td>Fragment-linker pairs</td>
      </tr>
      <tr>
          <td>Prior training (Mol2Mol)</td>
          <td>ChEMBL 28 / PubChem</td>
          <td>~200B pairs</td>
          <td>Tanimoto similarity $\geq 0.50$</td>
      </tr>
      <tr>
          <td>Case study TL</td>
          <td>PubChem AID1798002</td>
          <td>315 compounds</td>
          <td>Congeneric PDK1 actives</td>
      </tr>
      <tr>
          <td>Case study docking</td>
          <td>PDB 2XCH</td>
          <td>1 structure</td>
          <td>PDK1 crystal structure</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimization</strong>: DAP (recommended), plus three deprecated alternatives (REINFORCE, A2C, MAULI)</li>
<li><strong>Decoding</strong>: Multinomial sampling (default, temperature $K = 1$) and beam search</li>
<li><strong>Diversity filter</strong>: Murcko scaffold, topological scaffold, scaffold similarity, same-SMILES penalty</li>
<li><strong>Experience replay</strong>: Inception memory with configurable size and sampling rate</li>
<li><strong>Gradient descent</strong>: Adam optimizer</li>
</ul>
<h3 id="models">Models</h3>
<p>All pre-trained priors are distributed with the repository. RNN-based generators (Reinvent, LibInvent, LinkInvent) and transformer-based generator (Mol2Mol) with multiple similarity-conditioned variants.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Condition</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Hit rate (RL)</td>
          <td>1.9%</td>
          <td>50 epochs, batch 128</td>
          <td>PDK1 case study</td>
      </tr>
      <tr>
          <td>Hit rate (TL+RL)</td>
          <td>3.5%</td>
          <td>10 TL + 50 RL epochs</td>
          <td>PDK1 case study</td>
      </tr>
      <tr>
          <td>Scaffold diversity (RL)</td>
          <td>103 scaffolds</td>
          <td>From 119 hits</td>
          <td>Generic Bemis-Murcko</td>
      </tr>
      <tr>
          <td>Scaffold diversity (TL+RL)</td>
          <td>176 scaffolds</td>
          <td>From 222 hits</td>
          <td>Generic Bemis-Murcko</td>
      </tr>
      <tr>
          <td>Best docking score</td>
          <td>-10.1 kcal/mol</td>
          <td>Both methods</td>
          <td>Glide SP</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify hardware requirements. REINVENT 4 supports both GPU and CPU execution. Python 3.10+ is required, with PyTorch 1.x (2.0 also compatible) and RDKit 2022.9+.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MolecularAI/REINVENT4">REINVENT4</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Full framework with pre-trained priors</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MolecularAI/DockStream">DockStream</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Docking wrapper for scoring</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Loeffler, H. H., He, J., Tibo, A., Janet, J. P., Voronov, A., Mervin, L. H., &amp; Engkvist, O. (2024). Reinvent 4: Modern AI-driven generative molecule design. <em>Journal of Cheminformatics</em>, 16, 20. <a href="https://doi.org/10.1186/s13321-024-00812-5">https://doi.org/10.1186/s13321-024-00812-5</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{loeffler2024reinvent,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Reinvent 4: Modern AI-driven generative molecule design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Loeffler, Hannes H. and He, Jiazhen and Tibo, Alessandro and Janet, Jon Paul and Voronov, Alexey and Mervin, Lewis H. and Engkvist, Ola}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{20}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-024-00812-5}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Re-evaluating Sample Efficiency in Molecule Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/sample-efficiency-de-novo-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/sample-efficiency-de-novo-generation/</guid><description>Thomas et al. re-evaluate generative model benchmarks for de novo drug design, adding property filters and diversity metrics that re-rank model performance.</description><content:encoded><![CDATA[<h2 id="an-empirical-re-evaluation-of-generative-model-benchmarks">An Empirical Re-evaluation of Generative Model Benchmarks</h2>
<p>This is an <strong>Empirical</strong> paper. The primary contribution is a critical reassessment of the <a href="/notes/computational-chemistry/benchmark-problems/pmo-sample-efficient-molecular-optimization/">Practical Molecular Optimization (PMO)</a> benchmark for de novo molecule generation. Rather than proposing a new generative model, the authors modify existing benchmark metrics to account for chemical desirability (molecular weight, LogP, topological novelty) and molecular diversity. They then re-evaluate all 25 generative models from the original PMO benchmark plus the recently proposed <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/">Augmented Hill-Climb (AHC)</a> method.</p>
<h2 id="sample-efficiency-and-chemical-quality-in-drug-design">Sample Efficiency and Chemical Quality in Drug Design</h2>
<p>Deep generative models for de novo molecule generation often require large numbers of oracle evaluations (up to $10^5$ samples) to optimize toward a target objective. This is a practical limitation when using computationally expensive scoring functions like molecular docking. The <a href="/notes/computational-chemistry/benchmark-problems/pmo-sample-efficient-molecular-optimization/">PMO benchmark</a> by Gao et al. addressed this by reformulating performance as maximizing an objective within a fixed budget of 10,000 oracle calls, finding <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> to be the most sample-efficient model across 23 tasks.</p>
<p>However, the authors identify a key limitation: the PMO benchmark measures only sample efficiency without considering the chemical quality of proposed molecules. Investigating the top-performing REINVENT model on the <a href="https://en.wikipedia.org/wiki/C-Jun_N-terminal_kinase">JNK3</a> task, they find that 4 of 5 replicate runs produce molecules with molecular weight and LogP distributions far outside the training data (ZINC250k). The resulting molecules contain large structures with repeating substructures that are undesirable from a medicinal chemistry perspective. This disconnect between benchmark performance and practical utility motivates the modified evaluation metrics.</p>
<h2 id="modified-metrics-property-filters-and-diversity-requirements">Modified Metrics: Property Filters and Diversity Requirements</h2>
<p>The core innovation is the introduction of three modified AUC Top-10 metrics that extend the original PMO benchmark evaluation:</p>
<p><strong>AUC Top-10 (Filtered)</strong>: Molecules are excluded if their molecular weight or LogP falls beyond 4 standard deviations from the mean of the ZINC250k pre-training dataset ($\mu \pm 4\sigma$, covering approximately 99.99% of a normal distribution). Molecules with more than 10% de novo (unobserved in ZINC250k) ECFP4 fingerprint bits are also filtered out. This ensures the generative model does not drift beyond its applicability domain.</p>
<p><strong>AUC Top-10 (Diverse)</strong>: The top 10 molecules are selected iteratively, where a molecule is only added if its Tanimoto similarity (by ECFP4 fingerprints) to any previously selected compound does not exceed 0.35. This threshold corresponds to an approximately 80% probability that more-similar molecules belong to the same bioactivity class, enforcing that distinct candidates possess different profiles.</p>
<p><strong>AUC Top-10 (Combined)</strong>: Applies both property filters and diversity filters simultaneously, providing the most stringent evaluation of practical performance.</p>
<h2 id="benchmark-setup-and-generative-models-evaluated">Benchmark Setup and Generative Models Evaluated</h2>
<h3 id="implementation-details">Implementation Details</h3>
<p>The authors re-implement the PMO benchmark using the original code and data (MIT license) with no changes beyond adding AHC and the new metrics. For Augmented Hill-Climb, the architecture follows REINVENT: an embedding layer of size 128 and 3 layers of Gated Recurrent Units (GRU) with size 512. The prior is trained on ZINC250k using SMILES notation with batch size 128 for 5 epochs.</p>
<p>Two AHC variants are benchmarked:</p>
<ul>
<li><strong>SMILES-AHC</strong>: Hyperparameters optimized via the standard PMO procedure, yielding batch size 256, $\sigma = 120$, $K = 0.25$</li>
<li><strong>SMILES-AHC</strong>*: Uses $\sigma = 60$, chosen based on prior knowledge that lower $\sigma$ values maintain better regularization and chemical quality</li>
</ul>
<p>Both omit diversity filters and non-unique penalization for standardized comparison, despite these being shown to improve performance in prior work.</p>
<h3 id="models-compared">Models Compared</h3>
<p>The benchmark includes 25 generative models from the original PMO paper spanning diverse architectures: REINVENT (RNN + RL), Graph GA (graph-based genetic algorithm), GP BO (Gaussian process Bayesian optimization), SMILES GA (SMILES-based genetic algorithm), SELFIES-based VAEs, and others. The 23 objective tasks derive primarily from the <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmark.</p>
<h2 id="re-ranked-results-and-augmented-hill-climb-performance">Re-ranked Results and Augmented Hill-Climb Performance</h2>
<p>The modified metrics substantially re-order the ranking of generative models:</p>
<ol>
<li>
<p><em><em>SMILES-AHC</em> achieves top performance on AUC Top-10 (Combined)</em>*, where both property filters and diversity are enforced. The use of domain-informed hyperparameter selection ($\sigma = 60$) proves critical.</p>
</li>
<li>
<p><strong>SMILES-AHC (data-driven hyperparameters) ranks first</strong> when accounting for property filters alone, diversity alone, or both combined, demonstrating that the AHC algorithm itself provides strong performance even without manual tuning.</p>
</li>
<li>
<p><strong>REINVENT retains its first-place rank under property filters alone</strong>, suggesting that the minority of compounds staying within acceptable property space still perform well. However, it drops when diversity is also required.</p>
</li>
<li>
<p><strong>Evolutionary algorithms (Graph GA, GP BO, SMILES GA) drop significantly</strong> under the new metrics. This is expected because rule-based methods are not constrained by the ZINC250k distribution and tend to propose molecules that diverge from drug-like chemical space.</p>
</li>
<li>
<p><strong>Both AHC variants excel on empirically difficult tasks</strong>, including isomer-based tasks, Zaleplon MPO, and Sitagliptin MPO, where other methods struggle.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>Results are preliminary because generative models have not undergone hyperparameter optimization against the new metrics</li>
<li>Property filter thresholds are subjective, and the 10% de novo ECFP4 bit threshold was chosen by visual inspection</li>
<li>Comparing rule-based models against distribution-based models using ZINC250k similarity introduces a bias toward distribution-based approaches</li>
<li>Six objective task reference molecules sit in the lowest 0.01% of ZINC250k property space, raising questions about whether distribution-based models can reasonably optimize for these objectives</li>
<li>Property filters and diversity could alternatively be incorporated directly into the objective function as additional oracles, though this would not necessarily produce the same results</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC250k</td>
          <td>~250K molecules</td>
          <td>Subset of ZINC15, provided by PMO benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><a href="/notes/computational-chemistry/benchmark-problems/pmo-sample-efficient-molecular-optimization/">PMO</a> benchmark tasks</td>
          <td>23 objectives</td>
          <td>Derived primarily from <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a></td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Augmented Hill-Climb</strong>: RL strategy from Thomas et al. (2022), patience of 5</li>
<li><strong>Hyperparameters (SMILES-AHC)</strong>: batch size 256, $\sigma = 120$, $K = 0.25$</li>
<li><em><em>Hyperparameters (SMILES-AHC</em>)</em>*: $\sigma = 60$ (domain-informed selection)</li>
<li><strong>Prior training</strong>: 5 epochs, batch size 128, SMILES notation</li>
<li><strong>Oracle budget</strong>: 10,000 evaluations per task</li>
<li><strong>Replicates</strong>: 5 per model per task</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Embedding (128) + 3x GRU (512), following REINVENT</li>
<li><strong>All 25 PMO benchmark models</strong> re-evaluated using original implementations</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AUC Top-10 (Original)</td>
          <td>Area under curve of average top 10 molecules</td>
          <td>Standard PMO metric</td>
      </tr>
      <tr>
          <td>AUC Top-10 (Filtered)</td>
          <td>Original with MW/LogP and ECFP4 novelty filters</td>
          <td>$\mu \pm 4\sigma$ from ZINC250k</td>
      </tr>
      <tr>
          <td>AUC Top-10 (Diverse)</td>
          <td>Top 10 selected with Tanimoto &lt; 0.35 diversity</td>
          <td>ECFP4 fingerprints</td>
      </tr>
      <tr>
          <td>AUC Top-10 (Combined)</td>
          <td>Both filters and diversity applied</td>
          <td>Most stringent metric</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware requirements are not specified in the paper. The benchmark uses 10,000 oracle evaluations per task with 5 replicates, which is computationally modest compared to standard generative model training.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MorganCThomas/MolScore">MolScore</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Scoring and benchmarking framework by the first author</td>
      </tr>
      <tr>
          <td><a href="https://github.com/wenhao-gao/mol_opt">PMO Benchmark</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Original benchmark code and data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Thomas, M., O&rsquo;Boyle, N. M., Bender, A., &amp; de Graaf, C. (2022). Re-evaluating sample efficiency in de novo molecule generation. <em>arXiv preprint arXiv:2212.01385</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{thomas2022reevaluating,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Re-evaluating sample efficiency in de novo molecule generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Thomas, Morgan and O&#39;Boyle, Noel M. and Bender, Andreas and de Graaf, Chris}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2212.01385}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.LG}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arxiv.2212.01385}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Protein-to-Drug Molecule Translation via Transformer</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/transformer-protein-drug-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/transformer-protein-drug-generation/</guid><description>A Transformer model frames protein-targeted drug generation as machine translation from amino acid sequences to SMILES molecular strings.</description><content:encoded><![CDATA[<h2 id="protein-targeted-drug-generation-as-machine-translation">Protein-Targeted Drug Generation as Machine Translation</h2>
<p>This is a <strong>Method</strong> paper that proposes using the Transformer neural network architecture for protein-specific de novo drug generation. The primary contribution is framing the problem of generating molecules that bind to a target protein as a machine translation task: translating from the &ldquo;language&rdquo; of amino acid sequences to the SMILES representation of candidate drug molecules. The model takes only a protein&rsquo;s amino acid sequence as input and generates novel molecules with predicted binding affinity, requiring no prior knowledge of active ligands, physicochemical descriptors, or the protein&rsquo;s three-dimensional structure.</p>
<h2 id="limitations-of-existing-generative-drug-design-approaches">Limitations of Existing Generative Drug Design Approaches</h2>
<p>Existing deep learning methods for de novo molecule generation suffer from several limitations. Most RNN-based approaches require a library of known active compounds against the target protein to fine-tune the generator or train a reward predictor for reinforcement learning. Structure-based drug design methods require the three-dimensional structure of the target protein, which can be costly and technically difficult to obtain through protein expression, purification, and crystallization. Autoencoder-based approaches (variational and adversarial) similarly depend on prior knowledge of protein binders or their physicochemical characteristics.</p>
<p>The estimated drug-like molecule space is on the order of $10^{60}$, while only around $10^{8}$ compounds have been synthesized. High-throughput screening is expensive and time-consuming, and virtual screening operates only on known molecules. Computational de novo design methods often generate molecules that are hard to synthesize or restrict accessible chemical space through coded rules. A method that requires only a protein&rsquo;s amino acid sequence would substantially simplify the initial stages of drug discovery, particularly for targets with limited or no information about inhibitors and 3D structure.</p>
<h2 id="sequence-to-sequence-translation-with-self-attention">Sequence-to-Sequence Translation with Self-Attention</h2>
<p>The core insight is to treat protein-targeted drug generation as a translation problem between two &ldquo;languages,&rdquo; applying the Transformer architecture that had demonstrated strong results in neural machine translation. The encoder maps a protein amino acid sequence $(a_1, \ldots, a_n)$ to continuous representations $\mathbf{z} = (z_1, \ldots, z_n)$, and the decoder autoregressively generates a SMILES string conditioned on $\mathbf{z}$.</p>
<p>The self-attention mechanism computes:</p>
<p>$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$</p>
<p>where $d_k$ is a scaling factor. Multihead attention runs $h$ parallel attention heads:</p>
<p>$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$</p>
<p>$$
\text{Multihead}(Q, K, V) = (\text{head}_1, \ldots, \text{head}_h)W^O
$$</p>
<p>Positional encoding uses sinusoidal functions:</p>
<p>$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i / d_{model}}}\right)
$$</p>
<p>$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i / d_{model}}}\right)
$$</p>
<p>The self-attention mechanism is particularly well-suited for this task for two reasons. First, protein sequences can be much longer than SMILES strings (dozens of times longer), making the ability to capture long-range dependencies essential. Second, three-dimensional structural features of the binding pocket may be formed by amino acid residues far apart in the linear sequence, and multihead attention can jointly attend to different positional aspects simultaneously.</p>
<h2 id="data-model-architecture-and-docking-evaluation">Data, Model Architecture, and Docking Evaluation</h2>
<h3 id="data">Data</h3>
<p>The training data was retrieved from BindingDB, filtering for interactions between proteins from Homo sapiens, Rattus norvegicus, Mus musculus, and Bos taurus with binding affinity below 100 nM (IC50, Kd, or EC50). After filtering for valid PubChem CIDs, SMILES representations, UniProt IDs, molecular weight under 1000 Da, and amino acid sequence lengths between 80 and 2050, the final dataset contained 238,147 records with 1,613 unique proteins and 154,924 unique ligand SMILES strings.</p>
<p>Five Monte Carlo cross-validation splits were created, with the constraint that test set proteins share less than 20% sequence similarity with training set proteins (measured via <a href="https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm">Needleman-Wunsch</a> global alignment).</p>
<h3 id="model-configuration">Model Configuration</h3>
<p>The model uses the original Transformer implementation via the tensor2tensor library with:</p>
<ul>
<li>4 encoder/decoder layers of size 128</li>
<li>4 attention heads</li>
<li>Adam optimizer with learning rate decay from the original Transformer paper</li>
<li>Batch size of 4,096 tokens</li>
<li>Training for 600K epochs on a single GPU in Google Colaboratory</li>
<li>Vocabulary of 71 symbols (character-level tokenization)</li>
</ul>
<p>Beam search decoding was used with two modes: beam size 4 keeping only the top-1 result (&ldquo;one per one&rdquo; mode) and beam size 10 keeping all 10 results (&ldquo;ten per one&rdquo; mode).</p>
<h3 id="chemical-validity-and-uniqueness">Chemical Validity and Uniqueness</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>One per One (avg)</th>
          <th>Ten per One (avg)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid SMILES (%)</td>
          <td>90.2</td>
          <td>82.6</td>
      </tr>
      <tr>
          <td>Unique SMILES (%)</td>
          <td>92.3</td>
          <td>81.7</td>
      </tr>
      <tr>
          <td>ZINC15 match (%)</td>
          <td>30.6</td>
          <td>17.1</td>
      </tr>
  </tbody>
</table>
<h3 id="docking-evaluation">Docking Evaluation</h3>
<p>To assess binding affinity, the authors selected two receptor tyrosine kinases from the test set (IGF-1R and VEGFR2) and performed molecular docking with <a href="/notes/computational-chemistry/benchmark-problems/smina-docking-benchmark/">SMINA</a>. Four sets of ligands were compared: known binders, randomly selected compounds, molecules generated for the target protein, and molecules generated for other targets (cross-docking control).</p>
<p>ROC-AUC analysis showed that the docking tool classified generated molecules for the correct target as binders at rates comparable to known binders. For the best-discriminating structures (PDB 3O23 for IGF-1R, PDB 3BE2 for VEGFR2), Mann-Whitney U tests confirmed statistically significant differences between generated-for-target molecules and random compounds, while the difference between generated-for-target and known binders was not significant (p = 0.40 and 0.26 respectively), suggesting the model generates plausible binders.</p>
<h3 id="drug-likeness-properties">Drug-Likeness Properties</h3>
<p>Generated molecules were evaluated against <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski&rsquo;s Rule of Five</a> and other drug-likeness criteria:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Constraint</th>
          <th>One per One (%)</th>
          <th>Ten per One (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>logP</td>
          <td>&lt; 5</td>
          <td>84.4</td>
          <td>85.6</td>
      </tr>
      <tr>
          <td>Molecular weight</td>
          <td>&lt; 500 Da</td>
          <td>95.8</td>
          <td>88.9</td>
      </tr>
      <tr>
          <td>H-bond donors</td>
          <td>&lt; 5</td>
          <td>95.8</td>
          <td>91.9</td>
      </tr>
      <tr>
          <td>H-bond acceptors</td>
          <td>&lt; 10</td>
          <td>97.9</td>
          <td>93.5</td>
      </tr>
      <tr>
          <td>Rotatable bonds</td>
          <td>&lt; 10</td>
          <td>97.9</td>
          <td>91.2</td>
      </tr>
      <tr>
          <td>TPSA</td>
          <td>&lt; 140</td>
          <td>98.0</td>
          <td>92.7</td>
      </tr>
      <tr>
          <td>SAS</td>
          <td>&lt; 6</td>
          <td>99.9</td>
          <td>100.0</td>
      </tr>
  </tbody>
</table>
<p>Mean QED values were 0.66 +/- 0.19 (one per one) and 0.58 +/- 0.21 (ten per one).</p>
<h3 id="structural-novelty">Structural Novelty</h3>
<p>Tanimoto similarity analysis showed that only 8% of generated structures had similarity above the threshold (&gt; 0.85) to training compounds. The majority (51%) had Tanimoto scores below 0.5. The mean nearest-neighbor Tanimoto similarity of generated molecules to the training set (0.54 +/- 0.17 in one-per-one mode) was substantially lower than the mean within-training-set similarity (0.74 +/- 0.14), indicating the model generates structurally diverse molecules outside the training distribution.</p>
<h2 id="generated-molecules-show-drug-like-properties-and-predicted-binding">Generated Molecules Show Drug-Like Properties and Predicted Binding</h2>
<p>The model generates roughly 90% chemically valid SMILES in one-per-one mode, with 92% uniqueness. Docking simulations on IGF-1R and VEGFR2 suggest that generated molecules for the correct target are statistically indistinguishable from known binders, while molecules generated for other targets behave more like random compounds. Drug-likeness properties fall within acceptable ranges for the vast majority of generated compounds.</p>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>Only two protein targets were analyzed via docking due to computational constraints, and the analysis was limited to proteins with a single well-known druggable binding pocket.</li>
<li>Beam search produces molecules that differ only slightly; diverse beam search or coupling with variational/adversarial autoencoders could improve diversity.</li>
<li>The fraction of molecules matching the ZINC15 database (30.6% in one-per-one mode) could potentially be reduced by pretraining on a larger compound set (e.g., ChEMBL&rsquo;s 1.5 million molecules).</li>
<li>Model interpretability remains limited and is identified as important future work.</li>
<li>The approach is a proof of concept and requires further validation via in vitro assays across diverse protein targets.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data-1">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Test</td>
          <td>BindingDB (filtered)</td>
          <td>238,147 records</td>
          <td>1,613 unique proteins, 154,924 unique SMILES; IC50/Kd/EC50 &lt; 100 nM</td>
      </tr>
      <tr>
          <td>Docking validation</td>
          <td>PDB structures</td>
          <td>11 (IGF-1R), 20 (VEGFR2)</td>
          <td>SMINA docking with default settings</td>
      </tr>
      <tr>
          <td>Database matching</td>
          <td>ZINC15</td>
          <td>N/A</td>
          <td>Used for novelty assessment</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer (encoder-decoder) via tensor2tensor library</li>
<li>Beam search decoding (beam sizes 4 and 10)</li>
<li>Needleman-Wunsch global alignment for protein sequence similarity (EMBOSS)</li>
<li>SMINA for molecular docking</li>
<li>RDKit for validity checking, property calculation, and canonicalization</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>4 layers, 128 hidden size, 4 attention heads</li>
<li>Character-level tokenization with 71-symbol vocabulary</li>
<li>5-fold Monte Carlo cross-validation with &lt; 20% sequence similarity between train/test proteins</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid SMILES</td>
          <td>90.2% (1-per-1), 82.6% (10-per-1)</td>
          <td>Averaged across 5 splits</td>
      </tr>
      <tr>
          <td>Unique SMILES</td>
          <td>92.3% (1-per-1), 81.7% (10-per-1)</td>
          <td>Averaged across 5 splits</td>
      </tr>
      <tr>
          <td>ZINC15 match</td>
          <td>30.6% (1-per-1), 17.1% (10-per-1)</td>
          <td>Averaged across 5 splits</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.66 +/- 0.19 (1-per-1), 0.58 +/- 0.21 (10-per-1)</td>
          <td>Drug-likeness score</td>
      </tr>
      <tr>
          <td>SAS compliance</td>
          <td>99.9% (1-per-1), 100% (10-per-1)</td>
          <td>SAS &lt; 6</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Google Colaboratory with one GPU</li>
<li>Training for 600K epochs</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/dariagrechishnikova/molecule_structure_generation">molecule_structure_generation</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Jupyter Notebook implementation using tensor2tensor</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Grechishnikova, D. (2021). Transformer neural network for protein-specific de novo drug generation as a machine translation problem. <em>Scientific Reports</em>, 11, 321. <a href="https://doi.org/10.1038/s41598-020-79682-4">https://doi.org/10.1038/s41598-020-79682-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{grechishnikova2021transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer neural network for protein-specific de novo drug generation as a machine translation problem}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Grechishnikova, Daria}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{321}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41598-020-79682-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>PrefixMol: Prefix Embeddings for Drug Molecule Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/prefixmol-target-chemistry-aware-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/prefixmol-target-chemistry-aware-generation/</guid><description>PrefixMol uses prefix embeddings in a GPT SMILES generator to jointly condition on protein pockets and chemical properties for drug design.</description><content:encoded><![CDATA[<h2 id="unified-multi-conditional-molecular-generation">Unified Multi-Conditional Molecular Generation</h2>
<p>PrefixMol is a <strong>Method</strong> paper that introduces a unified generative model for structure-based drug design that simultaneously conditions on protein binding pockets and multiple chemical properties. The primary contribution is a prefix-embedding mechanism, borrowed from NLP multi-task learning, that represents each condition (pocket geometry, Vina score, QED, SA, LogP, <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski</a>) as a learnable feature vector prepended to the input sequence of a GPT-based <a href="/notes/computational-chemistry/molecular-representations/smiles-original-paper/">SMILES</a> generator. This allows a single model to handle customized multi-conditional generation without the negative transfer that typically arises from merging separate task-specific models.</p>
<h2 id="bridging-target-aware-and-chemistry-aware-molecular-design">Bridging Target-Aware and Chemistry-Aware Molecular Design</h2>
<p>Prior structure-based drug design methods (e.g., Pocket2Mol, GraphBP) generate molecules conditioned on protein binding pockets but impose no constraints on the chemical properties of the output. Conversely, controllable molecule generation methods (e.g., <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/retmol-retrieval-molecule-generation/">RetMol</a>, CMG) can steer chemical properties but ignore protein-ligand interactions. Merging these two objectives into a single model is difficult for two reasons:</p>
<ol>
<li><strong>Data scarcity</strong>: Few datasets contain both protein-ligand binding affinity data and comprehensive molecular property annotations.</li>
<li><strong>Negative transfer</strong>: Treating each condition as a separate task in a multi-task framework can hurt overall performance when tasks conflict.</li>
</ol>
<p>PrefixMol addresses both problems by extending the CrossDocked dataset with molecular property labels and using a parameter-efficient prefix conditioning strategy that decouples task-specific knowledge from the shared generative backbone.</p>
<h2 id="prefix-conditioning-in-attention-layers">Prefix Conditioning in Attention Layers</h2>
<p>The core innovation adapts prefix-tuning from NLP to molecular generation. Given a GPT transformer that generates SMILES token-by-token, PrefixMol prepends $n_c$ learnable condition vectors $\mathbf{p}_{\phi} \in \mathbb{R}^{n_c \times d}$ to the left of the sequence embedding $\mathbf{x} \in \mathbb{R}^{l \times d}$, forming an extended input $\mathbf{x}&rsquo; = [\text{PREFIX}; \mathbf{x}]$.</p>
<p>The output of each position is:</p>
<p>$$
h_i = \begin{cases} p_{\phi,i}, &amp; \text{if } i &lt; n_c \\ \text{LM}_\theta(x_i&rsquo;, h_{&lt;i}), &amp; \text{otherwise} \end{cases}
$$</p>
<p>Because the prefix features always sit to the left, the causal attention mask ensures they influence all subsequent token predictions. The key insight is that the attention mechanism decomposes into a weighted sum of self-attention and prefix attention:</p>
<p>$$
\begin{aligned}
\text{head} &amp;= (1 - \lambda(\mathbf{x})) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{c}\mathbf{W}_k, \mathbf{c}\mathbf{W}_v)}_{\text{self-attention}} \\
&amp;\quad + \lambda(\mathbf{x}) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{p}_\phi\mathbf{W}_k, \mathbf{p}_\phi\mathbf{W}_v)}_{\text{prefix attention}}
\end{aligned}
$$</p>
<p>where $\lambda(\mathbf{x})$ is a scalar representing the normalized attention weight on the prefix positions. This decomposition shows that conditions modulate generation through an additive attention pathway, and the activation map $\text{softmax}(\mathbf{x}\mathbf{W}_q \mathbf{W}_k^\top \mathbf{p}_\phi^\top)$ directly reveals how each condition steers model behavior.</p>
<p><strong>Condition correlation</strong> is similarly revealed. For the prefix features themselves, the causal mask zeros out the cross-attention to the sequence, leaving only the prefix self-correlation term:</p>
<p>$$
\text{head} = \text{Attn}(\mathbf{p}_\phi \mathbf{W}_q, \mathbf{p}_\phi \mathbf{W}_k, \mathbf{p}_\phi \mathbf{W}_v)
$$</p>
<p>The attention map $\mathbf{A}(\mathbf{p}_\phi)$ from this term encodes how conditions relate to one another.</p>
<h3 id="condition-encoders">Condition Encoders</h3>
<p>Each condition has a dedicated encoder:</p>
<ul>
<li><strong>3D Pocket</strong>: A Geometric Vector Transformer (GVF) processes the binding pocket as a 3D graph with SE(3)-equivariant node and edge features. GVF extends GVP-GNN with a global attention module over geometric features. A position-aware attention mechanism with radial basis functions produces the pocket embedding.</li>
<li><strong>Chemical properties</strong>: Separate MLPs embed each scalar property (Vina, QED, SA, LogP, Lipinski) into the shared $d$-dimensional space.</li>
</ul>
<h3 id="training-objective">Training Objective</h3>
<p>PrefixMol is trained with two losses. The auto-regressive loss is:</p>
<p>$$
\mathcal{L}_{AT} = -\sum_{1 &lt; i \leq t} \log p_{\phi, \theta}(x_i \mid \mathbf{x}_{&lt;i}, \mathbf{p}_\phi)
$$</p>
<p>A triplet property prediction loss encourages generated molecules to match desired properties:</p>
<p>$$
\mathcal{L}_{Pred} = \max\left((\hat{\mathbf{c}} - \mathbf{c})^2 - (\hat{\mathbf{c}} - \dot{\mathbf{c}})^2, 0\right)
$$</p>
<p>where $\mathbf{c}$ is the input condition, $\hat{\mathbf{c}}$ is predicted by an MLP head, and $\dot{\mathbf{c}}$ is computed by RDKit from the generated SMILES (gradient is propagated through $\hat{\mathbf{c}}$ since RDKit is non-differentiable).</p>
<h2 id="experimental-setup-and-controllability-evaluation">Experimental Setup and Controllability Evaluation</h2>
<h3 id="dataset">Dataset</h3>
<p>The authors use the CrossDocked dataset (22.5 million protein-ligand structures) with chemical properties appended for each ligand. Data splitting and evaluation follow Pocket2Mol and Masuda et al.</p>
<h3 id="metrics">Metrics</h3>
<ul>
<li><strong>Vina score</strong> (binding affinity, computed by QVina after UFF refinement)</li>
<li><strong>QED</strong> (quantitative estimate of drug-likeness, 0-1)</li>
<li><strong>SA</strong> (synthetic accessibility, 0-1)</li>
<li><strong>LogP</strong> (octanol-water partition coefficient)</li>
<li><strong>Lipinski</strong> (rule-of-five compliance count)</li>
<li><strong>High Affinity</strong> (fraction of pockets where generated molecules match or exceed test set affinities)</li>
<li><strong>Diversity</strong> (average pairwise Tanimoto distance over Morgan fingerprints)</li>
<li><strong>Sim.Train</strong> (maximum Tanimoto similarity to training set)</li>
</ul>
<h3 id="baselines">Baselines</h3>
<p>Unconditional comparison against CVAE, AR (Luo et al. 2021a), and Pocket2Mol.</p>
<h3 id="key-results">Key Results</h3>
<p><strong>Unconditional generation</strong> (Table 1): PrefixMol without conditions achieves sub-optimal results on Vina (-6.532), QED (0.551), SA (0.750), and LogP (1.415) compared to Pocket2Mol. However, it substantially outperforms all baselines on diversity (0.856 vs. 0.688 for Pocket2Mol) and novelty (Sim.Train of 0.239 vs. 0.376), indicating it generates genuinely novel molecules rather than memorizing training data.</p>
<p><strong>Single-property control</strong> (Table 2): Molecular properties are positively correlated with conditional inputs across VINA, QED, SA, LogP, and Lipinski. With favorable control scales, PrefixMol surpasses Pocket2Mol on QED (0.767 vs. 0.563), SA (0.924 vs. 0.765), and LogP. The Vina score also improves when QED or LogP conditions are increased (e.g., -7.733 at QED control scale +2), revealing coupling between conditions.</p>
<p><strong>Multi-property control</strong> (Table 3): Jointly adjusting all five conditions shows consistent positive relationships. For example, at control scale +4, QED reaches 0.722, SA reaches 0.913, and Lipinski saturates at 5.0. Joint QED+SA control at +2.0 achieves Lipinski = 5.0, confirming that certain properties are coupled.</p>
<h3 id="condition-relation-analysis">Condition Relation Analysis</h3>
<p>By computing partial derivatives of the prefix attention map with respect to each condition, the authors construct a relation matrix $\mathbf{R} = \sum_{i=2}^{6} |\partial \mathbf{A} / \partial c_i|$. Key findings:</p>
<ul>
<li><strong>Vina is weakly self-controllable</strong> but strongly influenced by QED, LogP, and SA, explaining why multi-condition control improves binding affinity even when Vina alone responds poorly.</li>
<li><strong>LogP and QED</strong> are the most correlated property pair.</li>
<li><strong>Lipinski is coupled to QED and SA</strong>, saturating at 5.0 when both QED and SA control scales reach +2.</li>
</ul>
<h2 id="key-findings-limitations-and-interpretability-insights">Key Findings, Limitations, and Interpretability Insights</h2>
<p>PrefixMol demonstrates that prefix embedding is an effective strategy for unifying target-aware and chemistry-aware molecular generation. The main findings are:</p>
<ol>
<li>A single prefix-conditioned GPT model can control multiple chemical properties simultaneously while targeting specific protein pockets.</li>
<li>Multi-conditional generation outperforms unconditional baselines in drug-likeness metrics, and the controllability enables PrefixMol to surpass Pocket2Mol on QED, SA, and LogP.</li>
<li>The attention mechanism provides interpretable coupling relationships between conditions, offering practical guidance (e.g., improving QED indirectly improves Vina).</li>
</ol>
<p><strong>Limitations</strong>: The paper does not report validity rates for generated SMILES. The unconditional model underperforms Pocket2Mol on binding affinity (Vina), suggesting that generating 2D SMILES strings and relying on post hoc 3D conformer generation may be less effective than direct atom-by-atom 3D generation for binding affinity optimization. The condition relation analysis uses a first-order finite difference approximation ($\Delta = 1$), which may not capture nonlinear interactions. No external validation on prospective drug discovery tasks is provided. Hardware and training time details are not reported.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training / Evaluation</td>
          <td>CrossDocked (extended)</td>
          <td>22.5M protein-ligand structures</td>
          <td>Extended with molecular properties (QED, SA, LogP, Lipinski, Vina)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>GPT-based auto-regressive SMILES generation with prefix conditioning</li>
<li>GVF (Geometric Vector Transformer) for 3D pocket encoding, extending GVP-GNN with global attention</li>
<li>Separate MLP encoders for each chemical property</li>
<li>Triplet property prediction loss with non-differentiable RDKit-computed properties</li>
<li>QVina for Vina score computation with UFF refinement</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>GPT transformer backbone for SMILES generation</li>
<li>6 prefix condition vectors ($n_c = 6$): Pocket, Vina, QED, SA, LogP, Lipinski</li>
<li>Specific architectural hyperparameters (hidden dimension, number of layers, heads) not reported in the paper</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>PrefixMol (unconditional)</th>
          <th>Pocket2Mol</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Vina (kcal/mol)</td>
          <td>-6.532</td>
          <td>-7.288</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.551</td>
          <td>0.563</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>SA</td>
          <td>0.750</td>
          <td>0.765</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>0.856</td>
          <td>0.688</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>Sim.Train</td>
          <td>0.239</td>
          <td>0.376</td>
          <td>Lower is better</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not reported in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/A4Bio/PrefixMol">PrefixMol</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official PyTorch implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Gao, Z., Hu, Y., Tan, C., &amp; Li, S. Z. (2023). PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding. <em>arXiv preprint arXiv:2302.07120</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{gao2023prefixmol,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Gao, Zhangyang and Hu, Yuqi and Tan, Cheng and Li, Stan Z.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2302.07120}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>PASITHEA: Gradient-Based Molecular Design via Dreaming</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/deep-molecular-dreaming-pasithea/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/deep-molecular-dreaming-pasithea/</guid><description>PASITHEA applies inceptionism to molecular design, using gradient-based optimization on SELFIES representations to generate molecules with target properties.</description><content:encoded><![CDATA[<h2 id="inceptionism-applied-to-molecular-inverse-design">Inceptionism Applied to Molecular Inverse Design</h2>
<p>This is a <strong>Method</strong> paper that introduces PASITHEA, a gradient-based approach to de-novo molecular design inspired by inceptionism (deep dreaming) techniques from computer vision. The core contribution is a direct optimization framework that modifies molecular structures by backpropagating through a trained property-prediction network, with the molecular input (rather than weights) serving as the optimizable variable. PASITHEA is enabled by SELFIES, a surjective molecular string representation that guarantees 100% validity of generated molecules.</p>
<h2 id="the-need-for-direct-gradient-based-molecular-optimization">The Need for Direct Gradient-Based Molecular Optimization</h2>
<p>Existing inverse molecular design methods, including variational autoencoders (VAEs), generative adversarial networks (GANs), reinforcement learning (RL), and genetic algorithms (GAs), share a common characteristic: they optimize molecules indirectly. VAEs and GANs learn distributions and scan latent spaces. RL agents learn policies from environmental rewards. GAs iteratively apply mutations and selections. None of these approaches directly maximize an objective function in a gradient-based manner with respect to the molecular representation itself.</p>
<p>This indirection has several consequences. VAE-based methods require learning a latent space, and the optimization happens in that space rather than directly on molecular structures. RL and GA methods require expensive function evaluations for each candidate molecule. The authors identify an opportunity to exploit gradients more directly by reversing the learning process of a neural network trained to predict molecular properties, thereby sidestepping latent spaces, policies, and population-based search entirely.</p>
<p>A second motivation is interpretability. By operating directly on the molecular representation (rather than a learned latent space), PASITHEA can reveal what a regression network has learned about structure-property relationships, a capability the authors frame as analogous to how deep dreaming reveals what image classifiers have learned about visual features.</p>
<h2 id="core-innovation-inverting-regression-networks-on-selfies">Core Innovation: Inverting Regression Networks on SELFIES</h2>
<p>PASITHEA&rsquo;s key insight is a two-phase training procedure that repurposes the standard neural network training loop for molecule generation.</p>
<p><strong>Phase 1: Prediction training.</strong> A fully connected neural network is trained to predict a real-valued chemical property (logP) from one-hot encoded SELFIES strings. The standard feedforward and backpropagation process updates the network weights to minimize mean squared error between predicted and ground-truth property values:</p>
<p>$$
\min_{\theta} \frac{1}{N} \sum_{i=1}^{N} (f_{\theta}(\mathbf{x}_i) - y_i)^2
$$</p>
<p>where $f_{\theta}$ is the neural network with parameters $\theta$, $\mathbf{x}_i$ is the one-hot encoded SELFIES input, and $y_i$ is the target logP value.</p>
<p><strong>Phase 2: Inverse training (deep dreaming).</strong> The network weights $\theta$ are frozen. For a given input molecule $\mathbf{x}$ and a desired target property value $y_{\text{target}}$, the gradients are computed with respect to the input representation rather than the weights:</p>
<p>$$
\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla_{\mathbf{x}} \mathcal{L}(f_{\theta}(\mathbf{x}), y_{\text{target}})
$$</p>
<p>This gradient descent on the input incrementally modifies the one-hot encoding of the molecular string, transforming it toward a structure whose predicted property matches the target value. At each step, the argmax function converts the continuous one-hot encoding back to a discrete SELFIES string, which always maps to a valid molecular graph due to the surjective property of SELFIES.</p>
<p><strong>The role of SELFIES.</strong> The surjective mapping from strings to molecular graphs is essential. With SMILES, intermediate strings during optimization can become syntactically invalid (e.g., an unclosed ring like &ldquo;CCCC1CCCCC&rdquo;), producing no valid molecule. SELFIES enforces constraints that guarantee every string maps to a valid molecular graph, making the continuous gradient-based optimization feasible.</p>
<p><strong>Input noise injection.</strong> Because inverse training transforms a one-hot encoding from binary values to real numbers, the discrete-to-continuous transition can cause convergence problems. The authors address this by initializing the input with noise: every zero in the one-hot encoding is replaced by a random number in $[0, k]$, where $k$ is a hyperparameter between 0.5 and 0.95. This smooths the optimization landscape and enables incremental molecular modifications rather than abrupt changes.</p>
<h2 id="experimental-setup-on-qm9-with-logp-optimization">Experimental Setup on QM9 with LogP Optimization</h2>
<h3 id="dataset-and-property">Dataset and Property</h3>
<p>The experiments use a random subset of 10,000 molecules from the QM9 dataset. The target property is the logarithm of the partition coefficient (logP), computed using RDKit. LogP measures lipophilicity, an important drug-likeness indicator that follows an approximately normal distribution in QM9 and has a nearly continuous range, making it suitable for gradient-based optimization.</p>
<h3 id="network-architecture">Network Architecture</h3>
<p>PASITHEA uses a fully connected neural network with four layers, each containing 500 nodes with ReLU activation. The loss function is mean squared error. Data is split 85%/15% for training/testing. The prediction model trains for approximately 1,500 epochs with an Adam optimizer and a learning rate of $1 \times 10^{-6}$.</p>
<p>For inverse training, the authors select a noise upper-bound of 0.9 and a learning rate of 0.01, chosen from hyperparameter tuning experiments that evaluate the percentage of molecules optimized toward the target property.</p>
<h3 id="optimization-targets">Optimization Targets</h3>
<p>Two extreme logP targets are used: $+6$ (high lipophilicity) and $-6$ (low lipophilicity). These values exceed the range of logP values in the QM9 dataset (minimum: $-2.19$, maximum: $3.08$), testing whether the model can extrapolate beyond the training distribution.</p>
<h2 id="distribution-shifts-and-interpretable-molecular-transformations">Distribution Shifts and Interpretable Molecular Transformations</h2>
<h3 id="distribution-level-results">Distribution-Level Results</h3>
<p>Applying deep dreaming to the full set of 10,000 molecules produces a clear shift in the logP distribution:</p>
<table>
  <thead>
      <tr>
          <th>Statistic</th>
          <th>QM9 Original</th>
          <th>Optimized (target +6)</th>
          <th>Optimized (target -6)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Mean logP</td>
          <td>0.3909</td>
          <td>1.8172</td>
          <td>-0.3360</td>
      </tr>
      <tr>
          <td>Min logP</td>
          <td>-2.1903</td>
          <td>-0.8240</td>
          <td>-2.452</td>
      </tr>
      <tr>
          <td>Max logP</td>
          <td>3.0786</td>
          <td>4.2442</td>
          <td>0.9018</td>
      </tr>
  </tbody>
</table>
<p>The optimized distributions extend beyond the original dataset&rsquo;s property range. The right-shifted distribution (target +6) produces molecules with logP values up to 4.24, exceeding the original maximum of 3.08. The left-shifted distribution (target -6) reaches -2.45, below the original minimum. This indicates that PASITHEA can generate molecules with properties outside the training data bounds.</p>
<p>Additionally, 97.2% of the generated molecules do not exist in the original training set, indicating that the network is not memorizing data but rather using structural features to guide optimization. Some generated molecules contain more heavy atoms than the QM9 maximum of 9, since the SELFIES string length allows for larger structures.</p>
<h3 id="molecule-level-interpretability">Molecule-Level Interpretability</h3>
<p>The stepwise molecular transformations reveal interpretable &ldquo;strategies&rdquo; the network employs:</p>
<ol>
<li>
<p><strong>Nitrogen appendage</strong>: When optimizing for lower logP, the network repeatedly appends nitrogen atoms to the molecule. The authors observe this as a consistent pattern across multiple test molecules, reflecting the known relationship between nitrogen content and reduced lipophilicity.</p>
</li>
<li>
<p><strong>Length modulation</strong>: When optimizing for higher logP, the network tends to increase molecular chain length (e.g., extending a carbon chain). When optimizing for lower logP, it shortens chains. This captures the intuition that larger, more carbon-heavy molecules tend to be more lipophilic.</p>
</li>
<li>
<p><strong>Bond order changes</strong>: The network replaces single bonds with double or triple bonds during optimization, demonstrating an understanding of the relationship between bonding patterns and logP.</p>
</li>
<li>
<p><strong>Consistency across trials</strong>: Because the input initialization includes random noise, repeated trials with the same molecule produce different transformation sequences. Despite this stochasticity, the network applies consistent strategies across trials (e.g., always shortening chains for negative optimization), validating that it has learned genuine structure-property relationships.</p>
</li>
</ol>
<h3 id="thermodynamic-stability">Thermodynamic Stability</h3>
<p>The authors assess synthesizability by computing heats of formation using MOPAC2016 at the PM7 level of theory. Some optimization trajectories move toward thermodynamically stable molecules (negative heats of formation), while others produce less stable structures. The authors acknowledge this limitation and propose multi-objective optimization incorporating stability as a future direction.</p>
<h3 id="comparison-to-vaes">Comparison to VAEs</h3>
<p>The key distinction from VAEs is where gradient computation occurs. In VAEs, a latent space is learned through encoding and decoding, and property optimization happens in that latent space. In PASITHEA, gradients are computed directly with respect to the molecular representation (SELFIES one-hot encoding). The authors argue this makes the approach more interpretable, since we can probe what the network learned about molecular structure without the &ldquo;detour&rdquo; through a latent space.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors are forthright about the preliminary nature of these results:</p>
<ul>
<li>The method is demonstrated only on a small subset of QM9 with a single, computationally inexpensive property (logP).</li>
<li>The simple four-layer architecture may not scale to larger molecular spaces or more complex properties.</li>
<li>Generated molecules are not always thermodynamically stable, requiring additional optimization objectives.</li>
<li>The approach has not been benchmarked against established methods (VAEs, GANs, RL) on standard generative benchmarks.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>QM9 (random subset)</td>
          <td>10,000 molecules</td>
          <td>logP values computed via RDKit</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Prediction training</strong>: 4-layer fully connected NN, 500 nodes/layer, ReLU activation, MSE loss, Adam optimizer, LR $1 \times 10^{-6}$, ~1,500 epochs, 85/15 train/test split</li>
<li><strong>Inverse training</strong>: Frozen weights, Adam optimizer, LR 0.01, noise upper-bound 0.9, logP targets of +6 and -6</li>
<li><strong>Heats of formation</strong>: MOPAC2016, PM7 level, geometry optimization with eigenvector following (EF)</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture is a simple 4-layer MLP. No pre-trained weights are distributed, but the full code is available.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Novel molecules</td>
          <td>97.2%</td>
          <td>Generated molecules not in training set</td>
      </tr>
      <tr>
          <td>Max logP (target +6)</td>
          <td>4.2442</td>
          <td>Exceeds QM9 max of 3.0786</td>
      </tr>
      <tr>
          <td>Min logP (target -6)</td>
          <td>-2.452</td>
          <td>Below QM9 min of -2.1903</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/Pasithea">Pasithea</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Shen, C., Krenn, M., Eppel, S., &amp; Aspuru-Guzik, A. (2021). Deep molecular dreaming: inverse machine learning for de-novo molecular design and interpretability with surjective representations. <em>Machine Learning: Science and Technology</em>, 2(3), 03LT02. <a href="https://doi.org/10.1088/2632-2153/ac09d6">https://doi.org/10.1088/2632-2153/ac09d6</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{shen2021deep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Deep molecular dreaming: inverse machine learning for de-novo molecular design and interpretability with surjective representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Shen, Cynthia and Krenn, Mario and Eppel, Sagi and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Machine Learning: Science and Technology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{03LT02}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{IOP Publishing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1088/2632-2153/ac09d6}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LMs Generate 3D Molecules from XYZ, CIF, PDB Files</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/3d-chemical-language-models-xyz-cif-pdb/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/3d-chemical-language-models-xyz-cif-pdb/</guid><description>Transformer language models trained on XYZ, CIF, and PDB sequences generate valid 3D molecules, crystals, and protein binding sites.</description><content:encoded><![CDATA[<h2 id="language-models-as-3d-chemical-structure-generators">Language Models as 3D Chemical Structure Generators</h2>
<p>This is a <strong>Method</strong> paper that demonstrates transformer-based language models can generate molecules, crystalline materials, and protein binding sites directly in three dimensions by training on sequences derived from standard chemical file formats (XYZ, CIF, PDB). The key contribution is showing that unmodified autoregressive language models, using only next-token prediction, achieve performance comparable to domain-specific 3D generative models that incorporate SE(3) equivariance and other geometric inductive biases.</p>
<h2 id="beyond-graphs-and-strings-the-need-for-3d-chemical-generation">Beyond Graphs and Strings: The Need for 3D Chemical Generation</h2>
<p>Molecular design with deep learning has largely relied on two representation paradigms: molecular graphs (processed with graph neural networks) and linearized string representations like <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> and <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a> (processed with sequence models). Both approaches have proven effective for drug-like organic molecules, but they share a fundamental limitation: they cannot represent structures whose identity depends on 3D spatial arrangement.</p>
<p>Crystalline materials, for example, have periodic lattice structures that cannot be reduced to simple graphs. Protein binding sites are defined by the 3D arrangement of hundreds of atoms across multiple residues. For tasks like catalysis design or structure-based drug discovery, the geometric positions of atoms are essential information that graphs and strings discard entirely.</p>
<p>Existing 3D generative models address this gap but typically require specialized architectures with SE(3) equivariance to handle rotational and translational symmetries. This work asks whether the general-purpose sequence modeling capability of transformers is sufficient to learn 3D chemical structure distributions without any domain-specific architectural modifications.</p>
<h2 id="direct-tokenization-of-chemical-file-formats">Direct Tokenization of Chemical File Formats</h2>
<p>The core insight is straightforward: any 3D molecule, crystal, or biomolecule is already stored as text in standard file formats (<a href="https://en.wikipedia.org/wiki/XYZ_file_format">XYZ</a>, <a href="https://en.wikipedia.org/wiki/Crystallographic_Information_File">CIF</a>, <a href="https://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format)">PDB</a>). These files encode atom types and their Cartesian coordinates as sequences of characters and numbers. Rather than designing specialized architectures for point cloud generation, the authors simply tokenize these files and train a standard GPT-style transformer to predict the next token.</p>
<p>A molecule with $n$ atoms is represented as:</p>
<p>$$
\mathcal{M} = (e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n)
$$</p>
<p>where $e_i$ is the element type and $(x_i, y_i, z_i)$ are Cartesian coordinates. Crystals additionally include lattice parameters:</p>
<p>$$
\mathcal{C} = (\ell_a, \ell_b, \ell_c, \alpha, \beta, \gamma, e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n)
$$</p>
<p>Protein binding sites use residue-atom indicators (e.g., HIS-C, CYS-N) instead of bare element symbols:</p>
<p>$$
\mathcal{P} = (a_1, x_1, y_1, z_1, \dots, a_n, x_n, y_n, z_n)
$$</p>
<p>The language model learns the joint distribution via autoregressive factorization:</p>
<p>$$
p(x) = \prod_{i=1}^{n} p(t_i \mid t_{i-1}, \dots, t_1)
$$</p>
<p>Two tokenization strategies are explored:</p>
<ol>
<li><strong>Character-level (LM-CH)</strong>: Every character in the file is a token, including digits, minus signs, spaces, and newlines. This produces long sequences but uses a small vocabulary (~30 tokens).</li>
<li><strong>Atom+coordinate-level (LM-AC)</strong>: Each atom placement requires exactly 4 tokens: one element/residue token and three coordinate tokens (e.g., &lsquo;-1.98&rsquo;). The vocabulary is larger (~100-10K tokens) but sequences are shorter.</li>
</ol>
<p>Numerical precision is controlled by rounding coordinates to 1, 2, or 3 decimal places. Since the model lacks rotation and translation invariance, random rotation augmentation during training improves performance.</p>
<h2 id="experiments-across-molecules-crystals-and-protein-binding-sites">Experiments Across Molecules, Crystals, and Protein Binding Sites</h2>
<h3 id="molecular-generation-zinc">Molecular Generation (ZINC)</h3>
<p>The model is evaluated on 250K commercially available molecules from the ZINC dataset, with an average of 23 heavy atoms. XYZ files are generated using RDKit&rsquo;s conformer tools. Coordinates use 2 decimal places of precision. The authors generate 10K molecules and evaluate both 3D geometry quality and standard generative metrics.</p>
<p>For 3D geometry assessment, root mean squared deviation (RMSD) between language model-generated conformers and RDKit-generated conformers shows most molecules fall between 1.0 and 2.0 RMSD, with a heavy tail extending to 4.0.</p>
<p>Standard metrics include validity, uniqueness, novelty, and earth mover&rsquo;s distance (WA) for molecular property distributions (QED, SA score, molecular weight).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>3D</th>
          <th>Valid (%)</th>
          <th>Unique (%)</th>
          <th>Novel (%)</th>
          <th>WA MW</th>
          <th>WA SA</th>
          <th>WA QED</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Train</td>
          <td>No</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>0.816</td>
          <td>0.013</td>
          <td>0.002</td>
      </tr>
      <tr>
          <td>SM-LM</td>
          <td>No</td>
          <td>98.35</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>3.640</td>
          <td>0.049</td>
          <td>0.005</td>
      </tr>
      <tr>
          <td>SF-LM</td>
          <td>No</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>3.772</td>
          <td>0.085</td>
          <td>0.006</td>
      </tr>
      <tr>
          <td>JTVAE</td>
          <td>No</td>
          <td>100.0</td>
          <td>98.56</td>
          <td>100.0</td>
          <td>22.63</td>
          <td>0.126</td>
          <td>0.023</td>
      </tr>
      <tr>
          <td>ENF</td>
          <td>Yes</td>
          <td>1.05</td>
          <td>96.37</td>
          <td>99.72</td>
          <td>168.5</td>
          <td>1.886</td>
          <td>0.160</td>
      </tr>
      <tr>
          <td>G-SchNet</td>
          <td>Yes</td>
          <td>1.20</td>
          <td>55.96</td>
          <td>98.33</td>
          <td>152.7</td>
          <td>1.126</td>
          <td>0.185</td>
      </tr>
      <tr>
          <td>EDM</td>
          <td>Yes</td>
          <td>77.51</td>
          <td>96.40</td>
          <td>95.30</td>
          <td>101.2</td>
          <td>0.939</td>
          <td>0.093</td>
      </tr>
      <tr>
          <td>LM-CH</td>
          <td>Yes</td>
          <td>90.13</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>3.912</td>
          <td>2.608</td>
          <td>0.077</td>
      </tr>
      <tr>
          <td>LM-AC</td>
          <td>Yes</td>
          <td>98.51</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>1.811</td>
          <td>0.026</td>
          <td>0.004</td>
      </tr>
  </tbody>
</table>
<p>The atom+coordinate tokenization model (LM-AC) achieves 98.51% validity with 100% uniqueness and novelty. Its WA scores for molecular weight (1.811) and QED (0.004) are substantially better than all other 3D generative baselines and competitive with SMILES/SELFIES language models. The character-level model (LM-CH) at 90.13% validity performs comparably to graph-based models but falls short of the string-based language models.</p>
<h3 id="crystal-generation-perov-5-and-mp-20">Crystal Generation (Perov-5 and MP-20)</h3>
<p>Crystal generation uses CIF-derived sequences with 3 decimal places of precision. Two datasets are used: Perov-5 (18,928 <a href="https://en.wikipedia.org/wiki/Perovskite_(structure)">perovskite</a> materials, 5 atoms per unit cell, 56 elements) and MP-20 (45,231 diverse materials, 1-20 atoms per unit cell, 89 elements).</p>
<p>Evaluation metrics include structural validity (minimum interatomic distance &gt; 0.5 angstrom), compositional validity (charge neutrality via SMACT), coverage (recall and precision between generated and test sets), and earth mover&rsquo;s distance for density and number of unique elements.</p>
<table>
  <thead>
      <tr>
          <th>Data</th>
          <th>Model</th>
          <th>Struc. Valid (%)</th>
          <th>Comp. Valid (%)</th>
          <th>COV-R (%)</th>
          <th>COV-P (%)</th>
          <th>WA density</th>
          <th>WA elements</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Perov-5</td>
          <td>CDVAE</td>
          <td>100.0</td>
          <td>98.59</td>
          <td>99.45</td>
          <td>98.46</td>
          <td>0.126</td>
          <td>0.063</td>
      </tr>
      <tr>
          <td>Perov-5</td>
          <td>LM-CH</td>
          <td>100.0</td>
          <td>98.51</td>
          <td>99.60</td>
          <td>99.42</td>
          <td>0.071</td>
          <td>0.036</td>
      </tr>
      <tr>
          <td>Perov-5</td>
          <td>LM-AC</td>
          <td>100.0</td>
          <td>98.79</td>
          <td>98.78</td>
          <td>99.36</td>
          <td>0.089</td>
          <td>0.028</td>
      </tr>
      <tr>
          <td>MP-20</td>
          <td>CDVAE</td>
          <td>100.0</td>
          <td>86.70</td>
          <td>99.15</td>
          <td>99.49</td>
          <td>0.688</td>
          <td>1.432</td>
      </tr>
      <tr>
          <td>MP-20</td>
          <td>LM-CH</td>
          <td>84.81</td>
          <td>83.55</td>
          <td>99.25</td>
          <td>97.89</td>
          <td>0.864</td>
          <td>0.132</td>
      </tr>
      <tr>
          <td>MP-20</td>
          <td>LM-AC</td>
          <td>95.81</td>
          <td>88.87</td>
          <td>99.60</td>
          <td>98.55</td>
          <td>0.696</td>
          <td>0.092</td>
      </tr>
  </tbody>
</table>
<p>On Perov-5, both language models outperform CDVAE across most metrics. On the more diverse MP-20 dataset, LM-AC achieves the best scores on 3 of 6 metrics and remains competitive on the others. LM-CH struggles more with structural validity on MP-20 (84.81%).</p>
<h3 id="protein-binding-site-generation-pdb">Protein Binding Site Generation (PDB)</h3>
<p>The most challenging task involves generating protein binding sites (~200-250 atoms each) from PDB-derived sequences. The dataset contains approximately 180K protein-ligand pairs. Residue-atom tokenization is used (e.g., CYS-C, CYS-N), with 2 decimal places of precision.</p>
<p>Validity is assessed per-residue using xyz2mol, with an additional check for inter-residue atomic overlap (atoms from different residues closer than the minimum bond distance). Approximately 99% of generated pockets pass the residue validity check, while about 5% fail the overlap check. Of generated pockets, 89.8% have unique residue orderings, and 83.6% have novel orderings not seen in training, indicating the model is generating novel binding site structures rather than memorizing.</p>
<h2 id="competitive-3d-generation-without-geometric-inductive-biases">Competitive 3D Generation Without Geometric Inductive Biases</h2>
<p>The central finding is that standard transformer language models, without any equivariance or geometric inductive biases, can generate valid 3D chemical structures across three substantially different domains. The atom+coordinate tokenization (LM-AC) consistently outperforms character-level tokenization (LM-CH), likely because it produces shorter sequences and reduces the number of sequential decisions needed per atom placement.</p>
<p>Several limitations are worth noting. The model generates atoms using absolute Cartesian coordinates, which means it must learn rotation and translation invariance purely from data augmentation rather than having it built into the architecture. The authors acknowledge this becomes increasingly difficult as structure size grows. The vocabulary size also scales with coordinate precision and structure complexity, which could become prohibitive for very large systems.</p>
<p>The paper does not include computational cost comparisons with baseline models, making it difficult to assess the practical tradeoff between the simplicity of the language modeling approach and the efficiency of specialized architectures. The authors also note that further validation through computational simulation and experiment is needed to confirm the physical plausibility of generated structures.</p>
<p>Future directions identified include inverse design of molecules and materials conditioned on target properties, extension to more complex structures (metal-organic frameworks), and exploration of alternative tokenization strategies to handle larger systems.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td>ZINC</td>
          <td>250K molecules</td>
          <td>~23 heavy atoms avg; XYZ files via RDKit conformer generation</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>Perov-5</td>
          <td>18,928 perovskites</td>
          <td>5 atoms/unit cell, 56 elements</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>MP-20</td>
          <td>45,231 materials</td>
          <td>1-20 atoms/unit cell, 89 elements</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>Protein binding sites</td>
          <td>~180K protein-ligand pairs</td>
          <td>Processed to 200-250 atoms per pocket</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: GPT-style transformer with ~1M to 100M parameters</li>
<li><strong>Layers</strong>: 12</li>
<li><strong>Embedding size</strong>: 128 to 1024</li>
<li><strong>Attention heads</strong>: 4 to 12</li>
<li><strong>Batch size</strong>: 4 to 32 structures</li>
<li><strong>Learning rate</strong>: $10^{-4}$ to $10^{-5}$, decayed to $9 \times 10^{-6}$</li>
<li><strong>Data augmentation</strong>: Random rotation of training structures at each epoch</li>
<li><strong>Numerical precision</strong>: 2 decimal places (molecules, proteins), 3 decimal places (crystals)</li>
</ul>
<h3 id="models">Models</h3>
<p>No pre-trained model weights are publicly available. The paper mentions &ldquo;Example code can be found at&rdquo; but the URL appears to be missing from the published version.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Domain</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>Molecules</td>
          <td>xyz2mol produces valid RDKit Mol object</td>
      </tr>
      <tr>
          <td>Validity</td>
          <td>Crystals</td>
          <td>Structural (min distance &gt; 0.5 angstrom) and compositional (charge neutral)</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>All</td>
          <td>Fraction of distinct generated structures</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>All</td>
          <td>Fraction not in training set</td>
      </tr>
      <tr>
          <td>Earth mover&rsquo;s distance</td>
          <td>All</td>
          <td>Distribution match for domain-specific properties</td>
      </tr>
      <tr>
          <td>RMSD</td>
          <td>Molecules</td>
          <td>Deviation from RDKit conformer geometries</td>
      </tr>
      <tr>
          <td>Coverage</td>
          <td>Crystals</td>
          <td>Recall and precision between generated and test sets</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Models were trained using the Canada Computing Systems (Compute Canada). Specific GPU types, counts, and training times are not reported.</p>
<h3 id="artifacts">Artifacts</h3>
<p>No public code repository, model weights, or datasets specific to this work were found. The ZINC, Perov-5, and MP-20 datasets used for evaluation are publicly available from their original sources.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Flam-Shepherd, D. &amp; Aspuru-Guzik, A. (2023). Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files. <em>arXiv preprint arXiv:2305.05708</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{flamshepherd2023language,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Language models can generate molecules, materials, and protein binding sites directly in three dimensions as {XYZ}, {CIF}, and {PDB} files}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Flam-Shepherd, Daniel and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2305.05708}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Link-INVENT: RL-Driven Molecular Linker Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/link-invent-generative-linker-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/link-invent-generative-linker-design/</guid><description>Link-INVENT extends REINVENT for molecular linker design using RNN-based generation and reinforcement learning with flexible multi-parameter scoring.</description><content:encoded><![CDATA[<h2 id="a-method-for-generative-linker-design-with-reinforcement-learning">A Method for Generative Linker Design with Reinforcement Learning</h2>
<p>Link-INVENT is a <strong>Method</strong> paper that introduces a generative model for molecular linker design built on the <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> de novo design platform. The primary contribution is an encoder-decoder recurrent neural network (RNN) architecture that generates SMILES-based linkers connecting two molecular subunits, combined with a flexible multi-parameter optimization (MPO) scoring function and reinforcement learning (RL) to steer generation toward desired properties. Link-INVENT targets three practical drug discovery tasks: fragment linking, scaffold hopping, and <a href="https://en.wikipedia.org/wiki/Proteolysis_targeting_chimera">proteolysis targeting chimera</a> (PROTAC) design.</p>
<h2 id="why-linker-design-needs-flexible-multi-parameter-optimization">Why Linker Design Needs Flexible Multi-Parameter Optimization</h2>
<p>Generating suitable chemical linkers between molecular subunits is a central challenge in <a href="https://en.wikipedia.org/wiki/Fragment-based_lead_discovery">fragment-based drug discovery</a> (FBDD), scaffold hopping, and PROTAC design. Traditional computational approaches rely on database searches, inherently limiting the generalizability of proposed linkers to the pre-defined collection. Recent deep learning methods (DeLinker, SyntaLinker, 3DLinker, DiffLinker) can generate novel linkers but offer limited support for optimizing specific physicochemical properties. Users can typically control only linker length and a few properties like hydrogen-bond donor count.</p>
<p>The key gaps that Link-INVENT addresses are:</p>
<ol>
<li><strong>Conditioning on both subunits</strong>: Prior RNN-based approaches (SAMOA) generate linkers conditioned only on the SMILES sequence seen so far, which may not account for the second molecular subunit. Link-INVENT conditions on both warheads simultaneously.</li>
<li><strong>Flexible scoring</strong>: Existing DL-based linker design tools lack the ability to define tailored MPO objectives. Link-INVENT inherits <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent4-generative-molecule-design/">REINVENT 4&rsquo;s</a> full scoring infrastructure and adds linker-specific properties.</li>
<li><strong>Generalizability</strong>: A single trained prior handles fragment linking, scaffold hopping, and PROTAC tasks without retraining.</li>
</ol>
<h2 id="core-innovation-conditional-linker-generation-with-augmented-likelihood-rl">Core Innovation: Conditional Linker Generation with Augmented Likelihood RL</h2>
<p>Link-INVENT&rsquo;s architecture is an encoder-decoder RNN adapted from the Lib-INVENT library design model. The encoder processes a pair of warheads (molecular subunits with defined exit vectors), and the decoder generates a linker token by token, yielding a connected molecule in SMILES format. The model uses three hidden layers of 512 LSTM cells with an embedding size of 256.</p>
<h3 id="training">Training</h3>
<p>The prior is trained on ChEMBL v27 data processed through reaction-based slicing to generate (linker, warheads pair, full molecule) tuples. <a href="/notes/computational-chemistry/molecular-representations/randomized-smiles-generative-models/">SMILES randomization</a> augments the training data at each epoch, improving chemical space generalizability. The prior is trained by maximizing the likelihood of generating a linker conditioned on the input warhead pair, with teacher forcing for stability.</p>
<h3 id="multi-parameter-optimization-via-rl">Multi-Parameter Optimization via RL</h3>
<p>The scoring function $S(x)$ is a weighted geometric mean of individual component scores:</p>
<p>$$
S(x) = \left(\prod_{i=1}^{n} C_{i}(x)^{w_{i}}\right)^{\frac{1}{\sum_{i=1}^{n} w_{i}}}
$$</p>
<p>where $x$ is a sampled linked molecule, $C_{i}(x)$ is the score for the $i$-th component, and $w_{i}$ is its weight.</p>
<p>The agent (initialized as a copy of the prior) is updated via the Difference of Augmented and Posterior likelihoods (DAP) loss. The <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/">augmented log likelihood</a> is:</p>
<p>$$
\log \pi_{\text{augmented}} = \log \pi_{\text{prior}} + \sigma \cdot S(x)
$$</p>
<p>where $\pi$ denotes a policy (token sampling probabilities conditioned on the sequence so far) and $\sigma$ is a scalar factor. The loss function is:</p>
<p>$$
J(\theta) = \left(\log \pi_{\text{augmented}} - \log \pi_{\text{agent}}\right)^{2}
$$</p>
<p>Minimizing $J(\theta)$ steers the agent to generate molecules that satisfy the scoring function while remaining anchored to the prior&rsquo;s chemical space.</p>
<h3 id="diversity-filters">Diversity Filters</h3>
<p>Link-INVENT uses Diversity Filters (DFs) to balance exploration and exploitation. Buckets of limited size track unique <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/memory-assisted-rl-diverse-molecular-design/">Bemis-Murcko scaffolds</a>. When a bucket is full, further sampling of that scaffold receives a score of zero, encouraging the agent to explore diverse chemical space regions.</p>
<h3 id="linker-specific-scoring-components">Linker-Specific Scoring Components</h3>
<p>New scoring components provide direct control over linker properties:</p>
<ul>
<li><strong>Linker effective length</strong>: number of bonds between attachment atoms</li>
<li><strong>Linker maximum graph length</strong>: bonds in the longest graph traversal path</li>
<li><strong>Linker length ratio</strong>: effective length divided by maximum graph length (controls branching)</li>
<li><strong>Linker ratio of rotatable bonds</strong>: rotatable bonds over total bonds (controls flexibility)</li>
<li><strong>Linker number of rings</strong>: controls linearity vs. cyclicity</li>
<li><strong>Linker number of HBDs</strong>: hydrogen-bond donors in the linker itself</li>
</ul>
<h2 id="experimental-evaluation-across-three-drug-discovery-tasks">Experimental Evaluation Across Three Drug Discovery Tasks</h2>
<p>Link-INVENT was evaluated through four experiments across three drug discovery applications, all using the same pre-trained prior.</p>
<h3 id="illustrative-example-two-benzene-rings">Illustrative Example: Two Benzene Rings</h3>
<p>A simple experiment linked two benzene rings with the objectives of limiting HBDs and requiring exactly one ring in the linker. Over 20 epochs, the agent learned to satisfy both objectives, demonstrating the basic RL-guided generation process.</p>
<h3 id="experiment-1a-fragment-linking-ck2-alpha-inhibitors">Experiment 1a: Fragment Linking (CK2 alpha Inhibitors)</h3>
<p>Based on the <a href="https://en.wikipedia.org/wiki/Casein_kinase_2">casein kinase 2</a> (CK2 alpha) fragment linking campaign by Fusco and Brear et al., Link-INVENT was tasked with linking two fragment hits while retaining the Lys68 hydrogen-bond interaction via a DockStream docking constraint (Glide/LigPrep backend). The scoring function also enforced linker length ratio &gt;= 70 and linker MW &lt;= 200 Da.</p>
<p>Over 100 epochs in triplicate, the agent generated molecules with gradually improving docking scores. Key results:</p>
<ul>
<li>Docking score distributions across triplicates were nearly identical, demonstrating reproducibility</li>
<li>Some generated molecules achieved more favorable docking scores than the reference ligand CAM4066 (-15.20 kcal/mol)</li>
<li>More than 5000 unique Bemis-Murcko scaffolds were generated, with minimal overlap across replicates</li>
<li>Binding pose analysis showed the generated linker closely resembled the ground-truth linker, retaining the Lys68 interaction</li>
</ul>
<h3 id="experiment-1b-comparison-fragment-linking-impdh-inhibitors">Experiment 1b: Comparison Fragment Linking (IMPDH Inhibitors)</h3>
<p>Using the IMPDH inhibitor fragment linking case study from Trapero et al., this experiment applied core constrained docking (fragment pose within 0.3 A of reference) and compared results to DeLinker and SyntaLinker. The scoring function enforced linker effective length in [3, 5], length ratio &gt;= 70, and linker MW &lt;= 150 Da.</p>
<p>Link-INVENT generated 8960 SMILES across 70 epochs (comparable to DeLinker&rsquo;s 9000 molecular graphs). Results:</p>
<ul>
<li>Link-INVENT generated molecules with more favorable docking scores than the reference ligand across triplicate runs</li>
<li>Of 20 DeLinker and 3 SyntaLinker example molecules, none and one (the recovered reference) docked better than or equal to the reference</li>
<li>Approximately 3000 unique Bemis-Murcko scaffolds were generated from 5000 total molecules</li>
<li>Link-INVENT&rsquo;s advantage comes from including docking explicitly as a learning objective rather than applying it post hoc</li>
</ul>
<h3 id="experiment-2-scaffold-hopping-dlk-inhibitor-cns-optimization">Experiment 2: Scaffold Hopping (DLK Inhibitor CNS Optimization)</h3>
<p>Based on Patel et al.&rsquo;s <a href="https://en.wikipedia.org/wiki/Dual_leucine_zipper_kinase">dual leucine zipper kinase</a> (DLK) inhibitor campaign, Link-INVENT generated new scaffold ideas to improve CNS penetration while retaining potency. The scoring function included a Cys193 docking constraint plus CNS-compatible properties (HBDs &lt; 2, tPSA &lt;= 90 A squared, 3 &lt;= SlogP &lt;= 4, MW &lt;= 450 Da, 1-2 aromatic rings in linker).</p>
<p>The solution space was significantly narrower than fragment linking. The agent still generated diverse scaffolds with favorable docking scores, though fewer exceeded the reference ligand&rsquo;s score. Binding pose analysis confirmed retained Cys193 interactions and predicted additional Gln195 hydrogen bonds.</p>
<h3 id="experiment-3-protac-design-bcl-2mcl-1-dual-degradation">Experiment 3: PROTAC Design (Bcl-2/Mcl-1 Dual Degradation)</h3>
<p>Three sub-experiments demonstrated linker-specific scoring components for PROTAC design based on Wang et al.&rsquo;s Bcl-2/Mcl-1 dual degradation strategy:</p>
<table>
  <thead>
      <tr>
          <th>Sub-Experiment</th>
          <th>Objective</th>
          <th>Key Finding</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Sub-Exp 1: Linker length</td>
          <td>Generate linkers within specified length intervals [4,6], [7,9], [10,12], [13,15]</td>
          <td>Clear enrichment within target intervals vs. baseline broad distribution</td>
      </tr>
      <tr>
          <td>Sub-Exp 2: Linearity</td>
          <td>Control linear vs. cyclic linkers at fixed length [7,9]</td>
          <td>Baseline ratio ~1:2 linear:cyclic; enforcing linearity or cyclicity achieved strong enrichment</td>
      </tr>
      <tr>
          <td>Sub-Exp 3: Flexibility</td>
          <td>Generate linkers with Low [0,30], Moderate [40,60], or High [70,100] rotatable bond ratios</td>
          <td>Agent learned that rings and sp2 atoms yield rigidity; linear sp3 chains yield flexibility</td>
      </tr>
  </tbody>
</table>
<h2 id="key-findings-and-practical-implications-for-drug-discovery">Key Findings and Practical Implications for Drug Discovery</h2>
<p>Link-INVENT demonstrates several practical advantages for molecular linker design:</p>
<ol>
<li><strong>Single prior, multiple tasks</strong>: The same pre-trained model handles fragment linking, scaffold hopping, and PROTAC design without retraining.</li>
<li><strong>Docking as a learning signal</strong>: Including molecular docking explicitly in the scoring function (via DockStream) during RL yields molecules with more favorable docking scores than approaches that apply docking post hoc.</li>
<li><strong>Implicit 3D awareness</strong>: The docking constraint guides the agent toward 3D structural awareness without explicit 3D coordinate inputs, as demonstrated by the overlap between generated and reference binding poses.</li>
<li><strong>Diverse and reproducible output</strong>: Diversity filters ensure exploration of multiple chemical space regions, and triplicate experiments show consistent docking score distributions with minimal scaffold overlap.</li>
</ol>
<p>Limitations acknowledged by the authors include:</p>
<ul>
<li>The linker flexibility metric (ratio of rotatable bonds) is agnostic to intra-molecular hydrogen bonds and does not account for all rigidity factors</li>
<li>Molecular docking is an approximation that can be exploited (e.g., excessive HBDs achieving favorable scores at the expense of permeability)</li>
<li>Experiments 1a and 1b require a proprietary Schrodinger license for Glide/LigPrep docking</li>
<li>No direct experimental (wet-lab) validation was performed in this study</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior training</td>
          <td>ChEMBL v27 (reaction-sliced)</td>
          <td>Not specified</td>
          <td>Filtered for drug-like compounds, then reaction-based slicing with SMIRKS</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>Held-out Bemis-Murcko scaffolds</td>
          <td>287 scaffolds</td>
          <td>Held out from training set</td>
      </tr>
      <tr>
          <td>SMILES augmentation</td>
          <td>Randomized SMILES per epoch</td>
          <td>Same tuples, different representations</td>
          <td>Improves generalizability</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-decoder RNN with 3 hidden layers of 512 LSTM cells, embedding size 256</li>
<li><strong>RL loss</strong>: DAP (Difference of Augmented and Posterior likelihoods)</li>
<li><strong>Batch size</strong>: 128 molecules per epoch</li>
<li><strong>Diversity filter</strong>: Bemis-Murcko scaffold buckets of size 25</li>
<li><strong>Score threshold</strong>: 0 (to store all molecules for analysis)</li>
<li><strong>Scoring function</strong>: Weighted geometric mean of component scores</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Single pre-trained prior used across all experiments</li>
<li>Agent initialized as copy of prior, updated via RL</li>
<li>Pre-trained prior available at GitHub repository</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Molecular docking via DockStream with Glide/LigPrep backend</li>
<li>Triplicate runs for all experiments</li>
<li>Metrics: docking scores, unique Bemis-Murcko scaffold counts, binding pose overlap</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not reported in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MolecularAI/Reinvent">REINVENT (Link-INVENT code)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Main codebase for Link-INVENT</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MolecularAI/ReinventCommunity">ReinventCommunity (data + tutorial)</a></td>
          <td>Code + Data</td>
          <td>MIT</td>
          <td>Training/validation data, reaction SMIRKS, pre-trained prior, Jupyter tutorial</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Partially Reproducible. Code, training data, and pre-trained prior are publicly available. However, reproducing the docking-based experiments (1a, 1b, and 2) requires a proprietary Schrodinger license for Glide and LigPrep. The PROTAC experiments (Experiment 3) that use only physicochemical scoring are fully reproducible with the open-source code.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Guo, J., Knuth, F., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., &amp; Patronov, A. (2023). Link-INVENT: generative linker design with reinforcement learning. <em>Digital Discovery</em>, 2, 392-408. <a href="https://doi.org/10.1039/D2DD00115B">https://doi.org/10.1039/D2DD00115B</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{guo2023link,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Link-INVENT: generative linker design with reinforcement learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Guo, Jeff and Knuth, Franziska and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{392--408}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D2DD00115B}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Lingo3DMol: Language Model for 3D Molecule Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/lingo3dmol-3d-molecule-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/lingo3dmol-3d-molecule-generation/</guid><description>Lingo3DMol combines language models with geometric deep learning for structure-based 3D molecule generation using a fragment-based SMILES representation.</description><content:encoded><![CDATA[<h2 id="a-language-model-approach-to-structure-based-drug-design">A Language Model Approach to Structure-Based Drug Design</h2>
<p>This is a <strong>Method</strong> paper that introduces Lingo3DMol, a pocket-based 3D molecule generation model combining transformer language models with geometric deep learning. The primary contribution is threefold: (1) a new molecular representation called FSMILES (fragment-based SMILES) that encodes both 2D topology and 3D spatial coordinates, (2) a dual-decoder architecture that jointly predicts molecular topology and atomic positions, and (3) an auxiliary non-covalent interaction (NCI) predictor that guides molecule generation toward favorable binding modes.</p>
<h2 id="limitations-of-existing-3d-molecular-generative-models">Limitations of Existing 3D Molecular Generative Models</h2>
<p>Existing approaches to structure-based drug design fall into two categories, each with notable limitations. Graph-based autoregressive methods (e.g., Pocket2Mol) represent molecules as 3D graphs and use GNNs for generation, but frequently produce non-drug-like structures: large rings (seven or more atoms), honeycomb-like ring arrays, and molecules with either too many or too few rings. The autoregressive sampling process tends to get stuck in local optima early in generation and accumulates errors at each step. Diffusion-based methods (e.g., TargetDiff) avoid autoregressive generation but still produce a notable proportion of undesirable structures due to weak perception of molecular topology, since they do not directly encode or predict bonds. Both approaches struggle with metrics like QED (quantitative estimate of drug-likeness) and SAS (synthetic accessibility score), and neither reliably reproduces known active compounds when evaluated on protein pockets.</p>
<h2 id="fsmiles-fragment-based-smiles-with-dual-coordinate-systems">FSMILES: Fragment-Based SMILES with Dual Coordinate Systems</h2>
<p>The core innovation of Lingo3DMol is a new molecular sequence representation called FSMILES that addresses the topology problem inherent in atom-by-atom generation. FSMILES reorganizes a molecule into fragments using a ring-first, depth-first traversal. Each fragment is represented using standard SMILES syntax, and the full molecule is assembled by combining fragments with a specific connection syntax. Ring size information is encoded directly in atom tokens (e.g., <code>C_6</code> for a carbon in a six-membered ring), providing the autoregressive decoder with critical context about local topology before it needs to close the ring.</p>
<p>The model integrates two coordinate systems. Local spherical coordinates encode bond length ($r$), bond angle ($\theta$), and dihedral angle ($\phi$) relative to three reference atoms (root1, root2, root3). These are predicted using separate MLP heads:</p>
<p>$$r = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_1\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}\right]\right)\right)\right)$$</p>
<p>$$\theta = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_2\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}\right]\right)\right)\right)$$</p>
<p>$$\phi = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_3\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}, h_{\text{root3}}\right]\right)\right)\right)$$</p>
<p>Global Euclidean coordinates ($x, y, z$) are predicted by a separate 3D decoder ($D_{\text{3D}}$). During inference, the model defines a search space around the predicted local coordinates ($r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$) and selects the global position with the highest joint probability within that space. This fusion strategy exploits the rigidity of bond lengths and angles (which makes local prediction easier) while maintaining global spatial awareness.</p>
<h3 id="ncianchor-prediction-model">NCI/Anchor Prediction Model</h3>
<p>A separately trained NCI/anchor prediction model identifies potential non-covalent interaction sites and anchor points in the protein pocket. This model shares the transformer architecture of the generation model and is initialized from pretrained parameters. It predicts whether each pocket atom will form hydrogen bonds, <a href="https://en.wikipedia.org/wiki/Halogen_bond">halogen bonds</a>, salt bridges, or <a href="https://en.wikipedia.org/wiki/Pi_stacking">pi-pi stacking</a> interactions with the ligand, and whether it lies within 4 A of any ligand atom (anchor points). The predicted NCI sites serve two purposes: they are incorporated as input features to the encoder, and they provide starting positions for molecule generation (the first atom is placed within 4.5 A of a sampled NCI site).</p>
<h3 id="pretraining-and-architecture">Pretraining and Architecture</h3>
<p>The model uses a denoising pretraining strategy inspired by BART. During pretraining on 12 million drug-like molecules, the model receives perturbed molecules (with 25% of atoms deleted, coordinates perturbed by $\pm 0.5$ A, and 25% of carbon element types corrupted) and learns to reconstruct the original structure. The architecture is transformer-based with graph structural information encoded through distance and edge vector bias terms in the attention mechanism:</p>
<p>$$A_{\text{biased}} = \operatorname{softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}} + B_D + B_J\right)V$$</p>
<p>The overall loss combines FSMILES token prediction, absolute coordinate prediction, and local coordinate predictions ($r$, $\theta$, $\phi$) with their auxiliary counterparts:</p>
<p>$$L = L_{\text{FSMILES}} + L_{\text{abs-coord}} + L_r + L_\theta + L_\phi + L_{r,\text{aux}} + L_{\theta,\text{aux}} + L_{\phi,\text{aux}}$$</p>
<p>Fine-tuning is performed on 11,800 protein-ligand complex samples from PDBbind 2020, with the first three encoder layers frozen to prevent overfitting.</p>
<h2 id="evaluation-on-dud-e-with-drug-likeness-filtering">Evaluation on DUD-E with Drug-Likeness Filtering</h2>
<p>The evaluation uses the DUD-E dataset (101 targets, 20,000+ active compounds), comparing Lingo3DMol against Pocket2Mol and TargetDiff. A key methodological contribution is the emphasis on filtering generated molecules for drug-likeness (QED &gt;= 0.3 and SAS &lt;= 5) before evaluating binding metrics, as the authors demonstrate that molecules with good docking scores can still be poor drug candidates.</p>
<p><strong>Molecular properties and binding mode (Table 1, drug-like molecules only):</strong></p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Pocket2Mol</th>
          <th>TargetDiff</th>
          <th>Lingo3DMol</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug-like molecules (% of total)</td>
          <td>61%</td>
          <td>49%</td>
          <td><strong>82%</strong></td>
      </tr>
      <tr>
          <td>Mean QED</td>
          <td>0.56</td>
          <td>0.60</td>
          <td>0.59</td>
      </tr>
      <tr>
          <td>Mean SAS</td>
          <td>3.5</td>
          <td>4.0</td>
          <td><strong>3.1</strong></td>
      </tr>
      <tr>
          <td>ECFP TS &gt; 0.5 (% of targets)</td>
          <td>8%</td>
          <td>3%</td>
          <td><strong>33%</strong></td>
      </tr>
      <tr>
          <td>Mean min-in-place GlideSP</td>
          <td>-6.7</td>
          <td>-6.2</td>
          <td><strong>-6.8</strong></td>
      </tr>
      <tr>
          <td>Mean GlideSP redocking</td>
          <td>-7.5</td>
          <td>-7.0</td>
          <td><strong>-7.8</strong></td>
      </tr>
      <tr>
          <td>Mean RMSD vs. low-energy conformer (A)</td>
          <td>1.1</td>
          <td>1.1</td>
          <td><strong>0.9</strong></td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>0.84</td>
          <td><strong>0.88</strong></td>
          <td>0.82</td>
      </tr>
  </tbody>
</table>
<p>Lingo3DMol generates substantially more drug-like molecules (82% vs. 61% and 49%) and finds similar-to-active compounds for 33% of targets compared to 8% (Pocket2Mol) and 3% (TargetDiff). The model also achieves the best min-in-place GlideSP scores and lowest RMSD versus low-energy conformers, indicating higher quality binding poses and more realistic 3D geometries.</p>
<p><strong>Molecular geometry:</strong> Lingo3DMol demonstrated the lowest Jensen-Shannon divergence for all atom-atom distance distributions and produced significantly fewer molecules with large rings (0.23% with 7-membered rings vs. 2.59% for Pocket2Mol and 11.70% for TargetDiff).</p>
<p><strong>Information leakage analysis:</strong> The authors controlled for information leakage by excluding proteins with &gt;30% sequence identity to DUD-E targets from training. When DUD-E targets were stratified by sequence identity to Pocket2Mol&rsquo;s training set, Lingo3DMol&rsquo;s advantage widened as leakage decreased, suggesting the performance gap is genuine rather than an artifact of training overlap.</p>
<p><strong>Ablation studies (Table 2):</strong></p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Standard</th>
          <th>Random NCI</th>
          <th>No Pretraining</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug-like (%)</td>
          <td><strong>82%</strong></td>
          <td>47%</td>
          <td>71%</td>
      </tr>
      <tr>
          <td>ECFP TS &gt; 0.5</td>
          <td><strong>33%</strong></td>
          <td>6%</td>
          <td>3%</td>
      </tr>
      <tr>
          <td>Mean min-in-place GlideSP</td>
          <td><strong>-6.8</strong></td>
          <td>-5.8</td>
          <td>-4.9</td>
      </tr>
      <tr>
          <td>Dice score</td>
          <td><strong>0.25</strong></td>
          <td>0.15</td>
          <td>0.13</td>
      </tr>
  </tbody>
</table>
<p>Both pretraining and the NCI predictor are essential. Removing pretraining reduces the number of valid molecules and binding quality. Replacing the trained NCI predictor with random NCI site selection severely degrades drug-likeness and the ability to generate active-like compounds.</p>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p>Lingo3DMol demonstrates that combining language model sequence generation with geometric deep learning can produce drug-like 3D molecules that outperform graph-based and diffusion-based alternatives in binding mode quality, drug-likeness, and similarity to known actives. The FSMILES representation successfully constrains generated molecules to realistic topologies by encoding ring size information and using fragment-level generation.</p>
<p>Several limitations are acknowledged. Capturing all non-covalent interactions within a single molecule remains difficult with autoregressive generation. The model does not enforce equivariance (SE(3) invariance is approximated via rotation/translation augmentation and invariant features rather than built into the architecture). The pretraining dataset is partially proprietary (12M molecules from a commercial library, of which 1.4M from public sources are shared). Diversity of generated drug-like molecules is slightly lower than baselines, though the authors argue that baseline diversity explores chemical space away from known active regions. A comprehensive evaluation of drug-like properties beyond QED and SAS metrics is identified as an important next step.</p>
<p>Future directions include investigating electron density representations for molecular interactions, incorporating SE(3) equivariant architectures (e.g., GVP, Vector Neurons), and developing more systematic drug-likeness evaluation frameworks.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>In-house commercial library</td>
          <td>12M molecules (1.4M public)</td>
          <td>Filtered for drug-likeness; conformers via ConfGen</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>PDBbind 2020 (general set)</td>
          <td>11,800 samples (8,201 PDB IDs)</td>
          <td>Filtered for &lt;30% sequence identity to DUD-E targets</td>
      </tr>
      <tr>
          <td>NCI labels</td>
          <td>PDBbind 2020</td>
          <td>Same as fine-tuning</td>
          <td>Labeled using ODDT for H-bonds, halogen bonds, salt bridges, pi-pi stacking</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>DUD-E</td>
          <td>101 targets, 20,000+ active compounds</td>
          <td>Standard benchmark for structure-based drug design</td>
      </tr>
      <tr>
          <td>Geometry evaluation</td>
          <td>CrossDocked2020</td>
          <td>100 targets</td>
          <td>Used for bond length and atom distance distribution comparisons</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer-based encoder-decoder with graph structural bias terms (distance matrix $B_D$, edge vector matrix $B_J$)</li>
<li>Denoising pretraining: 25% atom deletion, coordinate perturbation ($\pm 0.5$ A), 25% carbon element type corruption</li>
<li>Depth-first search sampling with reward function combining model confidence and anchor fulfillment</li>
<li>Fine-tuning: first three encoder layers frozen</li>
<li>Local-global coordinate fusion during inference with search space: $r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Generation model: transformer encoder-decoder with dual decoders ($D_{\text{2D}}$ for topology, $D_{\text{3D}}$ for global coordinates)</li>
<li>NCI/anchor prediction model: same architecture, initialized from pretrained parameters</li>
<li>Pretrained, fine-tuned, and NCI model checkpoints available on GitHub and figshare</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Lingo3DMol</th>
          <th>Best Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug-like molecules (%)</td>
          <td>82%</td>
          <td>61% (P2M)</td>
          <td>QED &gt;= 0.3, SAS &lt;= 5</td>
      </tr>
      <tr>
          <td>ECFP TS &gt; 0.5 (% targets)</td>
          <td>33%</td>
          <td>8% (P2M)</td>
          <td>Tanimoto similarity to known actives</td>
      </tr>
      <tr>
          <td>Min-in-place GlideSP</td>
          <td>-6.8</td>
          <td>-6.7 (P2M)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>GlideSP redocking</td>
          <td>-7.8</td>
          <td>-7.5 (P2M)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>RMSD vs. low-energy conformer</td>
          <td>0.9 A</td>
          <td>1.1 A (both)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>Generation speed (100 mol)</td>
          <td>874 +/- 401 s</td>
          <td>962 +/- 622 s (P2M)</td>
          <td>NVIDIA Tesla V100</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Inference benchmarked on NVIDIA Tesla V100 GPUs</li>
<li>Generation of 100 valid molecules per target: 874 +/- 401 seconds</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/stonewiseAIDrugDesign/Lingo3DMol">Lingo3DMol</a></td>
          <td>Code</td>
          <td>GPL-3.0</td>
          <td>Inference code and model architecture</td>
      </tr>
      <tr>
          <td><a href="https://figshare.com/articles/software/Code_for_Lingo3DMo/24633084">Model checkpoints</a></td>
          <td>Model</td>
          <td>GPL-3.0</td>
          <td>Pretraining, fine-tuning, and NCI checkpoints</td>
      </tr>
      <tr>
          <td><a href="https://figshare.com/articles/dataset/Data_for_Lingo3DMol/24550351">Training data</a></td>
          <td>Dataset</td>
          <td>Not specified</td>
          <td>Partial pretraining data (1.4M public molecules), fine-tuning complexes, evaluation molecules</td>
      </tr>
      <tr>
          <td><a href="https://sw3dmg.stonewise.cn">Online service</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Web interface for molecule generation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Feng, W., Wang, L., Lin, Z., Zhu, Y., Wang, H., Dong, J., Bai, R., Wang, H., Zhou, J., Peng, W., Huang, B., &amp; Zhou, W. (2024). Generation of 3D molecules in pockets via a language model. <em>Nature Machine Intelligence</em>, 6(1), 62-73. <a href="https://doi.org/10.1038/s42256-023-00775-6">https://doi.org/10.1038/s42256-023-00775-6</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{feng2024generation,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Generation of 3D molecules in pockets via a language model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Feng, Wei and Wang, Lvwei and Lin, Zaiyun and Zhu, Yanhao and Wang, Han and Dong, Jianqiang and Bai, Rong and Wang, Huting and Zhou, Jielong and Peng, Wei and Huang, Bo and Zhou, Wenbiao}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{62--73}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-023-00775-6}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Evolutionary Molecular Design via Deep Learning + GA</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/evolutionary-design-deep-learning-genetic-algorithm/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/evolutionary-design-deep-learning-genetic-algorithm/</guid><description>Kwon et al. combine an RNN decoder for SMILES reconstruction with a genetic algorithm operating on ECFP fingerprints for goal-directed molecular design.</description><content:encoded><![CDATA[<h2 id="fingerprint-based-evolutionary-molecular-design">Fingerprint-Based Evolutionary Molecular Design</h2>
<p>This is a <strong>Method</strong> paper that introduces an evolutionary design methodology (EDM) for goal-directed molecular optimization. The primary contribution is a three-component framework where (1) molecules are encoded as <a href="https://en.wikipedia.org/wiki/Extended-connectivity_fingerprint">extended-connectivity fingerprint</a> (ECFP) vectors, (2) a genetic algorithm evolves these fingerprint vectors through mutation and crossover, (3) a recurrent neural network (RNN) decodes the evolved fingerprints back into valid SMILES strings, and (4) a deep neural network (DNN) evaluates molecular fitness. The key advantage over prior evolutionary approaches is that no hand-crafted chemical rules or fragment libraries are needed, as the RNN learns valid molecular reconstruction from data.</p>
<h2 id="challenges-in-evolutionary-molecular-optimization">Challenges in Evolutionary Molecular Optimization</h2>
<p>Evolutionary algorithms for molecular design face two core challenges. First, maintaining chemical validity of evolved molecules is difficult when operating on graph or string representations directly. Prior methods rely on predefined chemical rules and fragment libraries to constrain structural modifications (atom/bond additions, deletions, substitutions), but these introduce bias and risk convergence to local optima. Each new application domain requires specifying new chemical rules, which may not exist for emerging areas. Second, fitness evaluation must be both efficient and accurate. Simple evaluation methods like structural similarity indices or semi-empirical quantum chemistry calculations reduce computational cost but may not capture complex property relationships.</p>
<p>High-throughput computational screening (HTCS) is a common alternative, but it depends on the quality of predefined virtual chemical libraries and often requires multiple iterative enumerations, limiting its ability to explore novel chemical space.</p>
<h2 id="core-innovation-evolving-fingerprints-with-neural-decoding">Core Innovation: Evolving Fingerprints with Neural Decoding</h2>
<p>The key insight is to perform genetic operations in fingerprint space rather than in molecular graph or SMILES string space. The framework comprises three learned functions:</p>
<p><strong>Encoding function</strong> $e(\cdot)$: Converts a SMILES string $\mathbf{m}$ into a 5000-dimensional ECFP vector $\mathbf{x}$ using Morgan fingerprints with a neighborhood radius of 6. This is a deterministic hash-based encoding (not learned).</p>
<p><strong>Decoding function</strong> $d(\cdot)$: An RNN with three hidden layers of 500 LSTM units that reconstructs a SMILES string from an ECFP vector. The RNN generates SMILES as a sequence of three-character substrings, conditioning each prediction on the current substring and the input ECFP vector:</p>
<p>$$d(\mathbf{x}) = \mathbf{m}, \quad \text{where } p(\mathbf{m}_{t+1} | \mathbf{m}_{t}, \mathbf{x})$$</p>
<p>The three-character substring approach reduces the ratio of invalid SMILES by imposing additional constraints on subsequent characters.</p>
<p><strong>Property prediction function</strong> $f(\cdot)$: A five-layer DNN with 250 hidden units per layer that predicts molecular properties from ECFP vectors:</p>
<p>$$\mathbf{t} = f(e(\mathbf{m}))$$</p>
<p>The RNN is trained by minimizing cross-entropy loss between the softmax output and the target SMILES string $\mathbf{m}_{i}$, learning the relationship $d(e(\mathbf{m}_{i})) = \mathbf{m}_{i}$. The DNN is trained by minimizing mean squared error between predicted and computed property values. Both use the Adam optimizer with mini-batch size 100, 500 training epochs, and dropout rate 0.5.</p>
<h3 id="genetic-algorithm-operations">Genetic Algorithm Operations</h3>
<p>The GA evolves ECFP vectors using the DEAP library with the following parameters:</p>
<ul>
<li><strong>Population size</strong>: 50</li>
<li><strong>Crossover rate</strong>: 0.7 (uniform crossover, mixing ratio 0.2)</li>
<li><strong>Mutation rate</strong>: 0.3 (Gaussian mutation, $N(0, 0.2^{2})$, applied to 1% of elements)</li>
<li><strong>Selection</strong>: Tournament selection with size 3, top 3 individuals as parents</li>
<li><strong>Termination</strong>: 500 generations or 30 consecutive generations without fitness improvement</li>
</ul>
<p>The evolutionary loop proceeds as follows: a seed molecule $\mathbf{m}_{0}$ is encoded to $\mathbf{x}_{0}$, mutated to generate a population $\mathbf{P}^{0} = {\mathbf{z}_{1}, \mathbf{z}_{2}, \ldots, \mathbf{z}_{L}}$, each vector is decoded via the RNN, validity is checked with RDKit, fitness is evaluated via the DNN, and the top parents produce the next generation through crossover and mutation.</p>
<h2 id="experimental-setup-light-absorbing-wavelength-optimization">Experimental Setup: Light-Absorbing Wavelength Optimization</h2>
<h3 id="training-data-and-deep-learning-performance">Training Data and Deep Learning Performance</h3>
<p>The models were trained on 10,000 to 100,000 molecules randomly sampled from PubChem (molecular weight 200-600 g/mol). Each molecule was labeled with DFT-computed excitation energy ($S_{1}$), <a href="https://en.wikipedia.org/wiki/HOMO_and_LUMO">HOMO, and LUMO</a> energies using B3LYP/6-31G.</p>
<table>
  <thead>
      <tr>
          <th>Training Data</th>
          <th>Validity (%)</th>
          <th>Reconstructability (%)</th>
          <th>$S_{1}$ (R, MAE)</th>
          <th>HOMO (R, MAE)</th>
          <th>LUMO (R, MAE)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>100,000</td>
          <td>88.8</td>
          <td>62.4</td>
          <td>0.977, 0.185 eV</td>
          <td>0.948, 0.168 eV</td>
          <td>0.960, 0.195 eV</td>
      </tr>
      <tr>
          <td>50,000</td>
          <td>86.7</td>
          <td>60.1</td>
          <td>0.973, 0.198 eV</td>
          <td>0.945, 0.172 eV</td>
          <td>0.955, 0.209 eV</td>
      </tr>
      <tr>
          <td>30,000</td>
          <td>85.3</td>
          <td>59.8</td>
          <td>0.930, 0.228 eV</td>
          <td>0.934, 0.191 eV</td>
          <td>0.945, 0.224 eV</td>
      </tr>
      <tr>
          <td>10,000</td>
          <td>83.2</td>
          <td>55.7</td>
          <td>0.913, 0.278 eV</td>
          <td>0.885, 0.244 eV</td>
          <td>0.917, 0.287 eV</td>
      </tr>
  </tbody>
</table>
<p>Validity refers to the proportion of chemically valid SMILES after RDKit inspection. Reconstructability measures how often the RNN can reproduce the original molecule from its ECFP (62.4% at 100k training samples by matching canonical SMILES among 10,000 generated strings).</p>
<h3 id="design-task-1-unconstrained-s1-modification">Design Task 1: Unconstrained S1 Modification</h3>
<p>Fifty seed molecules with $S_{1}$ values between 3.8 eV and 4.2 eV were evolved in both increasing and decreasing directions. With 50,000 training samples, $S_{1}$ increased by approximately 60% on average in the increasing direction and showed slightly lower rates of change in the decreasing direction. The asymmetry is attributed to the skewed $S_{1}$ distribution of training data (average $S_{1}$ of 4.3-4.4 eV, higher than the seed median of 4.0 eV). Performance saturated at approximately 50,000 training samples.</p>
<h3 id="design-task-2-s1-modification-with-homolumo-constraints">Design Task 2: S1 Modification with HOMO/LUMO Constraints</h3>
<p>The same 50 seeds were evolved with constraints: $-7.0 \text{ eV} &lt; \text{HOMO} &lt; -5.0 \text{ eV}$ and $\text{LUMO} &lt; 0.0 \text{ eV}$. In the increasing $S_{1}$ direction, constraints suppressed the rate of change because both HOMO and LUMO bounds limit the achievable HOMO-LUMO gap. In the decreasing direction, constraints had minimal effect because LUMO could freely decrease while HOMO had sufficient room to rise within the allowed range.</p>
<h3 id="design-task-3-extrapolation-beyond-training-data">Design Task 3: Extrapolation Beyond Training Data</h3>
<p>To generate molecules with $S_{1}$ values below 1.77 eV (outside the training distribution, which had mean $S_{1}$ of 4.91 eV), the authors introduced iterative &ldquo;phases&rdquo;: generate molecules, compute their properties via DFT, retrain the models, and repeat. Starting from the 30 lowest-$S_{1}$ seed molecules with 300 generation runs per phase:</p>
<ul>
<li>Phase 1: Average $S_{1}$ = 2.20 eV, 12 molecules below 1.77 eV</li>
<li>Phase 2: Average $S_{1}$ = 2.22 eV, 37 molecules below 1.77 eV</li>
<li>Phase 3: Average $S_{1}$ = 2.31 eV, 58 molecules below 1.77 eV</li>
</ul>
<p>While the average $S_{1}$ rose slightly across phases, variance decreased (from 1.40 to 1.36), indicating the model concentrated its outputs closer to the target range. This active-learning-like loop demonstrates the framework can extend beyond the training distribution.</p>
<h3 id="design-task-4-guacamol-benchmarks">Design Task 4: GuacaMol Benchmarks</h3>
<p>The method was evaluated on the <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> goal-directed benchmark suite using the ChEMBL25 training dataset. The RNN model was retrained with three-character substrings.</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Best of Dataset</th>
          <th><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/lstm-drug-like-molecule-generation/">SMILES LSTM</a></th>
          <th>SMILES GA</th>
          <th><a href="/notes/computational-chemistry/benchmark-problems/graph-based-genetic-algorithm-chemical-space/">Graph GA</a></th>
          <th><a href="/notes/computational-chemistry/benchmark-problems/graph-based-genetic-algorithm-chemical-space/">Graph MCTS</a></th>
          <th>cRNN</th>
          <th>EDM (ours)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Celecoxib rediscovery</td>
          <td>0.505</td>
          <td>1.000</td>
          <td>0.607</td>
          <td>1.000</td>
          <td>0.378</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>Troglitazone rediscovery</td>
          <td>0.419</td>
          <td>1.000</td>
          <td>0.558</td>
          <td>1.000</td>
          <td>0.312</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>Thiothixene rediscovery</td>
          <td>0.456</td>
          <td>1.000</td>
          <td>0.495</td>
          <td>1.000</td>
          <td>0.308</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>LogP(-1.0)</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>0.980</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>LogP(8.0)</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>0.979</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>TPSA(150.0)</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>CNS MPO</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.948</td>
          <td>0.948</td>
          <td>0.948</td>
          <td>0.948</td>
          <td>0.944</td>
          <td>0.948</td>
          <td>0.948</td>
      </tr>
  </tbody>
</table>
<p>The EDM achieves maximum scores on all eight tasks, matching the cRNN baseline. The 256 highest-scoring molecules from the ChEMBL25 test set were used as seeds, with 500 SMILES strings generated per seed.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="results">Results</h3>
<p>The evolutionary design framework successfully evolved seed molecules toward target properties across all four design tasks. The RNN decoder maintained 88.8% chemical validity at 100k training samples, and the DNN property predictor achieved correlation coefficients above 0.94 for $S_{1}$, HOMO, and LUMO prediction. The iterative retraining procedure enabled exploration outside the training data distribution, generating 58 molecules with $S_{1}$ below 1.77 eV after three phases. On GuacaMol benchmarks, the method achieved maximum scores on all eight tasks, matching <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/lstm-drug-like-molecule-generation/">SMILES LSTM</a>, <a href="/notes/computational-chemistry/benchmark-problems/graph-based-genetic-algorithm-chemical-space/">Graph GA</a>, and cRNN baselines.</p>
<h3 id="limitations">Limitations</h3>
<p>Several limitations are worth noting:</p>
<ol>
<li><strong>Reconstructability ceiling</strong>: Only 62.4% of molecules could be reconstructed from their ECFP vectors, meaning the RNN decoder fails to recover the original molecule approximately 38% of the time. This information loss in the ECFP encoding is a fundamental bottleneck.</li>
<li><strong>Data dependence</strong>: Performance is sensitive to the training data distribution. The asymmetric evolution rates for increasing vs. decreasing $S_{1}$ directly reflect the skewed training data.</li>
<li><strong>Structural constraints</strong>: Three heuristic constraints (fused ring sizes, number of fused rings, alkyl chain lengths) were still needed to maintain reasonable molecular structures, partially undermining the claim of a fully data-driven approach.</li>
<li><strong>DFT reliance</strong>: The extrapolation experiment requires DFT calculations in the loop, which are computationally expensive and may limit scalability.</li>
<li><strong>Limited benchmark scope</strong>: Only 8 GuacaMol tasks were tested, and all achieved perfect scores, making it difficult to differentiate from competing methods. The paper does not report on harder multi-objective benchmarks.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>PubChem random sample</td>
          <td>10,000-100,000 molecules</td>
          <td>MW 200-600 g/mol, labeled with DFT-computed $S_{1}$, HOMO, LUMO</td>
      </tr>
      <tr>
          <td>GuacaMol Benchmark</td>
          <td>ChEMBL25</td>
          <td>Standard split</td>
          <td>Used for retraining RNN; 256 top-scoring seeds</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Genetic algorithm</strong>: DEAP library; population 50, crossover rate 0.7, mutation rate 0.3, tournament size 3</li>
<li><strong>RNN decoder</strong>: 3 hidden layers, 500 LSTM units each, three-character substring generation</li>
<li><strong>DNN predictor</strong>: 5 layers, 250 hidden units, sigmoid activations, linear output</li>
<li><strong>Training</strong>: Adam optimizer, mini-batch 100, 500 epochs, dropout 0.5</li>
</ul>
<h3 id="models">Models</h3>
<p>All neural networks were implemented using Keras with the Theano backend (GPU-accelerated). No pre-trained model weights are publicly available.</p>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>RNN validity</strong>: Proportion of chemically valid SMILES (RDKit check)</li>
<li><strong>Reconstructability</strong>: Fraction of seed molecules recoverable from ECFP (canonical SMILES match in 10,000 generated strings)</li>
<li><strong>DNN accuracy</strong>: Correlation coefficient (R) and MAE via 10-fold cross-validation</li>
<li><strong>Evolutionary performance</strong>: Average rate of $S_{1}$ change across 50 seeds; molecule count in target range</li>
<li><strong>GuacaMol</strong>: Standard rediscovery and property satisfaction benchmarks</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify GPU models, training times, or computational requirements for the evolutionary runs. DFT calculations used the Gaussian 09 program suite with B3LYP/6-31G.</p>
<h3 id="artifacts">Artifacts</h3>
<p>No public code repository or pre-trained models are available. The paper is published under a CC-BY 4.0 license as open access in Scientific Reports.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.nature.com/articles/s41598-021-96812-8">Paper (Nature)</a></td>
          <td>Paper</td>
          <td>CC-BY 4.0</td>
          <td>Open access</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility classification</strong>: Partially Reproducible. The method is described in sufficient detail for reimplementation, but no code, trained models, or preprocessed datasets are released. The DFT calculations require Gaussian 09, a commercial software package.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kwon, Y., Kang, S., Choi, Y.-S., &amp; Kim, I. (2021). Evolutionary design of molecules based on deep learning and a genetic algorithm. <em>Scientific Reports</em>, 11, 17304. <a href="https://doi.org/10.1038/s41598-021-96812-8">https://doi.org/10.1038/s41598-021-96812-8</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{kwon2021evolutionary,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Evolutionary design of molecules based on deep learning and a genetic algorithm}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kwon, Youngchun and Kang, Seokho and Choi, Youn-Suk and Kim, Inkoo}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{17304}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41598-021-96812-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DrugEx v3: Scaffold-Constrained Graph Transformer</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/drugex-v3-scaffold-graph-transformer/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/drugex-v3-scaffold-graph-transformer/</guid><description>DrugEx v3 proposes a Graph Transformer with novel positional encoding for scaffold-constrained molecular generation via multi-objective reinforcement learning.</description><content:encoded><![CDATA[<h2 id="a-graph-transformer-method-for-scaffold-constrained-drug-design">A Graph Transformer Method for Scaffold-Constrained Drug Design</h2>
<p>This is a <strong>Method</strong> paper that introduces DrugEx v3, a Graph Transformer model for scaffold-constrained de novo drug design. The primary contribution is a novel positional encoding scheme for molecular graphs that allows a Transformer architecture to operate on graph-structured molecular data rather than <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings. The model takes user-provided scaffold fragments as input and generates complete molecules through growing and connecting operations, trained with multi-objective reinforcement learning to optimize for both target affinity and drug-likeness.</p>
<h2 id="from-fixed-objectives-to-user-guided-scaffold-design">From Fixed Objectives to User-Guided Scaffold Design</h2>
<p>Prior versions of DrugEx (v1 and <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/drugex-v2-pareto-multi-objective-rl/">v2</a>) used RNN-based generators trained with reinforcement learning for de novo drug design, but they operated under fixed objectives and could not accept user-provided structural priors. If a medicinal chemist wanted to explore analogs of a specific scaffold, the model needed retraining from scratch. Meanwhile, SMILES-based molecular generators face inherent limitations for scaffold-constrained design: SMILES is a linear notation, so inserting fragments at multiple positions of a scaffold requires complex grammar handling, and small token changes can produce invalid molecules.</p>
<p>Several approaches had been proposed for scaffold-based generation, including graph generative models (Lim et al., 2019), DeepScaffold (Li et al., 2020), SMILES-based scaffold decorators (Arus-Pous et al., 2020), and SyntaLinker for fragment linking (Yang et al., 2020). DrugEx v3 aims to combine the advantages of graph representations (validity guarantees, local invariance, flexible extension) with the Transformer architecture&rsquo;s ability to handle complex dependencies, while maintaining the multi-objective reinforcement learning framework from DrugEx v2.</p>
<h2 id="graph-positional-encoding-for-molecular-transformers">Graph Positional Encoding for Molecular Transformers</h2>
<p>The core innovation is adapting the Transformer architecture to work directly with molecular graph representations. Two key modifications make this possible.</p>
<p><strong>Graph word encoding.</strong> Since atoms and bonds cannot be processed simultaneously in a graph, the authors combine them into a single index:</p>
<p>$$
W = T_{atom} \times 4 + T_{bond}
$$</p>
<p>where $T_{atom}$ is the atom type index and $T_{bond}$ is the bond type index (four bond types: single, double, triple, and none).</p>
<p><strong>Graph positional encoding.</strong> Standard sequential position encoding does not capture molecular topology. The authors propose an adjacency-matrix-based positional encoding:</p>
<p>$$
P = I_{Atom} \times L_{max} + I_{Connected}
$$</p>
<p>where $I_{Atom}$ is the current atom index, $L_{max}$ is the maximum sequence length, and $I_{Connected}$ is the index of the atom connected by the current bond. This encoding is then processed through the standard sinusoidal positional encoding:</p>
<p>$$
PE_{(p, 2i)} = \sin(pos / 10000^{2i / d_{m}})
$$</p>
<p>$$
PE_{(p, 2i+1)} = \cos(pos / 10000^{2i / d_{m}})
$$</p>
<p>with $d_{m} = 512$.</p>
<p><strong>Molecule generation procedure.</strong> Each molecule in the training data is represented as a five-row matrix encoding atom type, bond type, connected atom index, current atom index, and fragment index. The columns are divided into three sections: fragment (the scaffold), growing (new atoms added to fragments), and linking (bonds connecting grown fragments). The decoder uses a GRU-based recurrent layer to sequentially output atom type, bond type, connected atom index, and current atom index at each step, with chemical valence rules enforced at every generation step to guarantee valid molecules.</p>
<p><strong>Multi-objective reinforcement learning.</strong> The generator is trained with a policy gradient objective:</p>
<p>$$
J(\theta) = \mathbb{E}\left[R^{*}(y_{1:T}) | \theta\right] = \sum_{t=1}^{T} \log G(y_{t} | y_{1:t-1}) \cdot R^{\ast}(y_{1:T})
$$</p>
<p>where $R^{*}$ is a Pareto-based reward combining target affinity and QED drug-likeness score:</p>
<p>$$
R^{*} = \begin{cases} 0.5 + \frac{k - N_{undesired}}{2N_{desired}}, &amp; \text{if desired} \\ \frac{k}{2N_{undesired}}, &amp; \text{if undesired} \end{cases}
$$</p>
<p>with $k$ being the solution&rsquo;s index in the Pareto rank. An exploration strategy uses two networks: an exploitation network $G_{\theta}$ (updated by policy gradient) and an exploration network $G_{\phi}$ (fixed, pre-trained on ChEMBL), with an exploration rate $\varepsilon$ controlling how many scaffolds are routed to $G_{\phi}$ during training.</p>
<h2 id="experimental-setup-architecture-comparison-and-rl-optimization">Experimental Setup: Architecture Comparison and RL Optimization</h2>
<h3 id="data">Data</h3>
<p>The ChEMBL set (version 27) contained approximately 1.7 million molecules for pre-training, preprocessed via RDKit (charge neutralization, metal/fragment removal). The LIGAND set comprised 10,828 adenosine receptor ligands for fine-tuning. Each molecule was decomposed into fragments using the BRICS algorithm, creating scaffold-molecule pairs (up to 15 pairs per molecule with four fragments). The ChEMBL set yielded 9.3 million training pairs, and the LIGAND set produced 53,888 training pairs.</p>
<h3 id="architecture-comparison">Architecture comparison</h3>
<p>Four architectures were compared:</p>
<ol>
<li><strong>Graph Transformer</strong>: graph input with novel positional encoding</li>
<li><strong>Sequential Transformer</strong>: SMILES input with standard Transformer</li>
<li><strong>LSTM-BASE</strong>: SMILES encoder-decoder with three recurrent layers</li>
<li><strong>LSTM+ATTN</strong>: LSTM-BASE with an attention mechanism between encoder and decoder</li>
</ol>
<p>All models were pre-trained on ChEMBL and fine-tuned on the LIGAND set. The bioactivity predictor was a random forest regression model using 2048D ECFP6 fingerprints and 19D physicochemical descriptors, with an activity threshold of pX = 6.5 for the A2A adenosine receptor.</p>
<h3 id="evaluation-metrics">Evaluation metrics</h3>
<p>Five metrics were used: validity (parseable molecules), accuracy (scaffold containment), desirability (meeting all objectives), uniqueness, and novelty (not in ChEMBL). Diversity was measured using the Solow-Polasky index with Tanimoto distance on ECFP6 fingerprints:</p>
<p>$$
I(A) = \frac{1}{|A|} \mathbf{e}^{\intercal} F(\mathbf{s})^{-1} \mathbf{e}
$$</p>
<h3 id="hardware">Hardware</h3>
<p>Models were benchmarked on a server with NVIDIA Tesla P100 GPUs.</p>
<h2 id="key-results-graph-representation-advantages-and-rl-trade-offs">Key Results: Graph Representation Advantages and RL Trade-offs</h2>
<h3 id="pre-training-and-fine-tuning-performance">Pre-training and fine-tuning performance</h3>
<p>The Graph Transformer achieved the best overall performance across all metrics:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Validity (PT)</th>
          <th>Accuracy (PT)</th>
          <th>Validity (FT)</th>
          <th>Accuracy (FT)</th>
          <th>Novelty (FT)</th>
          <th>Uniqueness (FT)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Graph Transformer (512)</td>
          <td>100.0%</td>
          <td>99.3%</td>
          <td>100.0%</td>
          <td>99.2%</td>
          <td>68.9%</td>
          <td>82.9%</td>
      </tr>
      <tr>
          <td>Seq. Transformer (512)</td>
          <td>96.7%</td>
          <td>74.0%</td>
          <td>99.3%</td>
          <td>92.7%</td>
          <td>8.9%</td>
          <td>28.9%</td>
      </tr>
      <tr>
          <td>LSTM+ATTN (512)</td>
          <td>94.3%</td>
          <td>72.8%</td>
          <td>96.9%</td>
          <td>85.9%</td>
          <td>6.3%</td>
          <td>20.7%</td>
      </tr>
      <tr>
          <td>LSTM-BASE (512)</td>
          <td>93.9%</td>
          <td>52.4%</td>
          <td>98.7%</td>
          <td>81.6%</td>
          <td>3.9%</td>
          <td>19.2%</td>
      </tr>
  </tbody>
</table>
<p>PT = pre-trained, FT = fine-tuned. The Graph Transformer achieved 100% validity due to its explicit valence checking at each generation step. It also produced substantially more novel and unique molecules after fine-tuning compared to SMILES-based methods.</p>
<p>The authors identified four advantages of the graph representation over SMILES: (1) local invariance, where fragment ordering does not affect output; (2) global extendibility, where new atoms can be appended without restructuring existing data; (3) freedom from grammar constraints; and (4) direct accessibility of chemical valence rules for validity enforcement.</p>
<h3 id="reinforcement-learning-results">Reinforcement learning results</h3>
<p>With multi-objective RL (affinity + QED), 74.6% of generated molecules were predicted active at $\varepsilon = 0.0$. The exploration rate $\varepsilon$ trades off desirability against uniqueness:</p>
<table>
  <thead>
      <tr>
          <th>$\varepsilon$</th>
          <th>Desirability</th>
          <th>Uniqueness</th>
          <th>Novelty</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0.0</td>
          <td>74.6%</td>
          <td>60.7%</td>
          <td>60.6%</td>
          <td>0.879</td>
      </tr>
      <tr>
          <td>0.1</td>
          <td>66.8%</td>
          <td>75.0%</td>
          <td>74.6%</td>
          <td>0.842</td>
      </tr>
      <tr>
          <td>0.2</td>
          <td>61.6%</td>
          <td>80.2%</td>
          <td>79.4%</td>
          <td>0.879</td>
      </tr>
      <tr>
          <td>0.3</td>
          <td>56.8%</td>
          <td>89.8%</td>
          <td>88.8%</td>
          <td>0.874</td>
      </tr>
  </tbody>
</table>
<p>The authors report that $\varepsilon = 0.3$ produced the best balance between desirability and uniqueness, with 56.8% desired molecules and 89.8% uniqueness. Diversity remained above 0.84 across all settings.</p>
<h3 id="limitations">Limitations</h3>
<p>The Graph Transformer produced molecules with worse synthetic accessibility (SA scores) compared to SMILES-based methods, particularly after fine-tuning on the smaller LIGAND set. The authors attribute this to uncommon ring systems generated when the model handles long-distance dependencies. A kekulization issue also causes a small fraction of molecules to fail scaffold matching: aromatic bond inference during sanitization can alter the scaffold substructure. Without single-objective affinity constraint, the model generates molecules with molecular weight exceeding 500 Da, reducing drug-likeness. All bioactivity predictions rely on a random forest model rather than experimental validation, and the t-SNE analysis suggests some generated molecules fall outside the model&rsquo;s applicability domain.</p>
<h3 id="future-directions">Future directions</h3>
<p>The authors propose extending the Graph Transformer to accept protein information as input via proteochemometric modeling, enabling design of ligands for targets without known ligands. Lead optimization, where a &ldquo;hit&rdquo; serves as input to generate improved analogs, is also identified as a natural extension.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data-1">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL v27</td>
          <td>~1.7M molecules (9.3M scaffold-molecule pairs)</td>
          <td>Preprocessed via RDKit</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>LIGAND set (A2A AR ligands from ChEMBL)</td>
          <td>10,828 ligands (53,888 pairs)</td>
          <td>Split 8:1:1 train/val/test</td>
      </tr>
      <tr>
          <td>Bioactivity labels</td>
          <td>ChEMBL A2A AR activity data</td>
          <td>pX threshold = 6.5</td>
          <td>Average pChEMBL values</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Fragment decomposition: BRICS algorithm via RDKit (max 4 fragments per molecule)</li>
<li>Optimizer: Adam with learning rate $10^{-4}$, batch size 256</li>
<li>Pre-training: 20 epochs; fine-tuning: up to 1,000 epochs with early stopping (patience: 100 epochs)</li>
<li>Bioactivity predictor: random forest regression (scikit-learn) with 2048D ECFP6 + 19D physicochemical descriptors</li>
<li>Pareto-based multi-objective ranking with GPU acceleration</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Graph Transformer: 512 hidden units, 8 attention heads, $d_{k} = d_{v} = 64$</li>
<li>Sequential Transformer: same hidden size, sinusoidal positional encoding</li>
<li>LSTM-BASE / LSTM+ATTN: 128 embedding units, 512 hidden units, 3 recurrent layers</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Graph Transformer</th>
          <th>Best SMILES Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity (fine-tuned)</td>
          <td>100.0%</td>
          <td>99.6% (LSTM-BASE 1024)</td>
          <td>Valence checking guarantees validity</td>
      </tr>
      <tr>
          <td>Accuracy (fine-tuned)</td>
          <td>99.2%</td>
          <td>94.3% (Seq. Transformer 1024)</td>
          <td>Scaffold containment</td>
      </tr>
      <tr>
          <td>Desirability (RL, $\varepsilon$=0.0)</td>
          <td>74.6%</td>
          <td>N/A</td>
          <td>Only Graph Transformer used for RL</td>
      </tr>
      <tr>
          <td>Diversity (RL)</td>
          <td>0.879</td>
          <td>N/A</td>
          <td>Solow-Polasky index</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware-1">Hardware</h3>
<p>NVIDIA Tesla P100 GPUs. Specific training times not reported, but Transformer models trained faster than LSTM models with the same hidden layer size.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/CDDLeiden/DrugEx">CDDLeiden/DrugEx</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation (v1, v2, v3)</td>
      </tr>
      <tr>
          <td><a href="https://www.ebi.ac.uk/chembl/">ChEMBL v27</a></td>
          <td>Dataset</td>
          <td>CC-BY-SA 3.0</td>
          <td>Pre-training data source</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liu, X., Ye, K., van Vlijmen, H. W. T., IJzerman, A. P., &amp; van Westen, G. J. P. (2023). DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning. <em>Journal of Cheminformatics</em>, 15, 24. <a href="https://doi.org/10.1186/s13321-023-00694-z">https://doi.org/10.1186/s13321-023-00694-z</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{liu2023drugex,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Liu, Xuhan and Ye, Kai and van Vlijmen, Herman W. T. and IJzerman, Adriaan P. and van Westen, Gerard J. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{24}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-023-00694-z}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Curriculum Learning for De Novo Drug Design (REINVENT)</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/curriculum-learning-molecular-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/curriculum-learning-molecular-design/</guid><description>Curriculum learning applied to REINVENT accelerates convergence on complex multi-parameter drug design objectives compared to standard reinforcement learning.</description><content:encoded><![CDATA[<h2 id="curriculum-learning-as-a-method-for-molecular-generation">Curriculum Learning as a Method for Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces curriculum learning (CL) into the <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> de novo molecular design platform. The primary contribution is a training strategy that decomposes complex multi-parameter optimization (MPO) objectives into sequences of simpler tasks with increasing complexity. The agent learns each simpler task before progressing to the full production objective, accelerating convergence and improving the quality and diversity of generated molecules compared to standard policy-based reinforcement learning (RL).</p>
<h2 id="the-computational-cost-of-complex-reward-functions">The Computational Cost of Complex Reward Functions</h2>
<p>Policy-based RL for molecular design works by training a generative model (the agent) to produce molecules that maximize a reward function. In practice, drug design reward functions often include computationally expensive components such as molecular docking. When the reward landscape is complex and minima are difficult to find, the agent may spend many epochs sampling molecules far from the desired objective. The resulting small gradients cause minimal policy updates, leading to long periods of non-productivity. This is particularly wasteful when each reward evaluation involves expensive physics-based computations.</p>
<p>The core problem is that standard RL treats the full MPO objective as a monolithic task. If the agent cannot find any rewarding molecules early in training, it receives near-zero gradients and makes negligible progress. This creates a bootstrapping problem: the agent needs to already be sampling from favorable regions of chemical space to receive useful learning signals, but it has no guidance on how to get there.</p>
<p>Curriculum learning, originally proposed by Bengio et al. (2009), addresses this by arranging training tasks in order of increasing difficulty. When constituent tasks are correlated with the final objective, the gradients from simpler tasks provide more effective traversal of the optimization landscape.</p>
<h2 id="formalized-curriculum-strategy-for-reinvent">Formalized Curriculum Strategy for REINVENT</h2>
<p>The key innovation is a two-phase training protocol with formal definitions for curriculum progression.</p>
<p>A scoring function maps SMILES strings to desirability scores in $[0, 1]$ using a weighted geometric mean:</p>
<p>$$S(x) = \left(\prod_{i=1}^{n} c_{i}(x)^{w_{i}}\right)^{1 / \sum_{i=1}^{n} w_{i}}$$</p>
<p>where $x$ is a sampled compound in SMILES format, $c_{i}$ is the $i$-th scoring component, and $w_{i}$ is its weight.</p>
<p>A Curriculum $C$ consists of a sequence of Objectives $O = {O_{C_1}, \ldots, O_{C_n}, O_{P}}$, where subscripts $C$ and $P$ denote Curriculum and Production Objectives respectively. Each Objective has a corresponding scoring function. Progression is controlled by Curriculum Progression Criteria $P = {P_{1}, \ldots, P_{n}}$, where each $P_{i}$ defines a score threshold the agent must achieve before advancing to the next objective.</p>
<p><strong>Curriculum Phase</strong>: The agent trains on sequential Curriculum Objectives with increasing complexity. A diversity filter is not applied during this phase, as it could be counterproductive to guiding the agent toward favorable chemical space. No computationally expensive components (e.g., docking) are used in Curriculum Objectives.</p>
<p><strong>Production Phase</strong>: Activated only when the final Curriculum Progression Criterion $P_{n}$ is satisfied. The agent now optimizes the full Production Objective, which may include expensive components like molecular docking. A new inception memory is initialized (clearing Curriculum Phase compounds), and a Bemis-Murcko scaffold diversity filter is applied to encourage exploration across multiple local minima.</p>
<p>The implementation builds on REINVENT&rsquo;s RNN architecture: three hidden layers of 512 LSTM cells with an embedding size of 256 and a linear layer with softmax activation, pretrained on ChEMBL to learn SMILES syntax.</p>
<h2 id="three-experiments-on-pdk1-inhibitor-design">Three Experiments on PDK1 Inhibitor Design</h2>
<p>The authors evaluate CL on three molecular design tasks of increasing complexity, all centered on designing <a href="https://en.wikipedia.org/wiki/PDPK1">3-phosphoinositide-dependent protein kinase-1</a> (PDK1) inhibitors.</p>
<h3 id="experiment-1-target-scaffold-construction">Experiment 1: Target Scaffold Construction</h3>
<p>The goal is to generate compounds possessing a dihydro-pyrazoloquinazoline scaffold with a phenyl substituent, a scaffold not present in the prior&rsquo;s training set. Standard RL fails entirely over 2000 epochs because the probability of randomly sampling a compound with this scaffold is negligibly small, producing binary rewards (1.0 if scaffold present, 0.5 otherwise) that never rise above 0.5.</p>
<p>CL decomposes the target scaffold into 5 progressively complex substructures. Each Curriculum Progression Criterion threshold is set to 0.8. The agent learns to generate compounds with each substructure before advancing. CL finds the target scaffold within 1750 epochs, while baseline RL cannot find it in the same timeframe.</p>
<h3 id="experiments-2-and-3-molecular-docking-constraints">Experiments 2 and 3: Molecular Docking Constraints</h3>
<p>These experiments use a Production Objective combining a molecular docking constraint (retaining two hydrogen-bonding interactions with Ala 162 of PDK1, PDB ID: 2XCH) and QED (Quantitative Estimate of Druglikeness). Both experiments limit computational cost by capping production epochs at 300.</p>
<p><strong>Experiment 2</strong> uses Tanimoto (2D) similarity to a reference ligand as the Curriculum Objective. Two scenarios are tested: &ldquo;Low&rdquo; (threshold 0.5) and &ldquo;High&rdquo; (threshold 0.8).</p>
<p><strong>Experiment 3</strong> uses ROCS (3D) shape-based similarity to the reference ligand as the Curriculum Objective, with &ldquo;Low&rdquo; (0.5) and &ldquo;High&rdquo; (0.75) thresholds.</p>
<p>All experiments are run in triplicate. The baseline includes both standard RL and RL with Tanimoto/ROCS components added directly to the scoring function (not sequentially), to control for the presence of these components.</p>
<p>Across all CL experiments, CL generates between 2,941 and 9,068 more compounds with docking scores better than the reference ligand (-10.907 kcal/mol) compared to baseline RL, corresponding to 12.42-23.79% improvement in the fraction of compounds exceeding the reference. Between the Curriculum Objectives, the &ldquo;High&rdquo; threshold scenario outperforms the &ldquo;Low&rdquo; scenario by 316-3,415 additional compounds (with percentage improvements ranging from -0.4% to 10.57%).</p>
<p>Baseline RL produces essentially no compounds satisfying the docking constraint for the first 100 epochs. CL agents achieve immediate productivity: in the &ldquo;High&rdquo; Tanimoto scenario, the initial docking score already exceeds the maximum score achieved by baseline RL over 300 epochs.</p>
<h3 id="scaffold-diversity-analysis">Scaffold Diversity Analysis</h3>
<p>CL generates more unique Bemis-Murcko scaffolds than baseline RL in all experiments. The &ldquo;High&rdquo; scenarios produce more unique scaffolds than the &ldquo;Low&rdquo; scenarios. CL also produces a higher fraction of &ldquo;favorable&rdquo; scaffolds (those with better docking scores than the reference ligand).</p>
<h2 id="accelerated-convergence-with-a-diversity-trade-off">Accelerated Convergence with a Diversity Trade-off</h2>
<p>The results demonstrate three consistent findings across all experiments:</p>
<ol>
<li>
<p><strong>Accelerated productivity</strong>: CL agents reach productive sampling of favorable compounds substantially faster than baseline RL. Even a single Curriculum Objective with a computationally inexpensive metric can guide the agent to regions of chemical space where expensive Production Objectives are readily satisfied.</p>
</li>
<li>
<p><strong>Improved output quality</strong>: CL generates more compounds with favorable docking scores, more unique scaffolds, and a higher fraction of scaffolds that outperform the reference ligand.</p>
</li>
<li>
<p><strong>Controllable trade-off</strong>: The Curriculum Progression Criterion threshold provides direct control over agent policy. Higher thresholds produce better Production Objective optimization but reduce intra-set diversity (higher cross-Tanimoto similarities among generated compounds). UMAP visualizations confirm that &ldquo;Low&rdquo; and &ldquo;High&rdquo; scenarios sample from nearby but distinct regions of chemical space.</p>
</li>
</ol>
<p>The authors note that even moderate optimization of similarity-based Curriculum Objectives (the &ldquo;Low&rdquo; scenarios) already substantially narrows the agent&rsquo;s perceived solution space. This suggests that CL inherently regularizes the agent policy, trading some diversity for convergence speed.</p>
<p><strong>Limitations</strong>: The authors acknowledge that data supporting the findings are available only upon request rather than as public deposits. The approach is demonstrated on a single target (PDK1), and the curriculum design requires domain expertise to decompose objectives appropriately. The inverse relationship between Curriculum Objective optimization and solution diversity means practitioners must carefully tune thresholds for their specific applications.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior training</td>
          <td>ChEMBL</td>
          <td>Not specified</td>
          <td>Used to pretrain the RNN on SMILES syntax</td>
      </tr>
      <tr>
          <td>Docking target</td>
          <td>PDB 2XCH</td>
          <td>1 structure</td>
          <td>PDK1 receptor crystal structure</td>
      </tr>
  </tbody>
</table>
<p>Raw data supporting the findings are available from the corresponding author upon request.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>REINVENT platform with LSTM-based RNN (3 hidden layers, 512 cells, embedding size 256)</li>
<li>Scoring function: weighted geometric mean of components</li>
<li>Curriculum Progression Criteria: score thresholds (0.5 or 0.75-0.8 depending on scenario)</li>
<li>Diversity filter: Identical Murcko Scaffold with bucket size 25 (Production Phase only)</li>
<li>Inception (experience replay) for both phases, reset at phase transition</li>
<li>Batch size: 128, learning rate: 0.0001, sigma: 128, Adam optimizer</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Prior: RNN pretrained on ChEMBL SMILES</li>
<li>Agent: Initialized from prior, focused via RL/CL</li>
<li>No pretrained model weights are publicly released</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Docking score (Glide SP)</td>
          <td>Predicted binding affinity (kcal/mol)</td>
          <td>Lower is better; reference ligand: -10.907</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>Quantitative Estimate of Druglikeness</td>
          <td>Range [0, 1]</td>
      </tr>
      <tr>
          <td>Unique Bemis-Murcko scaffolds</td>
          <td>Scaffold diversity measure</td>
          <td>Averaged over triplicates</td>
      </tr>
      <tr>
          <td>Cross-Tanimoto similarity</td>
          <td>Intra-set compound diversity</td>
          <td>Calculated on pooled triplicates</td>
      </tr>
      <tr>
          <td>Tanimoto/ROCS similarity</td>
          <td>Curriculum Objective metrics</td>
          <td>2D fingerprint and 3D shape similarity</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>GPU: NVIDIA Tesla V100 (32 GB)</li>
<li>Docking: AWS p3.8xlarge instance</li>
<li>LigPrep parallelized over 8 CPU cores</li>
<li>Glide docking parallelized over 48 CPU cores via DockStream</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MolecularAI/Reinvent">REINVENT</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>De novo molecular design platform</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MolecularAI/ReinventCommunity/blob/master/notebooks/Automated_Curriculum_Learning_Demo.ipynb">CL Tutorial Notebook</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Jupyter notebook tutorial for curriculum learning</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Guo, J., Fialková, V., Arango, J. D., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., &amp; Patronov, A. (2022). Improving de novo molecular design with curriculum learning. <em>Nature Machine Intelligence</em>, 4, 555-563. <a href="https://doi.org/10.1038/s42256-022-00494-4">https://doi.org/10.1038/s42256-022-00494-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{guo2022curriculum,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Improving de novo molecular design with curriculum learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Guo, Jeff and Fialkov{\&#39;a}, Vendy and Arango, Juan Diego and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{555--563}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Nature}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-022-00494-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>CogMol: Controlled Molecule Generation for COVID-19</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/cogmol-target-specific-drug-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/cogmol-target-specific-drug-design/</guid><description>CogMol combines a SMILES VAE with controlled latent space sampling to generate drug-like molecules with target specificity for novel viral proteins.</description><content:encoded><![CDATA[<h2 id="a-controlled-generation-framework-for-target-specific-drug-design">A Controlled Generation Framework for Target-Specific Drug Design</h2>
<p>This is a <strong>Method</strong> paper that introduces CogMol (Controlled Generation of Molecules), an end-to-end framework for de novo drug design. The primary contribution is a pipeline that combines a SMILES-based Variational Autoencoder (VAE) with multi-attribute controlled latent space sampling (CLaSS) to generate novel drug-like molecules with high binding affinity to specified protein targets, off-target selectivity, and favorable drug-likeness properties. The framework operates on protein sequence embeddings, allowing it to generalize to unseen target proteins without model retraining.</p>
<h2 id="multi-constraint-drug-design-for-novel-viral-targets">Multi-Constraint Drug Design for Novel Viral Targets</h2>
<p>Traditional drug discovery costs 2-3 billion USD and takes over a decade with less than 10% success rate. Generating drug molecules requires satisfying multiple competing objectives simultaneously: target binding affinity, off-target selectivity, synthetic accessibility, drug-likeness, and low toxicity. Prior generative approaches using reinforcement learning or Bayesian optimization are computationally expensive and typically require fine-tuning on target-specific ligand libraries, making them unable to generalize to unseen protein targets.</p>
<p>The emergence of SARS-CoV-2 in 2020 created an urgent need for antiviral drug candidates targeting novel viral proteins. Because no binding affinity data existed for these new targets, and the viral proteins were not closely related to proteins in existing databases like BindingDB, existing target-specific generative frameworks could not be directly applied. CogMol addresses this by using pre-trained protein sequence embeddings from UniRep (trained on 24 million UniRef50 sequences) rather than learning protein representations from the limited BindingDB training set.</p>
<h2 id="controlled-latent-space-sampling-with-pre-trained-protein-embeddings">Controlled Latent Space Sampling with Pre-trained Protein Embeddings</h2>
<p>CogMol&rsquo;s core innovation is a three-component architecture that enables multi-constraint molecule generation for unseen targets:</p>
<p><strong>1. SMILES VAE with adaptive pre-training.</strong> A Variational Autoencoder is first trained unsupervised on the MOSES/ZINC dataset (1.6M molecules), then jointly fine-tuned with QED and SA property predictors on BindingDB molecules. The standard VAE objective is:</p>
<p>$$\mathcal{L}_{\text{VAE}}(\theta, \phi) = \mathbb{E}_{p(x)} \left\{ \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) | p(z)) \right\}$$</p>
<p>where $q_\phi(z|x) = \mathcal{N}(z; \mu(x), \Sigma(x))$ specifies a diagonal Gaussian encoder distribution.</p>
<p><strong>2. Protein-molecule binding affinity predictor.</strong> A regression model takes pre-trained UniRep protein sequence embeddings and molecule latent embeddings $z$ as input and predicts pIC50 binding affinity ($= -\log(\text{IC50})$). Because UniRep embeddings capture sequence, structural, and functional relationships from a large unsupervised corpus, the predictor can estimate binding affinity for novel target sequences not present in the training data.</p>
<p><strong>3. CLaSS controlled sampling.</strong> The Conditional Latent attribute Space Sampling scheme generates molecules satisfying multiple constraints (affinity, QED, selectivity) through rejection sampling in the VAE latent space:</p>
<p>$$p(\mathbf{x} | \mathbf{a}) = \mathbb{E}_{\mathbf{z}} [p(\mathbf{z} | \mathbf{a}) , p(\mathbf{x} | \mathbf{z})] \approx \mathbb{E}_{\mathbf{z}} [\hat{p}_\xi(\mathbf{z} | \mathbf{a}) , p_\theta(\mathbf{x} | \mathbf{z})]$$</p>
<p>where $\mathbf{a} = [a_1, a_2, \ldots, a_n]$ is a set of independent attribute constraints. The conditional density $\hat{p}_\xi(\mathbf{z} | \mathbf{a})$ is approximated using a Gaussian mixture model $Q_\xi(\mathbf{z})$ and per-attribute classifiers $q_\xi(a_i | \mathbf{z})$, with Bayes&rsquo; rule and conditional independence assumptions. The acceptance probability equals the product of all attribute predictor scores, enabling efficient multi-constraint sampling without surrogate model or policy learning.</p>
<p><strong>Selectivity modeling.</strong> Off-target selectivity for a molecule $m$ against target $T$ is defined as:</p>
<p>$$\text{Sel}_{T,m} = \text{BA}(T, m) - \frac{1}{k} \sum_{i=1}^{k} \text{BA}(T_i, m)$$</p>
<p>where $\text{BA}(T, m)$ is binding affinity to the target and $T_i$ are $k$ randomly selected off-targets. This selectivity score is incorporated as a control attribute during CLaSS sampling.</p>
<h2 id="experimental-setup-covid-19-targets-and-in-silico-screening">Experimental Setup: COVID-19 Targets and In Silico Screening</h2>
<p><strong>Target proteins.</strong> CogMol was applied to three SARS-CoV-2 targets not present in BindingDB: NSP9 Replicase dimer, Main Protease (Mpro), and the Receptor-Binding Domain (RBD) of the spike protein. A cancer target (human HDAC1) with low ligand coverage in the training data was also evaluated.</p>
<p><strong>Training data.</strong> The SMILES VAE was trained on the MOSES benchmark (1.6M molecules from ZINC). The binding affinity predictor used curated IC50 data from BindingDB as reported in DeepAffinity, with all protein classes included in training.</p>
<p><strong>CLaSS controlled generation.</strong> Molecules were generated with simultaneous constraints on binding affinity (&gt; 0.5 normalized), QED (&gt; 0.8 normalized), and selectivity (&gt; 0.5 normalized). Approximately 1000 molecules per target were selected for downstream evaluation.</p>
<p><strong>In silico screening pipeline.</strong> Generated molecules underwent:</p>
<ul>
<li>Toxicity prediction via a multi-task deep neural network (MT-DNN) on 12 Tox21 in vitro endpoints and ClinTox clinical trial failure</li>
<li>Binding affinity rescoring with a higher-accuracy SMILES-level predictor</li>
<li>Blind docking (5 independent runs per molecule) using AutoDock Vina against target protein structures</li>
<li>Synthetic feasibility assessment using a retrosynthetic algorithm based on the Molecular Transformer trained on patent reaction data</li>
</ul>
<p><strong>Baselines.</strong> VAE performance was benchmarked against models from the MOSES platform. CLaSS-accepted molecules were compared against randomly sampled molecules from the latent space. Generated molecules were compared against FDA-approved drugs for toxicity and synthesizability.</p>
<h3 id="key-results">Key Results</h3>
<p><strong>CLaSS enrichment (Table 1).</strong> CLaSS consistently produced higher fractions of molecules meeting all criteria compared to random sampling. For the triple constraint (affinity &gt; 0.5, QED &gt; 0.8, selectivity &gt; 0.5), the enrichment was substantial: 6.9% vs. 0.7% for NSP9, 9.0% vs. 0.9% for RBD, and 10.4% vs. 1.1% for Mpro.</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>CLaSS (Aff+QED+Sel)</th>
          <th>Random (Aff+QED+Sel)</th>
          <th>Enrichment</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>NSP9</td>
          <td>6.9%</td>
          <td>0.7%</td>
          <td>~10x</td>
      </tr>
      <tr>
          <td>RBD</td>
          <td>9.0%</td>
          <td>0.9%</td>
          <td>~10x</td>
      </tr>
      <tr>
          <td>Mpro</td>
          <td>10.4%</td>
          <td>1.1%</td>
          <td>~9.5x</td>
      </tr>
  </tbody>
</table>
<p><strong>Docking results (Table 3).</strong> 87-95% of high-affinity generated molecules showed docking binding free energy (BFE) below -6 kcal/mol, with minimum BFEs reaching -8.6 to -9.5 kcal/mol depending on the target.</p>
<p><strong>Novelty.</strong> The likelihood of generating an exact duplicate of a training molecule was 2% or less. Against the full PubChem database (~103M molecules), exact matches ranged from 3.7% to 9.5%. Generated molecules also showed novel chemical scaffolds as confirmed by high Frechet ChemNet Distance.</p>
<p><strong>Synthesizability.</strong> Generated molecules for COVID-19 targets showed 85-90% synthetic feasibility using retrosynthetic analysis, exceeding the ~78% rate of FDA-approved drugs.</p>
<p><strong>Toxicity.</strong> Approximately 70% of generated parent molecules and ~80% of predicted metabolites were toxic in 0-1 endpoints out of 13, comparable to FDA-approved drugs.</p>
<h2 id="generated-molecules-show-favorable-binding-and-drug-like-properties">Generated Molecules Show Favorable Binding and Drug-Like Properties</h2>
<p>CogMol demonstrates that controlled latent space sampling with pre-trained protein embeddings can generate novel, drug-like molecules for unseen viral targets. The key findings are:</p>
<ol>
<li>CLaSS provides roughly 10x enrichment over random latent space sampling for molecules satisfying all three constraints (affinity, QED, selectivity).</li>
<li>Generated molecules bind favorably to druggable pockets in target protein 3D structures, even though the generation model uses only 1D sequence information.</li>
<li>Some generated SMILES matched existing PubChem molecules with known biological activity, suggesting the model identifies chemically relevant regions of molecular space.</li>
<li>The framework generalizes across targets of varying novelty, with Mpro (more similar to training proteins) yielding easier generation than NSP9 or RBD.</li>
</ol>
<p><strong>Limitations.</strong> The authors note that no wet-lab validation was performed on generated candidates. There may be divergence between ML-predicted properties and experimental measurements. The binding affinity predictor&rsquo;s accuracy is bounded by the quality and coverage of BindingDB training data. Selectivity modeling uses a random sample of off-targets rather than a pharmacologically curated panel.</p>
<p><strong>Future directions.</strong> The authors propose incorporating additional contexts beyond target protein (e.g., metabolic pathways), adding more pharmacologically relevant controls, and weighting objectives by relative importance.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>VAE pre-training</td>
          <td>MOSES/ZINC</td>
          <td>1.6M train, 176K test</td>
          <td>Publicly available benchmark</td>
      </tr>
      <tr>
          <td>VAE adaptive training</td>
          <td>BindingDB (DeepAffinity split)</td>
          <td>~27K protein-ligand pairs</td>
          <td>Curated IC50 data</td>
      </tr>
      <tr>
          <td>Protein embeddings</td>
          <td>UniRef50 via UniRep</td>
          <td>24M sequences</td>
          <td>Pre-trained, publicly available</td>
      </tr>
      <tr>
          <td>Toxicity prediction</td>
          <td>Tox21 + ClinTox</td>
          <td>12 in vitro + clinical endpoints</td>
          <td>Public benchmark datasets</td>
      </tr>
      <tr>
          <td>Docking validation</td>
          <td>PDB structures</td>
          <td>3 SARS-CoV-2 targets</td>
          <td>Public crystal structures</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>VAE architecture: SMILES encoder-decoder with diagonal Gaussian latent space, jointly trained with QED and SA regressors</li>
<li>CLaSS: rejection sampling from Gaussian mixture model of latent space with per-attribute classifiers</li>
<li>Binding affinity: regression on concatenated UniRep protein embeddings and VAE molecule embeddings</li>
<li>Selectivity: excess binding affinity over average of $k$ random off-targets</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>SMILES VAE with adaptive pre-training (ZINC then BindingDB)</li>
<li>Multi-task toxicity classifier (MT-DNN) for Tox21 and ClinTox endpoints</li>
<li>Binding affinity predictor (latent-level for generation, SMILES-level for screening)</li>
<li>Retrosynthetic predictor based on Molecular Transformer</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>90%</td>
          <td>-</td>
          <td>Generated SMILES</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>99%</td>
          <td>-</td>
          <td>Among valid molecules</td>
      </tr>
      <tr>
          <td>Filter pass</td>
          <td>95%</td>
          <td>-</td>
          <td>Relevant chemical filters</td>
      </tr>
      <tr>
          <td>Docking BFE &lt; -6 kcal/mol</td>
          <td>87-95%</td>
          <td>-</td>
          <td>Varies by target</td>
      </tr>
      <tr>
          <td>Synthetic feasibility</td>
          <td>85-90%</td>
          <td>78% (FDA drugs)</td>
          <td>COVID-19 targets</td>
      </tr>
      <tr>
          <td>Low toxicity (0-1 endpoints)</td>
          <td>~70% parent, ~80% metabolite</td>
          <td>Comparable to FDA drugs</td>
          <td>MT-DNN prediction</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify GPU types or training times. The work was funded internally by IBM Research.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/CogMol">CogMol (GitHub)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://github.com/IBM/CogMol">~3500 generated molecules</a></td>
          <td>Dataset</td>
          <td>Open license</td>
          <td>For three SARS-CoV-2 targets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chenthamarakshan, V., Das, P., Hoffman, S. C., Strobelt, H., Padhi, I., Lim, K. W., Hoover, B., Manica, M., Born, J., Laino, T., &amp; Mojsilovic, A. (2020). CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models. <em>Advances in Neural Information Processing Systems</em>, 33, 4320-4332.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{chenthamarakshan2020cogmol,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chenthamarakshan, Vijil and Das, Payel and Hoffman, Samuel C. and Strobelt, Hendrik and Padhi, Inkit and Lim, Kar Wai and Hoover, Benjamin and Manica, Matteo and Born, Jannis and Laino, Teodoro and Mojsilovi{\&#39;c}, Aleksandra}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{33}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{4320--4332}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>BindGPT: GPT for 3D Molecular Design and Docking</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/bindgpt-3d-molecular-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/bindgpt-3d-molecular-design/</guid><description>BindGPT applies GPT-style language modeling to 3D molecular generation using SMILES+XYZ tokenization, pre-training, and RL-based docking optimization.</description><content:encoded><![CDATA[<h2 id="a-language-model-for-joint-3d-molecular-graph-and-conformation-generation">A Language Model for Joint 3D Molecular Graph and Conformation Generation</h2>
<p>BindGPT is a <strong>Method</strong> paper that introduces a GPT-based language model for generating 3D molecular structures. The primary contribution is a unified framework that jointly produces molecular graphs (via SMILES) and 3D coordinates (via XYZ tokens) within a single autoregressive model. This eliminates the need for external graph reconstruction tools like OpenBabel, which are error-prone when applied to noisy atom positions. The same pre-trained model serves as a 3D molecular generative model, a conformer generator conditioned on molecular graphs, and a pocket-conditioned 3D molecule generator.</p>
<h2 id="the-graph-reconstruction-problem-in-3d-molecular-generation">The Graph Reconstruction Problem in 3D Molecular Generation</h2>
<p>Most existing 3D molecular generators focus on predicting atom types and positions, relying on supplementary software (e.g., OpenBabel or RDKit) to reconstruct molecular bonds from predicted coordinates. This introduces a fragile dependency: small positional errors can drastically change the reconstructed molecular graph or produce disconnected structures. Additionally, while diffusion models and equivariant GNNs have shown strong results for 3D molecular generation, they often depend on SE(3) equivariance inductive biases and are computationally expensive at sampling time (up to $10^6$ seconds for 1000 valid molecules for EDM). The pocket-conditioned generation task is further limited by the small size of available 3D binding pose datasets (e.g., CrossDocked), making it difficult for specialized models to generalize without large-scale pre-training.</p>
<h2 id="smilesxyz-tokenization-jointly-encoding-graphs-and-coordinates">SMILES+XYZ Tokenization: Jointly Encoding Graphs and Coordinates</h2>
<p>The core innovation in BindGPT is coupling SMILES notation with XYZ coordinate format in a single token sequence. The sequence starts with a <code>&lt;LIGAND&gt;</code> token, followed by character-level SMILES tokens encoding the molecular graph, then an <code>&lt;XYZ&gt;</code> token marking the transition to coordinate data. Each 3D atom position is encoded using 6 tokens (integer and fractional parts for each of the three coordinates). The atom ordering is synchronized between SMILES and XYZ, so atom symbols from SMILES are not repeated in the coordinate section.</p>
<p>For protein pockets, sequences begin with a <code>&lt;POCKET&gt;</code> token followed by atom names and coordinates. Following AlphaFold&rsquo;s approach, only alpha-carbon coordinates are retained to keep pocket representations compact.</p>
<p>The model uses the GPT-NeoX architecture with rotary position embeddings (RoPE), which enables length generalization between pre-training and fine-tuning where sequence lengths differ substantially. The pre-trained model has 108M parameters with 15 layers, 12 attention heads, and a hidden dimension of 768.</p>
<h3 id="pre-training-on-large-scale-3d-data">Pre-training on Large-Scale 3D Data</h3>
<p>Pre-training uses the Uni-Mol dataset containing 208M conformations for 12M molecules and 3.2M protein pocket structures. Each training batch contains either ligand sequences or pocket sequences (not mixed within a sequence). Since pockets are far fewer than ligands, the training schedule runs 5 pocket epochs per ligand epoch, resulting in roughly 8% pocket tokens overall. Training uses large batches of 1.6M tokens per step with Flash Attention and DeepSpeed optimizations.</p>
<h3 id="supervised-fine-tuning-with-augmentation">Supervised Fine-Tuning with Augmentation</h3>
<p>For pocket-conditioned generation, BindGPT is fine-tuned on CrossDocked 2020, which contains aligned pocket-ligand pairs. Unlike prior work that subsamples less than 1% of the best pairs, BindGPT uses all intermediate ligand poses (including lower-quality ones), yielding approximately 27M pocket-ligand pairs. To combat overfitting on the limited diversity (14k unique molecules, 3k pockets), two augmentation strategies are applied:</p>
<ol>
<li><strong><a href="/notes/computational-chemistry/molecular-representations/randomized-smiles-generative-models/">SMILES randomization</a></strong>: Each molecule can yield 100-1000 different valid SMILES strings</li>
<li><strong>Random 3D rotation</strong>: The same rotation matrix is applied to both pocket and ligand coordinates</li>
</ol>
<p>During fine-tuning, the pocket token sequence is concatenated before the ligand token sequence. An optional variant conditions on binding energy scores from the CrossDocked dataset, enabling contrastive learning between good and bad binding examples.</p>
<h3 id="reinforcement-learning-with-docking-feedback">Reinforcement Learning with Docking Feedback</h3>
<p>BindGPT applies REINFORCE (not PPO or <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>, which were found less stable) to further optimize pocket-conditioned generation. On each RL step, the model generates 3D ligand structures for a batch of random protein pockets, computes binding energy rewards using QVINA docking software, and updates model parameters. A KL-penalty between the current model and the SFT initialization stabilizes training.</p>
<p>The RL objective can be written as:</p>
<p>$$\mathcal{L}_{\text{RL}} = -\mathbb{E}_{x \sim \pi_\theta}\left[ R(x) \right] + \beta \cdot D_{\text{KL}}(\pi_\theta | \pi_{\text{SFT}})$$</p>
<p>where $R(x)$ is the docking reward from QVINA and $\beta$ controls the strength of the KL regularization.</p>
<h2 id="experimental-evaluation-across-three-3d-generation-tasks">Experimental Evaluation Across Three 3D Generation Tasks</h2>
<h3 id="datasets">Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>Uni-Mol 3D</td>
          <td>208M conformations (12M molecules) + 3.2M pockets</td>
          <td>Large-scale 3D molecular dataset</td>
      </tr>
      <tr>
          <td>Fine-tuning (SFT)</td>
          <td>CrossDocked 2020</td>
          <td>~27M pocket-ligand pairs</td>
          <td>14k molecules x 3k pockets, includes all pose qualities</td>
      </tr>
      <tr>
          <td>Fine-tuning (conformer)</td>
          <td><a href="/notes/computational-chemistry/datasets/geom/">GEOM-DRUGS</a></td>
          <td>27M conformations for 300k molecules</td>
          <td>Standard benchmark for 3D conformer generation</td>
      </tr>
      <tr>
          <td>Evaluation (conformer)</td>
          <td>Platinum</td>
          <td>Experimentally validated conformations</td>
          <td>Zero-shot evaluation holdout</td>
      </tr>
      <tr>
          <td>Evaluation (pocket)</td>
          <td>CrossDocked holdout</td>
          <td>100 pockets</td>
          <td>Held out from training</td>
      </tr>
  </tbody>
</table>
<h3 id="task-1-3d-molecule-generation-pre-training">Task 1: 3D Molecule Generation (Pre-training)</h3>
<p>Compared against XYZ-Transformer (the only other model capable of large-scale pre-training), BindGPT achieves 98.58% validity (vs. 12.87% for XYZ-TF without hydrogens), higher SA (0.77 vs. 0.21), QED (0.59 vs. 0.30), and Lipinski scores (4.86 vs. 4.79). BindGPT also produces conformations with RMSD of 0.89 (XYZ-TF&rsquo;s RMSD calculation failed to converge). Generation is 12x faster (13s vs. 165s for 1000 molecules).</p>
<h3 id="task-2-3d-molecule-generation-fine-tuned-on-geom-drugs">Task 2: 3D Molecule Generation (Fine-tuned on GEOM-DRUGS)</h3>
<p>Against EDM and MolDiff (diffusion baselines), BindGPT outperforms on nearly all 3D distributional metrics:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>EDM</th>
          <th>MolDiff</th>
          <th>BindGPT</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JS bond lengths</td>
          <td>0.246</td>
          <td>0.365</td>
          <td><strong>0.029</strong></td>
      </tr>
      <tr>
          <td>JS bond angles</td>
          <td>0.282</td>
          <td>0.155</td>
          <td><strong>0.075</strong></td>
      </tr>
      <tr>
          <td>JS dihedral angles</td>
          <td>0.328</td>
          <td>0.162</td>
          <td><strong>0.098</strong></td>
      </tr>
      <tr>
          <td>JS freq. bond types</td>
          <td>0.378</td>
          <td>0.163</td>
          <td><strong>0.045</strong></td>
      </tr>
      <tr>
          <td>JS freq. bond pairs</td>
          <td>0.396</td>
          <td>0.136</td>
          <td><strong>0.043</strong></td>
      </tr>
      <tr>
          <td>JS freq. bond triplets</td>
          <td>0.449</td>
          <td>0.125</td>
          <td><strong>0.042</strong></td>
      </tr>
      <tr>
          <td>Time (1000 molecules)</td>
          <td>1.4e6 s</td>
          <td>7500 s</td>
          <td><strong>200 s</strong></td>
      </tr>
  </tbody>
</table>
<p>BindGPT is two orders of magnitude faster than diffusion baselines while producing more accurate 3D geometries. MolDiff achieves better drug-likeness scores (QED, SA), but the authors argue 3D distributional metrics are more relevant for evaluating 3D structure fidelity.</p>
<h3 id="task-3-pocket-conditioned-molecule-generation">Task 3: Pocket-Conditioned Molecule Generation</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Vina Score</th>
          <th>SA</th>
          <th>QED</th>
          <th>Lipinski</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pocket2Mol</td>
          <td>-7.15 +/- 4.89</td>
          <td>0.75</td>
          <td>0.57</td>
          <td>4.88</td>
      </tr>
      <tr>
          <td>TargetDiff</td>
          <td>-7.80 +/- 3.61</td>
          <td>0.58</td>
          <td>0.48</td>
          <td>4.51</td>
      </tr>
      <tr>
          <td>BindGPT-FT</td>
          <td>-5.44 +/- 2.09</td>
          <td>0.78</td>
          <td>0.50</td>
          <td>4.72</td>
      </tr>
      <tr>
          <td>BindGPT-RFT</td>
          <td>-7.24 +/- 1.68</td>
          <td>0.74</td>
          <td>0.48</td>
          <td>4.32</td>
      </tr>
      <tr>
          <td>BindGPT-RL</td>
          <td><strong>-8.60 +/- 1.90</strong></td>
          <td><strong>0.84</strong></td>
          <td>0.43</td>
          <td>4.81</td>
      </tr>
  </tbody>
</table>
<p>The RL-fine-tuned model achieves the best Vina binding scores (-8.60 vs. -7.80 for TargetDiff) with lower variance and the highest SA score (0.84). The SFT-only model (BindGPT-FT) underperforms baselines on binding score, demonstrating that RL is essential for strong pocket-conditioned generation. QED is lower for BindGPT-RL, but the authors note that QED could be included in the RL reward and was excluded for fair comparison.</p>
<h3 id="conformer-generation">Conformer Generation</h3>
<p>On the Platinum dataset (zero-shot), BindGPT matches the performance of Torsional Diffusion (the specialized state-of-the-art) when assisted by RDKit, with a small gap without RDKit assistance. Uni-Mol fails to generalize to this dataset despite pre-training on the same Uni-Mol data.</p>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p>BindGPT demonstrates that a simple autoregressive language model without equivariance inductive biases can match or surpass specialized diffusion models and GNNs across multiple 3D molecular generation tasks. The key findings include:</p>
<ol>
<li><strong>Joint SMILES+XYZ generation eliminates graph reconstruction errors</strong>, achieving 98.58% validity compared to 12.87% for XYZ-Transformer</li>
<li><strong>Large-scale pre-training is critical for pocket-conditioned generation</strong>, as none of the baselines use pre-training and instead rely on heavy inductive biases</li>
<li><strong>RL fine-tuning with docking feedback substantially improves binding affinity</strong> beyond what SFT alone achieves</li>
<li><strong>Sampling is two orders of magnitude faster</strong> than diffusion baselines (200s vs. 1.4M s for EDM)</li>
</ol>
<p>Limitations include the relatively modest model size (108M parameters), with the authors finding this sufficient for current tasks but not exploring larger scales. The RL optimization uses only Vina score as reward; multi-objective optimization incorporating SA, QED, and other properties is left as future work. The model also relies on character-level SMILES tokenization rather than more sophisticated chemical tokenizers. BindGPT is the first model to explicitly generate hydrogens at scale, though validity drops from 98.58% to 77.33% when hydrogens are included.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>Uni-Mol 3D</td>
          <td>208M conformations, 12M molecules, 3.2M pockets</td>
          <td>From Zhou et al. (2023)</td>
      </tr>
      <tr>
          <td>SFT (pocket)</td>
          <td>CrossDocked 2020</td>
          <td>~27M pocket-ligand pairs</td>
          <td>Full version including low-quality poses</td>
      </tr>
      <tr>
          <td>SFT (conformer)</td>
          <td>GEOM-DRUGS</td>
          <td>27M conformations, 300k molecules</td>
          <td>Standard benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Platinum</td>
          <td>Experimentally validated conformations</td>
          <td>Zero-shot holdout</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: GPT-NeoX with rotary position embeddings (RoPE)</li>
<li><strong>Pre-training</strong>: Causal language modeling with 1.6M tokens per batch</li>
<li><strong>SFT augmentation</strong>: SMILES randomization + random 3D rotation</li>
<li><strong>RL</strong>: REINFORCE with KL-penalty from SFT initialization; QVINA docking as reward</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Size</strong>: 108M parameters, 15 layers, 12 heads, hidden size 768</li>
<li><strong>Vocabulary</strong>: Character-level SMILES tokens + special tokens (<code>&lt;LIGAND&gt;</code>, <code>&lt;POCKET&gt;</code>, <code>&lt;XYZ&gt;</code>) + coordinate tokens (6 per 3D position)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Validity, SA, QED, Lipinski</strong>: Standard drug-likeness metrics</li>
<li><strong>Jensen-Shannon divergences</strong>: Distribution-level 3D structural metrics (bond lengths, angles, dihedrals, bond types)</li>
<li><strong>RMSD</strong>: Alignment quality of generated conformations vs. RDKit reference</li>
<li><strong>RMSD-Coverage</strong>: CDF of RMSD between generated and reference conformers</li>
<li><strong>Vina score</strong>: Binding energy from QVINA docking software</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training and fine-tuning use Flash Attention and DeepSpeed for efficiency</li>
<li>Specific GPU counts and training times are described in Appendix G (not available in the main text)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://bindgpt.github.io/">Project Page</a></td>
          <td>Other</td>
          <td>Not specified</td>
          <td>Project website with additional details</td>
      </tr>
  </tbody>
</table>
<p>No public code repository or pre-trained model weights were identified. The project website exists but no source code has been released as of this writing.</p>
<p><strong>Reproducibility Status</strong>: Partially Reproducible. The paper provides detailed architecture specs and hyperparameters, but no public code or model weights are available. All training datasets (Uni-Mol, CrossDocked, GEOM-DRUGS) are publicly accessible.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zholus, A., Kuznetsov, M., Schutski, R., Shayakhmetov, R., Polykovskiy, D., Chandar, S., &amp; Zhavoronkov, A. (2025). BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning. <em>Proceedings of the AAAI Conference on Artificial Intelligence</em>, 39(24), 26083-26091. <a href="https://doi.org/10.1609/aaai.v39i24.34804">https://doi.org/10.1609/aaai.v39i24.34804</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zholus2025bindgpt,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zholus, Artem and Kuznetsov, Maksim and Schutski, Roman and Shayakhmetov, Rim and Polykovskiy, Daniil and Chandar, Sarath and Zhavoronkov, Alex}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the AAAI Conference on Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{39}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{24}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{26083--26091}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1609/aaai.v39i24.34804}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Avoiding Failure Modes in Goal-Directed Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/avoiding-failure-modes-goal-directed-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/avoiding-failure-modes-goal-directed-generation/</guid><description>Langevin et al. show that apparent failure modes in goal-directed molecular generation stem from QSAR model disagreement, not algorithmic flaws.</description><content:encoded><![CDATA[<h2 id="reinterpreting-goal-directed-generation-failures-as-qsar-model-issues">Reinterpreting Goal-Directed Generation Failures as QSAR Model Issues</h2>
<p>This is an <strong>Empirical</strong> study that challenges a widely cited finding about failure modes in goal-directed molecular generation. <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/failure-modes-molecule-generation/">Renz et al. (2019)</a> had shown that when molecules are optimized against a machine learning scoring function, control models trained on the same data distribution assign much lower scores to the generated molecules. This was interpreted as evidence that generation algorithms exploit model-specific biases. Langevin et al. demonstrate that this divergence is already present in the original data distribution and is attributable to disagreement among the QSAR classifiers, not to flaws in the generation algorithms themselves.</p>
<h2 id="why-qsar-model-agreement-matters-for-molecular-generation">Why QSAR Model Agreement Matters for Molecular Generation</h2>
<p>Goal-directed generation uses a scoring function (typically a <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a> model) to guide the design of molecules that maximize predicted activity. In the experimental framework from Renz et al., three Random Forest classifiers are trained: an optimization model $C_{opt}$ on Split 1, a model control $C_{mc}$ on Split 1 with a different random seed, and a data control $C_{dc}$ on Split 2. Each returns a confidence score ($S_{opt}$, $S_{mc}$, $S_{dc}$). The expectation is that molecules with high $S_{opt}$ should also score highly under $S_{mc}$ and $S_{dc}$, since all three models are trained on the same data distribution for the same target.</p>
<p>Renz et al. observed that during optimization, $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$, reaching substantially lower values. This was interpreted as goal-directed generation exploiting biases unique to the optimization model. The recommendation was to halt generation when control scores stop increasing, requiring a held-out dataset for a control model, which may not be feasible in low-data regimes.</p>
<p>The key insight of Langevin et al. is that nobody had checked whether this score disagreement existed before generation even began. If the classifiers already disagree on high-scoring molecules in the original dataset, the divergence during generation is expected behavior, not evidence of algorithmic failure.</p>
<h2 id="pre-existing-classifier-disagreement-explains-the-divergence">Pre-Existing Classifier Disagreement Explains the Divergence</h2>
<p>The core contribution is showing that the gap between optimization and control scores is a property of the QSAR models, not of the generation algorithms.</p>
<p>The authors introduce a held-out test set (10% of the data, used for neither training split) and augment it via Topliss tree enumeration to produce structural analogs for smoother statistical estimates. On this held-out set, they compute the Mean Average Difference (MAD) between $S_{opt}$ and control scores as a function of $S_{opt}$:</p>
<p>$$
\text{MAD}(x) = \frac{1}{|\{i : S_{opt}(x_i) \geq x\}|} \sum_{S_{opt}(x_i) \geq x} |S_{opt}(x_i) - S_{dc}(x_i)|
$$</p>
<p>On the three original datasets (DRD2, EGFR, JAK2), the MAD between $S_{opt}$ and $S_{dc}$ grows substantially with $S_{opt}$, reaching approximately 0.3 for the highest-scoring molecules. For EGFR, even the top molecules (with $S_{opt}$ between 0.5 and 0.6) have $S_{dc}$ below 0.2. This disagreement exists entirely within the original data distribution, before any generative algorithm is applied.</p>
<p>The authors formalize this with tolerance intervals. At each generation time step $t$, the distribution of optimization scores is $P_t[S_{opt}(x)]$. From the held-out set, the conditional distributions $P[S_{dc}(x) | S_{opt}(x)]$ and $P[S_{mc}(x) | S_{opt}(x)]$ are estimated empirically. The expected control scores at time $t$ are then:</p>
<p>$$
\mathbb{E}[S_{dc}] = \int P[S_{dc}(x) | S_{opt}(x)] \cdot P_t[S_{opt}(x)] , dS_{opt}
$$</p>
<p>By sampling from these distributions, the authors construct 95% tolerance intervals for the expected control scores at each time step. The observed trajectories of $S_{mc}$ and $S_{dc}$ during generation fall within these intervals, demonstrating that the divergence is fully explained by pre-existing classifier disagreement.</p>
<h2 id="experimental-setup-original-reproduction-and-corrected-experiments">Experimental Setup: Original Reproduction and Corrected Experiments</h2>
<h3 id="reproduction-of-renz-et-al">Reproduction of Renz et al.</h3>
<p>The original experimental framework uses three datasets from ChEMBL: <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">DRD2</a> (842 molecules, 59 actives), <a href="https://en.wikipedia.org/wiki/Epidermal_growth_factor_receptor">EGFR</a> (842 molecules, 40 actives), and <a href="https://en.wikipedia.org/wiki/Janus_kinase_2">JAK2</a> (667 molecules, 140 actives). These are small, noisy, and chemically diverse. Three goal-directed generation algorithms are tested:</p>
<table>
  <thead>
      <tr>
          <th>Algorithm</th>
          <th>Type</th>
          <th>Mechanism</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Graph GA</td>
          <td>Genetic algorithm on molecular graphs</td>
          <td>Mutation and crossover of molecular graphs</td>
      </tr>
      <tr>
          <td>SMILES-LSTM</td>
          <td>Recurrent neural network</td>
          <td>Hill-climbing fine-tuning on best molecules</td>
      </tr>
      <tr>
          <td>MSO</td>
          <td>Particle swarm in CDDD latent space</td>
          <td>Multiple swarm optimization</td>
      </tr>
  </tbody>
</table>
<p>All algorithms are run for 151 epochs with 10 runs each. The reproduction confirms the findings of Renz et al.: $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$ during optimization.</p>
<h3 id="tolerance-interval-analysis">Tolerance interval analysis</h3>
<p>The held-out set is augmented using Topliss tree enumeration on phenyl rings, providing structural analogs that are reasonable from a medicinal chemistry perspective. The optimization score range is divided into 25 equal bins, and for each molecule at each time step, 10 samples from the conditional control score distribution are drawn to construct empirical tolerance intervals.</p>
<h3 id="corrected-experiments-with-adequate-models">Corrected experiments with adequate models</h3>
<p>To test whether generation algorithms actually exploit biases when the classifiers agree, the authors construct two tasks where optimization and control models correlate well:</p>
<ol>
<li><strong>ALDH1 dataset</strong>: 464 molecules from LIT-PCBA, split using similarity-based pairing to maximize intra-pair chemical similarity. This ensures both splits sample similar chemistry.</li>
<li><strong>Modified JAK2</strong>: The same JAK2 dataset but with Random Forest hyperparameters adjusted (200 trees instead of 100, minimum 3 samples per leaf instead of 1) to reduce overfitting to spurious correlations.</li>
</ol>
<p>In both cases, $S_{opt}$, $S_{mc}$, and $S_{dc}$ agree well on the held-out test set. The starting population for generation is set to the held-out test set (rather than random ChEMBL molecules) to avoid building in a distribution shift.</p>
<h2 id="findings-no-algorithmic-failure-when-models-agree">Findings: No Algorithmic Failure When Models Agree</h2>
<p>On the corrected experimental setups (ALDH1 and modified JAK2), there is no major divergence between optimization and control scores during generation. The three algorithms produce molecules that score similarly under all three classifiers.</p>
<p>Key findings:</p>
<ol>
<li>
<p><strong>Pre-existing disagreement explains divergence</strong>: On all three original datasets, the divergence between $S_{opt}$ and control scores during generation falls within the tolerance intervals predicted from the initial data distribution alone. The generation algorithms are not exploiting model-specific biases beyond what already exists in the data.</p>
</li>
<li>
<p><strong>Split similarity bias is also pre-existing</strong>: Renz et al. observed that generated molecules are more similar to Split 1 (used to train $C_{opt}$) than Split 2. The authors show this bias is already present in the top-5 percentile of the held-out set: on EGFR and DRD2, high-scoring molecules are inherently more similar to Split 1.</p>
</li>
<li>
<p><strong>Appropriate model design resolves the issue</strong>: When Random Forest hyperparameters are chosen to avoid overfitting (more trees, higher minimum samples per leaf), or when data splits are constructed to be chemically balanced, the classifiers agree and the generation algorithms behave as expected.</p>
</li>
<li>
<p><strong>Quality problems remain independent</strong>: Even when optimization and control scores align, the generated molecules can still be poor drug candidates (unreactive, unsynthesizable, containing unusual fragments). The score divergence issue and the chemical quality issue are separate problems.</p>
</li>
</ol>
<h3 id="limitations-acknowledged-by-the-authors">Limitations acknowledged by the authors</h3>
<ul>
<li>The study focuses on Random Forest classifiers with ECFP fingerprints. The behavior of other model types (e.g., graph neural networks) and descriptor types is not fully explored, though supplementary results show similar patterns with physico-chemical descriptors and Atom-Pair fingerprints.</li>
<li>The corrected ALDH1 task uses a relatively small dataset (464 molecules) with careful split construction. Scaling this approach to larger, more heterogeneous datasets is not demonstrated.</li>
<li>The authors note that their results do not prove generation algorithms never exploit biases; they show that the specific evidence from Renz et al. can be explained without invoking algorithmic failure.</li>
<li>The problem of low-quality generated molecules (poor synthesizability, unusual fragments) remains unresolved and is acknowledged as an open question.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Original tasks</td>
          <td>DRD2, EGFR, JAK2</td>
          <td>842, 842, 667 molecules</td>
          <td>Extracted from ChEMBL; small with few actives</td>
      </tr>
      <tr>
          <td>New task</td>
          <td>ALDH1</td>
          <td>464 molecules (173 with purine substructure)</td>
          <td>Extracted from LIT-PCBA; similarity-based split</td>
      </tr>
      <tr>
          <td>Augmentation</td>
          <td>Topliss tree analogs</td>
          <td>~10x augmentation of held-out set</td>
          <td>Structural analogs via phenyl ring enumeration</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>Three goal-directed generation algorithms from the original Renz et al. study:</p>
<ul>
<li><strong>Graph GA</strong>: Genetic algorithm on molecular graphs (Jensen, 2019)</li>
<li><strong>SMILES-LSTM</strong>: Hill-climbing on LSTM-generated SMILES (Segler et al., 2018)</li>
<li><strong>MSO</strong>: Multi-Swarm Optimization in CDDD latent space (Winter et al., 2019)</li>
</ul>
<p>All run for 151 epochs, 10 runs each.</p>
<h3 id="models">Models</h3>
<p>Random Forest classifiers (scikit-learn) with:</p>
<ul>
<li>ECFP fingerprints (radius 2, 1024 bits, RDKit)</li>
<li>Default parameters for original tasks</li>
<li>Modified parameters for JAK2 correction: 200 trees, min 3 samples per leaf</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Purpose</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Mean Average Difference (MAD)</td>
          <td>Measures disagreement between optimization and control scores</td>
          <td>Computed as function of $S_{opt}$ on held-out set</td>
      </tr>
      <tr>
          <td>95% tolerance intervals</td>
          <td>Expected range of control scores given optimization scores</td>
          <td>Empirical, constructed from held-out set</td>
      </tr>
      <tr>
          <td>Tanimoto similarity</td>
          <td>Split bias assessment</td>
          <td>Morgan fingerprints, radius 2, 1024 bits</td>
      </tr>
      <tr>
          <td>ROC-AUC</td>
          <td>Classifier predictive performance</td>
          <td>Used to verify models have comparable accuracy</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Sanofi-Public/IDD-papers-avoiding_failure_modes">Code and datasets</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Fork of Renz et al. codebase with modifications</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Langevin, M., Vuilleumier, R., &amp; Bianciotto, M. (2022). Explaining and avoiding failure modes in goal-directed generation of small molecules. <em>Journal of Cheminformatics</em>, 14, 20. <a href="https://doi.org/10.1186/s13321-022-00601-y">https://doi.org/10.1186/s13321-022-00601-y</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{langevin2022explaining,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Explaining and avoiding failure modes in goal-directed generation of small molecules}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Langevin, Maxime and Vuilleumier, Rodolphe and Bianciotto, Marc}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{20}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-022-00601-y}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Augmented Hill-Climb for RL-Based Molecule Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/</guid><description>Augmented Hill-Climb combines REINVENT and Hill-Climb RL strategies to improve sample efficiency ~45-fold for SMILES-based de novo molecule generation.</description><content:encoded><![CDATA[<h2 id="a-hybrid-rl-strategy-for-de-novo-molecule-generation">A Hybrid RL Strategy for De Novo Molecule Generation</h2>
<p>This is a <strong>Method</strong> paper that proposes Augmented Hill-Climb (AHC), a reinforcement learning strategy for conditioning SMILES-based language models during de novo molecule generation. The primary contribution is a simple hybrid between the <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> and Hill-Climb (HC) RL strategies that computes the REINVENT loss function only on the top-k highest-scoring molecules per batch (as in HC), thereby removing the counterproductive regularization effect of low-scoring molecules. The authors demonstrate that AHC improves optimization ability ~1.5-fold and sample efficiency ~45-fold compared to REINVENT across docking tasks against four <a href="https://en.wikipedia.org/wiki/G_protein-coupled_receptor">GPCR</a> targets, and that the approach generalizes to transformer architectures.</p>
<h2 id="sample-efficiency-bottleneck-in-rl-guided-molecular-generation">Sample Efficiency Bottleneck in RL-Guided Molecular Generation</h2>
<p>Recurrent neural networks trained on SMILES have become a standard approach for de novo molecule generation, with RL strategies like REINVENT and Hill-Climb achieving top performance on benchmarks such as <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> and <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a>. However, RL-guided generation can be highly <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/sample-efficiency-de-novo-generation/">sample-inefficient</a>, often requiring $10^5$ or more molecules to optimize complex objectives. This is acceptable for cheap scoring functions (e.g., QSAR models, property calculators) but becomes a practical bottleneck when using computationally expensive scoring functions like molecular docking or computer-aided synthesis planning.</p>
<p>The REINVENT strategy regularizes the agent by computing a loss based on the difference between the agent&rsquo;s policy and an &ldquo;augmented likelihood&rdquo; that combines the prior policy with a scaled reward. When low-scoring molecules are sampled ($R_T \approx 0$), the augmented likelihood reduces to the prior likelihood, causing the agent to trend back toward the prior policy. This negates useful learnings, especially early in training or when the objective is difficult. Meanwhile, Hill-Climb simply fine-tunes the RNN on the top-k molecules per batch, which is sample-efficient but lacks explicit regularization, leading to mode collapse and generation of invalid SMILES.</p>
<p>Previous work by Neil et al. compared RL strategies but did not clearly quantify sample-efficiency differences, and modifications to the REINVENT loss function by Fialkova et al. showed no significant improvement. The best agent reminder (BAR) mechanism offered modest gains but was originally tested on graph-based models.</p>
<h2 id="core-innovation-filtering-low-scoring-molecules-from-the-reinvent-loss">Core Innovation: Filtering Low-Scoring Molecules from the REINVENT Loss</h2>
<p>Augmented Hill-Climb combines the loss formulation of REINVENT with the top-k selection mechanism of Hill-Climb. The agent samples a batch of molecules, ranks them by reward, and computes the REINVENT loss only on the top-k molecules. This removes the counterproductive regularization caused by low-scoring molecules while retaining the prior-based regularization for high-scoring molecules.</p>
<p>The REINVENT loss defines an augmented likelihood:</p>
<p>$$
\log P_{\mathbb{U}}(A) = \log P_{prior}(A) + \sigma R_T
$$</p>
<p>where $\sigma$ is a scaling coefficient controlling the reward contribution. The agent loss is the squared difference between the augmented likelihood and the agent&rsquo;s log-likelihood:</p>
<p>$$
L(\theta) = \left[\log P_{\mathbb{U}}(A) - \log P_{agent}(A)\right]^2
$$</p>
<p>In standard REINVENT, this loss is computed over all molecules in the batch. When $R_T \approx 0$, the augmented likelihood collapses to the prior likelihood, pushing the agent back toward the prior. AHC avoids this by computing the loss only on the top-k molecules ranked by reward, exactly as Hill-Climb selects molecules for fine-tuning.</p>
<p>The key insight is that high-scoring molecules are still regularized by the prior component of the augmented likelihood ($\log P_{prior}(A)$), preventing catastrophic forgetting. Low-scoring molecules, which would otherwise pull the agent back toward the prior, are simply excluded from the loss computation.</p>
<h3 id="diversity-filters-to-prevent-mode-collapse">Diversity Filters to Prevent Mode Collapse</h3>
<p>AHC is more susceptible to mode collapse than REINVENT because it focuses learning on high-scoring molecules. The authors address this with diversity filters (DFs) that penalize the reward of molecules similar to previously generated ones. Through a hyperparameter search over 825 configurations on three GuacaMol tasks, they identify an optimal DF configuration (DF2) with:</p>
<ul>
<li>Minimum score threshold of 0.5 (lower than DF1&rsquo;s 0.8)</li>
<li>Linear penalization output mode (softer than binary)</li>
<li>Bin size of 50 (larger than DF1&rsquo;s 25)</li>
<li>Scaffold similarity based on ECFP4 fingerprints</li>
</ul>
<p>The authors find that stricter DFs (lower thresholds, smaller bins) better prevent mode collapse but reduce optimization performance, while more lenient DFs enable better learning of chemotype-reward associations. DF2 represents a compromise.</p>
<h2 id="experimental-setup-docking-tasks-and-benchmark-comparisons">Experimental Setup: Docking Tasks and Benchmark Comparisons</h2>
<p>The evaluation spans five experiments:</p>
<p><strong>Experiment 1</strong>: AHC vs. REINVENT on DRD2 docking over 100 RL updates (6,400 samples), varying $\sigma$ from 30 to 240. RNN trained on the MOSESn dataset (MOSES with neutralized charges, 2.45M molecules).</p>
<p><strong>Experiment 2</strong>: AHC + DF2 vs. REINVENT on four GPCR targets (DRD2, OPRM1, AGTR1, OX1R) over 500 RL updates. Docking performed with Glide-SP after ligand preparation with LigPrep.</p>
<p><strong>Experiment 3</strong>: Diversity filter hyperparameter search (825 configurations) on three GuacaMol tasks (<a href="https://en.wikipedia.org/wiki/Aripiprazole">Aripiprazole</a> similarity, C11H24 isomers, <a href="https://en.wikipedia.org/wiki/Osimertinib">Osimertinib</a> MPO) using the GuacaMol training set (1.27M molecules from ChEMBL24).</p>
<p><strong>Experiment 4</strong>: Benchmark of AHC against REINFORCE, REINVENT (v1 and v2), BAR, and Hill-Climb (with and without KL regularization) on six tasks of varying difficulty:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Difficulty</th>
          <th>Objective</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Heavy atoms</td>
          <td>Easy</td>
          <td>Maximize number of heavy atoms</td>
      </tr>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Risperidone">Risperidone</a> similarity</td>
          <td>Easy</td>
          <td>Maximize Tanimoto similarity to Risperidone</td>
      </tr>
      <tr>
          <td>DRD2 activity</td>
          <td>Medium</td>
          <td>Maximize QSAR-predicted DRD2 activity</td>
      </tr>
      <tr>
          <td>DRD2 docking</td>
          <td>Medium</td>
          <td>Minimize Glide-SP docking score</td>
      </tr>
      <tr>
          <td>DRD2-DRD3 dual</td>
          <td>Hard</td>
          <td>Maximize predicted activity against both targets</td>
      </tr>
      <tr>
          <td>DRD2/DRD3 selective</td>
          <td>Hard</td>
          <td>Maximize selective DRD2 activity over DRD3</td>
      </tr>
  </tbody>
</table>
<p><strong>Experiment 5</strong>: AHC vs. REINVENT on transformer (Tr) and gated transformer (GTr) architectures on the same six benchmark tasks. The GTr implements a GRU-style gate in place of residual connections to stabilize RL training.</p>
<h3 id="rnn-and-transformer-architectures">RNN and Transformer Architectures</h3>
<p>Three RNN configurations were used: (1) embedding 128 + 3 GRU layers of 512 (REINVENT v1), (2) embedding 256 + 3 LSTM layers of 512 (REINVENT 2.0), (3) 3 LSTM layers of 512 with dropout 0.2 (GuacaMol). Transformers used 4 encoder layers with hidden dimension 512, 8 attention heads, and feed-forward dimension 1024.</p>
<p>QSAR models for DRD2 and DRD3 activity were random forest classifiers trained on ExCAPE-DB data with GHOST threshold identification for handling class imbalance.</p>
<h2 id="key-findings-45-fold-sample-efficiency-improvement">Key Findings: 45-Fold Sample Efficiency Improvement</h2>
<h3 id="experiment-1-ahc-consistently-outperforms-reinvent">Experiment 1: AHC Consistently Outperforms REINVENT</h3>
<p>AHC improved optimization ability by 1.39-fold over REINVENT averaged across all $\sigma$ values, with maximum optimization of 205% at $\sigma = 240$ (compared to 128% for REINVENT). AHC required ~80 fewer RL steps to match REINVENT&rsquo;s mean docking score at 100 steps. With DF1 applied, the improvement was 1.45-fold.</p>
<p>AHC showed greater sensitivity to $\sigma$, giving practitioners more control over the regularization-optimization trade-off. At $\sigma = 60$ (heavily regularized), AHC still improved 1.47-fold over REINVENT while maintaining property space defined by the MOSESn training set. At higher $\sigma$ values, AHC extrapolated further outside the training distribution, which can be favorable (novel chemical space) or unfavorable (scoring function exploitation, e.g., larger molecules getting better docking scores due to the additive nature of scoring functions).</p>
<h3 id="experiment-2-improvement-across-four-gpcr-targets">Experiment 2: Improvement Across Four GPCR Targets</h3>
<p>Across DRD2, OPRM1, AGTR1, and OX1R, AHC + DF2 required on average 7.4-fold fewer training steps and 45.5-fold fewer samples to reach optimization thresholds. The improvement was largest early in training: 19.8-fold fewer steps to reach 120% optimization, and 71.8-fold fewer samples to first produce a molecule exceeding 160% optimization.</p>
<p>AHC + DF2 surpassed the 80% retrospective precision threshold within 100 RL updates for all targets except the challenging OX1R. DF2 successfully stabilized learning, avoiding the convergence-to-threshold failure mode observed with DF1.</p>
<p>Scaffold analysis showed AHC generates similar chemistry to REINVENT. The top 500 scaffolds produced by REINVENT were also generated by AHC, but typically much sooner.</p>
<h3 id="experiment-4-benchmark-against-all-rl-strategies">Experiment 4: Benchmark Against All RL Strategies</h3>
<p>AHC outperformed all other RL strategies on all six benchmark tasks except maximizing heavy atoms (an extrapolation task of limited practical relevance). AHC was particularly superior during early-stage optimization and for harder objectives (dual activity, selective activity).</p>
<p>Hill-Climb with a smaller batch size (HC*) showed improved early-stage sample efficiency similar to AHC, but rapidly underwent mode collapse. KL regularization did not rescue mode collapse in any case and sometimes worsened performance. BAR performed poorly in most tasks, possibly because the best-agent memory acts as a second regularizer that inhibits learning.</p>
<p>In terms of wall time for the DRD2 docking task, AHC reached 140% optimization in 16 CPU hours vs. 202 CPU hours for <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent4-generative-molecule-design/">REINVENT 2.0</a>. AHC was the only strategy to reach 200% optimization within the allotted time (216 CPU hours). Parallelized over 10 CPUs, this corresponds to ~21.6 hours, making docking-guided generation feasible on local machines.</p>
<h3 id="experiment-5-generalization-to-transformers">Experiment 5: Generalization to Transformers</h3>
<p>AHC outperformed REINVENT on both the standard transformer and the gated transformer architectures. The standard transformer was unstable under RL, readily undergoing mode collapse. The gated transformer (with GRU-style gating replacing residual connections) stabilized RL training. AHC&rsquo;s efficiency gains generalized to both architectures.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>Chemistry quality evaluation is complicated by the interaction between RL strategy and scoring function suitability. Greater optimization may lead to unreasonable chemistry due to scoring function exploitation rather than the RL strategy itself.</li>
<li>The diversity filter hyperparameter search was conducted on GuacaMol toy tasks, which may not fully transfer to docking-based objectives.</li>
<li>The docking scoring function was system-dependent: DRD2 and OPRM1 were optimized effectively, while AGTR1 and OX1R proved more challenging (especially AGTR1, where the docking algorithm targeted the wrong sub-pocket).</li>
<li>KL regularization proved ineffective for HC and REINFORCE, suggesting it is not a sufficient regularization method in this context.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RNN pretraining</td>
          <td>MOSESn (MOSES neutralized)</td>
          <td>2,454,087 molecules</td>
          <td>ZINC15 clean leads with neutralized charges</td>
      </tr>
      <tr>
          <td>RNN pretraining</td>
          <td>GuacaMol train</td>
          <td>1,273,104 molecules</td>
          <td>ChEMBL24 with property filters</td>
      </tr>
      <tr>
          <td>QSAR training</td>
          <td>ExCAPE-DB (DRD2)</td>
          <td>4,609 actives / 343,026 inactives</td>
          <td>Random forest with GHOST thresholds</td>
      </tr>
      <tr>
          <td>QSAR training</td>
          <td>ExCAPE-DB (DRD3)</td>
          <td>2,758 actives / 402,524 inactives</td>
          <td>Unique subsets for dual/selective tasks</td>
      </tr>
      <tr>
          <td>DF parameter search</td>
          <td>GuacaMol benchmark tasks</td>
          <td>3 tasks</td>
          <td>825 configurations tested</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>AHC</strong>: REINVENT loss computed on top-k molecules per batch, ranked by reward</li>
<li><strong>Baselines</strong>: REINFORCE, REINVENT (v1, v2), BAR, Hill-Climb, Hill-Climb + KL regularization</li>
<li><strong>Hyperparameters</strong>: Default values from each original publication (listed in Supplementary Table S3)</li>
<li><strong>Docking</strong>: Glide-SP with Schrodinger Protein Preparation Wizard, LigPrep for ligand preparation</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>RNNs</strong>: 3 configurations (GRU/LSTM, 512 hidden units, trained 5-10 epochs)</li>
<li><strong>Transformer</strong>: 4 encoder layers, 512 hidden dim, 8 heads, 1024 FFN dim</li>
<li><strong>Gated Transformer</strong>: Same architecture with GRU-style gating replacing residual connections</li>
<li><strong>QSAR</strong>: Random forest classifiers (100 estimators, max depth 15, min leaf 2)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>AHC + DF2</th>
          <th>REINVENT</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Optimization fold-improvement</td>
          <td>1.45x</td>
          <td>baseline</td>
          <td>DRD2 docking, averaged across sigma values</td>
      </tr>
      <tr>
          <td>Sample efficiency</td>
          <td>45.5x fewer samples</td>
          <td>baseline</td>
          <td>Averaged across 4 GPCR targets</td>
      </tr>
      <tr>
          <td>Step efficiency</td>
          <td>7.4x fewer steps</td>
          <td>baseline</td>
          <td>Averaged across 4 GPCR targets</td>
      </tr>
      <tr>
          <td>CPU hours to 140% (DRD2 docking)</td>
          <td>16h</td>
          <td>202h (REINVENT 2.0)</td>
          <td>AMD Threadripper 1920 + RTX 2060 Super</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>AMD Threadripper 1920 CPU</li>
<li>Nvidia GeForce RTX 2060 Super GPU</li>
<li>DRD2 docking benchmark: 216 CPU hours for AHC to reach 200% optimization (~21.6h parallelized over 10 CPUs)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MorganCThomas/SMILES-RNN">SMILES-RNN</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>RNN and transformer generative model code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MorganCThomas/MolScore">MolScore</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td><a href="/notes/computational-chemistry/benchmark-problems/molscore-scoring-benchmarking-framework/">Scoring function platform</a></td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.6084/m9.figshare.19591024.v1">Figshare datasets</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>Supporting data (published under same license as paper)</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Thomas, M., O&rsquo;Boyle, N. M., Bender, A., &amp; de Graaf, C. (2022). Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation. <em>Journal of Cheminformatics</em>, 14, 68.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{thomas2022augmented,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Thomas, Morgan and O&#39;Boyle, Noel M. and Bender, Andreas and de Graaf, Chris}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{68}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-022-00646-z}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>AlphaDrug: MCTS-Guided Target-Specific Drug Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/alphadrug-protein-target-molecular-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/alphadrug-protein-target-molecular-generation/</guid><description>AlphaDrug combines a modified transformer with Monte Carlo tree search and docking rollouts for target-specific de novo molecular generation.</description><content:encoded><![CDATA[<h2 id="target-conditioned-molecular-generation-via-transformer-and-mcts">Target-Conditioned Molecular Generation via Transformer and MCTS</h2>
<p>AlphaDrug is a <strong>Method</strong> paper that proposes a target-specific de novo molecular generation framework. The primary contribution is the combination of two components: (1) an Lmser Transformer (LT) that embeds protein-ligand context through hierarchical skip connections from encoder to decoder, and (2) a Monte Carlo tree search (MCTS) procedure guided by both the LT&rsquo;s predicted probabilities and docking scores from the <a href="/notes/computational-chemistry/benchmark-problems/smina-docking-benchmark/">SMINA</a> program. The method generates SMILES strings autoregressively, with each symbol selection informed by look-ahead search over potential binding affinities.</p>
<h2 id="bridging-the-gap-between-molecular-generation-and-protein-targeting">Bridging the Gap Between Molecular Generation and Protein Targeting</h2>
<p>Most deep learning methods for de novo molecular generation optimize physicochemical properties (LogP, QED, SA) without conditioning on a specific protein target. Virtual screening approaches rely on existing compound databases and are computationally expensive. The few methods that do consider protein targets, such as LiGANN and the <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/transformer-protein-drug-generation/">transformer-based approach of Grechishnikova (2021)</a>, show limited docking performance. The core challenge is twofold: the search space of drug-like molecules is estimated at $10^{60}$ compounds, and learning protein-ligand interaction patterns from sequence data is difficult because proteins and ligands have very different structures and sequence lengths.</p>
<p>AlphaDrug addresses these gaps by proposing a method that jointly learns protein-ligand representations and uses docking-guided search to navigate the vast chemical space.</p>
<h2 id="lmser-transformer-and-docking-guided-mcts">Lmser Transformer and Docking-Guided MCTS</h2>
<p>The key innovations are the Lmser Transformer architecture and the MCTS search strategy.</p>
<h3 id="lmser-transformer-lt">Lmser Transformer (LT)</h3>
<p>The standard transformer for sequence-to-sequence tasks passes information from the encoder&rsquo;s top layer to the decoder through cross-attention. AlphaDrug identifies an information transfer bottleneck: deep protein features from the encoder&rsquo;s final layer must serve all decoder layers. Inspired by the Lmser (least mean squared error reconstruction) network, the authors add hierarchical skip connections from each encoder layer to the corresponding decoder layer.</p>
<p>Each decoder layer receives protein features at the matching level of abstraction through a cross-attention mechanism:</p>
<p>$$f_{ca}(Q_m, K_S, V_S) = \text{softmax}\left(\frac{Q_m K_S^T}{\sqrt{d_k}}\right) V_S$$</p>
<p>where $Q_m$ comes from the ligand molecule decoder and $(K_S, V_S)$ are passed through skip connections from the protein encoder. This allows different decoder layers to access different levels of protein features, rather than all layers sharing the same top-level encoding.</p>
<p>The multi-head attention follows the standard formulation:</p>
<p>$$\text{MultiHead}(Q, K, V) = \text{Concat}(H_1, \dots, H_h) W^O$$</p>
<p>$$H_i = f_{ca}(Q W_i^Q, K W_i^K, V W_i^V)$$</p>
<h3 id="mcts-for-molecular-generation">MCTS for Molecular Generation</h3>
<p>The molecular generation process models SMILES construction as a sequential decision problem. At each step $\tau$, the context $C_\tau = {S, a_1 a_2 \cdots a_\tau}$ consists of the protein sequence $S$ and the intermediate SMILES string. MCTS runs a fixed number of simulations per step, each consisting of four phases:</p>
<p><strong>Select</strong>: Starting from the current root node, child nodes are selected using a variant of the PUCT algorithm:</p>
<p>$$\tilde{a}_{\tau+t} = \underset{a \in A}{\arg\max}\left(Q(\tilde{C}_{\tau+t-1}, a) + U(\tilde{C}_{\tau+t-1}, a)\right)$$</p>
<p>where $Q(\tilde{C}, a) = W_a / N_a$ is the average reward and $U(\tilde{C}, a) = c_{puct} \cdot P(a | \tilde{C}) \cdot \sqrt{N_t} / (1 + N_t(a))$ is an exploration bonus based on the LT&rsquo;s predicted probability.</p>
<p>The Q-values are normalized to $[0, 1]$ using the range of docking scores in the tree:</p>
<p>$$Q(\tilde{C}, a) \leftarrow \frac{Q(\tilde{C}, a) - \min_{m \in \mathcal{M}} f_d(S, m)}{\max_{m \in \mathcal{M}} f_d(S, m) - \min_{m \in \mathcal{M}} f_d(S, m)}$$</p>
<p><strong>Expand</strong>: At a leaf node, the LT computes next-symbol probabilities and adds child nodes to the tree.</p>
<p><strong>Rollout</strong>: A complete molecule is generated greedily using LT probabilities. Valid molecules are scored with SMINA docking; invalid molecules receive the minimum observed docking score.</p>
<p><strong>Backup</strong>: Docking values propagate back up the tree, updating visit counts and cumulative rewards.</p>
<h3 id="training-objective">Training Objective</h3>
<p>The LT is trained on known protein-ligand pairs using cross-entropy loss:</p>
<p>$$J(\Theta) = -\sum_{(S,m) \in \mathcal{D}} \sum_{\tau=1}^{L_m} \sum_{a \in \mathcal{A}} y_a \ln P(a \mid C_\tau(S, m))$$</p>
<p>MCTS is only activated during inference, not during training.</p>
<h2 id="experiments-on-diverse-protein-targets">Experiments on Diverse Protein Targets</h2>
<h3 id="dataset">Dataset</h3>
<p>The authors use BindingDB, filtered to 239,455 protein-ligand pairs across 981 unique proteins. Filtering criteria include: human proteins only, IC50 &lt; 100 nM, molecular weight &lt; 1000 Da, and single-chain targets. Proteins are clustered at 30% sequence identity using MMseqs2, with 25 clusters held out for testing (100 proteins), and the remainder split 90/10 for training (192,712 pairs) and validation (17,049 pairs).</p>
<h3 id="baselines">Baselines</h3>
<ul>
<li><strong>T+BS10</strong>: Standard transformer with beam search (K=10) from <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/transformer-protein-drug-generation/">Grechishnikova (2021)</a></li>
<li><strong>LT+BS10</strong>: The proposed Lmser Transformer with beam search</li>
<li><strong>LiGANN</strong>: 3D pocket-to-ligand shape generation via BicycleGAN</li>
<li><strong>SBMolGen</strong>: ChemTS-based method with docking constraints</li>
<li><strong>SBDD-3D</strong>: 3D autoregressive graph-based generation</li>
<li><strong>Decoys</strong>: Random compounds from ZINC database</li>
<li><strong>Known ligands</strong>: Original binding partners from the database</li>
</ul>
<h3 id="main-results">Main Results</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Docking</th>
          <th>Uniqueness</th>
          <th>LogP</th>
          <th>QED</th>
          <th>SA</th>
          <th>NP</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Decoys</td>
          <td>7.3</td>
          <td>-</td>
          <td>2.4</td>
          <td>0.8</td>
          <td>2.4</td>
          <td>-1.2</td>
      </tr>
      <tr>
          <td>Known ligands</td>
          <td>9.8</td>
          <td>-</td>
          <td>2.2</td>
          <td>0.5</td>
          <td>3.3</td>
          <td>-1.0</td>
      </tr>
      <tr>
          <td>LiGANN</td>
          <td>6.7</td>
          <td>94.7%</td>
          <td>2.9</td>
          <td>0.6</td>
          <td>3.0</td>
          <td>-1.1</td>
      </tr>
      <tr>
          <td>SBMolGen</td>
          <td>7.7</td>
          <td>100%</td>
          <td>2.6</td>
          <td>0.7</td>
          <td>2.8</td>
          <td>-1.2</td>
      </tr>
      <tr>
          <td>SBDD-3D</td>
          <td>7.7</td>
          <td>99.3%</td>
          <td>1.5</td>
          <td>0.6</td>
          <td>4.0</td>
          <td>0.3</td>
      </tr>
      <tr>
          <td>T+BS10</td>
          <td>8.5</td>
          <td>90.6%</td>
          <td>3.8</td>
          <td>0.5</td>
          <td>2.8</td>
          <td>-0.8</td>
      </tr>
      <tr>
          <td>LT+BS10</td>
          <td>8.5</td>
          <td>98.1%</td>
          <td>4.0</td>
          <td>0.5</td>
          <td>2.7</td>
          <td>-1.0</td>
      </tr>
      <tr>
          <td>AlphaDrug (freq)</td>
          <td>10.8</td>
          <td>99.5%</td>
          <td>4.9</td>
          <td>0.4</td>
          <td>2.9</td>
          <td>-1.0</td>
      </tr>
      <tr>
          <td>AlphaDrug (max)</td>
          <td>11.6</td>
          <td>100%</td>
          <td>5.2</td>
          <td>0.4</td>
          <td>2.7</td>
          <td>-0.8</td>
      </tr>
  </tbody>
</table>
<p>AlphaDrug (max) achieves the highest average docking score (11.6), surpassing known ligands (9.8). Statistical significance is confirmed with two-tailed t-test P-values below 0.01 for all comparisons.</p>
<h3 id="mcts-vs-beam-search-under-equal-compute">MCTS vs. Beam Search Under Equal Compute</h3>
<p>When constrained to the same number of docking evaluations, MCTS consistently outperforms beam search:</p>
<table>
  <thead>
      <tr>
          <th>Docking times (N)</th>
          <th>BS</th>
          <th>MCTS</th>
          <th>P-value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>N = 105 (S=10)</td>
          <td>8.4 (10.9)</td>
          <td>10.9 (11.5)</td>
          <td>1.8e-34 (4.5e-3)</td>
      </tr>
      <tr>
          <td>N = 394 (S=50)</td>
          <td>8.3 (11.4)</td>
          <td>11.6 (12.2)</td>
          <td>1.4e-31 (1.8e-3)</td>
      </tr>
      <tr>
          <td>N = 1345 (S=500)</td>
          <td>8.4 (11.9)</td>
          <td>12.4 (13.2)</td>
          <td>2.2e-39 (8.2e-6)</td>
      </tr>
  </tbody>
</table>
<p>Values in parentheses are average top-1 scores per protein.</p>
<h3 id="ablation-effect-of-protein-sequence-input">Ablation: Effect of Protein Sequence Input</h3>
<p>Replacing the full transformer (T) or LT with a transformer encoder only (TE, no protein input) demonstrates that protein conditioning improves both uniqueness and docking score per symbol (SpS):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Uniqueness</th>
          <th>SpS</th>
          <th>Molecular length</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>TE + MCTS (S=50)</td>
          <td>81.0%</td>
          <td>0.1926</td>
          <td>62.93</td>
      </tr>
      <tr>
          <td>T + MCTS (S=50)</td>
          <td>98.0%</td>
          <td>0.2149</td>
          <td>55.63</td>
      </tr>
      <tr>
          <td>LT + MCTS (S=50)</td>
          <td>100.0%</td>
          <td>0.2159</td>
          <td>56.54</td>
      </tr>
  </tbody>
</table>
<p>The SpS metric (docking score normalized by molecule length) isolates the quality improvement from the tendency of longer molecules to score higher.</p>
<h3 id="computational-efficiency">Computational Efficiency</h3>
<p>A docking lookup table caches previously computed protein-molecule docking scores, reducing actual docking calls by 81-86% compared to the theoretical maximum ($L \times S$ calls per molecule). With $S = 10$, AlphaDrug generates molecules in about 52 minutes per protein; with $S = 50$, about 197 minutes per protein.</p>
<h2 id="docking-gains-with-acknowledged-limitations">Docking Gains with Acknowledged Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ul>
<li>86% of AlphaDrug-generated molecules have higher docking scores than known ligands for their respective targets.</li>
<li>The LT architecture with hierarchical skip connections improves uniqueness (from 90.6% to 98.1% with beam search) and provides slight SpS gains over the vanilla transformer.</li>
<li>MCTS is the dominant factor in performance improvement: even with only 10 simulations, it boosts docking scores by 31.3% over greedy LT decoding.</li>
<li>Case studies on three proteins (3gcs, 3eig, 4o28) show that generated molecules share meaningful substructures with known ligands, suggesting chemical plausibility.</li>
</ul>
<h3 id="limitations">Limitations</h3>
<p>The authors identify three areas for improvement:</p>
<ol>
<li><strong>Sequence-only representation</strong>: AlphaDrug uses amino acid sequences rather than 3D protein structures. While it outperforms existing 3D methods (SBDD-3D), incorporating 3D pocket geometry could further improve performance.</li>
<li><strong>External docking as value function</strong>: SMINA docking calls are computationally expensive and become a bottleneck during MCTS. A learnable end-to-end value network would reduce this cost and allow joint policy-value training.</li>
<li><strong>Full rollout requirement</strong>: Every MCTS simulation requires generating a complete molecule for docking evaluation. Estimating binding affinity from partial molecules remains an open challenge.</li>
</ol>
<p>The physicochemical properties (QED, SA) of AlphaDrug&rsquo;s outputs are comparable to baselines but not explicitly optimized. LogP values trend toward the upper end of the Ghose filter range (4.9-5.2 vs. the 5.6 limit), which may indicate lipophilicity bias.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>BindingDB (filtered)</td>
          <td>192,712 protein-ligand pairs</td>
          <td>Human proteins, IC50 &lt; 100 nM, MW &lt; 1000 Da</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>BindingDB (filtered)</td>
          <td>17,049 pairs</td>
          <td>Same filtering criteria</td>
      </tr>
      <tr>
          <td>Testing</td>
          <td>BindingDB (filtered)</td>
          <td>100 proteins from 25 clusters</td>
          <td>Clustered at 30% sequence identity via MMseqs2</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>MCTS with PUCT selection criterion, $c_{puct} = 1.5$</li>
<li>$S = 50$ simulations per step (default), $S = 10$ for fast variant</li>
<li>Greedy rollout policy using LT probabilities</li>
<li>Docking lookup table for efficiency (caches SMINA results)</li>
<li>Two generation modes: max (deterministic, highest visit count) and freq (stochastic, proportional to visit counts)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Lmser Transformer with hierarchical encoder-to-decoder skip connections</li>
<li>Sinusoidal positional encoding</li>
<li>Multi-head cross-attention at each decoder layer</li>
<li>Detailed hyperparameters (embedding dimensions, number of layers/heads) are in the supplementary material (Table S1)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>AlphaDrug (max)</th>
          <th>Known ligands</th>
          <th>Best baseline (T+BS10)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Docking score</td>
          <td>11.6</td>
          <td>9.8</td>
          <td>8.5</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>100%</td>
          <td>-</td>
          <td>90.6%</td>
      </tr>
      <tr>
          <td>Validity</td>
          <td>100%</td>
          <td>-</td>
          <td>Not reported</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not explicitly reported in the paper. Generation time is reported as approximately 52 minutes per protein ($S = 10$) and 197 minutes per protein ($S = 50$), with docking (via SMINA) being the dominant cost.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/CMACH508/AlphaDrug">CMACH508/AlphaDrug</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation, includes data processing and generation scripts</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Qian, H., Lin, C., Zhao, D., Tu, S., &amp; Xu, L. (2022). AlphaDrug: protein target specific de novo molecular generation. <em>PNAS Nexus</em>, 1(4), pgac227. <a href="https://doi.org/10.1093/pnasnexus/pgac227">https://doi.org/10.1093/pnasnexus/pgac227</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{qian2022alphadrug,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{AlphaDrug: protein target specific de novo molecular generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Qian, Hao and Lin, Cheng and Zhao, Dengwei and Tu, Shikui and Xu, Lei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{PNAS Nexus}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{pgac227}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/pnasnexus/pgac227}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>TamGen: GPT-Based Target-Aware Drug Design and Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/tamgen-target-aware-molecule-generation/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/tamgen-target-aware-molecule-generation/</guid><description>TamGen combines a GPT-like chemical language model with protein pocket encoding and VAE refinement to generate drug candidates with experimental validation.</description><content:encoded><![CDATA[<h2 id="a-method-for-target-conditioned-molecular-generation">A Method for Target-Conditioned Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces TamGen (Target-aware molecular generation), a three-module architecture for generating drug-like compounds conditioned on protein binding pocket structures. The primary contribution is a GPT-like chemical language model pre-trained on 10 million SMILES from PubChem, combined with a Transformer-based protein encoder and a VAE-based contextual encoder for compound refinement. The authors validate TamGen on the CrossDocked2020 benchmark and apply it through a Design-Refine-Test pipeline to discover 14 novel inhibitors of the Mycobacterium tuberculosis ClpP protease, with $\text{IC}_{50}$ values ranging from 1.88 to 35.2 $\mu$M.</p>
<h2 id="bridging-generative-ai-and-practical-drug-discovery">Bridging Generative AI and Practical Drug Discovery</h2>
<p>Target-based generative drug design aims to create novel compounds with desired pharmacological properties from scratch, exploring the estimated $10^{60}$ feasible compounds in chemical space rather than screening existing libraries of $10^{4}$ to $10^{8}$ molecules. Prior approaches using diffusion models, GANs, VAEs, and autoregressive models have demonstrated the feasibility of generating compounds conditioned on target proteins. However, most generated compounds lack satisfactory physicochemical properties for drug-likeness, and validations with biophysical or biochemical assays are largely missing.</p>
<p>The key limitations of existing 3D generation methods (TargetDiff, Pocket2Mol, ResGen, 3D-AR) include:</p>
<ul>
<li>Generated compounds frequently contain multiple fused rings, leading to poor synthetic accessibility</li>
<li>High cellular toxicity and decreased developability associated with excessive fused ring counts</li>
<li>Slow generation speeds (tens of minutes to hours per 100 compounds)</li>
<li>Limited real-world experimental validation of generated candidates</li>
</ul>
<p>TamGen addresses these issues by operating in 1D SMILES space rather than 3D coordinate space, leveraging pre-training on natural compound distributions to produce more drug-like molecules.</p>
<h2 id="three-module-architecture-with-pre-training-and-refinement">Three-Module Architecture with Pre-Training and Refinement</h2>
<p>TamGen consists of three components: a compound decoder, a protein encoder, and a contextual encoder.</p>
<h3 id="compound-decoder-chemical-language-model">Compound Decoder (Chemical Language Model)</h3>
<p>The compound decoder is a GPT-style autoregressive model pre-trained on 10 million SMILES randomly sampled from PubChem. The pre-training objective follows standard next-token prediction:</p>
<p>$$
\min -\sum_{y \in \mathcal{D}_0} \frac{1}{M_y} \sum_{i=1}^{M_y} \log P(y_i \mid y_{i-1}, y_{i-2}, \ldots, y_1)
$$</p>
<p>where $M_y$ is the SMILES sequence length. This enables both unconditional and conditional generation. The decoder uses 12 Transformer layers with hidden dimension 768.</p>
<h3 id="protein-encoder-with-distance-aware-attention">Protein Encoder with Distance-Aware Attention</h3>
<p>The protein encoder processes binding pocket residues using both sequential and geometric information. Given amino acids $\mathbf{a} = (a_1, \ldots, a_N)$ with 3D coordinates $\mathbf{r} = (r_1, \ldots, r_N)$, the input representation combines amino acid embeddings with coordinate embeddings:</p>
<p>$$
h_i^{(0)} = E_a a_i + E_r \rho\left(r_i - \frac{1}{N}\sum_{j=1}^{N} r_j\right)
$$</p>
<p>where $\rho$ denotes a random roto-translation operation applied as data augmentation, and coordinates are centered to the origin.</p>
<p>The encoder uses a distance-aware self-attention mechanism that weights attention scores by spatial proximity:</p>
<p>$$
\begin{aligned}
\hat{\alpha}_j &amp;= \exp\left(-\frac{|r_i - r_j|^2}{\tau}\right)(h_i^{(l)\top} W h_j^{(l)}) \\
\alpha_j &amp;= \frac{\exp \hat{\alpha}_j}{\sum_{k=1}^{N} \exp \hat{\alpha}_k} \\
\hat{\boldsymbol{h}}_i^{(l+1)} &amp;= \sum_{j=1}^{N} \alpha_j (W_v h_j^{(l)})
\end{aligned}
$$</p>
<p>where $\tau$ is a temperature hyperparameter and $W$, $W_v$ are learnable parameters. The encoder uses 4 layers with hidden dimension 256. Outputs are passed to the compound decoder via cross-attention.</p>
<h3 id="vae-based-contextual-encoder">VAE-Based Contextual Encoder</h3>
<p>A VAE-based contextual encoder determines the mean $\mu$ and standard deviation $\sigma$ for any (compound, protein) pair. During training, the model recovers the input compound. During application, a seed compound enables compound refinement. The full training objective combines reconstruction loss with KL regularization:</p>
<p>$$
\min_{\Theta, q} \frac{1}{|\mathcal{D}|} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} -\log P(\mathbf{y} \mid \mathbf{x}, z; \Theta) + \beta \mathcal{D}_{\text{KL}}(q(z \mid \mathbf{x}, \mathbf{y}) | p(z))
$$</p>
<p>where $\beta$ is a hyperparameter controlling the KL divergence weight, and $p(z)$ is a standard Gaussian prior.</p>
<h2 id="benchmark-evaluation-and-tuberculosis-drug-discovery">Benchmark Evaluation and Tuberculosis Drug Discovery</h2>
<h3 id="crossdocked2020-benchmark">CrossDocked2020 Benchmark</h3>
<p>TamGen was evaluated against five baselines (liGAN, 3D-AR, Pocket2Mol, ResGen, TargetDiff) on the CrossDocked2020 dataset (~100k drug-target pairs for training, 100 test binding pockets). For each target, 100 compounds were generated per method. Evaluation metrics included:</p>
<ul>
<li><strong>Docking score</strong> (AutoDock-Vina): binding affinity estimate</li>
<li><strong>QED</strong>: quantitative estimate of drug-likeness</li>
<li><strong><a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski&rsquo;s Rule of Five</a></strong>: physicochemical property compliance</li>
<li><strong>SAS</strong>: synthetic accessibility score</li>
<li><strong>LogP</strong>: lipophilicity (optimal range 0-5 for oral administration)</li>
<li><strong>Molecular diversity</strong>: Tanimoto similarity between Morgan fingerprints</li>
</ul>
<p>TamGen ranked first or second on 5 of 6 metrics and achieved the best overall score using mean reciprocal rank (MRR) across all metrics. On synthetic accessibility for high-affinity compounds, TamGen performed best. The generated compounds averaged 1.78 fused rings, closely matching FDA-approved drugs, while competing 3D methods produced compounds with significantly more fused rings.</p>
<p>TamGen was also 85x to 394x faster than competing methods: generating 100 compounds per target in an average of 9 seconds on a single A6000 GPU, compared to tens of minutes or hours for the baselines.</p>
<h3 id="design-refine-test-pipeline-for-clpp-inhibitors">Design-Refine-Test Pipeline for ClpP Inhibitors</h3>
<p>The practical application targeted ClpP protease of Mycobacterium tuberculosis, an emerging antibiotic target with no documented advanced inhibitors beyond <a href="https://en.wikipedia.org/wiki/Bortezomib">Bortezomib</a>.</p>
<p><strong>Design stage</strong>: Using the ClpP binding pocket from PDB structure 5DZK, TamGen generated 2,612 unique compounds. Compounds were filtered by molecular docking (retaining those with better scores than Bortezomib) and Ligandformer phenotypic activity prediction. Peptidomimetic compounds were excluded for poor ADME properties. Four seed compounds were selected.</p>
<p><strong>Refine stage</strong>: Using the 4 seed compounds plus 3 weakly active compounds ($\text{IC}_{50}$ 100-200 $\mu$M) from prior experiments, TamGen generated 8,635 unique compounds conditioned on both the target and seeds. After filtering, 296 compounds were selected for testing.</p>
<p><strong>Test stage</strong>: From a 446k commercial compound library, 159 analogs (MCS similarity &gt; 0.55) were identified. Five analogs showed significant inhibitory effects. Dose-response experiments revealed $\text{IC}_{50}$ values below 20 $\mu$M for all five, with Analog-005 achieving $\text{IC}_{50}$ of 1.9 $\mu$M. Three additional novel compounds were synthesized for SAR analysis:</p>
<table>
  <thead>
      <tr>
          <th>Compound</th>
          <th>Series</th>
          <th>Source</th>
          <th>$\text{IC}_{50}$ ($\mu$M)</th>
          <th>Key Feature</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Analog-005</td>
          <td>II</td>
          <td>Commercial library</td>
          <td>1.9</td>
          <td>Most potent analog</td>
      </tr>
      <tr>
          <td>Analog-003</td>
          <td>I</td>
          <td>Commercial library</td>
          <td>&lt; 20</td>
          <td>Strongest single-dose inhibition</td>
      </tr>
      <tr>
          <td>Syn-A003-01</td>
          <td>I</td>
          <td>TamGen (synthesized)</td>
          <td>&lt; 20</td>
          <td>Diphenylurea scaffold</td>
      </tr>
  </tbody>
</table>
<p>Both compound series (diphenylurea and benzenesulfonamide scaffolds) represent novel ClpP inhibitor chemotypes distinct from Bortezomib. Additionally, 6 out of 8 directly synthesized TamGen compounds demonstrated $\text{IC}_{50}$ below 40 $\mu$M, confirming TamGen&rsquo;s ability to produce viable hits without the library search step.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>Four ablation experiments clarified the contributions of TamGen&rsquo;s components:</p>
<ol>
<li><strong>Without pre-training</strong>: Significantly worse docking scores and simpler structures. The optimal decoder depth dropped from 12 to 4 layers without pre-training due to overfitting.</li>
<li><strong>Shuffled pocket-ligand pairs (TamGen-r)</strong>: Substantially worse docking scores, confirming TamGen learns meaningful pocket-ligand interactions rather than generic compound distributions.</li>
<li><strong>Without distance-aware attention</strong>: Significant decline in docking scores when removing the geometric attention term from Eq. 2.</li>
<li><strong>Without coordinate augmentation</strong>: Performance degradation when removing the roto-translation augmentation $\rho$, highlighting the importance of geometric invariance.</li>
</ol>
<h2 id="validated-drug-like-generation-with-practical-limitations">Validated Drug-Like Generation with Practical Limitations</h2>
<p>TamGen demonstrates that 1D SMILES-based generation with pre-training on natural compounds produces molecules with better drug-likeness properties than 3D generation methods. The experimental validation against ClpP is a notable strength, as most generative drug design methods lack biochemical assay confirmation.</p>
<p>Key limitations acknowledged by the authors include:</p>
<ul>
<li><strong>Insufficient sensitivity to minor target differences</strong>: TamGen cannot reliably distinguish targets with point mutations or protein isoforms, limiting applicability for cancer-related proteins</li>
<li><strong>Requires known structure and pocket</strong>: As a structure-based method, TamGen needs the 3D structure of the target protein and binding pocket information</li>
<li><strong>Limited cellular validation</strong>: The study focuses on hit identification; cellular activities and toxicities of proposed compounds were not extensively tested</li>
<li><strong>1D generation trade-off</strong>: SMILES-based generation does not fully exploit 3D protein-ligand geometric interactions available in coordinate space</li>
</ul>
<p>Future directions include integrating insights from 3D autoregressive methods, using Monte Carlo Tree Search or reinforcement learning to guide generation for better docking scores and ADME/T properties, and property-guided generation as explored in <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/prefixmol-target-chemistry-aware-generation/">PrefixMol</a>.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>PubChem (random sample)</td>
          <td>10M SMILES</td>
          <td>Compound decoder pre-training</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>CrossDocked2020</td>
          <td>~100k pairs</td>
          <td>Filtered pocket-ligand pairs</td>
      </tr>
      <tr>
          <td>Extended fine-tuning</td>
          <td>CrossDocked + PDB</td>
          <td>~300k pairs</td>
          <td>Used for TB compound generation</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>CrossDocked2020 test</td>
          <td>100 pockets</td>
          <td>Same split as TargetDiff/Pocket2Mol</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Compound decoder</strong>: 12-layer GPT with hidden dimension 768, pre-trained for 200k steps</li>
<li><strong>Protein encoder</strong>: 4-layer Transformer with hidden dimension 256, distance-aware attention</li>
<li><strong>VAE encoder</strong>: 4-layer standard Transformer encoder with hidden dimension 256</li>
<li><strong>Optimizer</strong>: Adam with initial learning rate $3 \times 10^{-5}$</li>
<li><strong>VAE $\beta$</strong>: 0.1 or 1.0 depending on generation stage</li>
<li><strong>Beam search</strong>: beam sizes of 4, 10, or 20 depending on stage</li>
<li><strong>Pocket definition</strong>: residues within 10 or 15 Angstrom distance cutoff from ligand center</li>
</ul>
<h3 id="models">Models</h3>
<p>Pre-trained model weights are available via Zenodo at <a href="https://doi.org/10.5281/zenodo.13751391">https://doi.org/10.5281/zenodo.13751391</a>.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>TamGen</th>
          <th>Best Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Overall MRR</td>
          <td>Best</td>
          <td>TargetDiff (2nd)</td>
          <td>Ranked across 6 metrics</td>
      </tr>
      <tr>
          <td>Fused rings (avg)</td>
          <td>1.78</td>
          <td>~3-5 (others)</td>
          <td>Matches FDA-approved drug average</td>
      </tr>
      <tr>
          <td>Generation speed</td>
          <td>9 sec/100 compounds</td>
          <td>~13 min (ResGen)</td>
          <td>Single A6000 GPU</td>
      </tr>
      <tr>
          <td>ClpP hit rate</td>
          <td>6/8 synthesized</td>
          <td>N/A</td>
          <td>$\text{IC}_{50}$ &lt; 40 $\mu$M</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 8x V100 GPUs for 200k steps</li>
<li>Inference benchmarking: 1x A6000 GPU</li>
<li>Generation time: ~9 seconds per 100 compounds per target</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/SigmaGenX/TamGen">TamGen code</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.13751391">Model weights and data</a></td>
          <td>Model + Data</td>
          <td>CC-BY-4.0</td>
          <td>Pre-trained weights, source data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wu, K., Xia, Y., Deng, P., Liu, R., Zhang, Y., Guo, H., Cui, Y., Pei, Q., Wu, L., Xie, S., Chen, S., Lu, X., Hu, S., Wu, J., Chan, C.-K., Chen, S., Zhou, L., Yu, N., Chen, E., Liu, H., Guo, J., Qin, T., &amp; Liu, T.-Y. (2024). TamGen: drug design with target-aware molecule generation through a chemical language model. <em>Nature Communications</em>, 15, 9360. <a href="https://doi.org/10.1038/s41467-024-53632-4">https://doi.org/10.1038/s41467-024-53632-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wu2024tamgen,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{TamGen: drug design with target-aware molecule generation through a chemical language model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wu, Kehan and Xia, Yingce and Deng, Pan and Liu, Renhe and Zhang, Yuan and Guo, Han and Cui, Yumeng and Pei, Qizhi and Wu, Lijun and Xie, Shufang and Chen, Si and Lu, Xi and Hu, Song and Wu, Jinzhi and Chan, Chi-Kin and Chen, Shawn and Zhou, Liangliang and Yu, Nenghai and Chen, Enhong and Liu, Haiguang and Guo, Jinjiang and Qin, Tao and Liu, Tie-Yan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{9360}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-024-53632-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Failure Modes in Molecule Generation &amp; Optimization</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/failure-modes-molecule-generation/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/failure-modes-molecule-generation/</guid><description>Renz et al. show trivial models fool distribution-learning metrics and ML scoring functions introduce exploitable biases in goal-directed molecule generation.</description><content:encoded><![CDATA[<h2 id="an-empirical-critique-of-molecular-generation-evaluation">An Empirical Critique of Molecular Generation Evaluation</h2>
<p>This is an <strong>Empirical</strong> paper that critically examines evaluation practices for molecular generative models. Rather than proposing a new generative method, the paper exposes systematic weaknesses in both distribution-learning metrics and goal-directed optimization scoring functions. The primary contributions are: (1) demonstrating that a trivially simple &ldquo;AddCarbon&rdquo; model can achieve near-perfect scores on widely used distribution-learning benchmarks, and (2) introducing an experimental framework with optimization scores and control scores that reveals model-specific and data-specific biases when ML models serve as scoring functions for goal-directed generation.</p>
<h2 id="evaluation-gaps-in-de-novo-molecular-design">Evaluation Gaps in De Novo Molecular Design</h2>
<p>The rapid growth of deep learning methods for molecular generation (RNN-based SMILES generators, VAEs, GANs, graph neural networks) created a need for standardized evaluation. Benchmarking suites like <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> and <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a> introduced metrics for validity, uniqueness, novelty, KL divergence over molecular properties, and <a href="/notes/computational-chemistry/benchmark-problems/frechet-chemnet-distance/">Frechet ChemNet Distance (FCD)</a>. For goal-directed generation, penalized logP became a common optimization target.</p>
<p>However, these metrics leave significant blind spots. Distribution-learning metrics do not detect whether a model merely copies training molecules with minimal modifications. Goal-directed benchmarks often use scoring functions that fail to capture the full requirements of drug discovery (synthetic feasibility, drug-likeness, absence of reactive substructures). When ML models serve as scoring functions, the problem worsens because generated molecules can exploit artifacts of the learned model rather than exhibiting genuinely desirable properties.</p>
<p>At the time of writing, wet-lab validations of generative models remained scarce, with only a handful of studies (Merk et al., Zhavoronkov et al.) demonstrating in vitro activity for generated compounds. The lack of rigorous evaluation left the field unable to distinguish meaningfully innovative methods from those that simply exploit metric weaknesses.</p>
<h2 id="the-copy-problem-and-control-score-framework">The Copy Problem and Control Score Framework</h2>
<p>The paper introduces two key conceptual contributions.</p>
<h3 id="the-addcarbon-model-for-distribution-learning">The AddCarbon Model for Distribution-Learning</h3>
<p>The AddCarbon model is deliberately trivial: it samples a molecule from the training set, inserts a single carbon atom at a random position in its SMILES string, and returns the result if it produces a valid, novel molecule. This model achieves near-perfect scores across most <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> distribution-learning benchmarks:</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>RS</th>
          <th>LSTM</th>
          <th>GraphMCTS</th>
          <th>AAE</th>
          <th>ORGAN</th>
          <th>VAE</th>
          <th>AddCarbon</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>1.000</td>
          <td>0.959</td>
          <td>1.000</td>
          <td>0.822</td>
          <td>0.379</td>
          <td>0.870</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>0.997</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>0.841</td>
          <td>0.999</td>
          <td>0.999</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>0.000</td>
          <td>0.912</td>
          <td>0.994</td>
          <td>0.998</td>
          <td>0.687</td>
          <td>0.974</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>KL divergence</td>
          <td>0.998</td>
          <td>0.991</td>
          <td>0.522</td>
          <td>0.886</td>
          <td>0.267</td>
          <td>0.982</td>
          <td>0.982</td>
      </tr>
      <tr>
          <td>FCD</td>
          <td>0.929</td>
          <td>0.913</td>
          <td>0.015</td>
          <td>0.529</td>
          <td>0.000</td>
          <td>0.863</td>
          <td>0.871</td>
      </tr>
  </tbody>
</table>
<p>The AddCarbon model beats all baselines except the LSTM on the FCD metric, despite being practically useless. This exposes what the authors call the &ldquo;copy problem&rdquo;: current metrics check only for exact matches to training molecules, so minimal edits evade novelty detection. The authors argue that likelihood-based evaluation on hold-out test sets, analogous to standard practice in NLP, would provide a more comprehensive metric.</p>
<h3 id="control-scores-for-goal-directed-generation">Control Scores for Goal-Directed Generation</h3>
<p>For goal-directed generation, the authors introduce a three-score experimental design:</p>
<ul>
<li><strong>Optimization Score (OS)</strong>: Output of a classifier trained on data split 1, used to guide the molecular optimizer.</li>
<li><strong>Model Control Score (MCS)</strong>: Output of a second classifier trained on split 1 with a different random seed. Divergence between OS and MCS quantifies model-specific biases.</li>
<li><strong>Data Control Score (DCS)</strong>: Output of a classifier trained on data split 2. Divergence between OS and DCS quantifies data-specific biases.</li>
</ul>
<p>This mirrors the training/test split paradigm in supervised learning. If a generator truly produces molecules with the desired bioactivity, the control scores should track the optimization score. Divergence between them indicates the optimizer is exploiting artifacts of the specific model or training data rather than learning generalizable chemical properties.</p>
<h2 id="experimental-setup-three-targets-three-generators">Experimental Setup: Three Targets, Three Generators</h2>
<h3 id="targets-and-data">Targets and Data</h3>
<p>The authors selected three biological targets from ChEMBL: <a href="https://en.wikipedia.org/wiki/Janus_kinase_2">Janus kinase 2</a> (JAK2), <a href="https://en.wikipedia.org/wiki/Epidermal_growth_factor_receptor">epidermal growth factor receptor</a> (EGFR), and <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">dopamine receptor D2</a> (DRD2). For each target, the data was split into two halves (split 1 and split 2) with balanced active/inactive ratios. Random forest classifiers using binary folded ECFP4 fingerprints (radius 2, size 1024) were trained to produce three scoring functions per target: the OS and MCS on split 1 (different random seeds), and the DCS on split 2.</p>
<h3 id="generators">Generators</h3>
<p>Three molecular generators were evaluated:</p>
<ol>
<li><strong>Graph-based Genetic Algorithm (GA)</strong>: Iteratively applies random mutations and crossovers to a population of molecules, retaining the best in each generation. One of the top performers in GuacaMol.</li>
<li><strong>SMILES-LSTM</strong>: An autoregressive model that generates SMILES character by character, optimized via hill climbing (iteratively sampling, keeping top molecules, fine-tuning). Also a top GuacaMol performer.</li>
<li><strong><a href="https://en.wikipedia.org/wiki/Particle_swarm_optimization">Particle Swarm Optimization</a> (PS)</strong>: Optimizes molecules in the continuous latent space of a SMILES-based sequence-to-sequence model.</li>
</ol>
<p>Each optimizer was run 10 times per target dataset.</p>
<h2 id="score-divergence-and-exploitable-biases">Score Divergence and Exploitable Biases</h2>
<h3 id="optimization-vs-control-score-divergence">Optimization vs. Control Score Divergence</h3>
<p>Across all three targets and all three generators, the OS consistently outpaced both control scores during optimization. The DCS sometimes stagnated or even decreased while the OS continued to climb. This divergence demonstrates that the generators exploit biases in the scoring function rather than discovering genuinely active compounds.</p>
<p>The MCS also diverged from the OS despite being trained on exactly the same data, confirming model-specific biases: the optimization exploits features unique to the particular random forest instance. The larger gap between OS and DCS (compared to OS and MCS) indicates that data-specific biases contribute more to the divergence than model-specific biases.</p>
<h3 id="chemical-space-migration">Chemical Space Migration</h3>
<p>Optimized molecules migrated toward the region of split 1 actives (used to train the OS), as shown by t-SNE embeddings and nearest-neighbor Tanimoto similarity analysis. Optimized molecules had more similar neighbors in split 1 than in split 2, confirming data-specific bias. By the end of optimization, generated molecules occupied different regions of chemical space than known actives when measured by logP and molecular weight, with compounds from the same optimization run forming distinct clusters.</p>
<h3 id="quality-of-generated-molecules">Quality of Generated Molecules</h3>
<p>High-scoring generated molecules frequently contained problematic substructures: reactive dienes, nitrogen-fluorine bonds, long heteroatom chains that are synthetically infeasible, and highly uncommon functional groups. The LSTM optimizer showed a bias toward high molecular weight, low diversity, and high logP values. These molecules would be rejected by medicinal chemists despite their high optimization scores.</p>
<h3 id="key-takeaways">Key Takeaways</h3>
<p>The authors emphasize several practical implications:</p>
<ol>
<li><strong>Early stopping</strong>: Control scores can indicate when further optimization is exploiting biases rather than finding better molecules. Optimization should stop when control scores plateau.</li>
<li><strong>Scoring function iteration</strong>: In practice, generative models are &ldquo;highly adept at exploiting&rdquo; incomplete scoring functions, necessitating several iterations of generation and scoring function refinement.</li>
<li><strong>Synthetic accessibility</strong>: Even high-scoring molecules are useless if they cannot be synthesized. The authors consider this a major challenge for practical adoption.</li>
<li><strong>Likelihood-based evaluation</strong>: For distribution-learning, the authors recommend reporting test-set likelihoods for likelihood-based models, following standard NLP practice.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Bioactivity data</td>
          <td>ChEMBL (JAK2, EGFR, DRD2)</td>
          <td>See Table S1</td>
          <td>Binary classification tasks, split 50/50</td>
      </tr>
      <tr>
          <td>Distribution-learning</td>
          <td>GuacaMol training set</td>
          <td>Subset of ChEMBL</td>
          <td>Used as starting population for GA and PS</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Scoring function</strong>: Random forest classifier (scikit-learn) on binary ECFP4 fingerprints (size 1024, radius 2, RDKit)</li>
<li><strong>GA</strong>: Graph-based genetic algorithm from Jensen (2019)</li>
<li><strong>LSTM</strong>: SMILES-LSTM with hill climbing, pretrained model from GuacaMol</li>
<li><strong>PS</strong>: Particle swarm optimization in latent space of a sequence-to-sequence model (Winter et al. 2019)</li>
<li>Each optimizer run 10 times per target</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Optimization Score (OS)</td>
          <td>RF classifier on split 1</td>
          <td>Guides optimization</td>
      </tr>
      <tr>
          <td>Model Control Score (MCS)</td>
          <td>RF on split 1, different seed</td>
          <td>Detects model-specific bias</td>
      </tr>
      <tr>
          <td>Data Control Score (DCS)</td>
          <td>RF on split 2</td>
          <td>Detects data-specific bias</td>
      </tr>
      <tr>
          <td><a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> metrics</td>
          <td>Validity, uniqueness, novelty, KL div, FCD</td>
          <td>For distribution-learning</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ml-jku/mgenerators-failure-modes">ml-jku/mgenerators-failure-modes</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Data, code, and results</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{renz2019failure,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{On failure modes in molecule generation and optimization}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Renz, Philipp and Van Rompaey, Dries and Wegner, J{\&#34;o}rg Kurt and Hochreiter, Sepp and Klambauer, G{\&#34;u}nter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Drug Discovery Today: Technologies}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{32-33}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{55--63}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Elsevier}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.ddtec.2020.09.003}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Renz, P., Van Rompaey, D., Wegner, J. K., Hochreiter, S., &amp; Klambauer, G. (2019). On failure modes in molecule generation and optimization. <em>Drug Discovery Today: Technologies</em>, 32-33, 55-63. <a href="https://doi.org/10.1016/j.ddtec.2020.09.003">https://doi.org/10.1016/j.ddtec.2020.09.003</a></p>
<p><strong>Publication</strong>: Drug Discovery Today: Technologies, Volume 32-33, 2019</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/ml-jku/mgenerators-failure-modes">Code and data (GitHub)</a></li>
</ul>
]]></content:encoded></item><item><title>Back Translation for Semi-Supervised Molecule Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/back-translation-molecule-generation/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/back-translation-molecule-generation/</guid><description>A semi-supervised method adapting NLP back translation to molecule generation, improving property optimization and retrosynthesis with unlabeled ZINC data.</description><content:encoded><![CDATA[<h2 id="semi-supervised-data-augmentation-for-molecular-tasks">Semi-Supervised Data Augmentation for Molecular Tasks</h2>
<p>This is a <strong>Method</strong> paper that introduces back translation, a semi-supervised technique from neural machine translation, to the domain of molecular generation. The primary contribution is a general-purpose data augmentation strategy that leverages large pools of unlabeled molecules (from databases like ZINC) to improve the performance of both sequence-based and graph-based models on molecule optimization and <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthesis</a> prediction tasks.</p>
<h2 id="bridging-the-labeled-data-gap-in-molecular-generation">Bridging the Labeled Data Gap in Molecular Generation</h2>
<p>Molecular generation tasks, such as property optimization and retrosynthesis, require paired training data: an input molecule (or property specification) mapped to a desired output molecule. Obtaining these labeled pairs is expensive and labor-intensive. Meanwhile, enormous databases of unlabeled molecules exist. ZINC alone contains over 750 million compounds, and PubChem has 109 million.</p>
<p>Prior approaches to using unlabeled molecular data include <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">variational autoencoders (VAEs)</a> for learning latent representations, conditional recurrent neural networks for inverse design, and pretraining techniques borrowed from NLP. However, these methods either focus on representation learning rather than direct generation, or require task-specific architectural modifications. The authors identify back translation, a well-established technique in machine translation, as a natural fit for molecular generation tasks that can be treated as sequence-to-sequence mappings.</p>
<h2 id="back-translation-as-molecular-data-augmentation">Back Translation as Molecular Data Augmentation</h2>
<p>The core idea is straightforward. Given a main task that maps from source domain $\mathcal{X}$ to target domain $\mathcal{Y}$ (e.g., mapping low-QED molecules to high-QED molecules), the method trains a reverse model $g$ that maps from $\mathcal{Y}$ back to $\mathcal{X}$. This reverse model then &ldquo;back translates&rdquo; unlabeled molecules from $\mathcal{Y}$ to generate synthetic source molecules, creating pseudo-labeled training pairs.</p>
<p>The theoretical motivation comes from maximizing the reconstruction probability. Given an unlabeled molecule $y_u \in \mathcal{U}_y$, the logarithmic reconstruction probability through the reverse model $g$ and forward model $f$ is:</p>
<p>$$
\log P(y_u = \hat{y}_u \mid y_u; g, f) = \log \sum_{\hat{x}_u \in \mathcal{X}} P(\hat{x}_u \mid y_u; g) P(y_u = \hat{y}_u \mid \hat{x}_u; f)
$$</p>
<p>Since summing over the exponentially large space $\mathcal{X}$ is intractable, the authors apply Jensen&rsquo;s inequality to obtain a lower bound:</p>
<p>$$
\log P(y_u = \hat{y}_u \mid y_u; g, f) \geq \mathbb{E}_{\hat{x}_u \sim P(\cdot \mid y_u; g)} \log P(y_u = \hat{y}_u \mid \hat{x}_u; f)
$$</p>
<p>This lower bound is optimized via Monte Carlo sampling in three steps:</p>
<p><strong>Step 1</strong>: Train both forward model $f$ and reverse model $g$ on the labeled data $\mathcal{L}$:</p>
<p>$$
\begin{aligned}
\min_{\theta_f} \sum_{(x,y) \in \mathcal{L}} -\log P(y \mid x; \theta_f) \\
\min_{\theta_g} \sum_{(x,y) \in \mathcal{L}} -\log P(x \mid y; \theta_g)
\end{aligned}
$$</p>
<p><strong>Step 2</strong>: Use the trained reverse model $g$ to back translate each unlabeled molecule $y_u \in \mathcal{U}_y$, producing synthetic pairs:</p>
<p>$$
\hat{\mathcal{L}} = {(\hat{x}_u, y_u) \mid y_u \in \mathcal{U}_y, \hat{x}_u \text{ sampled from } P(\cdot \mid y_u; \theta_g)}
$$</p>
<p><strong>Step 3</strong>: Retrain the forward model $f$ on the combined labeled and synthetic data $\mathcal{L} \cup \hat{\mathcal{L}}$, warm-starting from the parameters obtained in Step 1:</p>
<p>$$
\min_{\theta_f^<em>} \sum_{(x,y) \in \mathcal{L} \cup \hat{\mathcal{L}}} -\log P(y \mid x; \theta_f^</em>)
$$</p>
<p>A key practical finding is that data filtration matters. When using large amounts of unlabeled data (1M molecules), keeping only the synthetic pairs that satisfy the same constraints as the labeled data (e.g., similarity thresholds and property ranges) significantly improves performance over using all back-translated data unfiltered.</p>
<h2 id="experiments-on-property-optimization-and-retrosynthesis">Experiments on Property Optimization and Retrosynthesis</h2>
<h3 id="molecular-property-improvement">Molecular Property Improvement</h3>
<p>The authors evaluate on four tasks from Jin et al. (2019, 2020), each requiring the model to improve a specific molecular property while maintaining structural similarity (measured by Dice similarity on Morgan fingerprints):</p>
<ul>
<li><strong>LogP</strong> (penalized <a href="https://en.wikipedia.org/wiki/Octanol-water_partition_coefficient">partition coefficient</a>): two settings with similarity thresholds $\delta \geq 0.4$ and $\delta \geq 0.6$</li>
<li><strong>QED</strong> (quantitative estimation of drug-likeness): translate molecules from QED range [0.7, 0.8] to [0.9, 1.0]</li>
<li><strong>DRD2</strong> (<a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">dopamine type 2 receptor</a> activity): translate inactive ($P &lt; 0.5$) to active ($P \geq 0.5$)</li>
</ul>
<p>Two backbone architectures are tested: a Transformer (6 layers, 4 heads, 128-dim embeddings, 512-dim FFN) and HierG2G, a hierarchical graph-to-graph translation model. Unlabeled molecules are sampled from ZINC at 250K and 1M scales.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>LogP ($\delta \geq 0.6$)</th>
          <th>LogP ($\delta \geq 0.4$)</th>
          <th>QED (%)</th>
          <th>DRD2 (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JT-VAE</td>
          <td>0.28</td>
          <td>1.03</td>
          <td>8.8</td>
          <td>3.4</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>0.79</td>
          <td>2.49</td>
          <td>9.4</td>
          <td>4.4</td>
      </tr>
      <tr>
          <td>JTNN</td>
          <td>2.33</td>
          <td>3.55</td>
          <td>59.9</td>
          <td>77.8</td>
      </tr>
      <tr>
          <td>Transformer baseline</td>
          <td>2.45</td>
          <td>3.69</td>
          <td>71.9</td>
          <td>60.2</td>
      </tr>
      <tr>
          <td>+BT (1M, filtered)</td>
          <td>2.86</td>
          <td>4.41</td>
          <td>82.9</td>
          <td>67.4</td>
      </tr>
      <tr>
          <td>HierG2G baseline</td>
          <td>2.49</td>
          <td>3.98</td>
          <td>76.9</td>
          <td>85.9</td>
      </tr>
      <tr>
          <td>+BT (250K, filtered)</td>
          <td>2.75</td>
          <td>4.24</td>
          <td>79.1</td>
          <td>87.3</td>
      </tr>
  </tbody>
</table>
<h3 id="retrosynthesis-prediction">Retrosynthesis Prediction</h3>
<p>On the USPTO-50K benchmark (50K reactions, 10 reaction types, 80/10/10 train/val/test split), the method is applied to Transformer and GLN (Graph Logic Network) backbones. For other approaches to this benchmark, see <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/tied-two-way-transformers-retrosynthesis/">Tied Two-Way Transformers</a> and <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/data-transfer-seq-to-seq-retrosynthesis/">Data Transfer for Retrosynthesis</a>. Unlabeled reactant sets are constructed by sampling molecules from ZINC and concatenating them following the training data&rsquo;s reactant count distribution ($N_1 : N_2 : N_3 = 29.3% : 70.4% : 0.3%$).</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Top-1</th>
          <th>Top-3</th>
          <th>Top-5</th>
          <th>Top-10</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Reaction type given</strong></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>GLN</td>
          <td>64.2</td>
          <td>79.1</td>
          <td>85.2</td>
          <td>90.0</td>
      </tr>
      <tr>
          <td>Ours + GLN</td>
          <td>67.9</td>
          <td>82.5</td>
          <td>87.3</td>
          <td>91.5</td>
      </tr>
      <tr>
          <td>Transformer</td>
          <td>52.2</td>
          <td>68.2</td>
          <td>72.7</td>
          <td>77.4</td>
      </tr>
      <tr>
          <td>Ours + Transformer</td>
          <td>55.9</td>
          <td>72.8</td>
          <td>77.8</td>
          <td>79.7</td>
      </tr>
      <tr>
          <td><strong>Reaction type unknown</strong></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>GLN</td>
          <td>52.5</td>
          <td>69.0</td>
          <td>75.6</td>
          <td>83.7</td>
      </tr>
      <tr>
          <td>Ours + GLN</td>
          <td>54.7</td>
          <td>70.2</td>
          <td>77.0</td>
          <td>84.4</td>
      </tr>
      <tr>
          <td>Transformer</td>
          <td>37.9</td>
          <td>57.3</td>
          <td>62.7</td>
          <td>68.1</td>
      </tr>
      <tr>
          <td>Ours + Transformer</td>
          <td>43.5</td>
          <td>58.8</td>
          <td>64.6</td>
          <td>69.7</td>
      </tr>
  </tbody>
</table>
<p>The improvements are largest at lower $k$ values (top-1 and top-3), suggesting that back translation helps the model make more precise high-confidence predictions.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p><strong>Effect of unlabeled data size</strong>: On retrosynthesis with Transformer, performance improves as unlabeled data increases from 50K to 250K, then plateaus or declines beyond 250K. The authors attribute this to noise in the back-translated data outweighing the benefits at larger scales.</p>
<p><strong>Effect of labeled data size</strong>: With only 5K labeled samples, adding back-translated data hurts performance because the reverse model is too weak to generate useful synthetic data. As labeled data increases (10K, 25K, 50K), the benefit of back translation grows. This confirms that the method requires a reasonably well-trained reverse model to be effective.</p>
<p><strong>Data filtration</strong>: Using 1M unfiltered back-translated molecules sometimes hurts performance (e.g., QED drops from 71.9% to 75.1% vs. 82.9% with filtering), while filtering to enforce the same constraints as the labeled data recovers and exceeds the 250K filtered results.</p>
<h2 id="consistent-gains-across-architectures-and-tasks">Consistent Gains Across Architectures and Tasks</h2>
<p>The method achieves state-of-the-art results on all four molecular property improvement tasks and the USPTO-50K retrosynthesis benchmark at time of publication. Several observations stand out:</p>
<ol>
<li><strong>Architecture agnosticism</strong>: Back translation improves both sequence-based (Transformer) and graph-based (HierG2G, GLN) models, confirming that the approach is independent of the underlying architecture.</li>
<li><strong>Filtration is essential at scale</strong>: Unfiltered 1M back-translated data can degrade performance, but filtered data at the same scale consistently outperforms smaller unfiltered sets.</li>
<li><strong>Training overhead is moderate</strong>: On the DRD2 task, back translation with Transformer takes about 2.5x the supervised training time (11.0h vs. 8.5h for initial training), with the back-translation step itself taking under 1 hour.</li>
<li><strong>Diversity and novelty increase</strong>: Back translation improves both diversity (average pairwise distance among generated molecules) and novelty (fraction of generated molecules not seen in training) across QED and DRD2 tasks.</li>
</ol>
<p>The authors acknowledge limitations: the method does not form a closed loop between forward and reverse models (as in dual learning approaches), and the data filtration strategy is rule-based rather than learned. They suggest joint training of forward and reverse models and learned filtration as future directions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training (property improvement)</td>
          <td>Jin et al. (2019, 2020) datasets</td>
          <td>34K-99K pairs</td>
          <td>LogP, QED, DRD2 tasks</td>
      </tr>
      <tr>
          <td>Training (retrosynthesis)</td>
          <td>USPTO-50K</td>
          <td>40K reactions</td>
          <td>80/10/10 split from Dai et al. (2019)</td>
      </tr>
      <tr>
          <td>Unlabeled molecules</td>
          <td>ZINC</td>
          <td>250K or 1M</td>
          <td>Randomly sampled</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Same as training</td>
          <td>800-1000 test samples</td>
          <td>Per-task test sets</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Back translation with optional data filtration</li>
<li>Beam search with $k=20$ for inference</li>
<li>Random sampling for back-translation step (Equation 5)</li>
<li>Dice similarity on Morgan fingerprints for similarity constraint</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Transformer</strong>: 6 layers, 4 attention heads, 128-dim embeddings, 512-dim FFN (for property improvement); 4 layers, 8 heads, 256-dim embeddings, 2048-dim FFN (for retrosynthesis)</li>
<li><strong>HierG2G</strong>: Settings from Jin et al. (2020)</li>
<li><strong>GLN</strong>: Settings from Dai et al. (2019)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Best Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LogP improvement</td>
          <td>LogP ($\delta \geq 0.6$)</td>
          <td>2.86</td>
          <td>2.49 (HierG2G)</td>
          <td>Transformer + BT(1M, filtered)</td>
      </tr>
      <tr>
          <td>LogP improvement</td>
          <td>LogP ($\delta \geq 0.4$)</td>
          <td>4.41</td>
          <td>3.98 (HierG2G)</td>
          <td>Transformer + BT(1M, filtered)</td>
      </tr>
      <tr>
          <td>Success rate</td>
          <td>QED</td>
          <td>82.9%</td>
          <td>76.9% (HierG2G)</td>
          <td>Transformer + BT(1M, filtered)</td>
      </tr>
      <tr>
          <td>Success rate</td>
          <td>DRD2</td>
          <td>87.3%</td>
          <td>85.9% (HierG2G)</td>
          <td>HierG2G + BT(250K, filtered)</td>
      </tr>
      <tr>
          <td>Top-1 accuracy</td>
          <td>USPTO-50K (known type)</td>
          <td>67.9%</td>
          <td>64.2% (GLN)</td>
          <td>Ours + GLN</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper reports training times (8.5h for Transformer, 16.8h for HierG2G on DRD2 with 1M unlabeled data) but does not specify the GPU hardware used.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/fyabc/BT4MolGen">BT4MolGen</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation in Python</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fan, Y., Xia, Y., Zhu, J., Wu, L., Xie, S., &amp; Qin, T. (2021). Back translation for molecule generation. <em>Bioinformatics</em>, 38(5), 1244-1251. <a href="https://doi.org/10.1093/bioinformatics/btab817">https://doi.org/10.1093/bioinformatics/btab817</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{fan2022back,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Back translation for molecule generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fan, Yang and Xia, Yingce and Zhu, Jinhua and Wu, Lijun and Xie, Shufang and Qin, Tao}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{38}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1244--1251}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bioinformatics/btab817}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>UnCorrupt SMILES: Post Hoc Correction for De Novo Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/uncorrupt-smiles/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/uncorrupt-smiles/</guid><description>A transformer-based SMILES corrector that fixes invalid outputs from molecular generators, recovering 60-95% of erroneous SMILES strings.</description><content:encoded><![CDATA[<h2 id="a-transformer-based-smiles-error-corrector">A Transformer-Based SMILES Error Corrector</h2>
<p>This is a <strong>Method</strong> paper that proposes a post hoc approach to fixing invalid SMILES produced by de novo molecular generators. Rather than trying to prevent invalid outputs through alternative representations (<a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a>) or constrained architectures (graph models), the authors train a transformer model to translate invalid SMILES into valid ones. The corrector is framed as a sequence-to-sequence translation task, drawing on techniques from grammatical error correction (GEC) in natural language processing.</p>
<h2 id="the-problem-of-invalid-smiles-in-molecular-generation">The Problem of Invalid SMILES in Molecular Generation</h2>
<p><a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>-based generative models produce some percentage of invalid outputs that cannot be converted to molecules. The invalidity rate varies substantially across model types:</p>
<ul>
<li><strong>RNN models</strong> (DrugEx): 5.7% invalid (pretrained) and 4.7% invalid (target-directed)</li>
<li><strong>GANs</strong> (ORGANIC): 9.5% invalid</li>
<li><strong>VAEs</strong> (GENTRL): 88.9% invalid</li>
</ul>
<p>These invalid outputs represent wasted computation and potentially introduce bias toward molecules that are easier to generate correctly. Previous approaches to this problem include using alternative representations (<a href="/notes/computational-chemistry/molecular-representations/deepsmiles-adaptation-for-ml/">DeepSMILES</a>, <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a>) or graph-based models, but these either limit the search space or increase computational cost. The authors propose a complementary strategy: fix the errors after generation.</p>
<h2 id="error-taxonomy-across-generator-types">Error Taxonomy Across Generator Types</h2>
<p>The paper classifies invalid SMILES errors into six categories based on RDKit error messages:</p>
<ol>
<li><strong>Syntax errors</strong>: malformed SMILES grammar</li>
<li><strong>Unclosed rings</strong>: unmatched ring closure digits</li>
<li><strong>Parentheses errors</strong>: unbalanced open/close parentheses</li>
<li><strong>Bond already exists</strong>: duplicate bonds between the same atoms</li>
<li><strong>Aromaticity errors</strong>: atoms incorrectly marked as aromatic or kekulization failures</li>
<li><strong>Valence errors</strong>: atoms exceeding their maximum bond count</li>
</ol>
<p>The distribution of error types differs across generators. RNN-based models primarily produce aromaticity errors, suggesting they learn SMILES grammar well but struggle with chemical validity. The GAN (ORGANIC) produces mostly valence errors. The VAE (GENTRL) produces more grammar-level errors (syntax, parentheses, unclosed rings), indicating that sampling from the continuous latent space often produces sequences that violate basic SMILES structure.</p>
<h2 id="architecture-and-training">Architecture and Training</h2>
<p>The SMILES corrector uses a standard encoder-decoder transformer architecture based on Vaswani et al., with learned positional encodings. Key specifications:</p>
<ul>
<li>Embedding dimension: 256</li>
<li>Encoder/decoder layers: 3 each</li>
<li>Attention heads: 8</li>
<li>Feed-forward dimension: 512</li>
<li>Dropout: 0.1</li>
<li>Optimizer: Adam (learning rate 0.0005)</li>
<li>Training: 20 epochs, batch size 16</li>
</ul>
<p>Since no dataset of manually corrected invalid-valid SMILES pairs exists, the authors create synthetic training data by introducing errors into valid SMILES from the Papyrus bioactivity dataset (approximately 1.3M pairs). Errors are introduced through random perturbations following SMILES syntax rules: character substitutions, bond order changes, fragment additions from the <a href="/notes/computational-chemistry/datasets/gdb-11/">GDB</a>-8 database to atoms with full valence, and other structural modifications.</p>
<h2 id="training-with-multiple-errors-improves-correction">Training with Multiple Errors Improves Correction</h2>
<p>A key finding is that training the corrector on inputs with multiple errors per SMILES substantially improves performance on real generator outputs. The baseline model (1 error per input) fixes 35-80% of invalid outputs depending on the generator. Increasing errors per training input to 12 raises this to 62-95%:</p>
<table>
  <thead>
      <tr>
          <th>Generator</th>
          <th>1 error/input</th>
          <th>12 errors/input</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RNN (DrugEx)</td>
          <td>~60% fixed</td>
          <td>62% fixed</td>
      </tr>
      <tr>
          <td>Target-directed RNN</td>
          <td>~60% fixed</td>
          <td>68% fixed</td>
      </tr>
      <tr>
          <td>GAN (ORGANIC)</td>
          <td>~80% fixed</td>
          <td>95% fixed</td>
      </tr>
      <tr>
          <td>VAE (GENTRL)</td>
          <td>~35% fixed</td>
          <td>80% fixed</td>
      </tr>
  </tbody>
</table>
<p>Training beyond 12 errors per input yields diminishing returns (80% average at 20 errors vs. 78% at 12). The improvement from multi-error training is consistent with GEC literature, where models learn to &ldquo;distrust&rdquo; inputs more when exposed to higher error rates.</p>
<p>The model also shows low overcorrection: only 14% of valid SMILES are altered during translation, comparable to overcorrection rates in spelling correction systems.</p>
<h2 id="fixed-molecules-are-comparable-to-generator-outputs">Fixed Molecules Are Comparable to Generator Outputs</h2>
<p>The corrected molecules are evaluated against both the training set and the readily generated (valid) molecules from each generator:</p>
<ul>
<li><strong>Uniqueness</strong>: 97% of corrected molecules are unique</li>
<li><strong>Novelty vs. generated</strong>: 97% of corrected molecules are novel compared to the valid generator outputs</li>
<li><strong>Similarity to nearest neighbor (SNN)</strong>: 0.45 between fixed and generated sets, indicating the corrected molecules explore different parts of chemical space</li>
<li><strong>Property distributions</strong>: KL divergence scores between fixed molecules and the training set are comparable to those between generated molecules and the training set</li>
</ul>
<p>This demonstrates that SMILES correction produces molecules that are as chemically reasonable as the generator&rsquo;s valid outputs while exploring complementary regions of chemical space.</p>
<h2 id="local-chemical-space-exploration-via-error-introduction">Local Chemical Space Exploration via Error Introduction</h2>
<p>Beyond fixing generator errors, the authors propose using the SMILES corrector for analog generation. The workflow is:</p>
<ol>
<li>Take a known active molecule</li>
<li>Introduce random errors into its SMILES (repeated 1000 times)</li>
<li>Correct the errors using the trained corrector</li>
</ol>
<p>This &ldquo;local sequence exploration&rdquo; generates novel analogs with 97% validity. The uniqueness (39%) and novelty (16-37%) are lower than for generator correction because the corrector often regenerates the original molecule. However, the approach produces molecules that are structurally similar to the starting compound (SNN of 0.85 to known ligands).</p>
<p>The authors demonstrate this on selective <a href="https://en.wikipedia.org/wiki/Aurora_kinase_B">Aurora kinase B</a> (AURKB) inhibitors. The generated analogs occupy the same binding site region as the co-crystallized ligand VX-680 in docking studies, with predicted bioactivities similar to known compounds. Compared to target-directed RNN generation, SMILES exploration produces molecules closer to known actives (higher SNN, scaffold similarity, and KL divergence scores).</p>
<h2 id="limitations">Limitations</h2>
<p>The corrector performance drops when applied to real generator outputs compared to synthetic test data, because the synthetic error distribution does not perfectly match the errors that generators actually produce. Generator-specific correctors trained on actual invalid outputs could improve performance. The local exploration approach has limited novelty since the corrector frequently regenerates the original molecule. The evaluation uses predicted rather than experimental bioactivities for the Aurora kinase case study.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/LindeSchoenmaker/SMILES-corrector">LindeSchoenmaker/SMILES-corrector</a></td>
          <td>Code + Data</td>
          <td>MIT</td>
          <td>Training code, synthetic error generation, and evaluation scripts</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: Synthetic training pairs derived from the Papyrus bioactivity dataset (v5.5). Approximately 1.3M invalid-valid pairs per error-count setting.</p>
<p><strong>Code</strong>: Transformer implemented in PyTorch, adapted from Ben Trevett&rsquo;s seq2seq tutorial. Generative model baselines use DrugEx, GENTRL, and ORGANIC.</p>
<p><strong>Evaluation</strong>: Validity assessed with RDKit. Similarity metrics (SNN, fragment, scaffold) and KL divergence computed following <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a> and <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmark protocols.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Schoenmaker, L., Béquignon, O. J. M., Jespers, W., &amp; van Westen, G. J. P. (2023). UnCorrupt SMILES: a novel approach to de novo design. <em>Journal of Cheminformatics</em>, 15, 22.</p>
<p><strong>Publication</strong>: Journal of Cheminformatics, 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/LindeSchoenmaker/SMILES-corrector">GitHub: LindeSchoenmaker/SMILES-corrector</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{schoenmaker2023uncorrupt,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{UnCorrupt SMILES: a novel approach to de novo design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Schoenmaker, Linde and B{\&#39;e}quignon, Olivier J. M. and Jespers, Willem and van Westen, Gerard J. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-023-00696-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>RetMol: Retrieval-Based Controllable Molecule Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/retmol-retrieval-molecule-generation/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/retmol-retrieval-molecule-generation/</guid><description>RetMol uses retrieval-augmented generation to steer a pre-trained molecular model toward desired properties using only a handful of exemplar molecules.</description><content:encoded><![CDATA[<h2 id="retrieval-augmented-generation-for-molecules">Retrieval-Augmented Generation for Molecules</h2>
<p>This is a <strong>Method</strong> paper that introduces RetMol, a retrieval-based framework for controllable molecule generation. The key idea is to guide a pre-trained generative model using a small set of exemplar molecules that partially satisfy the desired design criteria, retrieved from a task-specific database. The approach requires no task-specific fine-tuning of the generative backbone and works effectively with very few exemplar molecules (as few as 23).</p>
<h2 id="limitations-of-existing-controllable-generation">Limitations of Existing Controllable Generation</h2>
<p>Existing approaches to controllable molecule generation fall into three categories, each with drawbacks:</p>
<ol>
<li><strong>Reinforcement learning (RL)-based methods</strong> require task-specific fine-tuning of the generative model for each new objective</li>
<li><strong>Supervised learning (SL)-based methods</strong> need molecules with desired properties as training data, which may be scarce</li>
<li><strong>Latent optimization-based methods</strong> require training property predictors in the latent space, which is challenging with limited active molecules and incompatible with variable-length latent spaces like those in transformers</li>
</ol>
<p>RetMol addresses all three issues by keeping the generative backbone frozen and using a lightweight, task-agnostic retrieval module that can be applied to new tasks simply by swapping the retrieval database.</p>
<h2 id="the-retmol-framework">The RetMol Framework</h2>
<p>RetMol consists of four components built around a pre-trained encoder-decoder backbone (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemformer/">Chemformer</a>, a BART variant trained on ZINC):</p>
<h3 id="retrieval-database">Retrieval Database</h3>
<p>A task-specific collection of exemplar molecules that at least partially satisfy the design criteria. The database can be very small (e.g., 23 known inhibitors for the SARS-CoV-2 task) and is dynamically updated during inference with newly generated molecules.</p>
<h3 id="molecule-retriever">Molecule Retriever</h3>
<p>A heuristic-based module that selects the $K$ most relevant exemplar molecules (default $K = 10$). It first constructs a feasible set of molecules satisfying all constraints, then selects those with the best property scores. If too few molecules satisfy all constraints, it progressively relaxes constraints until enough candidates are available.</p>
<h3 id="information-fusion-via-cross-attention">Information Fusion via Cross-Attention</h3>
<p>The core trainable component. Retrieved exemplar embeddings are fused with the input molecule embedding using cross-attention:</p>
<p>$$\boldsymbol{e} = f_{\text{CA}}(\boldsymbol{e}_{\text{in}}, \boldsymbol{E}_r; \theta) = \text{Attn}(\text{Query}(\boldsymbol{e}_{\text{in}}), \text{Key}(\boldsymbol{E}_r)) \cdot \text{Value}(\boldsymbol{E}_r)$$</p>
<p>where $\boldsymbol{e}_{\text{in}} = \text{Enc}(x_{\text{in}}) \in \mathbb{R}^{L \times D}$ is the input embedding and $\boldsymbol{E}_r = [\boldsymbol{e}_r^1, \ldots, \boldsymbol{e}_r^K]$ are the retrieved exemplar embeddings. This module adds less than 5% parameter overhead (460K parameters over the 10M base model).</p>
<h3 id="self-supervised-training-nearest-neighbor-prediction">Self-Supervised Training: Nearest Neighbor Prediction</h3>
<p>Rather than reconstructing the input molecule (which would make the retrieval module unnecessary), RetMol trains the fusion module to predict the nearest neighbor of the input:</p>
<p>$$\mathcal{L}(\theta) = \sum_{i=1}^{B} \text{CE}\left(\text{Dec}\left(f_{\text{CA}}(\boldsymbol{e}_{\text{in}}^{(i)}, \boldsymbol{E}_r^{(i)}; \theta)\right), x_{\text{1NN}}^{(i)}\right)$$</p>
<p>The remaining $K - 1$ nearest neighbors serve as the retrieved exemplar molecules. This forces the fusion module to learn how to use exemplar molecules to transform the input toward a related target. Only the fusion module parameters are updated; the encoder and decoder remain frozen.</p>
<h2 id="iterative-refinement-at-inference">Iterative Refinement at Inference</h2>
<p>During inference, RetMol uses an iterative process:</p>
<ol>
<li>Encode the input molecule and retrieved exemplars</li>
<li>Fuse embeddings via cross-attention</li>
<li>Perturb the fused embedding $M$ times with Gaussian noise</li>
<li>Greedily decode $M$ candidate molecules</li>
<li>Replace the input with the best candidate if it improves upon the current score</li>
<li>Add remaining good candidates to the retrieval database</li>
<li>Repeat until convergence or a maximum number of iterations</li>
</ol>
<p>The dynamic update of the retrieval database is critical for extrapolating beyond the initial set of exemplar molecules.</p>
<h2 id="experiments-and-results">Experiments and Results</h2>
<p>RetMol is evaluated on four tasks of increasing difficulty:</p>
<h3 id="qed-optimization-under-similarity-constraint">QED Optimization Under Similarity Constraint</h3>
<p>Goal: generate molecules with QED $\geq$ 0.9 while maintaining <a href="https://en.wikipedia.org/wiki/Tanimoto_coefficient">Tanimoto similarity</a> $\geq$ 0.4 to the input. RetMol achieves 94.5% success rate, compared to 92.8% for the previous best (QMO).</p>
<h3 id="penalized-logp-optimization">Penalized LogP Optimization</h3>
<p>Goal: maximize penalized <a href="https://en.wikipedia.org/wiki/Octanol-water_partition_coefficient">LogP</a> while maintaining structural similarity. At $\delta = 0.4$, RetMol achieves 11.55 average improvement, compared to 7.71 for QMO.</p>
<h3 id="gsk3beta--jnk3-dual-inhibitor-design"><a href="https://en.wikipedia.org/wiki/GSK-3">GSK3</a>$\beta$ + <a href="https://en.wikipedia.org/wiki/C-Jun_N-terminal_kinase">JNK3</a> Dual Inhibitor Design</h3>
<p>Goal: simultaneously satisfy four constraints (GSK3$\beta$ inhibition $\geq$ 0.5, JNK3 inhibition $\geq$ 0.5, QED $\geq$ 0.6, SA $\leq$ 4). Results:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Success %</th>
          <th>Novelty</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a></td>
          <td>47.9</td>
          <td>0.561</td>
          <td>0.621</td>
      </tr>
      <tr>
          <td>RationaleRL</td>
          <td>74.8</td>
          <td>0.568</td>
          <td>0.701</td>
      </tr>
      <tr>
          <td>MARS</td>
          <td>92.3</td>
          <td>0.824</td>
          <td>0.719</td>
      </tr>
      <tr>
          <td>MolEvol</td>
          <td>93.0</td>
          <td>0.757</td>
          <td>0.681</td>
      </tr>
      <tr>
          <td>RetMol</td>
          <td>96.9</td>
          <td>0.862</td>
          <td>0.732</td>
      </tr>
  </tbody>
</table>
<p>RetMol achieves this without task-specific fine-tuning and requires only 80 iterations compared to MARS&rsquo;s 550.</p>
<h3 id="sars-cov-2-main-protease-inhibitor-optimization"><a href="https://en.wikipedia.org/wiki/3C-like_protease">SARS-CoV-2 Main Protease</a> Inhibitor Optimization</h3>
<p>A real-world task using only 23 known inhibitors as the retrieval database and optimizing 8 weakly-binding drugs. Under the milder similarity constraint ($\delta = 0.4$), RetMol achieves 2.84 kcal/mol average binding affinity improvement versus 1.67 for Graph GA. Under the stricter constraint ($\delta = 0.6$), RetMol succeeds on 5/8 molecules versus 3/8 for Graph GA.</p>
<h2 id="key-analysis-findings">Key Analysis Findings</h2>
<ul>
<li><strong>Database size</strong>: Strong performance even with 100 molecules, already outperforming baselines on success rate</li>
<li><strong>Database quality</strong>: Molecules satisfying all four constraints give the best results (96.9%), but partial satisfaction still works reasonably (84.7% with two properties)</li>
<li><strong>Training objective</strong>: The nearest neighbor prediction objective outperforms conventional reconstruction on validity (0.902 vs. 0.834) and uniqueness (0.922 vs. 0.665)</li>
<li><strong>Dynamic database update</strong>: Essential for extrapolating beyond the initial retrieval database, generating molecules with property values exceeding the best in the original database</li>
</ul>
<h2 id="limitations">Limitations</h2>
<p>RetMol requires exemplar molecules that at least partially satisfy the design criteria. When such molecules are entirely unavailable, the framework cannot be applied. The method also relies on property predictors (for scoring and retrieval), whose accuracy directly affects generation quality. The iterative refinement process adds computational overhead at inference time, and the results depend on the Chemformer backbone&rsquo;s generation capabilities.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/NVlabs/RetMol">NVlabs/RetMol</a></td>
          <td>Code</td>
          <td>NVIDIA Source Code License-NC</td>
          <td>Full training and inference code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/NVlabs/RetMol">NVlabs/RetMol (checkpoints)</a></td>
          <td>Model</td>
          <td>CC BY-NC-SA 4.0</td>
          <td>Pre-trained model checkpoints</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: ZINC250k and ChEMBL datasets for training. Task-specific retrieval databases constructed from these datasets. COVID-19 task uses 23 known SARS-CoV-2 Mpro inhibitors.</p>
<p><strong>Training</strong>: Information fusion module trained on 4x V100 GPUs (16GB each) for approximately 2 hours. Batch size of 256 per GPU, 50K iterations.</p>
<p><strong>Inference</strong>: Single V100 GPU. Greedy decoding with Gaussian perturbation ($\sigma = 1$) for sampling multiple candidates per iteration.</p>
<p><strong>Backbone</strong>: Chemformer (BART variant) pre-trained on ZINC. Frozen during RetMol training and inference.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, Z., Nie, W., Qiao, Z., Xiao, C., Baraniuk, R. G., &amp; Anandkumar, A. (2023). Retrieval-based Controllable Molecule Generation. <em>Proceedings of the Eleventh International Conference on Learning Representations (ICLR 2023)</em>.</p>
<p><strong>Publication</strong>: International Conference on Learning Representations (ICLR) 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/NVlabs/RetMol">GitHub: NVlabs/RetMol</a></li>
<li><a href="https://openreview.net/forum?id=vDFA1tpuLvk">OpenReview</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{wang2023retrieval,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Retrieval-based Controllable Molecule Generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wang, Zichao and Nie, Weili and Qiao, Zhuoran and Xiao, Chaowei and Baraniuk, Richard G. and Anandkumar, Anima}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://openreview.net/forum?id=vDFA1tpuLvk}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LIMO: Latent Inceptionism for Targeted Molecule Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/limo-latent-inceptionism/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/limo-latent-inceptionism/</guid><description>LIMO uses gradient-based optimization through a VAE latent space and stacked property predictor to generate drug-like molecules with high binding affinity.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Eckmann, P., Sun, K., Zhao, B., Feng, M., Gilson, M. K., &amp; Yu, R. (2022). LIMO: Latent Inceptionism for Targeted Molecule Generation. <em>Proceedings of the 39th International Conference on Machine Learning (ICML 2022)</em>, PMLR 162, 5777&ndash;5792.</p>
<p><strong>Publication</strong>: ICML 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Rose-STL-Lab/LIMO">GitHub: Rose-STL-Lab/LIMO</a></li>
<li><a href="https://arxiv.org/abs/2206.09010">arXiv: 2206.09010</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{eckmann2022limo,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{LIMO: Latent Inceptionism for Targeted Molecule Generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Eckmann, Peter and Sun, Kunyang and Zhao, Bo and Feng, Mudong and Gilson, Michael K and Yu, Rose}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{5777--5792}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">organization</span>=<span style="color:#e6db74">{PMLR}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="gradient-based-reverse-optimization-in-molecular-latent-space">Gradient-Based Reverse Optimization in Molecular Latent Space</h2>
<p>This is a <strong>Method</strong> paper that introduces LIMO, a framework for generating molecules with desired properties using gradient-based optimization on a VAE latent space. The key innovation is a stacked architecture where a property predictor operates on the decoded molecular representation rather than directly on the latent space, combined with an inceptionism-like technique that backpropagates through the frozen decoder and predictor to optimize the latent code. This approach is 6-8x faster than RL baselines and 12x faster than sampling-based approaches while producing molecules with higher binding affinities.</p>
<h2 id="slow-property-optimization-in-existing-methods">Slow Property Optimization in Existing Methods</h2>
<p>Generating molecules with high binding affinity to target proteins is a central goal of early drug discovery, but existing computational approaches are slow when optimizing for properties that are expensive to evaluate (such as docking-based binding affinity). RL-based methods require many calls to the property function during training. Sampling-based approaches like MARS need hundreds of iterations. Latent optimization methods that predict properties directly from the latent space suffer from poor prediction accuracy because the mapping from latent space to molecular properties is difficult to learn.</p>
<h2 id="the-limo-framework">The LIMO Framework</h2>
<p>LIMO consists of three components: a VAE for learning a molecular latent space, a property predictor with a novel stacked architecture, and a gradient-based reverse optimization procedure.</p>
<h3 id="selfies-based-vae">SELFIES-Based VAE</h3>
<p>The VAE encodes molecules represented as SELFIES strings into a 1024-dimensional latent space $\mathbf{z} \in \mathbb{R}^m$ and decodes to probability distributions over SELFIES symbols. Since all SELFIES strings correspond to valid molecules, this guarantees 100% chemical validity. The output molecule is obtained by taking the argmax at each position:</p>
<p>$$\hat{x}_i = s_{d_i^*}, \quad d_i^* = \operatorname{argmax}_{d} \{y_{i,1}, \ldots, y_{i,d}\}$$</p>
<p>The VAE uses fully-connected layers (not recurrent), with a 64-dimensional embedding layer, four batch-normalized linear layers (2000-dimensional first layer, 1000-dimensional for the rest) with ReLU activation, and is trained with ELBO loss (0.9 weight on reconstruction, 0.1 on KL divergence).</p>
<h3 id="stacked-property-predictor">Stacked Property Predictor</h3>
<p>The critical architectural choice: the property predictor $g_\theta$ takes the decoded molecular representation $\hat{\mathbf{x}}$ as input rather than the latent code $\mathbf{z}$. The predictor is trained after the VAE is frozen by minimizing MSE on VAE-generated molecules:</p>
<p>$$\ell_0(\theta) = \left\| g_\theta\left(f_{\text{dec}}(\mathbf{z})\right) - \pi\left(f_{\text{dec}}(\mathbf{z})\right) \right\|^2$$</p>
<p>where $\pi$ is the ground-truth property function. This stacking improves prediction accuracy from $r^2 = 0.04$ (predicting from $\mathbf{z}$) to $r^2 = 0.38$ (predicting from $\hat{\mathbf{x}}$) on an unseen test set. The improvement comes because the mapping from molecular space to property is easier to learn than the mapping from latent space to property.</p>
<h3 id="reverse-optimization-inceptionism">Reverse Optimization (Inceptionism)</h3>
<p>After training, the decoder and predictor weights are frozen and $\mathbf{z}$ becomes the trainable parameter. For multiple properties with weights $(w_1, \ldots, w_k)$, the optimization minimizes:</p>
<p>$$\ell_1(\mathbf{z}) = -\sum_{i=1}^{k} w_i \cdot g^i\left(f_{\text{dec}}(\mathbf{z})\right)$$</p>
<p>Since both the decoder and predictor are neural networks, gradients flow through the entire chain, enabling efficient optimization with Adam. This is analogous to the &ldquo;inceptionism&rdquo; (DeepDream) technique from computer vision, where network inputs are optimized to maximize specific outputs.</p>
<h3 id="substructure-constrained-optimization">Substructure-Constrained Optimization</h3>
<p>For lead optimization, LIMO can fix a molecular substructure during optimization by adding a regularization term:</p>
<p>$$\ell_2(\mathbf{z}) = \lambda \sum_{i=1}^{n} \sum_{j=1}^{d} \left(M_{i,j} \cdot \left(f_{\text{dec}}(\mathbf{z})_{i,j} - (\hat{\mathbf{x}}_{\text{start}})_{i,j}\right)\right)^2$$</p>
<p>where $M$ is a binary mask specifying which SELFIES positions must remain unchanged and $\lambda = 1000$. This capability is enabled by the intermediate decoded representation, which most VAE-based methods lack.</p>
<h2 id="experiments-and-results">Experiments and Results</h2>
<h3 id="benchmark-tasks-qed-and-penalized-logp">Benchmark Tasks (QED and Penalized LogP)</h3>
<p>LIMO achieves competitive results with deep generative and RL-based models in 1 hour, compared to 8-24 hours for baselines. Top QED score: 0.947 (maximum possible: 0.948). Top penalized LogP: 10.5 (among length-limited models, comparable to MolDQN&rsquo;s 11.8).</p>
<p>The ablation study (&ldquo;LIMO on z&rdquo;) confirms the stacked predictor architecture: predicting from $\hat{\mathbf{x}}$ yields top p-logP of 10.5 versus 6.52 when predicting directly from $\mathbf{z}$.</p>
<h3 id="binding-affinity-maximization">Binding Affinity Maximization</h3>
<p>The primary contribution. LIMO generates molecules with substantially higher computed binding affinities (lower $K_D$) than baselines against two protein targets:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>ESR1 best $K_D$ (nM)</th>
          <th>ACAA1 best $K_D$ (nM)</th>
          <th>Time (hrs)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GCPN</td>
          <td>6.4</td>
          <td>75</td>
          <td>6</td>
      </tr>
      <tr>
          <td>MolDQN</td>
          <td>373</td>
          <td>240</td>
          <td>6</td>
      </tr>
      <tr>
          <td>MARS</td>
          <td>17</td>
          <td>163</td>
          <td>6</td>
      </tr>
      <tr>
          <td>GraphDF</td>
          <td>25</td>
          <td>370</td>
          <td>12</td>
      </tr>
      <tr>
          <td>LIMO</td>
          <td>0.72</td>
          <td>37</td>
          <td>1</td>
      </tr>
  </tbody>
</table>
<p>For ESR1, LIMO&rsquo;s best molecule has a $K_D$ of 0.72 nM from docking, nearly 10x better than the next method (GCPN at 6.4 nM). When corroborated with more rigorous absolute binding free energy (ABFE) calculations, one LIMO compound achieved a predicted $K_D$ of $6 \times 10^{-14}$ M (0.00006 nM), far exceeding the affinities of approved drugs tamoxifen ($K_D$ = 1.5 nM) and raloxifene ($K_D$ = 0.03 nM).</p>
<h3 id="multi-objective-optimization">Multi-Objective Optimization</h3>
<p>Single-objective optimization produces molecules with high affinity but problematic structures (polyenes, large rings). Multi-objective optimization simultaneously targeting binding affinity, QED ($&gt;$ 0.4), and SA ($&lt;$ 5.5) produces drug-like, synthesizable molecules that still have nanomolar binding affinities. Generated molecules satisfy Lipinski&rsquo;s rule of 5 with zero PAINS alerts.</p>
<h2 id="limitations">Limitations</h2>
<p>The LIMO property predictor achieves only moderate prediction accuracy ($r^2$ = 0.38), meaning the optimization relies on gradient direction being correct rather than absolute predictions being accurate. AutoDock-GPU docking scores do not correlate well with the more accurate ABFE results, a known limitation of docking. The fully-connected VAE architecture limits the molecular diversity compared to recurrent or attention-based alternatives (LSTM decoder produced max QED of only 0.3). The greedy fine-tuning step (replacing carbons with heteroatoms) is a heuristic rather than a learned procedure.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Rose-STL-Lab/LIMO">Rose-STL-Lab/LIMO</a></td>
          <td>Code</td>
          <td>UC San Diego Custom (non-commercial)</td>
          <td>Full training, optimization, and evaluation code</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: ZINC250k dataset for optimization tasks. MOSES dataset for random generation evaluation. Binding affinities computed with AutoDock-GPU.</p>
<p><strong>Hardware</strong>: Two GTX 1080 Ti GPUs (one for PyTorch, one for AutoDock-GPU), 4 CPU cores, 32 GB memory.</p>
<p><strong>Training</strong>: VAE trained for 18 epochs with learning rate 0.0001. Property predictor uses 3 layers of 1000 units, trained for 5 epochs. Reverse optimization uses learning rate 0.1 for 10 epochs.</p>
<p><strong>Targets</strong>: Human estrogen receptor (ESR1, PDB 1ERR) and human peroxisomal acetyl-CoA acyl transferase 1 (ACAA1, PDB 2IIK).</p>
]]></content:encoded></item><item><title>MolGen: Molecular Generation with Chemical Feedback</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/molgen-molecular-generation-chemical-feedback/</link><pubDate>Fri, 20 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/molgen-molecular-generation-chemical-feedback/</guid><description>MolGen pre-trains on SELFIES molecules and uses chemical feedback to align generated molecules with real-world chemical preferences across domains.</description><content:encoded><![CDATA[<h2 id="a-selfies-based-method-for-molecular-generation">A SELFIES-Based Method for Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces MolGen, a pre-trained molecular language model for generating molecules with desired chemical properties. The primary contribution is a three-part framework: (1) pre-training on 100M+ molecular SELFIES to learn structural and grammatical knowledge, (2) domain-agnostic molecular prefix tuning for cross-domain knowledge transfer, and (3) a chemical feedback paradigm that aligns the model&rsquo;s generative probabilities with real-world chemical preferences. MolGen is the first language model pre-trained on SELFIES rather than SMILES, which guarantees 100% syntactic validity of generated molecules.</p>
<h2 id="challenges-in-language-model-based-molecule-generation">Challenges in Language Model-Based Molecule Generation</h2>
<p>Generating novel molecules with desirable properties is a central task in drug discovery and chemical design. The molecular space is estimated at $10^{33}$ possible structures, making exhaustive search impractical. Prior deep generative approaches face several limitations:</p>
<ol>
<li><strong>Syntactic invalidity</strong>: <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>-based language models frequently generate strings that do not correspond to valid molecular graphs. A single random mutation of a SMILES string has only a 9.9% chance of remaining valid.</li>
<li><strong>Narrow domain focus</strong>: Most existing models focus exclusively on synthetic molecules and neglect <a href="https://en.wikipedia.org/wiki/Natural_product">natural products</a>, which have distinct structural complexity and scaffold diversity.</li>
<li><strong>Molecular hallucinations</strong>: Generated molecules may satisfy chemical structural rules yet fail to exhibit anticipated chemical activity in practical applications. The authors formally define this as molecules that &ldquo;comply with chemical structural rules, yet fail to exhibit practical utility or the anticipated properties.&rdquo;</li>
<li><strong>Limited optimization signals</strong>: Existing approaches rely on reinforcement learning (high variance), fixed-dimensional latent spaces, or expert-provided generation rules, all of which impede efficient exploration of chemical space.</li>
</ol>
<h2 id="core-innovation-pre-training-with-selfies-and-chemical-feedback">Core Innovation: Pre-training with SELFIES and Chemical Feedback</h2>
<p>MolGen&rsquo;s novelty rests on three interconnected components.</p>
<h3 id="selfies-based-pre-training">SELFIES-Based Pre-training</h3>
<p>MolGen uses <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a> (Self-Referencing Embedded Strings) instead of SMILES. SELFIES guarantees that every possible combination of symbols in the alphabet corresponds to a chemically valid molecular graph. The model uses a compact vocabulary of 185 tokens.</p>
<p>The first pre-training stage uses a BART-style encoder-decoder. Tokens from a SELFIES string $S = {s_1, \ldots, s_l}$ are randomly replaced with [MASK], then the corrupted input is encoded bidirectionally and decoded left-to-right. The reconstruction loss is:</p>
<p>$$
\mathcal{L}_{\text{ce}}(S) = -\sum_{j=1}^{l} \sum_{s} p_{\text{true}}(s \mid S, S_{&lt; j}) \log p_{\theta}(s \mid S, S_{&lt; j}; \theta)
$$</p>
<p>where $S_{&lt; j}$ denotes the partial sequence ${s_0, \ldots, s_{j-1}}$ and $p_{\text{true}}$ is the one-hot distribution under standard maximum likelihood estimation.</p>
<h3 id="domain-agnostic-molecular-prefix-tuning">Domain-Agnostic Molecular Prefix Tuning</h3>
<p>The second pre-training stage introduces shared prefix vectors $P_k, P_v \in \mathbb{R}^{m \times d}$ prepended to the keys and values of multi-head attention at each layer. Unlike conventional prefix tuning that freezes model parameters, MolGen updates the entire model. The attention output becomes:</p>
<p>$$
\text{head} = \text{Attn}\left(xW_q, [P_k, XW_k], [P_v, XW_v]\right)
$$</p>
<p>This decomposes into a linear interpolation between prefix attention and standard attention:</p>
<p>$$
\text{head} = \lambda(x) \cdot \text{Attn}(xW_q, P_k, P_v) + (1 - \lambda(x)) \cdot \text{Attn}(xW_q, XW_k, XW_v)
$$</p>
<p>where $\lambda(x)$ is a scalar representing the sum of normalized attention weights on the prefixes. The prefixes are trained simultaneously across synthetic and natural product domains, acting as a domain instructor.</p>
<h3 id="chemical-feedback-paradigm">Chemical Feedback Paradigm</h3>
<p>To address molecular hallucinations, MolGen aligns the model&rsquo;s probabilistic rankings with chemical preference rankings. Given a molecule $S$ and a set of candidate outputs $\mathcal{S}^*$ with distinct property scores $\text{Ps}(\cdot)$, the model should satisfy:</p>
<p>$$
p_{\text{true}}(S_i \mid S) &gt; p_{\text{true}}(S_j \mid S), \quad \forall S_i, S_j \in \mathcal{S}^*, \text{Ps}(S_i) &gt; \text{Ps}(S_j)
$$</p>
<p>This is enforced via a rank loss:</p>
<p>$$
\mathcal{L}_{\text{rank}}(S) = \sum_{i} \sum_{j &gt; i} \max\left(0, f(S_j) - f(S_i) + \gamma_{ij}\right)
$$</p>
<p>where $\gamma_{ij} = (j - i) \cdot \gamma$ is a margin scaled by rank difference and $f(S) = \sum_{t=1}^{l} \log p_{\theta}(s_t \mid S, S_{&lt; t}; \theta)$ is the estimated log-probability. The overall training objective combines cross-entropy and rank loss:</p>
<p>$$
\mathcal{L} = \mathcal{L}_{\text{ce}} + \alpha \mathcal{L}_{\text{rank}}
$$</p>
<p>Label smoothing is applied to the target distribution in $\mathcal{L}_{\text{ce}}$, allocating probability mass $\beta$ to non-target tokens to maintain generative diversity.</p>
<h2 id="experiments-across-distribution-learning-and-property-optimization">Experiments Across Distribution Learning and Property Optimization</h2>
<h3 id="datasets">Datasets</h3>
<ul>
<li><strong>Stage 1 pre-training</strong>: 100M+ unlabeled molecules from ZINC-15 (molecular weight $\leq$ 500 Da, LogP $\leq$ 5)</li>
<li><strong>Stage 2 pre-training</strong>: 2.22M molecules spanning synthetic (ZINC, MOSES) and natural product (NPASS, 30,926 compounds) domains</li>
<li><strong>Downstream evaluation</strong>: MOSES synthetic dataset, ZINC250K, and natural product molecules</li>
</ul>
<h3 id="molecular-distribution-learning">Molecular Distribution Learning</h3>
<p>MolGen generates 10,000 synthetic and 80,000 natural product molecules, evaluated on seven metrics (Validity, Fragment similarity, Scaffold similarity, SNN, Internal Diversity, <a href="/notes/computational-chemistry/benchmark-problems/frechet-chemnet-distance/">FCD</a>, and Novelty). Baselines include AAE, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/latentgan-de-novo-molecular-generation/">LatentGAN</a>, CharRNN, VAE, JT-VAE, LIMO, and <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemformer/">Chemformer</a>.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Validity</th>
          <th>Frag</th>
          <th>Scaf</th>
          <th>SNN</th>
          <th>IntDiv</th>
          <th>FCD</th>
          <th>Novelty</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Chemformer</td>
          <td>.9843</td>
          <td>.9889</td>
          <td>.9248</td>
          <td>.5622</td>
          <td>.8553</td>
          <td>.0061</td>
          <td>.9581</td>
      </tr>
      <tr>
          <td>MolGen</td>
          <td>1.000</td>
          <td>.9999</td>
          <td>.9999</td>
          <td>.9996</td>
          <td>.8567</td>
          <td>.0015</td>
          <td>1.000</td>
      </tr>
  </tbody>
</table>
<p>On synthetic molecules, MolGen achieves 100% validity, near-perfect fragment and scaffold similarity, and the lowest FCD (0.0015). For natural products, MolGen achieves FCD of 0.6519 compared to Chemformer&rsquo;s 0.8346.</p>
<h3 id="targeted-molecule-discovery">Targeted Molecule Discovery</h3>
<p>For penalized logP maximization (top-3 scores):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>1st</th>
          <th>2nd</th>
          <th>3rd</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MARS (no length limit)</td>
          <td>44.99</td>
          <td>44.32</td>
          <td>43.81</td>
      </tr>
      <tr>
          <td>MolGen (no length limit)</td>
          <td>80.30</td>
          <td>74.70</td>
          <td>69.85</td>
      </tr>
      <tr>
          <td>MolGen (length-limited)</td>
          <td>30.51</td>
          <td>28.98</td>
          <td>28.95</td>
      </tr>
  </tbody>
</table>
<p>For QED maximization, MolGen achieves the maximum score of 0.948 across the top-3.</p>
<h3 id="molecular-docking">Molecular Docking</h3>
<p>MolGen optimizes binding affinity for two protein targets (<a href="https://en.wikipedia.org/wiki/Estrogen_receptor_alpha">ESR1</a> and ACAA1), measured by <a href="https://en.wikipedia.org/wiki/Dissociation_constant">dissociation constant</a> $K_D$ (lower is better):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>ESR1 1st</th>
          <th>ESR1 2nd</th>
          <th>ESR1 3rd</th>
          <th>ACAA1 1st</th>
          <th>ACAA1 2nd</th>
          <th>ACAA1 3rd</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LIMO</td>
          <td>0.72</td>
          <td>0.89</td>
          <td>1.4</td>
          <td>37</td>
          <td>37</td>
          <td>41</td>
      </tr>
      <tr>
          <td>MolGen</td>
          <td>0.13</td>
          <td>0.35</td>
          <td>0.47</td>
          <td>3.36</td>
          <td>3.98</td>
          <td>8.50</td>
      </tr>
  </tbody>
</table>
<p>MolGen achieves the lowest dissociation constants across both targets. Optimization of the 1,000 worst-affinity molecules yields 96.7% relative improvement for ESR1 and 70.4% for ACAA1.</p>
<h3 id="constrained-molecular-optimization">Constrained Molecular Optimization</h3>
<p>Optimizing 800 molecules from ZINC250K with lowest p-logP scores under Tanimoto similarity constraints:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>$\delta = 0.6$</th>
          <th>$\delta = 0.4$</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/retmol-retrieval-molecule-generation/">RetMol</a></td>
          <td>3.78 (3.29)</td>
          <td>11.55 (11.27)</td>
      </tr>
      <tr>
          <td>MolGen</td>
          <td>12.08 (0.82)</td>
          <td>12.35 (1.21)</td>
      </tr>
  </tbody>
</table>
<p>MolGen achieves the highest mean improvement with the lowest standard deviation under both constraints.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<ul>
<li><strong>Chemical feedback</strong>: Without it, the model generates molecules with property scores similar to initial molecules. With it ($\alpha = 3$), property scores increase progressively across generation rounds.</li>
<li><strong>Prefix tuning</strong>: Removing prefix tuning reduces constrained optimization improvement by 0.45 at $\delta = 0.6$ and 2.12 at $\delta = 0.4$.</li>
<li><strong>Label smoothing</strong>: Enhances diversity of generated molecules as measured by Internal Diversity.</li>
<li><strong>Substructure attention</strong>: MolGen focuses attention on chemically meaningful functional groups (fluoro, phenyl, hydroxyl), while SMILES-based PLMs scatter attention across syntactic tokens. The Substructure Attention Level (SAL) metric confirms MolGen&rsquo;s superior focus.</li>
</ul>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>SELFIES pre-training guarantees 100% molecular validity, eliminating the need for external valency checks.</li>
<li>Domain-agnostic prefix tuning enables effective knowledge transfer between synthetic and natural product domains.</li>
<li>The chemical feedback paradigm aligns model outputs with chemical preferences without requiring external annotated data or reference databases.</li>
<li>MolGen achieves the best or competitive results across all evaluated tasks: distribution learning, targeted molecule discovery, constrained optimization, and molecular docking.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>Computational cost</strong>: Training and fine-tuning on large datasets is computationally intensive.</li>
<li><strong>Model interpretability</strong>: The transformer architecture makes it difficult to understand explicit rationale behind decisions.</li>
<li><strong>Single-target optimization only</strong>: The chemical feedback paradigm handles single-target optimization; multiple conflicting objectives could create ambiguous optimization trajectories.</li>
<li><strong>Task specificity</strong>: MolGen is designed for 2D molecular generation; 3D conformation information is not incorporated.</li>
<li><strong>Reaction prediction</strong>: When applied to reaction prediction (an off-target task), MolGen achieves only 71.4% accuracy on 39,990 reaction samples.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest applying MolGen to retrosynthesis and reaction prediction, exploring multimodal pre-training, and incorporating additional knowledge sources.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Stage 1 pre-training</td>
          <td>ZINC-15</td>
          <td>100M+ molecules</td>
          <td>MW $\leq$ 500 Da, LogP $\leq$ 5</td>
      </tr>
      <tr>
          <td>Stage 2 pre-training</td>
          <td>ZINC + <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a> + NPASS</td>
          <td>2.22M molecules</td>
          <td>Synthetic and natural product domains</td>
      </tr>
      <tr>
          <td>Distribution learning (synthetic)</td>
          <td><a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a></td>
          <td>~1.9M molecules</td>
          <td>Standard benchmark split</td>
      </tr>
      <tr>
          <td>Distribution learning (natural)</td>
          <td>NPASS</td>
          <td>30,926 compounds</td>
          <td>30,126 train / 800 test</td>
      </tr>
      <tr>
          <td>Constrained optimization</td>
          <td>ZINC250K</td>
          <td>800 molecules</td>
          <td>Lowest p-logP scores</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: BART-based encoder-decoder with SELFIES vocabulary (185 tokens)</li>
<li><strong>Prefix length</strong>: 5 tunable vectors per layer</li>
<li><strong>Optimizer</strong>: LAMB (pre-training), AdamW (fine-tuning)</li>
<li><strong>Pre-training</strong>: 600M steps with linear warm-up (180,000 steps) followed by linear decay</li>
<li><strong>Rank loss weight</strong> ($\alpha$): Recommended values of 3 or 5</li>
<li><strong>Candidate generation</strong>: 30 candidates per molecule (synthetic), 8 candidates (natural products)</li>
</ul>
<h3 id="models">Models</h3>
<p>MolGen is publicly available on Hugging Face. The model uses a vocabulary of 185 SELFIES tokens and is comparable in size to Chemformer-large.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Domain</th>
          <th>MolGen</th>
          <th>Best Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/computational-chemistry/benchmark-problems/frechet-chemnet-distance/">FCD</a> (lower is better)</td>
          <td>Synthetic</td>
          <td>0.0015</td>
          <td>0.0061 (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemformer/">Chemformer</a>)</td>
          <td>Distribution learning</td>
      </tr>
      <tr>
          <td>p-logP top-1 (no limit)</td>
          <td>Synthetic</td>
          <td>80.30</td>
          <td>44.99 (MARS)</td>
          <td>Targeted discovery</td>
      </tr>
      <tr>
          <td>QED top-1</td>
          <td>Synthetic</td>
          <td>0.948</td>
          <td>0.948 (several)</td>
          <td>Tied at maximum</td>
      </tr>
      <tr>
          <td>ESR1 $K_D$ top-1</td>
          <td>Docking</td>
          <td>0.13</td>
          <td>0.72 (LIMO)</td>
          <td>Binding affinity</td>
      </tr>
      <tr>
          <td>p-logP improvement ($\delta=0.4$)</td>
          <td>Synthetic</td>
          <td>12.35 (1.21)</td>
          <td>11.55 (11.27) (RetMol)</td>
          <td>Constrained optimization</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>6 NVIDIA V100 GPUs</li>
<li>Pre-training batch size: 256 molecules per GPU</li>
<li>Fine-tuning batch size: 6 (synthetic and natural product)</li>
<li>Training: 100 epochs for fine-tuning tasks</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/zjunlp/MolGen">zjunlp/MolGen</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Official PyTorch implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/zjunlp">zjunlp/MolGen-large</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Pre-trained weights on Hugging Face</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fang, Y., Zhang, N., Chen, Z., Guo, L., Fan, X., &amp; Chen, H. (2024). Domain-Agnostic Molecular Generation with Chemical Feedback. <em>Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024)</em>.</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/zjunlp/MolGen">GitHub: zjunlp/MolGen</a></li>
<li><a href="https://huggingface.co/zjunlp">Hugging Face Models</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{fang2024domain,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Domain-Agnostic Molecular Generation with Chemical Feedback}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fang, Yin and Zhang, Ningyu and Chen, Zhuo and Guo, Lingbing and Fan, Xiaohui and Chen, Huajun}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{The Twelfth International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://openreview.net/forum?id=9rPyHyjfwP}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GP-MoLFormer: Molecular Generation via Transformers</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/gp-molformer/</link><pubDate>Thu, 25 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/gp-molformer/</guid><description>A 46.8M parameter transformer for molecular generation trained on 1.1B SMILES, introducing pair-tuning for efficient property optimization.</description><content:encoded><![CDATA[<h2 id="contribution-and-taxonomic-focus">Contribution and Taxonomic Focus</h2>
<p>This is primarily a <strong>Methodological</strong> paper, as it proposes a specific neural architecture (GP-MoLFormer) and a novel fine-tuning algorithm (Pair-tuning) for molecular generation. It validates these contributions against standard baselines (e.g., JT-VAE, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/molgen-molecular-generation-chemical-feedback/">MolGen</a>-7b).</p>
<p>It also contains a secondary <strong>Theoretical</strong> contribution by establishing an empirical <a href="/notes/machine-learning/model-architectures/scaling-laws-vs-model-architectures/">scaling law</a> that relates inference compute (generation size) to the novelty of the generated molecules.</p>
<h2 id="motivation-data-scale-and-prompt-based-optimization">Motivation: Data Scale and Prompt-Based Optimization</h2>
<p>While large language models (LLMs) have transformed text generation, the impact of training data scale and memorization on <em>molecular</em> generative models remains under-explored. Specifically, there is a need to understand how training on billion-scale datasets affects the novelty of generated molecules and whether biases in public databases (like ZINC and PubChem) perpetuate memorization. Furthermore, existing optimization methods often require computationally expensive property predictors or reinforcement learning loops; there is a practical need for more efficient &ldquo;prompt-based&rdquo; optimization techniques.</p>
<h2 id="core-innovations-architecture-and-pair-tuning">Core Innovations: Architecture and Pair-Tuning</h2>
<ol>
<li><strong>Architecture</strong>: The application of a linear-attention transformer decoder with Rotary Positional Embeddings (RoPE) to generative chemistry, allowing for efficient training on 1.1 billion SMILES.</li>
<li><strong>Pair-Tuning</strong>: A novel, parameter-efficient fine-tuning method that uses property-ordered molecular pairs to learn &ldquo;soft prompts&rdquo; for optimization without updating the base model weights.</li>
<li><strong>Scaling Analysis</strong>: An extensive empirical investigation mapping the trade-off between inference compute (up to 10B generations) and chemical novelty, fitting an exponential decay curve that demonstrates how novelty saturates as generation volume grows.</li>
</ol>
<h2 id="experimental-methodology-and-downstream-tasks">Experimental Methodology and Downstream Tasks</h2>
<p>The authors evaluated GP-MoLFormer on three distinct tasks, though the comparisons highlight the difficulty of evaluating foundation models against classical baselines:</p>
<ol>
<li><strong>De Novo Generation</strong>: Comparing validity, uniqueness, and novelty against baselines (CharRNN, VAE, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/limo-latent-inceptionism/">LIMO</a>, MolGen-7b) on a held-out test set. Notably, this is an unequal comparison; most baselines were trained on the 1.6M molecule <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a> dataset, whereas GP-MoLFormer uses up to 1.1B molecules, meaning performance gains are heavily driven by data scale.</li>
<li><strong>Scaffold-Constrained Decoration</strong>: Generating molecules from DRD2 active binder scaffolds and measuring the hit rate of active compounds against specialized scaffold decorators.</li>
<li><strong>Property-Guided Optimization</strong>: Using Pair-tuning to optimize for Drug-likeness (QED), Penalized <a href="https://en.wikipedia.org/wiki/Octanol-water_partition_coefficient">logP</a>, and <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">DRD2</a> binding activity, comparing the results to graph-based and reinforcement learning benchmarks.</li>
</ol>
<p>Additionally, they performed a <strong>Scaling Study</strong>:</p>
<ul>
<li>Comparing models trained on raw (1.1B) vs. de-duplicated (650M) data.</li>
<li>Generating up to 10 billion molecules to fit empirical scaling laws for novelty.</li>
</ul>
<h2 id="key-findings-and-scaling-laws">Key Findings and Scaling Laws</h2>
<ul>
<li><strong>Scale Driven Performance</strong>: GP-MoLFormer achieves high internal diversity and validity on generation metrics. However, its baseline novelty percentage (~32%) is considerably lower than classical models. The authors attribute this to the massive training scale forcing the model to heavily prioritize matching real-world molecule frequencies over pure exploration. GP-MoLFormer&rsquo;s advantage in generation metrics over LLM-baselines like <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/molgen-molecular-generation-chemical-feedback/">MolGen</a>-7b likely stems heavily from its 10x larger training dataset rather than fundamental architectural superiority.</li>
<li><strong>Pair-Tuning Efficacy</strong>: The proposed pair-tuning method effectively optimizes properties (e.g., improving DRD2 activity scores) without requiring full model fine-tuning or external reward loops. While successful, the text-based generation yields ~94.5% validity during optimization, which lags behind graph and SELFIES-based baselines that guarantee 100% structural validity.</li>
<li><strong>Memorization vs. Novelty</strong>: Training on de-duplicated data (GP-MoLFormer-UNIQ) yields higher novelty (approx. 5-8% higher) than training on raw data, confirming that duplication bias in public databases leads directly to memorization.</li>
<li><strong>Inference Scaling Law</strong>: Novelty decays exponentially with generation size ($y = ae^{-bx}$), yet the model maintains generative capability (~16.7% novelty) even after generating an unprecedented 10 billion molecules.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Sources</strong>: A combination of <strong>PubChem</strong> (111M SMILES) and <strong>ZINC</strong> (1B SMILES) databases. Downloading and pre-training instructions are located in the repository&rsquo;s <code>data/README.md</code>.</li>
<li><strong>Preprocessing</strong>:
<ul>
<li>All SMILES were canonicalized using RDKit (no isomeric information).</li>
<li><strong>GP-MoLFormer (Base)</strong>: Trained on the full 1.1B dataset (includes duplicates).</li>
<li><strong>GP-MoLFormer-UNIQ</strong>: Trained on a de-duplicated subset of 650M SMILES.</li>
</ul>
</li>
<li><strong>Tokenization</strong>: Uses the tokenizer from Schwaller et al. (2019) with a vocabulary size of <strong>2,362 tokens</strong>.</li>
<li><strong>Filtering</strong>: Sequences restricted to a maximum length of <strong>202 tokens</strong>.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Pair-Tuning (Algorithm 1)</strong>:</p>
<ul>
<li><strong>Objective</strong>: Learn task-specific soft prompts $\phi_T$ to maximize the conditional probability of target molecule $b$ given a seed molecule $a$, where pair $(a, b)$ satisfies the property condition $b &gt; a$. The base model parameters $\theta$ remain frozen.</li>
<li><strong>Prompt Structure</strong>: Autoregressive training optimizes the continuous embeddings of $n$ enhancement tokens against the cross-entropy loss of the target sequence:
$$ \mathcal{L}(\phi_T) = - \sum_{i=1}^{|b|} \log P_{\theta}(b_i | \phi_T, a, b_{&lt;i}) $$</li>
<li><strong>Hyperparameters</strong>: Trained for 1,000 epochs with a batch size of 35 and a fixed learning rate of $3 \times 10^{-2}$.</li>
<li><strong>Inference</strong>: The learned prompt $\phi_T$ and seed molecule $a$ are prepended as context, and candidates are sampled autoregressively until a termination token is produced.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Availability</strong>: The model trained on deduplicated data (GP-MoLFormer-UNIQ) is publicly available on <a href="https://huggingface.co/ibm-research/GP-MoLFormer-Uniq">Hugging Face</a>. The full 1.1B base model is not explicitly hosted. The source code repository includes a disclosure that IBM will not maintain the code going forward.</li>
<li><strong>Architecture</strong>: Transformer decoder (~47M parameters: 12 layers, 12 heads, hidden size 768).</li>
<li><strong>Attention Mechanism</strong>: Combines Linear Attention (Generalized Random Feature map, $\phi$) with Rotary Positional Embeddings (RoPE). To avoid the quadratic complexity of standard attention while maintaining relative positional awareness, RoPE is applied to queries ($Q$) and keys ($K$) prior to the random feature mapping:
$$ \text{Attention}(Q, K, V) = \frac{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle v_n}{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle} $$</li>
<li><strong>Inference Speed</strong>: ~3ms per forward pass on a single A100 GPU.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Generation Quality Metrics</strong>: Validity, Uniqueness, Novelty (<a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a> suite), <a href="/notes/computational-chemistry/benchmark-problems/frechet-chemnet-distance/">Fréchet ChemNet Distance (FCD)</a>, Scaffold similarity (Scaf), and Similarity to Nearest Neighbor (SNN).</li>
<li><strong>MoLFormer-Based Metrics</strong>: The authors introduce Fréchet <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/molformer/">MoLFormer</a> Distance (FMD) and MoLFormer-space IntDiv2 to measure distributional similarity using their own pre-trained continuous embeddings instead of standard fingerprints.</li>
<li><strong>Optimization Metrics</strong>: Penalized logP (calculated as $\text{logP} - \text{SA} - \text{max}(\text{maxrings}(size) - 6, 0)$), Drug-likeness (QED), and DRD2 activity scores.</li>
<li><strong>Scaling Metrics</strong>: Empirical fit for novelty decay: $y = ae^{-bx}$.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 16 x NVIDIA A100 (80 GB) GPUs across 2 nodes connected via EDR Infiniband.</li>
<li><strong>Training Time</strong>:
<ul>
<li>GP-MoLFormer (1.1B data): ~115 hours total (28.75 hours/epoch for 4 epochs).</li>
<li>GP-MoLFormer-UNIQ (650M data): ~80 hours total.</li>
</ul>
</li>
<li><strong>Hyperparameters</strong>: Used a batch size of 1,600 molecules per GPU with a fixed learning rate of $1.6 \times 10^{-4}$ (scaled up to $8\times$ factor as GPUs increased).</li>
<li><strong>Optimization</strong>: Used distributed data-parallel training and adaptive bucketing by sequence length to handle scale.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/gp-molformer/">GP-MoLFormer (GitHub)</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Official implementation; IBM will not maintain going forward</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/ibm-research/GP-MoLFormer-Uniq">GP-MoLFormer-Uniq (Hugging Face)</a></td>
          <td>Model</td>
          <td>Apache 2.0</td>
          <td>Pre-trained on 650M de-duplicated SMILES</td>
      </tr>
  </tbody>
</table>
<p>The full 1.1B base model weights are not publicly hosted. The training data (PubChem and ZINC) is publicly available, and instructions for downloading and pre-processing are in the repository&rsquo;s <code>data/README.md</code>.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ross, J., Belgodere, B., Hoffman, S. C., Chenthamarakshan, V., Navratil, J., Mroueh, Y., &amp; Das, P. (2025). GP-MoLFormer: A Foundation Model For Molecular Generation. <em>Digital Discovery</em>, 4(10), 2684&ndash;2696. <a href="https://doi.org/10.1039/D5DD00122F">https://doi.org/10.1039/D5DD00122F</a></p>
<p><strong>Publication</strong>: Digital Discovery, vol. 4, no. 10, pp. 2684&ndash;2696 (2025)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ross2025gpmolformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{GP-MoLFormer: a foundation model for molecular generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ross, Jerret and Belgodere, Brian and Hoffman, Samuel C and Chenthamarakshan, Vijil and Navratil, Jiri and Mroueh, Youssef and Das, Payel}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{2684--2696}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D5DD00122F}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemformer: A Pre-trained Transformer for Comp Chem</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemformer/</link><pubDate>Tue, 23 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemformer/</guid><description>BART-based Transformer pre-trained on 100M molecules using self-supervision to accelerate convergence on chemical sequence tasks.</description><content:encoded><![CDATA[<h2 id="paper-contribution-and-methodological-classification">Paper Contribution and Methodological Classification</h2>
<p>This is a <strong>Methodological ($\Psi_{\text{Method}}$)</strong> paper. It proposes an architecture adaptation (Chemformer based on BART) and a specific pre-training strategy (&ldquo;Combined&rdquo; masking and augmentation). The paper validates this method by benchmarking against established models on multiple tasks, including direct synthesis, retrosynthesis, and molecular optimization. It also includes a secondary <strong>Resource ($\Psi_{\text{Resource}}$)</strong> contribution by making the pre-trained models and code available.</p>
<h2 id="motivation-computational-bottlenecks-in-cheminformatics">Motivation: Computational Bottlenecks in Cheminformatics</h2>
<p>Existing Transformer models for cheminformatics are often developed for single applications and are computationally expensive to train from scratch. For example, training a Molecular Transformer for reaction prediction can take days, limiting hyperparameter exploration. Self-supervised pre-training (like BERT or T5) has significantly advanced NLP by reducing fine-tuning time and improving performance. In chemistry, applications have traditionally focused on task-specific datasets or encoder-only architectures, which perform poorly on sequence generation tasks. The authors aim to use transfer learning on a large unlabelled dataset to create a model that converges quickly and performs well across diverse sequence-to-sequence and discriminative tasks.</p>
<h2 id="core-innovation-bart-architecture-and-combined-pre-training">Core Innovation: BART Architecture and Combined Pre-training</h2>
<p>The primary insight lies in the adaptation of the <strong>BART architecture</strong> for chemistry and the introduction of a <strong>&ldquo;Combined&rdquo; self-supervised pre-training task</strong>.</p>
<ul>
<li><strong>Architecture</strong>: Chemformer uses the BART encoder-decoder structure, allowing it to handle both discriminative (property prediction) and generative (reaction prediction) tasks efficiently. This provides an alternative to encoder-only (BERT) or decoder-only (GPT) models.</li>
<li><strong>Combined Pre-training</strong>: The authors introduce a task that applies both <strong>Span Masking</strong> (randomly replacing tokens with <code>&lt;mask&gt;</code>) and <strong><a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> Augmentation</strong> (permuting atom order, see <a href="/notes/computational-chemistry/molecular-representations/randomized-smiles-generative-models/">Randomized SMILES</a>) simultaneously. Formally, given a canonical SMILES sequence $x$, a corrupted sequence $\tilde{x} = \text{Mask}(\text{Augment}(x))$ is generated. The model is trained using an autoregressive cross-entropy loss to reconstruct the canonical sequence from the corrupted input:
$$ \mathcal{L}_{\text{pre-train}} = -\sum_{t=1}^{|x|} \log P(x_t \mid x_{&lt;t}, \tilde{x}) $$</li>
<li><strong>Tunable Augmentation</strong>: A downstream augmentation strategy is proposed where the probability of augmenting the input/output SMILES ($p_{aug}$) is a tunable hyperparameter, performed on-the-fly.</li>
</ul>
<h2 id="experimental-setup-and-pre-training-tasks">Experimental Setup and Pre-training Tasks</h2>
<p>The authors pre-trained Chemformer on <strong>100 million molecules</strong> from ZINC-15 and fine-tuned it on three distinct task types:</p>
<ol>
<li><strong>Seq2Seq Reaction Prediction</strong>:
<ul>
<li><em>Direct Synthesis</em>: USPTO-MIT dataset (Mixed and Separated).</li>
<li><em><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a></em>: USPTO-50K dataset (see also <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/molecular-transformer/">Molecular Transformer</a>, <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/tied-two-way-transformers-retrosynthesis/">Tied Two-Way Transformers</a>).</li>
</ul>
</li>
<li><strong>Molecular Optimization</strong>: Generating molecules with improved properties (<a href="https://en.wikipedia.org/wiki/Distribution_coefficient">LogD</a>, solubility, clearance) starting from ChEMBL matched molecular pairs.</li>
<li><strong>Discriminative Tasks</strong>:
<ul>
<li><em><a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a></em>: Predicting properties (ESOL, FreeSolv, Lipophilicity) from <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>.</li>
<li><em>Bioactivity</em>: Predicting pXC50 values for 133 genes using ExCAPE data.</li>
</ul>
</li>
</ol>
<p>Ablation studies compared three pre-training strategies (Masking, Augmentation, Combined) against a randomly initialized baseline.</p>
<h2 id="results-trade-offs-and-conclusions">Results, Trade-offs, and Conclusions</h2>
<ul>
<li><strong>Performance</strong>: Chemformer achieved <strong>competitive top-1 accuracy</strong> on USPTO-MIT (91.3% Mixed) and USPTO-50K (53.6-54.3%), outperforming the Augmented Transformer and graph-based models (GLN, GraphRetro).</li>
<li><strong>Convergence Speed</strong>: Pre-training significantly accelerated training; fine-tuning for just 20 epochs (30 mins) outperformed the previous baselines trained for significantly longer.</li>
<li><strong>Pre-training Tasks</strong>: The &ldquo;Combined&rdquo; task generally performed best for reaction prediction and bioactivity, while &ldquo;Masking&rdquo; was superior for molecular optimization.</li>
<li><strong>Augmentation Trade-off</strong>: The augmentation strategy improved top-1 accuracy but significantly degraded top-5/10 accuracy because beam search outputs became populated with augmented versions of the same molecule. This presents a considerable limitation for practical applications like retrosynthesis mapping, where retrieving a diverse set of candidate reactions is often critical.</li>
<li><strong>Discriminative Evaluation Caveats</strong>: Chemformer underperformed specialized baselines (like D-MPNN or <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/molbert-molecular-representations/">MolBERT</a>) on small discriminative datasets. The authors note that direct comparison is difficult: Chemformer was trained simultaneously on multiple subtasks (multi-task learning), while the literature baselines were trained and tuned on each subtask separately. Additionally, the Chemformer encoder uses fewer than 20M parameters compared to MolBERT&rsquo;s approximately 85M, and Chemformer&rsquo;s pre-training does not include molecular property objectives. For other transfer learning approaches to QSAR, see <a href="/notes/computational-chemistry/chemical-language-models/property-prediction/molpmofit-transfer-learning-qsar/">MolPMoFiT</a>.</li>
<li><strong>Pre-training Data Scope</strong>: The 100M pre-training dataset from ZINC-15 was selected with constraints on molecular weight ($\le 500$ Da) and LogP ($\le 5$), focusing the learned representations on small, drug-like molecules.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><em>Note: The primary GitHub repository for Chemformer was officially archived on February 11, 2026. Pre-trained weights and datasets used in the paper are still hosted externally on <a href="https://az.app.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq">Box</a>. Active development of Chemformer models has moved to the <a href="https://github.com/MolecularAI/aizynthmodels">AiZynthModels</a> repository.</em></p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/MolecularAI/Chemformer">Chemformer (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Archived; original PyTorch implementation</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/MolecularAI/aizynthmodels">AiZynthModels (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Active successor repository</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://az.app.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq">Pre-trained weights (Box)</a></td>
          <td style="text-align: left">Model</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Base and Large model checkpoints</td>
      </tr>
  </tbody>
</table>
<p>The following datasets were used for pre-training and benchmarking.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Pre-training</strong></td>
          <td style="text-align: left">ZINC-15</td>
          <td style="text-align: left">100M</td>
          <td style="text-align: left">Selected subset (reactive, annotated purchasability, MW $\le 500$, LogP $\le 5$). Split: 99% Train / 0.5% Val / 0.5% Test.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Direct Synthesis</strong></td>
          <td style="text-align: left">USPTO-MIT</td>
          <td style="text-align: left">~470k</td>
          <td style="text-align: left">Evaluated on &ldquo;Mixed&rdquo; and &ldquo;Separated&rdquo; variants.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Retrosynthesis</strong></td>
          <td style="text-align: left">USPTO-50K</td>
          <td style="text-align: left">~50k</td>
          <td style="text-align: left">Standard benchmark for retrosynthesis.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Optimization</strong></td>
          <td style="text-align: left">ChEMBL MMPs</td>
          <td style="text-align: left">~160k Train</td>
          <td style="text-align: left">Matched Molecular Pairs for LogD, solubility, and clearance optimization.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Properties</strong></td>
          <td style="text-align: left">MoleculeNet</td>
          <td style="text-align: left">Small</td>
          <td style="text-align: left">ESOL (1128), FreeSolv (642), Lipophilicity (4200).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Bioactivity</strong></td>
          <td style="text-align: left">ExCAPE</td>
          <td style="text-align: left">~312k</td>
          <td style="text-align: left">133 gene targets; &gt;1200 compounds per gene.</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>Tokenization</strong>: Regex-based tokenization (523 tokens total) derived from ChEMBL 27 canonical SMILES.</li>
<li><strong>Augmentation</strong>: SMILES enumeration (permuting atom order) used for pre-training and on-the-fly during fine-tuning ($p_{aug}=0.5$ for Seq2Seq, $p_{aug}=1.0$ for discriminative).</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pre-training Tasks</strong>:
<ol>
<li><em>Masking</em>: Span masking (BART style).</li>
<li><em>Augmentation</em>: Input is a randomized SMILES; target is canonical SMILES.</li>
<li><em>Combined</em>: Input is augmented <em>then</em> masked; target is canonical SMILES.</li>
</ol>
</li>
<li><strong>Optimization</strong>:
<ul>
<li>Optimizer: Adam ($\beta_1=0.9, \beta_2=0.999$).</li>
<li>Schedule: Linear warm-up (8000 steps) for pre-training; One-cycle schedule for fine-tuning.</li>
</ul>
</li>
<li><strong>Inference</strong>: <a href="https://en.wikipedia.org/wiki/Beam_search">Beam search</a> with width 10 for Seq2Seq tasks. Used <code>molbart/inference_score.py</code> and <code>molbart/retrosynthesis/round_trip_inference.py</code> for standard and round-trip validation.</li>
</ul>
<h3 id="models">Models</h3>
<p>Two model sizes were trained. Both use the Pre-Norm Transformer layout with GELU activation.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Hyperparameter</th>
          <th style="text-align: left">Chemformer (Base)</th>
          <th style="text-align: left">Chemformer-Large</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Layers</strong></td>
          <td style="text-align: left">6</td>
          <td style="text-align: left">8</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Model Dimension</strong></td>
          <td style="text-align: left">512</td>
          <td style="text-align: left">1024</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Feed-forward Dim</strong></td>
          <td style="text-align: left">2048</td>
          <td style="text-align: left">4096</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Attention Heads</strong></td>
          <td style="text-align: left">8</td>
          <td style="text-align: left">16</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Parameters</strong></td>
          <td style="text-align: left">~45M</td>
          <td style="text-align: left">~230M</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Pre-training Task</strong></td>
          <td style="text-align: left">All 3 variants</td>
          <td style="text-align: left">Combined only</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>Comparisons relied on Top-N accuracy for reaction tasks and validity metrics for optimization.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Key Result</th>
          <th style="text-align: left">Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Top-1 Acc</strong></td>
          <td style="text-align: left">Direct Synthesis (Sep)</td>
          <td style="text-align: left"><strong>92.8%</strong> (Large)</td>
          <td style="text-align: left">91.1% (Aug Transformer)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-1 Acc</strong></td>
          <td style="text-align: left">Retrosynthesis</td>
          <td style="text-align: left"><strong>54.3%</strong> (Large)</td>
          <td style="text-align: left">53.7% (GraphRetro) / 52.5% (GLN)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Desirable %</strong></td>
          <td style="text-align: left">Mol Optimization</td>
          <td style="text-align: left"><strong>75.0%</strong> (Base-Mask)</td>
          <td style="text-align: left">70.2% (Transformer-R)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>RMSE</strong></td>
          <td style="text-align: left">Lipophilicity</td>
          <td style="text-align: left">0.598 (Combined)</td>
          <td style="text-align: left">0.555 (D-MPNN)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 4 NVIDIA V100 GPUs (batch size 128 per GPU).</li>
<li><strong>Training Time</strong>:
<ul>
<li>Pre-training: 2.5 days (Base) / 6 days (Large) for 1M steps.</li>
<li>Fine-tuning: ~20-40 epochs for reaction prediction (&lt;12 hours).</li>
</ul>
</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Irwin, R., Dimitriadis, S., He, J., &amp; Bjerrum, E. J. (2022). Chemformer: a pre-trained transformer for computational chemistry. <em>Machine Learning: Science and Technology</em>, 3(1), 015022. <a href="https://doi.org/10.1088/2632-2153/ac3ffb">https://doi.org/10.1088/2632-2153/ac3ffb</a></p>
<p><strong>Publication</strong>: Machine Learning: Science and Technology 2022</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{irwinChemformerPretrainedTransformer2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Chemformer: A Pre-Trained Transformer for Computational Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Chemformer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Irwin, Ross and Dimitriadis, Spyridon and He, Jiazhen and Bjerrum, Esben Jannik}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Machine Learning: Science and Technology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{015022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IOP Publishing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{2632-2153}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1088/2632-2153/ac3ffb}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>