<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Drug-Design on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/tags/drug-design/</link><description>Recent content in Drug-Design on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sun, 05 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/tags/drug-design/index.xml" rel="self" type="application/rss+xml"/><item><title>REINVENT: Reinforcement Learning for Mol. Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/</guid><description>REINVENT uses augmented episodic likelihood to fine-tune a SMILES-based RNN via reinforcement learning for goal-directed molecular generation.</description><content:encoded><![CDATA[<h2 id="augmented-episodic-likelihood-for-goal-directed-generation">Augmented Episodic Likelihood for Goal-Directed Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces REINVENT, a policy-based reinforcement learning framework for molecular de novo design. The primary contribution is a novel cost function, the <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/">augmented episodic likelihood</a>, that fine-tunes a <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>-based recurrent neural network (RNN) pre-trained on ChEMBL toward generating molecules satisfying user-defined property objectives. The method anchors the agent to the prior distribution of valid drug-like molecules, addressing failure modes of standard REINFORCE algorithms (reward exploitation and <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/failure-modes-molecule-generation/">mode collapse</a> to trivially simple structures).</p>
<h2 id="de-novo-design-needs-flexible-data-driven-approaches">De Novo Design Needs Flexible, Data-Driven Approaches</h2>
<p>Traditional de novo design methods fall into three categories, each with limitations:</p>
<ol>
<li><strong>Structure-based approaches</strong> grow ligands to fit binding pockets but often produce molecules with poor DMPK profiles and synthetic intractability.</li>
<li><strong>Ligand-based virtual library</strong> approaches generate large libraries and score them, but are constrained by pre-defined reaction rules or transformation rules that limit chemical diversity.</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/property-prediction/">Inverse QSAR</a></strong> methods attempt to map favorable activity regions back to molecular structures, but require descriptors suitable for both forward prediction and inverse mapping.</li>
</ol>
<p>RNN-based generative models trained on SMILES offer a data-driven alternative that can learn the underlying distribution of drug-like chemical space without rigid rules. Segler et al. (2017) showed that fine-tuning a pre-trained RNN on focused actives yields high fractions of predicted actives. However, this maximum likelihood fine-tuning cannot use negative or continuous scores and risks catastrophic forgetting.</p>
<p>Prior RL approaches had significant issues. Jaques et al. (2016) used Deep Q-learning with prior likelihood regularization for sequence generation, but reported dependence on hand-written rules to penalize undesirable sequences and still observed reward exploitation producing unrealistically simple molecules. Standard REINFORCE algorithms tend to converge on trivial solutions (e.g., generating only &ldquo;C&rdquo; to satisfy a scoring function).</p>
<h2 id="the-augmented-episodic-likelihood-framework">The Augmented Episodic Likelihood Framework</h2>
<p>The core innovation is a formulation where the agent learns a policy that minimizes the squared difference between its own log-likelihood and an augmented target likelihood.</p>
<p>The RNN is first pre-trained on 1.5 million canonical SMILES from ChEMBL via maximum likelihood estimation:</p>
<p>$$
J(\Theta) = -\sum_{t=1}^{T} \log P(x^{t} \mid x^{t-1}, \dots, x^{1})
$$</p>
<p>The pre-trained model (the Prior) is then used as the starting point for the Agent. For a generated SMILES sequence $A = a_1, a_2, \dots, a_T$, the model likelihood is $P(A) = \prod_{t=1}^{T} \pi(a_t \mid s_t)$, and a scoring function $S(A) \in [-1, 1]$ rates desirability.</p>
<p>The augmented likelihood combines prior likelihood with the score:</p>
<p>$$
\log P(A)_{\mathbb{U}} = \log P(A)_{Prior} + \sigma S(A)
$$</p>
<p>where $\sigma$ is a scalar coefficient controlling the trade-off between prior fidelity and score optimization.</p>
<p>The return is defined as the negative squared difference between the augmented likelihood and the agent&rsquo;s likelihood:</p>
<p>$$
G(A) = -\left[\log P(A)_{\mathbb{U}} - \log P(A)_{\mathbb{A}}\right]^{2}
$$</p>
<p>The agent minimizes $J(\Theta) = -G$, effectively learning a policy whose sequence likelihoods match the prior modulated by the scoring function. The authors show in supplementary material that this is equivalent to a REINFORCE algorithm with a specific final-step reward formulation.</p>
<p>This design has three key advantages over standard REINFORCE:</p>
<ul>
<li>The target policy is explicitly stochastic, preserving diversity in generated molecules</li>
<li>The prior anchoring prevents catastrophic forgetting of SMILES syntax and chemical space coverage</li>
<li>No hand-written rules are needed to penalize degenerate solutions</li>
</ul>
<p>The Agent is trained on-policy with batches of 128 generated sequences, using SGD with learning rate 0.0005 and gradient clipping to $[-3, 3]$.</p>
<h2 id="three-experiments-sulphur-avoidance-celecoxib-analogues-and-drd2-activity">Three Experiments: Sulphur Avoidance, Celecoxib Analogues, and DRD2 Activity</h2>
<h3 id="prior-network-architecture">Prior Network Architecture</h3>
<p>The Prior is a 3-layer RNN with 1024 Gated Recurrent Units per layer, trained on RDKit canonical SMILES from ChEMBL (molecules with 10-50 heavy atoms, elements from ${H, B, C, N, O, F, Si, P, S, Cl, Br, I}$). Training used Adam ($\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$) for 50,000 steps with batch size 128 and learning rate decay of 0.02 every 100 steps. The Prior generates 94% valid SMILES, of which 90% are novel.</p>
<h3 id="experiment-1-learning-to-avoid-sulphur">Experiment 1: Learning to Avoid Sulphur</h3>
<p>A proof-of-principle task where the scoring function assigns $S(A) = 1$ for valid sulphur-free molecules, $S(A) = 0$ for invalid SMILES, and $S(A) = -1$ for sulphur-containing molecules.</p>
<p>The Agent method was compared against three alternatives:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Fraction Valid</th>
          <th>Fraction No S</th>
          <th>Avg MW</th>
          <th>Avg cLogP</th>
          <th>Avg RotBonds</th>
          <th>Avg AromRings</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior</td>
          <td>0.94</td>
          <td>0.66</td>
          <td>371</td>
          <td>3.36</td>
          <td>5.39</td>
          <td>2.26</td>
      </tr>
      <tr>
          <td>Agent</td>
          <td>0.95</td>
          <td>0.98</td>
          <td>367</td>
          <td>3.37</td>
          <td>5.41</td>
          <td>2.26</td>
      </tr>
      <tr>
          <td>Action basis</td>
          <td>0.95</td>
          <td>0.92</td>
          <td>372</td>
          <td>3.39</td>
          <td>6.08</td>
          <td>2.09</td>
      </tr>
      <tr>
          <td>REINFORCE</td>
          <td>0.98</td>
          <td>0.98</td>
          <td>585</td>
          <td>11.3</td>
          <td>30.0</td>
          <td>0.57</td>
      </tr>
      <tr>
          <td>REINFORCE + Prior</td>
          <td>0.98</td>
          <td>0.92</td>
          <td>232</td>
          <td>3.05</td>
          <td>2.8</td>
          <td>2.11</td>
      </tr>
  </tbody>
</table>
<p>Standard REINFORCE exploited the reward by generating sequences of predominantly &ldquo;C&rdquo; (average MW 585, cLogP 11.3). REINFORCE + Prior avoided this but collapsed to small, simplistic structures (MW 232). The Agent achieved 98% sulphur-free structures while maintaining molecular properties nearly identical to the Prior, demonstrating that augmented episodic likelihood preserves the prior distribution.</p>
<h3 id="experiment-2-similarity-guided-generation-celecoxib-analogues">Experiment 2: Similarity-Guided Generation (Celecoxib Analogues)</h3>
<p>The scoring function uses <a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity</a> on FCFP4 fingerprints:</p>
<p>$$
S(A) = -1 + 2 \times \frac{\min{J_{i,j}, k}}{k}
$$</p>
<p>where $k$ caps the rewarded similarity. With $k = 1$ and $\sigma = 15$, the Agent recovers <a href="https://en.wikipedia.org/wiki/Celecoxib">Celecoxib</a> itself within 200 training steps. Even when all structures with $J &gt; 0.5$ to Celecoxib (1,804 molecules) were removed from the Prior training set, the Agent still found Celecoxib after 400 steps, despite a 700-fold reduction in prior likelihood ($\log_e P$ from $-12.7$ to $-19.2$).</p>
<p>With moderate similarity targets ($k = 0.7$, $\sigma = 12$), the Agent generates diverse analogues including scaffold hops where functional groups are rearranged.</p>
<h3 id="experiment-3-target-activity-drd2">Experiment 3: Target Activity (DRD2)</h3>
<p>The most drug-discovery-relevant task: generating molecules predicted active against the <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">dopamine receptor type 2 (DRD2)</a>. An SVM classifier (Gaussian kernel, $C = 2^7$, $\gamma = 2^{-6}$) was trained on bioactivity data from ExCAPE-DB (7,218 actives with pIC50 &gt; 5, 100,000 sampled inactives). The actives were split by Butina clustering (ECFP6, cutoff 0.4) to decrease nearest-neighbor similarity between train and test sets.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Prior</th>
          <th>Agent</th>
          <th>Prior (reduced)</th>
          <th>Agent (reduced)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Fraction valid SMILES</td>
          <td>0.94</td>
          <td>0.99</td>
          <td>0.94</td>
          <td>0.99</td>
      </tr>
      <tr>
          <td>Fraction predicted actives</td>
          <td>0.03</td>
          <td>0.97</td>
          <td>0.02</td>
          <td>0.96</td>
      </tr>
      <tr>
          <td>Fraction similar to train active</td>
          <td>0.02</td>
          <td>0.79</td>
          <td>0.02</td>
          <td>0.75</td>
      </tr>
      <tr>
          <td>Fraction similar to test active</td>
          <td>0.01</td>
          <td>0.46</td>
          <td>0.01</td>
          <td>0.38</td>
      </tr>
      <tr>
          <td>Test actives recovered (x10^-3)</td>
          <td>13.5</td>
          <td>126</td>
          <td>2.85</td>
          <td>72.6</td>
      </tr>
  </tbody>
</table>
<p>The Agent increased the fraction of predicted actives from 2-3% (Prior) to 96-97%, representing a 250-fold enrichment in the probability of generating a test set active. The Agent based on the reduced Prior (DRD2 actives removed from ChEMBL) still recovered 7% of test actives, meaning it generated experimentally confirmed actives that appeared in neither the generative model nor the activity prediction model training data.</p>
<h2 id="anchored-policy-learning-prevents-reward-exploitation">Anchored Policy Learning Prevents Reward Exploitation</h2>
<p>The key finding is that augmented episodic likelihood successfully balances score optimization with prior distribution preservation. The Agent achieves task objectives (sulphur avoidance, similarity targets, activity prediction) while maintaining the molecular property distributions learned from ChEMBL. This is a significant improvement over standard REINFORCE, which either exploits rewards trivially or collapses to simple structures.</p>
<p>Analysis of the conditional probability distributions between the Prior and Agent (for DRD2 active generation) shows that the policy changes are not drastic: most trends learned by the Prior carry over, with targeted modifications at specific steps that substantially alter sequence likelihoods and generated structure types.</p>
<p>Limitations acknowledged by the authors:</p>
<ul>
<li>All experiments use single-parameter scoring functions; multi-parametric optimization (activity + DMPK + synthetic accessibility) is left for future work</li>
<li>The quality of generated structures depends heavily on the Prior&rsquo;s coverage of chemical space</li>
<li>The activity model (SVM) has limited domain of applicability, and structures outside this domain may be falsely scored</li>
<li>No exhaustive study of how Prior training set size, model size, and regularization affect generation quality</li>
</ul>
<p>Future directions include multi-parametric scoring functions, exploration of token embeddings, and adversarial training where the scoring function is replaced by a discriminator network (GAN-style training).</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior training</td>
          <td>ChEMBL</td>
          <td>1.5M structures</td>
          <td>10-50 heavy atoms, filtered elements</td>
      </tr>
      <tr>
          <td>DRD2 activity model</td>
          <td>ExCAPE-DB</td>
          <td>7,218 actives + 100K inactives</td>
          <td>Butina clustering split (ECFP6, cutoff 0.4)</td>
      </tr>
      <tr>
          <td>Similarity target</td>
          <td>Celecoxib</td>
          <td>1 query structure</td>
          <td>FCFP4 fingerprints for Jaccard similarity</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Prior</strong>: 3-layer GRU RNN (1024 units/layer), Adam optimizer, 50K steps, batch size 128, LR 0.001 with 0.02 decay/100 steps</li>
<li><strong>Agent</strong>: Same architecture, SGD with LR 0.0005, gradient clipping [-3, 3], on-policy batches of 128</li>
<li><strong>DRD2 model</strong>: SVM with Gaussian kernel ($C = 2^7$, $\gamma = 2^{-6}$), grid search on validation set</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MarcusOlivecrona/REINVENT">REINVENT</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Original implementation in TensorFlow/Python 2.7</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.572576">Archived version</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Zenodo archive (DOI: 10.5281/zenodo.572576)</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>SMILES validity rate (RDKit parsing)</li>
<li>Fraction of structures satisfying scoring function</li>
<li>Molecular property distributions (MW, cLogP, rotatable bonds, aromatic rings)</li>
<li>Jaccard similarity on ECFP6/FCFP4 fingerprints</li>
<li>Recovery rate of known actives from test set</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. The implementation uses TensorFlow 1.0.1 with Python 2.7, RDKit, and Scikit-learn.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Olivecrona, M., Blaschke, T., Engkvist, O., &amp; Chen, H. (2017). Molecular de-novo design through deep reinforcement learning. <em>Journal of Cheminformatics</em>, 9(1), 48.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{olivecrona2017molecular,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Molecular de-novo design through deep reinforcement learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Olivecrona, Marcus and Blaschke, Thomas and Engkvist, Ola and Chen, Hongming}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{48}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-017-0235-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>PharMolixFM: Multi-Modal All-Atom Molecular Models</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-modeling/pharmolixfm-all-atom-foundation-models/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-modeling/pharmolixfm-all-atom-foundation-models/</guid><description>PharMolixFM unifies diffusion, flow matching, and Bayesian flow networks for all-atom molecular modeling and generation with task-specific denoising priors.</description><content:encoded><![CDATA[<h2 id="a-unified-framework-for-all-atom-molecular-foundation-models">A Unified Framework for All-Atom Molecular Foundation Models</h2>
<p>PharMolixFM is a <strong>Method</strong> paper that introduces a unified framework for constructing all-atom foundation models for molecular modeling and generation. The primary contribution is the systematic implementation of three multi-modal generative model variants (diffusion, flow matching, and Bayesian flow networks) within a single architecture, along with a task-unifying denoising formulation that enables training on multiple structural biology tasks simultaneously. The framework achieves competitive performance on protein-small-molecule docking and structure-based drug design while providing the first empirical analysis of inference scaling laws for molecular generative models.</p>
<h2 id="challenges-in-multi-modal-atomic-modeling">Challenges in Multi-Modal Atomic Modeling</h2>
<p>Existing all-atom foundation models such as AlphaFold3, RoseTTAFold All-Atom, and ESM-AA face two core challenges that limit their generalization across molecular modeling and generation tasks.</p>
<p>First, atomic data is inherently multi-modal: each atom comprises both a discrete atom type and continuous 3D coordinates. This poses challenges for structure models that need to jointly capture and predict both modalities. Unlike text or image data that exhibit a single modality, molecular structures require generative models that can handle discrete categorical variables (atom types, bond types) and continuous variables (coordinates) simultaneously.</p>
<p>Second, there has been no comprehensive analysis of how different training objectives and sampling strategies impact the performance of all-atom foundation models. Prior work has focused on individual model architectures without systematically comparing generative frameworks or studying how inference-time compute scaling affects prediction quality.</p>
<p>PharMolixFM addresses both challenges by providing a unified framework that implements three state-of-the-art multi-modal generative models and formulates all downstream tasks as a generalized denoising process with task-specific priors.</p>
<h2 id="multi-modal-denoising-with-task-specific-priors">Multi-Modal Denoising with Task-Specific Priors</h2>
<p>The core innovation of PharMolixFM is the formulation of molecular tasks as a generalized denoising process where task-specific priors control which parts of the molecular system are noised during training. The framework decomposes a biomolecular system into $N$ atoms represented as a triplet $\bar{\mathbf{S}}_0 = \langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle$, where $\mathbf{X}_0 \in \mathbb{R}^{N \times 3}$ are atom coordinates, $\mathbf{A}_0 \in \mathbb{Z}^{N \times D_1}$ are one-hot atom types, and $\mathbf{E}_0 \in \mathbb{Z}^{N \times N \times D_2}$ are one-hot bond types.</p>
<p>The generative model estimates the density $p_\theta(\langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle)$ subject to SE(3) invariance:</p>
<p>$$
p_\theta(\langle \mathbf{R}\mathbf{X}_0 + \mathbf{t}, \mathbf{A}_0, \mathbf{E}_0 \rangle) = p_\theta(\langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle)
$$</p>
<p>The variational lower bound is optimized over latent variables $S_1, \ldots, S_T$ obtained by adding independent noise to different modalities and atoms:</p>
<p>$$
q(S_{1:T} \mid S_0) = \prod_{i=1}^{T} \prod_{j=1}^{N} q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}, \sigma_{i,j}^{(\mathbf{X})}) , q(\mathbf{A}_{i,j} \mid \mathbf{A}_{0,j}, \sigma_{i,j}^{(\mathbf{A})}) , q(\mathbf{E}_{i,j} \mid \mathbf{E}_{0,j}, \sigma_{i,j}^{(\mathbf{E})})
$$</p>
<p>A key design choice is the noise schedule $\sigma_{i,j}^{(\mathcal{M})} = \frac{i}{T} \cdot \text{fix}_j^{(\mathcal{M})}$, where $\text{fix}_j^{(\mathcal{M})}$ is a scaling factor between 0 and 1 that controls which atoms and modalities receive noise. This &ldquo;Fix&rdquo; mechanism enables multiple training tasks:</p>
<ul>
<li><strong>Docking</strong> ($\text{Fix} = 1$ for protein and molecular graph, $\text{Fix} = 0$ for molecule coordinates): predicts binding pose given known atom/bond types.</li>
<li><strong>Structure-based drug design</strong> ($\text{Fix} = 1$ for protein, $\text{Fix} = 0$ for all molecule properties): generates novel molecules for a given pocket.</li>
<li><strong>Robustness augmentation</strong> ($\text{Fix} = 0.7$ for 15% randomly selected atoms, $\text{Fix} = 0$ for rest): simulates partial structure determination.</li>
</ul>
<h3 id="three-generative-model-variants">Three Generative Model Variants</h3>
<p><strong>Multi-modal diffusion (PharMolixFM-Diff)</strong> uses a Markovian forward process. Continuous coordinates follow Gaussian diffusion while discrete variables use a D3PM categorical transition:</p>
<p>$$
q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}) = \mathcal{N}(\sqrt{\alpha_{i,j}} , \mathbf{X}_{0,j}, (1 - \alpha_{i,j}) \mathbf{I}), \quad \alpha_{i,j} = \prod_{k=1}^{i}(1 - \sigma_{i,j}^{(\mathbf{X})})
$$</p>
<p>$$
q(\mathbf{A}_{i,j} \mid \mathbf{A}_{0,j}) = \text{Cat}(\mathbf{A}_{0,j} \bar{Q}_{i,j}^{(\mathbf{A})}), \quad Q_{i,j}^{(\mathbf{A})} = (1 - \sigma_{i,j}^{(\mathbf{A})}) \mathbf{I} + \frac{\sigma_{i,j}^{(\mathbf{A})}}{D_1} \mathbb{1}\mathbb{1}^T
$$</p>
<p>The training loss combines coordinate MSE with cross-entropy for discrete variables:</p>
<p>$$
\mathcal{L} = \mathbb{E}_{S_0, i, S_i} \left[ \lambda_i^{(\mathbf{X})} | \tilde{\mathbf{X}}_0 - \mathbf{X}_0 |_2^2 + \lambda_i^{(\mathbf{A})} \mathcal{L}_{CE}(\tilde{\mathbf{A}}_0, \mathbf{A}_0) + \lambda_i^{(\mathbf{E})} \mathcal{L}_{CE}(\tilde{\mathbf{E}}_0, \mathbf{E}_0) \right]
$$</p>
<p><strong>Multi-modal flow matching (PharMolixFM-Flow)</strong> constructs a direct mapping between data and prior distributions using conditional vector fields. For coordinates, the conditional flow uses a Gaussian path $q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}) = \mathcal{N}((1 - \sigma_{i,j}^{(\mathbf{X})}) \mathbf{X}_{0,j}, (\sigma_{i,j}^{(\mathbf{X})})^2 \mathbf{I})$, while discrete variables use the same D3PM Markov chain. Sampling proceeds by solving an ODE via Euler integration.</p>
<p><strong>Bayesian flow networks (PharMolixFM-BFN)</strong> perform generative modeling in the parameter space of the data distribution rather than the data space. The Bayesian flow distribution for coordinates is:</p>
<p>$$
p_F(\tilde{\mathbf{X}}_{i,j}^{(\theta)} \mid \mathbf{X}_{0,j}) = \mathcal{N}(\gamma_{i,j} \mathbf{X}_{0,j}, \gamma_{i,j}(1 - \gamma_{i,j}) \mathbf{I}), \quad \gamma_{i,j} = 1 - \alpha^{2(1 - \sigma_{i,j}^{(\mathbf{X})})}
$$</p>
<h3 id="network-architecture">Network Architecture</h3>
<p>The architecture follows PocketXMol with a dual-branch SE(3)-equivariant graph neural network. A protein branch (4-layer GNN with kNN graph) processes pocket atoms, then representations are passed to a molecule branch (6-layer GNN) that captures protein-molecule interactions. Independent prediction heads reconstruct atom coordinates, atom types, and bond types, with additional confidence heads for self-ranking during inference.</p>
<h2 id="docking-and-drug-design-experiments">Docking and Drug Design Experiments</h2>
<h3 id="protein-small-molecule-docking">Protein-Small-Molecule Docking</h3>
<p>PharMolixFM is evaluated on the PoseBusters benchmark (428 protein-small-molecule complexes) using the holo docking setting with a known protein structure and 10 Angstrom binding pocket. The metric is the ratio of predictions with RMSD &lt; 2 Angstrom.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Self-Ranking (%)</th>
          <th>Oracle-Ranking (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DiffDock</td>
          <td>38.0</td>
          <td>-</td>
      </tr>
      <tr>
          <td>RFAA</td>
          <td>42.0</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Vina</td>
          <td>52.3</td>
          <td>-</td>
      </tr>
      <tr>
          <td>UniMol-Docking V2</td>
          <td>77.6</td>
          <td>-</td>
      </tr>
      <tr>
          <td>SurfDock</td>
          <td>78.0</td>
          <td>-</td>
      </tr>
      <tr>
          <td>AlphaFold3</td>
          <td>90.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>PocketXMol (50 repeats)</td>
          <td>82.2</td>
          <td>95.3</td>
      </tr>
      <tr>
          <td>PharMolixFM-Diff (50 repeats)</td>
          <td>83.4</td>
          <td>96.0</td>
      </tr>
      <tr>
          <td>PharMolixFM-Flow (50 repeats)</td>
          <td>73.4</td>
          <td>93.7</td>
      </tr>
      <tr>
          <td>PharMolixFM-BFN (50 repeats)</td>
          <td>78.5</td>
          <td>93.5</td>
      </tr>
      <tr>
          <td>PharMolixFM-Diff (500 repeats)</td>
          <td>83.9</td>
          <td>98.1</td>
      </tr>
  </tbody>
</table>
<p>PharMolixFM-Diff achieves the second-best self-ranking result (83.4%), outperforming PocketXMol by 1.7% absolute but trailing AlphaFold3 (90.4%). The key advantage is inference speed: approximately 4.6 seconds per complex on a single A800 GPU compared to approximately 249.0 seconds for AlphaFold3 (a 54x speedup). Under oracle-ranking with 500 repeats, PharMolixFM-Diff reaches 98.1%, suggesting that better ranking strategies could further improve practical performance.</p>
<h3 id="structure-based-drug-design">Structure-Based Drug Design</h3>
<p>Evaluation uses the CrossDocked test set (100 protein pockets, 100 molecules generated per pocket), measuring Vina binding affinity scores and drug-likeness properties (QED and SA).</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Vina Score (Avg/Med)</th>
          <th>QED</th>
          <th>SA</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pocket2Mol</td>
          <td>-5.14 / -4.70</td>
          <td>0.57</td>
          <td>0.76</td>
      </tr>
      <tr>
          <td>TargetDiff</td>
          <td>-5.47 / -6.30</td>
          <td>0.48</td>
          <td>0.58</td>
      </tr>
      <tr>
          <td>DecompDiff</td>
          <td>-5.67 / -6.04</td>
          <td>0.45</td>
          <td>0.61</td>
      </tr>
      <tr>
          <td>MolCRAFT</td>
          <td>-6.61 / -8.14</td>
          <td>0.46</td>
          <td>0.62</td>
      </tr>
      <tr>
          <td>PharMolixFM-Diff</td>
          <td>-6.18 / -6.44</td>
          <td>0.50</td>
          <td>0.73</td>
      </tr>
      <tr>
          <td>PharMolixFM-Flow</td>
          <td>-6.34 / -6.47</td>
          <td>0.49</td>
          <td>0.74</td>
      </tr>
      <tr>
          <td>PharMolixFM-BFN</td>
          <td>-6.38 / -6.45</td>
          <td>0.48</td>
          <td>0.64</td>
      </tr>
  </tbody>
</table>
<p>PharMolixFM achieves a better balance between binding affinity and drug-like properties compared to baselines. While MolCRAFT achieves the best Vina scores, PharMolixFM-Diff and Flow variants show notably higher QED (0.49-0.50 vs. 0.45-0.48) and SA (0.73-0.74 vs. 0.58-0.62), which are important for downstream validation and in-vivo application.</p>
<h3 id="inference-scaling-law">Inference Scaling Law</h3>
<p>The paper explores whether inference-time scaling holds for molecular generative models, fitting the relationship:</p>
<p>$$
\text{Acc} = a \log(bR + c) + d
$$</p>
<p>where $R$ is the number of sampling repeats. All three PharMolixFM variants exhibit logarithmic improvement in docking accuracy with increased sampling repeats, analogous to inference scaling laws observed in NLP. Performance plateaus eventually due to distributional differences between training and test sets.</p>
<h2 id="competitive-docking-with-faster-inference-but-limited-task-scope">Competitive Docking with Faster Inference, but Limited Task Scope</h2>
<p>PharMolixFM demonstrates that multi-modal generative models can achieve competitive all-atom molecular modeling with substantial inference speed advantages over AlphaFold3. The key findings are:</p>
<ol>
<li><strong>Diffusion outperforms flow matching and BFN</strong> for docking under standard sampling budgets. The stochastic nature of diffusion sampling appears beneficial compared to the deterministic ODE integration of flow matching.</li>
<li><strong>Oracle-ranking reveals untapped potential</strong>: the gap between self-ranking (83.4%) and oracle-ranking (98.1%) at 500 repeats indicates that confidence-based ranking is a bottleneck. Better ranking methods could close the gap with AlphaFold3.</li>
<li><strong>The three variants show similar performance for drug design</strong>, suggesting that model architecture and training data may matter more than the generative framework for generation tasks.</li>
<li><strong>Inference scaling laws hold</strong> for molecular generative models, paralleling findings in NLP.</li>
</ol>
<p>Limitations include that the framework is only evaluated on two tasks (docking and SBDD), and the paper does not address protein structure prediction, protein-protein interactions, or nucleic acid modeling, which are part of AlphaFold3&rsquo;s scope. The BFN variant underperforms the diffusion model, which the authors attribute to smaller noise scales at early sampling steps making training less challenging. The paper also does not compare against concurrent work on inference-time scaling for molecular models.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>PDBBind, Binding MOAD, CrossDocked2020, PepBDB</td>
          <td>Not specified</td>
          <td>Filtered by PocketXMol criteria</td>
      </tr>
      <tr>
          <td>Docking eval</td>
          <td>PoseBusters benchmark</td>
          <td>428 complexes</td>
          <td>Holo docking with known protein</td>
      </tr>
      <tr>
          <td>SBDD eval</td>
          <td>CrossDocked test set</td>
          <td>100 pockets</td>
          <td>100 molecules per pocket</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Three generative variants: multi-modal diffusion (D3PM), flow matching, Bayesian flow networks</li>
<li>Task-specific noise via Fix mechanism (0, 0.7, or 1.0)</li>
<li>Training tasks selected with equal probability per sample</li>
<li>AdamW optimizer: weight decay 0.001, $\beta_1 = 0.99$, $\beta_2 = 0.999$</li>
<li>Linear warmup to learning rate 0.001 over 1000 steps</li>
<li>180K training steps with batch size 40</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Dual-branch SE(3)-equivariant GNN (protein: 4-layer, molecule: 6-layer)</li>
<li>kNN graph construction for protein and protein-molecule interactions</li>
<li>Independent prediction heads for coordinates, atom types, bond types</li>
<li>Confidence heads for self-ranking during inference</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>PharMolixFM-Diff</th>
          <th>AlphaFold3</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RMSD &lt; 2A self-ranking</td>
          <td>83.4% (50 rep)</td>
          <td>90.4%</td>
          <td>PoseBusters docking</td>
      </tr>
      <tr>
          <td>RMSD &lt; 2A oracle-ranking</td>
          <td>98.1% (500 rep)</td>
          <td>-</td>
          <td>PoseBusters docking</td>
      </tr>
      <tr>
          <td>Inference time (per complex)</td>
          <td>~4.6s</td>
          <td>~249.0s</td>
          <td>Single A800 GPU</td>
      </tr>
      <tr>
          <td>Vina score (avg)</td>
          <td>-6.18</td>
          <td>-</td>
          <td>CrossDocked SBDD</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training: 4x 80GB A800 GPUs</li>
<li>Inference benchmarked on single A800 GPU</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/PharMolix/OpenBioMed">OpenBioMed (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Luo, Y., Wang, J., Fan, S., &amp; Nie, Z. (2025). PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation. <em>arXiv preprint arXiv:2503.21788</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{luo2025pharmolixfm,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Luo, Yizhen and Wang, Jiashuo and Fan, Siqi and Nie, Zaiqing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2503.21788}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ORGAN: Objective-Reinforced GANs for Molecule Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/</guid><description>ORGAN combines GANs with reinforcement learning to steer SMILES-based molecular generation toward drug-likeness, solubility, and synthesizability objectives.</description><content:encoded><![CDATA[<h2 id="combining-gans-and-reinforcement-learning-for-goal-directed-sequence-generation">Combining GANs and Reinforcement Learning for Goal-Directed Sequence Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces ORGAN (Objective-Reinforced Generative Adversarial Network), a framework for generating sequences that are both realistic (close to the training distribution) and optimized for domain-specific objectives. ORGAN extends SeqGAN by adding external reward functions to the reinforcement learning signal, with a tunable parameter $\lambda$ controlling the balance between adversarial (discriminator) and objective-based rewards. The authors demonstrate ORGAN on two domains: molecular generation using <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings (optimizing druglikeness, solubility, and synthesizability) and musical melody generation (optimizing tonality and step ratios).</p>
<h2 id="exposure-bias-and-mode-collapse-in-discrete-sequence-generation">Exposure Bias and Mode Collapse in Discrete Sequence Generation</h2>
<p>Generating discrete sequences with desirable properties presents two intertwined challenges. First, RNNs trained via maximum likelihood estimation (MLE) suffer from exposure bias, where the model sees only ground-truth prefixes during training but must condition on its own (potentially erroneous) outputs at generation time. Second, while <a href="/posts/what-is-a-gan/">GANs</a> can address some of these issues through adversarial training, they were not initially applicable to discrete data due to non-differentiability of the sampling step. SeqGAN resolved this by framing the generator as an RL agent, but it optimizes only for distributional fidelity (fooling the discriminator) without any mechanism to steer generation toward specific property targets.</p>
<p>In drug discovery, simply generating valid, drug-like molecules is insufficient. Practitioners need to optimize for particular pharmaceutical properties (e.g., solubility, synthesizability, druglikeness) while maintaining structural diversity. Naive RL approaches can optimize properties effectively but tend to collapse onto trivial solutions (e.g., repeating &ldquo;CCCCCCC&rdquo; to maximize solubility). The challenge is to combine the distributional regularization of adversarial training with the goal-directedness of RL.</p>
<h2 id="mixed-reward-interpolating-between-adversarial-and-objective-signals">Mixed Reward: Interpolating Between Adversarial and Objective Signals</h2>
<p>ORGAN&rsquo;s core innovation is a reward function that linearly interpolates between the discriminator score and domain-specific objectives:</p>
<p>$$R(Y_{1:T}) = \lambda \cdot D_{\phi}(Y_{1:T}) + (1 - \lambda) \cdot O_{i}(Y_{1:T})$$</p>
<p>When $\lambda = 1$, the model reduces to SeqGAN (pure adversarial training). When $\lambda = 0$, it becomes naive RL optimizing only the objective. Intermediate values allow the adversarial component to regularize the generator, keeping samples within the distribution while the objective component steers toward desired properties.</p>
<p>The generator $G_{\theta}$ is an LSTM-based RNN that produces sequences token-by-token. Training follows the REINFORCE algorithm, where the expected long-term reward is:</p>
<p>$$J(\theta) = \mathbb{E}\left[R(Y_{1:T}) \mid s_{0}, \theta\right] = \sum_{y_{1} \in Y} G_{\theta}(y_{1} \mid s_{0}) \cdot Q(s_{0}, y_{1})$$</p>
<p>For intermediate timesteps (partial sequences), the action-value function $Q$ is estimated via $N$-time Monte Carlo rollouts:</p>
<p>$$Q(Y_{1:t-1}, y_{t}) = \begin{cases} \frac{1}{N} \sum_{n=1}^{N} R(Y_{1:T}^{n}), &amp; \text{if } t &lt; T \\ R(Y_{1:T}), &amp; \text{if } t = T \end{cases}$$</p>
<p>where $Y_{1:T}^{n}$ are completions sampled by rolling out the current policy $G_{\theta}$ from state $Y_{1:t}$.</p>
<p>The policy gradient is:</p>
<p>$$\nabla_{\theta} J(\theta) \simeq \frac{1}{T} \sum_{t=1}^{T} \mathbb{E}_{y_{t} \sim G_{\theta}(y_{t} \mid Y_{1:t-1})} \left[\nabla_{\theta} \log G_{\theta}(y_{t} \mid Y_{1:t-1}) \cdot Q(Y_{1:t-1}, y_{t})\right]$$</p>
<p>Two additional mechanisms improve training:</p>
<ol>
<li><strong>Diversity penalty</strong>: Repeated sequences have their reward divided by their copy count, providing diminishing returns for non-unique outputs.</li>
<li><strong>Wasserstein distance</strong>: The authors also implement a variant (OR(W)GAN) that replaces the standard GAN discriminator loss with the Wasserstein-1 distance via Kantorovich-Rubinstein duality, which can improve training stability and diversity.</li>
</ol>
<h2 id="molecular-and-musical-melody-generation-experiments">Molecular and Musical Melody Generation Experiments</h2>
<h3 id="architecture">Architecture</h3>
<p>The generator $G_{\theta}$ is an RNN with LSTM cells. The discriminator $D_{\phi}$ is a CNN for text classification following Kim (2014), with 75% dropout and L2 regularization. All optimization uses Adam. Molecular metrics are computed with RDKit.</p>
<h3 id="molecular-generation-setup">Molecular Generation Setup</h3>
<p>Training data consists of 5,000 random molecules from the QM9 dataset (134k stable small molecules with up to 9 heavy atoms), encoded as SMILES strings with maximum sequence length 51 and alphabet size 43. Each generator is pre-trained for 250 MLE epochs, with the discriminator trained for 10 epochs. Adversarial/RL training then proceeds for up to 100 additional epochs. The default $\lambda$ is 0.5.</p>
<p>Three molecular objectives are evaluated:</p>
<ul>
<li><strong>Solubility (LogP)</strong>: water-octanol partition coefficient via RDKit&rsquo;s Crippen function</li>
<li><strong>Synthesizability</strong>: SA score estimating ease of synthesis (0 = hard, 1 = easy)</li>
<li><strong>Druglikeness</strong>: QED score capturing medicinal chemistry aesthetics</li>
</ul>
<p>Diversity is measured using average Jaccard distance of molecular fingerprints relative to a random training subset.</p>
<h3 id="molecular-generation-results">Molecular Generation Results</h3>
<table>
  <thead>
      <tr>
          <th>Objective</th>
          <th>Algorithm</th>
          <th>Validity (%)</th>
          <th>Diversity</th>
          <th>Druglikeness</th>
          <th>Synthesizability</th>
          <th>Solubility</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>None</td>
          <td>MLE</td>
          <td>75.9</td>
          <td>0.64</td>
          <td>0.48 (0%)</td>
          <td>0.23 (0%)</td>
          <td>0.30 (0%)</td>
      </tr>
      <tr>
          <td>None</td>
          <td>SeqGAN</td>
          <td>80.3</td>
          <td>0.61</td>
          <td>0.49 (+2%)</td>
          <td>0.25 (+6%)</td>
          <td>0.31 (+3%)</td>
      </tr>
      <tr>
          <td>Druglikeness</td>
          <td>ORGAN</td>
          <td>88.2</td>
          <td>0.55</td>
          <td>0.52 (+8%)</td>
          <td>0.32 (+38%)</td>
          <td>0.35 (+18%)</td>
      </tr>
      <tr>
          <td>Druglikeness</td>
          <td>OR(W)GAN</td>
          <td>85.0</td>
          <td>0.95</td>
          <td>0.60 (+25%)</td>
          <td>0.54 (+130%)</td>
          <td>0.47 (+57%)</td>
      </tr>
      <tr>
          <td>Druglikeness</td>
          <td>Naive RL</td>
          <td>97.1</td>
          <td>0.80</td>
          <td>0.57 (+19%)</td>
          <td>0.53 (+126%)</td>
          <td>0.50 (+67%)</td>
      </tr>
      <tr>
          <td>Synthesizability</td>
          <td>ORGAN</td>
          <td>96.5</td>
          <td>0.92</td>
          <td>0.51 (+6%)</td>
          <td>0.83 (+255%)</td>
          <td>0.45 (+52%)</td>
      </tr>
      <tr>
          <td>Synthesizability</td>
          <td>OR(W)GAN</td>
          <td>97.6</td>
          <td>1.00</td>
          <td>0.20 (-59%)</td>
          <td>0.75 (+223%)</td>
          <td>0.84 (+184%)</td>
      </tr>
      <tr>
          <td>Solubility</td>
          <td>ORGAN</td>
          <td>94.7</td>
          <td>0.76</td>
          <td>0.50 (+4%)</td>
          <td>0.63 (+171%)</td>
          <td>0.55 (+85%)</td>
      </tr>
      <tr>
          <td>Solubility</td>
          <td>OR(W)GAN</td>
          <td>94.1</td>
          <td>0.90</td>
          <td>0.42 (-12%)</td>
          <td>0.66 (+185%)</td>
          <td>0.54 (+81%)</td>
      </tr>
      <tr>
          <td>Solubility</td>
          <td>Naive RL</td>
          <td>92.7</td>
          <td>0.75</td>
          <td>0.49 (+3%)</td>
          <td>0.70 (+200%)</td>
          <td>0.78 (+162%)</td>
      </tr>
      <tr>
          <td>All (alternated)</td>
          <td>ORGAN</td>
          <td>96.1</td>
          <td>92.3</td>
          <td>0.52 (+9%)</td>
          <td>0.71 (+206%)</td>
          <td>0.53 (+79%)</td>
      </tr>
  </tbody>
</table>
<p>Key observations: OR(W)GAN consistently achieves higher diversity than standard ORGAN. Naive RL often achieves higher raw objective scores but at the cost of generating trivial solutions (e.g., simple atom chains for solubility). The Wasserstein variant provides better diversity properties. Multi-objective training via alternating objectives across epochs achieves gains comparable to individually optimized models.</p>
<h3 id="music-generation-setup">Music Generation Setup</h3>
<p>Using 1,000 melodies from the EsAC folk dataset, each encoded as 36-token sequences where tokens represent sixteenth-note events across three octaves (C3-B5). Two metrics are optimized: tonality (proportion of perfect fifths) and ratio of steps (conjunct melodic motion). Diversity is measured as average pairwise edit distance.</p>
<h3 id="music-results">Music Results</h3>
<table>
  <thead>
      <tr>
          <th>Objective</th>
          <th>Algorithm</th>
          <th>Diversity</th>
          <th>Tonality</th>
          <th>Ratio of Steps</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>None</td>
          <td>MLE</td>
          <td>0.221</td>
          <td>0.007</td>
          <td>0.010</td>
      </tr>
      <tr>
          <td>None</td>
          <td>SeqGAN</td>
          <td>0.187</td>
          <td>0.005</td>
          <td>0.010</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>Naive RL</td>
          <td>0.100</td>
          <td>0.478</td>
          <td>2.9E-05</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>ORGAN</td>
          <td>0.268</td>
          <td>0.372</td>
          <td>1.78E-04</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>OR(W)GAN</td>
          <td>0.268</td>
          <td>0.177</td>
          <td>2.4E-04</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>Naive RL</td>
          <td>0.321</td>
          <td>0.001</td>
          <td>0.829</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>ORGAN</td>
          <td>0.433</td>
          <td>0.001</td>
          <td>0.632</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>OR(W)GAN</td>
          <td>0.134</td>
          <td>5.95E-05</td>
          <td>0.622</td>
      </tr>
  </tbody>
</table>
<p>ORGAN outperforms SeqGAN and MLE on all metrics. Naive RL achieves higher raw scores but with lower diversity, producing simpler, less interesting outputs.</p>
<h2 id="capacity-ceilings-trade-offs-and-future-directions">Capacity Ceilings, Trade-offs, and Future Directions</h2>
<p>The authors identify several limitations and findings:</p>
<p><strong>Capacity ceiling</strong>: GAN-based models tend to generate sequences matching the training set&rsquo;s average length (15.42 characters). RL-only approaches can break this constraint, generating shorter (9.4) or longer (21.3) sequences depending on the objective. The upper bound of optimized properties also matches the training data&rsquo;s maximum, suggesting dataset-dependent limits.</p>
<p><strong>Lambda trade-off</strong>: Varying $\lambda$ reveals an optimal balance between objective optimization and distributional fidelity. This optimum depends on the model, dataset, and metric, suggesting that hyperparameter search over $\lambda$ is important in practice.</p>
<p><strong>Tonality vs. steps inverse relationship</strong>: In the music task, optimizing for tonality (perfect fifths) inherently conflicts with optimizing for step ratios (consecutive notes), since consecutive scale notes do not form perfect fifths.</p>
<p><strong>Limitations</strong>: The paper evaluates on relatively small datasets (5k molecules, 1k melodies) and short sequences. The molecular experiments use QM9 (small molecules with up to 9 heavy atoms), which limits the scope of conclusions for drug-like chemical space. The Wasserstein variant sometimes lags behind the standard GAN loss in raw metric scores, though it offers better diversity.</p>
<p><strong>Future directions</strong>: The authors propose extending ORGAN to non-sequential data (images, audio) by framing GANs as RL problems more broadly, and investigating how different heuristic choices affect performance. They also suggest exploring other discrete GAN formulations (MaliGAN, BGAN) with RL extensions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Molecular training</td>
          <td>QM9 subset</td>
          <td>5,000 molecules</td>
          <td>Random subset from 134k stable small molecules with up to 9 heavy atoms</td>
      </tr>
      <tr>
          <td>Music training</td>
          <td>EsAC folk dataset</td>
          <td>1,000 melodies</td>
          <td>36-token sequences, processed following Chen et al. (2017)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Generator pre-trained for 250 epochs via MLE; discriminator for 10 epochs</li>
<li>Adversarial/RL training for up to 100 epochs</li>
<li>Default $\lambda = 0.5$ for reward mixing</li>
<li>Monte Carlo rollouts for intermediate reward estimation</li>
<li>Duplicate penalty: reward divided by copy count</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Generator</strong>: RNN with LSTM cells</li>
<li><strong>Discriminator</strong>: CNN for text classification (Kim, 2014) with 75% dropout, L2 regularization</li>
<li><strong>Optimizer</strong>: Adam for all gradient descent steps</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Domain</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity (%)</td>
          <td>Fraction of generated SMILES that decode to valid molecules</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>Average Jaccard distance of fingerprints to training subset</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Druglikeness (QED)</td>
          <td>Quantitative Estimate of Drug-likeness</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Synthesizability (SA)</td>
          <td>Synthetic accessibility score</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Solubility (LogP)</td>
          <td>Water-octanol partition coefficient</td>
          <td>Molecules</td>
      </tr>
      <tr>
          <td>Tonality</td>
          <td>Proportion of perfect fifths</td>
          <td>Music</td>
      </tr>
      <tr>
          <td>Ratio of Steps</td>
          <td>Proportion of conjunct melodic intervals</td>
          <td>Music</td>
      </tr>
      <tr>
          <td>Diversity (edit)</td>
          <td>Average pairwise edit distance</td>
          <td>Music</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/gablg1/ORGAN">ORGAN</a></td>
          <td>Code</td>
          <td>GPL-2.0</td>
          <td>Official implementation including metrics for molecules and music</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Guimaraes, G. L., Sánchez-Lengeling, B., Outeiral, C., Farias, P. L. C., &amp; Aspuru-Guzik, A. (2017). Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models. <em>arXiv preprint arXiv:1705.10843</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{guimaraes2017organ,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Guimaraes, Gabriel Lima and Sanchez-Lengeling, Benjamin and Outeiral, Carlos and Farias, Pedro Luis Cunha and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1705.10843}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolecularRNN: Graph-Based Molecular Generation and RL</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/molecularrnn-graph-generation-optimized-properties/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/molecularrnn-graph-generation-optimized-properties/</guid><description>MolecularRNN extends GraphRNN with atom and bond type predictions, valency-based rejection sampling, and policy gradient optimization for molecular generation.</description><content:encoded><![CDATA[<h2 id="a-graph-recurrent-model-for-molecular-generation-with-property-optimization">A Graph Recurrent Model for Molecular Generation with Property Optimization</h2>
<p>This is a <strong>Method</strong> paper that introduces MolecularRNN, a graph-based recurrent generative model for molecular structures. The model extends GraphRNN to handle typed nodes (atom types) and typed edges (bond types), enabling direct generation of molecular graphs rather than working through string representations like SMILES. Three key contributions are combined: (1) the MolecularRNN architecture for autoregressive graph generation, (2) valency-based rejection sampling for guaranteed 100% validity at inference, and (3) policy gradient reinforcement learning for shifting molecular property distributions toward desired ranges.</p>
<h2 id="why-generate-molecules-as-graphs-rather-than-strings">Why Generate Molecules as Graphs Rather Than Strings</h2>
<p>Computational de novo molecular design aims to create novel molecules with desired properties, a task central to drug discovery. At the time of this work, most deep generative models for molecules operated on <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings, inheriting the complications of SMILES grammar and the problem that structurally similar molecules can have very different string representations. Graph-based representations are more natural for molecules, with atoms mapping to nodes and bonds to edges, and they allow direct enforcement of chemical constraints during generation.</p>
<p>Existing graph-based methods had their own limitations. Junction tree VAE (JT-VAE) generates molecules from structural fragments, which introduces ambiguity when converting junction trees back to molecules, particularly problematic during property optimization since molecules sharing a junction tree can have very different property values. The GCPN model uses graph convolutional networks with reinforcement learning but was evaluated only on top-3 generated molecules, making it difficult to assess overall distribution quality. Prior atom-level graph generation models like Li et al. (2018a) were restricted to molecules with at most 20 heavy atoms, limiting practical applicability.</p>
<h2 id="core-innovation-extending-graphrnn-with-chemical-constraints-and-rl">Core Innovation: Extending GraphRNN with Chemical Constraints and RL</h2>
<p>MolecularRNN builds on the GraphRNN architecture by introducing atom type predictions alongside edge type predictions. The model generates molecular graphs sequentially: at each step, a NodeRNN predicts the type of the next atom, then an EdgeRNN predicts bond types to all preceding atoms within a BFS-ordered window.</p>
<h3 id="autoregressive-graph-generation">Autoregressive Graph Generation</h3>
<p>The joint likelihood over atom types $C^{\pi}$ and adjacency vectors $S^{\pi}$ under BFS ordering $\pi$ is factorized as:</p>
<p>$$
p\left(S^{\pi}, C^{\pi}\right) = \prod_{i=1}^{n+1} p\left(C_{i}^{\pi} \mid S_{&lt;i}^{\pi}, C_{&lt;i}^{\pi}\right) p\left(S_{i}^{\pi} \mid C_{i}^{\pi}, S_{&lt;i}^{\pi}, C_{&lt;i}^{\pi}\right)
$$</p>
<p>NodeRNN processes embeddings of previous atom types and adjacency vectors to produce a hidden state, from which a two-layer MLP with softmax predicts the next atom type $\psi_{i}$:</p>
<p>$$
h_{i}^{\text{node}} = \text{NodeRNN}\left(h_{i-1}^{\text{node}}, \left[\text{emb}(S_{i-1}^{\pi}), \text{emb}(C_{i-1}^{\pi})\right]\right)
$$</p>
<p>$$
\psi_{i} = \text{NodeMLP}\left(h_{i}^{\text{node}}\right)
$$</p>
<p>EdgeRNN then unrolls across preceding atoms to predict bond types $\phi_{i,j}$, initialized with the NodeRNN hidden state:</p>
<p>$$
h_{i,j}^{\text{edge}} = \text{EdgeRNN}\left(h_{i,j-1}^{\text{edge}}, \text{emb}(S_{i,j-1}^{\pi})\right), \quad h_{i,0}^{\text{edge}} = h_{i}^{\text{node}}
$$</p>
<p>$$
\phi_{i,j} = \text{EdgeMLP}\left(h_{i,j}^{\text{edge}}\right)
$$</p>
<p>Bond types are categorical over {no bond, single, double, triple}, and molecules are represented in kekulized form. BFS ordering limits the EdgeRNN window to $M = 12$ preceding atoms.</p>
<h3 id="valency-based-rejection-sampling">Valency-Based Rejection Sampling</h3>
<p>During inference, each proposed bond of order $k$ between atoms $i$ and $j$ is accepted only if both atoms remain within their allowed valencies:</p>
<p>$$
\sum_{j} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{i}^{\pi}} \quad \text{and} \quad \sum_{i} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{j}^{\pi}}
$$</p>
<p>Atoms that do not fill their valencies are complemented with hydrogens. This constraint can be enforced directly on graphs (unlike SMILES, where intermediate substrings are not chemically meaningful), yielding 100% valid molecules.</p>
<h3 id="property-optimization-via-policy-gradient">Property Optimization via Policy Gradient</h3>
<p>For property optimization, MolecularRNN is formulated as a policy network in a Markov Decision Process. The loss function uses REINFORCE with a discounted final reward:</p>
<p>$$
L(\theta) = -\sum_{i=1}^{N} r(s_{N}) \cdot \gamma^{i} \cdot \log p(s_{i} \mid s_{i-1}; \theta)
$$</p>
<p>where $r(s_{N})$ is the reward from a property critic and $\gamma$ is a discount factor. The authors also introduce a structural penalty during RL training that assigns a penalty of $-10$ to atoms violating valency constraints, providing a learning signal from invalid intermediate molecules.</p>
<h2 id="experimental-setup-pretraining-and-property-optimization">Experimental Setup: Pretraining and Property Optimization</h2>
<h3 id="pretraining">Pretraining</h3>
<p>MolecularRNN is pretrained on three datasets: ChEMBL (~1.5M bioactive molecules), <a href="/notes/computational-chemistry/datasets/zinc-22/">ZINC 250k</a> (250K randomly selected commercially available compounds), and <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a> (~1.9M drug-like molecules from ZINC). The model considers 9 atom types (C, N, O, F, P, S, Cl, Br, I), 3 bond types (single, double, triple), and molecules with 10-50 heavy atoms. Architecture: NodeRNN with 4 GRU layers (hidden size 256), EdgeRNN with 4 GRU layers (hidden size 128), node embedding size 128, edge embedding size 16. Training uses Adam with learning rate 0.001 and multiplicative decay on 4 GPUs with batch size 512 per GPU for 250 epochs.</p>
<h3 id="generation-quality-at-scale">Generation Quality at Scale</h3>
<p>The pretrained model generates 1 million molecules per dataset (far larger than prior work: JT-VAE used 5K samples, Li et al. used 100K). Results with valency-based rejection sampling:</p>
<table>
  <thead>
      <tr>
          <th>Training Set</th>
          <th>Valid</th>
          <th>Unique</th>
          <th>Novel</th>
          <th>IntDiv (p=1)</th>
          <th>IntDiv (p=2)</th>
          <th>SA Score</th>
          <th>QED</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChEMBL</td>
          <td>100%</td>
          <td>99.2%</td>
          <td>99.3%</td>
          <td>0.895</td>
          <td>0.890</td>
          <td>3.67 +/- 1.20</td>
          <td>0.56 +/- 0.20</td>
      </tr>
      <tr>
          <td>ZINC 250k</td>
          <td>100%</td>
          <td>99.8%</td>
          <td>100%</td>
          <td>0.892</td>
          <td>0.887</td>
          <td>3.60 +/- 1.01</td>
          <td>0.68 +/- 0.16</td>
      </tr>
      <tr>
          <td>MOSES</td>
          <td>100%</td>
          <td>99.4%</td>
          <td>100%</td>
          <td>0.881</td>
          <td>0.876</td>
          <td>3.24 +/- 0.97</td>
          <td>0.74 +/- 0.14</td>
      </tr>
  </tbody>
</table>
<p>Comparison with baselines on ZINC 250k (30K samples):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Valid</th>
          <th>Unique</th>
          <th>Novel</th>
          <th>SA Score</th>
          <th>QED</th>
          <th>IntDiv</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JT-VAE</td>
          <td>99.8%</td>
          <td>100%</td>
          <td>100%</td>
          <td>3.37</td>
          <td>0.76</td>
          <td>0.85</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>100%</td>
          <td>99.97%</td>
          <td>100%</td>
          <td>4.62</td>
          <td>0.61</td>
          <td>0.90</td>
      </tr>
      <tr>
          <td>MolecularRNN</td>
          <td>100%</td>
          <td>99.89%</td>
          <td>100%</td>
          <td>3.59</td>
          <td>0.68</td>
          <td>0.89</td>
      </tr>
  </tbody>
</table>
<p>GCPN generates overly complex molecules (high SA score of 4.62), while MolecularRNN produces more realistic structures with higher internal diversity than JT-VAE.</p>
<h3 id="property-optimization-results">Property Optimization Results</h3>
<p>Policy gradient optimization is run for 300 iterations with batch size 512 and constant learning rate $10^{-5}$, discount factor $\gamma = 0.97$. Top-3 scores for penalized logP and QED:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>logP 1st</th>
          <th>logP 2nd</th>
          <th>logP 3rd</th>
          <th>QED 1st</th>
          <th>QED 2nd</th>
          <th>QED 3rd</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a></td>
          <td>3.63</td>
          <td>3.49</td>
          <td>3.44</td>
          <td>0.896</td>
          <td>0.824</td>
          <td>0.820</td>
      </tr>
      <tr>
          <td>JT-VAE</td>
          <td>5.30</td>
          <td>4.93</td>
          <td>4.49</td>
          <td>0.925</td>
          <td>0.911</td>
          <td>0.910</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>7.98</td>
          <td>7.85</td>
          <td>7.80</td>
          <td>0.948</td>
          <td>0.947</td>
          <td>0.946</td>
      </tr>
      <tr>
          <td>MolecularRNN</td>
          <td>10.34</td>
          <td>10.19</td>
          <td>10.14</td>
          <td>0.948</td>
          <td>0.948</td>
          <td>0.947</td>
      </tr>
  </tbody>
</table>
<p>MolecularRNN achieves the highest penalized logP scores (10.34 vs. GCPN&rsquo;s 7.98) while matching GCPN on QED. The authors also demonstrate melting temperature optimization using a GCN-based property predictor as the critic (RMSE 39.5 degrees C), showing that the RL framework generalizes to properties that cannot be computed directly from molecular graphs.</p>
<h2 id="distribution-level-evaluation-and-learned-chemical-patterns">Distribution-Level Evaluation and Learned Chemical Patterns</h2>
<p>The authors emphasize that reporting only top-3 scores is not informative, and they compare full property distributions. MolecularRNN shifts the QED distribution further toward maximum values compared to GCPN. They also note that during melting temperature optimization, the model rediscovered two chemical phenomena: fusing aromatic rings increases melting point, and the presence of polar groups (C=O, OH, NH2, heterocyclic nitrogens) enhances dipole-dipole interactions and raises melting temperature.</p>
<p>Without valency-based rejection sampling, the pretrained model achieves 65% validity. After structural penalty training (assigning -10 to valency-violating atoms and optimizing with policy gradient), validity increases to 90%. Enabling rejection sampling then achieves 100%.</p>
<p>Several limitations are worth noting. The BFS ordering introduces an arbitrary sequencing over equivalent graph traversals (the node order permutation problem is not addressed). The evaluation uses top-3 scores for property optimization, though the authors do advocate for distributional evaluation. The molecule size is capped at 50 heavy atoms. The paper does not report training time or wall-clock generation speed. Future directions mentioned include multi-objective property optimization and scaffold completion (graph completion from a given core structure).</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL</td>
          <td>~1.5M molecules</td>
          <td>Bioactive molecules with experimental measurements</td>
      </tr>
      <tr>
          <td>Pretraining</td>
          <td>ZINC 250k</td>
          <td>250K molecules</td>
          <td>Random subset of ZINC database</td>
      </tr>
      <tr>
          <td>Pretraining</td>
          <td>MOSES</td>
          <td>~1.9M molecules</td>
          <td>Drug-like subset of ZINC</td>
      </tr>
      <tr>
          <td>Melting point critic</td>
          <td>Custom split</td>
          <td>37,940 train / 9,458 test</td>
          <td>Melting temperatures from -196 to 517 degrees C</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pretraining</strong>: Maximum likelihood with Adam optimizer, learning rate 0.001 with multiplicative decay to $10^{-5}$, 250 epochs</li>
<li><strong>Structural penalty</strong>: Policy gradient with -10 penalty per valency-violating atom</li>
<li><strong>Property optimization</strong>: REINFORCE (policy gradient), 300 iterations, batch size 512, learning rate $10^{-5}$, discount factor $\gamma = 0.97$</li>
<li><strong>Melting point critic</strong>: GCN regression (4 layers, hidden size 128), Adam with learning rate 0.001, exponential decay $\gamma = 0.8$, 30 epochs, batch size 32</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>NodeRNN</strong>: 4 GRU layers, hidden size 256, node embedding 128</li>
<li><strong>EdgeRNN</strong>: 4 GRU layers, hidden size 128, edge embedding 16</li>
<li><strong>NodeMLP/EdgeMLP</strong>: 2-layer MLP with 128 hidden units, ReLU activation, softmax output</li>
<li><strong>BFS window</strong>: $M = 12$ preceding atoms</li>
<li><strong>Atom types</strong>: 9 (C, N, O, F, P, S, Cl, Br, I)</li>
<li><strong>Bond types</strong>: 3 (single, double, triple) + no bond</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>% chemically valid molecules (RDKit)</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>% unique in generated pool (up to 1M)</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>% not in training set</td>
      </tr>
      <tr>
          <td>Internal Diversity</td>
          <td>Average pairwise Tanimoto distance</td>
      </tr>
      <tr>
          <td>SA Score</td>
          <td>Synthetic accessibility (2-4 optimal range)</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>Drug-likeness score (0-1)</td>
      </tr>
      <tr>
          <td>Penalized logP</td>
          <td>Lipophilicity with ring and SA penalties</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>4 GPUs (NVIDIA, specific model not stated)</li>
<li>Per-GPU batch size of 512 for pretraining</li>
<li>Training time not reported</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Popova, M., Shvets, M., Oliva, J., &amp; Isayev, O. (2019). MolecularRNN: Generating realistic molecular graphs with optimized properties. <em>arXiv preprint arXiv:1905.13372</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{popova2019molecularrnn,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolecularRNN: Generating realistic molecular graphs with optimized properties}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Popova, Mariya and Shvets, Mykhailo and Oliva, Junier and Isayev, Olexandr}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1905.13372}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Memory-Assisted RL for Diverse De Novo Mol. Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/memory-assisted-rl-diverse-molecular-design/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/memory-assisted-rl-diverse-molecular-design/</guid><description>A memory unit for REINVENT-based RL that tracks generated scaffolds and penalizes repeated solutions, increasing molecular diversity up to fourfold.</description><content:encoded><![CDATA[<h2 id="a-memory-module-for-diverse-molecular-generation-via-rl">A Memory Module for Diverse Molecular Generation via RL</h2>
<p>This is a <strong>Method</strong> paper that introduces a memory unit for reinforcement learning (RL)-based molecular generation. The primary contribution is a hash-table-based memory mechanism that integrates into the REINVENT framework&rsquo;s scoring function. By tracking previously generated high-scoring molecules and penalizing the reward when new molecules are too similar to those already stored, the memory unit forces the generative model to explore different regions of chemical space rather than collapsing onto a single scaffold family.</p>
<h2 id="policy-collapse-limits-rl-based-de-novo-design">Policy Collapse Limits RL-Based De Novo Design</h2>
<p>Recurrent neural networks (RNNs) trained with reinforcement learning can generate novel molecules optimized for desired properties. The <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> algorithm and related approaches (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGANIC</a>, GENTRL) demonstrated the viability of coupling a pretrained SMILES-based generative model with a scoring function via RL. However, a persistent problem is <strong>policy collapse</strong> (also called mode collapse): once the model discovers a high-scoring region of chemical space, it continues to exploit that region, producing structurally similar compounds with minor substitution differences. This severely limits the practical utility of RL-based generation in drug design, where medicinal chemists need diverse scaffolds to explore structure-activity relationships and manage intellectual property concerns.</p>
<p>Prior work by Liu et al. [31] attempted to address this by engineering an explorative RNN alongside the standard generative RNN, but it did not substantially increase diversity compared to standard REINVENT. Other approaches like Generative Examination Networks (GEN) performed statistical analysis during training but were not evaluated in optimization scenarios.</p>
<h2 id="core-innovation-hash-table-memory-unit-for-reward-modification">Core Innovation: Hash-Table Memory Unit for Reward Modification</h2>
<p>The key insight is to dynamically modify the reward surface during RL by maintaining a memory of previously explored chemical space. The memory unit is a hash table of index-bucket pairs. Each bucket stores up to a fixed number of high-scoring molecules (default: 25) that are chemically similar to a seed molecule (the index).</p>
<h3 id="integration-with-reinvent">Integration with REINVENT</h3>
<p>The memory unit modifies the augmented likelihood used in REINVENT. For a generated compound $c$, the augmented log-likelihood becomes:</p>
<p>$$
\log P(c)_{Aug} = \log P(c)_{PriorNetwork} + \sigma \times S(c) \times M(c)
$$</p>
<p>where $\sigma$ is a scalar coefficient, $S(c)$ is the scoring function output, and $M(c)$ is the memory unit output (either 0 or 1). The reward is:</p>
<p>$$
R(c) = \left(\log P(c)_{Aug} - \log P(c)_{AgentNetwork}\right)^2
$$</p>
<p>and the loss is $\text{loss} = -R(c)$.</p>
<h3 id="memory-unit-operation">Memory Unit Operation</h3>
<p>When a high-scoring molecule is generated:</p>
<ol>
<li>Its fingerprint or scaffold is compared against all index structures in the memory</li>
<li>If it is similar to an index (above a Tanimoto cutoff, default 0.6) and the corresponding bucket is not full, $M(c) = 1$ and the molecule is added to the bucket</li>
<li>If the bucket is full, $M(c) = 0$, effectively zeroing the reward contribution and discouraging the model from generating similar molecules</li>
<li>If no similar index exists, a new index-bucket pair is created</li>
</ol>
<h3 id="four-similarity-criteria">Four Similarity Criteria</h3>
<p>The authors evaluate four criteria for grouping molecules in the memory:</p>
<ol>
<li><strong>Compound similarity</strong>: ECFP4 Tanimoto similarity at the whole-molecule level</li>
<li><strong>Identical Bemis-Murcko (BM) scaffold</strong>: exact match of Bemis-Murcko frameworks</li>
<li><strong>Identical carbon skeleton</strong>: exact match of carbon skeletons (BM scaffolds with all heteroatoms replaced by carbon and bonds set to single)</li>
<li><strong>Scaffold similarity</strong>: atom pair fingerprint Tanimoto similarity between carbon skeletons (fuzzy matching)</li>
</ol>
<h3 id="alternative-output-modes">Alternative Output Modes</h3>
<p>Beyond the binary output ($M(c) \in {0, 1}$), the authors also explored smooth output functions. The linear mode:</p>
<p>$$
M(c) = 1 - \frac{\text{compounds in bucket}}{\text{bucket size}}
$$</p>
<p>And the sigmoid mode:</p>
<p>$$
M(c) = 1 - \frac{1}{1 + e^{-\left(\frac{\frac{\text{compounds in bucket}}{\text{bucket size}} \times 2 - 1}{0.15}\right)}}
$$</p>
<p>Both smooth modes yielded slightly fewer analogs than the binary mode and were not pursued further.</p>
<h2 id="experimental-setup-logp-optimization-and-target-activity-prediction">Experimental Setup: LogP Optimization and Target Activity Prediction</h2>
<h3 id="case-study-1-logp-optimization">Case Study 1: LogP Optimization</h3>
<p>As a proof of concept, the authors optimized LogP values for known DRD2 inhibitors. Starting from 487 DRD2 compounds with LogP &gt;= 5 (from ExCAPE-DB), they applied transfer learning to the prior model for 20 epochs, then ran RL for 150 iterations (100 compounds per iteration, 15,000 total). The scoring function was:</p>
<p>$$
S = 1 - \tanh\left(\min\left(|2 - \text{AlogP}|, |3 - \text{AlogP}|\right)\right)
$$</p>
<p>targeting LogP values between 2.0 and 3.0.</p>
<h3 id="case-study-2-htr1a-and-drd2-activity-prediction">Case Study 2: HTR1A and DRD2 Activity Prediction</h3>
<p>For a more complex scenario, the authors trained SVM classifiers (with <a href="https://en.wikipedia.org/wiki/Platt_scaling">Platt scaling</a> for probabilistic output) on bioactivity data from ExCAPE-DB to predict activity against two neurotransmitter receptors:</p>
<ul>
<li><strong><a href="https://en.wikipedia.org/wiki/5-HT1A_receptor">HTR1A</a></strong>: 3,599 actives (pIC50 &gt;= 7) and 66,684 inactives</li>
<li><strong><a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">DRD2</a></strong>: 2,981 actives (pIC50 &gt;= 7) and 346,206 inactives (100,000 sampled)</li>
</ul>
<p>Data was split using Butina clustering on ECFP6 at a 0.4 Tanimoto cutoff (60/20/20 train/val/test). The SVM models achieved excellent performance:</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Set</th>
          <th>Balanced Accuracy</th>
          <th>ROC AUC</th>
          <th>F1</th>
          <th>MCC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HTR1A</td>
          <td>Test</td>
          <td>0.96</td>
          <td>0.99</td>
          <td>0.75</td>
          <td>0.75</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>Test</td>
          <td>0.95</td>
          <td>0.99</td>
          <td>0.71</td>
          <td>0.72</td>
      </tr>
  </tbody>
</table>
<p>RL was run for 300 iterations (100 compounds each, 30,000 total). Compounds with predicted activity &gt;= 0.7 were considered active.</p>
<h3 id="generative-model-architecture">Generative Model Architecture</h3>
<p>The RNN prior model followed the REINVENT architecture: an embedding layer, three GRU layers with 256 dimensions, and a linear output layer. It was pretrained on ~1.5 million ChEMBL 25 compounds (filtered to remove known HTR1A actives and DRD2 analogs) for 10 epochs using Adam with a learning rate of 0.01.</p>
<h3 id="comparisons">Comparisons</h3>
<p>The authors compared memory-assisted RL against:</p>
<ul>
<li>Standard REINVENT RL (no memory)</li>
<li>Experience replay (re-presenting 8 high-scoring compounds per iteration)</li>
<li>Temperature scaling (values from 1.0 to 10.0)</li>
<li>Memory + experience replay combined</li>
</ul>
<h2 id="results-up-to-fourfold-increase-in-diverse-active-compounds">Results: Up to Fourfold Increase in Diverse Active Compounds</h2>
<h3 id="logp-optimization-results">LogP Optimization Results</h3>
<p>Memory-assisted RL increased the number of optimized compounds (LogP 2-3) by roughly threefold:</p>
<table>
  <thead>
      <tr>
          <th>Memory Type</th>
          <th>Optimized Compounds</th>
          <th>Unique BM Scaffolds</th>
          <th>Unique Carbon Skeletons</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>No memory</td>
          <td>938</td>
          <td>727</td>
          <td>396</td>
      </tr>
      <tr>
          <td>Compound similarity</td>
          <td>3,451</td>
          <td>2,963</td>
          <td>1,472</td>
      </tr>
      <tr>
          <td>Identical BM Scaffold</td>
          <td>3,428</td>
          <td>2,865</td>
          <td>1,398</td>
      </tr>
      <tr>
          <td>Identical Carbon Skeleton</td>
          <td>3,315</td>
          <td>3,002</td>
          <td>1,799</td>
      </tr>
      <tr>
          <td>Scaffold Similarity</td>
          <td>3,591</td>
          <td>3,056</td>
          <td>1,538</td>
      </tr>
  </tbody>
</table>
<p>The memory unit also increased the generation of relevant analogs. ECFP6 analogs (Tanimoto &gt;= 0.4 to training set) increased from 145 to up to 549, and shared MMP cores increased from 5 to up to 19, confirming that the memory unit promoted exploration of chemically relevant space rather than random drift.</p>
<h3 id="htr1a-and-drd2-activity-optimization-results">HTR1A and DRD2 Activity Optimization Results</h3>
<p>The improvements were even more pronounced for target activity optimization:</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Memory Type</th>
          <th>Active Compounds</th>
          <th>Unique BM Scaffolds</th>
          <th>Unique Carbon Skeletons</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HTR1A</td>
          <td>No memory</td>
          <td>9,323</td>
          <td>7,312</td>
          <td>5,446</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>Compound similarity</td>
          <td>16,779</td>
          <td>13,304</td>
          <td>9,887</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>Identical Carbon Skeleton</td>
          <td>17,597</td>
          <td>15,531</td>
          <td>12,408</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>No memory</td>
          <td>5,143</td>
          <td>2,635</td>
          <td>1,949</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>Compound similarity</td>
          <td>21,486</td>
          <td>17,844</td>
          <td>12,749</td>
      </tr>
      <tr>
          <td>DRD2</td>
          <td>Scaffold Similarity</td>
          <td>22,784</td>
          <td>20,712</td>
          <td>16,434</td>
      </tr>
  </tbody>
</table>
<p>For DRD2, the effect was particularly striking: standard RL showed clear policy collapse with only 576 ECFP6 analogs to the training set, while memory-assisted RL generated up to 6,315. The compound similarity memory unit produced the most MMP analogs (217 to the training set vs. 7 without memory).</p>
<h3 id="parameter-sensitivity">Parameter Sensitivity</h3>
<p>Bucket size had a modest effect: larger buckets (allowing more compounds before penalization) slightly increased analog generation. The Tanimoto similarity threshold of 0.6 was near-optimal for the scaffold similarity memory; higher thresholds reduced diversity gains. The compound similarity memory showed increasing analogs with higher thresholds, but BM scaffold and carbon skeleton counts plateaued above 0.6.</p>
<h3 id="comparison-with-experience-replay-and-temperature-scaling">Comparison with Experience Replay and Temperature Scaling</h3>
<ul>
<li><strong>Experience replay alone</strong> increased diversity compared to vanilla RL but was less effective than the memory unit alone</li>
<li><strong>Memory + experience replay</strong> achieved the best results overall, as experience replay provided the model with diverse starting points for exploration after the memory unit altered the reward landscape</li>
<li><strong>Temperature scaling</strong> was largely ineffective: only a value of 1.25 showed improvement, and even then it achieved only about 50% of the analogs generated by memory-assisted RL. Temperatures above 2.0 degraded SMILES validity, and above 4.0 prevented valid molecule generation entirely</li>
</ul>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>All evaluations are retrospective; no synthesized compounds were experimentally tested</li>
<li>The SVM activity models, while accurate, may have applicability domain limitations for highly novel scaffolds</li>
<li>The binary memory output mode was found to work best, but the transition from exploration to exploitation is abrupt</li>
<li>The method was only tested with two biological targets and one physicochemical property</li>
<li>Computational overhead of the memory unit is not discussed</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior model training</td>
          <td>ChEMBL 25</td>
          <td>~1.5M compounds</td>
          <td>Filtered: max 50 heavy atoms, no stereochemistry, removed HTR1A actives and DRD2 analogs</td>
      </tr>
      <tr>
          <td>HTR1A activity data</td>
          <td>ExCAPE-DB</td>
          <td>3,599 actives + 66,684 inactives</td>
          <td>pIC50 &gt;= 7 threshold for actives</td>
      </tr>
      <tr>
          <td>DRD2 activity data</td>
          <td>ExCAPE-DB</td>
          <td>2,981 actives + 100,000 inactives (sampled)</td>
          <td>pIC50 &gt;= 7 threshold for actives</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Generative model</strong>: RNN with embedding + 3 GRU layers (256 dim) + linear output (REINVENT architecture)</li>
<li><strong>RL</strong>: Augmented likelihood formulation with sigma scaling coefficient</li>
<li><strong>SVM classifiers</strong>: Non-linear SVM with MinMax kernel, Platt scaling, ECFP6 count-based fingerprints (2048 dim)</li>
<li><strong>Butina clustering</strong>: ECFP6 Tanimoto cutoff 0.4 for train/val/test splitting</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Unique compounds</td>
          <td>Number of distinct valid SMILES generated</td>
      </tr>
      <tr>
          <td>Unique BM scaffolds</td>
          <td>Bemis-Murcko framework diversity</td>
      </tr>
      <tr>
          <td>Unique carbon skeletons</td>
          <td>Carbon skeleton diversity (stripped BM scaffolds)</td>
      </tr>
      <tr>
          <td>ECFP6 analogs</td>
          <td>Compounds with Tanimoto &gt;= 0.4 to known actives</td>
      </tr>
      <tr>
          <td>MMP analogs</td>
          <td>Matched molecular pair relationships with known actives</td>
      </tr>
      <tr>
          <td>Shared MMP cores</td>
          <td>Scaffold cores shared between generated and known compounds</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/tblaschke/reinvent-memory">reinvent-memory</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation with prepared datasets</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Blaschke, T., Engkvist, O., Bajorath, J., &amp; Chen, H. (2020). Memory-assisted reinforcement learning for diverse molecular de novo design. <em>Journal of Cheminformatics</em>, 12, 68. <a href="https://doi.org/10.1186/s13321-020-00473-0">https://doi.org/10.1186/s13321-020-00473-0</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{blaschke2020memory,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Memory-assisted reinforcement learning for diverse molecular de novo design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Blaschke, Thomas and Engkvist, Ola and Bajorath, J{\&#34;u}rgen and Chen, Hongming}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{68}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-020-00473-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LatentGAN: Latent-Space GAN for Molecular Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/latentgan-de-novo-molecular-generation/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/latentgan-de-novo-molecular-generation/</guid><description>LatentGAN combines a SMILES heteroencoder with a Wasserstein GAN to generate novel drug-like molecules in latent space, avoiding SMILES syntax issues.</description><content:encoded><![CDATA[<h2 id="a-gan-operating-in-learned-latent-space-for-molecular-design">A GAN Operating in Learned Latent Space for Molecular Design</h2>
<p>LatentGAN is a <strong>Method</strong> paper that introduces a two-stage architecture for de novo molecular generation. The first stage trains a heteroencoder to map SMILES strings into a continuous latent vector space. The second stage trains a Wasserstein GAN with gradient penalty (WGAN-GP) to generate new latent vectors that, when decoded, produce valid and novel molecular structures. The key contribution is decoupling the GAN from direct SMILES string generation, allowing the adversarial training to focus on learning the distribution of molecular latent representations rather than character-level sequence generation.</p>
<h2 id="limitations-of-direct-smiles-generation-with-gans">Limitations of Direct SMILES Generation with GANs</h2>
<p>Prior GAN-based molecular generation methods such as ORGAN and ORGANIC operated directly on SMILES strings. This created a fundamental challenge: the generator had to simultaneously learn valid SMILES syntax and the distribution of chemically meaningful molecules. ORGAN struggled with optimizing discrete molecular properties like Lipinski&rsquo;s Rule of Five, while ORGANIC showed limited success beyond the QED drug-likeness score. Other approaches (RANC, ATNC) substituted more advanced recurrent architectures but still operated in the discrete SMILES space.</p>
<p>Meanwhile, variational autoencoders (VAEs) demonstrated that working in continuous latent space could enable molecular generation, but they relied on forcing the latent distribution to match a Gaussian prior through KL divergence. This assumption is not necessarily appropriate for chemical space, which is inherently discontinuous.</p>
<p>RNN-based methods with transfer learning offered an alternative for target-biased generation, but the authors hypothesized that combining GANs with learned latent representations could produce complementary chemical space coverage.</p>
<h2 id="heteroencoder-plus-wasserstein-gan-architecture">Heteroencoder Plus Wasserstein GAN Architecture</h2>
<p>The core innovation of LatentGAN is separating molecular representation learning from adversarial generation through a two-component pipeline.</p>
<h3 id="heteroencoder">Heteroencoder</h3>
<p>The heteroencoder is an autoencoder trained on pairs of different non-canonical (randomized) SMILES representations of the same molecule. This is distinct from a standard autoencoder because the input and target SMILES are different representations of the same structure.</p>
<p>The encoder uses a two-layer bidirectional LSTM with 512 units per layer (256 forward, 256 backward). The concatenated output feeds into a 512-dimensional feed-forward layer. During training, zero-centered Gaussian noise with $\sigma = 0.1$ is added to the latent vector as regularization. The decoder is a four-layer unidirectional LSTM with a softmax output layer. Batch normalization with momentum 0.9 is applied to all hidden layers except the noise layer.</p>
<p>Training uses teacher forcing with categorical cross-entropy loss for 100 epochs. The learning rate starts at $10^{-3}$ for the first 50 epochs and decays exponentially to $10^{-6}$ by the final epoch. After training, the noise layer is deactivated for deterministic encoding and decoding.</p>
<p>An important design choice is that the heteroencoder makes no assumption about the latent space distribution (unlike VAEs with their KL divergence term). The latent space is shaped purely by reconstruction loss, and the GAN later learns to sample from this unconstrained distribution.</p>
<h3 id="wasserstein-gan-with-gradient-penalty">Wasserstein GAN with Gradient Penalty</h3>
<p>The GAN uses the WGAN-GP formulation. The critic (discriminator) consists of three feed-forward layers of 256 dimensions each with leaky ReLU activations (no activation on the final layer). The generator has five feed-forward layers of 256 dimensions each with batch normalization and leaky ReLU between layers.</p>
<p>The training ratio is 5:1, with five critic updates for every generator update. The generator takes random vectors sampled from a uniform distribution and learns to produce latent vectors indistinguishable from the real encoded molecular latent vectors.</p>
<p>The WGAN-GP loss for the critic is:</p>
<p>$$L_{\text{critic}} = \mathbb{E}_{\tilde{x} \sim \mathbb{P}_g}[D(\tilde{x})] - \mathbb{E}_{x \sim \mathbb{P}_r}[D(x)] + \lambda \mathbb{E}_{\hat{x} \sim \mathbb{P}_{\hat{x}}}[(|\nabla_{\hat{x}} D(\hat{x})|_2 - 1)^2]$$</p>
<p>where $\lambda$ is the gradient penalty coefficient, $\mathbb{P}_r$ is the real data distribution (encoded latent vectors), $\mathbb{P}_g$ is the generator distribution, and $\mathbb{P}_{\hat{x}}$ samples uniformly along straight lines between pairs of real and generated points.</p>
<h3 id="generation-pipeline">Generation Pipeline</h3>
<p>At inference time, the full pipeline operates as: (1) sample a random vector, (2) pass through the trained generator to produce a latent vector, (3) decode the latent vector into a SMILES string using the pretrained heteroencoder decoder.</p>
<h2 id="experiments-on-drug-like-and-target-biased-generation">Experiments on Drug-Like and Target-Biased Generation</h2>
<h3 id="datasets">Datasets</h3>
<p>The heteroencoder was trained on 1,347,173 SMILES from ChEMBL 25, standardized with MolVS and restricted to molecules with atoms from {H, C, N, O, S, Cl, Br} and at most 50 heavy atoms.</p>
<p>For general drug-like generation, a random subset of 100,000 ChEMBL compounds was used to train the GAN model for 30,000 epochs.</p>
<p>For target-biased generation, three datasets were extracted from ExCAPE-DB for EGFR, HTR1A, and S1PR1 targets. These were clustered into training and test sets to ensure chemical series were not split across sets.</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Training Set</th>
          <th>Test Set</th>
          <th>SVM ROC-AUC</th>
          <th>SVM Kappa</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>EGFR</td>
          <td>2,949</td>
          <td>2,326</td>
          <td>0.850</td>
          <td>0.56</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>48,283</td>
          <td>23,048</td>
          <td>0.993</td>
          <td>0.90</td>
      </tr>
      <tr>
          <td>S1PR1</td>
          <td>49,381</td>
          <td>23,745</td>
          <td>0.995</td>
          <td>0.91</td>
      </tr>
  </tbody>
</table>
<p>SVM target prediction models using 2048-bit FCFP6 fingerprints were built with scikit-learn to evaluate generated compounds.</p>
<h3 id="baselines">Baselines</h3>
<p>RNN-based generative models with transfer learning served as the primary baseline. A prior RNN model was trained on the same ChEMBL set, then fine-tuned on each target dataset. The LatentGAN was also benchmarked on the MOSES platform against VAE, JTN-VAE, and AAE architectures.</p>
<h3 id="heteroencoder-performance">Heteroencoder Performance</h3>
<p>The heteroencoder achieved 99% valid SMILES on the training set and 98% on the test set. Reconstruction error (decoding to a different molecule) was 18% on training and 20% on test. Notably, decoding to a different valid SMILES of the same molecule is not counted as an error.</p>
<h3 id="target-biased-generation-results">Target-Biased Generation Results</h3>
<p>From 50,000 sampled SMILES per target model:</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Arch.</th>
          <th>Valid (%)</th>
          <th>Unique (%)</th>
          <th>Novel (%)</th>
          <th>Active (%)</th>
          <th>Recovered Actives (%)</th>
          <th>Recovered Neighbors</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>EGFR</td>
          <td>GAN</td>
          <td>86</td>
          <td>56</td>
          <td>97</td>
          <td>71</td>
          <td>5.26</td>
          <td>196</td>
      </tr>
      <tr>
          <td>EGFR</td>
          <td>RNN</td>
          <td>96</td>
          <td>46</td>
          <td>95</td>
          <td>65</td>
          <td>7.74</td>
          <td>238</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>GAN</td>
          <td>86</td>
          <td>66</td>
          <td>95</td>
          <td>71</td>
          <td>5.05</td>
          <td>284</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>RNN</td>
          <td>96</td>
          <td>50</td>
          <td>90</td>
          <td>81</td>
          <td>7.28</td>
          <td>384</td>
      </tr>
      <tr>
          <td>S1PR1</td>
          <td>GAN</td>
          <td>89</td>
          <td>31</td>
          <td>98</td>
          <td>44</td>
          <td>0.93</td>
          <td>24</td>
      </tr>
      <tr>
          <td>S1PR1</td>
          <td>RNN</td>
          <td>97</td>
          <td>35</td>
          <td>97</td>
          <td>65</td>
          <td>3.72</td>
          <td>43</td>
      </tr>
  </tbody>
</table>
<h3 id="moses-benchmark">MOSES Benchmark</h3>
<p>On the MOSES benchmark (trained on a ZINC subset of 1,584,663 compounds, sampled 30,000 SMILES), LatentGAN showed comparable or better results than JTN-VAE and AAE on Frechet ChemNet Distance (FCD), Fragment similarity, and Scaffold similarity, while producing slightly worse nearest-neighbor cosine similarity (SNN). The standard VAE showed signs of mode collapse with high test metric overlap and low novelty.</p>
<h2 id="complementary-generation-and-drug-likeness-preservation">Complementary Generation and Drug-Likeness Preservation</h2>
<h3 id="key-findings">Key Findings</h3>
<p><strong>Validity and novelty</strong>: LatentGAN achieved 86-89% validity on target-biased tasks (lower than RNN&rsquo;s 96-97%) but produced higher uniqueness on two of three targets and comparable or higher novelty (95-98%).</p>
<p><strong>Complementary chemical space</strong>: The overlap between LatentGAN-generated and RNN-generated active compounds was very small at both compound and scaffold levels. A probabilistic analysis showed that the RNN model would be very unlikely to eventually cover the LatentGAN output space. This suggests the two architectures can work complementarily in de novo design campaigns.</p>
<p><strong>Drug-likeness</strong>: QED score distributions of LatentGAN-generated compounds closely matched training set distributions across all three targets, with training compounds showing only slightly higher drug-likeness. SA score distributions were similarly well-preserved.</p>
<p><strong>Chemical space coverage</strong>: PCA analysis using MQN fingerprints confirmed that generated compounds occupy most of the chemical space of the training sets. Some regions of the PCA plots contained compounds predicted as inactive, which corresponded to non-drug-like outliers in the training data.</p>
<p><strong>Novel scaffolds</strong>: About 14% of scaffolds in the sampled sets had similarity below 0.4 to the training set across all three targets, indicating LatentGAN can generate genuinely novel chemical scaffolds. Around 5% of generated compounds were identical to training set compounds, while 21-25% had Tanimoto similarity below 0.4.</p>
<h3 id="limitations">Limitations</h3>
<p>The paper acknowledges several limitations. The 18-20% heteroencoder reconstruction error means a non-trivial fraction of encoded molecules decode to different structures. Validity rates (86-89%) are lower than RNN baselines (96-97%). The S1PR1 target showed notably lower uniqueness (31%) and predicted activity (44%) compared to the other targets, possibly due to the smaller effective training set of active compounds. The paper does not report specific hardware requirements or training times. No wet-lab experimental validation of generated compounds was performed.</p>
<h3 id="future-directions">Future Directions</h3>
<p>The authors envision LatentGAN as a complementary tool to existing RNN-based generative models, with the two architectures covering different regions of chemical space. The approach of operating in learned latent space rather than directly on SMILES strings offers a general framework that could be extended to other molecular representations or generation objectives.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Heteroencoder training</td>
          <td>ChEMBL 25 (subset)</td>
          <td>1,347,173 SMILES</td>
          <td>Standardized with MolVS; atoms restricted to H, C, N, O, S, Cl, Br; max 50 heavy atoms</td>
      </tr>
      <tr>
          <td>General GAN training</td>
          <td>ChEMBL 25 (random subset)</td>
          <td>100,000</td>
          <td>Subset of heteroencoder training set</td>
      </tr>
      <tr>
          <td>Target-biased training</td>
          <td>ExCAPE-DB (EGFR)</td>
          <td>2,949 actives</td>
          <td>Clustered train/test split</td>
      </tr>
      <tr>
          <td>Target-biased training</td>
          <td>ExCAPE-DB (HTR1A)</td>
          <td>48,283 actives</td>
          <td>Clustered train/test split</td>
      </tr>
      <tr>
          <td>Target-biased training</td>
          <td>ExCAPE-DB (S1PR1)</td>
          <td>49,381 actives</td>
          <td>Clustered train/test split</td>
      </tr>
      <tr>
          <td>Benchmarking</td>
          <td>ZINC (MOSES subset)</td>
          <td>1,584,663</td>
          <td>Canonical SMILES</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Heteroencoder</strong>: Bidirectional LSTM encoder (2 layers, 512 units) + unidirectional LSTM decoder (4 layers), trained with teacher forcing and categorical cross-entropy for 100 epochs</li>
<li><strong>GAN</strong>: WGAN-GP with 5:1 critic-to-generator training ratio. General model trained 30,000 epochs; target models trained 10,000 epochs</li>
<li><strong>Evaluation</strong>: SVM classifiers with FCFP6 fingerprints (2048 bits) for activity prediction; MQN fingerprints for PCA-based chemical space analysis; Murcko scaffolds for scaffold-level analysis</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Heteroencoder: 512-dim latent space, bidirectional LSTM encoder, unidirectional LSTM decoder</li>
<li>Generator: 5 feed-forward layers of 256 dims with batch norm and leaky ReLU</li>
<li>Critic: 3 feed-forward layers of 256 dims with leaky ReLU</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>LatentGAN (EGFR)</th>
          <th>RNN Baseline (EGFR)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>86%</td>
          <td>96%</td>
          <td>Percent valid SMILES</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>56%</td>
          <td>46%</td>
          <td>Percent unique among valid</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>97%</td>
          <td>95%</td>
          <td>Not in training set</td>
      </tr>
      <tr>
          <td>Predicted active</td>
          <td>71%</td>
          <td>65%</td>
          <td>By SVM model</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Dierme/latent-gan">LatentGAN source code</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Includes trained heteroencoder model and training sets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Prykhodko, O., Johansson, S.V., Kotsias, P.-C., Arús-Pous, J., Bjerrum, E.J., Engkvist, O., &amp; Chen, H. (2019). A de novo molecular generation method using latent vector based generative adversarial network. <em>Journal of Cheminformatics</em>, 11(1), 74. <a href="https://doi.org/10.1186/s13321-019-0397-9">https://doi.org/10.1186/s13321-019-0397-9</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{prykhodko2019latentgan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A de novo molecular generation method using latent vector based generative adversarial network}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Prykhodko, Oleksii and Johansson, Simon Viet and Kotsias, Panagiotis-Christos and Ar{\&#39;u}s-Pous, Josep and Bjerrum, Esben Jannik and Engkvist, Ola and Chen, Hongming}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{74}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-019-0397-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DrugEx v2: Pareto Multi-Objective RL for Drug Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/drugex-v2-pareto-multi-objective-rl/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/drugex-v2-pareto-multi-objective-rl/</guid><description>DrugEx v2 extends RNN-based de novo drug design with Pareto ranking and evolutionary exploration for multi-objective molecule generation.</description><content:encoded><![CDATA[<h2 id="multi-objective-de-novo-drug-design-with-pareto-optimization">Multi-Objective De Novo Drug Design with Pareto Optimization</h2>
<p>This is a <strong>Method</strong> paper that extends the DrugEx framework (v1) to handle multi-objective optimization in de novo drug design. The primary contribution is integrating Pareto-based ranking with evolutionary algorithm concepts (crossover and mutation) into an RNN-based reinforcement learning pipeline. The system generates <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>-based molecules optimized simultaneously for activity toward multiple protein targets while avoiding off-targets, addressing polypharmacology scenarios where drugs must bind multiple specific receptors.</p>
<h2 id="polypharmacology-and-the-limits-of-single-objective-generation">Polypharmacology and the Limits of Single-Objective Generation</h2>
<p>Traditional drug discovery follows the &ldquo;one drug, one target, one disease&rdquo; paradigm, but drug molecules interact with an average of six protein targets. Off-target binding causes side effects that remain a leading cause of clinical failure and post-approval drug withdrawals (over 500 drugs withdrawn due to fatal toxicity). Complex diseases often require modulating multiple targets simultaneously, making polypharmacology an important design objective.</p>
<p>Prior deep learning approaches for de novo design, including DrugEx v1, focused on generating molecules active against a single target. Extending these methods to multiple objectives introduces fundamental challenges: objectives are often contradictory (high affinity for one target may correlate with high affinity for an undesired off-target), and naive weighted-sum approaches can collapse diversity by over-optimizing a single dominant objective. The authors specifically target the <a href="https://en.wikipedia.org/wiki/Adenosine_receptor">adenosine receptor</a> system, where $A_1AR$ and $A_{2A}AR$ selectivity profiles matter for therapeutic efficacy, and <a href="https://en.wikipedia.org/wiki/HERG">hERG</a> channel binding must be avoided to prevent cardiac toxicity.</p>
<h2 id="evolutionary-exploration-and-pareto-ranking-in-rl">Evolutionary Exploration and Pareto Ranking in RL</h2>
<p>The core innovation of DrugEx v2 has two components: an evolutionary exploration strategy and Pareto-based reward assignment.</p>
<h3 id="evolutionary-exploration-strategy">Evolutionary Exploration Strategy</h3>
<p>The generation process uses three RNN networks with identical LSTM architectures:</p>
<ul>
<li><strong>Agent net</strong> ($G_A$): the primary generator, updated at each training epoch via policy gradient</li>
<li><strong>Crossover net</strong> ($G_C$): initialized from the fine-tuned model, updated iteratively from $G_A$ after each convergence period</li>
<li><strong>Mutation net</strong> ($G_M$): initialized from the pre-trained model, parameters fixed throughout training</li>
</ul>
<p>At each token-generation step, a random number determines whether the token probability comes from the combination of $G_A$ and $G_C$ (with probability $1 - \varepsilon$) or from $G_M$ (with probability $\varepsilon$). This mirrors crossover and mutation operations from evolutionary algorithms, maintaining diversity while steering toward desired properties.</p>
<h3 id="pareto-front-reward-scheme">Pareto Front Reward Scheme</h3>
<p>For $n$ objectives (three in this study: $A_1AR$, $A_{2A}AR$, hERG), each molecule receives a score $R_i$ based on its predicted bioactivity:</p>
<p>$$
R_{i} = \begin{cases} \text{minmax}(pX_{i}), &amp; \text{if high affinity required} \\ 1 - \text{minmax}(pX_{i}), &amp; \text{if low affinity required} \\ 0, &amp; \text{if SMILES invalid} \end{cases}
$$</p>
<p>where $pX_i$ is the predicted bioactivity (range 3.0 to 10.0), normalized to [0, 1].</p>
<p>For the multi-target case, high affinity is required for both $A_1AR$ and $A_{2A}AR$ while low affinity is required for hERG. For the target-specific case, high affinity is required only for $A_{2A}AR$ while low affinity is required for both $A_1AR$ and hERG.</p>
<p>Molecules are ranked using a <a href="https://en.wikipedia.org/wiki/Non-dominated_sorting_genetic_algorithm_II">non-dominated sorting</a> algorithm to construct Pareto fronts. Within each front, molecules are ranked by average Tanimoto distance (using ECFP6 fingerprints) rather than crowding distance, favoring chemically diverse solutions. The final reward is:</p>
<p>$$
R_i^{*} = \begin{cases} 0.5 + \frac{k - N_{undesired}}{2N_{desired}}, &amp; \text{if desired} \\ \frac{k}{2N_{undesired}}, &amp; \text{if undesired} \end{cases}
$$</p>
<p>where $k$ is the molecule&rsquo;s index in the Pareto rank. Rewards for undesired and desired solutions are distributed in $(0, 0.5]$ and $(0.5, 1.0]$, respectively.</p>
<p>The agent is trained via policy gradient:</p>
<p>$$
J(\theta) = \mathbb{E}\left[R^{*}(y_{1:T}) \middle|\theta\right] = \sum_{t=1}^{T} \log G(y_t | y_{1:t-1}) \cdot R^{*}(y_{1:T})
$$</p>
<h3 id="weighted-sum-alternative">Weighted Sum Alternative</h3>
<p>The authors also implement a weighted sum (WS) scheme with dynamic weights proportional to the ratio of undesired to desired molecules per objective:</p>
<p>$$
w_i = \frac{r_i}{\sum_{k=1}^{M} r_k}, \quad R^{*} = \sum_{i=1}^{n} w_i R_i
$$</p>
<p>This auto-adjusts importance toward under-performing objectives during training.</p>
<h3 id="molecular-diversity-metric">Molecular Diversity Metric</h3>
<p>Diversity is measured using the Solow-Polasky metric adapted from ecological biodiversity:</p>
<p>$$
I(A) = \frac{1}{|A|} \mathbf{e}^{\top} F(\mathbf{s})^{-1} \mathbf{e}
$$</p>
<p>where $F(\mathbf{s})$ is a distance matrix with entries $f(d_{ij}) = e^{-\theta d_{ij}}$ and $d_{ij}$ is the Tanimoto distance between ECFP6 fingerprints of molecules $s_i$ and $s_j$.</p>
<h2 id="multi-target-and-target-specific-experiments">Multi-Target and Target-Specific Experiments</h2>
<h3 id="qsar-environment">QSAR Environment</h3>
<p>Four ML algorithms were benchmarked for the bioactivity prediction environment: Random Forest (RF), SVM, PLS, and Multi-task DNN (MT-DNN). Input features combined 2048-bit ECFP6 fingerprints with 19 physicochemical descriptors (2067D total). The training data came from ChEMBL v26: 25,731 ligands with bioactivity measurements toward $A_1AR$, $A_{2A}AR$, and hERG. RF was selected as the final predictor based on superior performance in temporal-split independent testing ($R^2$ and RMSE), prioritizing robustness over cross-validation metrics.</p>
<h3 id="generative-model-architecture">Generative Model Architecture</h3>
<p>The RNN generator uses six layers: input, embedding (128D), three LSTM recurrent layers (512 hidden units), and output. LSTM was chosen over GRU based on higher valid SMILES rates (97.5% vs. 93.1% for pre-trained, 97.9% vs. 95.7% for fine-tuned). Pre-training used 1.7M molecules from ChEMBL; fine-tuning used the 25,731 LIGAND set molecules.</p>
<h3 id="baselines">Baselines</h3>
<p>DrugEx v2 was compared against DrugEx v1, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>, and <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGANIC</a>, all using the same RNN architecture and pre-trained/fine-tuned models, with only the RL framework differing. Both Pareto front (PF) and weighted sum (WS) reward schemes were tested.</p>
<h3 id="multi-target-results">Multi-Target Results</h3>
<p>In the multi-target case (high affinity for $A_1AR$ and $A_{2A}AR$, low affinity for hERG):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Scheme</th>
          <th>Validity</th>
          <th>Desirability</th>
          <th>Uniqueness</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DrugEx v2</td>
          <td>PF</td>
          <td>99.57%</td>
          <td>80.81%</td>
          <td>87.29%</td>
          <td>0.70</td>
      </tr>
      <tr>
          <td>DrugEx v2</td>
          <td>WS</td>
          <td>99.80%</td>
          <td><strong>97.45%</strong></td>
          <td>89.08%</td>
          <td>0.49</td>
      </tr>
      <tr>
          <td>REINVENT</td>
          <td>PF</td>
          <td>99.54%</td>
          <td>57.43%</td>
          <td><strong>98.84%</strong></td>
          <td><strong>0.77</strong></td>
      </tr>
      <tr>
          <td>ORGANIC</td>
          <td>PF</td>
          <td>98.84%</td>
          <td>66.01%</td>
          <td>82.67%</td>
          <td>0.65</td>
      </tr>
      <tr>
          <td>DrugEx v1</td>
          <td>PF</td>
          <td>98.28%</td>
          <td>43.27%</td>
          <td>88.96%</td>
          <td>0.71</td>
      </tr>
  </tbody>
</table>
<p>DrugEx v2 achieved the highest desirability under both schemes. The WS scheme maximized desirability (97.45%) but at the cost of diversity (0.49). The PF scheme maintained higher diversity (0.70) with still-strong desirability (80.81%).</p>
<h3 id="target-specific-results">Target-Specific Results</h3>
<p>In the target-specific case (high $A_{2A}AR$, low $A_1AR$ and hERG):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Scheme</th>
          <th>Validity</th>
          <th>Desirability</th>
          <th>Uniqueness</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DrugEx v2</td>
          <td>PF</td>
          <td>99.53%</td>
          <td><strong>89.49%</strong></td>
          <td>90.55%</td>
          <td>0.73</td>
      </tr>
      <tr>
          <td>DrugEx v2</td>
          <td>WS</td>
          <td>99.62%</td>
          <td><strong>97.86%</strong></td>
          <td>90.54%</td>
          <td>0.31</td>
      </tr>
      <tr>
          <td>REINVENT</td>
          <td>WS</td>
          <td>99.55%</td>
          <td>81.27%</td>
          <td>98.87%</td>
          <td>0.34</td>
      </tr>
      <tr>
          <td>ORGANIC</td>
          <td>PF</td>
          <td>98.29%</td>
          <td>86.98%</td>
          <td>80.30%</td>
          <td>0.64</td>
      </tr>
  </tbody>
</table>
<p>DrugEx v2 with PF achieved high desirability (89.49%) while maintaining diversity (0.73), outperforming both the WS scheme&rsquo;s diversity collapse (0.31) and competing methods.</p>
<h3 id="chemical-space-coverage">Chemical Space Coverage</h3>
<p>t-SNE visualization with ECFP6 descriptors showed that the PF scheme guided generators to cover chemical space more broadly than the WS scheme. DrugEx v1 and v2 covered nearly all of the chemical space occupied by known active ligands, while REINVENT and ORGANIC covered only partial regions in the target-specific case.</p>
<h3 id="substructure-distribution">Substructure Distribution</h3>
<p>Generated molecules were evaluated for purine ring, furan ring, and benzene ring frequencies. DrugEx v2 with PF produced substructure distributions closest to the LIGAND set, suggesting it better preserves the chemical characteristics of known active molecules compared to REINVENT (which over-represented benzene rings) and ORGANIC.</p>
<h3 id="guacamol-benchmark">GuacaMol Benchmark</h3>
<p>DrugEx v2 was tested on 20 goal-directed tasks from the <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmark, achieving the best score in 12 of 20 tasks and an overall second place. The method struggled with tasks requiring contradictory objectives in narrow chemical spaces (e.g., the Sitagliptin MPO task), reflecting its emphasis on diverse feasible molecules rather than optimal individual solutions.</p>
<h2 id="diversity-desirability-trade-off-and-limitations">Diversity-Desirability Trade-off and Limitations</h2>
<p>The key finding is that the Pareto front scheme and weighted sum scheme offer complementary strengths: PF produces molecules with higher diversity and more realistic substructure distributions, while WS achieves higher raw desirability scores. The Pareto front scheme is preferred for polypharmacology applications where chemical diversity matters for lead optimization.</p>
<p>The mutation rate $\varepsilon$ controls the diversity-desirability trade-off. Higher $\varepsilon$ increases diversity at the cost of desirability. The authors tested $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$ and found that appropriate tuning is important.</p>
<p>Limitations acknowledged by the authors include:</p>
<ul>
<li>The method is less effective for tasks with contradictory objectives in narrow chemical spaces</li>
<li>Emphasis is on generating diverse feasible molecules rather than individual optimal solutions</li>
<li>REINVENT 2.0 did not converge with the PF scheme, suggesting the Pareto approach may not be universally compatible with all RL frameworks</li>
<li>Bioactivity predictions rely on QSAR models (RF), which may not generalize perfectly to novel chemical scaffolds</li>
</ul>
<p>Future directions mentioned include adopting newer architectures (BERT, Transformer, GPT-2), handling graph and fragment representations, and integrating additional objectives like stability and synthesizability.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL v26 (ChEMBL set)</td>
          <td>1.7M molecules</td>
          <td>SMILES syntax learning, drug-like molecules</td>
      </tr>
      <tr>
          <td>Fine-tuning / Environment</td>
          <td>LIGAND set</td>
          <td>25,731 ligands</td>
          <td>Bioactivities for $A_1AR$, $A_{2A}AR$, hERG from ChEMBL</td>
      </tr>
      <tr>
          <td>Benchmark</td>
          <td>GuacaMol</td>
          <td>20 tasks</td>
          <td>Goal-directed generation tasks</td>
      </tr>
  </tbody>
</table>
<p>Active/inactive thresholds: $pX \geq 6.5$ (active), $pX &lt; 6.5$ (inactive). Low-quality data without exact pX assigned $pX = 3.99$ with sample weight 0.1.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>QSAR predictor</strong>: Random Forest, 1000 trees, Gini criterion. Input: 2048-bit ECFP6 + 19 physicochemical properties (2067D). MinMax normalization.</li>
<li><strong>Generator</strong>: 6-layer RNN with LSTM cells (512 hidden units), embedding dim 128, vocabulary 84 tokens. Adam optimizer, lr $10^{-3}$, batch size 512, 1000 epochs.</li>
<li><strong>RL training</strong>: Policy gradient with Pareto-based or weighted-sum reward. Mutation rates tested: $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$.</li>
<li><strong>Pareto ranking</strong>: GPU-accelerated non-dominated sorting via PyTorch. Tanimoto-based crowding distance with ECFP6 fingerprints.</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Architecture</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Generator</td>
          <td>LSTM (3 layers, 512 hidden)</td>
          <td>Embedding 128D, vocab 84</td>
      </tr>
      <tr>
          <td>Predictor</td>
          <td>Random Forest</td>
          <td>1000 trees, 2067D input</td>
      </tr>
      <tr>
          <td>MT-DNN (alternative)</td>
          <td>3 hidden layers (4000, 2000, 1000)</td>
          <td>ReLU, 20% dropout</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>Fraction of generated SMILES that parse to valid molecules</td>
      </tr>
      <tr>
          <td>Desirability</td>
          <td>Fraction of molecules meeting all activity thresholds ($pX \geq 6.5$ on-targets, $pX &lt; 6.5$ off-targets)</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>Fraction of non-duplicate molecules</td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>Solow-Polasky metric on ECFP6 Tanimoto distances</td>
      </tr>
      <tr>
          <td>SA score</td>
          <td>Synthetic accessibility (1-10, lower is easier)</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>Quantitative estimate of drug-likeness (0-1, higher is better)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>GPU acceleration was used for Pareto optimization via PyTorch. Specific hardware details (GPU model, training time) are not reported in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/XuhanLiu/DrugEx">DrugEx GitHub</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation (Python, PyTorch)</td>
      </tr>
      <tr>
          <td><a href="https://www.ebi.ac.uk/chembl/">ChEMBL v26</a></td>
          <td>Dataset</td>
          <td>CC BY-SA 3.0</td>
          <td>Source of training molecules and bioactivity data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liu, X., Ye, K., van Vlijmen, H. W. T., Emmerich, M. T. M., IJzerman, A. P., &amp; van Westen, G. J. P. (2021). DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology. <em>Journal of Cheminformatics</em>, 13(1), 85. <a href="https://doi.org/10.1186/s13321-021-00561-9">https://doi.org/10.1186/s13321-021-00561-9</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{liu2021drugex,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Liu, Xuhan and Ye, Kai and van Vlijmen, Herman W. T. and Emmerich, Michael T. M. and IJzerman, Adriaan P. and van Westen, Gerard J. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{85}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-021-00561-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DrugChat: Conversational QA on Drug Molecule Graphs</title><link>https://hunterheidenreich.com/notes/computational-chemistry/llms-for-chemistry/drugchat-chatgpt-drug-molecule-graphs/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/llms-for-chemistry/drugchat-chatgpt-drug-molecule-graphs/</guid><description>DrugChat connects a GNN molecular encoder with Vicuna-13B via a linear adaptor, enabling multi-turn conversational QA about drug compound graphs.</description><content:encoded><![CDATA[<h2 id="a-prototype-for-conversational-drug-compound-analysis">A Prototype for Conversational Drug Compound Analysis</h2>
<p><strong>Method ($\Psi_{\text{Method}}$)</strong></p>
<p>DrugChat is a prototype system that enables ChatGPT-like conversational interaction with drug molecule graphs. Users upload a compound&rsquo;s molecular graph and ask free-form, multi-turn questions about its properties, mechanism of action, or therapeutic applications. The system generates natural language answers by combining a graph neural network (GNN) encoder, a large language model (LLM), and a lightweight linear adaptor that bridges the two modalities. The primary contribution is the architecture and the accompanying instruction tuning datasets (10,834 drug compounds, 143,517 QA pairs) that make this graph-to-language interaction possible.</p>
<h2 id="why-conversational-interfaces-for-drug-molecules">Why Conversational Interfaces for Drug Molecules?</h2>
<p>Drug discovery is time-intensive and expensive, often requiring years and billions of dollars to bring a single compound to market. Traditional computational chemistry tools provide specialized outputs but lack the ability to support open-ended, interactive exploration of molecular properties. Researchers working with drug compound data frequently need quick answers to diverse questions: What is the mechanism of action? Are there known drug interactions? What structural modifications could improve efficacy?</p>
<p>At the time of this work, large language models had demonstrated strong conversational capabilities for text, and multimodal extensions (MiniGPT-4, LLaVA) had connected vision encoders to LLMs. However, no system had bridged graph-structured molecular data with LLMs for interactive dialogue. DrugChat addresses this gap by proposing the first system (to the authors&rsquo; knowledge) that connects molecular graph representations directly to an LLM for multi-turn question answering.</p>
<h2 id="architecture-gnn-adaptor-llm-pipeline">Architecture: GNN-Adaptor-LLM Pipeline</h2>
<p>The core innovation is the three-component architecture and its training strategy:</p>
<p><strong>Graph Neural Network (GNN)</strong>: A pre-trained GNN from Hu et al. (2020) processes the compound&rsquo;s molecular graph. At each layer $k$, node representations are updated by aggregating features from neighboring nodes:</p>
<p>$$
h_{v}^{k} = \sigma\left(h_{v}^{k-1}, \text{AGG}\left(\left\{h_{u}^{k-1}, u \in \mathcal{N}(v)\right\}\right)\right)
$$</p>
<p>A permutation-invariant pooling function produces the graph-level representation:</p>
<p>$$
h_{G} = f\left(\left\{h_{v}^{K}, v \in G\right\}\right)
$$</p>
<p><strong>Linear Adaptor</strong>: A single linear transformation matrix converts the GNN graph representation into a soft prompt vector compatible with the LLM&rsquo;s input space. This is the only component whose weights are updated during training.</p>
<p><strong>Large Language Model (Vicuna-13B)</strong>: The pre-trained Vicuna-13B model takes the transformed graph prompt vector along with user questions and generates answers. Both the GNN and LLM weights remain frozen during training.</p>
<p>The prompt template follows the Vicuna conversational format:</p>
<p>$$
\mathbf{Q}: \langle\text{Graph}\rangle\langle\text{GraphFeature}\rangle\langle/\text{Graph}\rangle\langle\text{Instruction}\rangle \quad \mathbf{A}: \langle\text{Desc}\rangle
$$</p>
<p>During training, the system minimizes a negative log-likelihood loss between generated and ground-truth answers. The entire training procedure updates only the adaptor&rsquo;s parameters, making the approach computationally lightweight compared to full fine-tuning.</p>
<h2 id="instruction-tuning-datasets-from-chembl-and-pubchem">Instruction Tuning Datasets from ChEMBL and PubChem</h2>
<p>The authors constructed two instruction tuning datasets:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Drug Compounds</th>
          <th>QA Pairs</th>
          <th>Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChEMBL</td>
          <td>3,892</td>
          <td>129,699</td>
          <td>ChEMBL database (Feb 2023)</td>
      </tr>
      <tr>
          <td>PubChem</td>
          <td>6,942</td>
          <td>13,818</td>
          <td>PubChem (May 2023)</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>10,834</strong></td>
          <td><strong>143,517</strong></td>
          <td></td>
      </tr>
  </tbody>
</table>
<p><strong>ChEMBL Dataset</strong>: Starting from 2,354,965 compounds in <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>, the authors identified 14,816 with drug information and filtered to 3,892 with sufficient descriptive content. For each drug, they gathered <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings, molecular features (formula, acid/base classification), and drug-specific properties (mechanism of action, therapeutic applications). They manually crafted QA pairs covering topics like rotatable bond count, <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski rule</a> violations, <a href="https://en.wikipedia.org/wiki/Chirality_(chemistry)">chirality</a>, <a href="https://en.wikipedia.org/wiki/Polar_surface_area">polar surface area</a>, development stage, approval year, and <a href="https://en.wikipedia.org/wiki/United_States_Adopted_Name">USAN</a> classification.</p>
<p><strong>PubChem Dataset</strong>: From 66,469,244 compounds in <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>, 19,319 had drug information, and 6,942 were retained after filtering for detailed descriptions. Descriptions were sourced from <a href="https://en.wikipedia.org/wiki/ChEBI">ChEBI</a>, LOTUS, and YMDB databases, yielding 13,818 QA pairs primarily asking for drug descriptions.</p>
<p>The QA pairs are formulaic: the ChEMBL set covers up to 34 question types per drug (an example drug in the paper shows all 34), while PubChem questions ask for descriptive summaries from different source databases.</p>
<h2 id="qualitative-demonstrations-only">Qualitative Demonstrations Only</h2>
<p>The paper presents only qualitative results. Two demonstration examples show DrugChat answering multi-turn questions about test compounds not seen during training. Questions like &ldquo;what makes this compound unique?&rdquo; and &ldquo;what diseases can this compound potentially treat?&rdquo; are answered in natural language.</p>
<p>No systematic quantitative evaluation is reported. The authors state they &ldquo;will perform a systematic quantitative evaluation by collaborating with pharmaceutical scientists,&rdquo; but this evaluation is not included in the technical report.</p>
<h2 id="limitations-and-future-directions">Limitations and Future Directions</h2>
<p>The authors identify <strong>language hallucination</strong> as the primary limitation. Since DrugChat incorporates an LLM, it may produce convincing but incorrect text descriptions about drugs, which could mislead decision-makers in real drug discovery pipelines.</p>
<p>Proposed mitigations include:</p>
<ul>
<li>Higher-quality training data and filtering strategies</li>
<li>More advanced GNN encoders and LLMs</li>
<li>Reinforcement learning from human feedback (RLHF) as the user base grows</li>
</ul>
<p>Several additional limitations are worth noting:</p>
<ul>
<li>The QA pairs are largely factoid-style questions with short, formulaic answers, which may not capture the nuanced reasoning needed for real drug discovery tasks</li>
<li>The evaluation is entirely qualitative, with no comparison to baselines or quantitative metrics</li>
<li>The linear adaptor is a minimal alignment mechanism; it remains unclear how much molecular structural information is preserved through this single linear transformation</li>
<li>The training data covers only a small fraction of known chemical space (10,834 compounds out of millions)</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ChEMBL Drug Instruction Tuning</td>
          <td>3,892 drugs, 129,699 QA pairs</td>
          <td>From ChEMBL (Feb 2023 dump)</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>PubChem Drug Instruction Tuning</td>
          <td>6,942 drugs, 13,818 QA pairs</td>
          <td>From PubChem (May 2023)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>GNN</strong>: Pre-trained model from Hu et al. (2020), &ldquo;Strategies for Pre-training Graph Neural Networks&rdquo;</li>
<li><strong>Adaptor</strong>: Single linear transformation matrix (only trainable component)</li>
<li><strong>Loss</strong>: Negative log-likelihood between generated and ground-truth answers</li>
<li><strong>Training</strong>: Only adaptor weights updated; GNN and LLM weights frozen</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Model</th>
          <th>Parameters</th>
          <th>Status</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GNN Encoder</td>
          <td>Pre-trained GNN (Hu et al., 2020)</td>
          <td>Not specified</td>
          <td>Frozen during training</td>
      </tr>
      <tr>
          <td>LLM</td>
          <td>Vicuna-13B</td>
          <td>~13B</td>
          <td>Frozen during training</td>
      </tr>
      <tr>
          <td>Adaptor</td>
          <td>Linear projection</td>
          <td>Not specified</td>
          <td>Trained</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>No quantitative evaluation metrics are reported. The paper provides only qualitative demonstrations on unseen compounds.</p>
<h3 id="hardware">Hardware</h3>
<p>No hardware specifications are reported for training or inference.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/UCSD-AI4H/drugchat">DrugChat Code</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official implementation (repository returned 404 as of March 2026)</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liang, Y., Zhang, R., Zhang, L., &amp; Xie, P. (2023). DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs. <em>arXiv preprint arXiv:2309.03907</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{liang2023drugchat,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Liang, Youwei and Zhang, Ruiyi and Zhang, Li and Xie, Pengtao}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2309.03907}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DrugAssist: Interactive LLM Molecule Optimization</title><link>https://hunterheidenreich.com/notes/computational-chemistry/llms-for-chemistry/drugassist-llm-molecule-optimization/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/llms-for-chemistry/drugassist-llm-molecule-optimization/</guid><description>DrugAssist fine-tunes Llama2-7B-Chat for interactive molecule optimization via natural language dialogue, releasing the MolOpt-Instructions dataset.</description><content:encoded><![CDATA[<h2 id="an-interactive-llm-for-molecule-optimization">An Interactive LLM for Molecule Optimization</h2>
<p>DrugAssist is a <strong>Method</strong> paper that proposes an interactive molecule optimization model built by fine-tuning Llama2-7B-Chat with LoRA on a newly constructed instruction dataset. The primary contribution is twofold: (1) the MolOpt-Instructions dataset containing over one million molecule pairs with six molecular properties and three optimization task categories, and (2) a dialogue-based molecule optimization system that allows domain experts to iteratively refine molecular modifications through multi-turn natural language conversations.</p>
<h2 id="why-interactive-molecule-optimization-matters">Why Interactive Molecule Optimization Matters</h2>
<p>Molecule optimization is a core step in the drug discovery pipeline, where lead compounds must be modified to improve specific pharmacological properties while maintaining structural similarity. Existing approaches fall into sequence-based methods (treating <a href="/notes/computational-chemistry/molecular-representations/">SMILES</a> optimization as machine translation) and graph-based methods (graph-to-graph translation), but they share a critical limitation: they are non-interactive. These models learn patterns from chemical structure data without incorporating expert feedback.</p>
<p>The drug discovery process is inherently iterative and requires integrating domain expertise. Medicinal chemists typically refine candidates through repeated cycles of suggestion, evaluation, and adjustment. Prior LLM-based approaches like <a href="/notes/computational-chemistry/llms-for-chemistry/chatdrug-conversational-drug-editing/">ChatDrug</a> relied on prompt engineering with general-purpose models (GPT-3.5-turbo) rather than fine-tuning, limiting their optimization accuracy. Additionally, most existing molecule optimization benchmarks focus on single-property optimization with vague objectives (e.g., &ldquo;maximize QED&rdquo;), while real-world drug design requires optimizing property values within specific ranges across multiple properties simultaneously.</p>
<h2 id="instruction-based-fine-tuning-with-molopt-instructions">Instruction-Based Fine-Tuning with MolOpt-Instructions</h2>
<p>The core innovation has two components: the MolOpt-Instructions dataset construction pipeline and the multi-task instruction tuning strategy.</p>
<h3 id="dataset-construction">Dataset Construction</h3>
<p>MolOpt-Instructions is built from one million molecules randomly sampled from the <a href="/notes/computational-chemistry/datasets/zinc-22/">ZINC database</a>. The construction workflow uses mmpdb (an open-source Matched Molecular Pair platform) to generate structurally similar molecule pairs through <a href="https://en.wikipedia.org/wiki/Matched_molecular_pair_analysis">Matched Molecular Pair Analysis (MMPA)</a>. Pairs are filtered to satisfy two criteria: <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> greater than 0.65 and <a href="https://en.wikipedia.org/wiki/Partition_coefficient">logP</a> difference greater than 2.5. Property values for six properties (Solubility, BBBP, <a href="https://en.wikipedia.org/wiki/KCNH2">hERG</a> inhibition, QED, hydrogen bond donor count, and hydrogen bond acceptor count) are computed using Tencent&rsquo;s iDrug platform. The final dataset contains 1,029,949 unique pairs covering 1,595,839 unique molecules, with mean similarity of 0.69 and mean logP difference of 2.82.</p>
<p>Three categories of optimization tasks are defined:</p>
<ul>
<li><strong>Loose</strong>: Increase or decrease a given property value (no threshold)</li>
<li><strong>Strict</strong>: Increase or decrease by at least a specified threshold</li>
<li><strong>Range</strong>: Optimize the property value to fall within a given interval</li>
</ul>
<p>Instruction templates are generated with ChatGPT assistance and manually refined. To ensure balance, source and target molecules are swapped for some pairs to maintain a roughly 1:1 ratio of property increases to decreases.</p>
<p>Murcko scaffold analysis confirms chemical diversity: the average molecules per scaffold is 2.95, and over 93.7% of scaffolds contain no more than five molecules.</p>
<h3 id="multi-task-instruction-tuning">Multi-Task Instruction Tuning</h3>
<p>The model is fine-tuned on Llama2-7B-Chat using LoRA (rank 64, alpha 128). To prevent catastrophic forgetting of general language capabilities, the training data combines MolOpt-Instructions with the Stanford Alpaca dataset (52k instruction-following examples, replicated 5x to balance the mixture). The training objective minimizes the negative log-likelihood over the response tokens:</p>
<p>$$L(R; \boldsymbol{\theta}) = -\sum_{u_i \in R} \log \Phi(u_i \mid u_{&lt;i}, I)$$</p>
<p>where $I$ is the instruction, $R$ is the response, and $\Phi$ is the model&rsquo;s conditional probability.</p>
<p>Training runs for 10 epochs with batch size 512, using AdamW ($\beta = (0.9, 0.999)$), learning rate 1e-4, 3% warm-up steps with cosine decay, and no weight decay. The data is split 90/5/5 for train/validation/test.</p>
<h2 id="experimental-setup-and-multi-property-optimization-results">Experimental Setup and Multi-Property Optimization Results</h2>
<h3 id="comparison-with-traditional-approaches">Comparison with Traditional Approaches</h3>
<p>DrugAssist is compared against Mol-Seq2Seq and Mol-Transformer (He et al., 2021) on simultaneous Solubility and BBBP optimization with range constraints. The evaluation prompt asks the model to generate an optimized molecule with solubility within a given range and BBBP category changed from one level to another.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Solubility</th>
          <th>BBBP</th>
          <th>Both</th>
          <th>Valid Rate</th>
          <th>Similarity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Mol-Seq2Seq</td>
          <td>0.46</td>
          <td>0.55</td>
          <td>0.35</td>
          <td>0.76</td>
          <td>0.61</td>
      </tr>
      <tr>
          <td>Mol-Transformer</td>
          <td>0.70</td>
          <td>0.78</td>
          <td>0.59</td>
          <td>0.96</td>
          <td>0.70</td>
      </tr>
      <tr>
          <td>DrugAssist</td>
          <td>0.74</td>
          <td>0.80</td>
          <td>0.62</td>
          <td>0.98</td>
          <td>0.69</td>
      </tr>
  </tbody>
</table>
<p>DrugAssist achieves the highest success rates in both single-property and multi-property optimization while maintaining high validity (0.98) and comparable structural similarity (0.69).</p>
<h3 id="comparison-with-llms">Comparison with LLMs</h3>
<p>DrugAssist is compared against Llama2-7B-Chat, GPT-3.5-turbo (via ChatDrug), and BioMedGPT-LM-7B on 16 tasks covering all three optimization categories. These comparisons use multi-turn dialogues following the ChatDrug protocol: if the model&rsquo;s output fails to meet requirements, a database-retrieved molecule meeting the criteria and similar to the model&rsquo;s output is provided as a hint for iterative refinement.</p>
<p>Selected results on single-property tasks (valid ratio / correct ratio, loose/strict):</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Llama2-7B-Chat</th>
          <th>GPT-3.5-turbo</th>
          <th>BioMedGPT-LM</th>
          <th>DrugAssist</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>QED+</td>
          <td>0.17 / 0.16</td>
          <td>0.15 / 0.15</td>
          <td>0.15 / 0.09</td>
          <td>0.76 / 0.63</td>
      </tr>
      <tr>
          <td>Acceptor+</td>
          <td>0.08 / 0.08</td>
          <td>0.04 / 0.06</td>
          <td>0.18 / 0.13</td>
          <td>0.71 / 0.67</td>
      </tr>
      <tr>
          <td>Donor+</td>
          <td>0.15 / 0.08</td>
          <td>0.10 / 0.04</td>
          <td>0.17 / 0.09</td>
          <td>0.72 / 0.76</td>
      </tr>
      <tr>
          <td>Solubility+</td>
          <td>0.36 / 0.20</td>
          <td>0.16 / 0.05</td>
          <td>0.18 / 0.09</td>
          <td>0.80 / 0.41</td>
      </tr>
      <tr>
          <td>BBBP+</td>
          <td>0.19 / 0.14</td>
          <td>0.10 / 0.10</td>
          <td>0.16 / 0.07</td>
          <td>0.82 / 0.61</td>
      </tr>
      <tr>
          <td>hERG-</td>
          <td>0.39 / 0.31</td>
          <td>0.13 / 0.15</td>
          <td>0.13 / 0.12</td>
          <td>0.71 / 0.67</td>
      </tr>
  </tbody>
</table>
<p>Multi-property tasks:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Llama2-7B-Chat</th>
          <th>GPT-3.5-turbo</th>
          <th>BioMedGPT-LM</th>
          <th>DrugAssist</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Sol+ &amp; Acc+</td>
          <td>0.15 / 0.04</td>
          <td>0.09 / 0.02</td>
          <td>0.10 / 0.07</td>
          <td>0.50 / 0.27</td>
      </tr>
      <tr>
          <td>QED+ &amp; BBBP+</td>
          <td>0.14 / 0.09</td>
          <td>0.09 / 0.06</td>
          <td>0.16 / 0.11</td>
          <td>0.65 / 0.41</td>
      </tr>
  </tbody>
</table>
<p>DrugAssist outperforms all baselines across every task. BioMedGPT-LM frequently misunderstands the task, generating guidance text rather than molecules. GPT-3.5-turbo achieves high validity but often outputs the input molecule unchanged.</p>
<h2 id="transferability-iterative-refinement-and-limitations">Transferability, Iterative Refinement, and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<p><strong>Zero-shot transferability</strong>: Although DrugAssist trains on single-property optimization data, it successfully handles multi-property optimization requests at inference time. In a case study, the model simultaneously increased both BBBP and QED by at least 0.1 while maintaining structural similarity, without any multi-property training examples.</p>
<p><strong>Few-shot generalization</strong>: DrugAssist optimizes properties not seen during training (e.g., logP) when provided with a few in-context examples of successful optimizations, a capability that traditional sequence-based or graph-based models cannot achieve without retraining.</p>
<p><strong>Iterative optimization</strong>: When an initial optimization fails to meet requirements, DrugAssist can incorporate feedback (a database-retrieved hint molecule) and modify different functional groups in a second attempt to produce a compliant molecule.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge that DrugAssist has a relatively lower success rate on the most challenging task category, strict range-constrained solubility optimization (0.41 success rate under strict criteria vs. 0.80 under loose criteria). The model also relies on iDrug for property prediction of Solubility, BBBP, and hERG inhibition, meaning its optimization quality is bounded by the accuracy of these property predictors. The evaluation uses only 500 test molecules for LLM comparisons, which is a relatively small evaluation set. The paper does not report statistical significance tests or confidence intervals for any results.</p>
<h3 id="future-directions">Future Directions</h3>
<p>The authors plan to improve multimodal data handling to reduce hallucination problems and to further enhance DrugAssist&rsquo;s interactive capabilities for better understanding of user needs and feedback.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>MolOpt-Instructions</td>
          <td>1,029,949 molecule pairs</td>
          <td>Sourced from ZINC via mmpdb; 6 properties</td>
      </tr>
      <tr>
          <td>Training (auxiliary)</td>
          <td>Stanford Alpaca</td>
          <td>52k instructions (5x replicated)</td>
          <td>Mitigates catastrophic forgetting</td>
      </tr>
      <tr>
          <td>Evaluation (traditional)</td>
          <td>From He et al. (2021)</td>
          <td>Not specified</td>
          <td>Multi-property optimization test</td>
      </tr>
      <tr>
          <td>Evaluation (LLM)</td>
          <td>ZINC subset</td>
          <td>500 molecules</td>
          <td>Randomly selected</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Base model</strong>: Llama2-7B-Chat</li>
<li><strong>Fine-tuning</strong>: LoRA with rank 64, alpha 128</li>
<li><strong>Optimizer</strong>: AdamW, $\beta = (0.9, 0.999)$, lr = 1e-4, no weight decay</li>
<li><strong>Schedule</strong>: 3% warm-up, cosine decay</li>
<li><strong>Epochs</strong>: 10</li>
<li><strong>Batch size</strong>: 512</li>
<li><strong>Property calculation</strong>: iDrug (Solubility, BBBP, hERG); RDKit (H-bond donors/acceptors, QED)</li>
<li><strong>Molecular pairs</strong>: mmpdb for Matched Molecular Pair Analysis</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Fine-tuned Llama2-7B-Chat with LoRA adapters</li>
<li>No pre-trained weights released (code and data available)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Success rate</td>
          <td>Fraction of molecules meeting optimization criteria</td>
      </tr>
      <tr>
          <td>Valid rate</td>
          <td>Fraction of generated SMILES that parse as valid molecules</td>
      </tr>
      <tr>
          <td>Similarity</td>
          <td>Tanimoto similarity between input and optimized molecules</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>8 NVIDIA Tesla A100-SXM4-40GB GPUs</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/blazerye/DrugAssist">DrugAssist Code</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Training and inference code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/blazerye/DrugAssist">MolOpt-Instructions</a></td>
          <td>Dataset</td>
          <td>Not specified</td>
          <td>1M+ molecule pairs, 6 properties</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ye, G., Cai, X., Lai, H., Wang, X., Huang, J., Wang, L., Liu, W., &amp; Zeng, X. (2024). DrugAssist: A Large Language Model for Molecule Optimization. <em>Briefings in Bioinformatics</em>, 26(1), bbae693.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ye2024drugassist,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DrugAssist: A Large Language Model for Molecule Optimization}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ye, Geyan and Cai, Xibao and Lai, Houtim and Wang, Xing and Huang, Junhong and Wang, Longyue and Liu, Wei and Zeng, Xiangxiang}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{26}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{bbae693}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bib/bbae693}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemCrow: Augmenting LLMs with 18 Chemistry Tools</title><link>https://hunterheidenreich.com/notes/computational-chemistry/llms-for-chemistry/chemcrow-augmenting-llms-chemistry-tools/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/llms-for-chemistry/chemcrow-augmenting-llms-chemistry-tools/</guid><description>ChemCrow integrates 18 expert-designed chemistry tools with GPT-4 to enable autonomous synthesis planning, drug discovery, and materials design tasks.</description><content:encoded><![CDATA[<h2 id="an-llm-powered-chemistry-agent">An LLM-Powered Chemistry Agent</h2>
<p>This is a <strong>Method</strong> paper that introduces ChemCrow, an LLM chemistry agent that augments GPT-4 with 18 expert-designed tools to accomplish tasks across organic synthesis, drug discovery, and materials design. Rather than relying on the LLM&rsquo;s internal knowledge (which is often inaccurate for chemistry), ChemCrow uses the LLM as a reasoning engine that iteratively calls specialized tools to gather information, plan actions, and execute experiments. The system successfully planned and executed real-world chemical syntheses on a robotic platform, demonstrating one of the first chemistry-related LLM agent interactions with the physical world.</p>
<h2 id="bridging-llm-reasoning-and-chemical-expertise">Bridging LLM Reasoning and Chemical Expertise</h2>
<p>Large language models have transformed many domains, but they struggle with chemistry-specific problems. GPT-4 cannot reliably perform basic operations like multiplying large numbers, converting <a href="https://en.wikipedia.org/wiki/IUPAC_nomenclature_of_chemistry">IUPAC names</a> to molecular structures, or predicting reaction outcomes. These limitations stem from the models&rsquo; token-prediction design, which does not encode chemical reasoning or factual chemical knowledge reliably.</p>
<p>Meanwhile, the chemistry community has developed numerous specialized computational tools for reaction prediction, <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/">retrosynthesis</a> planning, molecular property prediction, and de novo molecular generation. These tools exist in isolated environments with steep learning curves, making them difficult for experimental chemists to integrate and use together. The gap between LLM reasoning capabilities and specialized chemistry tools presents an opportunity: augmenting LLMs with these tools could compensate for the models&rsquo; chemical knowledge deficiencies while providing a natural language interface to specialized computational chemistry capabilities.</p>
<h2 id="tool-augmented-reasoning-via-react">Tool-Augmented Reasoning via ReAct</h2>
<p>ChemCrow builds on the ReAct (Reasoning and Acting) framework, where the LLM follows an iterative Thought-Action-Action Input-Observation loop. At each step, the model reasons about the current state of the task, selects an appropriate tool, provides input, pauses while the tool executes, and then incorporates the observation before deciding on the next step. This continues until the final answer is reached.</p>
<p>The system integrates 18 tools organized into four categories:</p>
<p><strong>General tools</strong> include web search (via SerpAPI), literature search (using paper-qa with OpenAI embeddings and FAISS), a Python REPL for arbitrary code execution, and a human interaction interface.</p>
<p><strong>Molecule tools</strong> cover Name2SMILES (converting molecule names to <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> via Chem-Space, PubChem, and OPSIN), SMILES2Price (checking purchasability via molbloom and ZINC20), Name2CAS (CAS number lookup via PubChem), molecular Similarity (<a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> with ECFP2 fingerprints), ModifyMol (local chemical space exploration via SynSpace), PatentCheck (bloom filter patent lookup via molbloom), FuncGroups (functional group identification via SMARTS patterns), and SMILES2Weight (molecular weight calculation via RDKit).</p>
<p><strong>Safety tools</strong> include ControlledChemicalCheck (screening against chemical weapons lists from <a href="https://en.wikipedia.org/wiki/Organisation_for_the_Prohibition_of_Chemical_Weapons">OPCW</a> and the Australia Group), ExplosiveCheck (GHS explosive classification via PubChem), and SafetySummary (comprehensive safety overview from PubChem data).</p>
<p><strong>Chemical reaction tools</strong> include NameRXN (reaction classification via NextMove Software), ReactionPredict (product prediction via IBM&rsquo;s RXN4Chemistry API using the <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/molecular-transformer/">Molecular Transformer</a>), ReactionPlanner (multi-step synthesis planning via RXN4Chemistry), and ReactionExecute (direct synthesis execution on IBM&rsquo;s RoboRXN robotic platform).</p>
<p>A key design feature is that safety checks are automatically invoked before synthesis execution. If a molecule is flagged as a controlled chemical or precursor, execution stops immediately.</p>
<h2 id="experimental-validation-and-evaluation">Experimental Validation and Evaluation</h2>
<h3 id="autonomous-synthesis">Autonomous Synthesis</h3>
<p>ChemCrow autonomously planned and executed four real-world syntheses on the IBM RoboRXN cloud-connected robotic platform:</p>
<ul>
<li><strong><a href="https://en.wikipedia.org/wiki/DEET">DEET</a></strong> (insect repellent), from the prompt &ldquo;Plan and execute the synthesis of an insect repellent&rdquo;</li>
<li><strong>Three <a href="https://en.wikipedia.org/wiki/Thiourea">thiourea</a> <a href="https://en.wikipedia.org/wiki/Organocatalysis">organocatalysts</a></strong> (Schreiner&rsquo;s, Ricci&rsquo;s, and Takemoto&rsquo;s catalysts), from a prompt asking to find and synthesize a thiourea organocatalyst that accelerates the <a href="https://en.wikipedia.org/wiki/Diels%E2%80%93Alder_reaction">Diels-Alder reaction</a></li>
</ul>
<p>All four syntheses yielded the anticipated compounds. ChemCrow demonstrated the ability to autonomously adapt synthesis procedures when the RoboRXN platform flagged issues (such as insufficient solvent or invalid purification actions), iteratively modifying the procedure until it was valid.</p>
<h3 id="novel-chromophore-discovery">Novel Chromophore Discovery</h3>
<p>In a human-AI collaboration scenario, ChemCrow was instructed to train a machine learning model to screen candidate <a href="https://en.wikipedia.org/wiki/Chromophore">chromophores</a>. The system loaded and cleaned data from a chromophore database, trained and evaluated a random forest model, and suggested a molecule with a target absorption maximum of 369 nm. The proposed molecule was subsequently synthesized and characterized, revealing a measured absorption maximum of 336 nm, confirming the discovery of a new chromophore.</p>
<h3 id="expert-vs-llm-evaluation">Expert vs. LLM Evaluation</h3>
<p>The evaluation used 14 use cases spanning synthesis planning, molecular design, and chemical logic. Both ChemCrow and standalone GPT-4 (without tools) were evaluated by:</p>
<ol>
<li><strong>Expert human evaluators</strong> (n=4): Assessed correctness of chemistry, quality of reasoning, and degree of task completion</li>
<li><strong>EvaluatorGPT</strong>: An LLM evaluator prompted to assess responses</li>
</ol>
<p>Key findings from the evaluation:</p>
<table>
  <thead>
      <tr>
          <th>Evaluator</th>
          <th>Preferred System</th>
          <th>Reasoning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Human experts</td>
          <td>ChemCrow</td>
          <td>Better chemical accuracy and task completeness, especially on complex tasks</td>
      </tr>
      <tr>
          <td>EvaluatorGPT</td>
          <td>GPT-4</td>
          <td>Favored fluent, complete-sounding responses despite factual errors</td>
      </tr>
  </tbody>
</table>
<p>Human experts preferred ChemCrow across most tasks, with the exception of very simple tasks where GPT-4 could answer from memorized training data (e.g., synthesis of well-known molecules like paracetamol). GPT-4 without tools consistently produced hallucinations that appeared convincing but were factually incorrect upon expert inspection.</p>
<p>An important finding is that LLM-based evaluation (EvaluatorGPT) cannot replace expert human assessment for scientific tasks. The LLM evaluator lacks the domain knowledge needed to distinguish fluent but incorrect answers from accurate ones, rendering it unsuitable for benchmarking factuality in chemistry.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p>ChemCrow demonstrates that augmenting LLMs with expert-designed tools transforms them from &ldquo;hyperconfident, typically wrong information sources&rdquo; into reasoning engines that can gather and act on accurate chemical information. The system lowers the barrier for non-experts to access computational chemistry tools through natural language while serving as an assistant to expert chemists.</p>
<p>Several limitations are acknowledged:</p>
<ul>
<li><strong>Tool dependency</strong>: ChemCrow&rsquo;s performance is bounded by the quality and coverage of its tools. Improved synthesis engines would directly improve synthesis planning capabilities.</li>
<li><strong>Reasoning failures</strong>: Tools become useless if the LLM&rsquo;s reasoning about when and how to use them is flawed, or if garbage inputs are provided.</li>
<li><strong>Reproducibility</strong>: The API-based approach to closed-source LLMs (GPT-4) limits reproducibility of individual results. The authors note that open-source models could address this, potentially at the cost of reasoning quality.</li>
<li><strong>Evaluation scope</strong>: The 14 evaluation tasks, while diverse, represent a limited test set. Standardized benchmarks for LLM-based chemistry tools did not exist at the time of publication.</li>
<li><strong>Safety considerations</strong>: While safety tools prevent execution of controlled chemical syntheses, risks remain from inaccurate reasoning or tool outputs leading to suboptimal conclusions.</li>
</ul>
<p>The authors emphasize that ChemCrow&rsquo;s modular design allows easy extension with new tools, and that future integration of image-processing tools, additional language-based tools, and other capabilities could substantially enhance the system.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Chromophore screening</td>
          <td>DB for chromophore (Joung et al.)</td>
          <td>Not specified</td>
          <td>Used for training random forest model</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>14 expert-designed tasks</td>
          <td>14 tasks</td>
          <td>Spanning synthesis, molecular design, and chemical logic</td>
      </tr>
      <tr>
          <td>Chemical safety</td>
          <td>OPCW Schedules 1-3, Australia Group lists</td>
          <td>Not specified</td>
          <td>Used for controlled chemical screening</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>LLM</strong>: GPT-4 with temperature 0.1</li>
<li><strong>Framework</strong>: LangChain for tool integration</li>
<li><strong>Reasoning</strong>: ReAct (Reasoning + Acting) framework with chain-of-thought prompting</li>
<li><strong>Synthesis planning</strong>: IBM RXN4Chemistry API (Molecular Transformer-based)</li>
<li><strong>Molecule similarity</strong>: Tanimoto similarity with ECFP2 fingerprints via RDKit</li>
<li><strong>Chemical space exploration</strong>: SynSpace with 50 robust medicinal chemistry reactions</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>GPT-4 (OpenAI, closed-source) for reasoning</li>
<li>Random forest for chromophore screening (trained on the fly)</li>
<li>Molecular Transformer via RXN4Chemistry API for reaction prediction and retrosynthesis</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Human evaluation</strong>: 4 expert chemists rated responses on chemistry correctness, reasoning quality, and task completion</li>
<li><strong>LLM evaluation</strong>: EvaluatorGPT assessed responses (found unreliable for factuality)</li>
<li><strong>Experimental validation</strong>: 4 syntheses on RoboRXN platform, 1 novel chromophore characterization</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Hardware requirements are not specified in the paper. The system relies primarily on API calls to GPT-4 and RXN4Chemistry, so local compute requirements are minimal.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ur-whitelab/chemcrow-public">chemcrow-public</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Open-source implementation with 12 of 18 tools</td>
      </tr>
      <tr>
          <td><a href="https://github.com/ur-whitelab/chemcrow-runs">chemcrow-runs</a></td>
          <td>Data</td>
          <td>Not specified</td>
          <td>All experiment outputs and evaluation data</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10884639">Zenodo release (code)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Archived release v0.3.24</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10884645">Zenodo release (runs)</a></td>
          <td>Data</td>
          <td>Not specified</td>
          <td>Archived experiment runs</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., &amp; Schwaller, P. (2024). Augmenting large language models with chemistry tools. <em>Nature Machine Intelligence</em>, 6(5), 525-535.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{bran2024augmenting,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Augmenting large language models with chemistry tools}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Bran, Andres M. and Cox, Sam and Schilter, Oliver and Baldassari, Carlo and White, Andrew D. and Schwaller, Philippe}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{525--535}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-024-00832-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChatDrug: Conversational Drug Editing with ChatGPT</title><link>https://hunterheidenreich.com/notes/computational-chemistry/llms-for-chemistry/chatdrug-conversational-drug-editing/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/llms-for-chemistry/chatdrug-conversational-drug-editing/</guid><description>ChatDrug uses ChatGPT with retrieval and domain feedback for drug editing across small molecules, peptides, and proteins on 39 tasks.</description><content:encoded><![CDATA[<h2 id="a-framework-for-conversational-drug-editing-with-llms">A Framework for Conversational Drug Editing with LLMs</h2>
<p>ChatDrug is a <strong>Method</strong> paper that introduces a parameter-free framework for drug editing using conversational large language models (specifically ChatGPT/GPT-3.5). The primary contribution is a three-module pipeline that combines prompt engineering, retrieval-augmented domain feedback, and iterative conversation to perform text-guided editing of small molecules, peptides, and proteins. The paper also establishes a benchmark of 39 drug editing tasks spanning these three drug types.</p>
<h2 id="bridging-conversational-ai-and-drug-discovery">Bridging Conversational AI and Drug Discovery</h2>
<p>Drug editing (also called <a href="https://en.wikipedia.org/wiki/Hit_to_lead">lead optimization</a> or protein design) is a critical step in the drug discovery pipeline where molecular substructures are modified to achieve desired properties. Traditional approaches rely on domain experts for manual editing, which can be subjective and biased. Recent multi-modal approaches like MoleculeSTM and ProteinDT have started exploring text-guided drug editing, but they are domain-specific (limited to one drug type) and lack conversational capabilities for iterative refinement.</p>
<p>The authors identify three properties of conversational LLMs that make them suitable for drug discovery: (1) pretraining on comprehensive knowledge bases covering drug-related concepts, (2) strong few-shot adaptation and generalization abilities, and (3) interactive communication enabling iterative feedback incorporation. However, directly applying LLMs to drug editing yields suboptimal results because the models do not fully utilize prior domain knowledge. ChatDrug addresses this gap through structured retrieval and feedback mechanisms.</p>
<h2 id="three-module-pipeline-pdds-redf-and-conversation">Three-Module Pipeline: PDDS, ReDF, and Conversation</h2>
<p>ChatDrug consists of three modules that operate sequentially without any parameter learning.</p>
<h3 id="pdds-module-prompt-design-for-domain-specific">PDDS Module (Prompt Design for Domain-Specific)</h3>
<p>The PDDS module constructs domain-specific prompts for ChatGPT. Given an input drug $\pmb{x}_{\text{in}}$ and a text prompt $\pmb{x}_t$ describing the desired property change, the goal is:</p>
<p>$$
\pmb{x}_{\text{out}} = \text{ChatDrug}(\pmb{x}_{\text{in}}, \pmb{x}_t)
$$</p>
<p>The prompts are designed around high-level property descriptions (e.g., &ldquo;more soluble in water&rdquo;) rather than exact substructure replacements. The authors argue that ChatDrug is better suited for &ldquo;fuzzy searching&rdquo; (property-based editing with non-deterministic answers) rather than &ldquo;exact searching&rdquo; (precise substructure replacement that experts can do directly).</p>
<h3 id="redf-module-retrieval-and-domain-feedback">ReDF Module (Retrieval and Domain Feedback)</h3>
<p>The ReDF module retrieves structurally similar examples from a domain-specific database and injects them into the conversation as demonstrations. For an input drug $\pmb{x}_{\text{in}}$, a candidate drug $\tilde{\pmb{x}}$ that failed the desired property change, and a retrieval database, ReDF returns:</p>
<p>$$
\pmb{x}_R = \text{ReDF}(\pmb{x}_{\text{in}}, \tilde{\pmb{x}}; \pmb{x}_t) = \underset{\pmb{x}&rsquo;_R \in \text{RetrievalDB}}{\arg\max} \langle \tilde{\pmb{x}}, \pmb{x}&rsquo;_R \rangle \wedge D(\pmb{x}_{\text{in}}, \pmb{x}&rsquo;_R; \pmb{x}_t)
$$</p>
<p>where $D(\cdot, \cdot; \cdot) \in {\text{True}, \text{False}}$ is a domain feedback function checking whether the retrieved drug satisfies the desired property change, and $\langle \tilde{\pmb{x}}, \pmb{x}&rsquo;_R \rangle$ is a similarity function (<a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> for small molecules, <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> for peptides and proteins).</p>
<p>The retrieved example $\pmb{x}_R$ is injected into the prompt as: &ldquo;Your provided sequence [$\tilde{\pmb{x}}$] is not correct. We find a sequence [$\pmb{x}_R$] which is correct and similar to the molecule you provided. Can you give me a new molecule?&rdquo;</p>
<h3 id="conversation-module">Conversation Module</h3>
<p>The conversation module enables iterative refinement over $C$ rounds. At each round $c$, if the edited drug $\pmb{x}_c$ does not satisfy the evaluation condition, ChatDrug retrieves a new example via ReDF using $\tilde{\pmb{x}} = \pmb{x}_c$ and continues the conversation. This aligns with the iterative nature of real drug discovery workflows.</p>
<h2 id="experiments-across-39-drug-editing-tasks">Experiments Across 39 Drug Editing Tasks</h2>
<h3 id="task-design">Task Design</h3>
<p>The benchmark includes 39 tasks across three drug types:</p>
<ul>
<li><strong>Small molecules</strong> (28 tasks): 16 single-objective (tasks 101-108, each with loose and strict thresholds) and 12 multi-objective tasks (tasks 201-206, each with two thresholds). Properties include solubility (<a href="https://en.wikipedia.org/wiki/Partition_coefficient">LogP</a>), drug-likeness (QED), permeability (<a href="https://en.wikipedia.org/wiki/Polar_surface_area">tPSA</a>), <a href="https://en.wikipedia.org/wiki/Hydrogen_bond">hydrogen bond</a> acceptors/donors.</li>
<li><strong>Peptides</strong> (9 tasks): 6 single-objective and 3 multi-objective tasks for editing <a href="https://en.wikipedia.org/wiki/Major_histocompatibility_complex">peptide-MHC binding</a> affinity across different <a href="https://en.wikipedia.org/wiki/Human_leukocyte_antigen">HLA allele</a> types.</li>
<li><strong>Proteins</strong> (2 tasks): Editing protein sequences to increase <a href="https://en.wikipedia.org/wiki/Alpha_helix">alpha-helix</a> or <a href="https://en.wikipedia.org/wiki/Beta_sheet">beta-strand</a> secondary structures.</li>
</ul>
<h3 id="baselines">Baselines</h3>
<p>For small molecules, baselines include Random, PCA, High-Variance, and GS-Mutate (all based on MegaMolBART), plus MoleculeSTM with <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> and Graph representations. For peptides and proteins, random mutation baselines with 1-3 mutated positions are used.</p>
<h3 id="main-results">Main Results</h3>
<p>ChatDrug achieves the best performance on 33 out of 39 tasks. Key results for small molecule editing (hit ratio):</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Property</th>
          <th>ChatDrug (loose)</th>
          <th>Best Baseline (loose)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>101</td>
          <td>More soluble</td>
          <td>94.13</td>
          <td>67.86 (MoleculeSTM-Graph)</td>
      </tr>
      <tr>
          <td>102</td>
          <td>Less soluble</td>
          <td>96.86</td>
          <td>64.79 (MoleculeSTM-Graph)</td>
      </tr>
      <tr>
          <td>106</td>
          <td>Lower permeability</td>
          <td>77.35</td>
          <td>34.13 (MoleculeSTM-SMILES)</td>
      </tr>
      <tr>
          <td>107</td>
          <td>More HBA</td>
          <td>95.35</td>
          <td>54.01 (MoleculeSTM-SMILES)</td>
      </tr>
      <tr>
          <td>108</td>
          <td>More HBD</td>
          <td>96.54</td>
          <td>60.97 (MoleculeSTM-Graph)</td>
      </tr>
  </tbody>
</table>
<p>ChatDrug underperforms on tasks 104 (less like a drug) and 105 (higher permeability) and most multi-objective tasks involving permeability (205), where MoleculeSTM variants perform better.</p>
<p>For peptide editing, ChatDrug achieves 41-69% hit ratios compared to 0.4-14.4% for random mutation baselines. For protein editing, ChatDrug reaches 34.79% and 51.38% hit ratios on helix and strand tasks respectively, compared to 26.90% and 21.44% for the best random mutation baseline.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p><strong>Conversation rounds</strong>: Performance increases with more rounds, converging around $C = 2$. For example, on task 101 (loose threshold), zero-shot achieves 78.26%, $C = 1$ reaches 89.56%, and $C = 2$ reaches 93.37%.</p>
<p><strong>ReDF threshold</strong>: Using a stricter threshold in the domain feedback function $D$ (matching the evaluation threshold) yields substantially higher performance than using a loose threshold. For example, on task 107 with strict evaluation, the strict-threshold ReDF achieves 72.60% vs. 14.96% for the loose-threshold ReDF.</p>
<p><strong>Similarity analysis</strong>: Retrieved molecules $\pmb{x}_R$ tend to have lower similarity to input molecules than the intermediate outputs $\pmb{x}_1$, yet they have higher hit ratios. This suggests the ReDF module explores the chemical space effectively, and the conversation module balances similarity preservation with property optimization.</p>
<p><strong>Knowledge extraction</strong>: ChatDrug can articulate domain-specific reasoning for its edits (e.g., summarizing rules for increasing water solubility by introducing polar functional groups), though the extracted knowledge shows some redundancy.</p>
<h2 id="limitations-and-future-directions">Limitations and Future Directions</h2>
<p>ChatDrug demonstrates that conversational LLMs can serve as useful tools for drug editing, achieving strong results across diverse drug types with a parameter-free approach. The framework exhibits open vocabulary and compositional properties, allowing it to handle novel drug concepts and multi-objective tasks through natural language.</p>
<p>The authors acknowledge two main limitations. First, ChatDrug struggles with understanding complex 3D drug geometries, which would require deeper geometric modeling. Second, the framework requires multiple conversation rounds to achieve strong performance, adding computational cost through repeated API calls. The authors suggest that knowledge summarization capabilities of LLMs could help reduce this cost.</p>
<p>The evaluation relies entirely on computational oracles (RDKit for small molecules, MHCflurry2.0 for peptides, ProteinCLAP for proteins) rather than wet-lab validation. The hit ratio metric also excludes invalid outputs from the denominator, so the effective success rate on all attempted edits may be lower than reported.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Small molecule inputs</td>
          <td><a href="/notes/computational-chemistry/datasets/zinc-22/">ZINC</a></td>
          <td>200 molecules</td>
          <td>Sampled SMILES strings</td>
      </tr>
      <tr>
          <td>Small molecule retrieval DB</td>
          <td>ZINC</td>
          <td>10K molecules</td>
          <td>For ReDF similarity search</td>
      </tr>
      <tr>
          <td>Peptide inputs</td>
          <td>Peptide-MHC binding dataset</td>
          <td>500 peptides per task</td>
          <td>From 30 common MHC alleles</td>
      </tr>
      <tr>
          <td>Peptide retrieval DB</td>
          <td>Experimental binding data</td>
          <td>Varies by allele</td>
          <td>Target allele experimental data</td>
      </tr>
      <tr>
          <td>Protein inputs</td>
          <td>TAPE test set</td>
          <td>Varies</td>
          <td>Secondary structure prediction test data</td>
      </tr>
      <tr>
          <td>Protein retrieval DB</td>
          <td>TAPE training set</td>
          <td>Varies</td>
          <td>Secondary structure prediction training data</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>GPT-3.5-turbo via OpenAI ChatCompletion API, temperature=0, frequency_penalty=0.2</li>
<li>System prompt: &ldquo;You are an expert in the field of molecular chemistry.&rdquo;</li>
<li>$C = 2$ conversation rounds for main results</li>
<li>5 random seeds (0-4) for small molecule main results, seed 0 for ablations</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>ChatGPT (GPT-3.5-turbo): used as-is, no fine-tuning</li>
<li>MHCflurry 2.0: pseudo-oracle for peptide binding affinity evaluation</li>
<li>ProteinCLAP-EBM-NCE from ProteinDT: protein secondary structure prediction</li>
<li>ESMFold: protein folding for visualization</li>
<li>RDKit: molecular property calculations for small molecules</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Hit Ratio</td>
          <td>Fraction of valid edits satisfying property requirements</td>
          <td>Invalid sequences excluded from denominator</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>All experiments conducted on a single NVIDIA RTX A6000 GPU (used only for peptide and protein evaluation). Total OpenAI API cost was less than $100.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/chao1224/ChatDrug">ChatDrug GitHub</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liu, S., Wang, J., Yang, Y., Wang, C., Liu, L., Guo, H., &amp; Xiao, C. (2024). Conversational Drug Editing Using Retrieval and Domain Feedback. <em>ICLR 2024</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{liu2024chatdrug,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Conversational Drug Editing Using Retrieval and Domain Feedback}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Liu, Shengchao and Wang, Jiongxiao and Yang, Yijin and Wang, Chengpeng and Liu, Ling and Guo, Hongyu and Xiao, Chaowei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MG-BERT: Graph BERT for Molecular Property Prediction</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/mg-bert-molecular-graph-bert/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/mg-bert-molecular-graph-bert/</guid><description>MG-BERT integrates graph neural network message passing into BERT with masked atom pretraining on 1.7M molecules for molecular property prediction.</description><content:encoded><![CDATA[<h2 id="a-graph-aware-bert-for-molecular-property-prediction">A Graph-Aware BERT for Molecular Property Prediction</h2>
<p>MG-BERT is a <strong>Method</strong> paper that adapts the BERT pretraining paradigm from NLP to molecular graphs. The primary contribution is a modified Transformer architecture that replaces global self-attention with bond-based local attention, allowing atoms to exchange information only through chemical bonds. This creates a deep message-passing network that avoids the oversmoothing problem of conventional graph neural networks (GNNs). Combined with a masked atom prediction pretraining strategy on 1.7 million unlabeled molecules from ChEMBL, MG-BERT learns context-sensitive atomic representations that transfer effectively to downstream property prediction tasks.</p>
<h2 id="data-scarcity-in-molecular-property-prediction">Data Scarcity in Molecular Property Prediction</h2>
<p><a href="/notes/computational-chemistry/chemical-language-models/property-prediction/">Molecular property prediction</a> is central to drug discovery, particularly for ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) endpoints. While deep learning has advanced many domains, molecular property prediction faces a persistent challenge: labeled data scarcity. ADMET measurements require expensive, time-consuming experiments, and typical datasets contain only hundreds to thousands of examples.</p>
<p>Prior approaches fall into three categories, each with limitations:</p>
<ol>
<li><strong>Feature engineering</strong> (molecular fingerprints, descriptors): Requires expert design, suffers from low scalability, and fixed representations cannot be optimized for specific tasks.</li>
<li><strong>SMILES-based deep learning</strong> (CNNs, LSTMs, Transformers on SMILES strings): Must learn to parse molecular information from complex string syntax, increasing learning difficulty. Autoencoder-based methods (e.g., <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/cddd-translation-molecular-descriptors/">CDDD</a>) learn fixed representations that cannot be fine-tuned.</li>
<li><strong>Graph neural networks</strong> (GAT, GCN): Can learn directly from molecular topology, but are limited to 2-3 layers due to oversmoothing, restricting their capacity to capture deep-level patterns.</li>
</ol>
<p>The BERT model from NLP demonstrated that self-supervised pretraining on large unlabeled corpora followed by fine-tuning on small labeled datasets can substantially improve downstream performance. <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/smiles-bert/">SMILES-BERT</a> applied this idea to SMILES strings directly, but suffered from interpretability issues due to auxiliary characters in the SMILES syntax. MG-BERT addresses these limitations by operating directly on molecular graphs.</p>
<h2 id="bond-based-local-attention-and-masked-atom-pretraining">Bond-Based Local Attention and Masked Atom Pretraining</h2>
<p>The core innovation of MG-BERT has two components: a modified Transformer architecture for molecular graphs and a self-supervised pretraining strategy.</p>
<h3 id="architecture-modifications">Architecture Modifications</h3>
<p>The original BERT model uses three components: an embedding layer, Transformer encoder layers, and a task-specific output layer. MG-BERT makes three key modifications:</p>
<ol>
<li>
<p><strong>Atom embeddings replace word embeddings.</strong> The dictionary contains 16 tokens: 13 common atom types ([H], [C], [N], [O], [F], [S], [Cl], [P], [Br], [B], [I], [Si], [Se]), plus [UNK] for rare atoms, [MASK] for pretraining, and [GLOBAL] for graph-level readout.</p>
</li>
<li>
<p><strong>No positional encoding.</strong> Unlike sequential text, atoms in a molecular graph have no inherent ordering, so positional embeddings are removed.</p>
</li>
<li>
<p><strong>Local attention replaces global attention.</strong> The adjacency matrix of the molecular graph is used as a visibility matrix to modulate the attention scores. Each atom can only attend to atoms connected by chemical bonds. Formally, the attention is constrained so that:</p>
</li>
</ol>
<p>$$A&rsquo;_{ij} = \begin{cases} A_{ij} &amp; \text{if bond exists between } i \text{ and } j \\ -\infty &amp; \text{otherwise} \end{cases}$$</p>
<p>where $A_{ij}$ is the standard scaled dot-product attention score. This local message passing makes MG-BERT a variant of GNN, but one that can stack many layers (6 in the medium configuration) without oversmoothing, thanks to the residual connections inherited from the Transformer architecture.</p>
<ol start="4">
<li><strong>Supernode for graph-level readout.</strong> A [GLOBAL] supernode is added to each molecular graph, connected to all atoms. This node aggregates information from the entire molecule and serves as the molecular representation for downstream prediction.</li>
</ol>
<h3 id="masked-atom-prediction">Masked Atom Prediction</h3>
<p>The pretraining strategy mirrors BERT&rsquo;s masked language model but operates on atoms:</p>
<ul>
<li>15% of atoms in each molecule are randomly selected (at least one atom per molecule)</li>
<li>Of selected atoms: 80% are replaced with [MASK], 10% are randomly replaced with another atom type, and 10% remain unchanged</li>
<li>The model is trained to predict the original atom type at masked positions</li>
<li>Loss is computed only at masked positions</li>
</ul>
<h3 id="model-configurations">Model Configurations</h3>
<p>Three model sizes were compared:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Layers</th>
          <th>Heads</th>
          <th>Embedding Size</th>
          <th>FFN Size</th>
          <th>Recovery Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MG-BERT Small</td>
          <td>3</td>
          <td>2</td>
          <td>128</td>
          <td>256</td>
          <td>95.27%</td>
      </tr>
      <tr>
          <td>MG-BERT Medium</td>
          <td>6</td>
          <td>4</td>
          <td>256</td>
          <td>512</td>
          <td>98.31%</td>
      </tr>
      <tr>
          <td>MG-BERT Large</td>
          <td>12</td>
          <td>8</td>
          <td>576</td>
          <td>1152</td>
          <td>98.35%</td>
      </tr>
  </tbody>
</table>
<p>The medium configuration was selected for all experiments because it achieved the best downstream performance, despite the large model having slightly higher pretraining recovery accuracy. The authors attribute this to overfitting risk with the larger model.</p>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<h3 id="pretraining">Pretraining</h3>
<p>MG-BERT was pretrained on 1.7 million compounds randomly selected from ChEMBL, with 10% held out for evaluation (1.53M training molecules). Molecules were converted to 2D undirected graphs using RDKit, with hydrogen atoms explicitly included. The model was pretrained for 10 epochs using Adam with learning rate 1e-4 and batch size 256.</p>
<h3 id="fine-tuning-datasets">Fine-tuning Datasets</h3>
<p>Sixteen datasets covering ADMET endpoints and common molecular properties were collected from ADMETlab and <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>:</p>
<table>
  <thead>
      <tr>
          <th>Type</th>
          <th>Dataset</th>
          <th>Category</th>
          <th>Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Regression</td>
          <td>Caco2</td>
          <td>Absorption</td>
          <td>979</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>logD</td>
          <td>Physicochemical</td>
          <td>10,354</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>logS</td>
          <td>Physicochemical</td>
          <td>5,045</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>PPB</td>
          <td>Distribution</td>
          <td>1,480</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>tox</td>
          <td>Toxicity</td>
          <td>7,295</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>ESOL</td>
          <td>Physicochemical</td>
          <td>1,128</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>FreeSolv</td>
          <td>Physicochemical</td>
          <td>642</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>Lipo</td>
          <td>Physicochemical</td>
          <td>4,200</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Ames</td>
          <td>Toxicity</td>
          <td>6,719</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBB</td>
          <td>Distribution</td>
          <td>1,855</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>FDAMDD</td>
          <td>Toxicity</td>
          <td>795</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>H_HT</td>
          <td>Toxicity</td>
          <td>2,170</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Pgp_inh</td>
          <td>Absorption</td>
          <td>2,125</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Pgp_sub</td>
          <td>Absorption</td>
          <td>1,210</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BACE</td>
          <td>Biophysics</td>
          <td>1,513</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBBP</td>
          <td>Physiology</td>
          <td>2,039</td>
      </tr>
  </tbody>
</table>
<p>Datasets were split 8:1:1 (train:validation:test) with stratified sampling by SMILES length. Each experiment was repeated 10 times with random splits, reporting mean and standard deviation. Regression was evaluated by R-squared, classification by ROC-AUC. Early stopping with a maximum of 100 epochs was used.</p>
<h3 id="baselines">Baselines</h3>
<p>Five baselines were compared:</p>
<ol>
<li><strong>ECFP4-XGBoost</strong>: Extended connectivity fingerprints (diameter 4) with gradient-boosted trees</li>
<li><strong>GAT</strong>: Graph Attention Network</li>
<li><strong>GCN</strong>: Graph Convolutional Network</li>
<li><strong>CDDD</strong>: Continuous and Data-Driven Descriptors (pretrained RNN encoder on SMILES with a fully connected network)</li>
<li><strong>SMILES-BERT</strong>: Original BERT applied directly to SMILES strings</li>
</ol>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>Two ablation studies were conducted:</p>
<ol>
<li><strong>Pretraining effectiveness</strong>: Comparing pretrained vs. non-pretrained MG-BERT under identical hyperparameters</li>
<li><strong>Hydrogen atoms</strong>: Comparing MG-BERT with and without explicit hydrogen atoms in the molecular graph</li>
</ol>
<h2 id="consistent-improvements-across-admet-benchmarks">Consistent Improvements Across ADMET Benchmarks</h2>
<h3 id="main-results">Main Results</h3>
<p>MG-BERT consistently outperformed all baselines across all 16 datasets. Key results on the 11 ADMET datasets:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>ECFP4-XGBoost</th>
          <th>GAT</th>
          <th>GCN</th>
          <th>CDDD</th>
          <th>SMILES-BERT</th>
          <th>MG-BERT</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Caco2 (R2)</td>
          <td>61.41</td>
          <td>69.16</td>
          <td>67.15</td>
          <td>73.42</td>
          <td>72.39</td>
          <td><strong>74.68</strong></td>
      </tr>
      <tr>
          <td>logD (R2)</td>
          <td>70.84</td>
          <td>84.62</td>
          <td>86.22</td>
          <td>85.85</td>
          <td>86.31</td>
          <td><strong>87.46</strong></td>
      </tr>
      <tr>
          <td>logS (R2)</td>
          <td>73.73</td>
          <td>84.06</td>
          <td>83.47</td>
          <td>84.01</td>
          <td>85.20</td>
          <td><strong>87.66</strong></td>
      </tr>
      <tr>
          <td>PPB (R2)</td>
          <td>55.11</td>
          <td>59.96</td>
          <td>57.34</td>
          <td>54.12</td>
          <td>62.37</td>
          <td><strong>65.94</strong></td>
      </tr>
      <tr>
          <td>Ames (AUC)</td>
          <td>87.21</td>
          <td>86.38</td>
          <td>87.04</td>
          <td>86.82</td>
          <td>87.69</td>
          <td><strong>89.33</strong></td>
      </tr>
      <tr>
          <td>BBB (AUC)</td>
          <td>94.62</td>
          <td>93.03</td>
          <td>92.67</td>
          <td>94.44</td>
          <td>94.02</td>
          <td><strong>95.41</strong></td>
      </tr>
      <tr>
          <td>BBBP (AUC)</td>
          <td>89.16</td>
          <td>90.33</td>
          <td>90.74</td>
          <td>91.12</td>
          <td>91.32</td>
          <td><strong>92.08</strong></td>
      </tr>
  </tbody>
</table>
<p>The overall improvement across all datasets was 28.1% (7.02% on classification, 21.28% on regression). Improvements were statistically significant at the 95% confidence level (paired t-test, P &lt;= 0.001).</p>
<h3 id="pretraining-ablation">Pretraining Ablation</h3>
<p>Pretraining improved performance by more than 2% on all datasets. The benefit was largest for small datasets: Caco2 improved by approximately 10 percentage points (64.79 to 74.68 R2), and FDAMDD improved by about 7.5 points (80.76 to 88.23 AUC). This confirms that self-supervised pretraining effectively addresses the labeled data scarcity problem.</p>
<h3 id="hydrogen-atom-ablation">Hydrogen Atom Ablation</h3>
<p>Including explicit hydrogen atoms improved pretraining recovery accuracy from 92.25% to 98.31% and consistently improved downstream performance. The authors provide an intuitive explanation: hydrogen atoms help determine bond counts for neighboring atoms, which is critical for the masked atom recovery task. They also show that removing hydrogens can make structurally distinct molecules (e.g., benzene and cyclohexane) indistinguishable at the graph level.</p>
<h3 id="interpretability-via-attention-visualization">Interpretability via Attention Visualization</h3>
<p>The authors provide two forms of interpretability analysis:</p>
<ol>
<li>
<p><strong>t-SNE visualization of atomic representations</strong>: Pretrained atomic representations cluster by atom type and, more specifically, by local chemical environment (e.g., aromatic carbons separate from aliphatic carbons, C-N bonds from C-O bonds). This demonstrates that pretraining captures neighborhood context beyond simple atom identity.</p>
</li>
<li>
<p><strong>Attention weight visualization</strong>: On the logD task, the supernode&rsquo;s attention focuses on polar groups (which govern lipophilicity). On the Ames mutagenicity task, attention concentrates on known mutagenic structural alerts (acylchloride, nitrosamide, azide groups). This provides chemically meaningful explanations for predictions.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The paper does not extensively discuss limitations, but several can be identified:</p>
<ul>
<li>The model uses only 2D molecular topology (atom types and bonds) without 3D conformational information or bond-type features</li>
<li>The atom dictionary is limited to 13 common types plus [UNK], which may lose information for molecules containing rarer elements</li>
<li>Evaluation is limited to ADMET-focused datasets; broader chemical spaces (e.g., materials, catalysts) are not tested</li>
<li>The comparison baselines do not include other graph-based pretraining methods (e.g., the contemporaneous Strategies for Pre-training Graph Neural Networks by Hu et al.)</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL (random subset)</td>
          <td>1.7M molecules (1.53M train)</td>
          <td>10% held out for evaluation</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>ADMETlab + MoleculeNet</td>
          <td>16 datasets (642-10,354 molecules)</td>
          <td>8:1:1 splits, stratified by SMILES length</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimizer</strong>: Adam (pretraining: lr=1e-4, batch=256; fine-tuning: lr from {1e-5, 5e-5, 1e-4}, batch from {16, 32, 64})</li>
<li><strong>Pretraining epochs</strong>: 10</li>
<li><strong>Fine-tuning</strong>: Up to 100 epochs with early stopping</li>
<li><strong>Dropout</strong>: Optimized per task in range [0.0, 0.5]</li>
<li><strong>Masking</strong>: 15% of atoms (80% [MASK], 10% random, 10% unchanged)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: MG-BERT Medium (6 layers, 4 heads, embedding size 256, FFN size 512)</li>
<li><strong>Molecule processing</strong>: RDKit for graph conversion with explicit hydrogens</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task Type</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>R-squared (R2)</td>
          <td>Regression</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>Accuracy, RMSE</td>
          <td>Both</td>
          <td>Reported in supplementary Table S1</td>
      </tr>
  </tbody>
</table>
<p>All results averaged over 10 random splits with standard deviations reported.</p>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify hardware requirements (GPU type, training time, or memory usage).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/zhang-xuan1314/Molecular-graph-BERT">Molecular-graph-BERT</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Jupyter Notebook implementation; last code push August 2021</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, X.-C., Wu, C.-K., Yang, Z.-J., Wu, Z.-X., Yi, J.-C., Hsieh, C.-Y., Hou, T.-J., &amp; Cao, D.-S. (2021). MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. <em>Briefings in Bioinformatics</em>, 22(6), bbab152. <a href="https://doi.org/10.1093/bib/bbab152">https://doi.org/10.1093/bib/bbab152</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhang2021mgbert,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{{MG-BERT}: leveraging unsupervised atomic representation learning for molecular property prediction}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhang, Xiao-Chen and Wu, Cheng-Kun and Yang, Zhi-Jiang and Wu, Zhen-Xing and Yi, Jia-Cai and Hsieh, Chang-Yu and Hou, Ting-Jun and Cao, Dong-Sheng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{bbab152}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bib/bbab152}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Maxsmi: SMILES Augmentation for Property Prediction</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/maxsmi-smiles-augmentation-property-prediction/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/maxsmi-smiles-augmentation-property-prediction/</guid><description>Maxsmi systematically evaluates five SMILES augmentation strategies with CNN and RNN models across solubility, lipophilicity, and bioactivity tasks.</description><content:encoded><![CDATA[<h2 id="systematic-benchmarking-of-smiles-data-augmentation">Systematic Benchmarking of SMILES Data Augmentation</h2>
<p>This is an <strong>Empirical</strong> paper that systematically evaluates how SMILES augmentation affects deep learning molecular property prediction. The primary contribution is a comprehensive comparison of five augmentation strategies across three neural network architectures and four datasets, producing the &ldquo;Maxsmi&rdquo; models that maximize prediction performance. The study also demonstrates that test-time augmentation provides a practical confidence measure for predictions.</p>
<h2 id="the-data-scarcity-problem-in-qsar-modeling">The Data Scarcity Problem in QSAR Modeling</h2>
<p>Deep learning models require large training sets to perform well, but experimental physico-chemical and bioactivity datasets remain small, typically ranging from hundreds to a few thousand compounds. SMILES augmentation, where the non-unique <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES representation</a> of a molecule is exploited to generate multiple training examples per compound, has been shown to help in prior work by Bjerrum (2017), Kimber et al. (2018), and Li and Fourches (2020). However, no prior study had systematically compared different augmentation strategies, analyzed how much augmentation is needed, or examined the relationship between augmentation factor and prediction confidence. Most previous work chose augmentation numbers a priori without justification. Maxsmi fills this gap by providing a systematic analysis and practical guidelines.</p>
<h2 id="five-augmentation-strategies-and-test-time-ensemble-learning">Five Augmentation Strategies and Test-Time Ensemble Learning</h2>
<p>The core insight is twofold. First, the authors define five distinct strategies for generating augmented SMILES:</p>
<ol>
<li><strong>No augmentation</strong>: use only the canonical SMILES (baseline)</li>
<li><strong>Augmentation with duplication</strong>: generate $m$ random SMILES per compound, allowing duplicates; dataset grows to $N \times m$</li>
<li><strong>Augmentation without duplication</strong>: generate $m$ random SMILES and discard exact duplicates</li>
<li><strong>Augmentation with reduced duplication</strong>: keep only $f(m) = \sqrt{m}$ copies of each duplicate, a compromise between the above</li>
<li><strong>Augmentation with estimated maximum</strong>: sample random SMILES until the same string has been generated 10 times, attempting to cover most of the valid SMILES space</li>
</ol>
<p>Second, the authors formalize test-time augmentation as ensemble learning. Given a trained model $M_{\Theta}$, each test compound $C$ is represented by $k$ random SMILES $S_1(C), \ldots, S_k(C)$. The per-SMILES predictions are:</p>
<p>$$
\hat{y}_i(C) = M_{\Theta}(S_i(C))
$$</p>
<p>The compound-level prediction is an aggregation (mean) over these:</p>
<p>$$
\hat{y}(C) = A\big(\hat{y}_1(C), \ldots, \hat{y}_k(C)\big)
$$</p>
<p>The standard deviation of the per-SMILES predictions serves as a confidence measure: high variance indicates the model is uncertain about a compound.</p>
<h2 id="experimental-design-three-architectures-four-datasets">Experimental Design: Three Architectures, Four Datasets</h2>
<h3 id="datasets">Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size (after preprocessing)</th>
          <th>Train / Test</th>
          <th>Task</th>
          <th>Provenance</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>1,128</td>
          <td>902 / 226</td>
          <td>Water solubility</td>
          <td><a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a></td>
      </tr>
      <tr>
          <td>ESOL_small</td>
          <td>1,068</td>
          <td>854 / 214</td>
          <td>Solubility (max 25 heavy atoms)</td>
          <td>MoleculeNet</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>642</td>
          <td>513 / 129</td>
          <td>Hydration free energy</td>
          <td>MoleculeNet</td>
      </tr>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Lipophilicity">Lipophilicity</a></td>
          <td>4,199</td>
          <td>3,359 / 840</td>
          <td>Octanol/water distribution</td>
          <td><a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a></td>
      </tr>
      <tr>
          <td>Affinity (EGFR)</td>
          <td>5,849</td>
          <td>4,679 / 1,170</td>
          <td><a href="https://en.wikipedia.org/wiki/IC50">pIC50</a> against <a href="https://en.wikipedia.org/wiki/Epidermal_growth_factor_receptor">EGFR</a> kinase</td>
          <td>Kinodata</td>
      </tr>
  </tbody>
</table>
<h3 id="architectures">Architectures</h3>
<p>Three shallow neural networks are compared:</p>
<ul>
<li><strong>CONV1D</strong>: 1D convolution (kernel size 10, stride 1) followed by two fully connected layers</li>
<li><strong>CONV2D</strong>: 2D convolution on the one-hot encoded SMILES matrix, followed by two fully connected layers</li>
<li><strong>RNN</strong>: LSTM layer followed by two fully connected layers (128 and 64 units)</li>
</ul>
<p>All models are trained for 250 epochs with batch size 16, MSE loss, SGD optimizer, and learning rate 0.001. A Random Forest baseline with Morgan fingerprints (radius 2, length 1024) is also included.</p>
<h3 id="augmentation-sweep">Augmentation sweep</h3>
<p>The augmentation number $m$ is varied from 1 to 20 (step 1) and from 20 to 100 (step 10) for three strategies (with, without, and reduced duplication). The estimated maximum strategy is tested on the smaller datasets. Both training and test sets receive the same augmentation.</p>
<h2 id="key-findings-augmentation-consistently-improves-rmse">Key Findings: Augmentation Consistently Improves RMSE</h2>
<h3 id="augmentation-always-helps">Augmentation always helps</h3>
<p>Across all datasets and architectures, SMILES augmentation reduces test RMSE compared to the no-augmentation baseline. Performance improves sharply in the low augmentation range (1 to 10) and reaches a plateau around 40 to 70, after which additional augmentation provides diminishing returns.</p>
<h3 id="best-models-maxsmi">Best models (Maxsmi)</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Model</th>
          <th>Augmentation Number</th>
          <th>Strategy</th>
          <th>Test RMSE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>CONV1D</td>
          <td>70</td>
          <td>Reduced duplication</td>
          <td>0.569</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>CONV1D</td>
          <td>70</td>
          <td>With duplication</td>
          <td>1.032</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>CONV1D</td>
          <td>80</td>
          <td>Without duplication</td>
          <td>0.593</td>
      </tr>
  </tbody>
</table>
<p>The CONV1D architecture consistently outperforms RNN and CONV2D. For ESOL, the CONV1D model improves from 0.839 RMSE (no augmentation) to 0.569 RMSE (70x reduced duplication), a 32% reduction.</p>
<h3 id="no-single-best-augmentation-strategy">No single best augmentation strategy</h3>
<p>The three main augmentation strategies (with, without, and reduced duplication) perform similarly. Generating the estimated maximum number of unique SMILES does not yield the best results, suggesting a saturation point exists where additional SMILES diversity stops helping.</p>
<h3 id="canonical-smiles-outperform-single-random-smiles">Canonical SMILES outperform single random SMILES</h3>
<p>When augmentation is limited to a single representation ($m = 1$), the canonical SMILES consistently outperforms a single random SMILES. On ESOL with CONV1D, the canonical model achieves 0.839 RMSE versus 0.964 for a random SMILES. The authors attribute this to the simpler, more readable structure of canonical SMILES (fewer branches and brackets).</p>
<h3 id="comparison-to-prior-work">Comparison to prior work</h3>
<table>
  <thead>
      <tr>
          <th>Study</th>
          <th>ESOL</th>
          <th>FreeSolv</th>
          <th>Lipophilicity</th>
          <th>Model</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Maxsmi</td>
          <td>0.569</td>
          <td>1.032</td>
          <td>0.593</td>
          <td>CNN</td>
      </tr>
      <tr>
          <td>MoleculeNet</td>
          <td>0.58 +/- 0.03</td>
          <td>1.15 +/- 0.12</td>
          <td>0.655 +/- 0.036</td>
          <td>GNN</td>
      </tr>
      <tr>
          <td>CNF</td>
          <td>0.62</td>
          <td>1.11</td>
          <td>0.67</td>
          <td>CNN</td>
      </tr>
      <tr>
          <td><a href="/notes/computational-chemistry/chemical-language-models/property-prediction/molpmofit-transfer-learning-qsar/">MolPMoFiT</a></td>
          <td>N/A</td>
          <td>1.197 +/- 0.127</td>
          <td>0.565 +/- 0.037</td>
          <td>RNN</td>
      </tr>
  </tbody>
</table>
<p>Maxsmi outperforms or matches MoleculeNet&rsquo;s graph neural networks and the CNF model on all three tasks. MolPMoFiT slightly outperforms Maxsmi on lipophilicity (0.565 vs 0.593) but performs worse on FreeSolv.</p>
<h3 id="confidence-estimation">Confidence estimation</h3>
<p>The standard deviation of per-SMILES predictions correlates with prediction error. Confidence curves show that sequentially removing compounds with the highest uncertainty leads to monotonically decreasing mean prediction error. For ESOL, keeping only the top 10% most confident predictions yields errors below 0.25.</p>
<h3 id="egfr-affinity-test-case">EGFR affinity test case</h3>
<p>Applying the Maxsmi approach (CONV1D, 70x augmentation, reduced duplication) to EGFR kinase affinity prediction yields test RMSE of 0.777 and R2 of 0.712, compared to 1.031 RMSE and 0.494 R2 for the canonical model (a 25% RMSE improvement). The Random Forest baseline (0.758 RMSE, 0.726 R2) performs comparably, which the authors note without further explanation.</p>
<h3 id="limitations">Limitations</h3>
<ul>
<li>All experiments use a single train/test split (80/20) without cross-validation, due to the computational cost of the full augmentation sweep. This means reported RMSE values lack uncertainty estimates for the Maxsmi models.</li>
<li>The study uses shallow networks only. Whether the same augmentation benefits apply to deeper architectures or pre-trained models is untested.</li>
<li>The EGFR test case shows the Random Forest baseline performing comparably to the Maxsmi model, raising questions about when SMILES augmentation provides a meaningful advantage over traditional fingerprint-based methods.</li>
<li>The comparison to prior work uses different splits, preprocessing, and evaluation protocols across studies, which the authors acknowledge limits direct comparability.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>ESOL</td>
          <td>1,128</td>
          <td>MoleculeNet, water solubility</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>FreeSolv</td>
          <td>642</td>
          <td>MoleculeNet, hydration free energy</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>Lipophilicity</td>
          <td>4,199</td>
          <td>ChEMBL, logD</td>
      </tr>
      <tr>
          <td>Test case</td>
          <td>EGFR Affinity</td>
          <td>5,849</td>
          <td>Kinodata (ChEMBL v28), pIC50</td>
      </tr>
  </tbody>
</table>
<p>All datasets are publicly available through MoleculeNet/DeepChem and Kinodata.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>SMILES generation via <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>&rsquo;s random SMILES enumeration</li>
<li>One-hot encoding of SMILES characters with padding to max length</li>
<li>Five augmentation strategies applied to both training and test sets</li>
<li>Mean aggregation for compound-level predictions</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Architecture</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CONV1D</td>
          <td>1D conv (kernel 10, stride 1) + 2 FC layers</td>
          <td>Not specified</td>
      </tr>
      <tr>
          <td>CONV2D</td>
          <td>2D conv (single channel) + 2 FC layers</td>
          <td>Not specified</td>
      </tr>
      <tr>
          <td>RNN</td>
          <td>LSTM + FC(128) + FC(64)</td>
          <td>Not specified</td>
      </tr>
      <tr>
          <td>RF Baseline</td>
          <td>Random Forest (default sklearn)</td>
          <td>Morgan FP, radius 2, length 1024</td>
      </tr>
  </tbody>
</table>
<p>Training: 250 epochs, batch size 16, MSE loss, SGD, lr=0.001.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RMSE (ESOL)</td>
          <td>0.569</td>
          <td>1.102 (RF)</td>
          <td>CONV1D, 70x reduced dup</td>
      </tr>
      <tr>
          <td>RMSE (FreeSolv)</td>
          <td>1.032</td>
          <td>2.563 (RF)</td>
          <td>CONV1D, 70x with dup</td>
      </tr>
      <tr>
          <td>RMSE (Lipophilicity)</td>
          <td>0.593</td>
          <td>0.860 (RF)</td>
          <td>CONV1D, 80x without dup</td>
      </tr>
      <tr>
          <td>RMSE (EGFR)</td>
          <td>0.777</td>
          <td>0.758 (RF)</td>
          <td>CONV1D, 70x reduced dup</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Training was performed on a GeForce GTX 1080 Ti, provided by the HPC cluster at Freie Universitat Berlin. Training CONV1D on ESOL with 100x augmentation (keeping duplicates, 90,200 data points) takes approximately 3 hours. Training with 19x augmentation achieves RMSE of 0.605 in under 30 minutes.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/volkamerlab/maxsmi">volkamerlab/maxsmi</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Full source code, trained models, CLI for prediction</td>
      </tr>
      <tr>
          <td><a href="https://maxsmi.readthedocs.io/en/latest/">Documentation</a></td>
          <td>Docs</td>
          <td>N/A</td>
          <td>Read the Docs documentation</td>
      </tr>
      <tr>
          <td><a href="https://github.com/openkinome/kinodata">Kinodata</a></td>
          <td>Dataset</td>
          <td>N/A</td>
          <td>Curated kinase bioactivity data from ChEMBL v28</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Highly Reproducible. Code, data, trained models, and a command-line prediction tool are all publicly available under the MIT license.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kimber, T. B., Gagnebin, M., &amp; Volkamer, A. (2021). Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning. <em>Artificial Intelligence in the Life Sciences</em>, 1, 100014. <a href="https://doi.org/10.1016/j.ailsci.2021.100014">https://doi.org/10.1016/j.ailsci.2021.100014</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{kimber2021maxsmi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kimber, Talia B. and Gagnebin, Maxime and Volkamer, Andrea}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Artificial Intelligence in the Life Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{100014}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Elsevier}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.ailsci.2021.100014}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Transformer CLMs for SMILES: Literature Review 2024</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/surveys-and-reviews/transformer-clms-smiles-review/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/surveys-and-reviews/transformer-clms-smiles-review/</guid><description>Review of transformer-based chemical language models for SMILES, covering encoder, decoder, and encoder-decoder architectures for molecular property prediction.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-transformer-based-chemical-language-models">A Systematization of Transformer-Based Chemical Language Models</h2>
<p>This paper is a <strong>Systematization</strong> (literature review) that surveys the landscape of transformer-based chemical language models (CLMs) operating on SMILES representations. It organizes the field into three architectural categories (encoder-only, decoder-only, encoder-decoder), discusses tokenization strategies, pre-training and fine-tuning methodologies, and identifies open challenges and future research directions. The review covers approximately 30 distinct CLMs published through early 2024.</p>
<h2 id="why-review-transformer-clms-for-smiles">Why Review Transformer CLMs for SMILES?</h2>
<p>The chemical space is vast, with databases like ZINC20 exceeding 5.5 billion compounds, and the amount of unlabeled molecular data far outstrips available labeled data for specific tasks like toxicity prediction or binding affinity estimation. Traditional molecular representations (fingerprints, descriptors, graph-based methods) require expert-engineered features and extensive domain knowledge.</p>
<p>Transformer-based language models, originally developed for NLP, have emerged as a compelling alternative. By treating <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings as a &ldquo;chemical language,&rdquo; these models can leverage large-scale unsupervised pre-training on abundant unlabeled molecules, then fine-tune on small labeled datasets for specific downstream tasks. Earlier approaches like Seq2Seq and Seq3Seq fingerprint methods used RNN-based encoder-decoders, but these suffered from vanishing gradients and sequential processing bottlenecks when handling long SMILES sequences.</p>
<p>The authors motivate this review by noting that no prior survey has comprehensively organized transformer-based CLMs by architecture type while simultaneously covering tokenization, embedding strategies, and downstream application domains.</p>
<h2 id="architectural-taxonomy-encoder-decoder-and-encoder-decoder-models">Architectural Taxonomy: Encoder, Decoder, and Encoder-Decoder Models</h2>
<p>The core organizational contribution is a three-way taxonomy of transformer CLMs based on their architectural backbone.</p>
<h3 id="encoder-only-models-bert-family">Encoder-Only Models (BERT Family)</h3>
<p>These models capture bidirectional context, making them well suited for extracting molecular representations for property prediction tasks. The review covers:</p>
<ul>
<li><strong>BERT</strong> (Lee and Nam, 2022): Adapted for SMILES processing with linguistic knowledge infusion, using BPE tokenization</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/molbert-molecular-representations/">MOLBERT</a></strong> (Fabian et al., 2020): Chemistry-specific BERT for physicochemical property and bioactivity prediction</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/smiles-bert/">SMILES-BERT</a></strong> (Wang et al., 2019): BERT variant designed to learn molecular representations directly from SMILES without feature engineering</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta/">ChemBERTa</a> / <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta-2/">ChemBERTa-2</a></strong> (Chithrananda et al., 2020; Ahmad et al., 2022): RoBERTa-based models optimized for chemical property prediction, with ChemBERTa-2 exploring multi-task pre-training</li>
<li><strong>GPT-MolBERTa</strong> (Balaji et al., 2023): Combines GPT molecular features with a RoBERTa backbone</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/molformer/">MoLFormer</a></strong> (Ross et al., 2022): Large-scale model trained on 1.1 billion molecules, published in Nature Machine Intelligence</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/selformer/">SELFormer</a></strong> (Yuksel et al., 2023): Operates on <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a> representations rather than SMILES</li>
<li><strong>Mol-BERT / MolRoPE-BERT</strong> (Li and Jiang, 2021; Liu et al., 2023): Differ in positional embedding strategy, with MolRoPE-BERT using rotary position embedding to handle longer sequences</li>
<li><strong>BET</strong> (Chen et al., 2021): Extracts predictive representations from hundreds of millions of molecules</li>
</ul>
<h3 id="decoder-only-models-gpt-family">Decoder-Only Models (GPT Family)</h3>
<p>These models excel at generative tasks, including de novo molecular design:</p>
<ul>
<li><strong>GPT-2-based model</strong> (Adilov, 2021): Generative pre-training from molecules</li>
<li><strong>MolXPT</strong> (Liu et al., 2023): Wraps molecules with text for generative pre-training, connecting chemical and natural language</li>
<li><strong>BioGPT</strong> (Luo et al., 2022): Focuses on biomedical text generation and mining</li>
<li><strong>MolGPT</strong> (Haroon et al., 2023): Uses relative attention to capture token distances and relationships for de novo drug design</li>
<li><strong>Mol-Instructions</strong> (Fang et al., 2023): Large-scale biomolecular instruction dataset for LLMs</li>
</ul>
<h3 id="encoder-decoder-models">Encoder-Decoder Models</h3>
<p>These combine encoding and generation capabilities for sequence-to-sequence tasks:</p>
<ul>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemformer/">Chemformer</a></strong> (Irwin et al., 2022): BART-based model for reaction prediction and molecular property prediction</li>
<li><strong>MolT5</strong> (adapted T5): Unified text-to-text framework for molecular tasks</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/smiles-transformer/">SMILES Transformer</a></strong> (Honda et al., 2019): Pre-trained molecular fingerprints for low-data drug discovery</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/x-mol-pretraining-molecular-understanding/">X-MOL</a></strong> (Xue et al., 2020): Large-scale pre-training for molecular understanding</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/property-prediction/regression-transformer/">Regression Transformer</a></strong> (Born and Manica, 2023): Operates on <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a>, enabling concurrent regression and generation</li>
<li><strong>TransAntivirus</strong> (Mao et al., 2023): Specialized for antiviral drug design using IUPAC nomenclature</li>
</ul>
<h2 id="tokenization-embedding-and-pre-training-strategies">Tokenization, Embedding, and Pre-Training Strategies</h2>
<h3 id="smiles-tokenization">SMILES Tokenization</h3>
<p>The review identifies tokenization as a critical preprocessing step that affects downstream performance. SMILES tokenization differs from standard NLP tokenization because SMILES strings lack whitespace and use parentheses for branching rather than sentence separation. The key approaches include:</p>
<table>
  <thead>
      <tr>
          <th>Strategy</th>
          <th>Source</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/computational-chemistry/molecular-representations/atom-in-smiles-tokenization/">Atom-in-SMILES (AIS)</a></td>
          <td>Ucak et al. (2023)</td>
          <td>Atom-level tokens preserving chemical identity</td>
      </tr>
      <tr>
          <td><a href="/notes/computational-chemistry/molecular-representations/smiles-pair-encoding/">SMILES Pair Encoding (SPE)</a></td>
          <td>Li and Fourches (2021)</td>
          <td>BPE-inspired substructure tokenization</td>
      </tr>
      <tr>
          <td>Byte-Pair Encoding (BPE)</td>
          <td>Chithrananda et al. (2020); Lee and Nam (2022)</td>
          <td>Standard subword tokenization adapted for SMILES</td>
      </tr>
      <tr>
          <td>SMILESTokenizer</td>
          <td>Chithrananda et al. (2020)</td>
          <td>Character-level tokenization with chemical adjustments</td>
      </tr>
  </tbody>
</table>
<h3 id="positional-embeddings">Positional Embeddings</h3>
<p>The models use various positional encoding strategies: absolute, relative key, relative key-query, rotary (RoPE), and sinusoidal. Notably, SMILES-based models omit segmentation embeddings since SMILES data consists of single sequences rather than sentence pairs.</p>
<h3 id="pre-training-and-fine-tuning-pipeline">Pre-Training and Fine-Tuning Pipeline</h3>
<p>The standard workflow follows two phases:</p>
<ol>
<li><strong>Pre-training</strong>: Unsupervised training on large unlabeled SMILES databases (ZINC, PubChem, ChEMBL) using masked language modeling (MLM), where the model learns to predict masked tokens within SMILES strings</li>
<li><strong>Fine-tuning</strong>: Supervised adaptation on smaller labeled datasets for specific tasks (classification or regression)</li>
</ol>
<p>The self-attention mechanism, central to all transformer CLMs, is formulated as:</p>
<p>$$
Z = \text{Softmax}\left(\frac{(XW^Q)(XW^K)^T}{\sqrt{d_k}}\right) XW^V
$$</p>
<p>where $X \in \mathbb{R}^{N \times M}$ is the input feature matrix, $W^Q$, $W^K$, $W^V \in \mathbb{R}^{M \times d_k}$ are learnable weight matrices, and $\sqrt{d_k}$ is the scaling factor.</p>
<h2 id="benchmark-datasets-and-evaluation-landscape">Benchmark Datasets and Evaluation Landscape</h2>
<p>The review catalogs the standard evaluation ecosystem for CLMs. Pre-training databases include ZINC, PubChem, and ChEMBL. Fine-tuning and evaluation rely heavily on <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Datasets</th>
          <th>Task Type</th>
          <th>Example Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Physical Chemistry</td>
          <td>ESOL, FreeSolv, Lipophilicity</td>
          <td>Regression</td>
          <td>642 to 4,200</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>PCBA, MUV, HIV, PDBbind, BACE</td>
          <td>Classification/Regression</td>
          <td>11,908 to 437,929</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>BBBP, Tox21, ToxCast, SIDER, ClinTox</td>
          <td>Classification</td>
          <td>1,427 to 8,575</td>
      </tr>
  </tbody>
</table>
<p>The authors also propose four new fine-tuning datasets targeting diseases: COVID-19 drug compounds, cocrystal formation, antimalarial drugs (Plasmodium falciparum targets), and cancer gene expression/drug response data.</p>
<h2 id="challenges-limitations-and-future-directions">Challenges, Limitations, and Future Directions</h2>
<h3 id="current-challenges">Current Challenges</h3>
<p>The review identifies several persistent limitations:</p>
<ol>
<li><strong>Data efficiency</strong>: Despite transfer learning, transformer CLMs still require substantial pre-training data, and labeled datasets for specific tasks remain scarce</li>
<li><strong>Interpretability</strong>: The complexity of transformer architectures makes it difficult to understand how specific molecular features contribute to predictions</li>
<li><strong>Computational cost</strong>: Training large-scale models demands significant GPU resources, limiting accessibility</li>
<li><strong>Handling rare molecules</strong>: Models struggle with molecular structures that deviate significantly from training data distributions</li>
<li><strong>SMILES limitations</strong>: Non-unique representations, invalid strings, exceeded atom valency, and inadequate spatial information capture</li>
</ol>
<h3 id="smiles-representation-issues">SMILES Representation Issues</h3>
<p>The authors highlight five specific problems with SMILES as an input representation:</p>
<ul>
<li>Non-canonical representations reduce string uniqueness for the same molecule</li>
<li>Many symbol combinations produce chemically invalid outputs</li>
<li>Valid SMILES strings can encode chemically impossible molecules (e.g., exceeded valency)</li>
<li>Spatial information is inadequately captured</li>
<li>Syntactic and semantic robustness is limited</li>
</ul>
<h3 id="future-research-directions">Future Research Directions</h3>
<p>The review proposes several directions:</p>
<ul>
<li><strong>Alternative molecular representations</strong>: Exploring <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a>, <a href="/notes/computational-chemistry/molecular-representations/deepsmiles-adaptation-for-ml/">DeepSMILES</a>, IUPAC, and InChI beyond SMILES</li>
<li><strong>Role of SMILES token types</strong>: Strategic masking of metals, non-metals, bonds, and branches during MLM pre-training to identify which components are most critical</li>
<li><strong>Few-shot learning</strong>: Combining few-shot approaches with large-scale pre-trained CLMs for data-scarce scenarios</li>
<li><strong>Drug repurposing</strong>: Training CLMs to distinguish identical compounds with different biological activity profiles across therapeutic domains</li>
<li><strong>Improved benchmarks</strong>: Incorporating disease-specific datasets (malaria, cancer, COVID-19) for more realistic evaluation</li>
<li><strong>Ethical considerations</strong>: Addressing dual-use risks, data biases, and responsible open-source release of CLMs</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This is a literature review paper. It does not introduce new models, code, or experimental results. The reproducibility assessment focuses on the accessibility of the reviewed works and proposed datasets.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC20</td>
          <td>5.5B+ compounds</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Pre-training</td>
          <td>PubChem</td>
          <td>100M+ compounds</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL</td>
          <td>2M+ compounds</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>MoleculeNet (8 datasets)</td>
          <td>642 to 437,929</td>
          <td>Standard benchmark suite</td>
      </tr>
      <tr>
          <td>Proposed</td>
          <td>COVID-19 drug compounds</td>
          <td>740</td>
          <td>From Harigua-Souiai et al. (2021)</td>
      </tr>
      <tr>
          <td>Proposed</td>
          <td>Cocrystal formation</td>
          <td>3,282</td>
          <td>From Mswahili et al. (2021)</td>
      </tr>
      <tr>
          <td>Proposed</td>
          <td>Antimalarial drugs</td>
          <td>4,794</td>
          <td>From Mswahili et al. (2024)</td>
      </tr>
      <tr>
          <td>Proposed</td>
          <td>Cancer gene/drug response</td>
          <td>201 drugs, 734 cell lines</td>
          <td>From Kim et al. (2021)</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="http://dai.chungbuk.ac.kr/">DAI Lab website</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Authors&rsquo; research lab</td>
      </tr>
  </tbody>
</table>
<p>No code, models, or evaluation scripts are released with this review. The paper does not include a supplementary materials section or GitHub repository.</p>
<h3 id="hardware">Hardware</h3>
<p>Not applicable (literature review).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Mswahili, M. E., &amp; Jeong, Y.-S. (2024). Transformer-based models for chemical SMILES representation: A comprehensive literature review. <em>Heliyon</em>, 10(20), e39038. <a href="https://doi.org/10.1016/j.heliyon.2024.e39038">https://doi.org/10.1016/j.heliyon.2024.e39038</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{mswahili2024transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer-based models for chemical {SMILES} representation: A comprehensive literature review}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Mswahili, Medard Edmund and Jeong, Young-Seob}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Heliyon}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{20}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{e39038}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Elsevier}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.heliyon.2024.e39038}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Systematic Review of Deep Learning CLMs (2020-2024)</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/surveys-and-reviews/systematic-review-deep-learning-clms/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/surveys-and-reviews/systematic-review-deep-learning-clms/</guid><description>Systematic review of 72 deep learning molecular generation studies using MOSES and GuacaMol benchmarks across RNNs, transformers, VAEs, and GANs.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-chemical-language-models-for-molecular-generation">A Systematization of Chemical Language Models for Molecular Generation</h2>
<p>This paper is a <strong>Systematization</strong> that provides a comprehensive, PRISMA-guided systematic review of deep learning chemical language models (CLMs) used for de novo molecular generation. The primary contribution is a structured statistical analysis of 72 retrieved articles from 2020 to June 2024, comparing architectures (RNNs, transformers, VAEs, GANs, S4 models), molecular representations, biased generation strategies, and quality metrics from the MOSES and GuacaMol benchmarking platforms. The review addresses five research questions about architecture configuration effects, best-performing architectures, impactful hyperparameters, common molecular representations, and effective biased generation methods.</p>
<h2 id="motivation-evaluating-four-years-of-generative-clm-progress">Motivation: Evaluating Four Years of Generative CLM Progress</h2>
<p>Deep learning molecular generation has expanded rapidly since 2018, when <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al.</a> and <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/lstm-drug-like-molecule-generation/">Segler et al.</a> demonstrated that deep generative models could learn to produce novel molecules from <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> representations. By 2020, multiple architectures (RNNs, transformers, VAEs, GANs) were being applied to chemical language modeling, and benchmarking platforms like <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a> and <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> had been introduced to enable standardized evaluation.</p>
<p>Despite this growth, existing reviews largely focused on theoretical background or drug development applications rather than systematic statistical comparison of model performance. Few studies had examined how architecture choice, training dataset size, molecular representation format, and biased learning strategies interact to affect generation quality metrics like validity, uniqueness, and novelty. This review fills that gap by restricting the analysis to papers reporting MOSES or GuacaMol metrics, enabling quantitative cross-study comparison.</p>
<h2 id="prisma-based-systematic-review-methodology">PRISMA-Based Systematic Review Methodology</h2>
<p>The review follows the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) guidelines. Articles were retrieved from Scopus, Web of Science, and Google Scholar using six Boolean search queries combining terms like &ldquo;Molecule Generation,&rdquo; &ldquo;Chemical Language Models,&rdquo; &ldquo;Deep Learning,&rdquo; and specific architecture names. The search window covered January 2020 to June 2024.</p>
<h3 id="eligibility-criteria">Eligibility Criteria</h3>
<p>Papers were included if they:</p>
<ol>
<li>Were written in English</li>
<li>Explicitly presented at least two metrics of uniqueness, validity, or novelty</li>
<li>Defined these metrics consistent with MOSES or GuacaMol concepts</li>
<li>Used deep learning generative models for de novo molecule design</li>
<li>Used conventional (non-quantum) deep learning methods</li>
<li>Were published between January 2020 and June 2024</li>
</ol>
<p>This yielded 48 articles from query-based search and 25 from citation search, totaling 72 articles. Of these, 62 used CLM approaches (string-based molecular representations) and 10 used graph-based representations.</p>
<h3 id="data-collection">Data Collection</h3>
<p>For each article, the authors extracted: journal details, database name, training dataset size, molecular representation type (<a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>, <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a>, InChI, <a href="/notes/computational-chemistry/molecular-representations/deepsmiles-adaptation-for-ml/">DeepSMILES</a>), architecture details (embedding length, layers, hidden units, trainable parameters, dropout, temperature, batch size, epochs, learning rate, optimizer), biased method usage (TL, RL, conditional learning), and generation metrics (validity, uniqueness, novelty, scaffold diversity, SNN, FCD).</p>
<h3 id="evaluation-metrics">Evaluation Metrics</h3>
<p>The review focuses on three core MOSES metrics:</p>
<p>$$
\text{Validity}(V_m) = \frac{\text{Valid molecules}}{\text{Molecules produced}}
$$</p>
<p>$$
\text{Uniqueness} = \frac{\text{set}(V_m)}{V_m}
$$</p>
<p>$$
\text{Novelty} = 1 - \frac{V_m \cap T_d}{V_m}
$$</p>
<p>where $V_m$ denotes valid molecules and $T_d$ the training dataset.</p>
<h2 id="architecture-distribution-and-performance-comparison">Architecture Distribution and Performance Comparison</h2>
<h3 id="architecture-trends-2020-2024">Architecture Trends (2020-2024)</h3>
<p>The review found that RNNs and transformers dominate CLM usage, with a growing trend toward transformers over time. The breakdown across 62 CLM articles: 24 RNN-based, 23 transformer-based, 16 VAE-based, 8 GAN-based, and 1 S4-based model. Among RNN variants, LSTM was the most common, followed by GRU, despite GRU having fewer trainable parameters.</p>
<p>The increase in transformer adoption is attributed to self-attention mechanisms enabling parallel computation and effective long-range dependency capture. Meanwhile, GANs and VAEs saw lower adoption rates, partly due to higher memory and time complexity and reduced ability to generate large molecules.</p>
<h3 id="molecular-representations-and-databases">Molecular Representations and Databases</h3>
<p>SMILES was used exclusively in 77.27% of CLM articles, reflecting its wide database availability and compact format. <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a>, <a href="/notes/computational-chemistry/molecular-representations/deepsmiles-adaptation-for-ml/">DeepSMILES</a>, and InChI each appeared in smaller fractions. The dominant databases were ChEMBL and ZINC (27 articles each), followed by PubChem (4 articles). Approximately 71% of reviewed articles focused on drug discovery applications.</p>
<table>
  <thead>
      <tr>
          <th>Database</th>
          <th>Molecules (millions)</th>
          <th>Representation</th>
          <th>Articles</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChEMBL</td>
          <td>2.4</td>
          <td>SMILES, InChI</td>
          <td>27</td>
      </tr>
      <tr>
          <td>ZINC</td>
          <td>750</td>
          <td>SMILES</td>
          <td>27</td>
      </tr>
      <tr>
          <td>PubChem</td>
          <td>115.3</td>
          <td>SMILES, InChI</td>
          <td>4</td>
      </tr>
      <tr>
          <td>COCONUT</td>
          <td>0.695</td>
          <td>SMILES, InChI</td>
          <td>1</td>
      </tr>
      <tr>
          <td>DNA-Encoded Library</td>
          <td>1,040</td>
          <td>SMILES</td>
          <td>1</td>
      </tr>
  </tbody>
</table>
<h3 id="unbiased-model-performance">Unbiased Model Performance</h3>
<p><strong>Validity</strong>: No statistically significant differences were observed across architecture families. Transformers generally achieved high validity through self-attention mechanisms that retain uncompressed sequence information. However, one transformer model (TransMol) achieved only 6.9% validity when using stochastic sampling with Gaussian noise to explore unseen chemical space. GANs showed high dispersion, with validity as low as 8.5% when learning from gene expression signatures rather than molecular structures directly.</p>
<p><strong>Uniqueness</strong>: No significant differences in median uniqueness across architectures. Transformer-based models using masked self-attention achieved near-perfect uniqueness scores. Scaffold decoration and fragment-linking approaches sometimes compromised uniqueness due to overfit-driven redundancy.</p>
<p><strong>Validity-Novelty Trade-off</strong>: The authors propose a &ldquo;Valid/Sample&rdquo; metric (Validity x Novelty) and find an inverse trend between validity and novelty (Spearman $\rho = -0.3575$, p-value = 0.0618). Only 17.9% of models achieved above-median values for both validity (95.6%) and novelty (96.5%) simultaneously. SELFIES-based models achieve 100% validity by construction, which can help address this trade-off.</p>
<h3 id="biased-model-performance">Biased Model Performance</h3>
<p>The review examines three biased generation strategies:</p>
<p><strong>Transfer Learning (TL)</strong>: The most prevalent biased method, used across all architecture types. Fine-tuning transfers pre-trained parameters to a target model, requiring significantly fewer training molecules (median ~2,507 vs. ~1.1M for unbiased). TL does not significantly affect validity (p = 0.16) or novelty (p = 0.84), but uniqueness decreases significantly (median 90.2% vs. 97.9%, p = 0.014), likely due to overfitting on small target datasets.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Unbiased (median)</th>
          <th>TL Target (median)</th>
          <th>p-value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training size</td>
          <td>1,128,920</td>
          <td>2,507</td>
          <td>&lt;0.0001</td>
      </tr>
      <tr>
          <td>Validity</td>
          <td>98.05%</td>
          <td>95.5%</td>
          <td>0.1602</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>97.9%</td>
          <td>90.2%</td>
          <td>0.0144</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>91.6%</td>
          <td>96.0%</td>
          <td>0.8438</td>
      </tr>
  </tbody>
</table>
<p><strong>Reinforcement Learning (RL)</strong>: Applied only to RNNs and transformers in the reviewed set. 90.1% of RL implementations used policy gradient methods with scoring functions for properties like synthesizability, binding affinity, and membrane permeability. No significant effects on generation metrics were observed.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Unbiased (median)</th>
          <th>RL Target (median)</th>
          <th>p-value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>91.1%</td>
          <td>96.5%</td>
          <td>0.1289</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>99.9%</td>
          <td>89.7%</td>
          <td>0.0935</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>91.5%</td>
          <td>93.5%</td>
          <td>0.2500</td>
      </tr>
  </tbody>
</table>
<p><strong>Conditional Learning (CL)</strong>: Integrates domain-specific data (properties, bioactivities, functional groups) directly into training via constraint tokens or property embeddings. Used primarily with encoder-decoder architectures (ARAEs, VAEs, transformers). CL does not significantly degrade generation metrics relative to unbiased models.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Unbiased (median)</th>
          <th>CL Target (median)</th>
          <th>p-value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>98.5%</td>
          <td>96.8%</td>
          <td>0.4648</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>99.9%</td>
          <td>97.5%</td>
          <td>0.0753</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>89.3%</td>
          <td>99.6%</td>
          <td>0.2945</td>
      </tr>
  </tbody>
</table>
<h2 id="key-findings-and-directions-for-chemical-language-models">Key Findings and Directions for Chemical Language Models</h2>
<h3 id="main-conclusions">Main Conclusions</h3>
<ol>
<li>
<p><strong>Transformers are overtaking RNNs</strong> as the dominant CLM architecture, driven by self-attention mechanisms that capture long-range dependencies without the gradient vanishing issues of recurrent models.</p>
</li>
<li>
<p><strong>SMILES remains dominant</strong> (77% of models) despite known limitations (non-uniqueness, syntax errors). SELFIES shows promise for improving the validity-novelty trade-off.</p>
</li>
<li>
<p><strong>No architecture achieves both high validity and high novelty easily.</strong> Only 17.9% of unbiased models exceeded medians for both metrics simultaneously, highlighting a fundamental tension in generative chemistry.</p>
</li>
<li>
<p><strong>Transfer learning requires only ~2,500 molecules</strong> to generate targeted compounds, compared to ~1.1M for unbiased training, but at the cost of reduced uniqueness.</p>
</li>
<li>
<p><strong>Combining biased methods</strong> (e.g., TL + RL, CL + TL) shows promise for multi-objective optimization and exploring distant regions of chemical space.</p>
</li>
<li>
<p><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/s4-chemical-language-modeling/">S4 models</a></strong> were newly introduced for CLMs in 2023, showing competitive performance with the dual nature of convolution during training and recurrent generation.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The review is restricted to papers reporting MOSES or GuacaMol metrics, which excludes many molecular generation studies that use alternative evaluation frameworks. The statistical comparisons rely on median values reported across different experimental settings, making direct architecture comparisons approximate. Graph-based approaches are included only for coarse comparison (10 of 72 articles) and are not the focus of the analysis.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a systematic review, so no new models were trained. The authors collected metadata from 72 published articles. No datasets were generated or analyzed beyond the literature corpus.</p>
<h3 id="algorithms">Algorithms</h3>
<p>Statistical comparisons used Mann-Whitney U tests for paired samples. Spearman correlation was used to assess the validity-novelty relationship. Outlier identification used the Valid/Sample (Validity x Novelty) metric with box plot analysis.</p>
<h3 id="evaluation">Evaluation</h3>
<p>The review evaluates models using MOSES metrics: validity, uniqueness, novelty, scaffold diversity, scaffold novelty, fragment similarity, SNN, internal diversity, and <a href="/notes/computational-chemistry/benchmark-problems/frechet-chemnet-distance/">FCD</a>. Statistical tests were applied to compare medians across architecture families and between biased and unbiased models.</p>
<h3 id="hardware">Hardware</h3>
<p>Not applicable (systematic review, no model training performed).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Flores-Hernandez, H., &amp; Martínez-Ledesma, E. (2024). A systematic review of deep learning chemical language models in recent era. <em>Journal of Cheminformatics</em>, 16(1), 129. <a href="https://doi.org/10.1186/s13321-024-00916-y">https://doi.org/10.1186/s13321-024-00916-y</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{floreshernandez2024systematic,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A systematic review of deep learning chemical language models in recent era}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Flores-Hernandez, Hector and Mart{\&#39;i}nez-Ledesma, Emmanuel}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{129}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{BioMed Central}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-024-00916-y}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>S4 Structured State Space Models for De Novo Drug Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/s4-chemical-language-modeling/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/s4-chemical-language-modeling/</guid><description>S4 state space models are applied to chemical language modeling for de novo drug design, outperforming LSTMs and GPTs in bioactivity learning from SMILES.</description><content:encoded><![CDATA[<h2 id="structured-state-spaces-meet-chemical-language-modeling">Structured State Spaces Meet Chemical Language Modeling</h2>
<p>This is a <strong>Method</strong> paper that introduces structured state space sequence (S4) models to chemical language modeling (CLM) for de novo drug design. S4 models have a dual formulation: they process entire input sequences via convolution during training (like Transformers) and generate sequences element-by-element via recurrence during inference (like LSTMs). The authors benchmark S4 against LSTM and GPT architectures across multiple drug discovery tasks, including drug-like molecule generation, bioactivity learning, chemical space exploration, natural product design, and prospective kinase inhibitor design validated by molecular dynamics simulations.</p>
<h2 id="bridging-the-lstm-transformer-gap-in-molecular-generation">Bridging the LSTM-Transformer Gap in Molecular Generation</h2>
<p>Chemical language models (CLMs) generate molecules by learning the &ldquo;chemical language&rdquo; of <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> string representations. The two dominant architectures for CLMs are LSTMs and GPTs, each with complementary strengths and limitations:</p>
<ul>
<li><strong>LSTMs</strong> generate sequences recurrently (element-by-element), which enables efficient generation and good learning of local/short-range dependencies. However, their sequential information bottleneck limits learning of global sequence properties.</li>
<li><strong>GPTs</strong> (Transformer decoders) process the entire input at once, better capturing global properties like bioactivity. However, they become increasingly compute-intensive for longer SMILES strings and struggle with chemical space exploration at higher sampling temperatures.</li>
</ul>
<p>Complex molecular properties like bioactivity can emerge from separated portions of a SMILES string (e.g., distant functional groups in the linear notation). Neither architecture fully addresses the need to learn these long-range dependencies while maintaining efficient, robust generation. The chemical space, estimated at up to $10^{60}$ small molecules, demands models that can both capture complex property relationships and explore diverse scaffolds efficiently.</p>
<h2 id="the-dual-nature-of-s4-convolution-meets-recurrence">The Dual Nature of S4: Convolution Meets Recurrence</h2>
<p>S4 models are built on discrete <a href="https://en.wikipedia.org/wiki/State-space_model">state space models</a>, which map an input sequence $\mathbf{u}$ to an output sequence $\mathbf{y}$ through learnable parameters $\overline{\mathbf{A}} \in \mathbb{R}^{N \times N}$, $\overline{\mathbf{B}} \in \mathbb{R}^{N \times 1}$, $\overline{\mathbf{C}} \in \mathbb{R}^{1 \times N}$, and $\overline{\mathbf{D}} \in \mathbb{R}^{1 \times 1}$:</p>
<p>$$
x_{k} = \overline{\mathbf{A}} x_{k-1} + \overline{\mathbf{B}} u_{k}
$$</p>
<p>$$
y_{k} = \overline{\mathbf{C}} x_{k} + \overline{\mathbf{D}} u_{k}
$$</p>
<p>This linear recurrence can equivalently be &ldquo;unrolled&rdquo; into a global convolution:</p>
<p>$$
\mathbf{y} = \mathbf{u} * \overline{\mathbf{K}}
$$</p>
<p>where $\overline{\mathbf{K}}$ is a convolution filter parameterized by $\overline{\mathbf{A}}$, $\overline{\mathbf{B}}$, and $\overline{\mathbf{C}}$. This duality is the core innovation for CLMs:</p>
<ul>
<li><strong>Training</strong>: S4 uses the convolutional formulation to learn from entire SMILES sequences simultaneously, capturing global molecular properties.</li>
<li><strong>Generation</strong>: S4 switches to the recurrent formulation, producing SMILES tokens one at a time for efficient, robust chemical space exploration.</li>
</ul>
<p>S4 addresses the numerical instabilities of naive state space models through high-order polynomial projection operators (HiPPO) and reduction to the stable Cauchy kernel computation, enabling effective learning of long-range dependencies.</p>
<p>For molecular ranking after fine-tuning, the log-likelihood score subtracts the pre-training likelihood to isolate target-specific information:</p>
<p>$$
\mathcal{L}_{\text{score}}(\mathbf{M}) = \mathcal{L}(\mathbf{M}_{\text{ft}}) - \mathcal{L}(\mathbf{M}_{\text{pt}})
$$</p>
<p>where $\mathcal{L}(\mathbf{M}_{\text{ft}})$ and $\mathcal{L}(\mathbf{M}_{\text{pt}})$ are the fine-tuned and pre-trained model log-likelihoods, respectively.</p>
<h2 id="benchmarking-s4-across-drug-discovery-tasks">Benchmarking S4 Across Drug Discovery Tasks</h2>
<h3 id="drug-like-molecule-generation">Drug-like molecule generation</h3>
<p>All three CLMs (S4, LSTM, GPT) were pre-trained on 1.9M canonical SMILES from ChEMBL v31 (molecules with fewer than 100 tokens). Each model generated 102,400 SMILES strings de novo.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Valid</th>
          <th>Unique</th>
          <th>Novel</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>S4</td>
          <td>99,268 (97%)</td>
          <td>98,712 (96%)</td>
          <td>95,552 (93%)</td>
      </tr>
      <tr>
          <td>LSTM</td>
          <td>97,151 (95%)</td>
          <td>96,618 (94%)</td>
          <td>82,988 (81%)</td>
      </tr>
      <tr>
          <td>GPT</td>
          <td>93,580 (91%)</td>
          <td>93,263 (91%)</td>
          <td>91,590 (89%)</td>
      </tr>
  </tbody>
</table>
<p>S4 produces the most valid, unique, and novel molecules. Error analysis reveals that each architecture shows different failure modes: LSTMs struggle most with branching errors, GPTs with ring and bond assignment errors, while S4 generates fewer branching and ring errors but more bond assignment errors than LSTM. This pattern supports the hypothesis that S4 captures long-range dependencies (branching, ring opening/closure) better while local dependencies (bond assignment) are handled better by recurrent processing.</p>
<h3 id="bioactivity-learning-via-transfer-learning">Bioactivity learning via transfer learning</h3>
<p>Five fine-tuning campaigns were conducted on targets from the LIT-PCBA dataset: PKM2, <a href="https://en.wikipedia.org/wiki/Mitogen-activated_protein_kinase_1">MAPK1</a>, GBA, mTORC1, and TP53. After fine-tuning, models ranked held-out test molecules by learned log-likelihoods to evaluate bioactive compound prioritization.</p>
<p>S4 outperformed both benchmarks across targets. Wilcoxon signed-rank tests on pooled scores confirmed statistically significant superiority:</p>
<ul>
<li>S4 vs. LSTM: $p$ [top 10] = 8.41e-6, $p$ [top 50] = 2.93e-7, $p$ [top 100] = 1.45e-7</li>
<li>S4 vs. GPT: $p$ [top 10] = 2.33e-3, $p$ [top 50] = 3.72e-3, $p$ [top 100] = 2.61e-2</li>
</ul>
<p>TP53 was the most challenging target, where no model consistently retrieved actives in the top 10, possibly due to <a href="/notes/computational-chemistry/benchmark-problems/activity-cliffs-benchmark/">activity cliffs</a> in the test set.</p>
<h3 id="chemical-space-exploration-with-temperature-sampling">Chemical space exploration with temperature sampling</h3>
<p>Models were evaluated across sampling temperatures from $T = 1.0$ to $T = 2.0$ on three metrics: SMILES validity, rediscovery rate of known actives, and scaffold diversity. Key findings:</p>
<ul>
<li><strong>Validity</strong>: S4 and LSTM maintain higher validity than GPT at elevated temperatures (GPT median validity drops below 40% at high T).</li>
<li><strong>Rediscovery</strong>: S4 outperforms LSTM in rediscovering bioactive molecules at all temperatures.</li>
<li><strong>Scaffold diversity</strong>: LSTM achieves the highest number of unique scaffold clusters (median 6,602 at $T = 1.75$), with S4 as close second (6,520 clusters).</li>
</ul>
<p>S4 provides the best balance between bioactivity capture and structural diversity.</p>
<h3 id="natural-product-design">Natural product design</h3>
<p>Models were trained on 32,360 large natural product SMILES (length &gt; 100 tokens) from the COCONUT database and used to generate 102,400 designs each.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>S4</th>
          <th>LSTM</th>
          <th>GPT</th>
          <th>Training Set</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid</td>
          <td>82,633 (81%)</td>
          <td>76,264 (74%)</td>
          <td>70,117 (68%)</td>
          <td>n.a.</td>
      </tr>
      <tr>
          <td>Unique</td>
          <td>53,293 (52%)</td>
          <td>51,326 (50%)</td>
          <td>50,487 (49%)</td>
          <td>n.a.</td>
      </tr>
      <tr>
          <td>Novel</td>
          <td>40,897 (40%)</td>
          <td>43,245 (42%)</td>
          <td>43,168 (42%)</td>
          <td>n.a.</td>
      </tr>
      <tr>
          <td>NP-likeness</td>
          <td>1.6 +/- 0.7</td>
          <td>1.5 +/- 0.7</td>
          <td>1.5 +/- 0.7</td>
          <td>1.6 +/- 0.7</td>
      </tr>
  </tbody>
</table>
<p>S4 designs the most valid molecules (6,000 to 12,000 more than benchmarks) and achieves significantly higher NP-likeness ($p = 1.41 \times 10^{-53}$ vs. LSTM, $p = 1.02 \times 10^{-82}$ vs. GPT). S4 also achieves the lowest Kolmogorov-Smirnov distances to the training/test distributions across multiple structural properties (sp3 carbons, aliphatic rings, spiro atoms, molecular weight, fused ring size, heavy atoms).</p>
<p>For computational efficiency, S4 trains as fast as GPT (both approximately 1.3x faster than LSTM) and generates fastest among all architectures.</p>
<h3 id="prospective-mapk1-inhibitor-design">Prospective MAPK1 inhibitor design</h3>
<p>The pre-trained S4 model was fine-tuned on 68 manually curated MAPK1 inhibitors ($K_i &lt; 1 \mu M$) from ChEMBL v33. The last five fine-tuning epochs generated 256K molecules across five temperature values. After ranking and filtering by log-likelihood score and scaffold similarity, the top 10 designs were evaluated via <a href="/notes/computational-chemistry/molecular-dynamics/umbrella-sampling/">Umbrella Sampling</a> <a href="/notes/computational-chemistry/molecular-dynamics/">molecular dynamics</a> simulations.</p>
<p>Eight out of ten designs showed high predicted affinity, with $\Delta G$ values ranging from $-10.3 \pm 0.6$ to $-23 \pm 4$ kcal/mol. These affinities are comparable to or exceed those of the closest known active neighbors ($\Delta G = -9.1 \pm 0.8$ to $-13 \pm 2$ kcal/mol). The most potent predicted design (molecule 2, $\Delta G = -23 \pm 4$ kcal/mol) engages extensively with the MAPK1 binding pocket, though synthetic accessibility may be limited. Several designs incorporate halogen substitutions favorable for MAPK1 inhibition, consistent with known structure-activity relationships.</p>
<h2 id="s4-combines-the-best-of-lstms-and-gpts-for-molecular-design">S4 Combines the Best of LSTMs and GPTs for Molecular Design</h2>
<p>The main findings of this study are:</p>
<ol>
<li><strong>S4 outperforms both LSTM and GPT</strong> in learning complex molecular properties like bioactivity, while maintaining competitive or superior performance in syntax learning and chemical space exploration.</li>
<li><strong>The dual formulation is key</strong>: holistic training (convolution) enables better capture of global molecular properties, while recurrent generation preserves robust chemical syntax and diverse scaffold exploration.</li>
<li><strong>S4 is especially strong for longer sequences</strong>: natural product design (SMILES &gt; 100 tokens) shows the largest advantages over benchmarks in validity and property matching.</li>
<li><strong>Prospective validation</strong>: 8/10 S4-designed MAPK1 inhibitors are predicted as highly active by molecular dynamics, with affinities comparable to or exceeding known actives.</li>
</ol>
<p><strong>Limitations acknowledged by the authors</strong>:</p>
<ul>
<li>All evaluations are computational; no wet-lab experimental validation is reported.</li>
<li>Bioactivity evaluation relies on likelihood-based ranking, which is an indirect proxy.</li>
<li>The MD simulations, while more rigorous than simple docking, still represent in silico predictions.</li>
<li>SMILES augmentation and improved ranking protocols could further boost performance.</li>
</ul>
<p><strong>Future directions</strong> include application to macrocyclic peptides and protein sequences, organic reaction planning, structure-based drug design, and integration with wet-lab experimental validation.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL v31</td>
          <td>1.9M SMILES</td>
          <td>Molecules with SMILES length &lt;= 100 tokens</td>
      </tr>
      <tr>
          <td>Fine-tuning (bioactivity)</td>
          <td>LIT-PCBA (5 targets)</td>
          <td>11-56 actives + ~10K inactives per target</td>
          <td>PKM2, MAPK1, GBA, mTORC1, TP53</td>
      </tr>
      <tr>
          <td>Natural product training</td>
          <td>COCONUT</td>
          <td>32,360 SMILES</td>
          <td>SMILES length &gt; 100 tokens</td>
      </tr>
      <tr>
          <td>Prospective fine-tuning</td>
          <td>ChEMBL v33 (MAPK1)</td>
          <td>68 inhibitors</td>
          <td>$K_i &lt; 1 \mu M$, target ID CHEMBL4040</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Pre-training: next-token prediction on SMILES strings</li>
<li>Fine-tuning: transfer learning with early stopping (patience 5, tolerance $10^{-5}$)</li>
<li>Molecule ranking: log-likelihood scoring with pre-training bias subtraction (Eq. 5)</li>
<li>Temperature sampling: $T$ from 1.0 to 2.0 (step 0.25) for chemical space exploration</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>S4</strong>: Structured state space sequence model with HiPPO initialization; hyperparameter search over 242 + 108 configurations</li>
<li><strong>LSTM</strong>: 40 configurations optimized via random search</li>
<li><strong>GPT</strong>: 35 configurations optimized via random search</li>
<li>All models share the same pre-training data and fine-tuning protocol for fair comparison</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Model</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity (ChEMBL)</td>
          <td>S4</td>
          <td>97%</td>
          <td>Out of 102,400 generated SMILES</td>
      </tr>
      <tr>
          <td>Uniqueness (ChEMBL)</td>
          <td>S4</td>
          <td>96%</td>
          <td>Among valid designs</td>
      </tr>
      <tr>
          <td>Novelty (ChEMBL)</td>
          <td>S4</td>
          <td>93%</td>
          <td>Not in training set</td>
      </tr>
      <tr>
          <td>Bioactivity ranking (top 10)</td>
          <td>S4</td>
          <td>Significant (p = 8.41e-6 vs LSTM)</td>
          <td>Wilcoxon signed-rank test</td>
      </tr>
      <tr>
          <td>NP validity</td>
          <td>S4</td>
          <td>81%</td>
          <td>COCONUT, SMILES &gt; 100 tokens</td>
      </tr>
      <tr>
          <td>MAPK1 inhibitor success</td>
          <td>S4</td>
          <td>8/10 designs active</td>
          <td>Validated by MD (Umbrella Sampling)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Hyperparameter search: NVIDIA A100 40GB GPUs</li>
<li>LSTM/GPT search: 5 days on single A100</li>
<li>S4 search: 10 days on multiple A100 GPUs</li>
<li>MD simulations: Dutch supercomputer Snellius; 1.2-1.6 microseconds per ligand (<a href="/notes/computational-chemistry/molecular-dynamics/umbrella-sampling/">Umbrella Sampling</a>)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/molML/s4-for-de-novo-drug-design">S4 for de novo drug design</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch implementation with data and trained models</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.12666371">Zenodo archive</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>Source data and molecule designs</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ozcelik, R., de Ruiter, S., Criscuolo, E., &amp; Grisoni, F. (2024). Chemical language modeling with structured state space sequence models. <em>Nature Communications</em>, 15, 6176.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ozcelik2024chemical,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Chemical language modeling with structured state space sequence models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{\&#34;O{}z\c{c}elik, R{\i}za and de Ruiter, Sarah and Criscuolo, Emanuele and Grisoni, Francesca}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{6176}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-024-50469-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Review: Deep Learning for Molecular Design (2019)</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/surveys-and-reviews/deep-learning-molecular-design-review/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/surveys-and-reviews/deep-learning-molecular-design-review/</guid><description>A 2019 review surveying deep generative models for molecular design, covering RNNs, VAEs, GANs, and RL approaches with SMILES and graph representations.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-deep-generative-models-for-molecular-design">A Systematization of Deep Generative Models for Molecular Design</h2>
<p>This is a <strong>Systematization</strong> paper that organizes and compares the rapidly growing literature on deep generative modeling for molecules. Published in 2019, it catalogs 45 papers from the preceding two years, classifying them by architecture (RNNs, VAEs, GANs, reinforcement learning) and molecular representation (SMILES strings, context-free grammars, graph tensors, 3D voxels). The review provides mathematical foundations for each technique, identifies cross-cutting themes, and proposes a framework for reward function design that addresses diversity, novelty, stability, and synthesizability.</p>
<h2 id="the-challenge-of-navigating-vast-chemical-space">The Challenge of Navigating Vast Chemical Space</h2>
<p>The space of potential drug-like molecules has been estimated to contain between $10^{23}$ and $10^{60}$ compounds, while only about $10^{8}$ have ever been synthesized. Traditional approaches to molecular design rely on combinatorial methods, mixing known scaffolds and functional groups, but these generate many unstable or unsynthesizable candidates. High-throughput screening (HTS) and virtual screening (HTVS) help but remain computationally expensive. The average cost to bring a new drug to market exceeds one billion USD, with a 13-year average timeline from discovery to market.</p>
<p>By 2016, <a href="/notes/machine-learning/generative-models/">deep generative models</a> had shown strong results in producing original images, music, and text. The &ldquo;molecular autoencoder&rdquo; of <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al. (2016/2018)</a> first applied these techniques to molecular generation, triggering an explosion of follow-up work. By the time of this review, the landscape had grown complex enough, with many architectures, representation schemes, and no agreed-upon benchmarking standards, to warrant systematic organization.</p>
<h2 id="molecular-representations-and-architecture-taxonomy">Molecular Representations and Architecture Taxonomy</h2>
<p>The review&rsquo;s core organizational contribution is a two-axis taxonomy: molecular representations on one axis and deep learning architectures on the other.</p>
<h3 id="molecular-representations">Molecular Representations</h3>
<p>The review categorizes representations into 3D and 2D graph-based schemes:</p>
<p><strong>3D representations</strong> include raw voxels (placing nuclear charges on a grid), smoothed voxels (Gaussian blurring around nuclei), and tensor field networks. These capture full geometric information but suffer from high dimensionality, sparsity, and difficulty encoding rotation/translation invariance.</p>
<p><strong>2D graph representations</strong> include:</p>
<ul>
<li><strong><a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings</strong>: The dominant representation, encoding molecular graphs as ASCII character sequences via depth-first traversal. Non-unique (each molecule with $N$ heavy atoms has at least $N$ SMILES representations), but invertible and widely supported.</li>
<li><strong>Canonical SMILES</strong>: Unique but potentially encode grammar rules rather than chemical structure.</li>
<li><strong>Context-free grammars (CFGs)</strong>: Decompose SMILES into grammar rules to improve validity rates, though not to 100%.</li>
<li><strong>Tensor representations</strong>: Store atom types in a vertex feature matrix $X \in \mathbb{R}^{N \times |\mathcal{A}|}$ and bond types in an adjacency tensor $A \in \mathbb{R}^{N \times N \times Y}$.</li>
<li><strong>Graph operations</strong>: Directly build molecular graphs by adding atoms and bonds, guaranteeing 100% chemical validity.</li>
</ul>
<h3 id="deep-learning-architectures">Deep Learning Architectures</h3>
<p><strong>Recurrent Neural Networks (RNNs)</strong> generate SMILES strings character by character, typically using LSTM or GRU units. Training uses maximum likelihood estimation (MLE) with teacher forcing:</p>
<p>$$
L^{\text{MLE}} = -\sum_{s \in \mathcal{X}} \sum_{t=2}^{T} \log \pi_{\theta}(s_{t} \mid S_{1:t-1})
$$</p>
<p>Thermal rescaling of the output distribution controls the diversity-validity tradeoff via a temperature parameter $T$. RNNs achieved SMILES validity rates of 94-98%.</p>
<p><strong><a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">Variational Autoencoders (VAEs)</a></strong> learn a continuous latent space by maximizing the evidence lower bound (ELBO):</p>
<p>$$
\mathcal{L}_{\theta,\phi}(x) = \mathbb{E}_{z \sim q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{\text{KL}}[q_{\phi}(z|x), p(z)]
$$</p>
<p>The first term encourages accurate reconstruction while the KL divergence term regularizes the latent distribution toward a standard Gaussian prior $p(z) = \mathcal{N}(z, 0, I)$. Variants include <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/grammar-variational-autoencoder/">grammar VAEs</a> (GVAEs), syntax-directed VAEs, junction tree VAEs, and adversarial autoencoders (AAEs) that replace the KL term with adversarial training.</p>
<p><strong><a href="/posts/what-is-a-gan/">Generative Adversarial Networks (GANs)</a></strong> train a generator against a discriminator using the minimax objective:</p>
<p>$$
\min_{G} \max_{D} V(D, G) = \mathbb{E}_{x \sim p_{d}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_{z}(z)}[\log(1 - D(G(z)))]
$$</p>
<p>The review shows that with an optimal discriminator, the generator objective reduces to minimizing the Jensen-Shannon divergence, which captures both forward and reverse KL divergence terms. This provides a more &ldquo;balanced&rdquo; training signal than MLE alone. The Wasserstein GAN (WGAN) uses the Earth mover&rsquo;s distance for more stable training:</p>
<p>$$
W(p, q) = \inf_{\gamma \in \Pi(p,q)} \mathbb{E}_{(x,y) \sim \gamma} |x - y|
$$</p>
<p><strong>Reinforcement Learning</strong> recasts molecular generation as a sequential decision problem. The policy gradient (REINFORCE) update is:</p>
<p>$$
\nabla J(\theta) = \mathbb{E}\left[G_{t} \frac{\nabla_{\theta} \pi_{\theta}(a_{t} \mid y_{1:t-1})}{\pi_{\theta}(a_{t} \mid y_{1:t-1})}\right]
$$</p>
<p>To prevent RL fine-tuning from causing the generator to &ldquo;drift&rdquo; away from viable chemical structures, an augmented reward function incorporates the prior likelihood:</p>
<p>$$
R&rsquo;(S) = [\sigma R(S) + \log P_{\text{prior}}(S) - \log P_{\text{current}}(S)]^{2}
$$</p>
<h2 id="cataloging-45-models-and-their-design-choices">Cataloging 45 Models and Their Design Choices</h2>
<p>Rather than running new experiments, the review&rsquo;s methodology involves systematically cataloging and comparing 45 published models. Table 2 in the paper lists each model&rsquo;s architecture, representation, training dataset, and dataset size. Key patterns include:</p>
<ul>
<li><strong>RNN-based models</strong> (16 entries): Almost exclusively use SMILES, trained on ZINC or ChEMBL datasets with 0.1M-1.7M molecules.</li>
<li><strong>VAE variants</strong> (20 entries): The most diverse category, spanning SMILES VAEs, grammar VAEs, junction tree VAEs, graph-based VAEs, and 3D VAEs. Training sets range from 10K to 72M molecules.</li>
<li><strong>GAN models</strong> (7 entries): Include <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a>, RANC, ATNC, MolGAN, and CycleGAN approaches. Notably, GANs appear to work with fewer training samples.</li>
<li><strong>Other approaches</strong> (2 entries): Pure RL methods from Zhou et al. and Stahl et al. that do not require pretraining on a dataset.</li>
</ul>
<p>The review also catalogs 13 publicly available datasets (Table 3), ranging from QM9 (133K molecules with quantum chemical properties) to <a href="/notes/computational-chemistry/datasets/gdb-13/">GDB-13</a> (977M combinatorially generated molecules) and ZINC15 (750M+ commercially available compounds).</p>
<h3 id="metrics-and-reward-function-design">Metrics and Reward Function Design</h3>
<p>A significant contribution is the systematic treatment of reward functions. The review argues that generated molecules should satisfy six desiderata: diversity, novelty, stability, synthesizability, non-triviality, and good properties. Key metrics formalized include:</p>
<p><strong>Diversity</strong> using Tanimoto similarity over fingerprints:</p>
<p>$$
r_{\text{diversity}} = 1 - \frac{1}{|\mathcal{G}|} \sum_{(x_{1}, x_{2}) \in \mathcal{G} \times \mathcal{G}} D(x_{1}, x_{2})
$$</p>
<p><strong>Novelty</strong> measured as the fraction of generated molecules not appearing in a hold-out test set:</p>
<p>$$
r_{\text{novel}} = 1 - \frac{|\mathcal{G} \cap \mathcal{T}|}{|\mathcal{T}|}
$$</p>
<p><strong>Synthesizability</strong> primarily assessed via the SA score, sometimes augmented with ring penalties and medicinal chemistry filters.</p>
<p>The review also discusses the <a href="/notes/computational-chemistry/benchmark-problems/frechet-chemnet-distance/">Fréchet ChemNet Distance</a> as an analog of FID for molecular generation, and notes the emergence of standardized benchmarking platforms including <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a>, <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>, and DiversityNet.</p>
<h2 id="key-findings-and-future-directions">Key Findings and Future Directions</h2>
<p>The review identifies several major trends and conclusions:</p>
<p><strong>Shift from SMILES to graph-based representations.</strong> SMILES-based methods struggle with validity (the molecular autoencoder VAE achieved only 0.7-75% valid SMILES depending on sampling strategy). Methods that work directly on molecular graphs with chemistry-preserving operations achieve 100% validity, and the review predicts this trend will continue.</p>
<p><strong>Advantages of adversarial and RL training over MLE.</strong> The mathematical analysis shows that MLE only optimizes forward KL divergence, which can lead to models that place probability mass where the data distribution is zero. GAN training optimizes the Jensen-Shannon divergence, which balances forward and reverse KL terms. RL approaches, particularly pure RL without pretraining, showed competitive performance with much less training data.</p>
<p><strong>Genetic algorithms remain competitive.</strong> The review notes that the latest genetic algorithm approaches (Grammatical Evolution) could match deep learning methods for molecular optimization under some metrics, and at 100x lower computational cost in some comparisons. This serves as an important baseline calibration.</p>
<p><strong>Reward function design is underappreciated.</strong> Early models generated unstable molecules with labile groups (enamines, hemiaminals, enol ethers). Better reward functions that incorporate synthesizability, diversity, and stability constraints significantly improved practical utility.</p>
<p><strong>Need for standardized benchmarks.</strong> The review identifies a lack of agreement on evaluation methodology as a major barrier to progress, noting that published comparisons are often subtly biased toward novel methods.</p>
<h3 id="limitations">Limitations</h3>
<p>As a review paper from early 2019, the work predates several important developments: transformer-based architectures (which would soon dominate), SELFIES representations, diffusion models for molecules, and large-scale pretrained chemical language models. The review focuses primarily on drug-like small molecules and does not deeply cover protein design or materials optimization.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a review paper that does not present new experimental results. The paper catalogs 13 publicly available datasets used across the reviewed works:</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td><a href="/notes/computational-chemistry/datasets/gdb-13/">GDB-13</a></td>
          <td>977M</td>
          <td>Combinatorially generated library</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>ZINC15</td>
          <td>750M+</td>
          <td>Commercially available compounds</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td><a href="/notes/computational-chemistry/datasets/gdb-17/">GDB-17</a></td>
          <td>50M</td>
          <td>Combinatorially generated library</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>ChEMBL</td>
          <td>2M</td>
          <td>Curated bioactive molecules</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>QM9</td>
          <td>133,885</td>
          <td>Small organic molecules with DFT properties</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>PubChemQC</td>
          <td>3.98M</td>
          <td>PubChem compounds with DFT data</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The review provides mathematical derivations for MLE training (Eq. 1), VAE ELBO (Eqs. 9-13), AAE objectives (Eqs. 15-16), GAN objectives (Eqs. 19-22), WGAN (Eq. 24), REINFORCE gradient (Eq. 7), and numerous reward function formulations (Eqs. 26-36).</p>
<h3 id="evaluation">Evaluation</h3>
<p>Key evaluation frameworks discussed:</p>
<ul>
<li><a href="/notes/computational-chemistry/benchmark-problems/frechet-chemnet-distance/">Fréchet ChemNet Distance</a> (molecular analog of FID)</li>
<li><a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a> benchmarking platform</li>
<li><a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmarking suite</li>
<li>Validity rate, uniqueness, novelty, and internal diversity metrics</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Elton, D. C., Boukouvalas, Z., Fuge, M. D., &amp; Chung, P. W. (2019). Deep Learning for Molecular Design: A Review of the State of the Art. <em>Molecular Systems Design &amp; Engineering</em>, 4(4), 828-849. <a href="https://doi.org/10.1039/C9ME00039A">https://doi.org/10.1039/C9ME00039A</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{elton2019deep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Deep Learning for Molecular Design -- A Review of the State of the Art}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Elton, Daniel C. and Boukouvalas, Zois and Fuge, Mark D. and Chung, Peter W.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Molecular Systems Design \&amp; Engineering}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{828--849}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/C9ME00039A}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Re-evaluating Sample Efficiency in Molecule Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/sample-efficiency-de-novo-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/sample-efficiency-de-novo-generation/</guid><description>Thomas et al. re-evaluate generative model benchmarks for de novo drug design, adding property filters and diversity metrics that re-rank model performance.</description><content:encoded><![CDATA[<h2 id="an-empirical-re-evaluation-of-generative-model-benchmarks">An Empirical Re-evaluation of Generative Model Benchmarks</h2>
<p>This is an <strong>Empirical</strong> paper. The primary contribution is a critical reassessment of the <a href="/notes/computational-chemistry/benchmark-problems/pmo-sample-efficient-molecular-optimization/">Practical Molecular Optimization (PMO)</a> benchmark for de novo molecule generation. Rather than proposing a new generative model, the authors modify existing benchmark metrics to account for chemical desirability (molecular weight, LogP, topological novelty) and molecular diversity. They then re-evaluate all 25 generative models from the original PMO benchmark plus the recently proposed <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/">Augmented Hill-Climb (AHC)</a> method.</p>
<h2 id="sample-efficiency-and-chemical-quality-in-drug-design">Sample Efficiency and Chemical Quality in Drug Design</h2>
<p>Deep generative models for de novo molecule generation often require large numbers of oracle evaluations (up to $10^5$ samples) to optimize toward a target objective. This is a practical limitation when using computationally expensive scoring functions like molecular docking. The <a href="/notes/computational-chemistry/benchmark-problems/pmo-sample-efficient-molecular-optimization/">PMO benchmark</a> by Gao et al. addressed this by reformulating performance as maximizing an objective within a fixed budget of 10,000 oracle calls, finding <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> to be the most sample-efficient model across 23 tasks.</p>
<p>However, the authors identify a key limitation: the PMO benchmark measures only sample efficiency without considering the chemical quality of proposed molecules. Investigating the top-performing REINVENT model on the <a href="https://en.wikipedia.org/wiki/C-Jun_N-terminal_kinase">JNK3</a> task, they find that 4 of 5 replicate runs produce molecules with molecular weight and LogP distributions far outside the training data (ZINC250k). The resulting molecules contain large structures with repeating substructures that are undesirable from a medicinal chemistry perspective. This disconnect between benchmark performance and practical utility motivates the modified evaluation metrics.</p>
<h2 id="modified-metrics-property-filters-and-diversity-requirements">Modified Metrics: Property Filters and Diversity Requirements</h2>
<p>The core innovation is the introduction of three modified AUC Top-10 metrics that extend the original PMO benchmark evaluation:</p>
<p><strong>AUC Top-10 (Filtered)</strong>: Molecules are excluded if their molecular weight or LogP falls beyond 4 standard deviations from the mean of the ZINC250k pre-training dataset ($\mu \pm 4\sigma$, covering approximately 99.99% of a normal distribution). Molecules with more than 10% de novo (unobserved in ZINC250k) ECFP4 fingerprint bits are also filtered out. This ensures the generative model does not drift beyond its applicability domain.</p>
<p><strong>AUC Top-10 (Diverse)</strong>: The top 10 molecules are selected iteratively, where a molecule is only added if its Tanimoto similarity (by ECFP4 fingerprints) to any previously selected compound does not exceed 0.35. This threshold corresponds to an approximately 80% probability that more-similar molecules belong to the same bioactivity class, enforcing that distinct candidates possess different profiles.</p>
<p><strong>AUC Top-10 (Combined)</strong>: Applies both property filters and diversity filters simultaneously, providing the most stringent evaluation of practical performance.</p>
<h2 id="benchmark-setup-and-generative-models-evaluated">Benchmark Setup and Generative Models Evaluated</h2>
<h3 id="implementation-details">Implementation Details</h3>
<p>The authors re-implement the PMO benchmark using the original code and data (MIT license) with no changes beyond adding AHC and the new metrics. For Augmented Hill-Climb, the architecture follows REINVENT: an embedding layer of size 128 and 3 layers of Gated Recurrent Units (GRU) with size 512. The prior is trained on ZINC250k using SMILES notation with batch size 128 for 5 epochs.</p>
<p>Two AHC variants are benchmarked:</p>
<ul>
<li><strong>SMILES-AHC</strong>: Hyperparameters optimized via the standard PMO procedure, yielding batch size 256, $\sigma = 120$, $K = 0.25$</li>
<li><strong>SMILES-AHC</strong>*: Uses $\sigma = 60$, chosen based on prior knowledge that lower $\sigma$ values maintain better regularization and chemical quality</li>
</ul>
<p>Both omit diversity filters and non-unique penalization for standardized comparison, despite these being shown to improve performance in prior work.</p>
<h3 id="models-compared">Models Compared</h3>
<p>The benchmark includes 25 generative models from the original PMO paper spanning diverse architectures: REINVENT (RNN + RL), Graph GA (graph-based genetic algorithm), GP BO (Gaussian process Bayesian optimization), SMILES GA (SMILES-based genetic algorithm), SELFIES-based VAEs, and others. The 23 objective tasks derive primarily from the <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmark.</p>
<h2 id="re-ranked-results-and-augmented-hill-climb-performance">Re-ranked Results and Augmented Hill-Climb Performance</h2>
<p>The modified metrics substantially re-order the ranking of generative models:</p>
<ol>
<li>
<p><em><em>SMILES-AHC</em> achieves top performance on AUC Top-10 (Combined)</em>*, where both property filters and diversity are enforced. The use of domain-informed hyperparameter selection ($\sigma = 60$) proves critical.</p>
</li>
<li>
<p><strong>SMILES-AHC (data-driven hyperparameters) ranks first</strong> when accounting for property filters alone, diversity alone, or both combined, demonstrating that the AHC algorithm itself provides strong performance even without manual tuning.</p>
</li>
<li>
<p><strong>REINVENT retains its first-place rank under property filters alone</strong>, suggesting that the minority of compounds staying within acceptable property space still perform well. However, it drops when diversity is also required.</p>
</li>
<li>
<p><strong>Evolutionary algorithms (Graph GA, GP BO, SMILES GA) drop significantly</strong> under the new metrics. This is expected because rule-based methods are not constrained by the ZINC250k distribution and tend to propose molecules that diverge from drug-like chemical space.</p>
</li>
<li>
<p><strong>Both AHC variants excel on empirically difficult tasks</strong>, including isomer-based tasks, Zaleplon MPO, and Sitagliptin MPO, where other methods struggle.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>Results are preliminary because generative models have not undergone hyperparameter optimization against the new metrics</li>
<li>Property filter thresholds are subjective, and the 10% de novo ECFP4 bit threshold was chosen by visual inspection</li>
<li>Comparing rule-based models against distribution-based models using ZINC250k similarity introduces a bias toward distribution-based approaches</li>
<li>Six objective task reference molecules sit in the lowest 0.01% of ZINC250k property space, raising questions about whether distribution-based models can reasonably optimize for these objectives</li>
<li>Property filters and diversity could alternatively be incorporated directly into the objective function as additional oracles, though this would not necessarily produce the same results</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC250k</td>
          <td>~250K molecules</td>
          <td>Subset of ZINC15, provided by PMO benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><a href="/notes/computational-chemistry/benchmark-problems/pmo-sample-efficient-molecular-optimization/">PMO</a> benchmark tasks</td>
          <td>23 objectives</td>
          <td>Derived primarily from <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a></td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Augmented Hill-Climb</strong>: RL strategy from Thomas et al. (2022), patience of 5</li>
<li><strong>Hyperparameters (SMILES-AHC)</strong>: batch size 256, $\sigma = 120$, $K = 0.25$</li>
<li><em><em>Hyperparameters (SMILES-AHC</em>)</em>*: $\sigma = 60$ (domain-informed selection)</li>
<li><strong>Prior training</strong>: 5 epochs, batch size 128, SMILES notation</li>
<li><strong>Oracle budget</strong>: 10,000 evaluations per task</li>
<li><strong>Replicates</strong>: 5 per model per task</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Embedding (128) + 3x GRU (512), following REINVENT</li>
<li><strong>All 25 PMO benchmark models</strong> re-evaluated using original implementations</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AUC Top-10 (Original)</td>
          <td>Area under curve of average top 10 molecules</td>
          <td>Standard PMO metric</td>
      </tr>
      <tr>
          <td>AUC Top-10 (Filtered)</td>
          <td>Original with MW/LogP and ECFP4 novelty filters</td>
          <td>$\mu \pm 4\sigma$ from ZINC250k</td>
      </tr>
      <tr>
          <td>AUC Top-10 (Diverse)</td>
          <td>Top 10 selected with Tanimoto &lt; 0.35 diversity</td>
          <td>ECFP4 fingerprints</td>
      </tr>
      <tr>
          <td>AUC Top-10 (Combined)</td>
          <td>Both filters and diversity applied</td>
          <td>Most stringent metric</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware requirements are not specified in the paper. The benchmark uses 10,000 oracle evaluations per task with 5 replicates, which is computationally modest compared to standard generative model training.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MorganCThomas/MolScore">MolScore</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Scoring and benchmarking framework by the first author</td>
      </tr>
      <tr>
          <td><a href="https://github.com/wenhao-gao/mol_opt">PMO Benchmark</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Original benchmark code and data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Thomas, M., O&rsquo;Boyle, N. M., Bender, A., &amp; de Graaf, C. (2022). Re-evaluating sample efficiency in de novo molecule generation. <em>arXiv preprint arXiv:2212.01385</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{thomas2022reevaluating,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Re-evaluating sample efficiency in de novo molecule generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Thomas, Morgan and O&#39;Boyle, Noel M. and Bender, Andreas and de Graaf, Chris}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2212.01385}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.LG}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arxiv.2212.01385}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Protein-to-Drug Molecule Translation via Transformer</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/transformer-protein-drug-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/transformer-protein-drug-generation/</guid><description>A Transformer model frames protein-targeted drug generation as machine translation from amino acid sequences to SMILES molecular strings.</description><content:encoded><![CDATA[<h2 id="protein-targeted-drug-generation-as-machine-translation">Protein-Targeted Drug Generation as Machine Translation</h2>
<p>This is a <strong>Method</strong> paper that proposes using the Transformer neural network architecture for protein-specific de novo drug generation. The primary contribution is framing the problem of generating molecules that bind to a target protein as a machine translation task: translating from the &ldquo;language&rdquo; of amino acid sequences to the SMILES representation of candidate drug molecules. The model takes only a protein&rsquo;s amino acid sequence as input and generates novel molecules with predicted binding affinity, requiring no prior knowledge of active ligands, physicochemical descriptors, or the protein&rsquo;s three-dimensional structure.</p>
<h2 id="limitations-of-existing-generative-drug-design-approaches">Limitations of Existing Generative Drug Design Approaches</h2>
<p>Existing deep learning methods for de novo molecule generation suffer from several limitations. Most RNN-based approaches require a library of known active compounds against the target protein to fine-tune the generator or train a reward predictor for reinforcement learning. Structure-based drug design methods require the three-dimensional structure of the target protein, which can be costly and technically difficult to obtain through protein expression, purification, and crystallization. Autoencoder-based approaches (variational and adversarial) similarly depend on prior knowledge of protein binders or their physicochemical characteristics.</p>
<p>The estimated drug-like molecule space is on the order of $10^{60}$, while only around $10^{8}$ compounds have been synthesized. High-throughput screening is expensive and time-consuming, and virtual screening operates only on known molecules. Computational de novo design methods often generate molecules that are hard to synthesize or restrict accessible chemical space through coded rules. A method that requires only a protein&rsquo;s amino acid sequence would substantially simplify the initial stages of drug discovery, particularly for targets with limited or no information about inhibitors and 3D structure.</p>
<h2 id="sequence-to-sequence-translation-with-self-attention">Sequence-to-Sequence Translation with Self-Attention</h2>
<p>The core insight is to treat protein-targeted drug generation as a translation problem between two &ldquo;languages,&rdquo; applying the Transformer architecture that had demonstrated strong results in neural machine translation. The encoder maps a protein amino acid sequence $(a_1, \ldots, a_n)$ to continuous representations $\mathbf{z} = (z_1, \ldots, z_n)$, and the decoder autoregressively generates a SMILES string conditioned on $\mathbf{z}$.</p>
<p>The self-attention mechanism computes:</p>
<p>$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$</p>
<p>where $d_k$ is a scaling factor. Multihead attention runs $h$ parallel attention heads:</p>
<p>$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$</p>
<p>$$
\text{Multihead}(Q, K, V) = (\text{head}_1, \ldots, \text{head}_h)W^O
$$</p>
<p>Positional encoding uses sinusoidal functions:</p>
<p>$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i / d_{model}}}\right)
$$</p>
<p>$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i / d_{model}}}\right)
$$</p>
<p>The self-attention mechanism is particularly well-suited for this task for two reasons. First, protein sequences can be much longer than SMILES strings (dozens of times longer), making the ability to capture long-range dependencies essential. Second, three-dimensional structural features of the binding pocket may be formed by amino acid residues far apart in the linear sequence, and multihead attention can jointly attend to different positional aspects simultaneously.</p>
<h2 id="data-model-architecture-and-docking-evaluation">Data, Model Architecture, and Docking Evaluation</h2>
<h3 id="data">Data</h3>
<p>The training data was retrieved from BindingDB, filtering for interactions between proteins from Homo sapiens, Rattus norvegicus, Mus musculus, and Bos taurus with binding affinity below 100 nM (IC50, Kd, or EC50). After filtering for valid PubChem CIDs, SMILES representations, UniProt IDs, molecular weight under 1000 Da, and amino acid sequence lengths between 80 and 2050, the final dataset contained 238,147 records with 1,613 unique proteins and 154,924 unique ligand SMILES strings.</p>
<p>Five Monte Carlo cross-validation splits were created, with the constraint that test set proteins share less than 20% sequence similarity with training set proteins (measured via <a href="https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm">Needleman-Wunsch</a> global alignment).</p>
<h3 id="model-configuration">Model Configuration</h3>
<p>The model uses the original Transformer implementation via the tensor2tensor library with:</p>
<ul>
<li>4 encoder/decoder layers of size 128</li>
<li>4 attention heads</li>
<li>Adam optimizer with learning rate decay from the original Transformer paper</li>
<li>Batch size of 4,096 tokens</li>
<li>Training for 600K epochs on a single GPU in Google Colaboratory</li>
<li>Vocabulary of 71 symbols (character-level tokenization)</li>
</ul>
<p>Beam search decoding was used with two modes: beam size 4 keeping only the top-1 result (&ldquo;one per one&rdquo; mode) and beam size 10 keeping all 10 results (&ldquo;ten per one&rdquo; mode).</p>
<h3 id="chemical-validity-and-uniqueness">Chemical Validity and Uniqueness</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>One per One (avg)</th>
          <th>Ten per One (avg)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid SMILES (%)</td>
          <td>90.2</td>
          <td>82.6</td>
      </tr>
      <tr>
          <td>Unique SMILES (%)</td>
          <td>92.3</td>
          <td>81.7</td>
      </tr>
      <tr>
          <td>ZINC15 match (%)</td>
          <td>30.6</td>
          <td>17.1</td>
      </tr>
  </tbody>
</table>
<h3 id="docking-evaluation">Docking Evaluation</h3>
<p>To assess binding affinity, the authors selected two receptor tyrosine kinases from the test set (IGF-1R and VEGFR2) and performed molecular docking with <a href="/notes/computational-chemistry/benchmark-problems/smina-docking-benchmark/">SMINA</a>. Four sets of ligands were compared: known binders, randomly selected compounds, molecules generated for the target protein, and molecules generated for other targets (cross-docking control).</p>
<p>ROC-AUC analysis showed that the docking tool classified generated molecules for the correct target as binders at rates comparable to known binders. For the best-discriminating structures (PDB 3O23 for IGF-1R, PDB 3BE2 for VEGFR2), Mann-Whitney U tests confirmed statistically significant differences between generated-for-target molecules and random compounds, while the difference between generated-for-target and known binders was not significant (p = 0.40 and 0.26 respectively), suggesting the model generates plausible binders.</p>
<h3 id="drug-likeness-properties">Drug-Likeness Properties</h3>
<p>Generated molecules were evaluated against <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski&rsquo;s Rule of Five</a> and other drug-likeness criteria:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Constraint</th>
          <th>One per One (%)</th>
          <th>Ten per One (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>logP</td>
          <td>&lt; 5</td>
          <td>84.4</td>
          <td>85.6</td>
      </tr>
      <tr>
          <td>Molecular weight</td>
          <td>&lt; 500 Da</td>
          <td>95.8</td>
          <td>88.9</td>
      </tr>
      <tr>
          <td>H-bond donors</td>
          <td>&lt; 5</td>
          <td>95.8</td>
          <td>91.9</td>
      </tr>
      <tr>
          <td>H-bond acceptors</td>
          <td>&lt; 10</td>
          <td>97.9</td>
          <td>93.5</td>
      </tr>
      <tr>
          <td>Rotatable bonds</td>
          <td>&lt; 10</td>
          <td>97.9</td>
          <td>91.2</td>
      </tr>
      <tr>
          <td>TPSA</td>
          <td>&lt; 140</td>
          <td>98.0</td>
          <td>92.7</td>
      </tr>
      <tr>
          <td>SAS</td>
          <td>&lt; 6</td>
          <td>99.9</td>
          <td>100.0</td>
      </tr>
  </tbody>
</table>
<p>Mean QED values were 0.66 +/- 0.19 (one per one) and 0.58 +/- 0.21 (ten per one).</p>
<h3 id="structural-novelty">Structural Novelty</h3>
<p>Tanimoto similarity analysis showed that only 8% of generated structures had similarity above the threshold (&gt; 0.85) to training compounds. The majority (51%) had Tanimoto scores below 0.5. The mean nearest-neighbor Tanimoto similarity of generated molecules to the training set (0.54 +/- 0.17 in one-per-one mode) was substantially lower than the mean within-training-set similarity (0.74 +/- 0.14), indicating the model generates structurally diverse molecules outside the training distribution.</p>
<h2 id="generated-molecules-show-drug-like-properties-and-predicted-binding">Generated Molecules Show Drug-Like Properties and Predicted Binding</h2>
<p>The model generates roughly 90% chemically valid SMILES in one-per-one mode, with 92% uniqueness. Docking simulations on IGF-1R and VEGFR2 suggest that generated molecules for the correct target are statistically indistinguishable from known binders, while molecules generated for other targets behave more like random compounds. Drug-likeness properties fall within acceptable ranges for the vast majority of generated compounds.</p>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>Only two protein targets were analyzed via docking due to computational constraints, and the analysis was limited to proteins with a single well-known druggable binding pocket.</li>
<li>Beam search produces molecules that differ only slightly; diverse beam search or coupling with variational/adversarial autoencoders could improve diversity.</li>
<li>The fraction of molecules matching the ZINC15 database (30.6% in one-per-one mode) could potentially be reduced by pretraining on a larger compound set (e.g., ChEMBL&rsquo;s 1.5 million molecules).</li>
<li>Model interpretability remains limited and is identified as important future work.</li>
<li>The approach is a proof of concept and requires further validation via in vitro assays across diverse protein targets.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data-1">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Test</td>
          <td>BindingDB (filtered)</td>
          <td>238,147 records</td>
          <td>1,613 unique proteins, 154,924 unique SMILES; IC50/Kd/EC50 &lt; 100 nM</td>
      </tr>
      <tr>
          <td>Docking validation</td>
          <td>PDB structures</td>
          <td>11 (IGF-1R), 20 (VEGFR2)</td>
          <td>SMINA docking with default settings</td>
      </tr>
      <tr>
          <td>Database matching</td>
          <td>ZINC15</td>
          <td>N/A</td>
          <td>Used for novelty assessment</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer (encoder-decoder) via tensor2tensor library</li>
<li>Beam search decoding (beam sizes 4 and 10)</li>
<li>Needleman-Wunsch global alignment for protein sequence similarity (EMBOSS)</li>
<li>SMINA for molecular docking</li>
<li>RDKit for validity checking, property calculation, and canonicalization</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>4 layers, 128 hidden size, 4 attention heads</li>
<li>Character-level tokenization with 71-symbol vocabulary</li>
<li>5-fold Monte Carlo cross-validation with &lt; 20% sequence similarity between train/test proteins</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid SMILES</td>
          <td>90.2% (1-per-1), 82.6% (10-per-1)</td>
          <td>Averaged across 5 splits</td>
      </tr>
      <tr>
          <td>Unique SMILES</td>
          <td>92.3% (1-per-1), 81.7% (10-per-1)</td>
          <td>Averaged across 5 splits</td>
      </tr>
      <tr>
          <td>ZINC15 match</td>
          <td>30.6% (1-per-1), 17.1% (10-per-1)</td>
          <td>Averaged across 5 splits</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.66 +/- 0.19 (1-per-1), 0.58 +/- 0.21 (10-per-1)</td>
          <td>Drug-likeness score</td>
      </tr>
      <tr>
          <td>SAS compliance</td>
          <td>99.9% (1-per-1), 100% (10-per-1)</td>
          <td>SAS &lt; 6</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Google Colaboratory with one GPU</li>
<li>Training for 600K epochs</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/dariagrechishnikova/molecule_structure_generation">molecule_structure_generation</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Jupyter Notebook implementation using tensor2tensor</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Grechishnikova, D. (2021). Transformer neural network for protein-specific de novo drug generation as a machine translation problem. <em>Scientific Reports</em>, 11, 321. <a href="https://doi.org/10.1038/s41598-020-79682-4">https://doi.org/10.1038/s41598-020-79682-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{grechishnikova2021transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer neural network for protein-specific de novo drug generation as a machine translation problem}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Grechishnikova, Daria}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{321}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41598-020-79682-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>PrefixMol: Prefix Embeddings for Drug Molecule Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/prefixmol-target-chemistry-aware-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/prefixmol-target-chemistry-aware-generation/</guid><description>PrefixMol uses prefix embeddings in a GPT SMILES generator to jointly condition on protein pockets and chemical properties for drug design.</description><content:encoded><![CDATA[<h2 id="unified-multi-conditional-molecular-generation">Unified Multi-Conditional Molecular Generation</h2>
<p>PrefixMol is a <strong>Method</strong> paper that introduces a unified generative model for structure-based drug design that simultaneously conditions on protein binding pockets and multiple chemical properties. The primary contribution is a prefix-embedding mechanism, borrowed from NLP multi-task learning, that represents each condition (pocket geometry, Vina score, QED, SA, LogP, <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski</a>) as a learnable feature vector prepended to the input sequence of a GPT-based <a href="/notes/computational-chemistry/molecular-representations/smiles-original-paper/">SMILES</a> generator. This allows a single model to handle customized multi-conditional generation without the negative transfer that typically arises from merging separate task-specific models.</p>
<h2 id="bridging-target-aware-and-chemistry-aware-molecular-design">Bridging Target-Aware and Chemistry-Aware Molecular Design</h2>
<p>Prior structure-based drug design methods (e.g., Pocket2Mol, GraphBP) generate molecules conditioned on protein binding pockets but impose no constraints on the chemical properties of the output. Conversely, controllable molecule generation methods (e.g., <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/retmol-retrieval-molecule-generation/">RetMol</a>, CMG) can steer chemical properties but ignore protein-ligand interactions. Merging these two objectives into a single model is difficult for two reasons:</p>
<ol>
<li><strong>Data scarcity</strong>: Few datasets contain both protein-ligand binding affinity data and comprehensive molecular property annotations.</li>
<li><strong>Negative transfer</strong>: Treating each condition as a separate task in a multi-task framework can hurt overall performance when tasks conflict.</li>
</ol>
<p>PrefixMol addresses both problems by extending the CrossDocked dataset with molecular property labels and using a parameter-efficient prefix conditioning strategy that decouples task-specific knowledge from the shared generative backbone.</p>
<h2 id="prefix-conditioning-in-attention-layers">Prefix Conditioning in Attention Layers</h2>
<p>The core innovation adapts prefix-tuning from NLP to molecular generation. Given a GPT transformer that generates SMILES token-by-token, PrefixMol prepends $n_c$ learnable condition vectors $\mathbf{p}_{\phi} \in \mathbb{R}^{n_c \times d}$ to the left of the sequence embedding $\mathbf{x} \in \mathbb{R}^{l \times d}$, forming an extended input $\mathbf{x}&rsquo; = [\text{PREFIX}; \mathbf{x}]$.</p>
<p>The output of each position is:</p>
<p>$$
h_i = \begin{cases} p_{\phi,i}, &amp; \text{if } i &lt; n_c \\ \text{LM}_\theta(x_i&rsquo;, h_{&lt;i}), &amp; \text{otherwise} \end{cases}
$$</p>
<p>Because the prefix features always sit to the left, the causal attention mask ensures they influence all subsequent token predictions. The key insight is that the attention mechanism decomposes into a weighted sum of self-attention and prefix attention:</p>
<p>$$
\begin{aligned}
\text{head} &amp;= (1 - \lambda(\mathbf{x})) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{c}\mathbf{W}_k, \mathbf{c}\mathbf{W}_v)}_{\text{self-attention}} \\
&amp;\quad + \lambda(\mathbf{x}) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{p}_\phi\mathbf{W}_k, \mathbf{p}_\phi\mathbf{W}_v)}_{\text{prefix attention}}
\end{aligned}
$$</p>
<p>where $\lambda(\mathbf{x})$ is a scalar representing the normalized attention weight on the prefix positions. This decomposition shows that conditions modulate generation through an additive attention pathway, and the activation map $\text{softmax}(\mathbf{x}\mathbf{W}_q \mathbf{W}_k^\top \mathbf{p}_\phi^\top)$ directly reveals how each condition steers model behavior.</p>
<p><strong>Condition correlation</strong> is similarly revealed. For the prefix features themselves, the causal mask zeros out the cross-attention to the sequence, leaving only the prefix self-correlation term:</p>
<p>$$
\text{head} = \text{Attn}(\mathbf{p}_\phi \mathbf{W}_q, \mathbf{p}_\phi \mathbf{W}_k, \mathbf{p}_\phi \mathbf{W}_v)
$$</p>
<p>The attention map $\mathbf{A}(\mathbf{p}_\phi)$ from this term encodes how conditions relate to one another.</p>
<h3 id="condition-encoders">Condition Encoders</h3>
<p>Each condition has a dedicated encoder:</p>
<ul>
<li><strong>3D Pocket</strong>: A Geometric Vector Transformer (GVF) processes the binding pocket as a 3D graph with SE(3)-equivariant node and edge features. GVF extends GVP-GNN with a global attention module over geometric features. A position-aware attention mechanism with radial basis functions produces the pocket embedding.</li>
<li><strong>Chemical properties</strong>: Separate MLPs embed each scalar property (Vina, QED, SA, LogP, Lipinski) into the shared $d$-dimensional space.</li>
</ul>
<h3 id="training-objective">Training Objective</h3>
<p>PrefixMol is trained with two losses. The auto-regressive loss is:</p>
<p>$$
\mathcal{L}_{AT} = -\sum_{1 &lt; i \leq t} \log p_{\phi, \theta}(x_i \mid \mathbf{x}_{&lt;i}, \mathbf{p}_\phi)
$$</p>
<p>A triplet property prediction loss encourages generated molecules to match desired properties:</p>
<p>$$
\mathcal{L}_{Pred} = \max\left((\hat{\mathbf{c}} - \mathbf{c})^2 - (\hat{\mathbf{c}} - \dot{\mathbf{c}})^2, 0\right)
$$</p>
<p>where $\mathbf{c}$ is the input condition, $\hat{\mathbf{c}}$ is predicted by an MLP head, and $\dot{\mathbf{c}}$ is computed by RDKit from the generated SMILES (gradient is propagated through $\hat{\mathbf{c}}$ since RDKit is non-differentiable).</p>
<h2 id="experimental-setup-and-controllability-evaluation">Experimental Setup and Controllability Evaluation</h2>
<h3 id="dataset">Dataset</h3>
<p>The authors use the CrossDocked dataset (22.5 million protein-ligand structures) with chemical properties appended for each ligand. Data splitting and evaluation follow Pocket2Mol and Masuda et al.</p>
<h3 id="metrics">Metrics</h3>
<ul>
<li><strong>Vina score</strong> (binding affinity, computed by QVina after UFF refinement)</li>
<li><strong>QED</strong> (quantitative estimate of drug-likeness, 0-1)</li>
<li><strong>SA</strong> (synthetic accessibility, 0-1)</li>
<li><strong>LogP</strong> (octanol-water partition coefficient)</li>
<li><strong>Lipinski</strong> (rule-of-five compliance count)</li>
<li><strong>High Affinity</strong> (fraction of pockets where generated molecules match or exceed test set affinities)</li>
<li><strong>Diversity</strong> (average pairwise Tanimoto distance over Morgan fingerprints)</li>
<li><strong>Sim.Train</strong> (maximum Tanimoto similarity to training set)</li>
</ul>
<h3 id="baselines">Baselines</h3>
<p>Unconditional comparison against CVAE, AR (Luo et al. 2021a), and Pocket2Mol.</p>
<h3 id="key-results">Key Results</h3>
<p><strong>Unconditional generation</strong> (Table 1): PrefixMol without conditions achieves sub-optimal results on Vina (-6.532), QED (0.551), SA (0.750), and LogP (1.415) compared to Pocket2Mol. However, it substantially outperforms all baselines on diversity (0.856 vs. 0.688 for Pocket2Mol) and novelty (Sim.Train of 0.239 vs. 0.376), indicating it generates genuinely novel molecules rather than memorizing training data.</p>
<p><strong>Single-property control</strong> (Table 2): Molecular properties are positively correlated with conditional inputs across VINA, QED, SA, LogP, and Lipinski. With favorable control scales, PrefixMol surpasses Pocket2Mol on QED (0.767 vs. 0.563), SA (0.924 vs. 0.765), and LogP. The Vina score also improves when QED or LogP conditions are increased (e.g., -7.733 at QED control scale +2), revealing coupling between conditions.</p>
<p><strong>Multi-property control</strong> (Table 3): Jointly adjusting all five conditions shows consistent positive relationships. For example, at control scale +4, QED reaches 0.722, SA reaches 0.913, and Lipinski saturates at 5.0. Joint QED+SA control at +2.0 achieves Lipinski = 5.0, confirming that certain properties are coupled.</p>
<h3 id="condition-relation-analysis">Condition Relation Analysis</h3>
<p>By computing partial derivatives of the prefix attention map with respect to each condition, the authors construct a relation matrix $\mathbf{R} = \sum_{i=2}^{6} |\partial \mathbf{A} / \partial c_i|$. Key findings:</p>
<ul>
<li><strong>Vina is weakly self-controllable</strong> but strongly influenced by QED, LogP, and SA, explaining why multi-condition control improves binding affinity even when Vina alone responds poorly.</li>
<li><strong>LogP and QED</strong> are the most correlated property pair.</li>
<li><strong>Lipinski is coupled to QED and SA</strong>, saturating at 5.0 when both QED and SA control scales reach +2.</li>
</ul>
<h2 id="key-findings-limitations-and-interpretability-insights">Key Findings, Limitations, and Interpretability Insights</h2>
<p>PrefixMol demonstrates that prefix embedding is an effective strategy for unifying target-aware and chemistry-aware molecular generation. The main findings are:</p>
<ol>
<li>A single prefix-conditioned GPT model can control multiple chemical properties simultaneously while targeting specific protein pockets.</li>
<li>Multi-conditional generation outperforms unconditional baselines in drug-likeness metrics, and the controllability enables PrefixMol to surpass Pocket2Mol on QED, SA, and LogP.</li>
<li>The attention mechanism provides interpretable coupling relationships between conditions, offering practical guidance (e.g., improving QED indirectly improves Vina).</li>
</ol>
<p><strong>Limitations</strong>: The paper does not report validity rates for generated SMILES. The unconditional model underperforms Pocket2Mol on binding affinity (Vina), suggesting that generating 2D SMILES strings and relying on post hoc 3D conformer generation may be less effective than direct atom-by-atom 3D generation for binding affinity optimization. The condition relation analysis uses a first-order finite difference approximation ($\Delta = 1$), which may not capture nonlinear interactions. No external validation on prospective drug discovery tasks is provided. Hardware and training time details are not reported.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training / Evaluation</td>
          <td>CrossDocked (extended)</td>
          <td>22.5M protein-ligand structures</td>
          <td>Extended with molecular properties (QED, SA, LogP, Lipinski, Vina)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>GPT-based auto-regressive SMILES generation with prefix conditioning</li>
<li>GVF (Geometric Vector Transformer) for 3D pocket encoding, extending GVP-GNN with global attention</li>
<li>Separate MLP encoders for each chemical property</li>
<li>Triplet property prediction loss with non-differentiable RDKit-computed properties</li>
<li>QVina for Vina score computation with UFF refinement</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>GPT transformer backbone for SMILES generation</li>
<li>6 prefix condition vectors ($n_c = 6$): Pocket, Vina, QED, SA, LogP, Lipinski</li>
<li>Specific architectural hyperparameters (hidden dimension, number of layers, heads) not reported in the paper</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>PrefixMol (unconditional)</th>
          <th>Pocket2Mol</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Vina (kcal/mol)</td>
          <td>-6.532</td>
          <td>-7.288</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.551</td>
          <td>0.563</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>SA</td>
          <td>0.750</td>
          <td>0.765</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>0.856</td>
          <td>0.688</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>Sim.Train</td>
          <td>0.239</td>
          <td>0.376</td>
          <td>Lower is better</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not reported in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/A4Bio/PrefixMol">PrefixMol</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official PyTorch implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Gao, Z., Hu, Y., Tan, C., &amp; Li, S. Z. (2023). PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding. <em>arXiv preprint arXiv:2302.07120</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{gao2023prefixmol,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Gao, Zhangyang and Hu, Yuqi and Tan, Cheng and Li, Stan Z.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2302.07120}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Link-INVENT: RL-Driven Molecular Linker Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/link-invent-generative-linker-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/link-invent-generative-linker-design/</guid><description>Link-INVENT extends REINVENT for molecular linker design using RNN-based generation and reinforcement learning with flexible multi-parameter scoring.</description><content:encoded><![CDATA[<h2 id="a-method-for-generative-linker-design-with-reinforcement-learning">A Method for Generative Linker Design with Reinforcement Learning</h2>
<p>Link-INVENT is a <strong>Method</strong> paper that introduces a generative model for molecular linker design built on the <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> de novo design platform. The primary contribution is an encoder-decoder recurrent neural network (RNN) architecture that generates SMILES-based linkers connecting two molecular subunits, combined with a flexible multi-parameter optimization (MPO) scoring function and reinforcement learning (RL) to steer generation toward desired properties. Link-INVENT targets three practical drug discovery tasks: fragment linking, scaffold hopping, and <a href="https://en.wikipedia.org/wiki/Proteolysis_targeting_chimera">proteolysis targeting chimera</a> (PROTAC) design.</p>
<h2 id="why-linker-design-needs-flexible-multi-parameter-optimization">Why Linker Design Needs Flexible Multi-Parameter Optimization</h2>
<p>Generating suitable chemical linkers between molecular subunits is a central challenge in <a href="https://en.wikipedia.org/wiki/Fragment-based_lead_discovery">fragment-based drug discovery</a> (FBDD), scaffold hopping, and PROTAC design. Traditional computational approaches rely on database searches, inherently limiting the generalizability of proposed linkers to the pre-defined collection. Recent deep learning methods (DeLinker, SyntaLinker, 3DLinker, DiffLinker) can generate novel linkers but offer limited support for optimizing specific physicochemical properties. Users can typically control only linker length and a few properties like hydrogen-bond donor count.</p>
<p>The key gaps that Link-INVENT addresses are:</p>
<ol>
<li><strong>Conditioning on both subunits</strong>: Prior RNN-based approaches (SAMOA) generate linkers conditioned only on the SMILES sequence seen so far, which may not account for the second molecular subunit. Link-INVENT conditions on both warheads simultaneously.</li>
<li><strong>Flexible scoring</strong>: Existing DL-based linker design tools lack the ability to define tailored MPO objectives. Link-INVENT inherits <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent4-generative-molecule-design/">REINVENT 4&rsquo;s</a> full scoring infrastructure and adds linker-specific properties.</li>
<li><strong>Generalizability</strong>: A single trained prior handles fragment linking, scaffold hopping, and PROTAC tasks without retraining.</li>
</ol>
<h2 id="core-innovation-conditional-linker-generation-with-augmented-likelihood-rl">Core Innovation: Conditional Linker Generation with Augmented Likelihood RL</h2>
<p>Link-INVENT&rsquo;s architecture is an encoder-decoder RNN adapted from the Lib-INVENT library design model. The encoder processes a pair of warheads (molecular subunits with defined exit vectors), and the decoder generates a linker token by token, yielding a connected molecule in SMILES format. The model uses three hidden layers of 512 LSTM cells with an embedding size of 256.</p>
<h3 id="training">Training</h3>
<p>The prior is trained on ChEMBL v27 data processed through reaction-based slicing to generate (linker, warheads pair, full molecule) tuples. <a href="/notes/computational-chemistry/molecular-representations/randomized-smiles-generative-models/">SMILES randomization</a> augments the training data at each epoch, improving chemical space generalizability. The prior is trained by maximizing the likelihood of generating a linker conditioned on the input warhead pair, with teacher forcing for stability.</p>
<h3 id="multi-parameter-optimization-via-rl">Multi-Parameter Optimization via RL</h3>
<p>The scoring function $S(x)$ is a weighted geometric mean of individual component scores:</p>
<p>$$
S(x) = \left(\prod_{i=1}^{n} C_{i}(x)^{w_{i}}\right)^{\frac{1}{\sum_{i=1}^{n} w_{i}}}
$$</p>
<p>where $x$ is a sampled linked molecule, $C_{i}(x)$ is the score for the $i$-th component, and $w_{i}$ is its weight.</p>
<p>The agent (initialized as a copy of the prior) is updated via the Difference of Augmented and Posterior likelihoods (DAP) loss. The <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/">augmented log likelihood</a> is:</p>
<p>$$
\log \pi_{\text{augmented}} = \log \pi_{\text{prior}} + \sigma \cdot S(x)
$$</p>
<p>where $\pi$ denotes a policy (token sampling probabilities conditioned on the sequence so far) and $\sigma$ is a scalar factor. The loss function is:</p>
<p>$$
J(\theta) = \left(\log \pi_{\text{augmented}} - \log \pi_{\text{agent}}\right)^{2}
$$</p>
<p>Minimizing $J(\theta)$ steers the agent to generate molecules that satisfy the scoring function while remaining anchored to the prior&rsquo;s chemical space.</p>
<h3 id="diversity-filters">Diversity Filters</h3>
<p>Link-INVENT uses Diversity Filters (DFs) to balance exploration and exploitation. Buckets of limited size track unique <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/memory-assisted-rl-diverse-molecular-design/">Bemis-Murcko scaffolds</a>. When a bucket is full, further sampling of that scaffold receives a score of zero, encouraging the agent to explore diverse chemical space regions.</p>
<h3 id="linker-specific-scoring-components">Linker-Specific Scoring Components</h3>
<p>New scoring components provide direct control over linker properties:</p>
<ul>
<li><strong>Linker effective length</strong>: number of bonds between attachment atoms</li>
<li><strong>Linker maximum graph length</strong>: bonds in the longest graph traversal path</li>
<li><strong>Linker length ratio</strong>: effective length divided by maximum graph length (controls branching)</li>
<li><strong>Linker ratio of rotatable bonds</strong>: rotatable bonds over total bonds (controls flexibility)</li>
<li><strong>Linker number of rings</strong>: controls linearity vs. cyclicity</li>
<li><strong>Linker number of HBDs</strong>: hydrogen-bond donors in the linker itself</li>
</ul>
<h2 id="experimental-evaluation-across-three-drug-discovery-tasks">Experimental Evaluation Across Three Drug Discovery Tasks</h2>
<p>Link-INVENT was evaluated through four experiments across three drug discovery applications, all using the same pre-trained prior.</p>
<h3 id="illustrative-example-two-benzene-rings">Illustrative Example: Two Benzene Rings</h3>
<p>A simple experiment linked two benzene rings with the objectives of limiting HBDs and requiring exactly one ring in the linker. Over 20 epochs, the agent learned to satisfy both objectives, demonstrating the basic RL-guided generation process.</p>
<h3 id="experiment-1a-fragment-linking-ck2-alpha-inhibitors">Experiment 1a: Fragment Linking (CK2 alpha Inhibitors)</h3>
<p>Based on the <a href="https://en.wikipedia.org/wiki/Casein_kinase_2">casein kinase 2</a> (CK2 alpha) fragment linking campaign by Fusco and Brear et al., Link-INVENT was tasked with linking two fragment hits while retaining the Lys68 hydrogen-bond interaction via a DockStream docking constraint (Glide/LigPrep backend). The scoring function also enforced linker length ratio &gt;= 70 and linker MW &lt;= 200 Da.</p>
<p>Over 100 epochs in triplicate, the agent generated molecules with gradually improving docking scores. Key results:</p>
<ul>
<li>Docking score distributions across triplicates were nearly identical, demonstrating reproducibility</li>
<li>Some generated molecules achieved more favorable docking scores than the reference ligand CAM4066 (-15.20 kcal/mol)</li>
<li>More than 5000 unique Bemis-Murcko scaffolds were generated, with minimal overlap across replicates</li>
<li>Binding pose analysis showed the generated linker closely resembled the ground-truth linker, retaining the Lys68 interaction</li>
</ul>
<h3 id="experiment-1b-comparison-fragment-linking-impdh-inhibitors">Experiment 1b: Comparison Fragment Linking (IMPDH Inhibitors)</h3>
<p>Using the IMPDH inhibitor fragment linking case study from Trapero et al., this experiment applied core constrained docking (fragment pose within 0.3 A of reference) and compared results to DeLinker and SyntaLinker. The scoring function enforced linker effective length in [3, 5], length ratio &gt;= 70, and linker MW &lt;= 150 Da.</p>
<p>Link-INVENT generated 8960 SMILES across 70 epochs (comparable to DeLinker&rsquo;s 9000 molecular graphs). Results:</p>
<ul>
<li>Link-INVENT generated molecules with more favorable docking scores than the reference ligand across triplicate runs</li>
<li>Of 20 DeLinker and 3 SyntaLinker example molecules, none and one (the recovered reference) docked better than or equal to the reference</li>
<li>Approximately 3000 unique Bemis-Murcko scaffolds were generated from 5000 total molecules</li>
<li>Link-INVENT&rsquo;s advantage comes from including docking explicitly as a learning objective rather than applying it post hoc</li>
</ul>
<h3 id="experiment-2-scaffold-hopping-dlk-inhibitor-cns-optimization">Experiment 2: Scaffold Hopping (DLK Inhibitor CNS Optimization)</h3>
<p>Based on Patel et al.&rsquo;s <a href="https://en.wikipedia.org/wiki/Dual_leucine_zipper_kinase">dual leucine zipper kinase</a> (DLK) inhibitor campaign, Link-INVENT generated new scaffold ideas to improve CNS penetration while retaining potency. The scoring function included a Cys193 docking constraint plus CNS-compatible properties (HBDs &lt; 2, tPSA &lt;= 90 A squared, 3 &lt;= SlogP &lt;= 4, MW &lt;= 450 Da, 1-2 aromatic rings in linker).</p>
<p>The solution space was significantly narrower than fragment linking. The agent still generated diverse scaffolds with favorable docking scores, though fewer exceeded the reference ligand&rsquo;s score. Binding pose analysis confirmed retained Cys193 interactions and predicted additional Gln195 hydrogen bonds.</p>
<h3 id="experiment-3-protac-design-bcl-2mcl-1-dual-degradation">Experiment 3: PROTAC Design (Bcl-2/Mcl-1 Dual Degradation)</h3>
<p>Three sub-experiments demonstrated linker-specific scoring components for PROTAC design based on Wang et al.&rsquo;s Bcl-2/Mcl-1 dual degradation strategy:</p>
<table>
  <thead>
      <tr>
          <th>Sub-Experiment</th>
          <th>Objective</th>
          <th>Key Finding</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Sub-Exp 1: Linker length</td>
          <td>Generate linkers within specified length intervals [4,6], [7,9], [10,12], [13,15]</td>
          <td>Clear enrichment within target intervals vs. baseline broad distribution</td>
      </tr>
      <tr>
          <td>Sub-Exp 2: Linearity</td>
          <td>Control linear vs. cyclic linkers at fixed length [7,9]</td>
          <td>Baseline ratio ~1:2 linear:cyclic; enforcing linearity or cyclicity achieved strong enrichment</td>
      </tr>
      <tr>
          <td>Sub-Exp 3: Flexibility</td>
          <td>Generate linkers with Low [0,30], Moderate [40,60], or High [70,100] rotatable bond ratios</td>
          <td>Agent learned that rings and sp2 atoms yield rigidity; linear sp3 chains yield flexibility</td>
      </tr>
  </tbody>
</table>
<h2 id="key-findings-and-practical-implications-for-drug-discovery">Key Findings and Practical Implications for Drug Discovery</h2>
<p>Link-INVENT demonstrates several practical advantages for molecular linker design:</p>
<ol>
<li><strong>Single prior, multiple tasks</strong>: The same pre-trained model handles fragment linking, scaffold hopping, and PROTAC design without retraining.</li>
<li><strong>Docking as a learning signal</strong>: Including molecular docking explicitly in the scoring function (via DockStream) during RL yields molecules with more favorable docking scores than approaches that apply docking post hoc.</li>
<li><strong>Implicit 3D awareness</strong>: The docking constraint guides the agent toward 3D structural awareness without explicit 3D coordinate inputs, as demonstrated by the overlap between generated and reference binding poses.</li>
<li><strong>Diverse and reproducible output</strong>: Diversity filters ensure exploration of multiple chemical space regions, and triplicate experiments show consistent docking score distributions with minimal scaffold overlap.</li>
</ol>
<p>Limitations acknowledged by the authors include:</p>
<ul>
<li>The linker flexibility metric (ratio of rotatable bonds) is agnostic to intra-molecular hydrogen bonds and does not account for all rigidity factors</li>
<li>Molecular docking is an approximation that can be exploited (e.g., excessive HBDs achieving favorable scores at the expense of permeability)</li>
<li>Experiments 1a and 1b require a proprietary Schrodinger license for Glide/LigPrep docking</li>
<li>No direct experimental (wet-lab) validation was performed in this study</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior training</td>
          <td>ChEMBL v27 (reaction-sliced)</td>
          <td>Not specified</td>
          <td>Filtered for drug-like compounds, then reaction-based slicing with SMIRKS</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>Held-out Bemis-Murcko scaffolds</td>
          <td>287 scaffolds</td>
          <td>Held out from training set</td>
      </tr>
      <tr>
          <td>SMILES augmentation</td>
          <td>Randomized SMILES per epoch</td>
          <td>Same tuples, different representations</td>
          <td>Improves generalizability</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-decoder RNN with 3 hidden layers of 512 LSTM cells, embedding size 256</li>
<li><strong>RL loss</strong>: DAP (Difference of Augmented and Posterior likelihoods)</li>
<li><strong>Batch size</strong>: 128 molecules per epoch</li>
<li><strong>Diversity filter</strong>: Bemis-Murcko scaffold buckets of size 25</li>
<li><strong>Score threshold</strong>: 0 (to store all molecules for analysis)</li>
<li><strong>Scoring function</strong>: Weighted geometric mean of component scores</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Single pre-trained prior used across all experiments</li>
<li>Agent initialized as copy of prior, updated via RL</li>
<li>Pre-trained prior available at GitHub repository</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Molecular docking via DockStream with Glide/LigPrep backend</li>
<li>Triplicate runs for all experiments</li>
<li>Metrics: docking scores, unique Bemis-Murcko scaffold counts, binding pose overlap</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not reported in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MolecularAI/Reinvent">REINVENT (Link-INVENT code)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Main codebase for Link-INVENT</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MolecularAI/ReinventCommunity">ReinventCommunity (data + tutorial)</a></td>
          <td>Code + Data</td>
          <td>MIT</td>
          <td>Training/validation data, reaction SMIRKS, pre-trained prior, Jupyter tutorial</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Partially Reproducible. Code, training data, and pre-trained prior are publicly available. However, reproducing the docking-based experiments (1a, 1b, and 2) requires a proprietary Schrodinger license for Glide and LigPrep. The PROTAC experiments (Experiment 3) that use only physicochemical scoring are fully reproducible with the open-source code.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Guo, J., Knuth, F., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., &amp; Patronov, A. (2023). Link-INVENT: generative linker design with reinforcement learning. <em>Digital Discovery</em>, 2, 392-408. <a href="https://doi.org/10.1039/D2DD00115B">https://doi.org/10.1039/D2DD00115B</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{guo2023link,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Link-INVENT: generative linker design with reinforcement learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Guo, Jeff and Knuth, Franziska and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{392--408}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D2DD00115B}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Lingo3DMol: Language Model for 3D Molecule Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/lingo3dmol-3d-molecule-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/lingo3dmol-3d-molecule-generation/</guid><description>Lingo3DMol combines language models with geometric deep learning for structure-based 3D molecule generation using a fragment-based SMILES representation.</description><content:encoded><![CDATA[<h2 id="a-language-model-approach-to-structure-based-drug-design">A Language Model Approach to Structure-Based Drug Design</h2>
<p>This is a <strong>Method</strong> paper that introduces Lingo3DMol, a pocket-based 3D molecule generation model combining transformer language models with geometric deep learning. The primary contribution is threefold: (1) a new molecular representation called FSMILES (fragment-based SMILES) that encodes both 2D topology and 3D spatial coordinates, (2) a dual-decoder architecture that jointly predicts molecular topology and atomic positions, and (3) an auxiliary non-covalent interaction (NCI) predictor that guides molecule generation toward favorable binding modes.</p>
<h2 id="limitations-of-existing-3d-molecular-generative-models">Limitations of Existing 3D Molecular Generative Models</h2>
<p>Existing approaches to structure-based drug design fall into two categories, each with notable limitations. Graph-based autoregressive methods (e.g., Pocket2Mol) represent molecules as 3D graphs and use GNNs for generation, but frequently produce non-drug-like structures: large rings (seven or more atoms), honeycomb-like ring arrays, and molecules with either too many or too few rings. The autoregressive sampling process tends to get stuck in local optima early in generation and accumulates errors at each step. Diffusion-based methods (e.g., TargetDiff) avoid autoregressive generation but still produce a notable proportion of undesirable structures due to weak perception of molecular topology, since they do not directly encode or predict bonds. Both approaches struggle with metrics like QED (quantitative estimate of drug-likeness) and SAS (synthetic accessibility score), and neither reliably reproduces known active compounds when evaluated on protein pockets.</p>
<h2 id="fsmiles-fragment-based-smiles-with-dual-coordinate-systems">FSMILES: Fragment-Based SMILES with Dual Coordinate Systems</h2>
<p>The core innovation of Lingo3DMol is a new molecular sequence representation called FSMILES that addresses the topology problem inherent in atom-by-atom generation. FSMILES reorganizes a molecule into fragments using a ring-first, depth-first traversal. Each fragment is represented using standard SMILES syntax, and the full molecule is assembled by combining fragments with a specific connection syntax. Ring size information is encoded directly in atom tokens (e.g., <code>C_6</code> for a carbon in a six-membered ring), providing the autoregressive decoder with critical context about local topology before it needs to close the ring.</p>
<p>The model integrates two coordinate systems. Local spherical coordinates encode bond length ($r$), bond angle ($\theta$), and dihedral angle ($\phi$) relative to three reference atoms (root1, root2, root3). These are predicted using separate MLP heads:</p>
<p>$$r = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_1\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}\right]\right)\right)\right)$$</p>
<p>$$\theta = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_2\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}\right]\right)\right)\right)$$</p>
<p>$$\phi = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_3\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}, h_{\text{root3}}\right]\right)\right)\right)$$</p>
<p>Global Euclidean coordinates ($x, y, z$) are predicted by a separate 3D decoder ($D_{\text{3D}}$). During inference, the model defines a search space around the predicted local coordinates ($r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$) and selects the global position with the highest joint probability within that space. This fusion strategy exploits the rigidity of bond lengths and angles (which makes local prediction easier) while maintaining global spatial awareness.</p>
<h3 id="ncianchor-prediction-model">NCI/Anchor Prediction Model</h3>
<p>A separately trained NCI/anchor prediction model identifies potential non-covalent interaction sites and anchor points in the protein pocket. This model shares the transformer architecture of the generation model and is initialized from pretrained parameters. It predicts whether each pocket atom will form hydrogen bonds, <a href="https://en.wikipedia.org/wiki/Halogen_bond">halogen bonds</a>, salt bridges, or <a href="https://en.wikipedia.org/wiki/Pi_stacking">pi-pi stacking</a> interactions with the ligand, and whether it lies within 4 A of any ligand atom (anchor points). The predicted NCI sites serve two purposes: they are incorporated as input features to the encoder, and they provide starting positions for molecule generation (the first atom is placed within 4.5 A of a sampled NCI site).</p>
<h3 id="pretraining-and-architecture">Pretraining and Architecture</h3>
<p>The model uses a denoising pretraining strategy inspired by BART. During pretraining on 12 million drug-like molecules, the model receives perturbed molecules (with 25% of atoms deleted, coordinates perturbed by $\pm 0.5$ A, and 25% of carbon element types corrupted) and learns to reconstruct the original structure. The architecture is transformer-based with graph structural information encoded through distance and edge vector bias terms in the attention mechanism:</p>
<p>$$A_{\text{biased}} = \operatorname{softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}} + B_D + B_J\right)V$$</p>
<p>The overall loss combines FSMILES token prediction, absolute coordinate prediction, and local coordinate predictions ($r$, $\theta$, $\phi$) with their auxiliary counterparts:</p>
<p>$$L = L_{\text{FSMILES}} + L_{\text{abs-coord}} + L_r + L_\theta + L_\phi + L_{r,\text{aux}} + L_{\theta,\text{aux}} + L_{\phi,\text{aux}}$$</p>
<p>Fine-tuning is performed on 11,800 protein-ligand complex samples from PDBbind 2020, with the first three encoder layers frozen to prevent overfitting.</p>
<h2 id="evaluation-on-dud-e-with-drug-likeness-filtering">Evaluation on DUD-E with Drug-Likeness Filtering</h2>
<p>The evaluation uses the DUD-E dataset (101 targets, 20,000+ active compounds), comparing Lingo3DMol against Pocket2Mol and TargetDiff. A key methodological contribution is the emphasis on filtering generated molecules for drug-likeness (QED &gt;= 0.3 and SAS &lt;= 5) before evaluating binding metrics, as the authors demonstrate that molecules with good docking scores can still be poor drug candidates.</p>
<p><strong>Molecular properties and binding mode (Table 1, drug-like molecules only):</strong></p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Pocket2Mol</th>
          <th>TargetDiff</th>
          <th>Lingo3DMol</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug-like molecules (% of total)</td>
          <td>61%</td>
          <td>49%</td>
          <td><strong>82%</strong></td>
      </tr>
      <tr>
          <td>Mean QED</td>
          <td>0.56</td>
          <td>0.60</td>
          <td>0.59</td>
      </tr>
      <tr>
          <td>Mean SAS</td>
          <td>3.5</td>
          <td>4.0</td>
          <td><strong>3.1</strong></td>
      </tr>
      <tr>
          <td>ECFP TS &gt; 0.5 (% of targets)</td>
          <td>8%</td>
          <td>3%</td>
          <td><strong>33%</strong></td>
      </tr>
      <tr>
          <td>Mean min-in-place GlideSP</td>
          <td>-6.7</td>
          <td>-6.2</td>
          <td><strong>-6.8</strong></td>
      </tr>
      <tr>
          <td>Mean GlideSP redocking</td>
          <td>-7.5</td>
          <td>-7.0</td>
          <td><strong>-7.8</strong></td>
      </tr>
      <tr>
          <td>Mean RMSD vs. low-energy conformer (A)</td>
          <td>1.1</td>
          <td>1.1</td>
          <td><strong>0.9</strong></td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>0.84</td>
          <td><strong>0.88</strong></td>
          <td>0.82</td>
      </tr>
  </tbody>
</table>
<p>Lingo3DMol generates substantially more drug-like molecules (82% vs. 61% and 49%) and finds similar-to-active compounds for 33% of targets compared to 8% (Pocket2Mol) and 3% (TargetDiff). The model also achieves the best min-in-place GlideSP scores and lowest RMSD versus low-energy conformers, indicating higher quality binding poses and more realistic 3D geometries.</p>
<p><strong>Molecular geometry:</strong> Lingo3DMol demonstrated the lowest Jensen-Shannon divergence for all atom-atom distance distributions and produced significantly fewer molecules with large rings (0.23% with 7-membered rings vs. 2.59% for Pocket2Mol and 11.70% for TargetDiff).</p>
<p><strong>Information leakage analysis:</strong> The authors controlled for information leakage by excluding proteins with &gt;30% sequence identity to DUD-E targets from training. When DUD-E targets were stratified by sequence identity to Pocket2Mol&rsquo;s training set, Lingo3DMol&rsquo;s advantage widened as leakage decreased, suggesting the performance gap is genuine rather than an artifact of training overlap.</p>
<p><strong>Ablation studies (Table 2):</strong></p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Standard</th>
          <th>Random NCI</th>
          <th>No Pretraining</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug-like (%)</td>
          <td><strong>82%</strong></td>
          <td>47%</td>
          <td>71%</td>
      </tr>
      <tr>
          <td>ECFP TS &gt; 0.5</td>
          <td><strong>33%</strong></td>
          <td>6%</td>
          <td>3%</td>
      </tr>
      <tr>
          <td>Mean min-in-place GlideSP</td>
          <td><strong>-6.8</strong></td>
          <td>-5.8</td>
          <td>-4.9</td>
      </tr>
      <tr>
          <td>Dice score</td>
          <td><strong>0.25</strong></td>
          <td>0.15</td>
          <td>0.13</td>
      </tr>
  </tbody>
</table>
<p>Both pretraining and the NCI predictor are essential. Removing pretraining reduces the number of valid molecules and binding quality. Replacing the trained NCI predictor with random NCI site selection severely degrades drug-likeness and the ability to generate active-like compounds.</p>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p>Lingo3DMol demonstrates that combining language model sequence generation with geometric deep learning can produce drug-like 3D molecules that outperform graph-based and diffusion-based alternatives in binding mode quality, drug-likeness, and similarity to known actives. The FSMILES representation successfully constrains generated molecules to realistic topologies by encoding ring size information and using fragment-level generation.</p>
<p>Several limitations are acknowledged. Capturing all non-covalent interactions within a single molecule remains difficult with autoregressive generation. The model does not enforce equivariance (SE(3) invariance is approximated via rotation/translation augmentation and invariant features rather than built into the architecture). The pretraining dataset is partially proprietary (12M molecules from a commercial library, of which 1.4M from public sources are shared). Diversity of generated drug-like molecules is slightly lower than baselines, though the authors argue that baseline diversity explores chemical space away from known active regions. A comprehensive evaluation of drug-like properties beyond QED and SAS metrics is identified as an important next step.</p>
<p>Future directions include investigating electron density representations for molecular interactions, incorporating SE(3) equivariant architectures (e.g., GVP, Vector Neurons), and developing more systematic drug-likeness evaluation frameworks.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>In-house commercial library</td>
          <td>12M molecules (1.4M public)</td>
          <td>Filtered for drug-likeness; conformers via ConfGen</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>PDBbind 2020 (general set)</td>
          <td>11,800 samples (8,201 PDB IDs)</td>
          <td>Filtered for &lt;30% sequence identity to DUD-E targets</td>
      </tr>
      <tr>
          <td>NCI labels</td>
          <td>PDBbind 2020</td>
          <td>Same as fine-tuning</td>
          <td>Labeled using ODDT for H-bonds, halogen bonds, salt bridges, pi-pi stacking</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>DUD-E</td>
          <td>101 targets, 20,000+ active compounds</td>
          <td>Standard benchmark for structure-based drug design</td>
      </tr>
      <tr>
          <td>Geometry evaluation</td>
          <td>CrossDocked2020</td>
          <td>100 targets</td>
          <td>Used for bond length and atom distance distribution comparisons</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer-based encoder-decoder with graph structural bias terms (distance matrix $B_D$, edge vector matrix $B_J$)</li>
<li>Denoising pretraining: 25% atom deletion, coordinate perturbation ($\pm 0.5$ A), 25% carbon element type corruption</li>
<li>Depth-first search sampling with reward function combining model confidence and anchor fulfillment</li>
<li>Fine-tuning: first three encoder layers frozen</li>
<li>Local-global coordinate fusion during inference with search space: $r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Generation model: transformer encoder-decoder with dual decoders ($D_{\text{2D}}$ for topology, $D_{\text{3D}}$ for global coordinates)</li>
<li>NCI/anchor prediction model: same architecture, initialized from pretrained parameters</li>
<li>Pretrained, fine-tuned, and NCI model checkpoints available on GitHub and figshare</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Lingo3DMol</th>
          <th>Best Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug-like molecules (%)</td>
          <td>82%</td>
          <td>61% (P2M)</td>
          <td>QED &gt;= 0.3, SAS &lt;= 5</td>
      </tr>
      <tr>
          <td>ECFP TS &gt; 0.5 (% targets)</td>
          <td>33%</td>
          <td>8% (P2M)</td>
          <td>Tanimoto similarity to known actives</td>
      </tr>
      <tr>
          <td>Min-in-place GlideSP</td>
          <td>-6.8</td>
          <td>-6.7 (P2M)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>GlideSP redocking</td>
          <td>-7.8</td>
          <td>-7.5 (P2M)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>RMSD vs. low-energy conformer</td>
          <td>0.9 A</td>
          <td>1.1 A (both)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>Generation speed (100 mol)</td>
          <td>874 +/- 401 s</td>
          <td>962 +/- 622 s (P2M)</td>
          <td>NVIDIA Tesla V100</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Inference benchmarked on NVIDIA Tesla V100 GPUs</li>
<li>Generation of 100 valid molecules per target: 874 +/- 401 seconds</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/stonewiseAIDrugDesign/Lingo3DMol">Lingo3DMol</a></td>
          <td>Code</td>
          <td>GPL-3.0</td>
          <td>Inference code and model architecture</td>
      </tr>
      <tr>
          <td><a href="https://figshare.com/articles/software/Code_for_Lingo3DMo/24633084">Model checkpoints</a></td>
          <td>Model</td>
          <td>GPL-3.0</td>
          <td>Pretraining, fine-tuning, and NCI checkpoints</td>
      </tr>
      <tr>
          <td><a href="https://figshare.com/articles/dataset/Data_for_Lingo3DMol/24550351">Training data</a></td>
          <td>Dataset</td>
          <td>Not specified</td>
          <td>Partial pretraining data (1.4M public molecules), fine-tuning complexes, evaluation molecules</td>
      </tr>
      <tr>
          <td><a href="https://sw3dmg.stonewise.cn">Online service</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Web interface for molecule generation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Feng, W., Wang, L., Lin, Z., Zhu, Y., Wang, H., Dong, J., Bai, R., Wang, H., Zhou, J., Peng, W., Huang, B., &amp; Zhou, W. (2024). Generation of 3D molecules in pockets via a language model. <em>Nature Machine Intelligence</em>, 6(1), 62-73. <a href="https://doi.org/10.1038/s42256-023-00775-6">https://doi.org/10.1038/s42256-023-00775-6</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{feng2024generation,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Generation of 3D molecules in pockets via a language model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Feng, Wei and Wang, Lvwei and Lin, Zaiyun and Zhu, Yanhao and Wang, Han and Dong, Jianqiang and Bai, Rong and Wang, Huting and Zhou, Jielong and Peng, Wei and Huang, Bo and Zhou, Wenbiao}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{62--73}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-023-00775-6}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Generative AI Survey for De Novo Molecule and Protein Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/surveys-and-reviews/generative-ai-drug-design-survey/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/surveys-and-reviews/generative-ai-drug-design-survey/</guid><description>Comprehensive survey of generative AI for de novo drug design covering molecule and protein generation with VAEs, GANs, diffusion, and flow models.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-generative-ai-for-drug-design">A Systematization of Generative AI for Drug Design</h2>
<p>This is a <strong>Systematization</strong> paper that provides a broad survey of generative AI methods applied to de novo drug design. The survey organizes the field into two overarching themes: small molecule generation and protein generation. Within each theme, the authors identify subtasks, catalog datasets and benchmarks, describe model architectures, and compare the performance of leading methods using standardized metrics. The paper covers over 200 references and provides 12 comparative benchmark tables.</p>
<p>The primary contribution is a unified organizational framework that allows both micro-level comparisons within each subtask and macro-level observations across the two application domains. The authors highlight parallel developments in both fields, particularly the shift from sequence-based to structure-based approaches and the growing dominance of diffusion models.</p>
<h2 id="the-challenge-of-navigating-de-novo-drug-design">The Challenge of Navigating De Novo Drug Design</h2>
<p>The drug design process requires creating ligands that interact with specific biological targets. These range from small molecules (tens of atoms) to large proteins (monoclonal antibodies). Traditional discovery methods are computationally expensive, with preclinical trials costing hundreds of millions of dollars and taking 3-6 years. The chemical space of potential drug-like compounds is estimated at $10^{23}$ to $10^{60}$, making brute-force exploration infeasible.</p>
<p>AI-driven generative methods have gained traction in recent years, with over 150 AI-focused biotech companies initiating small-molecule drugs in the discovery phase and 15 in clinical trials. The rate of AI-fueled drug design processes has expanded by almost 40% each year.</p>
<p>The rapid development of the field, combined with its inherent complexity, creates barriers for new researchers. Several prior surveys exist, but they focus on specific aspects: molecule generation, protein generation, antibody generation, or specific model architectures like diffusion models. This survey takes a broader approach, covering both molecule and protein generation under a single organizational framework.</p>
<h2 id="unified-taxonomy-two-themes-seven-subtasks">Unified Taxonomy: Two Themes, Seven Subtasks</h2>
<p>The survey&rsquo;s core organizational insight is structuring de novo drug design into two themes with distinct subtasks, while identifying common architectural patterns across them.</p>
<h3 id="generative-model-architectures">Generative Model Architectures</h3>
<p>The survey covers four main generative model families used across both molecule and protein generation:</p>
<p><strong><a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">Variational Autoencoders (VAEs)</a></strong> encode inputs into a latent distribution and decode from sampled points. The encoder maps input $x$ to a distribution parameterized by mean $\mu_\phi(x)$ and variance $\sigma^2_\phi(x)$. Training minimizes reconstruction loss plus KL divergence:</p>
<p>$$\mathcal{L} = \mathcal{L}_{\text{recon}} + \beta \mathcal{L}_{\text{KL}}$$</p>
<p>where the KL loss is:</p>
<p>$$\mathcal{L}_{\text{KL}} = -\frac{1}{2} \sum_{k} \left(1 + \log(\sigma_k^{(i)2}) - \mu_k^{(i)2} - \sigma_k^{(i)2}\right)$$</p>
<p><strong><a href="/posts/what-is-a-gan/">Generative Adversarial Networks (GANs)</a></strong> use a generator-discriminator game. The generator $G$ creates instances from random noise $z$ sampled from a prior $p_z(z)$, while the discriminator $D$ distinguishes real from synthetic data:</p>
<p>$$\min_{G} \max_{D} \mathbb{E}_x[\log D(x; \theta_d)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z; \theta_g); \theta_d))]$$</p>
<p><strong>Flow-Based Models</strong> generate data by applying an invertible function $f: z_0 \mapsto x$ to transform a simple latent distribution (Gaussian) to the target distribution. The log-likelihood is computed using the change-of-variable formula:</p>
<p>$$\log p(x) = \log p_0(z) + \log \left| \det \frac{\partial f}{\partial z} \right|$$</p>
<p><strong>Diffusion Models</strong> gradually add Gaussian noise over $T$ steps in a forward process and learn to reverse the noising via a denoising neural network. The forward step is:</p>
<p>$$x_{t+1} = \sqrt{1 - \beta_t} x_t + \sqrt{\beta_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$</p>
<p>The training loss minimizes the difference between the true noise and the predicted noise:</p>
<p>$$L_t = \mathbb{E}_{t \sim [1,T], x_0, \epsilon_t} \left[ | \epsilon_t - \epsilon_\theta(x_t, t) |^2 \right]$$</p>
<p>Graph neural networks (GNNs), particularly equivariant GNNs (EGNNs), are commonly paired with these generative methods to handle 2D/3D molecular and protein inputs. Diffusion and flow-based models are often paired with GNNs for processing 2D/3D-based input, while VAEs and GANs are typically used for 1D input.</p>
<h2 id="small-molecule-generation-tasks-datasets-and-models">Small Molecule Generation: Tasks, Datasets, and Models</h2>
<h3 id="target-agnostic-molecule-design">Target-Agnostic Molecule Design</h3>
<p>The goal is to generate a set of novel, valid, and stable molecules without conditioning on any specific biological target. Models are evaluated on atom stability, molecule stability, validity, uniqueness, novelty, and QED (Quantitative Estimate of Drug-Likeness).</p>
<p><strong>Datasets</strong>: QM9 (small stable molecules from <a href="/notes/computational-chemistry/datasets/gdb-17/">GDB-17</a>) and <a href="/notes/computational-chemistry/datasets/geom/">GEOM</a>-Drug (more complex, drug-like molecules).</p>
<p>The field has shifted from SMILES-based VAEs (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">CVAE</a>, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/grammar-variational-autoencoder/">GVAE</a>, SD-VAE) to 2D graph methods (JTVAE) and then to 3D diffusion-based models. Current leading methods on QM9:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>At Stb. (%)</th>
          <th>Mol Stb. (%)</th>
          <th>Valid (%)</th>
          <th>Val/Uniq. (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MiDi</td>
          <td>EGNN, Diffusion</td>
          <td>99.8</td>
          <td>97.5</td>
          <td>97.9</td>
          <td>97.6</td>
      </tr>
      <tr>
          <td>MDM</td>
          <td>EGNN, VAE, Diffusion</td>
          <td>99.2</td>
          <td>89.6</td>
          <td>98.6</td>
          <td>94.6</td>
      </tr>
      <tr>
          <td>JODO</td>
          <td>EGNN, Diffusion</td>
          <td>99.2</td>
          <td>93.4</td>
          <td>99.0</td>
          <td>96.0</td>
      </tr>
      <tr>
          <td>GeoLDM</td>
          <td>VAE, Diffusion</td>
          <td>98.9</td>
          <td>89.4</td>
          <td>93.8</td>
          <td>92.7</td>
      </tr>
      <tr>
          <td>EDM</td>
          <td>EGNN, Diffusion</td>
          <td>98.7</td>
          <td>82.0</td>
          <td>91.9</td>
          <td>90.7</td>
      </tr>
  </tbody>
</table>
<p>EDM provided an initial baseline using diffusion with an equivariant GNN. GCDM introduced attention-based geometric message-passing. MDM separately handles covalent bond edges and Van der Waals forces, and also addresses diversity through an additional distribution-controlling noise variable. GeoLDM maps molecules to a lower-dimensional latent space for more efficient diffusion. MiDi uses a &ldquo;relaxed&rdquo; EGNN and jointly models 2D and 3D information through a graph representation capturing both spatial and connectivity data.</p>
<p>On the larger GEOM-Drugs dataset, performance drops for most models:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>At Stb. (%)</th>
          <th>Mol Stb. (%)</th>
          <th>Valid (%)</th>
          <th>Val/Uniq. (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MiDi</td>
          <td>99.8</td>
          <td>91.6</td>
          <td>77.8</td>
          <td>77.8</td>
      </tr>
      <tr>
          <td>MDM</td>
          <td>&ndash;</td>
          <td>62.2</td>
          <td>99.5</td>
          <td>99.0</td>
      </tr>
      <tr>
          <td>GeoLDM</td>
          <td>84.4</td>
          <td>&ndash;</td>
          <td>99.3</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>EDM</td>
          <td>81.3</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
  </tbody>
</table>
<p>MiDi distinguishes itself for generating more stable complex molecules, though at the expense of validity. Models generally perform well on QM9 but show room for improvement on more complex GEOM-Drugs molecules.</p>
<h3 id="target-aware-molecule-design">Target-Aware Molecule Design</h3>
<p>Target-aware generation produces molecules for specific protein targets, using either ligand-based (LBDD) or structure-based (SBDD) approaches. SBDD methods have become more prevalent as protein structure information becomes increasingly available.</p>
<p><strong>Datasets</strong>: CrossDocked2020 (22.5M ligand-protein pairs), ZINC20, Binding MOAD.</p>
<p><strong>Metrics</strong>: Vina Score (docking energy), High Affinity Percentage, QED, SA Score (synthetic accessibility), Diversity (Tanimoto similarity).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>Vina</th>
          <th>Affinity (%)</th>
          <th>QED</th>
          <th>SA</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DiffSBDD</td>
          <td>EGNN, Diffusion</td>
          <td>-7.333</td>
          <td>&ndash;</td>
          <td>0.467</td>
          <td>0.554</td>
          <td>0.758</td>
      </tr>
      <tr>
          <td>Luo et al.</td>
          <td>SchNet</td>
          <td>-6.344</td>
          <td>29.09</td>
          <td>0.525</td>
          <td>0.657</td>
          <td>0.720</td>
      </tr>
      <tr>
          <td>TargetDiff</td>
          <td>EGNN, Diffusion</td>
          <td>-6.3</td>
          <td>58.1</td>
          <td>0.48</td>
          <td>0.58</td>
          <td>0.72</td>
      </tr>
      <tr>
          <td>LiGAN</td>
          <td>CNN, VAE</td>
          <td>-6.144</td>
          <td>21.1</td>
          <td>0.39</td>
          <td>0.59</td>
          <td>0.66</td>
      </tr>
      <tr>
          <td>Pocket2Mol</td>
          <td>EGNN, MLP</td>
          <td>-5.14</td>
          <td>48.4</td>
          <td>0.56</td>
          <td>0.74</td>
          <td>0.69</td>
      </tr>
  </tbody>
</table>
<p>DrugGPT is an LBDD autoregressive model using transformers on tokenized protein-ligand pairs. Among the SBDD models, LiGAN introduces a 3D CNN-VAE framework, Pocket2Mol emphasizes binding pocket geometry using an EGNN with geometric vector MLP layers, and Luo et al. model atomic probabilities in the binding site using SchNet. TargetDiff performs diffusion on an EGNN and optimizes binding affinity by reflecting low atom type entropy. DiffSBDD applies an inpainting approach by masking and replacing segments of ligand-protein complexes. DiffSBDD leads in Vina score and diversity, while TargetDiff leads in high affinity. Interestingly, diffusion-based methods are outperformed by Pocket2Mol on drug-likeness metrics (QED and SA).</p>
<h3 id="molecular-conformation-generation">Molecular Conformation Generation</h3>
<p>Conformation generation involves producing 3D structures from 2D connectivity graphs. Models are evaluated on Coverage (COV, percentage of ground-truth conformations &ldquo;covered&rdquo; within an RMSD threshold) and Matching (MAT, average RMSD to closest ground-truth conformation).</p>
<p><strong>Datasets</strong>: GEOM-QM9, GEOM-Drugs, ISO17.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>GEOM-QM9 COV (%)</th>
          <th>GEOM-QM9 MAT</th>
          <th>GEOM-Drugs COV (%)</th>
          <th>GEOM-Drugs MAT</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Torsional Diff.</td>
          <td>Diffusion</td>
          <td>92.8</td>
          <td>0.178</td>
          <td>72.7*</td>
          <td>0.582</td>
      </tr>
      <tr>
          <td>DGSM</td>
          <td>MPNN, Diffusion</td>
          <td>91.49</td>
          <td>0.2139</td>
          <td>78.73</td>
          <td>1.0154</td>
      </tr>
      <tr>
          <td>GeoDiff</td>
          <td>GFN, Diffusion</td>
          <td>90.07</td>
          <td>0.209</td>
          <td>89.13</td>
          <td>0.8629</td>
      </tr>
      <tr>
          <td>ConfGF</td>
          <td>GIN, Diffusion</td>
          <td>88.49</td>
          <td>0.2673</td>
          <td>62.15</td>
          <td>1.1629</td>
      </tr>
      <tr>
          <td>GeoMol</td>
          <td>MPNN</td>
          <td>71.26</td>
          <td>0.3731</td>
          <td>67.16</td>
          <td>1.0875</td>
      </tr>
  </tbody>
</table>
<p>*Torsional Diffusion uses a 0.75 A threshold instead of the standard 1.25 A for GEOM-Drugs coverage, leading to a deflated score. It outperforms GeoDiff and GeoMol when evaluated at the same threshold.</p>
<p>Torsional Diffusion operates in the space of torsion angles rather than Cartesian coordinates, allowing for improved representation and fewer denoising steps. GeoDiff uses Euclidean-space diffusion, treating each atom as a particle and incorporating Markov kernels that preserve E(3) equivariance through a graph field network (GFN) layer.</p>
<h2 id="protein-generation-from-sequence-to-structure">Protein Generation: From Sequence to Structure</h2>
<h3 id="protein-representation-learning">Protein Representation Learning</h3>
<p>Representation learning creates embeddings for protein inputs to support downstream tasks. Models are evaluated on contact prediction, fold classification (at family, superfamily, and fold levels), and stability prediction (Spearman&rsquo;s $\rho$).</p>
<p>Key models include: UniRep (mLSTM RNN), ProtBERT (BERT applied to amino acid sequences), ESM-1B (33-layer, 650M parameter transformer), MSA Transformer (pre-trained on MSA input), and GearNET (Geo-EGNN using 3D structure with directed edges). OntoProtein and KeAP incorporate knowledge graphs for direct knowledge injection.</p>
<h3 id="protein-structure-prediction">Protein Structure Prediction</h3>
<p>Given an amino acid sequence, models predict 3D point coordinates for each residue. Evaluated using RMSD, GDT-TS, TM-score, and LDDT on CASP14 and CAMEO benchmarks.</p>
<p>AlphaFold2 is the landmark model, integrating MSA and pair representations through transformers with invariant point attention (IPA). ESMFold uses ESM-2 language model representations instead of MSAs, achieving faster processing. RoseTTAFold uses a three-track neural network learning from 1D sequence, 2D distance map, and 3D backbone coordinate information simultaneously. EigenFold uses diffusion, representing the protein as a system of harmonic oscillators.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>CAMEO RMSD</th>
          <th>CAMEO TMScore</th>
          <th>CAMEO GDT-TS</th>
          <th>CAMEO lDDT</th>
          <th>CASP14 TMScore</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AlphaFold2</td>
          <td>Transformer</td>
          <td>3.30</td>
          <td>0.87</td>
          <td>0.86</td>
          <td>0.90</td>
          <td>0.38</td>
      </tr>
      <tr>
          <td>ESMFold</td>
          <td>Transformer</td>
          <td>3.99</td>
          <td>0.85</td>
          <td>0.83</td>
          <td>0.87</td>
          <td>0.68</td>
      </tr>
      <tr>
          <td>RoseTTAFold</td>
          <td>Transformer</td>
          <td>5.72</td>
          <td>0.77</td>
          <td>0.71</td>
          <td>0.79</td>
          <td>0.37</td>
      </tr>
      <tr>
          <td>EigenFold</td>
          <td>Diffusion</td>
          <td>7.37</td>
          <td>0.75</td>
          <td>0.71</td>
          <td>0.78</td>
          <td>&ndash;</td>
      </tr>
  </tbody>
</table>
<h3 id="sequence-generation-inverse-folding">Sequence Generation (Inverse Folding)</h3>
<p>Given a fixed protein backbone structure, models generate amino acid sequences that will fold into that structure. The space of valid sequences is between $10^{65}$ and $10^{130}$.</p>
<p>Evaluated using Amino Acid Recovery (AAR), diversity, RMSD, nonpolar loss, and perplexity (PPL):</p>
<p>$$\text{PPL} = \exp\left(\frac{1}{N} \sum_{i=1}^{N} \log P(x_i | x_1, x_2, \ldots x_{i-1})\right)$$</p>
<p>ProteinMPNN is the current top performer, generating the most accurate sequences and leading in AAR, RMSD, and nonpolar loss. It uses a message-passing neural network with a flexible, order-agnostic autoregressive approach.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>AAR (%)</th>
          <th>Div.</th>
          <th>RMSD</th>
          <th>Non.</th>
          <th>Time (s)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ProteinMPNN</td>
          <td>MPNN</td>
          <td>48.7</td>
          <td>0.168</td>
          <td>1.019</td>
          <td>1.061</td>
          <td>112</td>
      </tr>
      <tr>
          <td>ESM-IF1</td>
          <td>Transformer</td>
          <td>47.7</td>
          <td>0.184</td>
          <td>1.265</td>
          <td>1.201</td>
          <td>1980</td>
      </tr>
      <tr>
          <td>GPD</td>
          <td>Transformer</td>
          <td>46.2</td>
          <td>0.219</td>
          <td>1.758</td>
          <td>1.333</td>
          <td>35</td>
      </tr>
      <tr>
          <td>ABACUS-R</td>
          <td>Transformer</td>
          <td>45.7</td>
          <td>0.124</td>
          <td>1.482</td>
          <td>0.968</td>
          <td>233280</td>
      </tr>
      <tr>
          <td>3D CNN</td>
          <td>CNN</td>
          <td>44.5</td>
          <td>0.272</td>
          <td>1.62</td>
          <td>1.027</td>
          <td>536544</td>
      </tr>
      <tr>
          <td>PiFold</td>
          <td>GNN</td>
          <td>42.8</td>
          <td>0.141</td>
          <td>1.592</td>
          <td>1.464</td>
          <td>221</td>
      </tr>
      <tr>
          <td>ProteinSolver</td>
          <td>GNN</td>
          <td>24.6</td>
          <td>0.186</td>
          <td>5.354</td>
          <td>1.389</td>
          <td>180</td>
      </tr>
  </tbody>
</table>
<p>Results are from the independent benchmark by Yu et al. GPD remains the fastest method, generating sequences around three times faster than ProteinMPNN. Current SOTA models recover fewer than half of target amino acid residues, indicating room for improvement.</p>
<h3 id="backbone-design">Backbone Design</h3>
<p>Backbone design creates protein structures from scratch, representing the core of de novo protein design. Models generate coordinates for backbone atoms (nitrogen, alpha-carbon, carbonyl, oxygen) and use external tools like Rosetta for side-chain packing.</p>
<p>Two evaluation paradigms exist: context-free generation (evaluated by self-consistency TM, or scTM) and context-given generation (inpainting, evaluated by AAR, PPL, RMSD).</p>
<p>ProtDiff represents residues as 3D Cartesian coordinates and uses particle-filtering diffusion. FoldingDiff instead uses an angular representation (six angles per residue) with a BERT-based DDPM. LatentDiff embeds proteins into a latent space using an equivariant autoencoder, then applies equivariant diffusion, analogous to GeoLDM for molecules. These early models work well for short proteins (up to 128 residues) but struggle with longer structures.</p>
<p>Frame-based methods address this scaling limitation. Genie uses Frenet-Serret frames with paired residue representations and IPA for noise prediction. FrameDiff parameterizes backbone structures on the $SE(3)^N$ manifold of frames using a score-based generative model. RFDiffusion is the current leading model, combining RoseTTAFold structure prediction with diffusion. It fine-tunes RoseTTAFold weights on a masked input sequence and random noise coordinates, using &ldquo;self-conditioning&rdquo; on predicted structures. Protpardelle co-designs sequence and structure by creating a &ldquo;superposition&rdquo; over possible sidechain states and collapsing them during each iterative diffusion step.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>scTM (%)</th>
          <th>Design. (%)</th>
          <th>PPL</th>
          <th>AAR (%)</th>
          <th>RMSD</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RFDiffusion</td>
          <td>Diffusion</td>
          <td>&ndash;</td>
          <td>95.1</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>Protpardelle</td>
          <td>Diffusion</td>
          <td>85</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>FrameDiff</td>
          <td>Diffusion</td>
          <td>84</td>
          <td>48.3</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>Genie</td>
          <td>Diffusion</td>
          <td>81.5</td>
          <td>79.0</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>LatentDiff</td>
          <td>EGNN, Diffusion</td>
          <td>31.6</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>FoldingDiff</td>
          <td>Diffusion</td>
          <td>14.2</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>ProtDiff</td>
          <td>EGNN, Diffusion</td>
          <td>11.8</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>12.47*</td>
          <td>8.01*</td>
      </tr>
  </tbody>
</table>
<p>*ProtDiff context-given results are tested only on beta-lactamase metalloproteins from PDB.</p>
<h3 id="antibody-design">Antibody Design</h3>
<p>The survey covers antibody structure prediction, representation learning, and CDR-H3 generation. Antibodies are Y-shaped proteins with complementarity-determining regions (CDRs), where CDR-H3 is the most variable and functionally important region.</p>
<p>For CDR-H3 generation, models have progressed from sequence-based (LSTM) to structure-based (RefineGNN) and sequence-structure co-design approaches (MEAN, AntiDesigner, DiffAb). dyMEAN is the current leading model, providing an end-to-end method incorporating structure prediction, docking, and CDR generation into a single framework. MSA alignment cannot be used for antibody input, which makes general models like AlphaFold2 inefficient for antibody prediction. Specialized models like IgFold use sequence embeddings from AntiBERTy with invariant point attention to achieve faster antibody structure prediction.</p>
<h3 id="peptide-design">Peptide Design</h3>
<p>The survey briefly covers peptide generation, including models for therapeutic peptide generation (MMCD), peptide-protein interaction prediction (PepGB), peptide representation learning (PepHarmony), peptide sequencing (AdaNovo), and signal peptide prediction (PEFT-SP).</p>
<h2 id="current-trends-challenges-and-future-directions">Current Trends, Challenges, and Future Directions</h2>
<h3 id="current-trends">Current Trends</h3>
<p>The survey identifies several parallel trends across molecule and protein generation:</p>
<ol>
<li>
<p><strong>Shift from sequence to structure</strong>: In molecule generation, graph-based diffusion models (GeoLDM, MiDi, TargetDiff) now dominate. In protein generation, structure-based representation learning (GearNET) and diffusion-based backbone design (RFDiffusion) have overtaken sequence-only methods.</p>
</li>
<li>
<p><strong>Dominance of E(3) equivariant architectures</strong>: EGNNs appear across nearly all subtasks, reflecting the physical requirement that molecular and protein properties should be invariant to rotation and translation.</p>
</li>
<li>
<p><strong>Structure-based over ligand-based approaches</strong>: In target-aware molecule design, SBDD methods that use 3D protein structures demonstrate clear advantages over LBDD approaches that operate on amino acid sequences alone.</p>
</li>
</ol>
<h3 id="challenges">Challenges</h3>
<p><strong>For small molecule generation:</strong></p>
<ul>
<li><strong>Complexity</strong>: Models perform well on simple QM9 but struggle with complex GEOM-Drugs molecules.</li>
<li><strong>Applicability</strong>: Generating molecules with high binding affinity to targets remains difficult.</li>
<li><strong>Explainability</strong>: Methods are black-box, offering no insight into why generated molecules have desired properties.</li>
</ul>
<p><strong>For protein generation:</strong></p>
<ul>
<li><strong>Benchmarking</strong>: Protein generative tasks lack a standard evaluative procedure, with variance between each model&rsquo;s metrics and testing conditions.</li>
<li><strong>Performance</strong>: SOTA models still struggle with fold classification, gene ontology, and antibody CDR-H3 generation.</li>
</ul>
<p>The authors also note that many generative tasks are evaluated using predictive models (e.g., classifier networks for binding affinity or molecular properties). Improvements to these classification methods would lead to more precise alignment with real-world biological applications.</p>
<h3 id="future-directions">Future Directions</h3>
<p>The authors identify increasing performance in existing tasks, defining more applicable tasks (especially in molecule-protein binding, antibody generation), and exploring entirely new areas of research as key future directions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>As a survey paper, this work does not produce new models, datasets, or experimental results. All benchmark numbers reported are from the original papers cited.</p>
<h3 id="data">Data</h3>
<p>The survey catalogs the following key datasets across subtasks:</p>
<table>
  <thead>
      <tr>
          <th>Subtask</th>
          <th>Datasets</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Target-agnostic molecule</td>
          <td>QM9, <a href="/notes/computational-chemistry/datasets/geom/">GEOM</a>-Drug</td>
          <td>QM9 from <a href="/notes/computational-chemistry/datasets/gdb-17/">GDB-17</a>; GEOM-Drug for complex molecules</td>
      </tr>
      <tr>
          <td>Target-aware molecule</td>
          <td>CrossDocked2020, ZINC20, Binding MOAD</td>
          <td>CrossDocked2020 most used (22.5M pairs)</td>
      </tr>
      <tr>
          <td>Conformation generation</td>
          <td><a href="/notes/computational-chemistry/datasets/geom/">GEOM</a>-QM9, GEOM-Drugs, ISO17</td>
          <td>Conformer sets for molecules</td>
      </tr>
      <tr>
          <td>Protein structure prediction</td>
          <td>PDB, CASP14, CAMEO</td>
          <td>CASP biennial blind evaluation</td>
      </tr>
      <tr>
          <td>Protein sequence generation</td>
          <td>PDB, UniRef, UniParc, CATH, TS500</td>
          <td>CATH for domain classification</td>
      </tr>
      <tr>
          <td>Backbone design</td>
          <td>PDB, AlphaFoldDB, SCOP, CATH</td>
          <td>AlphaFoldDB for expanded structural coverage</td>
      </tr>
      <tr>
          <td>Antibody structure</td>
          <td>SAbDab, RAB</td>
          <td>SAbDab: all antibody structures from PDB</td>
      </tr>
      <tr>
          <td>Antibody CDR generation</td>
          <td>SAbDab, RAB, SKEMPI</td>
          <td>SKEMPI for affinity optimization</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/gersteinlab/GenAI4Drug">GenAI4Drug</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Organized repository of all covered sources</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Tang, X., Dai, H., Knight, E., Wu, F., Li, Y., Li, T., &amp; Gerstein, M. (2024). A survey of generative AI for de novo drug design: New frontiers in molecule and protein generation. <em>Briefings in Bioinformatics</em>, 25(4), bbae338. <a href="https://doi.org/10.1093/bib/bbae338">https://doi.org/10.1093/bib/bbae338</a></p>
<p><strong>Publication</strong>: Briefings in Bioinformatics, Volume 25, Issue 4, 2024.</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2402.08703">arXiv: 2402.08703</a></li>
<li><a href="https://github.com/gersteinlab/GenAI4Drug">GitHub: GenAI4Drug</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11247410/">PMC: PMC11247410</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{tang2024survey,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Tang, Xiangru and Dai, Howard and Knight, Elizabeth and Wu, Fang and Li, Yunyang and Li, Tianxiao and Gerstein, Mark}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{25}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{bbae338}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bib/bbae338}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Curriculum Learning for De Novo Drug Design (REINVENT)</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/curriculum-learning-molecular-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/curriculum-learning-molecular-design/</guid><description>Curriculum learning applied to REINVENT accelerates convergence on complex multi-parameter drug design objectives compared to standard reinforcement learning.</description><content:encoded><![CDATA[<h2 id="curriculum-learning-as-a-method-for-molecular-generation">Curriculum Learning as a Method for Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces curriculum learning (CL) into the <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> de novo molecular design platform. The primary contribution is a training strategy that decomposes complex multi-parameter optimization (MPO) objectives into sequences of simpler tasks with increasing complexity. The agent learns each simpler task before progressing to the full production objective, accelerating convergence and improving the quality and diversity of generated molecules compared to standard policy-based reinforcement learning (RL).</p>
<h2 id="the-computational-cost-of-complex-reward-functions">The Computational Cost of Complex Reward Functions</h2>
<p>Policy-based RL for molecular design works by training a generative model (the agent) to produce molecules that maximize a reward function. In practice, drug design reward functions often include computationally expensive components such as molecular docking. When the reward landscape is complex and minima are difficult to find, the agent may spend many epochs sampling molecules far from the desired objective. The resulting small gradients cause minimal policy updates, leading to long periods of non-productivity. This is particularly wasteful when each reward evaluation involves expensive physics-based computations.</p>
<p>The core problem is that standard RL treats the full MPO objective as a monolithic task. If the agent cannot find any rewarding molecules early in training, it receives near-zero gradients and makes negligible progress. This creates a bootstrapping problem: the agent needs to already be sampling from favorable regions of chemical space to receive useful learning signals, but it has no guidance on how to get there.</p>
<p>Curriculum learning, originally proposed by Bengio et al. (2009), addresses this by arranging training tasks in order of increasing difficulty. When constituent tasks are correlated with the final objective, the gradients from simpler tasks provide more effective traversal of the optimization landscape.</p>
<h2 id="formalized-curriculum-strategy-for-reinvent">Formalized Curriculum Strategy for REINVENT</h2>
<p>The key innovation is a two-phase training protocol with formal definitions for curriculum progression.</p>
<p>A scoring function maps SMILES strings to desirability scores in $[0, 1]$ using a weighted geometric mean:</p>
<p>$$S(x) = \left(\prod_{i=1}^{n} c_{i}(x)^{w_{i}}\right)^{1 / \sum_{i=1}^{n} w_{i}}$$</p>
<p>where $x$ is a sampled compound in SMILES format, $c_{i}$ is the $i$-th scoring component, and $w_{i}$ is its weight.</p>
<p>A Curriculum $C$ consists of a sequence of Objectives $O = {O_{C_1}, \ldots, O_{C_n}, O_{P}}$, where subscripts $C$ and $P$ denote Curriculum and Production Objectives respectively. Each Objective has a corresponding scoring function. Progression is controlled by Curriculum Progression Criteria $P = {P_{1}, \ldots, P_{n}}$, where each $P_{i}$ defines a score threshold the agent must achieve before advancing to the next objective.</p>
<p><strong>Curriculum Phase</strong>: The agent trains on sequential Curriculum Objectives with increasing complexity. A diversity filter is not applied during this phase, as it could be counterproductive to guiding the agent toward favorable chemical space. No computationally expensive components (e.g., docking) are used in Curriculum Objectives.</p>
<p><strong>Production Phase</strong>: Activated only when the final Curriculum Progression Criterion $P_{n}$ is satisfied. The agent now optimizes the full Production Objective, which may include expensive components like molecular docking. A new inception memory is initialized (clearing Curriculum Phase compounds), and a Bemis-Murcko scaffold diversity filter is applied to encourage exploration across multiple local minima.</p>
<p>The implementation builds on REINVENT&rsquo;s RNN architecture: three hidden layers of 512 LSTM cells with an embedding size of 256 and a linear layer with softmax activation, pretrained on ChEMBL to learn SMILES syntax.</p>
<h2 id="three-experiments-on-pdk1-inhibitor-design">Three Experiments on PDK1 Inhibitor Design</h2>
<p>The authors evaluate CL on three molecular design tasks of increasing complexity, all centered on designing <a href="https://en.wikipedia.org/wiki/PDPK1">3-phosphoinositide-dependent protein kinase-1</a> (PDK1) inhibitors.</p>
<h3 id="experiment-1-target-scaffold-construction">Experiment 1: Target Scaffold Construction</h3>
<p>The goal is to generate compounds possessing a dihydro-pyrazoloquinazoline scaffold with a phenyl substituent, a scaffold not present in the prior&rsquo;s training set. Standard RL fails entirely over 2000 epochs because the probability of randomly sampling a compound with this scaffold is negligibly small, producing binary rewards (1.0 if scaffold present, 0.5 otherwise) that never rise above 0.5.</p>
<p>CL decomposes the target scaffold into 5 progressively complex substructures. Each Curriculum Progression Criterion threshold is set to 0.8. The agent learns to generate compounds with each substructure before advancing. CL finds the target scaffold within 1750 epochs, while baseline RL cannot find it in the same timeframe.</p>
<h3 id="experiments-2-and-3-molecular-docking-constraints">Experiments 2 and 3: Molecular Docking Constraints</h3>
<p>These experiments use a Production Objective combining a molecular docking constraint (retaining two hydrogen-bonding interactions with Ala 162 of PDK1, PDB ID: 2XCH) and QED (Quantitative Estimate of Druglikeness). Both experiments limit computational cost by capping production epochs at 300.</p>
<p><strong>Experiment 2</strong> uses Tanimoto (2D) similarity to a reference ligand as the Curriculum Objective. Two scenarios are tested: &ldquo;Low&rdquo; (threshold 0.5) and &ldquo;High&rdquo; (threshold 0.8).</p>
<p><strong>Experiment 3</strong> uses ROCS (3D) shape-based similarity to the reference ligand as the Curriculum Objective, with &ldquo;Low&rdquo; (0.5) and &ldquo;High&rdquo; (0.75) thresholds.</p>
<p>All experiments are run in triplicate. The baseline includes both standard RL and RL with Tanimoto/ROCS components added directly to the scoring function (not sequentially), to control for the presence of these components.</p>
<p>Across all CL experiments, CL generates between 2,941 and 9,068 more compounds with docking scores better than the reference ligand (-10.907 kcal/mol) compared to baseline RL, corresponding to 12.42-23.79% improvement in the fraction of compounds exceeding the reference. Between the Curriculum Objectives, the &ldquo;High&rdquo; threshold scenario outperforms the &ldquo;Low&rdquo; scenario by 316-3,415 additional compounds (with percentage improvements ranging from -0.4% to 10.57%).</p>
<p>Baseline RL produces essentially no compounds satisfying the docking constraint for the first 100 epochs. CL agents achieve immediate productivity: in the &ldquo;High&rdquo; Tanimoto scenario, the initial docking score already exceeds the maximum score achieved by baseline RL over 300 epochs.</p>
<h3 id="scaffold-diversity-analysis">Scaffold Diversity Analysis</h3>
<p>CL generates more unique Bemis-Murcko scaffolds than baseline RL in all experiments. The &ldquo;High&rdquo; scenarios produce more unique scaffolds than the &ldquo;Low&rdquo; scenarios. CL also produces a higher fraction of &ldquo;favorable&rdquo; scaffolds (those with better docking scores than the reference ligand).</p>
<h2 id="accelerated-convergence-with-a-diversity-trade-off">Accelerated Convergence with a Diversity Trade-off</h2>
<p>The results demonstrate three consistent findings across all experiments:</p>
<ol>
<li>
<p><strong>Accelerated productivity</strong>: CL agents reach productive sampling of favorable compounds substantially faster than baseline RL. Even a single Curriculum Objective with a computationally inexpensive metric can guide the agent to regions of chemical space where expensive Production Objectives are readily satisfied.</p>
</li>
<li>
<p><strong>Improved output quality</strong>: CL generates more compounds with favorable docking scores, more unique scaffolds, and a higher fraction of scaffolds that outperform the reference ligand.</p>
</li>
<li>
<p><strong>Controllable trade-off</strong>: The Curriculum Progression Criterion threshold provides direct control over agent policy. Higher thresholds produce better Production Objective optimization but reduce intra-set diversity (higher cross-Tanimoto similarities among generated compounds). UMAP visualizations confirm that &ldquo;Low&rdquo; and &ldquo;High&rdquo; scenarios sample from nearby but distinct regions of chemical space.</p>
</li>
</ol>
<p>The authors note that even moderate optimization of similarity-based Curriculum Objectives (the &ldquo;Low&rdquo; scenarios) already substantially narrows the agent&rsquo;s perceived solution space. This suggests that CL inherently regularizes the agent policy, trading some diversity for convergence speed.</p>
<p><strong>Limitations</strong>: The authors acknowledge that data supporting the findings are available only upon request rather than as public deposits. The approach is demonstrated on a single target (PDK1), and the curriculum design requires domain expertise to decompose objectives appropriately. The inverse relationship between Curriculum Objective optimization and solution diversity means practitioners must carefully tune thresholds for their specific applications.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior training</td>
          <td>ChEMBL</td>
          <td>Not specified</td>
          <td>Used to pretrain the RNN on SMILES syntax</td>
      </tr>
      <tr>
          <td>Docking target</td>
          <td>PDB 2XCH</td>
          <td>1 structure</td>
          <td>PDK1 receptor crystal structure</td>
      </tr>
  </tbody>
</table>
<p>Raw data supporting the findings are available from the corresponding author upon request.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>REINVENT platform with LSTM-based RNN (3 hidden layers, 512 cells, embedding size 256)</li>
<li>Scoring function: weighted geometric mean of components</li>
<li>Curriculum Progression Criteria: score thresholds (0.5 or 0.75-0.8 depending on scenario)</li>
<li>Diversity filter: Identical Murcko Scaffold with bucket size 25 (Production Phase only)</li>
<li>Inception (experience replay) for both phases, reset at phase transition</li>
<li>Batch size: 128, learning rate: 0.0001, sigma: 128, Adam optimizer</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Prior: RNN pretrained on ChEMBL SMILES</li>
<li>Agent: Initialized from prior, focused via RL/CL</li>
<li>No pretrained model weights are publicly released</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Docking score (Glide SP)</td>
          <td>Predicted binding affinity (kcal/mol)</td>
          <td>Lower is better; reference ligand: -10.907</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>Quantitative Estimate of Druglikeness</td>
          <td>Range [0, 1]</td>
      </tr>
      <tr>
          <td>Unique Bemis-Murcko scaffolds</td>
          <td>Scaffold diversity measure</td>
          <td>Averaged over triplicates</td>
      </tr>
      <tr>
          <td>Cross-Tanimoto similarity</td>
          <td>Intra-set compound diversity</td>
          <td>Calculated on pooled triplicates</td>
      </tr>
      <tr>
          <td>Tanimoto/ROCS similarity</td>
          <td>Curriculum Objective metrics</td>
          <td>2D fingerprint and 3D shape similarity</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>GPU: NVIDIA Tesla V100 (32 GB)</li>
<li>Docking: AWS p3.8xlarge instance</li>
<li>LigPrep parallelized over 8 CPU cores</li>
<li>Glide docking parallelized over 48 CPU cores via DockStream</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MolecularAI/Reinvent">REINVENT</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>De novo molecular design platform</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MolecularAI/ReinventCommunity/blob/master/notebooks/Automated_Curriculum_Learning_Demo.ipynb">CL Tutorial Notebook</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Jupyter notebook tutorial for curriculum learning</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Guo, J., Fialková, V., Arango, J. D., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., &amp; Patronov, A. (2022). Improving de novo molecular design with curriculum learning. <em>Nature Machine Intelligence</em>, 4, 555-563. <a href="https://doi.org/10.1038/s42256-022-00494-4">https://doi.org/10.1038/s42256-022-00494-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{guo2022curriculum,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Improving de novo molecular design with curriculum learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Guo, Jeff and Fialkov{\&#39;a}, Vendy and Arango, Juan Diego and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{555--563}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Nature}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-022-00494-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>CogMol: Controlled Molecule Generation for COVID-19</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/cogmol-target-specific-drug-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/cogmol-target-specific-drug-design/</guid><description>CogMol combines a SMILES VAE with controlled latent space sampling to generate drug-like molecules with target specificity for novel viral proteins.</description><content:encoded><![CDATA[<h2 id="a-controlled-generation-framework-for-target-specific-drug-design">A Controlled Generation Framework for Target-Specific Drug Design</h2>
<p>This is a <strong>Method</strong> paper that introduces CogMol (Controlled Generation of Molecules), an end-to-end framework for de novo drug design. The primary contribution is a pipeline that combines a SMILES-based Variational Autoencoder (VAE) with multi-attribute controlled latent space sampling (CLaSS) to generate novel drug-like molecules with high binding affinity to specified protein targets, off-target selectivity, and favorable drug-likeness properties. The framework operates on protein sequence embeddings, allowing it to generalize to unseen target proteins without model retraining.</p>
<h2 id="multi-constraint-drug-design-for-novel-viral-targets">Multi-Constraint Drug Design for Novel Viral Targets</h2>
<p>Traditional drug discovery costs 2-3 billion USD and takes over a decade with less than 10% success rate. Generating drug molecules requires satisfying multiple competing objectives simultaneously: target binding affinity, off-target selectivity, synthetic accessibility, drug-likeness, and low toxicity. Prior generative approaches using reinforcement learning or Bayesian optimization are computationally expensive and typically require fine-tuning on target-specific ligand libraries, making them unable to generalize to unseen protein targets.</p>
<p>The emergence of SARS-CoV-2 in 2020 created an urgent need for antiviral drug candidates targeting novel viral proteins. Because no binding affinity data existed for these new targets, and the viral proteins were not closely related to proteins in existing databases like BindingDB, existing target-specific generative frameworks could not be directly applied. CogMol addresses this by using pre-trained protein sequence embeddings from UniRep (trained on 24 million UniRef50 sequences) rather than learning protein representations from the limited BindingDB training set.</p>
<h2 id="controlled-latent-space-sampling-with-pre-trained-protein-embeddings">Controlled Latent Space Sampling with Pre-trained Protein Embeddings</h2>
<p>CogMol&rsquo;s core innovation is a three-component architecture that enables multi-constraint molecule generation for unseen targets:</p>
<p><strong>1. SMILES VAE with adaptive pre-training.</strong> A Variational Autoencoder is first trained unsupervised on the MOSES/ZINC dataset (1.6M molecules), then jointly fine-tuned with QED and SA property predictors on BindingDB molecules. The standard VAE objective is:</p>
<p>$$\mathcal{L}_{\text{VAE}}(\theta, \phi) = \mathbb{E}_{p(x)} \left\{ \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) | p(z)) \right\}$$</p>
<p>where $q_\phi(z|x) = \mathcal{N}(z; \mu(x), \Sigma(x))$ specifies a diagonal Gaussian encoder distribution.</p>
<p><strong>2. Protein-molecule binding affinity predictor.</strong> A regression model takes pre-trained UniRep protein sequence embeddings and molecule latent embeddings $z$ as input and predicts pIC50 binding affinity ($= -\log(\text{IC50})$). Because UniRep embeddings capture sequence, structural, and functional relationships from a large unsupervised corpus, the predictor can estimate binding affinity for novel target sequences not present in the training data.</p>
<p><strong>3. CLaSS controlled sampling.</strong> The Conditional Latent attribute Space Sampling scheme generates molecules satisfying multiple constraints (affinity, QED, selectivity) through rejection sampling in the VAE latent space:</p>
<p>$$p(\mathbf{x} | \mathbf{a}) = \mathbb{E}_{\mathbf{z}} [p(\mathbf{z} | \mathbf{a}) , p(\mathbf{x} | \mathbf{z})] \approx \mathbb{E}_{\mathbf{z}} [\hat{p}_\xi(\mathbf{z} | \mathbf{a}) , p_\theta(\mathbf{x} | \mathbf{z})]$$</p>
<p>where $\mathbf{a} = [a_1, a_2, \ldots, a_n]$ is a set of independent attribute constraints. The conditional density $\hat{p}_\xi(\mathbf{z} | \mathbf{a})$ is approximated using a Gaussian mixture model $Q_\xi(\mathbf{z})$ and per-attribute classifiers $q_\xi(a_i | \mathbf{z})$, with Bayes&rsquo; rule and conditional independence assumptions. The acceptance probability equals the product of all attribute predictor scores, enabling efficient multi-constraint sampling without surrogate model or policy learning.</p>
<p><strong>Selectivity modeling.</strong> Off-target selectivity for a molecule $m$ against target $T$ is defined as:</p>
<p>$$\text{Sel}_{T,m} = \text{BA}(T, m) - \frac{1}{k} \sum_{i=1}^{k} \text{BA}(T_i, m)$$</p>
<p>where $\text{BA}(T, m)$ is binding affinity to the target and $T_i$ are $k$ randomly selected off-targets. This selectivity score is incorporated as a control attribute during CLaSS sampling.</p>
<h2 id="experimental-setup-covid-19-targets-and-in-silico-screening">Experimental Setup: COVID-19 Targets and In Silico Screening</h2>
<p><strong>Target proteins.</strong> CogMol was applied to three SARS-CoV-2 targets not present in BindingDB: NSP9 Replicase dimer, Main Protease (Mpro), and the Receptor-Binding Domain (RBD) of the spike protein. A cancer target (human HDAC1) with low ligand coverage in the training data was also evaluated.</p>
<p><strong>Training data.</strong> The SMILES VAE was trained on the MOSES benchmark (1.6M molecules from ZINC). The binding affinity predictor used curated IC50 data from BindingDB as reported in DeepAffinity, with all protein classes included in training.</p>
<p><strong>CLaSS controlled generation.</strong> Molecules were generated with simultaneous constraints on binding affinity (&gt; 0.5 normalized), QED (&gt; 0.8 normalized), and selectivity (&gt; 0.5 normalized). Approximately 1000 molecules per target were selected for downstream evaluation.</p>
<p><strong>In silico screening pipeline.</strong> Generated molecules underwent:</p>
<ul>
<li>Toxicity prediction via a multi-task deep neural network (MT-DNN) on 12 Tox21 in vitro endpoints and ClinTox clinical trial failure</li>
<li>Binding affinity rescoring with a higher-accuracy SMILES-level predictor</li>
<li>Blind docking (5 independent runs per molecule) using AutoDock Vina against target protein structures</li>
<li>Synthetic feasibility assessment using a retrosynthetic algorithm based on the Molecular Transformer trained on patent reaction data</li>
</ul>
<p><strong>Baselines.</strong> VAE performance was benchmarked against models from the MOSES platform. CLaSS-accepted molecules were compared against randomly sampled molecules from the latent space. Generated molecules were compared against FDA-approved drugs for toxicity and synthesizability.</p>
<h3 id="key-results">Key Results</h3>
<p><strong>CLaSS enrichment (Table 1).</strong> CLaSS consistently produced higher fractions of molecules meeting all criteria compared to random sampling. For the triple constraint (affinity &gt; 0.5, QED &gt; 0.8, selectivity &gt; 0.5), the enrichment was substantial: 6.9% vs. 0.7% for NSP9, 9.0% vs. 0.9% for RBD, and 10.4% vs. 1.1% for Mpro.</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>CLaSS (Aff+QED+Sel)</th>
          <th>Random (Aff+QED+Sel)</th>
          <th>Enrichment</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>NSP9</td>
          <td>6.9%</td>
          <td>0.7%</td>
          <td>~10x</td>
      </tr>
      <tr>
          <td>RBD</td>
          <td>9.0%</td>
          <td>0.9%</td>
          <td>~10x</td>
      </tr>
      <tr>
          <td>Mpro</td>
          <td>10.4%</td>
          <td>1.1%</td>
          <td>~9.5x</td>
      </tr>
  </tbody>
</table>
<p><strong>Docking results (Table 3).</strong> 87-95% of high-affinity generated molecules showed docking binding free energy (BFE) below -6 kcal/mol, with minimum BFEs reaching -8.6 to -9.5 kcal/mol depending on the target.</p>
<p><strong>Novelty.</strong> The likelihood of generating an exact duplicate of a training molecule was 2% or less. Against the full PubChem database (~103M molecules), exact matches ranged from 3.7% to 9.5%. Generated molecules also showed novel chemical scaffolds as confirmed by high Frechet ChemNet Distance.</p>
<p><strong>Synthesizability.</strong> Generated molecules for COVID-19 targets showed 85-90% synthetic feasibility using retrosynthetic analysis, exceeding the ~78% rate of FDA-approved drugs.</p>
<p><strong>Toxicity.</strong> Approximately 70% of generated parent molecules and ~80% of predicted metabolites were toxic in 0-1 endpoints out of 13, comparable to FDA-approved drugs.</p>
<h2 id="generated-molecules-show-favorable-binding-and-drug-like-properties">Generated Molecules Show Favorable Binding and Drug-Like Properties</h2>
<p>CogMol demonstrates that controlled latent space sampling with pre-trained protein embeddings can generate novel, drug-like molecules for unseen viral targets. The key findings are:</p>
<ol>
<li>CLaSS provides roughly 10x enrichment over random latent space sampling for molecules satisfying all three constraints (affinity, QED, selectivity).</li>
<li>Generated molecules bind favorably to druggable pockets in target protein 3D structures, even though the generation model uses only 1D sequence information.</li>
<li>Some generated SMILES matched existing PubChem molecules with known biological activity, suggesting the model identifies chemically relevant regions of molecular space.</li>
<li>The framework generalizes across targets of varying novelty, with Mpro (more similar to training proteins) yielding easier generation than NSP9 or RBD.</li>
</ol>
<p><strong>Limitations.</strong> The authors note that no wet-lab validation was performed on generated candidates. There may be divergence between ML-predicted properties and experimental measurements. The binding affinity predictor&rsquo;s accuracy is bounded by the quality and coverage of BindingDB training data. Selectivity modeling uses a random sample of off-targets rather than a pharmacologically curated panel.</p>
<p><strong>Future directions.</strong> The authors propose incorporating additional contexts beyond target protein (e.g., metabolic pathways), adding more pharmacologically relevant controls, and weighting objectives by relative importance.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>VAE pre-training</td>
          <td>MOSES/ZINC</td>
          <td>1.6M train, 176K test</td>
          <td>Publicly available benchmark</td>
      </tr>
      <tr>
          <td>VAE adaptive training</td>
          <td>BindingDB (DeepAffinity split)</td>
          <td>~27K protein-ligand pairs</td>
          <td>Curated IC50 data</td>
      </tr>
      <tr>
          <td>Protein embeddings</td>
          <td>UniRef50 via UniRep</td>
          <td>24M sequences</td>
          <td>Pre-trained, publicly available</td>
      </tr>
      <tr>
          <td>Toxicity prediction</td>
          <td>Tox21 + ClinTox</td>
          <td>12 in vitro + clinical endpoints</td>
          <td>Public benchmark datasets</td>
      </tr>
      <tr>
          <td>Docking validation</td>
          <td>PDB structures</td>
          <td>3 SARS-CoV-2 targets</td>
          <td>Public crystal structures</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>VAE architecture: SMILES encoder-decoder with diagonal Gaussian latent space, jointly trained with QED and SA regressors</li>
<li>CLaSS: rejection sampling from Gaussian mixture model of latent space with per-attribute classifiers</li>
<li>Binding affinity: regression on concatenated UniRep protein embeddings and VAE molecule embeddings</li>
<li>Selectivity: excess binding affinity over average of $k$ random off-targets</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>SMILES VAE with adaptive pre-training (ZINC then BindingDB)</li>
<li>Multi-task toxicity classifier (MT-DNN) for Tox21 and ClinTox endpoints</li>
<li>Binding affinity predictor (latent-level for generation, SMILES-level for screening)</li>
<li>Retrosynthetic predictor based on Molecular Transformer</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>90%</td>
          <td>-</td>
          <td>Generated SMILES</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>99%</td>
          <td>-</td>
          <td>Among valid molecules</td>
      </tr>
      <tr>
          <td>Filter pass</td>
          <td>95%</td>
          <td>-</td>
          <td>Relevant chemical filters</td>
      </tr>
      <tr>
          <td>Docking BFE &lt; -6 kcal/mol</td>
          <td>87-95%</td>
          <td>-</td>
          <td>Varies by target</td>
      </tr>
      <tr>
          <td>Synthetic feasibility</td>
          <td>85-90%</td>
          <td>78% (FDA drugs)</td>
          <td>COVID-19 targets</td>
      </tr>
      <tr>
          <td>Low toxicity (0-1 endpoints)</td>
          <td>~70% parent, ~80% metabolite</td>
          <td>Comparable to FDA drugs</td>
          <td>MT-DNN prediction</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify GPU types or training times. The work was funded internally by IBM Research.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/CogMol">CogMol (GitHub)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://github.com/IBM/CogMol">~3500 generated molecules</a></td>
          <td>Dataset</td>
          <td>Open license</td>
          <td>For three SARS-CoV-2 targets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chenthamarakshan, V., Das, P., Hoffman, S. C., Strobelt, H., Padhi, I., Lim, K. W., Hoover, B., Manica, M., Born, J., Laino, T., &amp; Mojsilovic, A. (2020). CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models. <em>Advances in Neural Information Processing Systems</em>, 33, 4320-4332.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{chenthamarakshan2020cogmol,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chenthamarakshan, Vijil and Das, Payel and Hoffman, Samuel C. and Strobelt, Hendrik and Padhi, Inkit and Lim, Kar Wai and Hoover, Benjamin and Manica, Matteo and Born, Jannis and Laino, Teodoro and Mojsilovi{\&#39;c}, Aleksandra}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{33}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{4320--4332}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>CDDD: Learning Descriptors by Translating SMILES</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/cddd-translation-molecular-descriptors/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/cddd-translation-molecular-descriptors/</guid><description>CDDD learns continuous molecular descriptors by translating between SMILES and InChI representations, outperforming fingerprints in virtual screening.</description><content:encoded><![CDATA[<h2 id="a-translation-based-method-for-learned-molecular-descriptors">A Translation-Based Method for Learned Molecular Descriptors</h2>
<p>This is a <strong>Method</strong> paper that introduces Continuous and Data-Driven Descriptors (CDDD), a neural machine translation approach for learning fixed-size, continuous molecular representations. Rather than training an autoencoder to reconstruct <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings, Winter et al. train an encoder-decoder model to translate between semantically equivalent but syntactically different molecular representations (e.g., randomized SMILES to canonical SMILES, or <a href="/notes/computational-chemistry/molecular-representations/inchi-2013/">InChI</a> to canonical SMILES). The bottleneck latent vector serves as a general-purpose molecular descriptor. Pretrained on approximately 72 million compounds from <a href="/notes/computational-chemistry/datasets/zinc-22/">ZINC15</a> and PubChem, CDDD produces 512-dimensional descriptors that achieve competitive QSAR performance and significantly outperform all tested molecular fingerprints in ligand-based virtual screening.</p>
<h2 id="why-translation-instead-of-reconstruction">Why Translation Instead of Reconstruction?</h2>
<p>Molecular descriptors are central to cheminformatics. Traditional approaches rely on human-engineered fingerprints like ECFPs, which encode structural features as fixed-length bit vectors. While effective, these representations are constrained by predefined feature extraction rules.</p>
<p>Recent work applied deep neural networks directly to molecular graphs or SMILES strings to learn task-specific representations. However, these end-to-end approaches must learn features from scratch for each new dataset, making them prone to overfitting on the small bioactivity datasets typical in drug discovery.</p>
<p>Unsupervised approaches based on autoencoders (notably <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al.&rsquo;s VAE</a> and <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/seq2seq-fingerprint-molecular-embedding/">Xu et al.&rsquo;s seq2seq model</a>) offered a path toward general-purpose learned descriptors. These models reconstruct SMILES strings through an information bottleneck, forcing the latent space to capture molecular information. The concern with reconstruction, however, is that the model may focus on syntactic patterns of the string representation rather than the underlying chemical semantics. A model that memorizes SMILES syntax shortcuts can achieve low reconstruction error without truly encoding chemical meaning.</p>
<p>Winter et al. address this by drawing on the analogy to neural machine translation: a translator must understand the meaning of a sentence to produce a correct translation in another language. By training the model to translate between different molecular representations (which share chemical semantics but differ in syntax), the latent space is forced to capture the chemical information common to both representations, rather than representation-specific syntactic artifacts.</p>
<h2 id="translation-as-semantic-compression">Translation as Semantic Compression</h2>
<p>The core insight is that translating between two syntactically different but semantically equivalent representations forces the encoder to capture only the chemical meaning shared by both. The model architecture follows the standard encoder-decoder framework from neural machine translation.</p>
<p>The encoder reads a source molecular string (e.g., a randomized SMILES or InChI) and compresses it into a fixed-size latent vector. The decoder takes this latent vector and generates the target molecular string (canonical SMILES). The model is trained to minimize character-level cross-entropy between the decoder output and the target sequence.</p>
<p>Four translation tasks were evaluated:</p>
<ol>
<li><strong>Randomized SMILES to canonical SMILES</strong> (best performing)</li>
<li><strong>InChI to canonical SMILES</strong></li>
<li><strong>Canonical SMILES to canonical SMILES</strong> (autoencoding baseline)</li>
<li><strong>Canonical SMILES to InChI</strong> (failed to learn)</li>
</ol>
<p>The final model uses an RNN encoder with 3 stacked GRU layers (512, 1024, and 2048 units). The concatenated cell states pass through a fully connected layer with tanh activation to produce a 512-dimensional latent vector. The decoder mirrors this architecture, initializing its GRU states from the latent vector via separate fully connected layers. Teacher forcing is used during training, and left-to-right beam search is used at inference.</p>
<p>An auxiliary property prediction network takes the latent vector as input and predicts nine molecular properties (logP, partial charges, valence electrons, H-bond donors/acceptors, Balaban&rsquo;s J, <a href="https://en.wikipedia.org/wiki/Molar_refractivity">molar refractivity</a>, TPSA). This multi-task signal encourages the latent space to encode physically meaningful information. The full training objective combines the translation cross-entropy loss with the property prediction mean squared error:</p>
<p>$$\mathcal{L} = \mathcal{L}_{\text{translation}} + \mathcal{L}_{\text{properties}}$$</p>
<p>To ensure invariance to input SMILES representation at inference time, the model uses randomized SMILES as input half the time and canonical SMILES the other half during training. Input dropout (15% at the character level) and Gaussian noise (standard deviation 0.05) are applied for regularization.</p>
<h2 id="qsar-benchmarks-virtual-screening-and-latent-space-exploration">QSAR Benchmarks, Virtual Screening, and Latent Space Exploration</h2>
<h3 id="pretraining">Pretraining</h3>
<p>The model was pretrained on approximately 72 million compounds from ZINC15 and PubChem (merged, deduplicated, filtered for organic molecules with MW 12-600, &gt;3 heavy atoms, logP between -7 and 5). All evaluation compounds were removed from the pretraining set.</p>
<h3 id="qsar-experiments">QSAR Experiments</h3>
<p>Ten QSAR datasets were used, spanning classification (<a href="https://en.wikipedia.org/wiki/Ames_test">Ames mutagenicity</a>, <a href="https://en.wikipedia.org/wiki/KCNH2">hERG inhibition</a>, <a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">BBB penetration</a>, BACE inhibition, bee toxicity) and regression (EGFR inhibition, <a href="https://en.wikipedia.org/wiki/Plasmodium_falciparum">Plasmodium falciparum</a> inhibition, lipophilicity, aqueous solubility, melting point). Two datasets (Ames and lipophilicity) served as validation for architecture selection; the remaining eight were held out for final evaluation.</p>
<p>CDDD descriptors with an SVM were benchmarked against:</p>
<ul>
<li>Nine circular fingerprint variants (Morgan fingerprints, radius 1-3, folded to 512/1024/2048 bits) with RF, SVM, and GB</li>
<li>Graph convolution models (<a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">DeepChem</a>)</li>
</ul>
<p>Both random-split and cluster-split (K-means on MACCS fingerprints, K=5) cross-validation were performed.</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Split</th>
          <th>CDDD + SVM</th>
          <th>Best Fingerprint</th>
          <th>Graph Conv</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ames (ROC-AUC)</td>
          <td>Random</td>
          <td>0.89</td>
          <td>0.89 (ecfc2, RF)</td>
          <td>0.88</td>
      </tr>
      <tr>
          <td>hERG (ROC-AUC)</td>
          <td>Random</td>
          <td>0.86</td>
          <td>0.85 (ecfc4, RF)</td>
          <td>0.86</td>
      </tr>
      <tr>
          <td>BBBP (ROC-AUC)</td>
          <td>Random</td>
          <td>0.93</td>
          <td>0.93 (ecfc2, RF)</td>
          <td>0.92</td>
      </tr>
      <tr>
          <td>BACE (ROC-AUC)</td>
          <td>Random</td>
          <td>0.90</td>
          <td>0.91 (ecfc2, RF)</td>
          <td>0.91</td>
      </tr>
      <tr>
          <td>Bee toxicity (ROC-AUC)</td>
          <td>Random</td>
          <td>0.92</td>
          <td>0.91 (ecfc6, RF)</td>
          <td>0.89</td>
      </tr>
      <tr>
          <td>Lipophilicity ($r^2$)</td>
          <td>Random</td>
          <td>0.72</td>
          <td>0.69 (ecfc2, SVM)</td>
          <td>0.73</td>
      </tr>
      <tr>
          <td>ESOL ($r^2$)</td>
          <td>Random</td>
          <td>0.92</td>
          <td>0.58 (ecfc6, SVM)</td>
          <td>0.86</td>
      </tr>
      <tr>
          <td>Melting point ($r^2$)</td>
          <td>Random</td>
          <td>0.42</td>
          <td>0.38 (ecfc2, SVM)</td>
          <td>0.39</td>
      </tr>
  </tbody>
</table>
<p>CDDD descriptors showed competitive or better performance across all tasks. Notably, CDDD achieved substantially higher $r^2$ on aqueous solubility (0.92 vs. 0.58 for the best fingerprint). The authors emphasize that CDDD&rsquo;s feature extraction was fixed based on two validation tasks, while baseline methods selected the best fingerprint/model combination per task, making the comparison conservative for CDDD.</p>
<h3 id="virtual-screening">Virtual Screening</h3>
<p>Ligand-based virtual screening experiments followed the Riniker et al. benchmarking protocol on 40 DUD targets and 17 MUV targets. Five active compounds were randomly selected per target, and remaining compounds were ranked by similarity (cosine similarity for CDDD, <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto</a> for fingerprints). This process was repeated 50 times per target.</p>
<table>
  <thead>
      <tr>
          <th>Database</th>
          <th>CDDD (ROC-AUC)</th>
          <th>Second Best</th>
          <th>p-value (Wilcoxon)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DUD</td>
          <td>0.949</td>
          <td>0.899 (laval)</td>
          <td>$5 \times 10^{-38}$</td>
      </tr>
      <tr>
          <td>MUV</td>
          <td>0.679</td>
          <td>0.677 (ap)</td>
          <td>0.04</td>
      </tr>
  </tbody>
</table>
<p>CDDD significantly outperformed all 14 baseline fingerprints on both databases. The DUD improvement was particularly large (+5.0 ROC-AUC points over the next best). On MUV, which is designed to be harder, the advantage was smaller but still statistically significant. Importantly, while the best baseline fingerprint varied between DUD and MUV (laval vs. ap), CDDD ranked first on both, demonstrating consistent performance.</p>
<h3 id="latent-space-exploration">Latent Space Exploration</h3>
<p>The continuous, reversible nature of CDDD enables chemical space navigation. Shifting a molecule&rsquo;s embedding along the first principal component of the pretraining data correlates with molecular size (Spearman $r = 0.947$, $p = 0.00048$), while the second principal component correlates with polarity/logP ($r = -0.916$, $p = 0.00015$).</p>
<p>When shifting 1000 compounds along 100 random directions, the model maintained high valid SMILES generation rates (&gt;97% for the top beam search output, &gt;99% when considering the top 3 outputs). Euclidean distance in the descriptor space correlated smoothly with Tanimoto distance in fingerprint space, confirming that the latent space supports meaningful interpolation.</p>
<h2 id="consistent-learned-descriptors-for-chemistry">Consistent Learned Descriptors for Chemistry</h2>
<p>CDDD demonstrated that translation between molecular representations produces more informative latent spaces than autoencoder reconstruction. The key findings are:</p>
<ol>
<li><strong>Translation outperforms reconstruction</strong>: Models trained on translating between different representations consistently produced better downstream descriptors than autoencoding models, despite autoencoding being an easier task.</li>
<li><strong>Auxiliary property prediction helps</strong>: The additional classification task for molecular properties improved descriptor quality, particularly for physicochemical endpoints correlated with the predicted properties.</li>
<li><strong>Consistent performance</strong>: Unlike baseline methods where the best fingerprint varies by task, CDDD showed consistent performance across all QSAR and VS experiments.</li>
<li><strong>Smooth latent space</strong>: The continuous descriptor space supports meaningful interpolation and chemical space exploration with high valid SMILES rates.</li>
</ol>
<p>The authors acknowledge several limitations. The InChI-to-SMILES translation worked but produced inferior descriptors compared to SMILES-to-SMILES, and SMILES-to-InChI translation failed entirely, likely due to InChI&rsquo;s complex syntax (counting, arithmetic). The approach was only tested with string-based representations; translation between conceptually different representations (e.g., 3D structures) remains future work. The QSAR evaluation, while extensive, used relatively standard datasets, and the method&rsquo;s advantage over graph convolution models was modest on tasks where end-to-end learning had sufficient data.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ZINC15 + PubChem (merged)</td>
          <td>~72M compounds</td>
          <td>Filtered: organic, MW 12-600, &gt;3 heavy atoms, logP -7 to 5</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>Ames mutagenicity</td>
          <td>6,130</td>
          <td>Classification</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>Lipophilicity</td>
          <td>3,817</td>
          <td>Regression</td>
      </tr>
      <tr>
          <td>Test</td>
          <td>hERG, BBBP, BACE, bee toxicity</td>
          <td>188-3,440</td>
          <td>Classification</td>
      </tr>
      <tr>
          <td>Test</td>
          <td>EGFR, Plasmodium, ESOL, melting point</td>
          <td>184-4,451</td>
          <td>Regression</td>
      </tr>
      <tr>
          <td>VS</td>
          <td>DUD</td>
          <td>40 targets</td>
          <td>Ligand-based virtual screening</td>
      </tr>
      <tr>
          <td>VS</td>
          <td>MUV</td>
          <td>17 targets</td>
          <td>Maximum unbiased validation</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Encoder: 3 stacked GRU layers (512, 1024, 2048 units) with tanh bottleneck to 512-dim latent space</li>
<li>Decoder: Matching 3 stacked GRU layers, initialized from latent space</li>
<li>Auxiliary classifier: 3 FC layers (512, 128, 9) predicting molecular properties</li>
<li>Optimizer: Adam, initial LR $5 \times 10^{-4}$, decayed by 0.9 every 50,000 steps</li>
<li>Batch size: 64 with bucketing by sequence length</li>
<li>Input regularization: 15% character dropout + Gaussian noise (std 0.05)</li>
<li>Beam search for decoding at inference</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/jrwnter/cddd">CDDD (GitHub)</a></td>
          <td>Code + Model</td>
          <td>MIT</td>
          <td>Pretrained model and extraction code</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>QSAR: 5-fold random CV and 5-fold cluster CV (K-means on MACCS, K=5)</li>
<li>Classification metric: ROC-AUC</li>
<li>Regression metric: $r^2$</li>
<li>VS: ROC-AUC averaged over 50 random active set selections per target</li>
<li>Statistical test: <a href="https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test">Wilcoxon signed-rank test</a> for VS comparisons</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Framework: TensorFlow 1.4.1</li>
<li>Fingerprint extraction on GPU is comparable in speed to RDKit on CPU</li>
<li>SVM training on 512-dim CDDD descriptors takes seconds (vs. minutes for 2048-dim fingerprints)</li>
<li>Graph convolution training: ~30 minutes per task on GPU</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Winter, R., Montanari, F., Noe, F., &amp; Clevert, D.-A. (2019). Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. <em>Chemical Science</em>, 10(6), 1692-1701. <a href="https://doi.org/10.1039/C8SC04175J">https://doi.org/10.1039/C8SC04175J</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{winter2019learning,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Winter, Robin and Montanari, Floriane and No{\&#39;e}, Frank and Clevert, Djork-Arn{\&#39;e}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1692--1701}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/C8SC04175J}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>BindGPT: GPT for 3D Molecular Design and Docking</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/bindgpt-3d-molecular-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/bindgpt-3d-molecular-design/</guid><description>BindGPT applies GPT-style language modeling to 3D molecular generation using SMILES+XYZ tokenization, pre-training, and RL-based docking optimization.</description><content:encoded><![CDATA[<h2 id="a-language-model-for-joint-3d-molecular-graph-and-conformation-generation">A Language Model for Joint 3D Molecular Graph and Conformation Generation</h2>
<p>BindGPT is a <strong>Method</strong> paper that introduces a GPT-based language model for generating 3D molecular structures. The primary contribution is a unified framework that jointly produces molecular graphs (via SMILES) and 3D coordinates (via XYZ tokens) within a single autoregressive model. This eliminates the need for external graph reconstruction tools like OpenBabel, which are error-prone when applied to noisy atom positions. The same pre-trained model serves as a 3D molecular generative model, a conformer generator conditioned on molecular graphs, and a pocket-conditioned 3D molecule generator.</p>
<h2 id="the-graph-reconstruction-problem-in-3d-molecular-generation">The Graph Reconstruction Problem in 3D Molecular Generation</h2>
<p>Most existing 3D molecular generators focus on predicting atom types and positions, relying on supplementary software (e.g., OpenBabel or RDKit) to reconstruct molecular bonds from predicted coordinates. This introduces a fragile dependency: small positional errors can drastically change the reconstructed molecular graph or produce disconnected structures. Additionally, while diffusion models and equivariant GNNs have shown strong results for 3D molecular generation, they often depend on SE(3) equivariance inductive biases and are computationally expensive at sampling time (up to $10^6$ seconds for 1000 valid molecules for EDM). The pocket-conditioned generation task is further limited by the small size of available 3D binding pose datasets (e.g., CrossDocked), making it difficult for specialized models to generalize without large-scale pre-training.</p>
<h2 id="smilesxyz-tokenization-jointly-encoding-graphs-and-coordinates">SMILES+XYZ Tokenization: Jointly Encoding Graphs and Coordinates</h2>
<p>The core innovation in BindGPT is coupling SMILES notation with XYZ coordinate format in a single token sequence. The sequence starts with a <code>&lt;LIGAND&gt;</code> token, followed by character-level SMILES tokens encoding the molecular graph, then an <code>&lt;XYZ&gt;</code> token marking the transition to coordinate data. Each 3D atom position is encoded using 6 tokens (integer and fractional parts for each of the three coordinates). The atom ordering is synchronized between SMILES and XYZ, so atom symbols from SMILES are not repeated in the coordinate section.</p>
<p>For protein pockets, sequences begin with a <code>&lt;POCKET&gt;</code> token followed by atom names and coordinates. Following AlphaFold&rsquo;s approach, only alpha-carbon coordinates are retained to keep pocket representations compact.</p>
<p>The model uses the GPT-NeoX architecture with rotary position embeddings (RoPE), which enables length generalization between pre-training and fine-tuning where sequence lengths differ substantially. The pre-trained model has 108M parameters with 15 layers, 12 attention heads, and a hidden dimension of 768.</p>
<h3 id="pre-training-on-large-scale-3d-data">Pre-training on Large-Scale 3D Data</h3>
<p>Pre-training uses the Uni-Mol dataset containing 208M conformations for 12M molecules and 3.2M protein pocket structures. Each training batch contains either ligand sequences or pocket sequences (not mixed within a sequence). Since pockets are far fewer than ligands, the training schedule runs 5 pocket epochs per ligand epoch, resulting in roughly 8% pocket tokens overall. Training uses large batches of 1.6M tokens per step with Flash Attention and DeepSpeed optimizations.</p>
<h3 id="supervised-fine-tuning-with-augmentation">Supervised Fine-Tuning with Augmentation</h3>
<p>For pocket-conditioned generation, BindGPT is fine-tuned on CrossDocked 2020, which contains aligned pocket-ligand pairs. Unlike prior work that subsamples less than 1% of the best pairs, BindGPT uses all intermediate ligand poses (including lower-quality ones), yielding approximately 27M pocket-ligand pairs. To combat overfitting on the limited diversity (14k unique molecules, 3k pockets), two augmentation strategies are applied:</p>
<ol>
<li><strong><a href="/notes/computational-chemistry/molecular-representations/randomized-smiles-generative-models/">SMILES randomization</a></strong>: Each molecule can yield 100-1000 different valid SMILES strings</li>
<li><strong>Random 3D rotation</strong>: The same rotation matrix is applied to both pocket and ligand coordinates</li>
</ol>
<p>During fine-tuning, the pocket token sequence is concatenated before the ligand token sequence. An optional variant conditions on binding energy scores from the CrossDocked dataset, enabling contrastive learning between good and bad binding examples.</p>
<h3 id="reinforcement-learning-with-docking-feedback">Reinforcement Learning with Docking Feedback</h3>
<p>BindGPT applies REINFORCE (not PPO or <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>, which were found less stable) to further optimize pocket-conditioned generation. On each RL step, the model generates 3D ligand structures for a batch of random protein pockets, computes binding energy rewards using QVINA docking software, and updates model parameters. A KL-penalty between the current model and the SFT initialization stabilizes training.</p>
<p>The RL objective can be written as:</p>
<p>$$\mathcal{L}_{\text{RL}} = -\mathbb{E}_{x \sim \pi_\theta}\left[ R(x) \right] + \beta \cdot D_{\text{KL}}(\pi_\theta | \pi_{\text{SFT}})$$</p>
<p>where $R(x)$ is the docking reward from QVINA and $\beta$ controls the strength of the KL regularization.</p>
<h2 id="experimental-evaluation-across-three-3d-generation-tasks">Experimental Evaluation Across Three 3D Generation Tasks</h2>
<h3 id="datasets">Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>Uni-Mol 3D</td>
          <td>208M conformations (12M molecules) + 3.2M pockets</td>
          <td>Large-scale 3D molecular dataset</td>
      </tr>
      <tr>
          <td>Fine-tuning (SFT)</td>
          <td>CrossDocked 2020</td>
          <td>~27M pocket-ligand pairs</td>
          <td>14k molecules x 3k pockets, includes all pose qualities</td>
      </tr>
      <tr>
          <td>Fine-tuning (conformer)</td>
          <td><a href="/notes/computational-chemistry/datasets/geom/">GEOM-DRUGS</a></td>
          <td>27M conformations for 300k molecules</td>
          <td>Standard benchmark for 3D conformer generation</td>
      </tr>
      <tr>
          <td>Evaluation (conformer)</td>
          <td>Platinum</td>
          <td>Experimentally validated conformations</td>
          <td>Zero-shot evaluation holdout</td>
      </tr>
      <tr>
          <td>Evaluation (pocket)</td>
          <td>CrossDocked holdout</td>
          <td>100 pockets</td>
          <td>Held out from training</td>
      </tr>
  </tbody>
</table>
<h3 id="task-1-3d-molecule-generation-pre-training">Task 1: 3D Molecule Generation (Pre-training)</h3>
<p>Compared against XYZ-Transformer (the only other model capable of large-scale pre-training), BindGPT achieves 98.58% validity (vs. 12.87% for XYZ-TF without hydrogens), higher SA (0.77 vs. 0.21), QED (0.59 vs. 0.30), and Lipinski scores (4.86 vs. 4.79). BindGPT also produces conformations with RMSD of 0.89 (XYZ-TF&rsquo;s RMSD calculation failed to converge). Generation is 12x faster (13s vs. 165s for 1000 molecules).</p>
<h3 id="task-2-3d-molecule-generation-fine-tuned-on-geom-drugs">Task 2: 3D Molecule Generation (Fine-tuned on GEOM-DRUGS)</h3>
<p>Against EDM and MolDiff (diffusion baselines), BindGPT outperforms on nearly all 3D distributional metrics:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>EDM</th>
          <th>MolDiff</th>
          <th>BindGPT</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JS bond lengths</td>
          <td>0.246</td>
          <td>0.365</td>
          <td><strong>0.029</strong></td>
      </tr>
      <tr>
          <td>JS bond angles</td>
          <td>0.282</td>
          <td>0.155</td>
          <td><strong>0.075</strong></td>
      </tr>
      <tr>
          <td>JS dihedral angles</td>
          <td>0.328</td>
          <td>0.162</td>
          <td><strong>0.098</strong></td>
      </tr>
      <tr>
          <td>JS freq. bond types</td>
          <td>0.378</td>
          <td>0.163</td>
          <td><strong>0.045</strong></td>
      </tr>
      <tr>
          <td>JS freq. bond pairs</td>
          <td>0.396</td>
          <td>0.136</td>
          <td><strong>0.043</strong></td>
      </tr>
      <tr>
          <td>JS freq. bond triplets</td>
          <td>0.449</td>
          <td>0.125</td>
          <td><strong>0.042</strong></td>
      </tr>
      <tr>
          <td>Time (1000 molecules)</td>
          <td>1.4e6 s</td>
          <td>7500 s</td>
          <td><strong>200 s</strong></td>
      </tr>
  </tbody>
</table>
<p>BindGPT is two orders of magnitude faster than diffusion baselines while producing more accurate 3D geometries. MolDiff achieves better drug-likeness scores (QED, SA), but the authors argue 3D distributional metrics are more relevant for evaluating 3D structure fidelity.</p>
<h3 id="task-3-pocket-conditioned-molecule-generation">Task 3: Pocket-Conditioned Molecule Generation</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Vina Score</th>
          <th>SA</th>
          <th>QED</th>
          <th>Lipinski</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pocket2Mol</td>
          <td>-7.15 +/- 4.89</td>
          <td>0.75</td>
          <td>0.57</td>
          <td>4.88</td>
      </tr>
      <tr>
          <td>TargetDiff</td>
          <td>-7.80 +/- 3.61</td>
          <td>0.58</td>
          <td>0.48</td>
          <td>4.51</td>
      </tr>
      <tr>
          <td>BindGPT-FT</td>
          <td>-5.44 +/- 2.09</td>
          <td>0.78</td>
          <td>0.50</td>
          <td>4.72</td>
      </tr>
      <tr>
          <td>BindGPT-RFT</td>
          <td>-7.24 +/- 1.68</td>
          <td>0.74</td>
          <td>0.48</td>
          <td>4.32</td>
      </tr>
      <tr>
          <td>BindGPT-RL</td>
          <td><strong>-8.60 +/- 1.90</strong></td>
          <td><strong>0.84</strong></td>
          <td>0.43</td>
          <td>4.81</td>
      </tr>
  </tbody>
</table>
<p>The RL-fine-tuned model achieves the best Vina binding scores (-8.60 vs. -7.80 for TargetDiff) with lower variance and the highest SA score (0.84). The SFT-only model (BindGPT-FT) underperforms baselines on binding score, demonstrating that RL is essential for strong pocket-conditioned generation. QED is lower for BindGPT-RL, but the authors note that QED could be included in the RL reward and was excluded for fair comparison.</p>
<h3 id="conformer-generation">Conformer Generation</h3>
<p>On the Platinum dataset (zero-shot), BindGPT matches the performance of Torsional Diffusion (the specialized state-of-the-art) when assisted by RDKit, with a small gap without RDKit assistance. Uni-Mol fails to generalize to this dataset despite pre-training on the same Uni-Mol data.</p>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p>BindGPT demonstrates that a simple autoregressive language model without equivariance inductive biases can match or surpass specialized diffusion models and GNNs across multiple 3D molecular generation tasks. The key findings include:</p>
<ol>
<li><strong>Joint SMILES+XYZ generation eliminates graph reconstruction errors</strong>, achieving 98.58% validity compared to 12.87% for XYZ-Transformer</li>
<li><strong>Large-scale pre-training is critical for pocket-conditioned generation</strong>, as none of the baselines use pre-training and instead rely on heavy inductive biases</li>
<li><strong>RL fine-tuning with docking feedback substantially improves binding affinity</strong> beyond what SFT alone achieves</li>
<li><strong>Sampling is two orders of magnitude faster</strong> than diffusion baselines (200s vs. 1.4M s for EDM)</li>
</ol>
<p>Limitations include the relatively modest model size (108M parameters), with the authors finding this sufficient for current tasks but not exploring larger scales. The RL optimization uses only Vina score as reward; multi-objective optimization incorporating SA, QED, and other properties is left as future work. The model also relies on character-level SMILES tokenization rather than more sophisticated chemical tokenizers. BindGPT is the first model to explicitly generate hydrogens at scale, though validity drops from 98.58% to 77.33% when hydrogens are included.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>Uni-Mol 3D</td>
          <td>208M conformations, 12M molecules, 3.2M pockets</td>
          <td>From Zhou et al. (2023)</td>
      </tr>
      <tr>
          <td>SFT (pocket)</td>
          <td>CrossDocked 2020</td>
          <td>~27M pocket-ligand pairs</td>
          <td>Full version including low-quality poses</td>
      </tr>
      <tr>
          <td>SFT (conformer)</td>
          <td>GEOM-DRUGS</td>
          <td>27M conformations, 300k molecules</td>
          <td>Standard benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Platinum</td>
          <td>Experimentally validated conformations</td>
          <td>Zero-shot holdout</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: GPT-NeoX with rotary position embeddings (RoPE)</li>
<li><strong>Pre-training</strong>: Causal language modeling with 1.6M tokens per batch</li>
<li><strong>SFT augmentation</strong>: SMILES randomization + random 3D rotation</li>
<li><strong>RL</strong>: REINFORCE with KL-penalty from SFT initialization; QVINA docking as reward</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Size</strong>: 108M parameters, 15 layers, 12 heads, hidden size 768</li>
<li><strong>Vocabulary</strong>: Character-level SMILES tokens + special tokens (<code>&lt;LIGAND&gt;</code>, <code>&lt;POCKET&gt;</code>, <code>&lt;XYZ&gt;</code>) + coordinate tokens (6 per 3D position)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Validity, SA, QED, Lipinski</strong>: Standard drug-likeness metrics</li>
<li><strong>Jensen-Shannon divergences</strong>: Distribution-level 3D structural metrics (bond lengths, angles, dihedrals, bond types)</li>
<li><strong>RMSD</strong>: Alignment quality of generated conformations vs. RDKit reference</li>
<li><strong>RMSD-Coverage</strong>: CDF of RMSD between generated and reference conformers</li>
<li><strong>Vina score</strong>: Binding energy from QVINA docking software</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training and fine-tuning use Flash Attention and DeepSpeed for efficiency</li>
<li>Specific GPU counts and training times are described in Appendix G (not available in the main text)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://bindgpt.github.io/">Project Page</a></td>
          <td>Other</td>
          <td>Not specified</td>
          <td>Project website with additional details</td>
      </tr>
  </tbody>
</table>
<p>No public code repository or pre-trained model weights were identified. The project website exists but no source code has been released as of this writing.</p>
<p><strong>Reproducibility Status</strong>: Partially Reproducible. The paper provides detailed architecture specs and hyperparameters, but no public code or model weights are available. All training datasets (Uni-Mol, CrossDocked, GEOM-DRUGS) are publicly accessible.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zholus, A., Kuznetsov, M., Schutski, R., Shayakhmetov, R., Polykovskiy, D., Chandar, S., &amp; Zhavoronkov, A. (2025). BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning. <em>Proceedings of the AAAI Conference on Artificial Intelligence</em>, 39(24), 26083-26091. <a href="https://doi.org/10.1609/aaai.v39i24.34804">https://doi.org/10.1609/aaai.v39i24.34804</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zholus2025bindgpt,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zholus, Artem and Kuznetsov, Maksim and Schutski, Roman and Shayakhmetov, Rim and Polykovskiy, Daniil and Chandar, Sarath and Zhavoronkov, Alex}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the AAAI Conference on Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{39}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{24}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{26083--26091}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1609/aaai.v39i24.34804}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Avoiding Failure Modes in Goal-Directed Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/avoiding-failure-modes-goal-directed-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/avoiding-failure-modes-goal-directed-generation/</guid><description>Langevin et al. show that apparent failure modes in goal-directed molecular generation stem from QSAR model disagreement, not algorithmic flaws.</description><content:encoded><![CDATA[<h2 id="reinterpreting-goal-directed-generation-failures-as-qsar-model-issues">Reinterpreting Goal-Directed Generation Failures as QSAR Model Issues</h2>
<p>This is an <strong>Empirical</strong> study that challenges a widely cited finding about failure modes in goal-directed molecular generation. <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/failure-modes-molecule-generation/">Renz et al. (2019)</a> had shown that when molecules are optimized against a machine learning scoring function, control models trained on the same data distribution assign much lower scores to the generated molecules. This was interpreted as evidence that generation algorithms exploit model-specific biases. Langevin et al. demonstrate that this divergence is already present in the original data distribution and is attributable to disagreement among the QSAR classifiers, not to flaws in the generation algorithms themselves.</p>
<h2 id="why-qsar-model-agreement-matters-for-molecular-generation">Why QSAR Model Agreement Matters for Molecular Generation</h2>
<p>Goal-directed generation uses a scoring function (typically a <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a> model) to guide the design of molecules that maximize predicted activity. In the experimental framework from Renz et al., three Random Forest classifiers are trained: an optimization model $C_{opt}$ on Split 1, a model control $C_{mc}$ on Split 1 with a different random seed, and a data control $C_{dc}$ on Split 2. Each returns a confidence score ($S_{opt}$, $S_{mc}$, $S_{dc}$). The expectation is that molecules with high $S_{opt}$ should also score highly under $S_{mc}$ and $S_{dc}$, since all three models are trained on the same data distribution for the same target.</p>
<p>Renz et al. observed that during optimization, $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$, reaching substantially lower values. This was interpreted as goal-directed generation exploiting biases unique to the optimization model. The recommendation was to halt generation when control scores stop increasing, requiring a held-out dataset for a control model, which may not be feasible in low-data regimes.</p>
<p>The key insight of Langevin et al. is that nobody had checked whether this score disagreement existed before generation even began. If the classifiers already disagree on high-scoring molecules in the original dataset, the divergence during generation is expected behavior, not evidence of algorithmic failure.</p>
<h2 id="pre-existing-classifier-disagreement-explains-the-divergence">Pre-Existing Classifier Disagreement Explains the Divergence</h2>
<p>The core contribution is showing that the gap between optimization and control scores is a property of the QSAR models, not of the generation algorithms.</p>
<p>The authors introduce a held-out test set (10% of the data, used for neither training split) and augment it via Topliss tree enumeration to produce structural analogs for smoother statistical estimates. On this held-out set, they compute the Mean Average Difference (MAD) between $S_{opt}$ and control scores as a function of $S_{opt}$:</p>
<p>$$
\text{MAD}(x) = \frac{1}{|\{i : S_{opt}(x_i) \geq x\}|} \sum_{S_{opt}(x_i) \geq x} |S_{opt}(x_i) - S_{dc}(x_i)|
$$</p>
<p>On the three original datasets (DRD2, EGFR, JAK2), the MAD between $S_{opt}$ and $S_{dc}$ grows substantially with $S_{opt}$, reaching approximately 0.3 for the highest-scoring molecules. For EGFR, even the top molecules (with $S_{opt}$ between 0.5 and 0.6) have $S_{dc}$ below 0.2. This disagreement exists entirely within the original data distribution, before any generative algorithm is applied.</p>
<p>The authors formalize this with tolerance intervals. At each generation time step $t$, the distribution of optimization scores is $P_t[S_{opt}(x)]$. From the held-out set, the conditional distributions $P[S_{dc}(x) | S_{opt}(x)]$ and $P[S_{mc}(x) | S_{opt}(x)]$ are estimated empirically. The expected control scores at time $t$ are then:</p>
<p>$$
\mathbb{E}[S_{dc}] = \int P[S_{dc}(x) | S_{opt}(x)] \cdot P_t[S_{opt}(x)] , dS_{opt}
$$</p>
<p>By sampling from these distributions, the authors construct 95% tolerance intervals for the expected control scores at each time step. The observed trajectories of $S_{mc}$ and $S_{dc}$ during generation fall within these intervals, demonstrating that the divergence is fully explained by pre-existing classifier disagreement.</p>
<h2 id="experimental-setup-original-reproduction-and-corrected-experiments">Experimental Setup: Original Reproduction and Corrected Experiments</h2>
<h3 id="reproduction-of-renz-et-al">Reproduction of Renz et al.</h3>
<p>The original experimental framework uses three datasets from ChEMBL: <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">DRD2</a> (842 molecules, 59 actives), <a href="https://en.wikipedia.org/wiki/Epidermal_growth_factor_receptor">EGFR</a> (842 molecules, 40 actives), and <a href="https://en.wikipedia.org/wiki/Janus_kinase_2">JAK2</a> (667 molecules, 140 actives). These are small, noisy, and chemically diverse. Three goal-directed generation algorithms are tested:</p>
<table>
  <thead>
      <tr>
          <th>Algorithm</th>
          <th>Type</th>
          <th>Mechanism</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Graph GA</td>
          <td>Genetic algorithm on molecular graphs</td>
          <td>Mutation and crossover of molecular graphs</td>
      </tr>
      <tr>
          <td>SMILES-LSTM</td>
          <td>Recurrent neural network</td>
          <td>Hill-climbing fine-tuning on best molecules</td>
      </tr>
      <tr>
          <td>MSO</td>
          <td>Particle swarm in CDDD latent space</td>
          <td>Multiple swarm optimization</td>
      </tr>
  </tbody>
</table>
<p>All algorithms are run for 151 epochs with 10 runs each. The reproduction confirms the findings of Renz et al.: $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$ during optimization.</p>
<h3 id="tolerance-interval-analysis">Tolerance interval analysis</h3>
<p>The held-out set is augmented using Topliss tree enumeration on phenyl rings, providing structural analogs that are reasonable from a medicinal chemistry perspective. The optimization score range is divided into 25 equal bins, and for each molecule at each time step, 10 samples from the conditional control score distribution are drawn to construct empirical tolerance intervals.</p>
<h3 id="corrected-experiments-with-adequate-models">Corrected experiments with adequate models</h3>
<p>To test whether generation algorithms actually exploit biases when the classifiers agree, the authors construct two tasks where optimization and control models correlate well:</p>
<ol>
<li><strong>ALDH1 dataset</strong>: 464 molecules from LIT-PCBA, split using similarity-based pairing to maximize intra-pair chemical similarity. This ensures both splits sample similar chemistry.</li>
<li><strong>Modified JAK2</strong>: The same JAK2 dataset but with Random Forest hyperparameters adjusted (200 trees instead of 100, minimum 3 samples per leaf instead of 1) to reduce overfitting to spurious correlations.</li>
</ol>
<p>In both cases, $S_{opt}$, $S_{mc}$, and $S_{dc}$ agree well on the held-out test set. The starting population for generation is set to the held-out test set (rather than random ChEMBL molecules) to avoid building in a distribution shift.</p>
<h2 id="findings-no-algorithmic-failure-when-models-agree">Findings: No Algorithmic Failure When Models Agree</h2>
<p>On the corrected experimental setups (ALDH1 and modified JAK2), there is no major divergence between optimization and control scores during generation. The three algorithms produce molecules that score similarly under all three classifiers.</p>
<p>Key findings:</p>
<ol>
<li>
<p><strong>Pre-existing disagreement explains divergence</strong>: On all three original datasets, the divergence between $S_{opt}$ and control scores during generation falls within the tolerance intervals predicted from the initial data distribution alone. The generation algorithms are not exploiting model-specific biases beyond what already exists in the data.</p>
</li>
<li>
<p><strong>Split similarity bias is also pre-existing</strong>: Renz et al. observed that generated molecules are more similar to Split 1 (used to train $C_{opt}$) than Split 2. The authors show this bias is already present in the top-5 percentile of the held-out set: on EGFR and DRD2, high-scoring molecules are inherently more similar to Split 1.</p>
</li>
<li>
<p><strong>Appropriate model design resolves the issue</strong>: When Random Forest hyperparameters are chosen to avoid overfitting (more trees, higher minimum samples per leaf), or when data splits are constructed to be chemically balanced, the classifiers agree and the generation algorithms behave as expected.</p>
</li>
<li>
<p><strong>Quality problems remain independent</strong>: Even when optimization and control scores align, the generated molecules can still be poor drug candidates (unreactive, unsynthesizable, containing unusual fragments). The score divergence issue and the chemical quality issue are separate problems.</p>
</li>
</ol>
<h3 id="limitations-acknowledged-by-the-authors">Limitations acknowledged by the authors</h3>
<ul>
<li>The study focuses on Random Forest classifiers with ECFP fingerprints. The behavior of other model types (e.g., graph neural networks) and descriptor types is not fully explored, though supplementary results show similar patterns with physico-chemical descriptors and Atom-Pair fingerprints.</li>
<li>The corrected ALDH1 task uses a relatively small dataset (464 molecules) with careful split construction. Scaling this approach to larger, more heterogeneous datasets is not demonstrated.</li>
<li>The authors note that their results do not prove generation algorithms never exploit biases; they show that the specific evidence from Renz et al. can be explained without invoking algorithmic failure.</li>
<li>The problem of low-quality generated molecules (poor synthesizability, unusual fragments) remains unresolved and is acknowledged as an open question.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Original tasks</td>
          <td>DRD2, EGFR, JAK2</td>
          <td>842, 842, 667 molecules</td>
          <td>Extracted from ChEMBL; small with few actives</td>
      </tr>
      <tr>
          <td>New task</td>
          <td>ALDH1</td>
          <td>464 molecules (173 with purine substructure)</td>
          <td>Extracted from LIT-PCBA; similarity-based split</td>
      </tr>
      <tr>
          <td>Augmentation</td>
          <td>Topliss tree analogs</td>
          <td>~10x augmentation of held-out set</td>
          <td>Structural analogs via phenyl ring enumeration</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>Three goal-directed generation algorithms from the original Renz et al. study:</p>
<ul>
<li><strong>Graph GA</strong>: Genetic algorithm on molecular graphs (Jensen, 2019)</li>
<li><strong>SMILES-LSTM</strong>: Hill-climbing on LSTM-generated SMILES (Segler et al., 2018)</li>
<li><strong>MSO</strong>: Multi-Swarm Optimization in CDDD latent space (Winter et al., 2019)</li>
</ul>
<p>All run for 151 epochs, 10 runs each.</p>
<h3 id="models">Models</h3>
<p>Random Forest classifiers (scikit-learn) with:</p>
<ul>
<li>ECFP fingerprints (radius 2, 1024 bits, RDKit)</li>
<li>Default parameters for original tasks</li>
<li>Modified parameters for JAK2 correction: 200 trees, min 3 samples per leaf</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Purpose</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Mean Average Difference (MAD)</td>
          <td>Measures disagreement between optimization and control scores</td>
          <td>Computed as function of $S_{opt}$ on held-out set</td>
      </tr>
      <tr>
          <td>95% tolerance intervals</td>
          <td>Expected range of control scores given optimization scores</td>
          <td>Empirical, constructed from held-out set</td>
      </tr>
      <tr>
          <td>Tanimoto similarity</td>
          <td>Split bias assessment</td>
          <td>Morgan fingerprints, radius 2, 1024 bits</td>
      </tr>
      <tr>
          <td>ROC-AUC</td>
          <td>Classifier predictive performance</td>
          <td>Used to verify models have comparable accuracy</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Sanofi-Public/IDD-papers-avoiding_failure_modes">Code and datasets</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Fork of Renz et al. codebase with modifications</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Langevin, M., Vuilleumier, R., &amp; Bianciotto, M. (2022). Explaining and avoiding failure modes in goal-directed generation of small molecules. <em>Journal of Cheminformatics</em>, 14, 20. <a href="https://doi.org/10.1186/s13321-022-00601-y">https://doi.org/10.1186/s13321-022-00601-y</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{langevin2022explaining,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Explaining and avoiding failure modes in goal-directed generation of small molecules}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Langevin, Maxime and Vuilleumier, Rodolphe and Bianciotto, Marc}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{20}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-022-00601-y}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Augmented Hill-Climb for RL-Based Molecule Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/</guid><description>Augmented Hill-Climb combines REINVENT and Hill-Climb RL strategies to improve sample efficiency ~45-fold for SMILES-based de novo molecule generation.</description><content:encoded><![CDATA[<h2 id="a-hybrid-rl-strategy-for-de-novo-molecule-generation">A Hybrid RL Strategy for De Novo Molecule Generation</h2>
<p>This is a <strong>Method</strong> paper that proposes Augmented Hill-Climb (AHC), a reinforcement learning strategy for conditioning SMILES-based language models during de novo molecule generation. The primary contribution is a simple hybrid between the <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> and Hill-Climb (HC) RL strategies that computes the REINVENT loss function only on the top-k highest-scoring molecules per batch (as in HC), thereby removing the counterproductive regularization effect of low-scoring molecules. The authors demonstrate that AHC improves optimization ability ~1.5-fold and sample efficiency ~45-fold compared to REINVENT across docking tasks against four <a href="https://en.wikipedia.org/wiki/G_protein-coupled_receptor">GPCR</a> targets, and that the approach generalizes to transformer architectures.</p>
<h2 id="sample-efficiency-bottleneck-in-rl-guided-molecular-generation">Sample Efficiency Bottleneck in RL-Guided Molecular Generation</h2>
<p>Recurrent neural networks trained on SMILES have become a standard approach for de novo molecule generation, with RL strategies like REINVENT and Hill-Climb achieving top performance on benchmarks such as <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> and <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a>. However, RL-guided generation can be highly <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/sample-efficiency-de-novo-generation/">sample-inefficient</a>, often requiring $10^5$ or more molecules to optimize complex objectives. This is acceptable for cheap scoring functions (e.g., QSAR models, property calculators) but becomes a practical bottleneck when using computationally expensive scoring functions like molecular docking or computer-aided synthesis planning.</p>
<p>The REINVENT strategy regularizes the agent by computing a loss based on the difference between the agent&rsquo;s policy and an &ldquo;augmented likelihood&rdquo; that combines the prior policy with a scaled reward. When low-scoring molecules are sampled ($R_T \approx 0$), the augmented likelihood reduces to the prior likelihood, causing the agent to trend back toward the prior policy. This negates useful learnings, especially early in training or when the objective is difficult. Meanwhile, Hill-Climb simply fine-tunes the RNN on the top-k molecules per batch, which is sample-efficient but lacks explicit regularization, leading to mode collapse and generation of invalid SMILES.</p>
<p>Previous work by Neil et al. compared RL strategies but did not clearly quantify sample-efficiency differences, and modifications to the REINVENT loss function by Fialkova et al. showed no significant improvement. The best agent reminder (BAR) mechanism offered modest gains but was originally tested on graph-based models.</p>
<h2 id="core-innovation-filtering-low-scoring-molecules-from-the-reinvent-loss">Core Innovation: Filtering Low-Scoring Molecules from the REINVENT Loss</h2>
<p>Augmented Hill-Climb combines the loss formulation of REINVENT with the top-k selection mechanism of Hill-Climb. The agent samples a batch of molecules, ranks them by reward, and computes the REINVENT loss only on the top-k molecules. This removes the counterproductive regularization caused by low-scoring molecules while retaining the prior-based regularization for high-scoring molecules.</p>
<p>The REINVENT loss defines an augmented likelihood:</p>
<p>$$
\log P_{\mathbb{U}}(A) = \log P_{prior}(A) + \sigma R_T
$$</p>
<p>where $\sigma$ is a scaling coefficient controlling the reward contribution. The agent loss is the squared difference between the augmented likelihood and the agent&rsquo;s log-likelihood:</p>
<p>$$
L(\theta) = \left[\log P_{\mathbb{U}}(A) - \log P_{agent}(A)\right]^2
$$</p>
<p>In standard REINVENT, this loss is computed over all molecules in the batch. When $R_T \approx 0$, the augmented likelihood collapses to the prior likelihood, pushing the agent back toward the prior. AHC avoids this by computing the loss only on the top-k molecules ranked by reward, exactly as Hill-Climb selects molecules for fine-tuning.</p>
<p>The key insight is that high-scoring molecules are still regularized by the prior component of the augmented likelihood ($\log P_{prior}(A)$), preventing catastrophic forgetting. Low-scoring molecules, which would otherwise pull the agent back toward the prior, are simply excluded from the loss computation.</p>
<h3 id="diversity-filters-to-prevent-mode-collapse">Diversity Filters to Prevent Mode Collapse</h3>
<p>AHC is more susceptible to mode collapse than REINVENT because it focuses learning on high-scoring molecules. The authors address this with diversity filters (DFs) that penalize the reward of molecules similar to previously generated ones. Through a hyperparameter search over 825 configurations on three GuacaMol tasks, they identify an optimal DF configuration (DF2) with:</p>
<ul>
<li>Minimum score threshold of 0.5 (lower than DF1&rsquo;s 0.8)</li>
<li>Linear penalization output mode (softer than binary)</li>
<li>Bin size of 50 (larger than DF1&rsquo;s 25)</li>
<li>Scaffold similarity based on ECFP4 fingerprints</li>
</ul>
<p>The authors find that stricter DFs (lower thresholds, smaller bins) better prevent mode collapse but reduce optimization performance, while more lenient DFs enable better learning of chemotype-reward associations. DF2 represents a compromise.</p>
<h2 id="experimental-setup-docking-tasks-and-benchmark-comparisons">Experimental Setup: Docking Tasks and Benchmark Comparisons</h2>
<p>The evaluation spans five experiments:</p>
<p><strong>Experiment 1</strong>: AHC vs. REINVENT on DRD2 docking over 100 RL updates (6,400 samples), varying $\sigma$ from 30 to 240. RNN trained on the MOSESn dataset (MOSES with neutralized charges, 2.45M molecules).</p>
<p><strong>Experiment 2</strong>: AHC + DF2 vs. REINVENT on four GPCR targets (DRD2, OPRM1, AGTR1, OX1R) over 500 RL updates. Docking performed with Glide-SP after ligand preparation with LigPrep.</p>
<p><strong>Experiment 3</strong>: Diversity filter hyperparameter search (825 configurations) on three GuacaMol tasks (<a href="https://en.wikipedia.org/wiki/Aripiprazole">Aripiprazole</a> similarity, C11H24 isomers, <a href="https://en.wikipedia.org/wiki/Osimertinib">Osimertinib</a> MPO) using the GuacaMol training set (1.27M molecules from ChEMBL24).</p>
<p><strong>Experiment 4</strong>: Benchmark of AHC against REINFORCE, REINVENT (v1 and v2), BAR, and Hill-Climb (with and without KL regularization) on six tasks of varying difficulty:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Difficulty</th>
          <th>Objective</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Heavy atoms</td>
          <td>Easy</td>
          <td>Maximize number of heavy atoms</td>
      </tr>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Risperidone">Risperidone</a> similarity</td>
          <td>Easy</td>
          <td>Maximize Tanimoto similarity to Risperidone</td>
      </tr>
      <tr>
          <td>DRD2 activity</td>
          <td>Medium</td>
          <td>Maximize QSAR-predicted DRD2 activity</td>
      </tr>
      <tr>
          <td>DRD2 docking</td>
          <td>Medium</td>
          <td>Minimize Glide-SP docking score</td>
      </tr>
      <tr>
          <td>DRD2-DRD3 dual</td>
          <td>Hard</td>
          <td>Maximize predicted activity against both targets</td>
      </tr>
      <tr>
          <td>DRD2/DRD3 selective</td>
          <td>Hard</td>
          <td>Maximize selective DRD2 activity over DRD3</td>
      </tr>
  </tbody>
</table>
<p><strong>Experiment 5</strong>: AHC vs. REINVENT on transformer (Tr) and gated transformer (GTr) architectures on the same six benchmark tasks. The GTr implements a GRU-style gate in place of residual connections to stabilize RL training.</p>
<h3 id="rnn-and-transformer-architectures">RNN and Transformer Architectures</h3>
<p>Three RNN configurations were used: (1) embedding 128 + 3 GRU layers of 512 (REINVENT v1), (2) embedding 256 + 3 LSTM layers of 512 (REINVENT 2.0), (3) 3 LSTM layers of 512 with dropout 0.2 (GuacaMol). Transformers used 4 encoder layers with hidden dimension 512, 8 attention heads, and feed-forward dimension 1024.</p>
<p>QSAR models for DRD2 and DRD3 activity were random forest classifiers trained on ExCAPE-DB data with GHOST threshold identification for handling class imbalance.</p>
<h2 id="key-findings-45-fold-sample-efficiency-improvement">Key Findings: 45-Fold Sample Efficiency Improvement</h2>
<h3 id="experiment-1-ahc-consistently-outperforms-reinvent">Experiment 1: AHC Consistently Outperforms REINVENT</h3>
<p>AHC improved optimization ability by 1.39-fold over REINVENT averaged across all $\sigma$ values, with maximum optimization of 205% at $\sigma = 240$ (compared to 128% for REINVENT). AHC required ~80 fewer RL steps to match REINVENT&rsquo;s mean docking score at 100 steps. With DF1 applied, the improvement was 1.45-fold.</p>
<p>AHC showed greater sensitivity to $\sigma$, giving practitioners more control over the regularization-optimization trade-off. At $\sigma = 60$ (heavily regularized), AHC still improved 1.47-fold over REINVENT while maintaining property space defined by the MOSESn training set. At higher $\sigma$ values, AHC extrapolated further outside the training distribution, which can be favorable (novel chemical space) or unfavorable (scoring function exploitation, e.g., larger molecules getting better docking scores due to the additive nature of scoring functions).</p>
<h3 id="experiment-2-improvement-across-four-gpcr-targets">Experiment 2: Improvement Across Four GPCR Targets</h3>
<p>Across DRD2, OPRM1, AGTR1, and OX1R, AHC + DF2 required on average 7.4-fold fewer training steps and 45.5-fold fewer samples to reach optimization thresholds. The improvement was largest early in training: 19.8-fold fewer steps to reach 120% optimization, and 71.8-fold fewer samples to first produce a molecule exceeding 160% optimization.</p>
<p>AHC + DF2 surpassed the 80% retrospective precision threshold within 100 RL updates for all targets except the challenging OX1R. DF2 successfully stabilized learning, avoiding the convergence-to-threshold failure mode observed with DF1.</p>
<p>Scaffold analysis showed AHC generates similar chemistry to REINVENT. The top 500 scaffolds produced by REINVENT were also generated by AHC, but typically much sooner.</p>
<h3 id="experiment-4-benchmark-against-all-rl-strategies">Experiment 4: Benchmark Against All RL Strategies</h3>
<p>AHC outperformed all other RL strategies on all six benchmark tasks except maximizing heavy atoms (an extrapolation task of limited practical relevance). AHC was particularly superior during early-stage optimization and for harder objectives (dual activity, selective activity).</p>
<p>Hill-Climb with a smaller batch size (HC*) showed improved early-stage sample efficiency similar to AHC, but rapidly underwent mode collapse. KL regularization did not rescue mode collapse in any case and sometimes worsened performance. BAR performed poorly in most tasks, possibly because the best-agent memory acts as a second regularizer that inhibits learning.</p>
<p>In terms of wall time for the DRD2 docking task, AHC reached 140% optimization in 16 CPU hours vs. 202 CPU hours for <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent4-generative-molecule-design/">REINVENT 2.0</a>. AHC was the only strategy to reach 200% optimization within the allotted time (216 CPU hours). Parallelized over 10 CPUs, this corresponds to ~21.6 hours, making docking-guided generation feasible on local machines.</p>
<h3 id="experiment-5-generalization-to-transformers">Experiment 5: Generalization to Transformers</h3>
<p>AHC outperformed REINVENT on both the standard transformer and the gated transformer architectures. The standard transformer was unstable under RL, readily undergoing mode collapse. The gated transformer (with GRU-style gating replacing residual connections) stabilized RL training. AHC&rsquo;s efficiency gains generalized to both architectures.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>Chemistry quality evaluation is complicated by the interaction between RL strategy and scoring function suitability. Greater optimization may lead to unreasonable chemistry due to scoring function exploitation rather than the RL strategy itself.</li>
<li>The diversity filter hyperparameter search was conducted on GuacaMol toy tasks, which may not fully transfer to docking-based objectives.</li>
<li>The docking scoring function was system-dependent: DRD2 and OPRM1 were optimized effectively, while AGTR1 and OX1R proved more challenging (especially AGTR1, where the docking algorithm targeted the wrong sub-pocket).</li>
<li>KL regularization proved ineffective for HC and REINFORCE, suggesting it is not a sufficient regularization method in this context.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RNN pretraining</td>
          <td>MOSESn (MOSES neutralized)</td>
          <td>2,454,087 molecules</td>
          <td>ZINC15 clean leads with neutralized charges</td>
      </tr>
      <tr>
          <td>RNN pretraining</td>
          <td>GuacaMol train</td>
          <td>1,273,104 molecules</td>
          <td>ChEMBL24 with property filters</td>
      </tr>
      <tr>
          <td>QSAR training</td>
          <td>ExCAPE-DB (DRD2)</td>
          <td>4,609 actives / 343,026 inactives</td>
          <td>Random forest with GHOST thresholds</td>
      </tr>
      <tr>
          <td>QSAR training</td>
          <td>ExCAPE-DB (DRD3)</td>
          <td>2,758 actives / 402,524 inactives</td>
          <td>Unique subsets for dual/selective tasks</td>
      </tr>
      <tr>
          <td>DF parameter search</td>
          <td>GuacaMol benchmark tasks</td>
          <td>3 tasks</td>
          <td>825 configurations tested</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>AHC</strong>: REINVENT loss computed on top-k molecules per batch, ranked by reward</li>
<li><strong>Baselines</strong>: REINFORCE, REINVENT (v1, v2), BAR, Hill-Climb, Hill-Climb + KL regularization</li>
<li><strong>Hyperparameters</strong>: Default values from each original publication (listed in Supplementary Table S3)</li>
<li><strong>Docking</strong>: Glide-SP with Schrodinger Protein Preparation Wizard, LigPrep for ligand preparation</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>RNNs</strong>: 3 configurations (GRU/LSTM, 512 hidden units, trained 5-10 epochs)</li>
<li><strong>Transformer</strong>: 4 encoder layers, 512 hidden dim, 8 heads, 1024 FFN dim</li>
<li><strong>Gated Transformer</strong>: Same architecture with GRU-style gating replacing residual connections</li>
<li><strong>QSAR</strong>: Random forest classifiers (100 estimators, max depth 15, min leaf 2)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>AHC + DF2</th>
          <th>REINVENT</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Optimization fold-improvement</td>
          <td>1.45x</td>
          <td>baseline</td>
          <td>DRD2 docking, averaged across sigma values</td>
      </tr>
      <tr>
          <td>Sample efficiency</td>
          <td>45.5x fewer samples</td>
          <td>baseline</td>
          <td>Averaged across 4 GPCR targets</td>
      </tr>
      <tr>
          <td>Step efficiency</td>
          <td>7.4x fewer steps</td>
          <td>baseline</td>
          <td>Averaged across 4 GPCR targets</td>
      </tr>
      <tr>
          <td>CPU hours to 140% (DRD2 docking)</td>
          <td>16h</td>
          <td>202h (REINVENT 2.0)</td>
          <td>AMD Threadripper 1920 + RTX 2060 Super</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>AMD Threadripper 1920 CPU</li>
<li>Nvidia GeForce RTX 2060 Super GPU</li>
<li>DRD2 docking benchmark: 216 CPU hours for AHC to reach 200% optimization (~21.6h parallelized over 10 CPUs)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MorganCThomas/SMILES-RNN">SMILES-RNN</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>RNN and transformer generative model code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MorganCThomas/MolScore">MolScore</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td><a href="/notes/computational-chemistry/benchmark-problems/molscore-scoring-benchmarking-framework/">Scoring function platform</a></td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.6084/m9.figshare.19591024.v1">Figshare datasets</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>Supporting data (published under same license as paper)</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Thomas, M., O&rsquo;Boyle, N. M., Bender, A., &amp; de Graaf, C. (2022). Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation. <em>Journal of Cheminformatics</em>, 14, 68.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{thomas2022augmented,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Thomas, Morgan and O&#39;Boyle, Noel M. and Bender, Andreas and de Graaf, Chris}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{68}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-022-00646-z}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>AlphaDrug: MCTS-Guided Target-Specific Drug Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/alphadrug-protein-target-molecular-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/alphadrug-protein-target-molecular-generation/</guid><description>AlphaDrug combines a modified transformer with Monte Carlo tree search and docking rollouts for target-specific de novo molecular generation.</description><content:encoded><![CDATA[<h2 id="target-conditioned-molecular-generation-via-transformer-and-mcts">Target-Conditioned Molecular Generation via Transformer and MCTS</h2>
<p>AlphaDrug is a <strong>Method</strong> paper that proposes a target-specific de novo molecular generation framework. The primary contribution is the combination of two components: (1) an Lmser Transformer (LT) that embeds protein-ligand context through hierarchical skip connections from encoder to decoder, and (2) a Monte Carlo tree search (MCTS) procedure guided by both the LT&rsquo;s predicted probabilities and docking scores from the <a href="/notes/computational-chemistry/benchmark-problems/smina-docking-benchmark/">SMINA</a> program. The method generates SMILES strings autoregressively, with each symbol selection informed by look-ahead search over potential binding affinities.</p>
<h2 id="bridging-the-gap-between-molecular-generation-and-protein-targeting">Bridging the Gap Between Molecular Generation and Protein Targeting</h2>
<p>Most deep learning methods for de novo molecular generation optimize physicochemical properties (LogP, QED, SA) without conditioning on a specific protein target. Virtual screening approaches rely on existing compound databases and are computationally expensive. The few methods that do consider protein targets, such as LiGANN and the <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/transformer-protein-drug-generation/">transformer-based approach of Grechishnikova (2021)</a>, show limited docking performance. The core challenge is twofold: the search space of drug-like molecules is estimated at $10^{60}$ compounds, and learning protein-ligand interaction patterns from sequence data is difficult because proteins and ligands have very different structures and sequence lengths.</p>
<p>AlphaDrug addresses these gaps by proposing a method that jointly learns protein-ligand representations and uses docking-guided search to navigate the vast chemical space.</p>
<h2 id="lmser-transformer-and-docking-guided-mcts">Lmser Transformer and Docking-Guided MCTS</h2>
<p>The key innovations are the Lmser Transformer architecture and the MCTS search strategy.</p>
<h3 id="lmser-transformer-lt">Lmser Transformer (LT)</h3>
<p>The standard transformer for sequence-to-sequence tasks passes information from the encoder&rsquo;s top layer to the decoder through cross-attention. AlphaDrug identifies an information transfer bottleneck: deep protein features from the encoder&rsquo;s final layer must serve all decoder layers. Inspired by the Lmser (least mean squared error reconstruction) network, the authors add hierarchical skip connections from each encoder layer to the corresponding decoder layer.</p>
<p>Each decoder layer receives protein features at the matching level of abstraction through a cross-attention mechanism:</p>
<p>$$f_{ca}(Q_m, K_S, V_S) = \text{softmax}\left(\frac{Q_m K_S^T}{\sqrt{d_k}}\right) V_S$$</p>
<p>where $Q_m$ comes from the ligand molecule decoder and $(K_S, V_S)$ are passed through skip connections from the protein encoder. This allows different decoder layers to access different levels of protein features, rather than all layers sharing the same top-level encoding.</p>
<p>The multi-head attention follows the standard formulation:</p>
<p>$$\text{MultiHead}(Q, K, V) = \text{Concat}(H_1, \dots, H_h) W^O$$</p>
<p>$$H_i = f_{ca}(Q W_i^Q, K W_i^K, V W_i^V)$$</p>
<h3 id="mcts-for-molecular-generation">MCTS for Molecular Generation</h3>
<p>The molecular generation process models SMILES construction as a sequential decision problem. At each step $\tau$, the context $C_\tau = {S, a_1 a_2 \cdots a_\tau}$ consists of the protein sequence $S$ and the intermediate SMILES string. MCTS runs a fixed number of simulations per step, each consisting of four phases:</p>
<p><strong>Select</strong>: Starting from the current root node, child nodes are selected using a variant of the PUCT algorithm:</p>
<p>$$\tilde{a}_{\tau+t} = \underset{a \in A}{\arg\max}\left(Q(\tilde{C}_{\tau+t-1}, a) + U(\tilde{C}_{\tau+t-1}, a)\right)$$</p>
<p>where $Q(\tilde{C}, a) = W_a / N_a$ is the average reward and $U(\tilde{C}, a) = c_{puct} \cdot P(a | \tilde{C}) \cdot \sqrt{N_t} / (1 + N_t(a))$ is an exploration bonus based on the LT&rsquo;s predicted probability.</p>
<p>The Q-values are normalized to $[0, 1]$ using the range of docking scores in the tree:</p>
<p>$$Q(\tilde{C}, a) \leftarrow \frac{Q(\tilde{C}, a) - \min_{m \in \mathcal{M}} f_d(S, m)}{\max_{m \in \mathcal{M}} f_d(S, m) - \min_{m \in \mathcal{M}} f_d(S, m)}$$</p>
<p><strong>Expand</strong>: At a leaf node, the LT computes next-symbol probabilities and adds child nodes to the tree.</p>
<p><strong>Rollout</strong>: A complete molecule is generated greedily using LT probabilities. Valid molecules are scored with SMINA docking; invalid molecules receive the minimum observed docking score.</p>
<p><strong>Backup</strong>: Docking values propagate back up the tree, updating visit counts and cumulative rewards.</p>
<h3 id="training-objective">Training Objective</h3>
<p>The LT is trained on known protein-ligand pairs using cross-entropy loss:</p>
<p>$$J(\Theta) = -\sum_{(S,m) \in \mathcal{D}} \sum_{\tau=1}^{L_m} \sum_{a \in \mathcal{A}} y_a \ln P(a \mid C_\tau(S, m))$$</p>
<p>MCTS is only activated during inference, not during training.</p>
<h2 id="experiments-on-diverse-protein-targets">Experiments on Diverse Protein Targets</h2>
<h3 id="dataset">Dataset</h3>
<p>The authors use BindingDB, filtered to 239,455 protein-ligand pairs across 981 unique proteins. Filtering criteria include: human proteins only, IC50 &lt; 100 nM, molecular weight &lt; 1000 Da, and single-chain targets. Proteins are clustered at 30% sequence identity using MMseqs2, with 25 clusters held out for testing (100 proteins), and the remainder split 90/10 for training (192,712 pairs) and validation (17,049 pairs).</p>
<h3 id="baselines">Baselines</h3>
<ul>
<li><strong>T+BS10</strong>: Standard transformer with beam search (K=10) from <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/transformer-protein-drug-generation/">Grechishnikova (2021)</a></li>
<li><strong>LT+BS10</strong>: The proposed Lmser Transformer with beam search</li>
<li><strong>LiGANN</strong>: 3D pocket-to-ligand shape generation via BicycleGAN</li>
<li><strong>SBMolGen</strong>: ChemTS-based method with docking constraints</li>
<li><strong>SBDD-3D</strong>: 3D autoregressive graph-based generation</li>
<li><strong>Decoys</strong>: Random compounds from ZINC database</li>
<li><strong>Known ligands</strong>: Original binding partners from the database</li>
</ul>
<h3 id="main-results">Main Results</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Docking</th>
          <th>Uniqueness</th>
          <th>LogP</th>
          <th>QED</th>
          <th>SA</th>
          <th>NP</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Decoys</td>
          <td>7.3</td>
          <td>-</td>
          <td>2.4</td>
          <td>0.8</td>
          <td>2.4</td>
          <td>-1.2</td>
      </tr>
      <tr>
          <td>Known ligands</td>
          <td>9.8</td>
          <td>-</td>
          <td>2.2</td>
          <td>0.5</td>
          <td>3.3</td>
          <td>-1.0</td>
      </tr>
      <tr>
          <td>LiGANN</td>
          <td>6.7</td>
          <td>94.7%</td>
          <td>2.9</td>
          <td>0.6</td>
          <td>3.0</td>
          <td>-1.1</td>
      </tr>
      <tr>
          <td>SBMolGen</td>
          <td>7.7</td>
          <td>100%</td>
          <td>2.6</td>
          <td>0.7</td>
          <td>2.8</td>
          <td>-1.2</td>
      </tr>
      <tr>
          <td>SBDD-3D</td>
          <td>7.7</td>
          <td>99.3%</td>
          <td>1.5</td>
          <td>0.6</td>
          <td>4.0</td>
          <td>0.3</td>
      </tr>
      <tr>
          <td>T+BS10</td>
          <td>8.5</td>
          <td>90.6%</td>
          <td>3.8</td>
          <td>0.5</td>
          <td>2.8</td>
          <td>-0.8</td>
      </tr>
      <tr>
          <td>LT+BS10</td>
          <td>8.5</td>
          <td>98.1%</td>
          <td>4.0</td>
          <td>0.5</td>
          <td>2.7</td>
          <td>-1.0</td>
      </tr>
      <tr>
          <td>AlphaDrug (freq)</td>
          <td>10.8</td>
          <td>99.5%</td>
          <td>4.9</td>
          <td>0.4</td>
          <td>2.9</td>
          <td>-1.0</td>
      </tr>
      <tr>
          <td>AlphaDrug (max)</td>
          <td>11.6</td>
          <td>100%</td>
          <td>5.2</td>
          <td>0.4</td>
          <td>2.7</td>
          <td>-0.8</td>
      </tr>
  </tbody>
</table>
<p>AlphaDrug (max) achieves the highest average docking score (11.6), surpassing known ligands (9.8). Statistical significance is confirmed with two-tailed t-test P-values below 0.01 for all comparisons.</p>
<h3 id="mcts-vs-beam-search-under-equal-compute">MCTS vs. Beam Search Under Equal Compute</h3>
<p>When constrained to the same number of docking evaluations, MCTS consistently outperforms beam search:</p>
<table>
  <thead>
      <tr>
          <th>Docking times (N)</th>
          <th>BS</th>
          <th>MCTS</th>
          <th>P-value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>N = 105 (S=10)</td>
          <td>8.4 (10.9)</td>
          <td>10.9 (11.5)</td>
          <td>1.8e-34 (4.5e-3)</td>
      </tr>
      <tr>
          <td>N = 394 (S=50)</td>
          <td>8.3 (11.4)</td>
          <td>11.6 (12.2)</td>
          <td>1.4e-31 (1.8e-3)</td>
      </tr>
      <tr>
          <td>N = 1345 (S=500)</td>
          <td>8.4 (11.9)</td>
          <td>12.4 (13.2)</td>
          <td>2.2e-39 (8.2e-6)</td>
      </tr>
  </tbody>
</table>
<p>Values in parentheses are average top-1 scores per protein.</p>
<h3 id="ablation-effect-of-protein-sequence-input">Ablation: Effect of Protein Sequence Input</h3>
<p>Replacing the full transformer (T) or LT with a transformer encoder only (TE, no protein input) demonstrates that protein conditioning improves both uniqueness and docking score per symbol (SpS):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Uniqueness</th>
          <th>SpS</th>
          <th>Molecular length</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>TE + MCTS (S=50)</td>
          <td>81.0%</td>
          <td>0.1926</td>
          <td>62.93</td>
      </tr>
      <tr>
          <td>T + MCTS (S=50)</td>
          <td>98.0%</td>
          <td>0.2149</td>
          <td>55.63</td>
      </tr>
      <tr>
          <td>LT + MCTS (S=50)</td>
          <td>100.0%</td>
          <td>0.2159</td>
          <td>56.54</td>
      </tr>
  </tbody>
</table>
<p>The SpS metric (docking score normalized by molecule length) isolates the quality improvement from the tendency of longer molecules to score higher.</p>
<h3 id="computational-efficiency">Computational Efficiency</h3>
<p>A docking lookup table caches previously computed protein-molecule docking scores, reducing actual docking calls by 81-86% compared to the theoretical maximum ($L \times S$ calls per molecule). With $S = 10$, AlphaDrug generates molecules in about 52 minutes per protein; with $S = 50$, about 197 minutes per protein.</p>
<h2 id="docking-gains-with-acknowledged-limitations">Docking Gains with Acknowledged Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ul>
<li>86% of AlphaDrug-generated molecules have higher docking scores than known ligands for their respective targets.</li>
<li>The LT architecture with hierarchical skip connections improves uniqueness (from 90.6% to 98.1% with beam search) and provides slight SpS gains over the vanilla transformer.</li>
<li>MCTS is the dominant factor in performance improvement: even with only 10 simulations, it boosts docking scores by 31.3% over greedy LT decoding.</li>
<li>Case studies on three proteins (3gcs, 3eig, 4o28) show that generated molecules share meaningful substructures with known ligands, suggesting chemical plausibility.</li>
</ul>
<h3 id="limitations">Limitations</h3>
<p>The authors identify three areas for improvement:</p>
<ol>
<li><strong>Sequence-only representation</strong>: AlphaDrug uses amino acid sequences rather than 3D protein structures. While it outperforms existing 3D methods (SBDD-3D), incorporating 3D pocket geometry could further improve performance.</li>
<li><strong>External docking as value function</strong>: SMINA docking calls are computationally expensive and become a bottleneck during MCTS. A learnable end-to-end value network would reduce this cost and allow joint policy-value training.</li>
<li><strong>Full rollout requirement</strong>: Every MCTS simulation requires generating a complete molecule for docking evaluation. Estimating binding affinity from partial molecules remains an open challenge.</li>
</ol>
<p>The physicochemical properties (QED, SA) of AlphaDrug&rsquo;s outputs are comparable to baselines but not explicitly optimized. LogP values trend toward the upper end of the Ghose filter range (4.9-5.2 vs. the 5.6 limit), which may indicate lipophilicity bias.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>BindingDB (filtered)</td>
          <td>192,712 protein-ligand pairs</td>
          <td>Human proteins, IC50 &lt; 100 nM, MW &lt; 1000 Da</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>BindingDB (filtered)</td>
          <td>17,049 pairs</td>
          <td>Same filtering criteria</td>
      </tr>
      <tr>
          <td>Testing</td>
          <td>BindingDB (filtered)</td>
          <td>100 proteins from 25 clusters</td>
          <td>Clustered at 30% sequence identity via MMseqs2</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>MCTS with PUCT selection criterion, $c_{puct} = 1.5$</li>
<li>$S = 50$ simulations per step (default), $S = 10$ for fast variant</li>
<li>Greedy rollout policy using LT probabilities</li>
<li>Docking lookup table for efficiency (caches SMINA results)</li>
<li>Two generation modes: max (deterministic, highest visit count) and freq (stochastic, proportional to visit counts)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Lmser Transformer with hierarchical encoder-to-decoder skip connections</li>
<li>Sinusoidal positional encoding</li>
<li>Multi-head cross-attention at each decoder layer</li>
<li>Detailed hyperparameters (embedding dimensions, number of layers/heads) are in the supplementary material (Table S1)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>AlphaDrug (max)</th>
          <th>Known ligands</th>
          <th>Best baseline (T+BS10)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Docking score</td>
          <td>11.6</td>
          <td>9.8</td>
          <td>8.5</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>100%</td>
          <td>-</td>
          <td>90.6%</td>
      </tr>
      <tr>
          <td>Validity</td>
          <td>100%</td>
          <td>-</td>
          <td>Not reported</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not explicitly reported in the paper. Generation time is reported as approximately 52 minutes per protein ($S = 10$) and 197 minutes per protein ($S = 50$), with docking (via SMINA) being the dominant cost.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/CMACH508/AlphaDrug">CMACH508/AlphaDrug</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation, includes data processing and generation scripts</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Qian, H., Lin, C., Zhao, D., Tu, S., &amp; Xu, L. (2022). AlphaDrug: protein target specific de novo molecular generation. <em>PNAS Nexus</em>, 1(4), pgac227. <a href="https://doi.org/10.1093/pnasnexus/pgac227">https://doi.org/10.1093/pnasnexus/pgac227</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{qian2022alphadrug,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{AlphaDrug: protein target specific de novo molecular generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Qian, Hao and Lin, Cheng and Zhao, Dengwei and Tu, Shikui and Xu, Lei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{PNAS Nexus}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{pgac227}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/pnasnexus/pgac227}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>TamGen: GPT-Based Target-Aware Drug Design and Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/tamgen-target-aware-molecule-generation/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/tamgen-target-aware-molecule-generation/</guid><description>TamGen combines a GPT-like chemical language model with protein pocket encoding and VAE refinement to generate drug candidates with experimental validation.</description><content:encoded><![CDATA[<h2 id="a-method-for-target-conditioned-molecular-generation">A Method for Target-Conditioned Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces TamGen (Target-aware molecular generation), a three-module architecture for generating drug-like compounds conditioned on protein binding pocket structures. The primary contribution is a GPT-like chemical language model pre-trained on 10 million SMILES from PubChem, combined with a Transformer-based protein encoder and a VAE-based contextual encoder for compound refinement. The authors validate TamGen on the CrossDocked2020 benchmark and apply it through a Design-Refine-Test pipeline to discover 14 novel inhibitors of the Mycobacterium tuberculosis ClpP protease, with $\text{IC}_{50}$ values ranging from 1.88 to 35.2 $\mu$M.</p>
<h2 id="bridging-generative-ai-and-practical-drug-discovery">Bridging Generative AI and Practical Drug Discovery</h2>
<p>Target-based generative drug design aims to create novel compounds with desired pharmacological properties from scratch, exploring the estimated $10^{60}$ feasible compounds in chemical space rather than screening existing libraries of $10^{4}$ to $10^{8}$ molecules. Prior approaches using diffusion models, GANs, VAEs, and autoregressive models have demonstrated the feasibility of generating compounds conditioned on target proteins. However, most generated compounds lack satisfactory physicochemical properties for drug-likeness, and validations with biophysical or biochemical assays are largely missing.</p>
<p>The key limitations of existing 3D generation methods (TargetDiff, Pocket2Mol, ResGen, 3D-AR) include:</p>
<ul>
<li>Generated compounds frequently contain multiple fused rings, leading to poor synthetic accessibility</li>
<li>High cellular toxicity and decreased developability associated with excessive fused ring counts</li>
<li>Slow generation speeds (tens of minutes to hours per 100 compounds)</li>
<li>Limited real-world experimental validation of generated candidates</li>
</ul>
<p>TamGen addresses these issues by operating in 1D SMILES space rather than 3D coordinate space, leveraging pre-training on natural compound distributions to produce more drug-like molecules.</p>
<h2 id="three-module-architecture-with-pre-training-and-refinement">Three-Module Architecture with Pre-Training and Refinement</h2>
<p>TamGen consists of three components: a compound decoder, a protein encoder, and a contextual encoder.</p>
<h3 id="compound-decoder-chemical-language-model">Compound Decoder (Chemical Language Model)</h3>
<p>The compound decoder is a GPT-style autoregressive model pre-trained on 10 million SMILES randomly sampled from PubChem. The pre-training objective follows standard next-token prediction:</p>
<p>$$
\min -\sum_{y \in \mathcal{D}_0} \frac{1}{M_y} \sum_{i=1}^{M_y} \log P(y_i \mid y_{i-1}, y_{i-2}, \ldots, y_1)
$$</p>
<p>where $M_y$ is the SMILES sequence length. This enables both unconditional and conditional generation. The decoder uses 12 Transformer layers with hidden dimension 768.</p>
<h3 id="protein-encoder-with-distance-aware-attention">Protein Encoder with Distance-Aware Attention</h3>
<p>The protein encoder processes binding pocket residues using both sequential and geometric information. Given amino acids $\mathbf{a} = (a_1, \ldots, a_N)$ with 3D coordinates $\mathbf{r} = (r_1, \ldots, r_N)$, the input representation combines amino acid embeddings with coordinate embeddings:</p>
<p>$$
h_i^{(0)} = E_a a_i + E_r \rho\left(r_i - \frac{1}{N}\sum_{j=1}^{N} r_j\right)
$$</p>
<p>where $\rho$ denotes a random roto-translation operation applied as data augmentation, and coordinates are centered to the origin.</p>
<p>The encoder uses a distance-aware self-attention mechanism that weights attention scores by spatial proximity:</p>
<p>$$
\begin{aligned}
\hat{\alpha}_j &amp;= \exp\left(-\frac{|r_i - r_j|^2}{\tau}\right)(h_i^{(l)\top} W h_j^{(l)}) \\
\alpha_j &amp;= \frac{\exp \hat{\alpha}_j}{\sum_{k=1}^{N} \exp \hat{\alpha}_k} \\
\hat{\boldsymbol{h}}_i^{(l+1)} &amp;= \sum_{j=1}^{N} \alpha_j (W_v h_j^{(l)})
\end{aligned}
$$</p>
<p>where $\tau$ is a temperature hyperparameter and $W$, $W_v$ are learnable parameters. The encoder uses 4 layers with hidden dimension 256. Outputs are passed to the compound decoder via cross-attention.</p>
<h3 id="vae-based-contextual-encoder">VAE-Based Contextual Encoder</h3>
<p>A VAE-based contextual encoder determines the mean $\mu$ and standard deviation $\sigma$ for any (compound, protein) pair. During training, the model recovers the input compound. During application, a seed compound enables compound refinement. The full training objective combines reconstruction loss with KL regularization:</p>
<p>$$
\min_{\Theta, q} \frac{1}{|\mathcal{D}|} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} -\log P(\mathbf{y} \mid \mathbf{x}, z; \Theta) + \beta \mathcal{D}_{\text{KL}}(q(z \mid \mathbf{x}, \mathbf{y}) | p(z))
$$</p>
<p>where $\beta$ is a hyperparameter controlling the KL divergence weight, and $p(z)$ is a standard Gaussian prior.</p>
<h2 id="benchmark-evaluation-and-tuberculosis-drug-discovery">Benchmark Evaluation and Tuberculosis Drug Discovery</h2>
<h3 id="crossdocked2020-benchmark">CrossDocked2020 Benchmark</h3>
<p>TamGen was evaluated against five baselines (liGAN, 3D-AR, Pocket2Mol, ResGen, TargetDiff) on the CrossDocked2020 dataset (~100k drug-target pairs for training, 100 test binding pockets). For each target, 100 compounds were generated per method. Evaluation metrics included:</p>
<ul>
<li><strong>Docking score</strong> (AutoDock-Vina): binding affinity estimate</li>
<li><strong>QED</strong>: quantitative estimate of drug-likeness</li>
<li><strong><a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski&rsquo;s Rule of Five</a></strong>: physicochemical property compliance</li>
<li><strong>SAS</strong>: synthetic accessibility score</li>
<li><strong>LogP</strong>: lipophilicity (optimal range 0-5 for oral administration)</li>
<li><strong>Molecular diversity</strong>: Tanimoto similarity between Morgan fingerprints</li>
</ul>
<p>TamGen ranked first or second on 5 of 6 metrics and achieved the best overall score using mean reciprocal rank (MRR) across all metrics. On synthetic accessibility for high-affinity compounds, TamGen performed best. The generated compounds averaged 1.78 fused rings, closely matching FDA-approved drugs, while competing 3D methods produced compounds with significantly more fused rings.</p>
<p>TamGen was also 85x to 394x faster than competing methods: generating 100 compounds per target in an average of 9 seconds on a single A6000 GPU, compared to tens of minutes or hours for the baselines.</p>
<h3 id="design-refine-test-pipeline-for-clpp-inhibitors">Design-Refine-Test Pipeline for ClpP Inhibitors</h3>
<p>The practical application targeted ClpP protease of Mycobacterium tuberculosis, an emerging antibiotic target with no documented advanced inhibitors beyond <a href="https://en.wikipedia.org/wiki/Bortezomib">Bortezomib</a>.</p>
<p><strong>Design stage</strong>: Using the ClpP binding pocket from PDB structure 5DZK, TamGen generated 2,612 unique compounds. Compounds were filtered by molecular docking (retaining those with better scores than Bortezomib) and Ligandformer phenotypic activity prediction. Peptidomimetic compounds were excluded for poor ADME properties. Four seed compounds were selected.</p>
<p><strong>Refine stage</strong>: Using the 4 seed compounds plus 3 weakly active compounds ($\text{IC}_{50}$ 100-200 $\mu$M) from prior experiments, TamGen generated 8,635 unique compounds conditioned on both the target and seeds. After filtering, 296 compounds were selected for testing.</p>
<p><strong>Test stage</strong>: From a 446k commercial compound library, 159 analogs (MCS similarity &gt; 0.55) were identified. Five analogs showed significant inhibitory effects. Dose-response experiments revealed $\text{IC}_{50}$ values below 20 $\mu$M for all five, with Analog-005 achieving $\text{IC}_{50}$ of 1.9 $\mu$M. Three additional novel compounds were synthesized for SAR analysis:</p>
<table>
  <thead>
      <tr>
          <th>Compound</th>
          <th>Series</th>
          <th>Source</th>
          <th>$\text{IC}_{50}$ ($\mu$M)</th>
          <th>Key Feature</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Analog-005</td>
          <td>II</td>
          <td>Commercial library</td>
          <td>1.9</td>
          <td>Most potent analog</td>
      </tr>
      <tr>
          <td>Analog-003</td>
          <td>I</td>
          <td>Commercial library</td>
          <td>&lt; 20</td>
          <td>Strongest single-dose inhibition</td>
      </tr>
      <tr>
          <td>Syn-A003-01</td>
          <td>I</td>
          <td>TamGen (synthesized)</td>
          <td>&lt; 20</td>
          <td>Diphenylurea scaffold</td>
      </tr>
  </tbody>
</table>
<p>Both compound series (diphenylurea and benzenesulfonamide scaffolds) represent novel ClpP inhibitor chemotypes distinct from Bortezomib. Additionally, 6 out of 8 directly synthesized TamGen compounds demonstrated $\text{IC}_{50}$ below 40 $\mu$M, confirming TamGen&rsquo;s ability to produce viable hits without the library search step.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>Four ablation experiments clarified the contributions of TamGen&rsquo;s components:</p>
<ol>
<li><strong>Without pre-training</strong>: Significantly worse docking scores and simpler structures. The optimal decoder depth dropped from 12 to 4 layers without pre-training due to overfitting.</li>
<li><strong>Shuffled pocket-ligand pairs (TamGen-r)</strong>: Substantially worse docking scores, confirming TamGen learns meaningful pocket-ligand interactions rather than generic compound distributions.</li>
<li><strong>Without distance-aware attention</strong>: Significant decline in docking scores when removing the geometric attention term from Eq. 2.</li>
<li><strong>Without coordinate augmentation</strong>: Performance degradation when removing the roto-translation augmentation $\rho$, highlighting the importance of geometric invariance.</li>
</ol>
<h2 id="validated-drug-like-generation-with-practical-limitations">Validated Drug-Like Generation with Practical Limitations</h2>
<p>TamGen demonstrates that 1D SMILES-based generation with pre-training on natural compounds produces molecules with better drug-likeness properties than 3D generation methods. The experimental validation against ClpP is a notable strength, as most generative drug design methods lack biochemical assay confirmation.</p>
<p>Key limitations acknowledged by the authors include:</p>
<ul>
<li><strong>Insufficient sensitivity to minor target differences</strong>: TamGen cannot reliably distinguish targets with point mutations or protein isoforms, limiting applicability for cancer-related proteins</li>
<li><strong>Requires known structure and pocket</strong>: As a structure-based method, TamGen needs the 3D structure of the target protein and binding pocket information</li>
<li><strong>Limited cellular validation</strong>: The study focuses on hit identification; cellular activities and toxicities of proposed compounds were not extensively tested</li>
<li><strong>1D generation trade-off</strong>: SMILES-based generation does not fully exploit 3D protein-ligand geometric interactions available in coordinate space</li>
</ul>
<p>Future directions include integrating insights from 3D autoregressive methods, using Monte Carlo Tree Search or reinforcement learning to guide generation for better docking scores and ADME/T properties, and property-guided generation as explored in <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/prefixmol-target-chemistry-aware-generation/">PrefixMol</a>.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>PubChem (random sample)</td>
          <td>10M SMILES</td>
          <td>Compound decoder pre-training</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>CrossDocked2020</td>
          <td>~100k pairs</td>
          <td>Filtered pocket-ligand pairs</td>
      </tr>
      <tr>
          <td>Extended fine-tuning</td>
          <td>CrossDocked + PDB</td>
          <td>~300k pairs</td>
          <td>Used for TB compound generation</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>CrossDocked2020 test</td>
          <td>100 pockets</td>
          <td>Same split as TargetDiff/Pocket2Mol</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Compound decoder</strong>: 12-layer GPT with hidden dimension 768, pre-trained for 200k steps</li>
<li><strong>Protein encoder</strong>: 4-layer Transformer with hidden dimension 256, distance-aware attention</li>
<li><strong>VAE encoder</strong>: 4-layer standard Transformer encoder with hidden dimension 256</li>
<li><strong>Optimizer</strong>: Adam with initial learning rate $3 \times 10^{-5}$</li>
<li><strong>VAE $\beta$</strong>: 0.1 or 1.0 depending on generation stage</li>
<li><strong>Beam search</strong>: beam sizes of 4, 10, or 20 depending on stage</li>
<li><strong>Pocket definition</strong>: residues within 10 or 15 Angstrom distance cutoff from ligand center</li>
</ul>
<h3 id="models">Models</h3>
<p>Pre-trained model weights are available via Zenodo at <a href="https://doi.org/10.5281/zenodo.13751391">https://doi.org/10.5281/zenodo.13751391</a>.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>TamGen</th>
          <th>Best Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Overall MRR</td>
          <td>Best</td>
          <td>TargetDiff (2nd)</td>
          <td>Ranked across 6 metrics</td>
      </tr>
      <tr>
          <td>Fused rings (avg)</td>
          <td>1.78</td>
          <td>~3-5 (others)</td>
          <td>Matches FDA-approved drug average</td>
      </tr>
      <tr>
          <td>Generation speed</td>
          <td>9 sec/100 compounds</td>
          <td>~13 min (ResGen)</td>
          <td>Single A6000 GPU</td>
      </tr>
      <tr>
          <td>ClpP hit rate</td>
          <td>6/8 synthesized</td>
          <td>N/A</td>
          <td>$\text{IC}_{50}$ &lt; 40 $\mu$M</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 8x V100 GPUs for 200k steps</li>
<li>Inference benchmarking: 1x A6000 GPU</li>
<li>Generation time: ~9 seconds per 100 compounds per target</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/SigmaGenX/TamGen">TamGen code</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.13751391">Model weights and data</a></td>
          <td>Model + Data</td>
          <td>CC-BY-4.0</td>
          <td>Pre-trained weights, source data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wu, K., Xia, Y., Deng, P., Liu, R., Zhang, Y., Guo, H., Cui, Y., Pei, Q., Wu, L., Xie, S., Chen, S., Lu, X., Hu, S., Wu, J., Chan, C.-K., Chen, S., Zhou, L., Yu, N., Chen, E., Liu, H., Guo, J., Qin, T., &amp; Liu, T.-Y. (2024). TamGen: drug design with target-aware molecule generation through a chemical language model. <em>Nature Communications</em>, 15, 9360. <a href="https://doi.org/10.1038/s41467-024-53632-4">https://doi.org/10.1038/s41467-024-53632-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wu2024tamgen,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{TamGen: drug design with target-aware molecule generation through a chemical language model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wu, Kehan and Xia, Yingce and Deng, Pan and Liu, Renhe and Zhang, Yuan and Guo, Han and Cui, Yumeng and Pei, Qizhi and Wu, Lijun and Xie, Shufang and Chen, Si and Lu, Xi and Hu, Song and Wu, Jinzhi and Chan, Chi-Kin and Chen, Shawn and Zhou, Liangliang and Yu, Nenghai and Chen, Enhong and Liu, Haiguang and Guo, Jinjiang and Qin, Tao and Liu, Tie-Yan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{9360}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-024-53632-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Review of Molecular Representation Learning Models</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/molecular-representation-learning-foundation-models-review/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/molecular-representation-learning-foundation-models-review/</guid><description>A systematic review of molecular representation learning foundation models for drug discovery, covering five modalities and four pretraining strategies.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-molecular-representation-foundation-models">A Systematization of Molecular Representation Foundation Models</h2>
<p>This paper is a <strong>Systematization</strong> that provides the first comprehensive review of foundation models for molecular representation learning (MRL). The authors classify existing models by their input modality (unimodal vs. multimodal), analyze four mainstream pretraining strategies, survey five downstream application domains, and propose practical guidelines for model selection. The review covers over 35 representative models published between 2020 and 2024, with parameter counts ranging from 2 million to over 1 trillion.</p>
<h2 id="why-a-systematic-review-of-mrl-foundation-models-is-needed">Why a Systematic Review of MRL Foundation Models Is Needed</h2>
<p>Molecular representation learning transforms molecular structures and properties into numerical vectors that serve as inputs for machine learning models. The field has evolved rapidly from molecular fingerprints through SMILES-based sequence models to graph neural networks and 3D geometry-aware architectures. Foundation models, characterized by large-scale pretraining on unlabeled molecular data followed by fine-tuning on downstream tasks, have introduced new opportunities for generalizability and transfer learning in drug discovery.</p>
<p>Despite this rapid progress, the authors identify a gap: no prior work has systematically reviewed MRL foundation models across all input modalities and pretraining paradigms. Existing surveys tend to focus on specific representations (e.g., graph-based methods) or specific applications (e.g., property prediction) without providing the cross-cutting perspective needed to guide model selection. This review fills that gap by offering a unified taxonomy and practical guidelines.</p>
<h2 id="taxonomy-of-molecular-descriptors-and-model-architectures">Taxonomy of Molecular Descriptors and Model Architectures</h2>
<p>The core organizational framework classifies models along two axes: the molecular descriptor used as input and the backbone architecture.</p>
<h3 id="molecular-descriptors">Molecular Descriptors</h3>
<p>The review identifies five primary descriptor types:</p>
<ol>
<li><strong>Molecular fingerprints</strong>: Binary vectors encoding structural features (e.g., Morgan fingerprints). Rarely used in foundation models due to information loss and dimensional complexity.</li>
<li><strong>1D sequences</strong>: <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> and <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a> string representations. SMILES is compact and widely used but can produce invalid molecules. SELFIES guarantees valid molecular strings by construction.</li>
<li><strong>2D topological graphs</strong>: Atoms as nodes, bonds as edges. Can be derived from SMILES via <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>, making graph datasets effectively interchangeable with SMILES datasets.</li>
<li><strong>3D geometry</strong>: Spatial coordinates capturing conformational information, energy states, and stereochemistry. Experimentally expensive to obtain, limiting dataset availability.</li>
<li><strong>Multimodal</strong>: Combinations of the above with text, IUPAC names, knowledge graphs, and molecular images.</li>
</ol>
<p>The paper also discusses mathematically abstract molecular representations. For example, the <a href="https://en.wikipedia.org/wiki/Wiener_index">Wiener index</a> quantifies structural complexity:</p>
<p>$$
W = \frac{1}{2} \sum_{i &lt; j} d_{ij}
$$</p>
<p>where $d_{ij}$ is the topological distance (shortest bonding path length) between atoms $i$ and $j$.</p>
<p>Degree centrality captures local connectivity:</p>
<p>$$
C_{D}(v_{i}) = \sum_{j=1}^{n} A_{ij}
$$</p>
<p>where $A \in \mathbb{R}^{n \times n}$ is the molecular graph adjacency matrix.</p>
<h3 id="model-architectures">Model Architectures</h3>
<p>Models are classified into two primary categories:</p>
<p><strong>Unimodal-based models:</strong></p>
<ul>
<li><strong>Sequence-based</strong>: Transformer models operating on SMILES/SELFIES (e.g., <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/molformer/">MoLFormer</a>, MolGEN, <a href="/notes/computational-chemistry/llms-for-chemistry/llamsmol-instruction-tuning-chemistry/">LlaSMol</a>). These capture syntactic patterns but miss spatial and topological features.</li>
<li><strong>Topological graph-based</strong>: GNN variants (GIN, GCN, GAT) and Transformer-based graph models (Graphormer). GNNs capture local topology through message passing; Transformers overcome locality limitations through global self-attention.</li>
<li><strong>3D geometry-based</strong>: Models like Uni-Mol and 3D PGT that incorporate spatial coordinates. Uni-Mol uses distance-aware self-attention with an SE(3)-equivariant coordinate head for rotation/translation invariance.</li>
<li><strong>Image-based</strong>: CNN-based models (ImageMol) that process 2D molecular images using visual representation learning.</li>
</ul>
<p><strong>Multimodal-based models:</strong></p>
<ul>
<li><strong>Sequence + Graph</strong>: <a href="/notes/computational-chemistry/chemical-language-models/multimodal-molecular/dual-view-molecule-pretraining/">DVMP</a>, PanGu Drug Model. Combines the strengths of string and topological representations.</li>
<li><strong>Graph + 3D Geometry</strong>: GraphMVP, Transformer-M. Enriches topological features with spatial information.</li>
<li><strong>Text + Molecular Structure</strong>: KV-PLM, MolT5, MoleculeSTM, MolReGPT, Y-mol. Aligns molecular structural information with biomedical text through cross-modal learning.</li>
</ul>
<h2 id="four-pretraining-paradigms-for-mrl">Four Pretraining Paradigms for MRL</h2>
<p>The review systematically categorizes pretraining strategies into four paradigms:</p>
<h3 id="masked-language-modeling-mlm">Masked Language Modeling (MLM)</h3>
<p>The cornerstone strategy for sequence-based models. Randomly masks tokens in molecular sequences and trains the model to predict them. ChemBERTa pretrained on 77 million SMILES sequences from PubChem achieves 5-10% improvement in AUC-ROC on property prediction tasks compared to task-specific models. MLM captures local dependencies and global sequence patterns but cannot model spatial or topological features, making it best suited for unimodal sequence inputs.</p>
<h3 id="contrastive-learning-cl">Contrastive Learning (CL)</h3>
<p>The dominant strategy for multimodal models. Constructs positive-negative sample pairs to align features across modalities or views. In unimodal settings, CL generates negative samples by perturbing molecular graphs. In multimodal settings, it aligns features from different modalities. GraphMVP, which contrasts 2D topological features with 3D spatial features, reduces RMSE by 15% on QM9 energy prediction compared to unimodal models. Performance depends heavily on the quality of positive sample construction.</p>
<h3 id="reconstruction-based-pretraining-rbp">Reconstruction-Based Pretraining (RBP)</h3>
<p>Learns global molecular features by reconstructing original data from corrupted inputs. Tasks include node feature reconstruction, graph structure reconstruction, and coordinate/energy reconstruction. MGMAE masks more than 50% of nodes and edges in molecular graphs and trains the model to reconstruct them, achieving 94.2% AUC-ROC on BBBP. RBP captures global structural patterns but requires high model complexity and training cost.</p>
<h3 id="multimodal-alignment-pretraining-map">Multimodal Alignment Pretraining (MAP)</h3>
<p>Designed for multimodal inputs, aligning and fusing features from different modalities through cross-modal tasks. KV-PLM uses SMILES-to-text matching to align molecular structure and functional information. MAP fuses structural information (SMILES, graphs) with semantic information (text) but requires large-scale cross-modal labeled data, posing significant data acquisition challenges.</p>
<h2 id="downstream-applications-and-performance-benchmarks">Downstream Applications and Performance Benchmarks</h2>
<p>The review evaluates MRL foundation models across five application domains.</p>
<h3 id="molecular-property-prediction">Molecular Property Prediction</h3>
<p>The most common benchmark for MRL models. The review provides comprehensive ROC-AUC comparisons across eight <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification datasets:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>BBBP</th>
          <th>BACE</th>
          <th>ClinTox</th>
          <th>Tox21</th>
          <th>SIDER</th>
          <th>HIV</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MGMAE</td>
          <td>Graph</td>
          <td>94.2</td>
          <td>92.7</td>
          <td>96.7</td>
          <td>86.0</td>
          <td>66.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>MPG</td>
          <td>Graph</td>
          <td>92.2</td>
          <td>92.0</td>
          <td>96.3</td>
          <td>83.7</td>
          <td>66.1</td>
          <td>-</td>
      </tr>
      <tr>
          <td>GROVER</td>
          <td>Graph+Trans.</td>
          <td>94.0</td>
          <td>89.4</td>
          <td>94.4</td>
          <td>83.1</td>
          <td>65.8</td>
          <td>-</td>
      </tr>
      <tr>
          <td>MoLFormer</td>
          <td>Sequence</td>
          <td>93.7</td>
          <td>88.2</td>
          <td>94.8</td>
          <td>84.7</td>
          <td>69.0</td>
          <td>82.2</td>
      </tr>
      <tr>
          <td>MM-Deacon</td>
          <td>Seq.+IUPAC</td>
          <td>78.5</td>
          <td>-</td>
          <td>99.5</td>
          <td>-</td>
          <td>69.3</td>
          <td>80.1</td>
      </tr>
      <tr>
          <td>Uni-Mol</td>
          <td>3D</td>
          <td>72.9</td>
          <td>85.7</td>
          <td>91.9</td>
          <td>79.6</td>
          <td>65.9</td>
          <td>80.8</td>
      </tr>
      <tr>
          <td>DVMP</td>
          <td>Seq.+Graph</td>
          <td>77.8</td>
          <td>89.4</td>
          <td>95.6</td>
          <td>79.1</td>
          <td>69.8</td>
          <td>81.4</td>
      </tr>
      <tr>
          <td>TxD-T-LLM</td>
          <td>Seq.+Text</td>
          <td>-</td>
          <td>-</td>
          <td>86.3</td>
          <td>88.2</td>
          <td>-</td>
          <td>73.2</td>
      </tr>
  </tbody>
</table>
<p>The table shows that no single architecture dominates across all datasets. Transformer- and GIN-based architectures with graph inputs generally perform well. The review notes that model effectiveness depends heavily on the dataset, with Mole-BERT encountering negative transfer due to a small and unbalanced atomic vocabulary.</p>
<h3 id="molecular-generation">Molecular Generation</h3>
<p>MolGEN (SELFIES-based, 8B parameters) achieves 100% validity on synthetic molecules. MolT5 excels at text-to-molecule generation. Uni-Mol generates 3D conformations with 97.95% coverage on QM9.</p>
<h3 id="drug-drug-interaction-prediction"><a href="https://en.wikipedia.org/wiki/Drug_interaction">Drug-Drug Interaction</a> Prediction</h3>
<p>MPG achieves 96.6% AUC-ROC on BIOSNAP by combining unsupervised pretraining with supervised fine-tuning and multi-task learning.</p>
<h3 id="retrosynthesis-prediction"><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a> Prediction</h3>
<p>DVMP achieves 66.5% top-1 accuracy on USPTO-50K when reaction types are provided as priors (54.2% without).</p>
<h3 id="drug-synergy-prediction">Drug Synergy Prediction</h3>
<p>SynerGPT (GPT-based) achieves 77.7% AUC-ROC in few-shot settings for novel drug combinations, outperforming baselines through contextual learning.</p>
<h2 id="guidelines-limitations-and-future-directions">Guidelines, Limitations, and Future Directions</h2>
<h3 id="model-selection-guidelines">Model Selection Guidelines</h3>
<p>The authors provide structured guidelines for choosing MRL foundation models based on:</p>
<ol>
<li><strong>Task objective</strong>: Property prediction favors GNNs or large pretrained frameworks (ChemBERTa-2, Uni-Mol). Generation tasks favor GPT-style autoregressive models (MolGEN). Retrosynthesis benefits from multimodal architectures.</li>
<li><strong>Data characteristics</strong>: SMILES/graph representations suit generation tasks. Knowledge graph-enhanced models benefit interaction and synergy prediction. Transfer learning helps data-limited scenarios.</li>
<li><strong>Interpretability needs</strong>: Transformer architectures are preferred when interpretability is required, as attention matrices enable visualization of learned molecular features.</li>
<li><strong>Computational budget</strong>: GIN-based models have $\mathcal{O}(|V| + |E|)$ complexity, while Transformer-based models scale as $\mathcal{O}(n^2 \cdot d)$.</li>
</ol>
<h3 id="limitations-and-future-directions">Limitations and Future Directions</h3>
<p>The review identifies five key challenges:</p>
<ol>
<li><strong>Multimodal data integration</strong>: Each representation paradigm has distinct limitations (1D neglects spatial configuration, 2D omits conformational details, 3D faces rotational invariance challenges). The authors propose incorporating <a href="/notes/computational-chemistry/molecular-dynamics/">molecular dynamics</a> trajectories as a dynamic modality and using cross-modal data augmentation.</li>
<li><strong>Data scarcity</strong>: Semi-supervised learning can achieve more than 90% of fully supervised performance using only 10% labeled data on QM9. Cross-modal augmentation (e.g., 3D InfoMax) can generate plausible 3D conformers from 2D graphs.</li>
<li><strong>Interpretability</strong>: Current methods rely primarily on attention-based visualization, which is insufficient for multimodal models. The authors suggest assessing decision consistency across modalities and incorporating chemical knowledge graphs.</li>
<li><strong>Training efficiency</strong>: Large parameter counts demand distributed parallel training techniques, with data parallelism being the most common approach.</li>
<li><strong>Robustness and generalization</strong>: Strategies include data augmentation (multiple SMILES representations, 3D conformer generation), meta-learning for rapid adaptation, and sparse attention mechanisms to reduce sensitivity to irrelevant long-range interactions.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This is a review paper, so standard reproducibility criteria for experimental papers do not directly apply. The review compiles results from the original publications of each surveyed model.</p>
<h3 id="data">Data</h3>
<p>The review catalogs 28 representative molecular datasets used by the surveyed foundation models:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size</th>
          <th>Descriptor</th>
          <th>Primary Use</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PubChem</td>
          <td>~118M</td>
          <td>SMILES, 3D, Image, IUPAC</td>
          <td>Pretraining</td>
      </tr>
      <tr>
          <td>ZINC15</td>
          <td>~980M</td>
          <td>SMILES</td>
          <td>Pretraining</td>
      </tr>
      <tr>
          <td>ChEMBL</td>
          <td>~2.4M</td>
          <td>SMILES</td>
          <td>Pretraining</td>
      </tr>
      <tr>
          <td>QM9</td>
          <td>133,884</td>
          <td>SMILES</td>
          <td>Property prediction</td>
      </tr>
      <tr>
          <td><a href="/notes/computational-chemistry/datasets/geom/">GEOM</a></td>
          <td>450,000</td>
          <td>3D coordinates</td>
          <td>Property prediction</td>
      </tr>
      <tr>
          <td>USPTO-full</td>
          <td>950,000</td>
          <td>SMILES</td>
          <td>Reaction prediction</td>
      </tr>
      <tr>
          <td>Molecule3D</td>
          <td>4M</td>
          <td>3D coordinates</td>
          <td>Property prediction</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Z-dot-max/MRL_Foundation_Review">Review Materials (GitHub)</a></td>
          <td>Code/Data</td>
          <td>Not specified</td>
          <td>Code and data tables for figures</td>
      </tr>
      <tr>
          <td><a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC12784970/">Paper (PMC)</a></td>
          <td>Paper</td>
          <td>CC-BY</td>
          <td>Open access via PubMed Central</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>All performance metrics reported in the review are directly cited from the original studies. The evaluation protocols follow each model&rsquo;s original setup. The review covers:</p>
<ul>
<li>ROC-AUC for classification tasks (property prediction, DDI, synergy)</li>
<li>RMSE/MAE for regression tasks</li>
<li>Validity and novelty for molecular generation</li>
<li>Top-k accuracy for retrosynthesis</li>
<li>COV and MAT for conformation generation</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Song, B., Zhang, J., Liu, Y., Liu, Y., Jiang, J., Yuan, S., Zhen, X., &amp; Liu, Y. (2025). A systematic review of molecular representation learning foundation models. <em>Briefings in Bioinformatics</em>, 27(1), bbaf703. <a href="https://doi.org/10.1093/bib/bbaf703">https://doi.org/10.1093/bib/bbaf703</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{song2025systematic,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A systematic review of molecular representation learning foundation models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Song, Bosheng and Zhang, Jiayi and Liu, Ying and Liu, Yuansheng and Jiang, Jing and Yuan, Sisi and Zhen, Xia and Liu, Yiping}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{27}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{bbaf703}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bib/bbaf703}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>PMO: Benchmarking Sample-Efficient Molecular Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/pmo-sample-efficient-molecular-optimization/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/pmo-sample-efficient-molecular-optimization/</guid><description>PMO benchmarks 25 molecular optimization algorithms across 23 tasks under a 10K oracle budget, finding older methods like REINVENT still lead.</description><content:encoded><![CDATA[<h2 id="a-standardized-benchmark-for-molecular-optimization">A Standardized Benchmark for Molecular Optimization</h2>
<p>This is a <strong>Resource</strong> paper that introduces PMO (Practical Molecular Optimization), an open-source benchmark for evaluating molecular optimization algorithms with a focus on sample efficiency. The primary contribution is not a new algorithm but a comprehensive evaluation framework that exposes blind spots in how the field measures progress. By benchmarking 25 methods across 23 oracle functions under a fixed budget of 10,000 oracle calls, the authors provide a standardized protocol for transparent and reproducible comparison of molecular design methods.</p>
<h2 id="the-missing-dimension-oracle-budget-in-molecular-design">The Missing Dimension: Oracle Budget in Molecular Design</h2>
<p>Molecular optimization is central to drug and materials discovery, and the field has seen rapid growth in computational methods. Despite this progress, the authors identify three persistent problems with how methods are evaluated:</p>
<ol>
<li>
<p><strong>Lack of oracle budget control</strong>: Most papers do not report how many candidate molecules were evaluated by the oracle to achieve their results, despite this number spanning orders of magnitude. In practice, the most valuable oracles (wet-lab experiments, high-accuracy simulations) are expensive, making sample efficiency critical.</p>
</li>
<li>
<p><strong>Trivial or self-designed oracles</strong>: Many papers only report on easy objectives like QED or penalized LogP, or introduce custom tasks that make cross-method comparison impossible.</p>
</li>
<li>
<p><strong>Insufficient handling of randomness</strong>: Many algorithms are stochastic, yet existing benchmarks examined no more than five methods and rarely reported variance across independent runs.</p>
</li>
</ol>
<p>Prior benchmarks such as <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>, Therapeutics Data Commons (TDC), and Tripp et al.&rsquo;s analysis each suffer from at least one of these issues. PMO addresses all three simultaneously.</p>
<h2 id="the-pmo-benchmark-design">The PMO Benchmark Design</h2>
<p>The core innovation of PMO is its evaluation protocol rather than any single algorithmic contribution. The benchmark enforces three design principles:</p>
<p><strong>Oracle budget constraint</strong>: All methods are limited to 10,000 oracle calls. This is deliberately much smaller than the unconstrained budgets typical in the literature, reflecting the practical reality that experimental evaluations are costly.</p>
<p><strong>AUC-based metric</strong>: Instead of reporting only the final top-K score, PMO uses the area under the curve (AUC) of top-K average property value versus oracle calls:</p>
<p>$$
\text{AUC Top-}K = \int_{0}^{N} \bar{f}_{K}(n) , dn
$$</p>
<p>where $\bar{f}_{K}(n)$ is the average property value of the top $K$ molecules found after $n$ oracle calls, and $N = 10{,}000$. The paper uses $K = 10$. This metric rewards methods that reach high property values quickly, not just those that eventually converge given enough budget. All AUC values are min-max scaled to [0, 1].</p>
<p><strong>Standardized data</strong>: All methods use only the ZINC 250K dataset (approximately 250,000 molecules) whenever a database is required, ensuring a level playing field.</p>
<p>The benchmark includes 23 oracle functions: QED, <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">DRD2</a>, <a href="https://en.wikipedia.org/wiki/GSK-3">GSK3</a>-beta, <a href="https://en.wikipedia.org/wiki/C-Jun_N-terminal_kinase">JNK3</a>, and 19 oracles from <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> covering multi-property objectives (MPOs) based on similarity, molecular weight, CLogP, and other pharmaceutically relevant criteria. All oracle scores are normalized to [0, 1].</p>
<h2 id="25-methods-across-nine-algorithm-families">25 Methods Across Nine Algorithm Families</h2>
<p>The benchmark evaluates 25 molecular optimization methods organized along two dimensions: molecular assembly strategy (SMILES, SELFIES, atom-level graphs, fragment-level graphs, synthesis-based) and optimization algorithm (GA, MCTS, BO, VAE, GAN, score-based modeling, hill climbing, RL, gradient ascent). Each method was hyperparameter-tuned on two held-out tasks (zaleplon_mpo and perindopril_mpo) and then evaluated across all 23 oracles for 5 independent runs.</p>
<p>The following table summarizes the top 10 methods by sum of mean AUC Top-10 across all 23 tasks:</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Method</th>
          <th>Assembly</th>
          <th>Sum AUC Top-10</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a></td>
          <td>SMILES</td>
          <td>14.196</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Graph GA</td>
          <td>Fragments</td>
          <td>13.751</td>
      </tr>
      <tr>
          <td>3</td>
          <td>SELFIES-REINVENT</td>
          <td>SELFIES</td>
          <td>13.471</td>
      </tr>
      <tr>
          <td>4</td>
          <td>GP BO</td>
          <td>Fragments</td>
          <td>13.156</td>
      </tr>
      <tr>
          <td>5</td>
          <td><a href="/notes/computational-chemistry/benchmark-problems/stoned-selfies-chemical-space-exploration/">STONED</a></td>
          <td>SELFIES</td>
          <td>13.024</td>
      </tr>
      <tr>
          <td>6</td>
          <td>LSTM HC</td>
          <td>SMILES</td>
          <td>12.223</td>
      </tr>
      <tr>
          <td>7</td>
          <td>SMILES GA</td>
          <td>SMILES</td>
          <td>12.054</td>
      </tr>
      <tr>
          <td>8</td>
          <td>SynNet</td>
          <td>Synthesis</td>
          <td>11.498</td>
      </tr>
      <tr>
          <td>9</td>
          <td>DoG-Gen</td>
          <td>Synthesis</td>
          <td>11.456</td>
      </tr>
      <tr>
          <td>10</td>
          <td>DST</td>
          <td>Fragments</td>
          <td>10.989</td>
      </tr>
  </tbody>
</table>
<p>The bottom five methods by overall ranking were GFlowNet-AL, Pasithea, JT-VAE, Graph MCTS, and MolDQN.</p>
<p>REINVENT is ranked first across all six metrics considered (AUC Top-1, AUC Top-10, AUC Top-100, Top-1, Top-10, Top-100). Graph GA is consistently second. Both methods were released several years before many of the methods they outperform, yet they are rarely used as baselines in newer work.</p>
<h2 id="key-findings-older-methods-win-and-selfies-offers-limited-advantage">Key Findings: Older Methods Win and SELFIES Offers Limited Advantage</h2>
<p>The benchmark yields several findings with practical implications:</p>
<p><strong>No method solves optimization within realistic budgets.</strong> None of the 25 methods can optimize the included objectives within hundreds of oracle calls (the scale at which experimental evaluations would be feasible), except for trivially easy oracles like QED, DRD2, and osimertinib_mpo.</p>
<p><strong>Older algorithms remain competitive.</strong> REINVENT (2017) and Graph GA (2019) outperform all newer methods tested, including those published at top AI conferences. The absence of standardized benchmarking had obscured this fact.</p>
<p><strong>SMILES versus SELFIES.</strong> <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a> was designed to guarantee syntactically valid molecular strings, but head-to-head comparisons show that SELFIES-based variants of language model methods (REINVENT, LSTM HC, VAE) generally do not outperform their <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> counterparts. Modern language models learn SMILES grammar well enough that syntactic invalidity is no longer a practical issue. The one exception is genetic algorithms, where SELFIES-based GAs (<a href="/notes/computational-chemistry/benchmark-problems/stoned-selfies-chemical-space-exploration/">STONED</a>) outperform SMILES-based GAs, likely because SELFIES provides more intuitive mutation operations.</p>
<p><strong>Model-based methods need careful design.</strong> Model-based variants (GP BO relative to Graph GA, GFlowNet-AL relative to GFlowNet) do not consistently outperform their model-free counterparts. GP BO outperformed Graph GA in 12 of 23 tasks but underperformed on sum, and GFlowNet-AL underperformed GFlowNet in nearly every task. The bottleneck is the quality of the predictive surrogate model, and naive surrogate integration can actually hurt performance.</p>
<p><strong>Oracle landscape determines method suitability.</strong> Clustering analysis of relative AUC Top-10 scores reveals clear patterns. String-based GAs excel on isomer-type oracles (which are sums of atomic contributions), while RL-based and fragment-based methods perform better on similarity-based MPOs. This suggests there is no single best algorithm, and method selection should be informed by the optimization landscape.</p>
<p><strong>Hyperparameter tuning and multiple runs are essential.</strong> Optimal hyperparameters differed substantially from default values in original papers. For example, REINVENT&rsquo;s performance is highly sensitive to its sigma parameter, and the best value under the constrained-budget setting is much larger than originally suggested. Methods like Graph GA and GP BO also show high variance across runs, underscoring the importance of reporting distributional outcomes rather than single-run results.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations: they cannot exhaustively tune every hyperparameter or include every variant of each method; the conclusion may be biased toward similarity-based oracles (which dominate the 23 tasks); important quantities like synthesizability and diversity are not thoroughly evaluated; and oracle calls from pre-training data in model-based methods are counted against the budget, which may disadvantage methods that could leverage prior data collection. For a follow-up study that adds property filters and diversity requirements to the PMO evaluation, see <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/evaluation/sample-efficiency-de-novo-generation/">Re-evaluating Sample Efficiency</a>.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Molecule library</td>
          <td>ZINC 250K</td>
          <td>~250,000 molecules</td>
          <td>Used for screening, pre-training generative models, and fragment extraction</td>
      </tr>
      <tr>
          <td>Oracle functions</td>
          <td>TDC / GuacaMol</td>
          <td>23 tasks</td>
          <td>All scores normalized to [0, 1]</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>25 molecular optimization methods spanning 9 algorithm families and 5 molecular assembly strategies. Each method was hyperparameter-tuned on 2 held-out tasks (zaleplon_mpo, perindopril_mpo) using 3 independent runs, then evaluated on all 23 tasks with 5 independent runs each.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AUC Top-K</td>
          <td>Area under curve of top-K average vs. oracle calls</td>
          <td>Primary metric; K=10; min-max scaled to [0, 1]</td>
      </tr>
      <tr>
          <td>Top-K</td>
          <td>Final top-K average property value at 10K calls</td>
          <td>Secondary metric</td>
      </tr>
      <tr>
          <td>Sum rank</td>
          <td>Sum of AUC Top-10 across all 23 tasks</td>
          <td>Used for overall ranking</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper states hardware details are in Appendix C.2. The benchmark runs on standard compute infrastructure and does not require GPUs for most methods. Specific compute requirements vary by method.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/wenhao-gao/mol_opt">mol_opt</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Full benchmark implementation with all 25 methods</td>
      </tr>
      <tr>
          <td><a href="https://figshare.com/articles/dataset/Results_for_practival_molecular_optimization_PMO_benchmark/20123453">Benchmark results</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>All experimental results from the paper</td>
      </tr>
      <tr>
          <td><a href="https://tdcommons.ai">TDC</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>Oracle functions and evaluation infrastructure</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{gao2022sample,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Gao, Wenhao and Fu, Tianfan and Sun, Jimeng and Coley, Connor W.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{35}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{21342--21357}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Gao, W., Fu, T., Sun, J., &amp; Coley, C. W. (2022). Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization. <em>Advances in Neural Information Processing Systems</em>, 35, 21342-21357. <a href="https://arxiv.org/abs/2206.12411">https://arxiv.org/abs/2206.12411</a></p>
<p><strong>Publication</strong>: NeurIPS 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/wenhao-gao/mol_opt">PMO Benchmark Code (GitHub)</a></li>
<li><a href="https://figshare.com/articles/dataset/Results_for_practival_molecular_optimization_PMO_benchmark/20123453">Benchmark Results (Figshare)</a></li>
<li><a href="https://tdcommons.ai">Therapeutics Data Commons</a></li>
</ul>
]]></content:encoded></item><item><title>MolScore: Scoring and Benchmarking for Drug Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/molscore-scoring-benchmarking-framework/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/molscore-scoring-benchmarking-framework/</guid><description>MolScore provides a unified, open-source Python framework for scoring, evaluating, and benchmarking generative models applied to de novo drug design.</description><content:encoded><![CDATA[<h2 id="a-unified-resource-for-generative-molecular-design">A Unified Resource for Generative Molecular Design</h2>
<p>MolScore is a <strong>Resource</strong> paper that introduces an open-source Python framework for scoring, evaluating, and benchmarking generative models in de novo drug design. The primary contribution is the software itself: a modular, configurable platform that consolidates functionality previously scattered across multiple tools (GuacaMol, MOSES, MolOpt, REINVENT, TDC) into a single package. MolScore provides scoring functions for molecular optimization, evaluation metrics for assessing the quality of generated molecules, and a benchmark mode for standardized comparison of generative models.</p>
<h2 id="the-fragmented-landscape-of-generative-model-evaluation">The Fragmented Landscape of Generative Model Evaluation</h2>
<p>Generative models for molecular design have proliferated rapidly, but evaluating and comparing them remains difficult. Existing benchmarks each address only part of the problem:</p>
<ul>
<li><strong><a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a></strong> provides 20 fixed optimization objectives but cannot separate top-performing models on most tasks, and custom objectives require code modification.</li>
<li><strong><a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a></strong> focuses on distribution-learning metrics but does not support molecular optimization.</li>
<li><strong><a href="/notes/computational-chemistry/benchmark-problems/pmo-sample-efficient-molecular-optimization/">MolOpt</a></strong> extends benchmark evaluation to 25 generative approaches but lacks evaluation of the quality of generated chemistry.</li>
<li><strong>Docking benchmarks</strong> (<a href="/notes/computational-chemistry/benchmark-problems/smina-docking-benchmark/">smina-docking-benchmark</a>, <a href="/notes/computational-chemistry/benchmark-problems/dockstring-docking-benchmarks-ligand-design/">DOCKSTRING</a>, TDC) test structure-based scoring but often lack proper ligand preparation, leading generative models to exploit non-holistic objectives by generating large or greasy molecules.</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a></strong> provides configurable scoring functions but is tightly coupled to its own generative model architecture.</li>
</ul>
<p>No single tool offered configurable objectives, comprehensive evaluation metrics, generative-model-agnostic design, and graphical user interfaces together. This fragmentation forces practitioners to write custom glue code and makes reproducible comparison across methods difficult.</p>
<h2 id="modular-architecture-for-scoring-evaluation-and-benchmarking">Modular Architecture for Scoring, Evaluation, and Benchmarking</h2>
<p>MolScore is split into two sub-packages:</p>
<h3 id="molscore-molecule-scoring">molscore: Molecule Scoring</h3>
<p>The <code>molscore</code> sub-package handles iterative scoring of <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> generated by any generative model. The workflow for each iteration:</p>
<ol>
<li>Parse and validate SMILES via RDKit, canonicalize, and check intra-batch uniqueness.</li>
<li>Cross-reference against previously generated molecules to reuse cached scores (saving compute for expensive scoring functions like docking).</li>
<li>Run user-specified scoring functions on valid, unique molecules (invalid molecules receive a score of 0).</li>
<li>Transform each score to a 0-1 range using configurable transformation functions (normalize, linear threshold, Gaussian threshold, step threshold).</li>
<li>Aggregate transformed scores into a single desirability score using configurable aggregation (weighted sum, product, geometric mean, arithmetic mean, <a href="https://en.wikipedia.org/wiki/Pareto_front">Pareto front</a>, or auto-weighted variants).</li>
<li>Optionally apply diversity filters to penalize non-diverse molecules, or use any scoring function as a multiplicative filter.</li>
</ol>
<p>The full objective is specified in a single JSON configuration file, with a Streamlit GUI provided for interactive configuration writing. The available scoring functions span:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Examples</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Descriptors</td>
          <td>RDKit descriptors, linker descriptors, penalized logP</td>
      </tr>
      <tr>
          <td>Similarity</td>
          <td>Fingerprint similarity, ROCS, Open3DAlign, substructure matching</td>
      </tr>
      <tr>
          <td>Predictive models</td>
          <td>Scikit-learn models, PIDGINv5 (2,337 ChEMBL31 targets), ChemProp, ADMET-AI</td>
      </tr>
      <tr>
          <td>Docking</td>
          <td>Glide, PLANTS, GOLD, OEDock, Smina, Gnina, Vina, rDock</td>
      </tr>
      <tr>
          <td>Synthesizability</td>
          <td>SA score, RA Score, AiZynthFinder, reaction filters</td>
      </tr>
  </tbody>
</table>
<p>Most scoring functions support multiprocessing, and computationally expensive functions (docking, ligand preparation) can be distributed across compute clusters via Dask.</p>
<h3 id="moleval-molecule-evaluation">moleval: Molecule Evaluation</h3>
<p>The <code>moleval</code> sub-package computes performance metrics on generated molecules relative to reference datasets. It extends the MOSES metric suite with additional intrinsic metrics (sphere exclusion diversity, scaffold uniqueness, functional group and ring system diversity, ZINC20 purchasability via molbloom) and extrinsic metrics (analogue similarity/coverage, functional group and ring system similarity, outlier bits or &ldquo;Silliness&rdquo;).</p>
<h3 id="benchmark-mode">Benchmark Mode</h3>
<p>A <code>MolScoreBenchmark</code> class iterates over a list of JSON configuration files, providing standardized comparison. Pre-built presets reimplement GuacaMol and MolOpt benchmarks, and users can define custom benchmark suites without writing code.</p>
<h2 id="case-studies-5-ht2a-ligand-design-and-fine-tuning-evaluation">Case Studies: 5-HT2A Ligand Design and Fine-Tuning Evaluation</h2>
<p>The authors demonstrate MolScore with a SMILES-based RNN generative model using <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/">Augmented Hill-Climb</a> for optimization, designing serotonin <a href="https://en.wikipedia.org/wiki/5-HT2A_receptor">5-HT2A</a> receptor ligands across three objective sets of increasing complexity.</p>
<h3 id="first-objective-set-basic-drug-properties">First Objective Set: Basic Drug Properties</h3>
<p>Four objectives combine predicted 5-HT2A activity (via PIDGINv5 random forest models at 1 uM) with synthesizability (RAscore) and/or <a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">BBB</a> permeability property ranges (<a href="https://en.wikipedia.org/wiki/Polar_surface_area">TPSA</a> &lt; 70, HBD &lt; 2, logP 2-4, MW &lt; 400). All objectives were optimized successfully, with diversity filters preventing mode collapse. The most difficult single objective (5-HT2A activity alone) was hardest primarily because the diversity filter more heavily penalized similar molecules for this relatively easy task.</p>
<h3 id="second-objective-set-selectivity">Second Objective Set: Selectivity</h3>
<p>Six objectives incorporate selectivity proxies using PIDGINv5 models for off-target prediction against <a href="https://en.wikipedia.org/wiki/G_protein-coupled_receptor">Class A GPCR</a> membrane receptors (266 models), the <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">D2 dopamine receptor</a>, dopamine receptor family, serotonin receptor subtypes, and combinations. These proved substantially harder: selectivity against dopamine and serotonin receptor families combined was barely improved during optimization. Even with imperfect predictive models, the PIDGINv5 ensemble correctly identified 95 of 126 known selective 5-HT2A ligands. Nearest-neighbor analysis of de novo molecules (<a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> 0.3-0.6) showed they tended to be structurally simpler versions of known selective ligands.</p>
<h3 id="third-objective-set-structure-based-docking">Third Objective Set: Structure-Based Docking</h3>
<p>Two objectives use molecular docking via GlideSP into 5-HT2A (PDB: 6A93) and D2 (PDB: 6CM4) crystal structures with full ligand preparation (LigPrep for stereoisomer/tautomer/protonation state enumeration). Multi-parameter optimization includes docking score, D155 polar interaction constraint, formal charge, and consecutive rotatable bond limits. Single-target docking scores reached the mean of known ligands within 200 steps, but optimizing for divergent 5-HT2A vs D2 docking scores was much harder due to binding pocket similarity. Protein-ligand interaction fingerprint analysis (ProLIF) revealed that molecules optimized for selectivity avoided specific binding pocket regions shared between the two receptors.</p>
<h3 id="evaluation-case-study-fine-tuning-epochs">Evaluation Case Study: Fine-Tuning Epochs</h3>
<p>The moleval sub-package was used to track metrics across fine-tuning epochs of a SMILES RNN on A2A receptor ligands, showing that just one or two epochs sufficed to increase similarity to the fine-tuning set, while further epochs reduced novelty and diversity.</p>
<h2 id="configurable-benchmarking-with-practical-drug-design-relevance">Configurable Benchmarking with Practical Drug Design Relevance</h2>
<p>MolScore provides a more comprehensive platform than any single existing tool. Compared to prior work:</p>
<table>
  <thead>
      <tr>
          <th>Feature</th>
          <th>GuacaMol</th>
          <th>MOSES</th>
          <th>MolOpt</th>
          <th>TDC</th>
          <th>REINVENT</th>
          <th>MolScore</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Configurable objectives</td>
          <td>No</td>
          <td>N/A</td>
          <td>No</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Optimization objectives</td>
          <td>Yes</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Evaluation metrics</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>No</td>
          <td>No</td>
          <td>No</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Model-agnostic</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>No</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>GUI</td>
          <td>No</td>
          <td>No</td>
          <td>No</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
  </tbody>
</table>
<p>The framework integrates into any Python-based generative model in three lines of code. Dependency conflicts between scoring function libraries are handled by running conflicting components as local servers from isolated conda environments.</p>
<p>Key limitations acknowledged by the authors include: the assumption of conda for environment management, the inherent difficulty of designing non-exploitable objectives, and the fact that ligand-based predictive models may have limited applicability domains for out-of-distribution de novo molecules.</p>
<p>Future directions include accepting 3D molecular conformations as inputs, structure interaction fingerprint rescoring, and dynamic configuration files for curriculum learning.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL compounds</td>
          <td>Not specified</td>
          <td>Standard ChEMBL training set for SMILES RNN</td>
      </tr>
      <tr>
          <td>Evaluation reference</td>
          <td>5-HT2A ligands from ChEMBL31</td>
          <td>3,771 compounds</td>
          <td>Extracted for score distribution comparison</td>
      </tr>
      <tr>
          <td>Activity models</td>
          <td>PIDGINv5 on ChEMBL31</td>
          <td>2,337 target models</td>
          <td>Random forest classifiers at various concentration thresholds</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>A2A receptor ligands</td>
          <td>Not specified</td>
          <td>Used for moleval case study</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The generative model used in case studies is a SMILES-based RNN with Augmented Hill-Climb reinforcement learning. Diversity filters penalize non-diverse molecules during optimization. Score transformation functions (normalize, linear threshold, Gaussian threshold, step threshold) map raw scores to 0-1 range. Aggregation functions (arithmetic mean, weighted sum, product, geometric mean, Pareto front) combine multi-parameter objectives.</p>
<h3 id="models">Models</h3>
<p>PIDGINv5 provides 2,337 pre-trained random forest classifiers on ChEMBL31 targets. RAscore provides pre-trained synthesizability prediction. ADMET-AI and ChemProp models are supported via isolated environments. Docking uses GlideSP with LigPrep for ligand preparation in the structure-based case study.</p>
<h3 id="evaluation">Evaluation</h3>
<p>Intrinsic metrics: validity, uniqueness, scaffold uniqueness, internal diversity, sphere exclusion diversity, Solow-Polasky diversity, scaffold diversity, functional group diversity, ring system diversity, MCF and <a href="https://en.wikipedia.org/wiki/Pan-assay_interference_compounds">PAINS</a> filters, ZINC20 purchasability.</p>
<p>Extrinsic metrics: novelty, <a href="/notes/computational-chemistry/benchmark-problems/frechet-chemnet-distance/">FCD</a>, analogue similarity/coverage, functional group similarity, ring system similarity, SNN similarity, fragment similarity, scaffold similarity, outlier bits, Wasserstein distance on LogP/SA/NP/QED/MW.</p>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. Docking-based objectives can be distributed across compute clusters via Dask.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MorganCThomas/MolScore">MolScore</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Main framework, installable via pip</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MorganCThomas/MolScore_examples">MolScore Examples</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Integration examples with SMILES-RNN, CReM, GraphGA</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Thomas, M., O&rsquo;Boyle, N. M., Bender, A., &amp; de Graaf, C. (2024). MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design. <em>Journal of Cheminformatics</em>, 16(1), 64. <a href="https://doi.org/10.1186/s13321-024-00861-w">https://doi.org/10.1186/s13321-024-00861-w</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{thomas2024molscore,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Thomas, Morgan and O&#39;Boyle, Noel M. and Bender, Andreas and de Graaf, Chris}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{64}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{BioMed Central}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-024-00861-w}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolGenBench: Benchmarking Molecular Generative Models</title><link>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/molgenbench-molecular-generative-models/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/molgenbench-molecular-generative-models/</guid><description>MolGenBench benchmarks 17 molecular generative models across 120 protein targets using novel metrics for target awareness, hit rates, and lead optimization.</description><content:encoded><![CDATA[<h2 id="a-comprehensive-benchmark-for-structure-based-molecular-generation">A Comprehensive Benchmark for Structure-Based Molecular Generation</h2>
<p>MolGenBench is a <strong>Resource</strong> paper that provides a large-scale, application-oriented benchmark for evaluating molecular generative models in the context of structure-based drug design (SBDD). The primary contribution is a dataset of 220,005 experimentally validated active molecules across 120 protein targets, organized into 5,433 chemical series, along with a suite of novel evaluation metrics. The benchmark addresses both <a href="https://en.wikipedia.org/wiki/De_novo_drug_design">de novo molecular design</a> and hit-to-lead (H2L) optimization, a critical drug discovery stage that existing benchmarks largely ignore.</p>
<h2 id="gaps-in-existing-molecular-generation-benchmarks">Gaps in Existing Molecular Generation Benchmarks</h2>
<p>Despite rapid progress in deep generative models for drug discovery, the evaluation landscape has not kept pace. The authors identify four categories of limitations in existing benchmarks:</p>
<ol>
<li>
<p><strong>Dataset construction</strong>: Existing benchmarks use overly stringent activity cutoffs and too few protein targets. The widely used CrossDocked2020 dataset contains very few reference ligands per target, making it difficult to evaluate whether a model can rediscover the full distribution of active compounds.</p>
</li>
<li>
<p><strong>Model selection</strong>: Prior benchmark studies evaluate a narrow range of architectures and do not systematically examine the effects of training data composition, prior knowledge integration, or architectural paradigm.</p>
</li>
<li>
<p><strong>Evaluation scenarios</strong>: Existing benchmarks focus exclusively on de novo generation. Hit-to-lead optimization, where a hit compound is refined through R-group modifications, remains unstandardized.</p>
</li>
<li>
<p><strong>Evaluation metrics</strong>: Standard metrics (QED, Vina score, SA score) correlate strongly with atom count and fail to assess target-specific generation capacity. The AddCarbon model illustrates this: simply adding random carbon atoms to training molecules achieves near-perfect scores on standard metrics while producing nonsensical chemistry.</p>
</li>
</ol>
<h2 id="novel-metrics-for-evaluating-molecular-generation">Novel Metrics for Evaluating Molecular Generation</h2>
<p>MolGenBench introduces three key metrics designed to capture aspects of model performance that existing metrics miss.</p>
<h3 id="target-aware-score-tascore">Target-Aware Score (TAScore)</h3>
<p>The TAScore measures whether a model generates target-specific molecules rather than generic structures. It compares the ratio of active molecule or scaffold recovery on a specific target to the background recovery across all targets:</p>
<p>$$
\text{TAScore}_{\text{label}, i} = \frac{S_{i} / S_{\text{all}}}{R_{i} / R_{\text{all}}}; \quad \text{label} \in \{\text{SMILES}, \text{scaffold}\}
$$</p>
<p>For target $i$: $R_{\text{all}}$ is the total number of distinct molecules generated across all 120 targets; $R_{i}$ is the subset matching known actives for target $i$ (without conditioning on target $i$); $S_{\text{all}}$ is the total generated when conditioned on target $i$; and $S_{i}$ is the subset matching known actives for target $i$. A TAScore above 1 indicates the model uses target-specific information effectively.</p>
<h3 id="hit-rate">Hit Rate</h3>
<p>The hit rate quantifies the efficiency of active compound discovery:</p>
<p>$$
\text{HitRate}_{\text{label}} = \frac{\mathcal{M}_{\text{active}}}{\mathcal{M}_{\text{sampled}}}; \quad \text{label} \in \{\text{SMILES}, \text{scaffold}\}
$$</p>
<p>where $\mathcal{M}_{\text{active}}$ is the number of unique active molecules or scaffolds found, and $\mathcal{M}_{\text{sampled}}$ is the total number of generated molecules.</p>
<h3 id="mean-normalized-affinity-mna-score">Mean Normalized Affinity (MNA) Score</h3>
<p>For H2L optimization, the MNA Score measures whether models generate compounds with improved potency relative to the known activity range within each chemical series:</p>
<p>$$
\text{NA}_{g} = \frac{\text{Affinity}_{g}^{\text{series}} - \text{Affinity}_{\min}^{\text{series}}}{\text{Affinity}_{\max}^{\text{series}} - \text{Affinity}_{\min}^{\text{series}}}
$$</p>
<p>$$
\text{MNAScore} = \frac{1}{G} \sum_{g}^{G} \text{NA}_{g}
$$</p>
<p>This normalizes affinities to [0, 1] within each series, enabling cross-series comparison.</p>
<h2 id="systematic-evaluation-of-17-generative-models-across-two-drug-discovery-scenarios">Systematic Evaluation of 17 Generative Models Across Two Drug Discovery Scenarios</h2>
<h3 id="dataset-construction">Dataset Construction</h3>
<p>The MolGenBench dataset was built from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL v33</a>. Ligands failing RDKit validation were discarded, along with entries where binding affinity exceeded 10 uM. The 120 protein targets were selected based on minimum thresholds: at least 50 active molecules, at least 50 unique Bemis-Murcko scaffolds, and at least 20 distinct chemical series per target. For H2L optimization, maximum common substructures (MCS) were identified per series, with dual thresholds requiring the MCS to appear in over 80% of molecules and cover more than one-third of each molecule&rsquo;s atoms. The top 5 series per target (ranked by dockable ligands) formed the H2L test set: 600 compound series across 120 targets.</p>
<h3 id="evaluated-models">Evaluated Models</h3>
<p><strong>De novo models (10)</strong>: Pocket2Mol, TargetDiff, FLAG, DecompDiff, SurfGen, PocketFlow, MolCraft, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/tamgen-target-aware-molecule-generation/">TamGen</a>, DiffSBDD-M (trained on BindingMOAD), DiffSBDD-C (trained on CrossDock). These span autoregressive, diffusion, and Bayesian flow network architectures.</p>
<p><strong>H2L models (7)</strong>: Fragment-based (DiffSBDD-M/C inpainting, Delete, DiffDec) and ligand-based (ShEPhERD, ShapeMol, PGMG). These use pharmacophore, surface, or shape priors.</p>
<p>Models were further stratified by whether test proteins appeared in their CrossDock training set (&ldquo;Proteins in CrossDock&rdquo; vs. &ldquo;Proteins Not in CrossDock&rdquo;), enabling direct measurement of generalization.</p>
<h3 id="evaluation-dimensions">Evaluation Dimensions</h3>
<p>The benchmark evaluates six dimensions:</p>
<table>
  <thead>
      <tr>
          <th>Dimension</th>
          <th>Key Metrics</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Basic molecular properties</td>
          <td>Validity, QED, SA score, uniqueness, diversity, JSD alignment</td>
      </tr>
      <tr>
          <td>Chemical safety</td>
          <td>Industry-standard filter pass rates (Eli Lilly, Novartis, ChEMBL rules)</td>
      </tr>
      <tr>
          <td>Conformational quality</td>
          <td>PoseBusters pass rate, strain energy, steric clash frequency</td>
      </tr>
      <tr>
          <td>Active compound recovery</td>
          <td>Hit rate, hit fraction, active molecule and scaffold recovery counts</td>
      </tr>
      <tr>
          <td>Target awareness</td>
          <td>TAScore at molecule and scaffold levels</td>
      </tr>
      <tr>
          <td>Lead optimization</td>
          <td>MNA Score, number of series with hits</td>
      </tr>
  </tbody>
</table>
<h3 id="key-results-basic-properties-and-chemical-safety">Key Results: Basic Properties and Chemical Safety</h3>
<p>Most models generate drug-like molecules with reasonable QED (0.4-0.6) and SA scores (0.5-0.8). However, two models (FLAG, SurfGen) showed validity below 0.4. TamGen exhibited low uniqueness (~27%), suggesting overreliance on pretrained patterns.</p>
<p>Chemical filter pass rates revealed a more concerning picture: only TamGen and PGMG exceeded 50% of molecules passing all industry-standard filters. Most models fell below 40%, and some (FLAG, SurfGen) below 5%. Nearly 70% of reference active molecules passed the same filters, indicating models frequently generate high-risk compounds.</p>
<h3 id="key-results-conformational-quality">Key Results: Conformational Quality</h3>
<p>MolCraft achieved the highest PoseBusters validity (0.783 PB-valid score among valid molecules). PocketFlow, despite perfect SMILES validity, had fewer than half of its valid molecules pass conformational checks. Most models produced conformations with higher <a href="https://en.wikipedia.org/wiki/Strain_(chemistry)">strain energy</a> than those from <a href="https://en.wikipedia.org/wiki/AutoDock">AutoDock Vina</a>. Some models (MolCraft for de novo, DiffDec for H2L) surpassed Vina in minimizing steric clashes, suggesting advanced architectures can exceed the patterns in their training data.</p>
<h3 id="key-results-active-compound-recovery-and-hit-rates">Key Results: Active Compound Recovery and Hit Rates</h3>
<p>De novo models exhibited very low hit rates. The highest molecular hit rate among de novo models was 0.124% on proteins in CrossDock, dropping to 0.024% on unseen proteins. Scaffold-level hit rates were 10-fold higher, showing that generating pharmacologically plausible scaffolds is considerably easier than generating fully active molecules.</p>
<p>After removing molecules overlapping with the CrossDock training set, TamGen&rsquo;s recovery dropped substantially (from 30.3 to 18.7 targets), confirming significant memorization effects. On proteins not in CrossDock, half of the de novo models failed to recover any active molecules at all.</p>
<p>Fragment-based H2L models substantially outperformed both de novo models and ligand-based H2L approaches. Delete recovered active molecules in 44.3 series (out of 600), and DiffDec in 34.7 series.</p>
<h3 id="key-results-target-awareness">Key Results: Target Awareness</h3>
<p>Most de novo models failed the TAScore evaluation. PocketFlow showed the strongest target awareness at the scaffold level, with only 27% of targets showing TAScore &lt; 1 (indicating no target specificity). At the molecular level, results were even weaker: TamGen achieved TAScore &gt; 1 for only 30.6% of CrossDock-seen targets and just 4 out of 35 unseen targets. Most models generated structurally similar molecules regardless of which target they were conditioned on.</p>
<h3 id="key-results-h2l-optimization-mna-score">Key Results: H2L Optimization (MNA Score)</h3>
<p>DiffDec achieved the highest total active hits (121.7) and the best MNA Score (0.523), followed by Delete (104.7 hits, MNA Score 0.482). Ligand-based models (ShEPhERD, PGMG) recovered fewer hits but showed higher MNA Scores per hit, suggesting pharmacophore-based priors help prioritize more potent molecules when actives are found. The most successful model (Delete) achieved a hit in only 9.6% of series (57/600), indicating substantial room for improvement.</p>
<h2 id="critical-findings-and-limitations-of-current-molecular-generative-models">Critical Findings and Limitations of Current Molecular Generative Models</h2>
<p>The benchmark reveals several consistent limitations:</p>
<ol>
<li>
<p><strong>Low screening efficiency</strong>: De novo models achieve molecular hit rates below 0.13%, far from practical utility. Scaffold recovery is more feasible but still limited.</p>
</li>
<li>
<p><strong>Weak target awareness</strong>: Most SBDD models fail to use protein structural information effectively, generating similar molecules across different targets. This raises concerns about off-target effects.</p>
</li>
<li>
<p><strong>Conformational prediction remains difficult</strong>: Most models produce conformations with higher strain energy than classical docking, and only a small fraction (typically below 23%) of generated poses match redocked conformations within 2 Angstrom RMSD.</p>
</li>
<li>
<p><strong>Generalization gap</strong>: Performance consistently drops on proteins not in the training set, and prior benchmarks that do not stratify by training data exposure overestimate real-world utility.</p>
</li>
<li>
<p><strong>Inference-time scaling does not solve the problem</strong>: Sampling up to 100,000 molecules increased the absolute number of active discoveries but with diminishing efficiency. Without better scoring functions, scaling sampling offers limited practical value.</p>
</li>
<li>
<p><strong>Chemical safety</strong>: Most models produce a majority of molecules that fail industry-standard reactivity and promiscuity filters.</p>
</li>
</ol>
<p>The authors acknowledge that the benchmark&rsquo;s 220,005 active molecules represent a biased subset of bioactive chemical space. A model&rsquo;s failure to rediscover known actives for a given target may reflect sampling limitations rather than generating inactive compounds.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Active compounds</td>
          <td>ChEMBL v33</td>
          <td>220,005 molecules, 120 targets</td>
          <td>Filtered at 10 uM affinity threshold</td>
      </tr>
      <tr>
          <td>H2L series</td>
          <td>ChEMBL v33 + PDB</td>
          <td>5,433 series (600 used for H2L test)</td>
          <td>MCS-based series construction</td>
      </tr>
      <tr>
          <td>Protein structures</td>
          <td><a href="https://en.wikipedia.org/wiki/Protein_Data_Bank">PDB</a></td>
          <td>120 targets</td>
          <td>One PDB entry per target</td>
      </tr>
      <tr>
          <td>Training (most models)</td>
          <td>CrossDocked2020</td>
          <td>Varies</td>
          <td>Standard SBDD training set</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>De novo models sampled 1,000 molecules per target; H2L models sampled 200 per series</li>
<li>All experiments repeated three times with different random seeds</li>
<li>Docking performed with AutoDock Vina using standard parameters</li>
<li>Chemical filters applied via the medchem library</li>
<li>Conformational quality assessed with PoseBusters and PoseCheck</li>
<li>Interaction scores computed via ProLIF with frequency-weighted normalization</li>
</ul>
<h3 id="models">Models</h3>
<p>All 17 models were obtained from their official GitHub repositories and run with default configurations. The benchmark does not introduce new model architectures.</p>
<h3 id="evaluation">Evaluation</h3>
<p>Summary of key metrics across the best-performing models in each category:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best De Novo</th>
          <th>Value</th>
          <th>Best H2L</th>
          <th>Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PB-valid score</td>
          <td>MolCraft</td>
          <td>0.783</td>
          <td>DiffSBDD-M</td>
          <td>0.597</td>
      </tr>
      <tr>
          <td>Molecular hit rate (in CrossDock)</td>
          <td>TamGen</td>
          <td>0.124%</td>
          <td>DiffDec</td>
          <td>Higher than de novo</td>
      </tr>
      <tr>
          <td>Scaffold hit rate (in CrossDock)</td>
          <td>PocketFlow</td>
          <td>&gt;10%</td>
          <td>Delete</td>
          <td>Lower than PocketFlow</td>
      </tr>
      <tr>
          <td>TAScore scaffold (% targets &gt;1)</td>
          <td>PocketFlow</td>
          <td>73%</td>
          <td>N/A</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>MNA Score</td>
          <td>N/A</td>
          <td>N/A</td>
          <td>DiffDec</td>
          <td>0.523</td>
      </tr>
      <tr>
          <td>Filter pass rate</td>
          <td>TamGen</td>
          <td>&gt;50%</td>
          <td>PGMG</td>
          <td>&gt;50%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Specific hardware requirements are not detailed in the paper. Models were run using their default configurations from official repositories.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/CAODH/MolGenBench">MolGenBench</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Benchmark evaluation framework</td>
      </tr>
      <tr>
          <td><a href="https://zenodo.org/records/17572553">Zenodo dataset</a></td>
          <td>Dataset</td>
          <td>CC-BY-NC-ND 4.0</td>
          <td>Processed data and source data for all results</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Cao, D., Fan, Z., Yu, J., Chen, M., Jiang, X., Sheng, X., Wang, X., Zeng, C., Luo, X., Teng, D., &amp; Zheng, M. (2025). Benchmarking Real-World Applicability of Molecular Generative Models from De novo Design to Lead Optimization with MolGenBench. <em>bioRxiv</em>. <a href="https://doi.org/10.1101/2025.11.03.686215">https://doi.org/10.1101/2025.11.03.686215</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{cao2025molgenbench,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Benchmarking Real-World Applicability of Molecular Generative Models from De novo Design to Lead Optimization with MolGenBench}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Cao, Duanhua and Fan, Zhehuan and Yu, Jie and Chen, Mingan and Jiang, Xinyu and Sheng, Xia and Wang, Xingyou and Zeng, Chuanlong and Luo, Xiaomin and Teng, Dan and Zheng, Mingyue}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{bioRxiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1101/2025.11.03.686215}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MoleculeNet: Benchmarking Molecular Machine Learning</title><link>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/</guid><description>MoleculeNet curates 17 datasets across quantum mechanics, physical chemistry, biophysics, and physiology with standardized splits and metrics for molecular ML.</description><content:encoded><![CDATA[<h2 id="a-resource-paper-for-molecular-machine-learning-benchmarking">A Resource Paper for Molecular Machine Learning Benchmarking</h2>
<p>This is a <strong>Resource</strong> paper. MoleculeNet provides a standardized benchmark suite for evaluating molecular machine learning methods. Its primary contribution is the curation of 17 public datasets spanning four categories of molecular properties, together with standardized evaluation metrics, multiple dataset splitting strategies, and open-source implementations of featurization and learning algorithms via the DeepChem library.</p>
<h2 id="why-molecular-ml-needed-a-unified-benchmark">Why Molecular ML Needed a Unified Benchmark</h2>
<p>Prior to MoleculeNet, algorithmic progress in molecular machine learning was difficult to measure. Individual papers benchmarked proposed methods on different datasets with different metrics, making cross-method comparison unreliable. Several factors make molecular ML particularly challenging:</p>
<ol>
<li><strong>Data scarcity</strong>: Molecular datasets are much smaller than those available for computer vision or NLP, since obtaining accurate chemical property measurements requires specialized instruments and expert supervision.</li>
<li><strong>Heterogeneous outputs</strong>: Properties of interest range from quantum mechanical characteristics to macroscopic physiological effects on the human body.</li>
<li><strong>Variable input structures</strong>: Molecules have arbitrary size, variable connectivity, and many possible 3D conformers, all of which must be encoded into fixed-length representations for conventional ML algorithms.</li>
<li><strong>No standard evaluation protocol</strong>: Without prescribed metrics, splits, or data subsets, two methods using the same underlying database (e.g., PubChem) could be entirely incomparable.</li>
</ol>
<p>Existing databases like PubChem, ChEMBL, and the Quantum Machine collections provided raw data but did not define evaluation protocols suitable for machine learning development. MoleculeNet bridges this gap, following the precedent set by ImageNet in computer vision and WordNet in NLP.</p>
<h2 id="core-design-datasets-splits-metrics-and-featurizations">Core Design: Datasets, Splits, Metrics, and Featurizations</h2>
<p>MoleculeNet is organized around four components: curated datasets, splitting methods, evaluation metrics, and molecular featurizations.</p>
<h3 id="datasets-across-four-property-categories">Datasets Across Four Property Categories</h3>
<p>The benchmark includes 17 datasets covering over 700,000 compounds and more than 800 tasks. These are organized into four categories reflecting different levels of molecular properties:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Dataset</th>
          <th>Tasks</th>
          <th>Compounds</th>
          <th>Task Type</th>
          <th>Rec. Split</th>
          <th>Rec. Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Quantum Mechanics</td>
          <td>QM7</td>
          <td>1</td>
          <td>7,165</td>
          <td>Regression</td>
          <td>Stratified</td>
          <td>MAE</td>
      </tr>
      <tr>
          <td></td>
          <td>QM7b</td>
          <td>14</td>
          <td>7,211</td>
          <td>Regression</td>
          <td>Random</td>
          <td>MAE</td>
      </tr>
      <tr>
          <td></td>
          <td>QM8</td>
          <td>12</td>
          <td>21,786</td>
          <td>Regression</td>
          <td>Random</td>
          <td>MAE</td>
      </tr>
      <tr>
          <td></td>
          <td>QM9</td>
          <td>12</td>
          <td>133,885</td>
          <td>Regression</td>
          <td>Random</td>
          <td>MAE</td>
      </tr>
      <tr>
          <td>Physical Chemistry</td>
          <td>ESOL</td>
          <td>1</td>
          <td>1,128</td>
          <td>Regression</td>
          <td>Random</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td></td>
          <td>FreeSolv</td>
          <td>1</td>
          <td>643</td>
          <td>Regression</td>
          <td>Random</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td></td>
          <td>Lipophilicity</td>
          <td>1</td>
          <td>4,200</td>
          <td>Regression</td>
          <td>Random</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>PCBA</td>
          <td>128</td>
          <td>439,863</td>
          <td>Classification</td>
          <td>Random</td>
          <td>PRC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>MUV</td>
          <td>17</td>
          <td>93,127</td>
          <td>Classification</td>
          <td>Random</td>
          <td>PRC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>HIV</td>
          <td>1</td>
          <td>41,913</td>
          <td>Classification</td>
          <td>Scaffold</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>PDBbind</td>
          <td>1</td>
          <td>11,908</td>
          <td>Regression</td>
          <td>Time</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td></td>
          <td>BACE</td>
          <td>1</td>
          <td>1,522</td>
          <td>Classification</td>
          <td>Scaffold</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>BBBP</td>
          <td>1</td>
          <td>2,053</td>
          <td>Classification</td>
          <td>Scaffold</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>Tox21</td>
          <td>12</td>
          <td>8,014</td>
          <td>Classification</td>
          <td>Random</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>ToxCast</td>
          <td>617</td>
          <td>8,615</td>
          <td>Classification</td>
          <td>Random</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>SIDER</td>
          <td>27</td>
          <td>1,427</td>
          <td>Classification</td>
          <td>Random</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td></td>
          <td>ClinTox</td>
          <td>2</td>
          <td>1,491</td>
          <td>Classification</td>
          <td>Random</td>
          <td>ROC-AUC</td>
      </tr>
  </tbody>
</table>
<p><strong>Quantum mechanics</strong> datasets (QM7, QM7b, QM8, QM9) contain DFT-computed electronic properties for subsets of the <a href="/notes/computational-chemistry/datasets/gdb-17/">GDB</a> database. <strong>Physical chemistry</strong> datasets cover solubility (ESOL), hydration free energy (FreeSolv), and lipophilicity. <strong>Biophysics</strong> datasets include high-throughput screening results (PCBA, MUV), HIV inhibition activity, protein-ligand binding affinity (PDBbind), and BACE-1 inhibition. <strong>Physiology</strong> datasets cover blood-brain barrier penetration (BBBP), toxicity (Tox21, ToxCast), side effects (SIDER), and clinical trial toxicity (ClinTox).</p>
<h3 id="data-splitting-strategies">Data Splitting Strategies</h3>
<p>MoleculeNet implements four splitting methods, all using an 80/10/10 train/validation/test ratio:</p>
<ul>
<li><strong>Random splitting</strong>: Standard random assignment to subsets.</li>
<li><strong>Scaffold splitting</strong>: Separates molecules by their 2D structural frameworks (Bemis-Murcko scaffolds), providing a harder generalization test since structurally different molecules appear in different subsets.</li>
<li><strong>Stratified splitting</strong>: Ensures each subset contains the full range of label values (used for QM7).</li>
<li><strong>Time splitting</strong>: Trains on older data and tests on newer data to mimic real-world development (used for PDBbind).</li>
</ul>
<h3 id="evaluation-metrics">Evaluation Metrics</h3>
<p>Regression tasks use MAE or RMSE depending on the dataset. Classification tasks use either ROC-AUC or PRC-AUC. The choice between ROC-AUC and PRC-AUC depends on class imbalance: PRC-AUC is recommended for datasets with positive rates below 2% (PCBA, MUV), since precision-recall curves better capture performance under extreme imbalance.</p>
<p>The false positive rate and precision are defined as:</p>
<p>$$
\text{FPR} = \frac{\text{false positive}}{\text{false positive} + \text{true negative}}
$$</p>
<p>$$
\text{precision} = \frac{\text{true positive}}{\text{false positive} + \text{true positive}}
$$</p>
<p>When positive samples form a small fraction of the data, false positives influence precision much more than FPR, making PRC-AUC more informative than ROC-AUC.</p>
<h3 id="featurization-methods">Featurization Methods</h3>
<p>MoleculeNet implements six molecular featurization approaches:</p>
<ol>
<li><strong>ECFP (Extended-Connectivity Fingerprints)</strong>: Fixed-length binary fingerprints capturing topological substructures via hashing.</li>
<li><strong><a href="/posts/molecular-descriptor-coulomb-matrix/">Coulomb Matrix</a></strong>: Encodes nuclear charges and 3D coordinates through atomic self-energies and Coulomb repulsion:</li>
</ol>
<p>$$
M_{IJ} = \begin{cases} 0.5 Z_{I}^{2.4} &amp; \text{for } I = J \\ \frac{Z_{I} Z_{J}}{|\mathbf{R}_{I} - \mathbf{R}_{J}|} &amp; \text{for } I \neq J \end{cases}
$$</p>
<ol start="3">
<li><strong>Grid Featurizer</strong>: Designed for PDBbind, incorporating both ligand and protein structural information including salt bridges, hydrogen bonds, and SPLIF fingerprints.</li>
<li><strong>Symmetry Functions</strong>: Preserve rotational and permutation symmetry through radial and angular functions between atom pairs and triplets.</li>
<li><strong>Graph Convolutions</strong>: Compute initial atom feature vectors and neighbor lists from molecular graphs.</li>
<li><strong>Weave</strong>: Similar to graph convolutions but also computes pairwise atom features encoding bond properties, graph distance, and ring information.</li>
</ol>
<h2 id="benchmarked-models-and-experimental-setup">Benchmarked Models and Experimental Setup</h2>
<p>MoleculeNet benchmarks 12 learning algorithms divided into conventional methods and graph-based methods.</p>
<h3 id="conventional-methods">Conventional Methods</h3>
<ul>
<li><strong>Logistic Regression</strong> (classification only)</li>
<li><strong>Kernel SVM</strong> with radial basis function kernel</li>
<li><strong>Kernel Ridge Regression (KRR)</strong></li>
<li><strong>Random Forests</strong></li>
<li><strong>Gradient Boosting</strong> (XGBoost)</li>
<li><strong>Singletask/Multitask Networks</strong>: Fully connected networks with shared layers across tasks</li>
<li><strong>Bypass Networks</strong>: Multitask networks augmented with per-task &ldquo;bypass&rdquo; layers that directly connect inputs to outputs</li>
<li><strong>Influence Relevance Voting (IRV)</strong>: Refined K-nearest neighbor classifiers using Jaccard-Tanimoto similarity:</li>
</ul>
<p>$$
S(\vec{A}, \vec{B}) = \frac{A \cap B}{A \cup B}
$$</p>
<h3 id="graph-based-methods">Graph-Based Methods</h3>
<ul>
<li><strong>Graph Convolutional Models (GC)</strong>: Extend circular fingerprints with learnable convolutions over molecular graphs.</li>
<li><strong>Weave Models</strong>: Update atom features using information from all other atoms and their pairwise features.</li>
<li><strong>Directed Acyclic Graph (DAG) Models</strong>: Define directed bonds toward a central atom and propagate features through the directed graph.</li>
<li><strong>Deep Tensor Neural Networks (DTNN)</strong>: Use nuclear charges and distance matrices directly, updating atom embeddings based on pairwise physical distances.</li>
<li><strong>ANI-1</strong>: Learns transferable potentials using symmetry function features with atom-type-specific neural networks.</li>
<li><strong>Message Passing Neural Networks (MPNN)</strong>: Generalized framework with edge-dependent message functions and set2set readout.</li>
</ul>
<h3 id="experimental-protocol">Experimental Protocol</h3>
<p>Gaussian process hyperparameter optimization was applied to each dataset-model combination, followed by three independent runs with different random seeds. All results are reported as means with standard deviations. Variable training-size experiments were conducted on Tox21, FreeSolv, and QM7 to study data efficiency.</p>
<h2 id="key-findings-across-property-categories">Key Findings Across Property Categories</h2>
<h3 id="biophysics-and-physiology">Biophysics and Physiology</h3>
<p>Graph convolutional and weave models showed strong performance on larger datasets with less overfitting than conventional methods. Graph-based models outperformed multitask networks at 30% training data compared to 90% on Tox21. However, for smaller single-task datasets (under 3,000 samples), kernel SVM and ensemble tree methods were more robust. On highly imbalanced datasets like MUV (0.20% positive rate), graph-based models struggled to control false positives.</p>
<p>Multitask training had a regularizing effect, reducing the gap between train and test scores compared to single-task models. Bypass networks consistently matched or exceeded vanilla multitask networks, confirming that per-task layers add explanatory power.</p>
<h3 id="physical-chemistry">Physical Chemistry</h3>
<p>Graph-based methods (GC, DAG, MPNN, Weave) provided significant improvements over single-task networks for predicting solubility, solvation energy, and lipophilicity. The best models achieved accuracy comparable to ab initio predictions (within 0.5 RMSE for ESOL, within 1.5 kcal/mol for FreeSolv). On FreeSolv, a weave model trained on approximately 200 samples matched the accuracy of alchemical free energy calculations.</p>
<h3 id="quantum-mechanics">Quantum Mechanics</h3>
<p>Models incorporating 3D distance information (DTNN, MPNN, KRR with Coulomb matrix) substantially outperformed models using only topological features. DTNN and MPNN covered the best-performing models on 28 of 39 tasks across QM datasets. The choice of physics-aware featurization proved more important than the choice of learning algorithm for these tasks.</p>
<h3 id="summary-of-best-performances">Summary of Best Performances</h3>
<p>Graph-based models outperformed conventional methods on 11 of 17 datasets. Key results on the test set:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Metric</th>
          <th>Best Conventional</th>
          <th>Best Graph-Based</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>QM7</td>
          <td>MAE</td>
          <td>KRR (CM): 10.22</td>
          <td>DTNN: 8.75</td>
      </tr>
      <tr>
          <td>QM9</td>
          <td>MAE</td>
          <td>Multitask (CM): 4.35</td>
          <td>DTNN: 2.35</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>XGBoost: 0.99</td>
          <td>MPNN: 0.58</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>RMSE</td>
          <td>XGBoost: 1.74</td>
          <td>MPNN: 1.15</td>
      </tr>
      <tr>
          <td>PCBA</td>
          <td>PRC-AUC</td>
          <td>Logreg: 0.129</td>
          <td>GC: 0.136</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>ROC-AUC</td>
          <td>KernelSVM: 0.822</td>
          <td>GC: 0.829</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>ROC-AUC</td>
          <td>KernelSVM: 0.792</td>
          <td>GC: 0.763</td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>ROC-AUC</td>
          <td>RF: 0.867</td>
          <td>Weave: 0.806</td>
      </tr>
  </tbody>
</table>
<p>Conventional methods (KernelSVM, RF) still won on several smaller or scaffold-split datasets (HIV, BACE, MUV, PDBbind, BBBP, SIDER), highlighting that graph-based models are not universally superior, particularly under data scarcity or challenging splits.</p>
<h2 id="conclusions-and-limitations">Conclusions and Limitations</h2>
<p>MoleculeNet demonstrated that learnable representations broadly offer the best performance for molecular machine learning. However, the authors identify several important caveats:</p>
<ol>
<li><strong>Data scarcity</strong>: Graph-based methods are not robust enough on complex tasks with limited training data.</li>
<li><strong>Class imbalance</strong>: On heavily imbalanced classification datasets, conventional methods such as kernel SVM outperform learnable featurizations with respect to recall of positives.</li>
<li><strong>Task-specific featurizations</strong>: For quantum mechanical and biophysical datasets, incorporating physics-aware features (<a href="/posts/molecular-descriptor-coulomb-matrix/">Coulomb matrix</a>, 3D coordinates) is more important than the choice of learning algorithm.</li>
<li><strong>Data-driven physical chemistry</strong>: On FreeSolv, data-driven methods outperformed ab initio calculations with moderate data, suggesting data-driven approaches will become increasingly important as methods and datasets mature.</li>
</ol>
<p>The authors express hope that MoleculeNet will stimulate algorithmic development similar to how ImageNet catalyzed breakthroughs in computer vision. Future directions include extending coverage to 3D protein structure prediction, DNA topological modeling, and other areas of molecular science.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>All 17 datasets are publicly available and integrated into the DeepChem Python package. Users can load any dataset with a single library call.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>QM benchmark</td>
          <td>QM7/QM7b/QM8/QM9</td>
          <td>7K-134K compounds</td>
          <td>DFT-computed properties from GDB subsets</td>
      </tr>
      <tr>
          <td>Physical chemistry</td>
          <td>ESOL/FreeSolv/Lipophilicity</td>
          <td>643-4,200 compounds</td>
          <td>Experimental measurements</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>PCBA/MUV/HIV/PDBbind/BACE</td>
          <td>1.5K-440K compounds</td>
          <td>Bioassay and binding data</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>BBBP/Tox21/ToxCast/SIDER/ClinTox</td>
          <td>1.4K-8.6K compounds</td>
          <td>Toxicity and drug safety data</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>All splitting methods (random, scaffold, stratified, time) and featurizations (ECFP, Coulomb matrix, grid, symmetry functions, graph convolutions, weave) are implemented in DeepChem. Hyperparameters were tuned via Gaussian process optimization. Three random seeds were used per experiment.</p>
<h3 id="models">Models</h3>
<p>All 12 models are implemented in DeepChem, built on Scikit-Learn and TensorFlow. No pretrained weights are provided; models are trained from scratch on each dataset.</p>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics include MAE, RMSE, ROC-AUC, and PRC-AUC as specified per dataset. Multi-task datasets report mean metric values across all tasks.</p>
<h3 id="hardware">Hardware</h3>
<p>The authors used Stanford&rsquo;s Sherlock and Xstream GPU nodes. Specific GPU types and training times per model are provided in Table S1 of the supplementary material.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/deepchem/deepchem">DeepChem</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Open-source library with all datasets, featurizations, and models</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., &amp; Pande, V. (2018). MoleculeNet: a benchmark for molecular machine learning. <em>Chemical Science</em>, 9(2), 513-530. <a href="https://doi.org/10.1039/c7sc02664a">https://doi.org/10.1039/c7sc02664a</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wu2018moleculenet,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MoleculeNet: a benchmark for molecular machine learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wu, Zhenqin and Ramsundar, Bharath and Feinberg, Evan N. and Gomes, Joseph and Geniesse, Caleb and Pappu, Aneesh S. and Leswing, Karl and Pande, Vijay}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{513--530}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/c7sc02664a}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GuacaMol: Benchmarking Models for De Novo Molecular Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/</guid><description>GuacaMol introduces a standardized benchmark suite for evaluating de novo molecular design models across distribution learning and goal-directed optimization.</description><content:encoded><![CDATA[<h2 id="a-standardized-benchmark-for-molecular-design">A Standardized Benchmark for Molecular Design</h2>
<p>GuacaMol is a <strong>Resource</strong> paper. Its primary contribution is a standardized, open-source benchmarking framework for evaluating models for de novo molecular design. The framework defines 5 distribution-learning benchmarks and 20 goal-directed optimization benchmarks, implemented as a Python package. The authors also provide baseline results for several classical and neural generative models, establishing reference performance levels for future comparisons.</p>
<h2 id="the-need-for-consistent-evaluation-in-generative-chemistry">The Need for Consistent Evaluation in Generative Chemistry</h2>
<p>By 2018, deep generative models for molecular design (<a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">VAEs</a>, RNNs, <a href="/posts/what-is-a-gan/">GANs</a>) had shown promising results, but the field lacked consistent evaluation standards. Different papers used different tasks, different datasets, and different metrics, making it difficult to compare models or assess real progress. Comparative studies between neural approaches and well-established algorithms like genetic algorithms were rare.</p>
<p>In other areas of machine learning, standardized benchmarks (ImageNet for vision, GLUE for NLP) had driven rapid progress by enabling fair comparisons. The de novo design community lacked an equivalent. Additionally, many existing evaluations focused on easily optimizable properties (logP, QED) that could not differentiate between models, since even simple baselines achieved near-perfect scores on those tasks.</p>
<h2 id="benchmark-design-distribution-learning-and-goal-directed-optimization">Benchmark Design: Distribution Learning and Goal-Directed Optimization</h2>
<p>GuacaMol separates evaluation into two independent dimensions, reflecting the two main use cases of generative models.</p>
<h3 id="distribution-learning-benchmarks">Distribution-Learning Benchmarks</h3>
<p>These five benchmarks assess how well a model learns to generate molecules similar to a training set (a standardized subset of ChEMBL 24):</p>
<ol>
<li><strong>Validity</strong>: Fraction of generated molecules that are chemically valid (parseable by RDKit), measured over 10,000 generated samples.</li>
<li><strong>Uniqueness</strong>: Fraction of unique canonical <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> among 10,000 valid generated molecules.</li>
<li><strong>Novelty</strong>: Fraction of generated molecules not present in the training set, measured over 10,000 unique samples.</li>
<li><strong><a href="/notes/computational-chemistry/benchmark-problems/frechet-chemnet-distance/">Fréchet ChemNet Distance</a> (FCD)</strong>: Measures distributional similarity between generated and reference molecules using hidden representations from ChemNet (trained on biological activity prediction). The FCD score is transformed as:</li>
</ol>
<p>$$S = \exp(-0.2 \cdot \text{FCD})$$</p>
<ol start="5">
<li><strong>KL Divergence</strong>: Compares distributions of nine physicochemical descriptors (BertzCT, MolLogP, MolWt, TPSA, NumHAcceptors, NumHDonors, NumRotatableBonds, NumAliphaticRings, NumAromaticRings) plus maximum nearest-neighbor ECFP4 similarity. The final score aggregates per-descriptor KL divergences:</li>
</ol>
<p>$$S = \frac{1}{k} \sum_{i}^{k} \exp(-D_{\text{KL}, i})$$</p>
<p>where $k = 9$ is the number of descriptors.</p>
<h3 id="goal-directed-benchmarks">Goal-Directed Benchmarks</h3>
<p>The 20 goal-directed benchmarks evaluate a model&rsquo;s ability to generate molecules that maximize a given scoring function. These span several categories:</p>
<ul>
<li><strong>Rediscovery</strong> (3 tasks): Regenerate a specific target molecule (Celecoxib, Troglitazone, Thiothixene) using Tanimoto similarity on ECFP4 fingerprints.</li>
<li><strong>Similarity</strong> (3 tasks): Generate many molecules similar to a target (Aripiprazole, Albuterol, Mestranol) above a threshold of 0.75.</li>
<li><strong>Isomers</strong> (2 tasks): Generate molecules matching a target molecular formula ($\text{C}_{11}\text{H}_{24}$ and $\text{C}_9\text{H}_{10}\text{N}_2\text{O}_2\text{PF}_2\text{Cl}$).</li>
<li><strong>Median molecules</strong> (2 tasks): Maximize similarity to two reference molecules simultaneously (camphor/menthol and tadalafil/sildenafil).</li>
<li><strong>Multi-property optimization</strong> (7 tasks): Optimize combinations of similarity, physicochemical properties, and structural features for drug-relevant molecules (Osimertinib, Fexofenadine, Ranolazine, Perindopril, Amlodipine, Sitagliptin, Zaleplon).</li>
<li><strong>SMARTS-based</strong> (1 task): Target molecules containing specific substructure patterns with constrained physicochemical properties (Valsartan SMARTS).</li>
<li><strong>Scaffold/decorator hop</strong> (2 tasks): Modify molecular scaffolds while preserving substituent patterns, or vice versa.</li>
</ul>
<p>The benchmark score for most goal-directed tasks combines top-1, top-10, and top-100 molecule scores:</p>
<p>$$S = \frac{1}{3}\left(s_1 + \frac{1}{10}\sum_{i=1}^{10} s_i + \frac{1}{100}\sum_{i=1}^{100} s_i\right)$$</p>
<p>where $s_i$ are molecule scores sorted in decreasing order.</p>
<h3 id="score-modifiers">Score Modifiers</h3>
<p>Raw molecular properties are transformed via modifier functions to restrict scores to [0, 1]:</p>
<ul>
<li><strong>Gaussian($\mu$, $\sigma$)</strong>: Targets a specific property value</li>
<li><strong>MinGaussian($\mu$, $\sigma$)</strong>: Full score below $\mu$, decreasing above</li>
<li><strong>MaxGaussian($\mu$, $\sigma$)</strong>: Full score above $\mu$, decreasing below</li>
<li><strong>Thresholded($t$)</strong>: Full score above threshold $t$, linear decrease below</li>
</ul>
<p>Multi-property objectives use either arithmetic or geometric means to combine individual scores.</p>
<h2 id="baseline-models-and-experimental-setup">Baseline Models and Experimental Setup</h2>
<p>The authors evaluate six baseline models spanning different paradigms:</p>
<p><strong>Distribution-learning baselines:</strong></p>
<ul>
<li><strong>Random sampler</strong>: Samples molecules directly from the dataset (provides upper/lower bounds).</li>
<li><strong>SMILES LSTM</strong>: 3-layer LSTM (hidden size 1024) trained to predict next SMILES characters.</li>
<li><strong>Graph MCTS</strong>: Monte Carlo Tree Search building molecules atom-by-atom.</li>
<li><strong>VAE</strong>: Variational autoencoder on SMILES representations.</li>
<li><strong>AAE</strong>: Adversarial autoencoder.</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a></strong>: Objective-reinforced generative adversarial network.</li>
</ul>
<p><strong>Goal-directed baselines:</strong></p>
<ul>
<li><strong>Best of dataset</strong>: Scores all training molecules and returns the best (virtual screening baseline).</li>
<li><strong>SMILES LSTM</strong>: Same model with 20 iterations of hill-climbing (8192 samples per iteration, top 1024 for fine-tuning).</li>
<li><strong>SMILES GA</strong>: Genetic algorithm operating on SMILES strings with grammar-based mutations.</li>
<li><strong>Graph GA</strong>: Genetic algorithm operating on molecular graphs with crossover and mutation.</li>
<li><strong>Graph MCTS</strong>: Monte Carlo Tree Search with 40 simulations per molecule.</li>
</ul>
<p>The training dataset is ChEMBL 24 after filtering: salt removal, charge neutralization, SMILES length cap of 100, element restrictions, and removal of molecules similar (ECFP4 &gt; 0.323) to 10 held-out drug molecules used in benchmarks.</p>
<h3 id="distribution-learning-results">Distribution-Learning Results</h3>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Random</th>
          <th>SMILES LSTM</th>
          <th>Graph MCTS</th>
          <th>AAE</th>
          <th>ORGAN</th>
          <th>VAE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>1.000</td>
          <td>0.959</td>
          <td>1.000</td>
          <td>0.822</td>
          <td>0.379</td>
          <td>0.870</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>0.997</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>0.841</td>
          <td>0.999</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>0.000</td>
          <td>0.912</td>
          <td>0.994</td>
          <td>0.998</td>
          <td>0.687</td>
          <td>0.974</td>
      </tr>
      <tr>
          <td>KL divergence</td>
          <td>0.998</td>
          <td>0.991</td>
          <td>0.522</td>
          <td>0.886</td>
          <td>0.267</td>
          <td>0.982</td>
      </tr>
      <tr>
          <td>FCD</td>
          <td>0.929</td>
          <td>0.913</td>
          <td>0.015</td>
          <td>0.529</td>
          <td>0.000</td>
          <td>0.863</td>
      </tr>
  </tbody>
</table>
<h3 id="goal-directed-results-selected">Goal-Directed Results (Selected)</h3>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Best of Dataset</th>
          <th>SMILES LSTM</th>
          <th>SMILES GA</th>
          <th>Graph GA</th>
          <th>Graph MCTS</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Celecoxib rediscovery</td>
          <td>0.505</td>
          <td>1.000</td>
          <td>0.732</td>
          <td>1.000</td>
          <td>0.355</td>
      </tr>
      <tr>
          <td>Osimertinib MPO</td>
          <td>0.839</td>
          <td>0.907</td>
          <td>0.886</td>
          <td>0.953</td>
          <td>0.784</td>
      </tr>
      <tr>
          <td>Sitagliptin MPO</td>
          <td>0.509</td>
          <td>0.545</td>
          <td>0.689</td>
          <td>0.891</td>
          <td>0.458</td>
      </tr>
      <tr>
          <td>Scaffold Hop</td>
          <td>0.738</td>
          <td>0.998</td>
          <td>0.885</td>
          <td>1.000</td>
          <td>0.478</td>
      </tr>
      <tr>
          <td><strong>Total (20 tasks)</strong></td>
          <td><strong>12.144</strong></td>
          <td><strong>17.340</strong></td>
          <td><strong>14.396</strong></td>
          <td><strong>17.983</strong></td>
          <td><strong>9.009</strong></td>
      </tr>
  </tbody>
</table>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="main-findings">Main Findings</h3>
<p>The Graph GA achieves the highest total score across goal-directed benchmarks (17.983), followed closely by the SMILES LSTM (17.340). This result is notable because genetic algorithms are well-established methods, and the LSTM-based neural approach nearly matches their optimization performance.</p>
<p>However, compound quality tells a different story. When examining the top 100 molecules per task through chemical quality filters (SureChEMBL, Glaxo, PAINS rules), 77% of LSTM-generated molecules pass, matching the Best of ChEMBL baseline. In contrast, Graph GA produces only 40% passing molecules, and Graph MCTS only 22%. This suggests that neural models benefit from pre-training on real molecular distributions, which encodes implicit knowledge about what constitutes a &ldquo;reasonable&rdquo; molecule.</p>
<p><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a> performs poorly across all distribution-learning tasks, with more than half its generated molecules being invalid. This is consistent with mode collapse, a known problem in GAN training.</p>
<p>Simpler generative models (LSTM, VAE) outperform more complex architectures (ORGAN, AAE) on distribution learning. Graph MCTS struggles with both distribution learning and goal-directed optimization, suggesting that single-molecule search trees are less effective than population-based approaches.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors explicitly identify several issues:</p>
<ul>
<li><strong>Compound quality is hard to quantify</strong>: The rule-based filters used are acknowledged as &ldquo;high precision, low recall&rdquo; surrogates. They catch some problematic molecules but cannot encode the full breadth of medicinal chemistry expertise.</li>
<li><strong>Some benchmarks are too easy</strong>: The trivially optimizable tasks (logP, QED, CNS MPO) cannot differentiate between models. All baselines achieve near-perfect scores on these.</li>
<li><strong>Sample efficiency and runtime are not benchmarked</strong>: The framework does not penalize models for requiring excessive scoring function calls.</li>
<li><strong>Synthesis accessibility is not addressed</strong>: Generated molecules may be valid but impractical to synthesize.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors call for harder benchmark tasks, better compound quality metrics, attention to sample efficiency and runtime constraints, and further development of graph-based neural generative models.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ChEMBL 24 (post-processed)</td>
          <td>~1.6M molecules</td>
          <td>Salt removal, neutralization, SMILES length cap, element restrictions</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>10 held-out drug molecules</td>
          <td>10</td>
          <td>Removed from training set via ECFP4 similarity threshold</td>
      </tr>
      <tr>
          <td>Quality filters</td>
          <td>SureChEMBL, Glaxo, PAINS, in-house rules</td>
          <td>N/A</td>
          <td>Applied via rd_filters</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>SMILES LSTM</strong>: 3-layer LSTM, hidden size 1024; hill-climbing with 20 iterations, 8192 samples per iteration, top 1024 for fine-tuning</li>
<li><strong>Graph GA</strong>: Population of 100, mating pool of 200, crossover + mutation (probability 0.5), 1000 epochs max</li>
<li><strong>SMILES GA</strong>: Population of 300, offspring of 600, SMILES grammar-based mutations, 1000 epochs max</li>
<li><strong>Graph MCTS</strong>: 40 simulations per molecule, 25 children per step, rollout to 60 atoms, starting from CC</li>
</ul>
<h3 id="models">Models</h3>
<p>All baseline implementations are released as open-source code. VAE, AAE, and ORGAN implementations are from the <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a> repository.</p>
<h3 id="evaluation">Evaluation</h3>
<p>All distribution-learning benchmarks sample 10,000 molecules. Goal-directed benchmarks use combinations of top-1, top-10, and top-100 scores. Compound quality is assessed via the percentage of top-100 molecules passing chemical filters.</p>
<h3 id="hardware">Hardware</h3>
<p>Hardware requirements are not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/BenevolentAI/guacamol">GuacaMol</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Benchmarking framework and scoring functions</td>
      </tr>
      <tr>
          <td><a href="https://github.com/BenevolentAI/guacamol_baselines">GuacaMol Baselines</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Baseline model implementations</td>
      </tr>
      <tr>
          <td><a href="https://figshare.com/projects/GuacaMol/56639">ChEMBL dataset</a></td>
          <td>Dataset</td>
          <td>CC-BY-SA 3.0</td>
          <td>Post-processed ChEMBL 24 for benchmarks</td>
      </tr>
      <tr>
          <td><a href="https://github.com/bioinf-jku/FCD">FCD package</a></td>
          <td>Code</td>
          <td>LGPL-3.0</td>
          <td>Fréchet ChemNet Distance implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Brown, N., Fiscato, M., Segler, M. H. S., &amp; Vaucher, A. C. (2019). GuacaMol: Benchmarking Models for De Novo Molecular Design. <em>Journal of Chemical Information and Modeling</em>, 59(3), 1096-1108. <a href="https://doi.org/10.1021/acs.jcim.8b00839">https://doi.org/10.1021/acs.jcim.8b00839</a></p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/BenevolentAI/guacamol">GuacaMol Python package</a></li>
<li><a href="https://github.com/BenevolentAI/guacamol_baselines">GuacaMol baselines</a></li>
<li><a href="https://figshare.com/projects/GuacaMol/56639">Post-processed ChEMBL datasets</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{brown2019guacamol,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{GuacaMol: Benchmarking Models for de Novo Molecular Design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Brown, Nathan and Fiscato, Marco and Segler, Marwin H. S. and Vaucher, Alain C.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{59}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1096--1108}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.8b00839}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DOCKSTRING: Docking-Based Benchmarks for Drug Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/dockstring-docking-benchmarks-ligand-design/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/dockstring-docking-benchmarks-ligand-design/</guid><description>DOCKSTRING provides an open-source Python docking package, 15M+ score dataset across 58 targets, and benchmark tasks for ML-driven drug design.</description><content:encoded><![CDATA[<h2 id="a-three-part-resource-for-docking-based-ml-benchmarks">A Three-Part Resource for Docking-Based ML Benchmarks</h2>
<p>DOCKSTRING is a <strong>Resource</strong> paper that delivers three integrated components for benchmarking machine learning models in drug discovery using molecular docking. The primary contributions are: (1) an open-source Python package wrapping <a href="https://en.wikipedia.org/wiki/AutoDock">AutoDock Vina</a> for deterministic docking from <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings, (2) a dataset of over 15 million docking scores and poses covering 260,000+ molecules docked against 58 medically relevant protein targets, and (3) a suite of benchmark tasks spanning regression, <a href="https://en.wikipedia.org/wiki/Virtual_screening">virtual screening</a>, and de novo molecular design. The paper additionally provides baseline results across classical and deep learning methods.</p>
<h2 id="why-existing-molecular-benchmarks-fall-short">Why Existing Molecular Benchmarks Fall Short</h2>
<p>ML methods for drug discovery are frequently evaluated using simple physicochemical properties such as penalized logP or QED (quantitative estimate of druglikeness). These properties are computationally cheap and easy to optimize, but they do not depend on the interaction between a candidate compound and a protein target. As a result, strong performance on logP or QED benchmarks does not necessarily translate to strong performance on real drug design tasks.</p>
<p><a href="https://en.wikipedia.org/wiki/Docking_(molecular)">Molecular docking</a> offers a more realistic evaluation objective because docking scores depend on the 3D structure of the ligand-target complex. Docking is routinely used by medicinal chemists to estimate binding affinities during hit discovery and lead optimization. Several prior efforts attempted to bring docking into ML benchmarking, but each had limitations:</p>
<ul>
<li><strong>VirtualFlow and DockStream</strong> require manually prepared target files and domain expertise.</li>
<li><strong>TDC and Cieplinski et al.</strong> provide SMILES-to-score wrappers but lack proper ligand protonation and randomness control, and cover very few targets (one and four, respectively).</li>
<li><strong>DUD-E</strong> is easily overfit by ML models that memorize actives vs. decoys.</li>
<li><strong><a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> and <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a></strong> rely on physicochemical properties or similarity functions that miss 3D structural subtleties.</li>
<li><strong><a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a></strong> compiles experimental datasets but does not support on-the-fly label computation needed for transfer learning or de novo design.</li>
</ul>
<p>DOCKSTRING addresses all of these gaps: it standardizes the docking procedure, automates ligand and target preparation, controls randomness for reproducibility, and provides a large, diverse target set.</p>
<h2 id="core-innovation-standardized-end-to-end-docking-pipeline">Core Innovation: Standardized End-to-End Docking Pipeline</h2>
<p>The key innovation is a fully automated, deterministic docking pipeline that produces reproducible scores from a SMILES string in four lines of Python code. The pipeline consists of three stages:</p>
<p><strong>Target Preparation.</strong> 57 of the 58 protein targets originate from the Directory of Useful Decoys Enhanced (DUD-E). PDB files are standardized with <a href="https://en.wikipedia.org/wiki/Open_Babel">Open Babel</a>, polar hydrogens are added, and conversion to PDBQT format is performed with AutoDock Tools. Search boxes are derived from crystallographic ligands with 12.5 A padding and a minimum side length of 30 A. The 58th target (DRD2, <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">dopamine receptor D2</a>) was prepared separately following the same protocol.</p>
<p><strong>Ligand Preparation.</strong> Ligands are protonated at pH 7.4 with Open Babel, embedded into 3D conformations using the ETKDG algorithm in RDKit, refined with the <a href="https://en.wikipedia.org/wiki/Merck_molecular_force_field">MMFF94 force field</a>, and assigned Gasteiger partial charges. Stereochemistry of determined stereocenters is maintained, while undetermined stereocenters are assigned randomly but consistently across runs.</p>
<p><strong>Docking.</strong> AutoDock Vina runs with default exhaustiveness (8), up to 9 binding modes, and an energy range of 3 kcal/mol. The authors verified that fixing the random seed yields docking score variance of less than 0.1 kcal/mol across runs, making the pipeline fully deterministic.</p>
<p>The three de novo design objective functions incorporate a QED penalty to enforce druglikeness:</p>
<p>$$
f_{\text{F2}}(l) = s(l, \text{F2}) + 10(1 - \text{QED}(l))
$$</p>
<p>$$
f_{\text{PPAR}}(l) = \max_{t \in \text{PPAR}} s(l, t) + 10(1 - \text{QED}(l))
$$</p>
<p>$$
f_{\text{JAK2}}(l) = s(l, \text{JAK2}) - \min(s(l, \text{LCK}), -8.1) + 10(1 - \text{QED}(l))
$$</p>
<p>The F2 task optimizes binding to a single protease. The Promiscuous <a href="https://en.wikipedia.org/wiki/Peroxisome_proliferator-activated_receptor">PPAR</a> task requires strong binding to three nuclear receptors simultaneously. The Selective <a href="https://en.wikipedia.org/wiki/Janus_kinase_2">JAK2</a> task is adversarial, requiring strong JAK2 binding while avoiding <a href="https://en.wikipedia.org/wiki/Tyrosin-protein_kinase_Lck">LCK</a> binding (two kinases with a score correlation of 0.80).</p>
<h2 id="experimental-setup-regression-virtual-screening-and-de-novo-design">Experimental Setup: Regression, Virtual Screening, and De Novo Design</h2>
<h3 id="dataset-construction">Dataset Construction</h3>
<p>The dataset combines molecules from ExCAPE-DB (which curates PubChem and ChEMBL bioactivity assays). The authors selected all molecules with active labels against targets having at least 1,000 experimental actives, plus 150,000 inactive-only molecules. After discarding 1.8% of molecules that failed ligand preparation, the final dataset contains 260,155 compounds docked against 58 targets, producing over 15 million docking scores and poses. The dataset required over 500,000 CPU hours to generate.</p>
<p>Cluster analysis using <a href="https://en.wikipedia.org/wiki/DBSCAN">DBSCAN</a> (<a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard distance</a> threshold of 0.25 on RDKit fingerprints) found 52,000 clusters, and Bemis-Murcko scaffold decomposition identified 102,000 scaffolds, confirming high molecular diversity. Train/test splitting follows cluster labels to prevent data leakage.</p>
<h3 id="regression-baselines">Regression Baselines</h3>
<p>Five targets of varying difficulty were selected: <a href="https://en.wikipedia.org/wiki/Poly_(ADP-ribose)_polymerase">PARP1</a> (easy), F2 (easy-medium), KIT (medium), ESR2 (hard), and PGR (hard). Baselines include Ridge, Lasso, XGBoost, exact GP, sparse GP, MPNN, and Attentive FP.</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Ridge</th>
          <th>Lasso</th>
          <th>XGBoost</th>
          <th>GP (exact)</th>
          <th>GP (sparse)</th>
          <th>MPNN</th>
          <th>Attentive FP</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>logP</td>
          <td>0.640</td>
          <td>0.640</td>
          <td>0.734</td>
          <td>0.707</td>
          <td>0.716</td>
          <td>0.953</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.519</td>
          <td>0.483</td>
          <td>0.660</td>
          <td>0.640</td>
          <td>0.598</td>
          <td>0.901</td>
          <td>0.981</td>
      </tr>
      <tr>
          <td>ESR2</td>
          <td>0.421</td>
          <td>0.416</td>
          <td>0.497</td>
          <td>0.441</td>
          <td>0.508</td>
          <td>0.506</td>
          <td>0.627</td>
      </tr>
      <tr>
          <td>F2</td>
          <td>0.672</td>
          <td>0.663</td>
          <td>0.688</td>
          <td>0.705</td>
          <td>0.744</td>
          <td>0.798</td>
          <td>0.880</td>
      </tr>
      <tr>
          <td>KIT</td>
          <td>0.604</td>
          <td>0.594</td>
          <td>0.674</td>
          <td>0.637</td>
          <td>0.684</td>
          <td>0.755</td>
          <td>0.806</td>
      </tr>
      <tr>
          <td>PARP1</td>
          <td>0.706</td>
          <td>0.700</td>
          <td>0.723</td>
          <td>0.743</td>
          <td>0.772</td>
          <td>0.815</td>
          <td>0.910</td>
      </tr>
      <tr>
          <td>PGR</td>
          <td>0.242</td>
          <td>0.245</td>
          <td>0.345</td>
          <td>0.291</td>
          <td>0.387</td>
          <td>0.324</td>
          <td>0.678</td>
      </tr>
  </tbody>
</table>
<p>Values are mean $R^2$ over three runs. Attentive FP achieves the best performance on every target but remains well below perfect prediction on the harder targets, confirming that docking score regression is a meaningful benchmark.</p>
<h3 id="virtual-screening-baselines">Virtual Screening Baselines</h3>
<p>Models trained on PARP1, KIT, and PGR docking scores rank all molecules in <a href="/notes/computational-chemistry/datasets/zinc-22/">ZINC20</a> (~1 billion compounds). The top 5,000 predictions are docked, and the enrichment factor (EF) is computed relative to a 0.1 percentile activity threshold.</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Threshold</th>
          <th>FSS</th>
          <th>Ridge</th>
          <th>Attentive FP</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>KIT</td>
          <td>-10.7</td>
          <td>239.2</td>
          <td>451.6</td>
          <td>766.5</td>
      </tr>
      <tr>
          <td>PARP1</td>
          <td>-12.1</td>
          <td>313.1</td>
          <td>325.9</td>
          <td>472.2</td>
      </tr>
      <tr>
          <td>PGR</td>
          <td>-10.1</td>
          <td>161.4</td>
          <td>120.5</td>
          <td>461.3</td>
      </tr>
  </tbody>
</table>
<p>The maximum possible EF is 1,000. Attentive FP substantially outperforms fingerprint similarity search (FSS) and Ridge regression across all targets.</p>
<h3 id="de-novo-design-baselines">De Novo Design Baselines</h3>
<p>Four optimization methods were tested: <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a> GA, <a href="/notes/computational-chemistry/benchmark-problems/graph-based-genetic-algorithm-chemical-space/">Graph GA</a>, GP-BO with UCB acquisition ($\beta = 10$), and GP-BO with expected improvement (EI), each with a budget of 5,000 objective function evaluations. Without QED penalties, all methods easily surpass the best training set molecules but produce large, lipophilic, undrug-like compounds. With QED penalties, the tasks become substantially harder: GP-BO with EI is the only method that finds 25 molecules better than the training set across all three tasks.</p>
<p>The Selective JAK2 task proved hardest due to the high correlation between JAK2 and LCK scores. Pose analysis of the top de novo molecule revealed a dual binding mode: type V inhibitor behavior in JAK2 (binding distant N- and C-terminal lobe regions) and type I behavior in LCK (hinge-binding), suggesting a plausible selectivity mechanism.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p><strong>Key findings:</strong></p>
<ol>
<li>Docking scores are substantially harder to predict than logP or QED, making them more suitable for benchmarking high-performing ML models. Graph neural networks (Attentive FP) achieve near-perfect $R^2$ on logP but only 0.63-0.91 on docking targets.</li>
<li>In-distribution regression difficulty does not necessarily predict out-of-distribution virtual screening difficulty. PARP1 is easiest for regression, but KIT is easiest for virtual screening.</li>
<li>Adding a QED penalty to de novo design objectives transforms trivially solvable tasks into meaningful benchmarks. The adversarial Selective JAK2 objective, which exploits correlated docking scores, may be an effective way to avoid docking score biases toward large and lipophilic molecules.</li>
<li>Docking scores from related protein targets are highly correlated, supporting the biological meaningfulness of the dataset and enabling multiobjective and transfer learning tasks.</li>
</ol>
<p><strong>Limitations acknowledged by the authors:</strong></p>
<ul>
<li>Docking scores are approximate heuristics. They use static binding sites and force fields with limited calibration for certain metal ions. DOCKSTRING benchmarks should not substitute for rational drug design and experimental validation.</li>
<li>The pipeline relies on AutoDock Vina specifically; other docking programs may produce different rankings.</li>
<li>Top de novo molecules for F2 and Promiscuous PPAR contain conjugated ring structures uncommon in successful drugs.</li>
<li>Platform support is primarily Linux, with noted scoring inconsistencies on macOS.</li>
</ul>
<p><strong>Future directions</strong> mentioned include multiobjective tasks (transfer learning, few-shot learning), improved objective functions for better pharmacokinetic properties and synthetic feasibility, and multifidelity optimization tasks combining docking with more expensive computational methods.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ligand source</td>
          <td>ExCAPE-DB (PubChem + ChEMBL)</td>
          <td>260,155 molecules</td>
          <td>Actives against 58 targets + 150K inactive-only</td>
      </tr>
      <tr>
          <td>Docking scores</td>
          <td>DOCKSTRING dataset</td>
          <td>15M+ scores and poses</td>
          <td>Full matrix across all molecule-target pairs</td>
      </tr>
      <tr>
          <td>Virtual screening library</td>
          <td>ZINC20</td>
          <td>~1 billion molecules</td>
          <td>Used for out-of-distribution evaluation</td>
      </tr>
      <tr>
          <td>Target structures</td>
          <td>DUD-E + PDB 6CM4 (DRD2)</td>
          <td>58 targets</td>
          <td>Kinases (22), enzymes (12), nuclear receptors (9), proteases (7), GPCRs (5), cytochromes (2), chaperone (1)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Docking engine</strong>: AutoDock Vina with default exhaustiveness (8), up to 9 binding modes, energy range of 3 kcal/mol</li>
<li><strong>Ligand preparation</strong>: Open Babel (protonation at pH 7.4), RDKit ETKDG (3D embedding), MMFF94 (force field refinement), Gasteiger charges</li>
<li><strong>Regression models</strong>: Ridge, Lasso, XGBoost (hyperparameters via 20-configuration random search with 5-fold CV), exact GP and sparse GP (Tanimoto kernel on fingerprints), MPNN, Attentive FP (DeepChem defaults, 10 epochs)</li>
<li><strong>Optimization</strong>: Graph GA (population 250, offspring 25, mutation rate 0.01), SELFIES GA (same population/offspring settings), GP-BO with UCB ($\beta = 10$) or EI (batch size 5, 1000 offspring, 25 generations per iteration)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Setting</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>$R^2$ (coefficient of determination)</td>
          <td>Regression</td>
          <td>Cluster-split train/test</td>
      </tr>
      <tr>
          <td>EF (enrichment factor)</td>
          <td>Virtual screening</td>
          <td>Top 5,000 from ZINC20, 0.1 percentile threshold</td>
      </tr>
      <tr>
          <td>Objective value trajectory</td>
          <td>De novo design</td>
          <td>5,000 function evaluation budget</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The dataset required over 500,000 CPU hours to compute, using the University of Cambridge Research Computing Service (EPSRC and DiRAC funded). Per-target docking takes approximately 15 seconds on 8 CPUs.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/dockstring/dockstring">DOCKSTRING Python package</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Wraps AutoDock Vina; available via conda-forge and PyPI</td>
      </tr>
      <tr>
          <td><a href="https://dockstring.github.io">DOCKSTRING dataset</a></td>
          <td>Dataset</td>
          <td>Apache 2.0</td>
          <td>15M+ docking scores and poses for 260K molecules x 58 targets</td>
      </tr>
      <tr>
          <td><a href="https://github.com/dockstring/dockstring">Benchmark baselines</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Regression, virtual screening, and de novo design baseline implementations</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: García-Ortegón, M., Simm, G. N. C., Tripp, A. J., Hernández-Lobato, J. M., Bender, A., &amp; Bacallado, S. (2022). DOCKSTRING: Easy Molecular Docking Yields Better Benchmarks for Ligand Design. <em>Journal of Chemical Information and Modeling</em>, 62(15), 3486-3502. <a href="https://doi.org/10.1021/acs.jcim.1c01334">https://doi.org/10.1021/acs.jcim.1c01334</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling, 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://dockstring.github.io">DOCKSTRING Project Page</a></li>
<li><a href="https://github.com/dockstring/dockstring">GitHub Repository</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{garciaortegon2022dockstring,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{{DOCKSTRING}: Easy Molecular Docking Yields Better Benchmarks for Ligand Design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Garc{\&#39;\i}a-Orteg{\&#39;o}n, Miguel and Simm, Gregor N. C. and Tripp, Austin J. and Hern{\&#39;a}ndez-Lobato, Jos{\&#39;e} Miguel and Bender, Andreas and Bacallado, Sergio}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{62}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3486--3502}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.1c01334}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Tartarus: Realistic Inverse Molecular Design Benchmarks</title><link>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/tartarus-inverse-molecular-design/</link><pubDate>Mon, 23 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/tartarus-inverse-molecular-design/</guid><description>Tartarus provides physics-based benchmark tasks for inverse molecular design spanning materials, drugs, and reactions with algorithm-domain dependencies.</description><content:encoded><![CDATA[<h2 id="a-resource-for-realistic-molecular-design-evaluation">A Resource for Realistic Molecular Design Evaluation</h2>
<p>This is a <strong>Resource</strong> paper. Its primary contribution is Tartarus, a modular benchmarking platform for inverse molecular design that provides physically grounded evaluation tasks across four application domains: organic photovoltaics, organic emitters, protein ligands, and chemical reaction substrates. Each task pairs a curated reference dataset with a computational simulation workflow that evaluates proposed molecular structures using established methods from computational chemistry (<a href="https://en.wikipedia.org/wiki/Force_field_(chemistry)">force fields</a>, semi-empirical quantum chemistry, <a href="https://en.wikipedia.org/wiki/Density_functional_theory">density functional theory</a>, and <a href="https://en.wikipedia.org/wiki/Docking_(molecular)">molecular docking</a>).</p>
<h2 id="the-problem-with-existing-molecular-design-benchmarks">The Problem with Existing Molecular Design Benchmarks</h2>
<p>Inverse molecular design, the challenge of crafting molecules with specific optimal properties, is central to drug, catalyst, and materials discovery. Many algorithms have been proposed for this task, but the benchmarks used to evaluate them have significant limitations:</p>
<ul>
<li><strong>Penalized logP</strong>, one of the most common benchmarks, depends heavily on molecule size and chain composition, limiting its informativeness.</li>
<li><strong>QED maximization</strong> has reached saturation, with numerous models achieving near-perfect scores.</li>
<li><strong><a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a></strong> often yields near-perfect scores across models, obscuring meaningful performance differences. <a href="/notes/computational-chemistry/benchmark-problems/pmo-sample-efficient-molecular-optimization/">Gao et al. (2022)</a> traced this to unlimited property evaluations, with imposed limits revealing much larger disparities.</li>
<li><strong>MOSES</strong> evaluates distribution-matching ability, but the emergence of <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a> and simple algorithms has made these tasks relatively straightforward.</li>
<li><strong>Molecular docking</strong> benchmarks are gaining popularity, but tend to favor reactive or unstable molecules and typically cover only drug design.</li>
</ul>
<p>These benchmarks share a common weakness: they rely on cheap, approximate property estimators (often QSAR models or simple heuristics) rather than physics-based simulations. This makes them poor proxies for real molecular design campaigns, where properties must be validated through computational or experimental workflows. Tartarus addresses this by providing benchmark tasks grounded in established simulation methods.</p>
<h2 id="physics-based-simulation-workflows-as-benchmark-oracles">Physics-Based Simulation Workflows as Benchmark Oracles</h2>
<p>The core innovation in Tartarus is the use of computational chemistry simulation pipelines as objective functions for benchmarking. Rather than relying on learned property predictors, each benchmark task runs a full simulation workflow to evaluate proposed molecules:</p>
<ol>
<li><strong>Organic Photovoltaics (OPV)</strong>: Starting from a <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> string, the workflow generates 3D coordinates with Open Babel, performs conformer search with CREST at the GFN-FF level, optimizes geometry at GFN2-xTB, and computes <a href="https://en.wikipedia.org/wiki/HOMO_and_LUMO">HOMO/LUMO</a> energies. Power conversion efficiency (PCE) is estimated via the Scharber model for single-junction <a href="https://en.wikipedia.org/wiki/Organic_solar_cell">organic solar cells</a>. HOMO and LUMO energies are calibrated against DFT results from the Harvard Clean Energy Project Database using <a href="https://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator">Theil-Sen regression</a>:</li>
</ol>
<p>$$
E_{\text{HOMO, calibrated}} = E_{\text{HOMO, GFN2-xTB}} \cdot 0.8051 + 2.5377 \text{ eV}
$$</p>
<p>$$
E_{\text{LUMO, calibrated}} = E_{\text{LUMO, GFN2-xTB}} \cdot 0.8788 + 3.7913 \text{ eV}
$$</p>
<ol start="2">
<li>
<p><strong>Organic Emitters (OLED)</strong>: The workflow uses conformer search via CREST, geometry optimization at GFN0-xTB, and TD-DFT single-point calculations at the B3LYP/6-31G* level with PySCF to extract singlet-triplet gaps, <a href="https://en.wikipedia.org/wiki/Oscillator_strength">oscillator strengths</a>, and vertical excitation energies.</p>
</li>
<li>
<p><strong>Protein Ligands</strong>: The workflow generates 3D coordinates, applies structural filters (<a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski&rsquo;s Rule of Five</a>, reactive moiety checks), and performs molecular docking using QuickVina2 with re-scoring via smina against three protein targets: 1SYH (ionotropic glutamate receptor), 6Y2F (<a href="https://en.wikipedia.org/wiki/3C-like_protease">SARS-CoV-2 main protease</a>), and 4LDE (beta-2 adrenoceptor).</p>
</li>
<li>
<p><strong>Chemical Reaction Substrates</strong>: The workflow models the intramolecular double hydrogen transfer in syn-sesquinorbornenes using the SEAM force field approach at the GFN-FF/GFN2-xTB level to compute activation and reaction energies.</p>
</li>
</ol>
<p>Each benchmark also includes a curated reference dataset for training generative models and a standardized evaluation protocol: train on 80% of the dataset, use 20% for hyperparameter optimization, then optimize structures starting from the best reference molecule with a constrained budget of 5,000 proposed compounds, a 24-hour runtime cap, and five independent repetitions.</p>
<h2 id="benchmark-tasks-datasets-and-model-comparisons">Benchmark Tasks, Datasets, and Model Comparisons</h2>
<h3 id="models-evaluated">Models Evaluated</h3>
<p>Eight generative models spanning major algorithm families were tested:</p>
<ul>
<li><strong>VAEs</strong>: SMILES-VAE and SELFIES-VAE</li>
<li><strong>Flow models</strong>: MoFlow</li>
<li><strong>Reinforcement learning</strong>: <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a></li>
<li><strong>LSTM-based hill climbing</strong>: SMILES-LSTM-HC and SELFIES-LSTM-HC</li>
<li><strong>Genetic algorithms</strong>: <a href="/notes/computational-chemistry/benchmark-problems/graph-based-genetic-algorithm-chemical-space/">GB-GA</a> and JANUS</li>
</ul>
<h3 id="organic-photovoltaics-results">Organic Photovoltaics Results</h3>
<p>The reference dataset (CEP_SUB) contains approximately 25,000 molecules from the Harvard Clean Energy Project Database. Two objectives combine PCE with synthetic accessibility (SAscore):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>PCE_PCBM - SAscore</th>
          <th>PCE_PCDTBT - SAscore</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Dataset</td>
          <td>7.57</td>
          <td>31.71</td>
      </tr>
      <tr>
          <td>SMILES-VAE</td>
          <td>7.44 +/- 0.28</td>
          <td>10.23 +/- 11.14</td>
      </tr>
      <tr>
          <td>SELFIES-VAE</td>
          <td>7.05 +/- 0.66</td>
          <td>29.24 +/- 0.65</td>
      </tr>
      <tr>
          <td>MoFlow</td>
          <td>7.08 +/- 0.31</td>
          <td>29.81 +/- 0.37</td>
      </tr>
      <tr>
          <td>SMILES-LSTM-HC</td>
          <td>6.69 +/- 0.40</td>
          <td>31.79 +/- 0.15</td>
      </tr>
      <tr>
          <td>SELFIES-LSTM-HC</td>
          <td>7.40 +/- 0.41</td>
          <td>30.71 +/- 1.20</td>
      </tr>
      <tr>
          <td>REINVENT</td>
          <td>7.48 +/- 0.11</td>
          <td>30.47 +/- 0.44</td>
      </tr>
      <tr>
          <td>GB-GA</td>
          <td>7.78 +/- 0.02</td>
          <td>30.24 +/- 0.80</td>
      </tr>
      <tr>
          <td>JANUS</td>
          <td>7.59 +/- 0.14</td>
          <td>31.34 +/- 0.74</td>
      </tr>
  </tbody>
</table>
<p>GB-GA achieves the best score on the first task (7.78), while SMILES-LSTM-HC leads on the second (31.79). Most models can marginally improve PCE but struggle to simultaneously improve PCE and reduce SAscore.</p>
<h3 id="organic-emitters-results">Organic Emitters Results</h3>
<p>The reference dataset (GDB-13_SUB) contains approximately 380,000 molecules filtered for conjugated pi-systems from <a href="/notes/computational-chemistry/datasets/gdb-13/">GDB-13</a>. Three objectives target singlet-triplet gap minimization, oscillator strength maximization, and a combined multi-objective:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Delta E(S1-T1)</th>
          <th>f12</th>
          <th>Multi-objective</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Dataset</td>
          <td>0.020</td>
          <td>2.97</td>
          <td>-0.04</td>
      </tr>
      <tr>
          <td>SMILES-VAE</td>
          <td>0.071 +/- 0.003</td>
          <td>0.50 +/- 0.27</td>
          <td>-0.57 +/- 0.33</td>
      </tr>
      <tr>
          <td>SELFIES-VAE</td>
          <td>0.016 +/- 0.001</td>
          <td>0.36 +/- 0.31</td>
          <td>0.17 +/- 0.10</td>
      </tr>
      <tr>
          <td>MoFlow</td>
          <td>0.013 +/- 0.001</td>
          <td>0.81 +/- 0.11</td>
          <td>-0.04 +/- 0.06</td>
      </tr>
      <tr>
          <td>GB-GA</td>
          <td>0.012 +/- 0.002</td>
          <td>2.14 +/- 0.45</td>
          <td>0.07 +/- 0.03</td>
      </tr>
      <tr>
          <td>JANUS</td>
          <td>0.008 +/- 0.001</td>
          <td>2.07 +/- 0.16</td>
          <td>0.02 +/- 0.05</td>
      </tr>
  </tbody>
</table>
<p>Only JANUS, GB-GA, and SELFIES-VAE generate compounds comparable to or improving upon the best training molecules. JANUS achieves the lowest singlet-triplet gap (0.008 eV), while SELFIES-VAE achieves the highest multi-objective fitness (0.17). Some proposed structures contain reactive moieties, likely because stability is not explicitly penalized in the objective functions.</p>
<h3 id="protein-ligand-results">Protein Ligand Results</h3>
<p>The reference dataset contains approximately 152,000 molecules from the DTP Open Compound Collection, filtered for drug-likeness. Docking is performed against three protein targets using both QuickVina2 and smina re-scoring:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>1SYH (smina)</th>
          <th>6Y2F (smina)</th>
          <th>4LDE (smina)</th>
          <th>SR (1SYH)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Dataset</td>
          <td>-10.2</td>
          <td>-8.2</td>
          <td>-13.1</td>
          <td>100.0%</td>
      </tr>
      <tr>
          <td>SMILES-VAE</td>
          <td>-10.4 +/- 0.6</td>
          <td>-8.9 +/- 0.8</td>
          <td>-11.1 +/- 0.4</td>
          <td>12.3%</td>
      </tr>
      <tr>
          <td>SELFIES-VAE</td>
          <td>-10.9 +/- 0.3</td>
          <td>-10.1 +/- 0.4</td>
          <td>-11.9 +/- 0.2</td>
          <td>34.8%</td>
      </tr>
      <tr>
          <td>REINVENT</td>
          <td>-12.1 +/- 0.2</td>
          <td>-11.4 +/- 0.3</td>
          <td>-13.7 +/- 0.5</td>
          <td>77.8%</td>
      </tr>
      <tr>
          <td>GB-GA</td>
          <td>-12.0 +/- 0.2</td>
          <td>-11.0 +/- 0.2</td>
          <td>-13.8 +/- 0.4</td>
          <td>72.6%</td>
      </tr>
      <tr>
          <td>JANUS</td>
          <td>-11.9 +/- 0.2</td>
          <td>-11.9 +/- 0.4</td>
          <td>-13.6 +/- 0.5</td>
          <td>68.4%</td>
      </tr>
  </tbody>
</table>
<p>No single model consistently achieves the best docking score across all three targets. REINVENT leads on 1SYH, JANUS on 6Y2F, and GB-GA on 4LDE. Both VAE models show low success rates for structural filter compliance (12-39%), while REINVENT, GAs, and LSTMs achieve 68-78%.</p>
<h3 id="chemical-reaction-substrates-results">Chemical Reaction Substrates Results</h3>
<p>The reference dataset (SNB-60K) contains approximately 60,000 syn-sesquinorbornene derivatives generated via <a href="/notes/computational-chemistry/benchmark-problems/stoned-selfies-chemical-space-exploration/">STONED-SELFIES</a> mutations. Four objectives target activation energy, reaction energy, and two combined metrics:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Delta E(activation)</th>
          <th>Delta E(reaction)</th>
          <th>Delta E(act) + Delta E(rxn)</th>
          <th>-Delta E(act) + Delta E(rxn)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Dataset</td>
          <td>64.94</td>
          <td>-34.39</td>
          <td>56.48</td>
          <td>-95.25</td>
      </tr>
      <tr>
          <td>SMILES-VAE</td>
          <td>76.81 +/- 0.25</td>
          <td>-10.96 +/- 0.71</td>
          <td>71.01 +/- 0.62</td>
          <td>-90.94 +/- 1.04</td>
      </tr>
      <tr>
          <td>MoFlow</td>
          <td>70.12 +/- 2.13</td>
          <td>-20.21 +/- 4.13</td>
          <td>63.21 +/- 0.69</td>
          <td>-92.82 +/- 3.06</td>
      </tr>
      <tr>
          <td>GB-GA</td>
          <td>56.04 +/- 3.07</td>
          <td>-41.39 +/- 5.76</td>
          <td>45.20 +/- 6.78</td>
          <td>-100.07 +/- 1.35</td>
      </tr>
      <tr>
          <td>JANUS</td>
          <td>47.56 +/- 2.19</td>
          <td>-45.37 +/- 7.90</td>
          <td>39.22 +/- 3.99</td>
          <td>-97.14 +/- 1.13</td>
      </tr>
  </tbody>
</table>
<p>Only JANUS and GB-GA consistently outperform the best reference compounds. Both VAE models fail to surpass the dataset baseline on any objective. JANUS achieves the best single-objective scores for activation energy (47.56) and reaction energy (-45.37), and the best combined score (39.22).</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="central-finding-algorithm-performance-is-domain-dependent">Central Finding: Algorithm Performance is Domain-Dependent</h3>
<p>The most important result from Tartarus is that no single generative model consistently outperforms the others across all benchmark domains. This has several implications:</p>
<ul>
<li><strong>Genetic algorithms (GB-GA and JANUS) show the most consistently strong performance</strong> across benchmarks, despite being among the simplest approaches and requiring minimal pre-conditioning time (seconds vs. hours for deep models).</li>
<li><strong>VAE-based models (SMILES-VAE and SELFIES-VAE) show the weakest overall performance</strong>, often failing to surpass the best molecules in the reference datasets. Their reliance on the available training data appears to limit their effectiveness.</li>
<li><strong>REINVENT performs competitively on protein ligand tasks</strong> but shows weaker performance on other benchmarks.</li>
<li><strong>Representation matters</strong>: SELFIES-based models generally outperform their SMILES-based counterparts (e.g., SELFIES-VAE vs. SMILES-VAE), consistent with SELFIES providing 100% validity guarantees.</li>
</ul>
<h3 id="timing-analysis">Timing Analysis</h3>
<p>Training time varies dramatically across models. Both VAEs require over 9 hours of GPU training, with estimated CPU-only training times of approximately 25 days. REINVENT and MoFlow train in under 1 hour. Both GAs complete pre-conditioning in seconds and require no GPU.</p>
<h3 id="limitations-acknowledged-by-the-authors">Limitations Acknowledged by the Authors</h3>
<ul>
<li>Benchmark domains covered are not comprehensive and need expansion.</li>
<li>3D generative models are not well supported, as proposed conformers are ignored in favor of simulation-derived geometries.</li>
<li>The chemical reaction substrate benchmark requires specialized geometries (reactant, product, transition state) that most 3D generative models cannot produce.</li>
<li>Results depend heavily on both model hyperparameters and benchmark settings (compute budget, number of evaluations).</li>
<li>Objective functions may need revision when undesired structures are promoted.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>OPV Training</td>
          <td>CEP_SUB (Harvard Clean Energy Project subset)</td>
          <td>~25,000 molecules</td>
          <td>From HIPS/neural-fingerprint repository</td>
      </tr>
      <tr>
          <td>Emitter Training</td>
          <td>GDB-13_SUB (filtered GDB-13)</td>
          <td>~380,000 molecules</td>
          <td>Conjugated pi-system filter applied</td>
      </tr>
      <tr>
          <td>Ligand Training</td>
          <td>DTP Open Compound Collection (filtered)</td>
          <td>~152,000 molecules</td>
          <td>Drug-likeness and structural filters applied</td>
      </tr>
      <tr>
          <td>Reaction Training</td>
          <td>SNB-60K (STONED-SELFIES mutations)</td>
          <td>~60,000 molecules</td>
          <td>Generated from syn-sesquinorbornene core</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>All eight algorithms are implemented in the Tartarus repository with configuration files and installation instructions. The evaluation protocol specifies: 80/20 train/validation split, population size of 5,000, 24-hour runtime cap, five independent runs per model.</p>
<h3 id="models">Models</h3>
<p>Pre-trained model checkpoints are not provided. Training must be performed from scratch using the provided reference datasets and hyperparameter configurations documented in the Supporting Information.</p>
<h3 id="evaluation">Evaluation</h3>
<p>Properties are evaluated through physics-based simulation workflows (not learned surrogates). Each workflow accepts a SMILES string and returns computed properties. Key software dependencies include: Open Babel, CREST, xTB, PySCF, QuickVina2, smina, and RDKit.</p>
<h3 id="hardware">Hardware</h3>
<p>Training and sampling benchmarks were conducted using 24 CPU cores (AMD Rome 7532 @ 2.40 GHz) and a single Tesla A100 GPU. Simulations were run on the Beluga, Narval, Niagara, Cedar, and Sherlock supercomputing clusters.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/Tartarus">Tartarus GitHub</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Benchmark tasks, simulation workflows, model configs</td>
      </tr>
      <tr>
          <td><a href="https://zenodo.org/badge/latestdoi/444879123">Zenodo Archive</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Reference datasets for all four benchmark domains</td>
      </tr>
      <tr>
          <td><a href="https://discord.gg/KypwPXTY2s">Discord Community</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Discussion and collaboration channel</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Nigam, A., Pollice, R., Tom, G., Jorner, K., Willes, J., Thiede, L. A., Kundaje, A., &amp; Aspuru-Guzik, A. (2023). Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design. <em>Advances in Neural Information Processing Systems 36</em>, 3263-3306.</p>
<p><strong>Publication</strong>: NeurIPS 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/aspuru-guzik-group/Tartarus">Tartarus GitHub Repository</a></li>
<li><a href="https://zenodo.org/badge/latestdoi/444879123">Zenodo Dataset Archive</a></li>
<li><a href="https://discord.gg/KypwPXTY2s">Discord Community</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{nigam2023tartarus,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Nigam, AkshatKumar and Pollice, Robert and Tom, Gary and Jorner, Kjell and Willes, John and Thiede, Luca A. and Kundaje, Anshul and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{36}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3263--3306}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMINA Docking Benchmark for De Novo Drug Design Models</title><link>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/smina-docking-benchmark/</link><pubDate>Mon, 23 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/smina-docking-benchmark/</guid><description>A docking-based benchmark for evaluating de novo drug design generative models, using SMINA scoring across eight protein targets from ChEMBL.</description><content:encoded><![CDATA[<h2 id="a-docking-based-benchmark-for-de-novo-drug-design">A Docking-Based Benchmark for De Novo Drug Design</h2>
<p>This is a <strong>Resource</strong> paper. Its primary contribution is a standardized benchmark for evaluating generative models in de novo drug design. Rather than introducing a new generative method, the paper provides a reusable evaluation framework built around molecular docking, a widely used computational proxy for predicting protein-ligand binding. The benchmark uses SMINA (a fork of <a href="https://en.wikipedia.org/wiki/AutoDock">AutoDock Vina</a>) to score generated molecules against eight protein targets, offering a more realistic evaluation than commonly used proxy metrics like logP or QED.</p>
<h2 id="why-existing-benchmarks-fall-short">Why Existing Benchmarks Fall Short</h2>
<p>De novo drug design methods are typically evaluated using simple proxy tasks that do not reflect the complexity of real drug discovery. The octanol-water partition coefficient (logP) can be trivially optimized by producing unrealistic molecules. The QED drug-likeness score suffers from the same issue. Neural network-based bioactivity predictors are similarly exploitable.</p>
<p>As Coley et al. (2020) note: &ldquo;The current evaluations for generative models do not reflect the complexity of real discovery problems.&rdquo;</p>
<p>More realistic evaluation approaches exist in adjacent domains (photovoltaics, excitation energies), where physical calculations are used to both train and evaluate models. Yet de novo drug design has largely relied on the same simplistic proxies. This gap between proxy task performance and real-world utility motivates the development of a docking-based benchmark that, while still a proxy, captures more of the structural complexity involved in protein-ligand interactions.</p>
<h2 id="benchmark-design-smina-docking-with-the-vinardo-scoring-function">Benchmark Design: SMINA Docking with the Vinardo Scoring Function</h2>
<p>The benchmark is defined by three components: (1) docking software that computes a ligand&rsquo;s pose in the binding site, (2) a scoring function that evaluates the pose, and (3) a training set of compounds with precomputed docking scores.</p>
<p>The concrete instantiation uses SMINA v. 2017.11.9 with the Vinardo scoring function:</p>
<p>$$S = -0.045 \cdot G + 0.8 \cdot R - 0.035 \cdot H - 0.6 \cdot B$$</p>
<p>where $S$ is the docking score, $G$ is the gauss term, $R$ is repulsion, $H$ is the hydrophobic term, and $B$ is the non-directional hydrogen bond term. The gauss and repulsion terms measure steric interactions between the ligand and the protein, while the hydrophobic and hydrogen bond terms capture favorable non-covalent contacts.</p>
<p>The benchmark includes three task variants:</p>
<ol>
<li><strong>Docking Score Function</strong>: Optimize the full Vinardo docking score (lower is better).</li>
<li><strong>Repulsion</strong>: Minimize only the repulsion component, defined as:</li>
</ol>
<p>$$
R(a_1, a_2) = \begin{cases}
d(a_1, a_2)^2 &amp; d(a_1, a_2) &lt; 0 \\
0 &amp; \text{otherwise}
\end{cases}
$$</p>
<p>where $d(a_1, a_2)$ is the inter-atomic distance minus the sum of <a href="https://en.wikipedia.org/wiki/Van_der_Waals_radius">van der Waals radii</a>.</p>
<ol start="3">
<li><strong>Hydrogen Bonding</strong>: Maximize the hydrogen bond term:</li>
</ol>
<p>$$
B(a_1, a_2) = \begin{cases}
0 &amp; (a_1, a_2) \text{ do not form H-bond} \\
1 &amp; d(a_1, a_2) &lt; -0.6 \\
0 &amp; d(a_1, a_2) \geq 0 \\
\frac{d(a_1, a_2)}{-0.6} &amp; \text{otherwise}
\end{cases}
$$</p>
<p>Scores are averaged over the top 5 binding poses for stability. Generated compounds are filtered by <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski&rsquo;s Rule of Five</a> and a minimum molecular weight of 100. Each model must generate 250 unique molecules per target.</p>
<p>Training data comes from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>, covering eight drug targets: 5-HT1B, 5-HT2B, ACM2, CYP2D6, ADRB1, MOR, A2A, and D2. Dataset sizes range from 1,082 (ADRB1) to 10,225 (MOR) molecules.</p>
<h2 id="experimental-evaluation-of-three-generative-models">Experimental Evaluation of Three Generative Models</h2>
<h3 id="models-tested">Models Tested</h3>
<p>Three popular generative models were evaluated:</p>
<ul>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">CVAE</a></strong> (Chemical Variational Autoencoder): A VAE operating on <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings.</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/grammar-variational-autoencoder/">GVAE</a></strong> (Grammar Variational Autoencoder): Extends CVAE by enforcing grammatical correctness of generated SMILES.</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a></strong>: A recurrent neural network trained first on ChEMBL in a supervised manner, then fine-tuned with reinforcement learning using docking scores as rewards.</li>
</ul>
<p>For CVAE and GVAE, molecules are generated by sampling from the latent space and taking 50 gradient steps to optimize an MLP that predicts the docking score. For REINVENT, a random forest model predicts docking scores from ECFP fingerprints, and the reward combines this prediction with the QED score.</p>
<h3 id="baselines">Baselines</h3>
<p>Two baselines provide context:</p>
<ul>
<li><strong>Training set</strong>: The top 50%, 10%, and 1% of docking scores from the ChEMBL training set.</li>
<li><strong><a href="/notes/computational-chemistry/datasets/zinc-22/">ZINC</a> subset</strong>: A random sample of ~9.2 million drug-like molecules from ZINC, with the same percentile breakdowns.</li>
</ul>
<p>Diversity is measured as the mean <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto distance</a> (using 1024-bit ECFP with radius 2) between all pairs of generated molecules.</p>
<h3 id="key-results">Key Results</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Model</th>
          <th>5-HT1B Score</th>
          <th>5-HT1B Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Docking Score</td>
          <td>CVAE</td>
          <td>-4.647</td>
          <td>0.907</td>
      </tr>
      <tr>
          <td>Docking Score</td>
          <td>GVAE</td>
          <td>-4.955</td>
          <td>0.901</td>
      </tr>
      <tr>
          <td>Docking Score</td>
          <td>REINVENT</td>
          <td>-9.774</td>
          <td>0.506</td>
      </tr>
      <tr>
          <td>Docking Score</td>
          <td>ZINC (10%)</td>
          <td>-9.894</td>
          <td>0.862</td>
      </tr>
      <tr>
          <td>Docking Score</td>
          <td>ZINC (1%)</td>
          <td>-10.496</td>
          <td>0.861</td>
      </tr>
      <tr>
          <td>Docking Score</td>
          <td>Train (10%)</td>
          <td>-10.837</td>
          <td>0.749</td>
      </tr>
  </tbody>
</table>
<p>On the full docking score task, CVAE and GVAE fail to match even the mean ZINC docking score. REINVENT performs substantially better (e.g., -9.774 on 5-HT1B) but still falls short of the top 10% ZINC scores (-9.894) in most cases. The exception is ACM2, where REINVENT&rsquo;s score (-9.775) exceeds the ZINC 10% threshold (-8.282).</p>
<p>On the repulsion task, all three models fail to outperform the top 10% ZINC scores. On the hydrogen bonding task (the easiest), GVAE and REINVENT nearly match the top 1% ZINC scores, suggesting that optimizing individual scoring components is more tractable than the full docking score.</p>
<p>A consistent finding across all experiments is that REINVENT generates substantially less diverse molecules than the training set (e.g., 0.506 vs. 0.787 mean Tanimoto distance on 5-HT1B). The t-SNE visualizations show generated molecules clustering in a single dense region, separate from the training data, regardless of optimization target.</p>
<p>The paper also notes a moderately strong correlation between docking scores and molecular weight or the number of rotatable bonds. Generated compounds achieve better docking scores at the same molecular weight after optimization, suggesting the models learn some structural preferences rather than simply exploiting molecular size.</p>
<h2 id="limitations-of-current-generative-models-for-drug-design">Limitations of Current Generative Models for Drug Design</h2>
<p>The main finding is negative: popular generative models for de novo drug design struggle to generate molecules that dock well when trained on realistically sized datasets (1,000 to 10,000 compounds). Even the best-performing model (REINVENT) generally cannot outperform the top 10% of a random ZINC subset on the full docking score task.</p>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>Docking is itself a proxy</strong>: The SMINA docking score is only an approximation of true binding affinity. The fact that even this simpler proxy is challenging should raise concerns about these models&rsquo; readiness for real drug discovery pipelines.</li>
<li><strong>Limited model selection</strong>: Only three models were tested (CVAE, GVAE, REINVENT). The authors note that CVAE and GVAE were not designed for small training sets, and REINVENT may not represent the state of the art in all respects.</li>
<li><strong>ML-based scoring surrogate</strong>: All models use an ML model (MLP or random forest) to predict docking scores during generation, rather than running SMINA directly. This introduces an additional approximation layer.</li>
<li><strong>No similarity constraints</strong>: The benchmark does not impose constraints on the distance between generated and training molecules. A trivial baseline is to simply return the training set.</li>
</ul>
<p>On a more positive note, the tested models perform well on the simplest subtask (hydrogen bonding), suggesting that optimizing docking scores from limited data is attainable but challenging. The benchmark has already been adopted by other groups, notably Nigam et al. (2021) for evaluating their JANUS genetic algorithm.</p>
<p>Future directions include adding similarity constraints, extending to additional protein targets, and using the benchmark to evaluate newer structure-based generative models that employ equivariant neural networks.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>ChEMBL (8 targets)</td>
          <td>1,082-10,225 molecules per target</td>
          <td>90/10 train/test split</td>
      </tr>
      <tr>
          <td>Baseline</td>
          <td>ZINC 15 subset</td>
          <td>~9.2M drug-like molecules</td>
          <td>In-stock, standard reactivity, drug-like</td>
      </tr>
      <tr>
          <td>Protein structures</td>
          <td><a href="https://en.wikipedia.org/wiki/Protein_Data_Bank">Protein Data Bank</a></td>
          <td>8 structures</td>
          <td>Cleaned with Schrodinger modeling package</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>CVAE/GVAE: Fine-tuned 5 epochs on target data, then 50 gradient steps in latent space to optimize MLP-predicted score</li>
<li>REINVENT: Pretrained on ChEMBL, fine-tuned with RL; reward = random forest prediction * QED score</li>
<li>All docking performed with SMINA v. 2017.11.9 using Vinardo scoring function in score_only mode</li>
<li>Scores averaged over top 5 binding poses</li>
<li>Filtering: Lipinski Rule of Five, minimum molecular weight 100</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Mean docking score</td>
          <td>Average over 250 generated molecules</td>
          <td>Lower is better for docking score and repulsion</td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>Mean Tanimoto distance (ECFP, r=2)</td>
          <td>Higher is more diverse</td>
      </tr>
      <tr>
          <td>ZINC percentile baselines</td>
          <td>Top 50%, 10%, 1% from random ZINC subset</td>
          <td>Task considered &ldquo;solved&rdquo; if generated score exceeds ZINC 1%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/cieplinski-tobiasz/smina-docking-benchmark">smina-docking-benchmark</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Benchmark code, data, evaluation notebooks</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Cieplinski, T., Danel, T., Podlewska, S., &amp; Jastrzebski, S. (2023). Generative Models Should at Least Be Able to Design Molecules That Dock Well: A New Benchmark. <em>Journal of Chemical Information and Modeling</em>, 63(11), 3238-3247. <a href="https://doi.org/10.1021/acs.jcim.2c01355">https://doi.org/10.1021/acs.jcim.2c01355</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/cieplinski-tobiasz/smina-docking-benchmark">GitHub Repository</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{cieplinski2023generative,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Generative Models Should at Least Be Able to Design Molecules That Dock Well: A New Benchmark}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Cieplinski, Tobiasz and Danel, Tomasz and Podlewska, Sabina and Jastrzebski, Stanislaw}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{63}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3238--3247}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.2c01355}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Exposing Limitations of Molecular ML with Activity Cliffs</title><link>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/activity-cliffs-benchmark/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/benchmark-problems/activity-cliffs-benchmark/</guid><description>A benchmark of 24 ML methods on activity cliff compounds across 30 drug targets, showing descriptor-based models outperform deep learning.</description><content:encoded><![CDATA[<h2 id="a-benchmark-for-activity-cliff-prediction">A Benchmark for Activity Cliff Prediction</h2>
<p>This is a <strong>Systematization</strong> paper ($\Psi_{\text{Systematization}}$) with a significant <strong>Resource</strong> component ($\Psi_{\text{Resource}}$).</p>
<p>The paper systematically benchmarks 24 machine learning and deep learning approaches on their ability to predict bioactivity for activity cliff compounds: pairs of structurally similar molecules that exhibit large differences in potency. These cases violate the similarity principle (similar structure implies similar activity) and represent a practical failure mode for <a href="/notes/computational-chemistry/chemical-language-models/property-prediction/">molecular property prediction</a> in drug discovery. The authors release MoleculeACE, an open-source benchmarking platform for evaluating ML models on activity cliffs.</p>
<h2 id="activity-cliffs-as-a-blind-spot-in-molecular-ml">Activity Cliffs as a Blind Spot in Molecular ML</h2>
<p>The <a href="https://en.wikipedia.org/wiki/Chemical_similarity">similarity principle</a> underpins most molecular ML: structurally similar compounds should have similar properties. Activity cliffs are the exceptions, where small structural changes cause large potency shifts (e.g., a single substituent change causing a 10x difference in $K_i$).</p>
<p>Despite their importance for <a href="https://en.wikipedia.org/wiki/Hit_to_lead">hit-to-lead optimization</a>, activity cliffs have received limited attention in ML benchmarking. Standard metrics like RMSE computed over entire test sets can mask poor predictions on cliff compounds. A model might achieve low overall error while systematically mispredicting these edge cases, which are precisely the molecules that matter most for medicinal chemistry applications.</p>
<p>The authors identify 7-52% of compounds as activity cliff molecules across their 30 target datasets, showing this is not a rare phenomenon.</p>
<h2 id="defining-and-detecting-activity-cliffs">Defining and Detecting Activity Cliffs</h2>
<p>The authors use three complementary similarity metrics to identify activity cliffs:</p>
<ol>
<li><strong>Substructure similarity</strong>: <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto coefficient</a> on extended connectivity fingerprints (ECFPs), capturing shared radial substructures</li>
<li><strong>Scaffold similarity</strong>: Tanimoto coefficient on ECFPs computed from molecular graph frameworks, detecting core/decoration differences</li>
<li><strong>SMILES similarity</strong>: <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> on canonical <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings, capturing character-level insertions, deletions, and translocations</li>
</ol>
<p>Pairs with $\geq 90%$ similarity on <strong>any one</strong> of the three metrics and $&gt; 10\times$ difference in bioactivity ($K_i$ or $\text{EC}_{50}$) are classified as activity cliff pairs. This union-based approach (rather than requiring agreement across all metrics) captures different types of structural relationships relevant to medicinal chemistry.</p>
<h2 id="24-methods-across-30-drug-targets">24 Methods Across 30 Drug Targets</h2>
<p>The benchmark evaluates 16 traditional ML configurations (4 algorithms $\times$ 4 descriptor types) and 8 deep learning approaches across 30 curated <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> v29 datasets (48,707 total molecules).</p>
<p><strong>Traditional ML algorithms</strong>: KNN, RF, GBM, SVM, each combined with ECFPs, MACCS keys, WHIM descriptors, or physicochemical properties.</p>
<p><strong>Deep learning methods</strong>: MPNN, GCN, GAT, Attentive FP (graph-based), plus LSTM, CNN, Transformer/<a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta/">ChemBERTa</a> (SMILES-based), and an MLP on ECFPs.</p>
<p>Performance is measured with both standard RMSE and a dedicated $\text{RMSE}_{\text{cliff}}$ computed only on activity cliff compounds in the test set:</p>
<p>$$
\text{RMSE}_{\text{cliff}} = \sqrt{\frac{\sum_{j=1}^{n_c} (\hat{y}_j - y_j)^2}{n_c}}
$$</p>
<p>Key results:</p>
<ul>
<li><strong>Molecular descriptors matter more than algorithms</strong>: The choice of descriptor (ECFPs vs. MACCS vs. WHIM vs. physicochemical) had a larger impact on $\text{RMSE}_{\text{cliff}}$ than the choice of ML algorithm ($p &lt; 0.05$, <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Wilcoxon rank-sum test</a> with <a href="https://en.wikipedia.org/wiki/False_discovery_rate">Benjamini-Hochberg correction</a>).</li>
<li><strong>SVM + ECFPs wins on average</strong>: The best overall method for activity cliff prediction, though the difference from RF + ECFPs or GBM + ECFPs was not statistically significant.</li>
<li><strong>Deep learning underperforms</strong>: All graph and SMILES-based deep learning methods performed worse than a simple MLP on ECFPs. Among deep learning, LSTM with transfer learning (pretrained on 36K molecules) was the best, outperforming the ChemBERTa transformer pretrained on 10M compounds.</li>
<li><strong>Large case-by-case variation</strong>: $\text{RMSE}_{\text{cliff}}$ ranged from 0.62 to 1.60 log units across datasets, with no method consistently best. Deep learning methods showed the highest variance across targets.</li>
</ul>
<h2 id="simple-descriptors-beat-complex-architectures-on-cliffs">Simple Descriptors Beat Complex Architectures on Cliffs</h2>
<p>The core finding is that activity cliffs expose a gap in learned molecular representations. Despite graph neural networks and transformers being able to learn directly from molecular structure, they fail to capture the subtle structural differences that drive activity cliffs.</p>
<p>Key observations:</p>
<ul>
<li><strong>RMSE and $\text{RMSE}_{\text{cliff}}$ correlate ($r = 0.81$ on average)</strong>, so optimizing overall error usually helps with cliffs too. But this correlation breaks down for some targets (e.g., CLK4), where methods with similar RMSE can have very different $\text{RMSE}_{\text{cliff}}$.</li>
<li><strong>Training set size matters for the RMSE/$\text{RMSE}_{\text{cliff}}$ correlation</strong>: Datasets with $&gt; 1000$ training molecules show $r &gt; 0.80$ between the two metrics. In low-data regimes, the correlation weakens, making dedicated cliff evaluation more important.</li>
<li><strong>No relationship between % cliff compounds and model performance</strong>, and no target-family-specific effects were found.</li>
<li><strong>Transfer learning helped SMILES models (LSTM) but not graph models</strong>: Self-supervised pretraining strategies (context prediction, infomax, edge prediction, masking) did not improve GNN performance, consistent with findings from other studies.</li>
</ul>
<p>The MoleculeACE platform provides standardized data curation, activity cliff detection, and cliff-specific evaluation, enabling researchers to assess new methods against this benchmark.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Source</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Benchmarking</td>
          <td>ChEMBL v29</td>
          <td>48,707 molecules (35,632 unique) across 30 targets</td>
          <td>Curated for duplicates, salts, outliers</td>
      </tr>
      <tr>
          <td>Smallest dataset</td>
          <td>JAK1</td>
          <td>615 molecules</td>
          <td>7% activity cliffs</td>
      </tr>
      <tr>
          <td>Largest dataset</td>
          <td>DRD3</td>
          <td>3,657 molecules</td>
          <td>39% activity cliffs</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Activity cliff detection</strong>: Pairwise similarity $\geq 0.9$ (Tanimoto on ECFPs, scaffold ECFPs, or Levenshtein on SMILES) with $&gt; 10\times$ potency difference</li>
<li><strong>Splitting</strong>: <a href="https://en.wikipedia.org/wiki/Spectral_clustering">Spectral clustering</a> on ECFPs (5 clusters), 80/20 stratified split preserving cliff proportion</li>
<li><strong>Hyperparameter optimization</strong>: <a href="https://en.wikipedia.org/wiki/Bayesian_optimization">Bayesian optimization</a> with Gaussian process, max 50 combinations, 5-fold cross-validation</li>
<li><strong>SMILES augmentation</strong>: 10-fold for all SMILES-based methods</li>
<li><strong>Transfer learning</strong>: LSTM pretrained on 36,281 merged training molecules (next-character prediction); ChemBERTa pretrained on 10M <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> compounds</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Traditional ML</strong>: KNN, RF, GBM, SVM (scikit-learn v1.0.2)</li>
<li><strong>Descriptors</strong>: ECFPs (1024-bit, radius 2), MACCS keys (166-bit), WHIM (114 descriptors), physicochemical (11 properties)</li>
<li><strong>GNNs</strong>: MPNN, GCN, GAT, AFP (PyTorch Geometric v2.0.4), with graph multiset transformer pooling</li>
<li><strong>SMILES models</strong>: LSTM (4 layers, 5.8M params), 1D CNN, ChemBERTa transformer</li>
<li><strong>Total models trained</strong>: 720 (24 methods $\times$ 30 targets)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Scope</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RMSE</td>
          <td>All test molecules</td>
          <td>Standard root-mean-square error on $\text{pK}_i$ / $\text{pEC}_{50}$</td>
      </tr>
      <tr>
          <td>$\text{RMSE}_{\text{cliff}}$</td>
          <td>Activity cliff compounds only</td>
          <td>RMSE restricted to cliff molecules in test set</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/molML/MoleculeACE">MoleculeACE</a></td>
          <td>Code + Data</td>
          <td>MIT</td>
          <td>Benchmark platform with all 30 curated datasets</td>
      </tr>
      <tr>
          <td><a href="https://github.com/molML/MoleculeACE/tree/main/MoleculeACE/Data/benchmark_data">Curated datasets</a></td>
          <td>Data</td>
          <td>MIT</td>
          <td>Processed ChEMBL bioactivity data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: van Tilborg, D., Alenicheva, A., &amp; Grisoni, F. (2022). Exposing the Limitations of Molecular Machine Learning with Activity Cliffs. <em>Journal of Chemical Information and Modeling</em>, 62(23), 5938-5951. <a href="https://doi.org/10.1021/acs.jcim.2c01073">https://doi.org/10.1021/acs.jcim.2c01073</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/molML/MoleculeACE">MoleculeACE GitHub Repository</a></li>
<li><a href="https://chemrxiv.org/engage/chemrxiv/article-details/630cc44058843b8403a19810">ChemRxiv Preprint</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{vantilborg2022activity,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Exposing the Limitations of Molecular Machine Learning with Activity Cliffs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{van Tilborg, Derek and Alenicheva, Alisa and Grisoni, Francesca}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{62}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{23}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{5938--5951}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.2c01073}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ZINC-22: A Multi-Billion Scale Database for Ligand Discovery</title><link>https://hunterheidenreich.com/notes/computational-chemistry/datasets/zinc-22/</link><pubDate>Sat, 27 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/datasets/zinc-22/</guid><description>The ZINC-22 dataset provides over 37 billion make-on-demand molecules enabling virtual screening and modern drug discovery.</description><content:encoded><![CDATA[<h2 id="key-contribution-scaling-make-on-demand-libraries">Key Contribution: Scaling Make-on-Demand Libraries</h2>
<p>ZINC-22 addresses the critical infrastructure challenges of managing multi-billion-scale libraries of make-on-demand chemical compounds through a federated database architecture, the CartBlanche web interface, and cloud distribution systems that enable modern virtual screening.</p>
<h2 id="overview">Overview</h2>
<p>ZINC-22 is a multi-billion scale public database of commercially available chemical compounds designed for virtual screening. It contains over 37 billion make-on-demand molecules and utilizes a distributed infrastructure capable of managing database indexing limits. For structural biology pipelines, it provides 4.5 billion ready-to-dock 3D conformations alongside pre-calculated pH-specific protonation states, tautomers, and AMSOL partial charges.</p>
<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/zinc-22-sample.webp"
         alt="ZINC-22&#39;s 2D Tranche Browser"
         title="ZINC-22&#39;s 2D Tranche Browser"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">ZINC-22&rsquo;s 2D Tranche Browser showing the organization of 37.2 billion molecules by physicochemical properties</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>2D Database</strong></td>
          <td>37B+</td>
          <td>Complete 2D chemical structures from make-on-demand catalogs (Enamine REAL, Enamine REAL Space, WuXi GalaXi, Mcule Ultimate)</td>
      </tr>
      <tr>
          <td><strong>3D Database</strong></td>
          <td>4.5B+</td>
          <td>Ready-to-dock 3D conformations with pre-calculated charges and solvation energies</td>
      </tr>
      <tr>
          <td><strong>Custom Tranches</strong></td>
          <td>Variable</td>
          <td>User-selected molecular subsets via Tranche Browser (e.g., lead-like, fragment-like)</td>
      </tr>
  </tbody>
</table>
<h2 id="use-cases">Use Cases</h2>
<p>ZINC-22 is designed for ultra-large virtual screening (ULVS), analog searching, and molecular docking campaigns. The Tranche Browser enables targeted subset selection (e.g., lead-like, fragment-like) for screening, and the CartBlanche interface supports both interactive and programmatic access to the database. The authors note that as the database grows, docking can identify better-fitting molecules.</p>
<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ZINC-20</strong></td>
          <td>Predecessor</td>
          <td></td>
      </tr>
      <tr>
          <td><strong>Enamine REAL</strong></td>
          <td>Source catalog</td>
          <td></td>
      </tr>
      <tr>
          <td><strong>WuXi GalaXi</strong></td>
          <td>Source catalog</td>
          <td></td>
      </tr>
  </tbody>
</table>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>Massive scale</strong>: 37+ billion purchasable compounds from major vendors (Enamine, WuXi, Mcule)</li>
<li><strong>Federated architecture</strong>: Supports asynchronous building and horizontal scaling to trillion-molecule growth</li>
<li><strong>Platform access</strong>: CartBlanche GUI provides a shopping cart metaphor for compound acquisition</li>
<li><strong>Privacy protection</strong>: Dual public/private server clusters protect patentability of undisclosed catalogs</li>
<li><strong>Chemical diversity</strong>: Linear growth (1 new scaffold per 10 molecules added), with 96.3M+ unique Bemis-Murcko scaffolds</li>
<li><strong>Ready-to-dock</strong>: 3D models include pre-calculated charges, protonation states, and solvation energies</li>
<li><strong>Cloud distribution</strong>: Available via AWS Open Data, Oracle OCI, and UCSF servers</li>
<li><strong>Scale-aware search</strong>: SmallWorld (similarity) and Arthor (substructure) tools partitioned to address specific constraints of billion-scale queries</li>
<li><strong>Organized access</strong>: Tranche system enables targeted selection of chemical space</li>
<li><strong>Open access</strong>: Entire database freely available to academic and commercial users</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Data Transfer Bottlenecks</strong>: Distributing 4.5 billion 3D alignments in standard rigid format (like db2 flexibase) requires roughly 1 Petabyte of storage. Transferring this takes months over standard gigabit connections, effectively mandating cloud-based compilation and rendering local copies impractical.</li>
<li><strong>Search Result Caps</strong>: Interactive Arthor searches are capped at 20,000 molecules to maintain a reliable public service. Users needing more results can use the asynchronous Arthor search tool via TLDR, which sends results by email.</li>
<li><strong>Enumeration Ceiling</strong>: Scaling relies entirely on PostgreSQL sharding. To continue using rigid docking tools, the database must fully enumerate structural states. The authors acknowledge that hardware limitations will likely cap full database enumeration well before the 10-trillion molecule mark, forcing future pipelines to accommodate unenumerated combinatorial fragment spaces.</li>
<li><strong>Download Workflow</strong>: Individual 3D molecule downloads are unavailable directly; researchers must rebuild them via the TLDR tool.</li>
<li><strong>Vendor Updates</strong>: There is difficulty removing discontinued vendor molecules due to the federated structure.</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="hardware--software">Hardware &amp; Software</h3>
<p><strong>Compute infrastructure</strong>:</p>
<ul>
<li>1,700 cores across 14 computers for parallel processing</li>
<li>174 independent PostgreSQL 12.0 databases (110 &lsquo;Sn&rsquo; for ZINC-ID, 64 &lsquo;Sb&rsquo; for Supplier Codes)</li>
<li>Distributed across Amazon AWS, Oracle OCI, and UCSF servers</li>
</ul>
<p><strong>Software stack</strong>:</p>
<ul>
<li>PostgreSQL 12.2</li>
<li>Python 3.6.8</li>
<li>RDKit 2020.03</li>
<li>Celery task queue with Redis for background processing</li>
<li>All code available on GitHub: docking-org/zinc22-2d, zinc22-3d</li>
</ul>
<h3 id="data-organization--access">Data Organization &amp; Access</h3>
<p><strong>Tranche system</strong>: Molecules organized into &ldquo;Tranches&rdquo; based on 4 dimensions:</p>
<ol>
<li>Heavy Atom Count</li>
<li>Lipophilicity (LogP)</li>
<li>Charge</li>
<li>File Format</li>
</ol>
<p>This enables downloading specific chemical neighborhoods (e.g., neutral lead-like molecules) without accessing the entire database.</p>
<p><strong>Search infrastructure</strong>:
Searching at the billion-molecule scale actively exceeds rapid-access computer memory limits. ZINC-22 splits retrieval between two distinct algorithms:</p>
<ul>
<li>
<p><strong>SmallWorld</strong>: Handles whole-molecule similarity using Graph Edit Distance (GED). GED defines the minimum cost of operations (node/edge insertions, deletions, or substitutions) required to transform graph $G_1$ into graph $G_2$:</p>
<p>$$
\text{GED}(G_1, G_2) = \min_{(e_1, &hellip;, e_k) \in \mathcal{P}(G_1, G_2)} \sum_{i=1}^k c(e_i)
$$</p>
<p>Because SmallWorld searches pre-calculated anonymous graphs, it evaluates close neighbors in near $\mathcal{O}(1)$ time and scales sub-linearly, though it struggles with highly distant structural matches.</p>
</li>
<li>
<p><strong>Arthor</strong>: Provides exact substructure and pattern matching. It scales linearly $\mathcal{O}(N)$ with database size and successfully finds distant hits (e.g., PAINS filters), but performance heavily degrades if the index exceeds available RAM.</p>
</li>
<li>
<p><strong>CartBlanche</strong>: Web interface wrapping these search tools with shopping cart functionality.</p>
</li>
</ul>
<h3 id="3d-generation-pipeline">3D Generation Pipeline</h3>
<p>The 3D database construction pipeline involves multiple specialized tools:</p>
<ol>
<li><strong>ChemAxon JChem</strong>: Protonation state and tautomer generation at physiological pH</li>
<li><strong>Corina</strong>: Initial 3D structure generation</li>
<li><strong>Omega</strong>: Conformation sampling</li>
<li><strong>AMSOL 7.1</strong>: Calculation of atomic partial charges and desolvation energies</li>
<li><strong>Strain calculation</strong>: Relative energies of conformations</li>
</ol>
<p>At sustained throughput, the pipeline builds approximately 11 million molecules per day, each with hundreds of pre-calculated conformations.</p>
<h3 id="chemical-diversity-analysis">Chemical Diversity Analysis</h3>
<p>A core debate in billion-scale library generation involves whether continuous enumeration merely yields repetitive derivatives. Analysis of Bemis-Murcko (BM) scaffolds demonstrates that chemical diversity in ZINC-22 continues to grow, but scales sub-linearly based on a power law. Specifically, the authors observe a $\log$ increase in BM scaffolds for every two $\log$ increase in database size:</p>
<p>$$
\log(\text{Scaffolds}_{BM}) \propto 0.5 \log(\text{Molecules})
$$</p>
<p>This suggests that while diversity does not saturate, it grows proportionally to the square root of the library size ($\mathcal{O}(\sqrt{N})$). The majority of this scaffold novelty stems from compounds with the highest heavy atom counts (HAC 24-25), which contribute roughly twice as many unique core structures as the combined HAC 06-23 subset.</p>
<h3 id="vendor-integration">Vendor Integration</h3>
<p>ZINC-22 is built from five source catalogs with the following approximate sizes:</p>
<ul>
<li><strong>Enamine REAL Database</strong>: 5 billion compounds</li>
<li><strong>Enamine REAL Space</strong>: 29 billion compounds</li>
<li><strong>WuXi GalaXi</strong>: 2.5 billion compounds</li>
<li><strong>Mcule Ultimate</strong>: 128 million compounds</li>
<li><strong>ZINC20 in-stock</strong>: 4 million compounds (incorporated as layer &ldquo;g&rdquo;)</li>
</ul>
<p>This focus on purchasable, make-on-demand molecules distinguishes ZINC-22 from theoretical chemical space databases. ZINC20 continues to be maintained separately for smaller catalogs and in-stock compounds.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://cartblanche22.docking.org/">CartBlanche web interface</a></td>
          <td>Dataset</td>
          <td>Free access</td>
          <td>Web GUI for searching and downloading ZINC-22</td>
      </tr>
      <tr>
          <td><a href="https://github.com/docking-org/zinc22-2d">docking-org/zinc22-2d</a></td>
          <td>Code</td>
          <td>BSD-3-Clause</td>
          <td>2D curation and loading pipeline</td>
      </tr>
      <tr>
          <td><a href="https://github.com/docking-org/zinc22-3d">docking-org/zinc22-3d</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>3D building pipeline</td>
      </tr>
      <tr>
          <td><a href="https://github.com/docking-org/cartblanche22">docking-org/cartblanche22</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>CartBlanche22 web application</td>
      </tr>
      <tr>
          <td>AWS Open Data / Oracle OCI</td>
          <td>Dataset</td>
          <td>Free access</td>
          <td>Cloud-hosted 3D database mirrors</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Data Availability</strong>: The compiled database is openly accessible and searchable through the <a href="https://cartblanche22.docking.org/">CartBlanche web interface</a>. Subsets can be downloaded, and programmatic access is provided via curl, wget, and Globus.</li>
<li><strong>Code &amp; Algorithms</strong>: The source code for database construction, parallel processing, and querying is open-source.
<ul>
<li>2D Pipeline: <a href="https://github.com/docking-org/zinc22-2d">docking-org/zinc22-2d</a></li>
<li>3D Pipeline: <a href="https://github.com/docking-org/zinc22-3d">docking-org/zinc22-3d</a></li>
<li>CartBlanche: <a href="https://github.com/docking-org/cartblanche22">docking-org/cartblanche22</a></li>
<li>TLDR modules: docking-org/TLDR and docking-org/tldr-modules (repositories no longer available)</li>
</ul>
</li>
<li><strong>Software Dependencies</strong>: While the orchestration code is public, the 3D structure generation relies on commercial software that requires separate licenses (CORINA, OpenEye OMEGA, ChemAxon JChem). This limits end-to-end reproducibility for researchers without access to these tools.</li>
<li><strong>Hardware Limitations</strong>: Recreating the entire 37+ billion molecule database from raw vendor catalogs requires approximately 1,700 CPU cores and petabytes of data transfer, restricting full recreation to large institutional clusters or substantial cloud compute budgets.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Tingle, B. I., Tang, K. G., Castanon, M., Gutierrez, J. J., Khurelbaatar, M., Dandarchuluun, C., Moroz, Y. S., and Irwin, J. J. (2023). ZINC-22: A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery. <em>Journal of Chemical Information and Modeling</em>, 63(4), 1166&ndash;1176. <a href="https://doi.org/10.1021/acs.jcim.2c01253">https://doi.org/10.1021/acs.jcim.2c01253</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Tingle_2023,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ZINC-22: A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{63}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">ISSN</span>=<span style="color:#e6db74">{1549-960X}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{http://dx.doi.org/10.1021/acs.jcim.2c01253}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">DOI</span>=<span style="color:#e6db74">{10.1021/acs.jcim.2c01253}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society (ACS)}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Tingle, Benjamin I. and Tang, Khanh G. and Castanon, Mar and Gutierrez, John J. and Khurelbaatar, Munkhzul and Dandarchuluun, Chinzorig and Moroz, Yurii S. and Irwin, John J.}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">month</span>=<span style="color:#e6db74">{Feb}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1166--1176}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>