<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Autoregressive Generation on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/</link><description>Recent content in Autoregressive Generation on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sun, 05 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/index.xml" rel="self" type="application/rss+xml"/><item><title>LSTM Neural Network for Drug-Like Molecule Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/lstm-drug-like-molecule-generation/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/lstm-drug-like-molecule-generation/</guid><description>An LSTM neural network trained on 509K ChEMBL SMILES generates one million novel drug-like molecules with realistic substructures and bioactivity profiles.</description><content:encoded><![CDATA[<h2 id="an-early-method-for-lstm-based-molecular-generation">An Early Method for LSTM-Based Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that applies character-level LSTM networks to the task of de novo drug-like molecule generation. The primary contribution is demonstrating that an LSTM trained on <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings from a large bioactive compound database (ChEMBL) can produce novel, diverse molecules whose chemical properties closely match those of known drug-like compounds. The paper also validates the generated molecules through virtual screening with profile QSAR models, showing comparable predicted bioactivity to the training set.</p>
<h2 id="the-challenge-of-exploring-drug-like-chemical-space">The Challenge of Exploring Drug-Like Chemical Space</h2>
<p>The theoretical space of drug-like molecules is astronomically large. Brute-force enumeration approaches such as <a href="/notes/computational-chemistry/datasets/gdb-17/">GDB-17</a> (which catalogued 166 billion molecules) are feasible only for small molecules, and full enumeration of molecules with 25-30 heavy atoms (the typical size of drug molecules) remains computationally intractable. Traditional cheminformatics approaches to sampling this space rely on fragment combination, evolutionary algorithms, or particle swarm optimization.</p>
<p>The authors position LSTM networks as a viable alternative. LSTMs had already demonstrated the ability to learn sequential structure in domains like text and music generation, making them natural candidates for learning SMILES grammar and generating novel valid molecular strings. At the time of writing (late 2017), several groups were exploring this direction, including Bjerrum and Threlfall (ZINC-based generation), <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al.</a> (VAE-based latent space design), <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">Olivecrona et al.</a> (RL-guided generation), and Segler et al. (focused library design). This paper contributes a large-scale empirical study with detailed analysis of the generated molecules&rsquo; chemical quality.</p>
<h2 id="character-level-lstm-with-temperature-based-sampling">Character-Level LSTM with Temperature-Based Sampling</h2>
<p>The core approach is straightforward: train an LSTM to predict the next character in a SMILES string, then sample from the trained model to generate new molecules character by character.</p>
<p>The network architecture consists of:</p>
<ul>
<li>Two stacked LSTM layers (which learn the SMILES grammar)</li>
<li>A dropout layer for regularization</li>
<li>A dense output layer with 23 neurons (one per character in the reduced SMILES alphabet) and softmax activation</li>
</ul>
<p>The RMSProp optimizer was used for training. The learning rate was gradually decreased from 0.01 to 0.0002 during training. At generation time, a temperature parameter controls the randomness of character sampling to produce more diverse structures rather than reproducing training molecules too closely.</p>
<p>A key preprocessing step reduces the SMILES alphabet to 23 characters. Multi-character atom tokens are replaced with single characters (<code>Cl</code> → <code>L</code>, <code>Br</code> → <code>R</code>, <code>[nH]</code> → <code>A</code>). Only the organic atom subset (<code>H</code>, <code>C</code>, <code>N</code>, <code>O</code>, <code>S</code>, <code>P</code>, <code>F</code>, <code>Cl</code>, <code>Br</code>, <code>I</code>) is retained. Charged molecules, stereo information, and molecules with more than 5 ring closures are excluded. The training corpus totals 23,664,668 characters, with 40-character windows used as input sequences during training.</p>
<h2 id="training-on-chembl-and-generating-one-million-molecules">Training on ChEMBL and Generating One Million Molecules</h2>
<h3 id="training-data">Training Data</h3>
<p>The training set consists of 509,000 bioactive molecules from ChEMBL with reported activity below 10 micromolar on any target.</p>
<h3 id="generation-and-filtering">Generation and Filtering</h3>
<p>The LSTM generates SMILES strings character by character. The generated strings undergo a two-stage validation:</p>
<ol>
<li><strong>Bracket and ring closure check</strong> (fast text-based): 54% of generated SMILES are discarded for unpaired brackets or ring closures</li>
<li><strong>Full chemical parsing with RDKit</strong>: An additional 14% fail due to unrealistic aromatic systems or incorrect valences</li>
<li><strong>Final yield</strong>: 32% of generated SMILES correspond to valid molecules</li>
</ol>
<p>One million valid molecules were generated in under 2 hours on 300 CPUs.</p>
<h3 id="novelty-and-diversity">Novelty and Diversity</h3>
<p>Out of one million generated molecules, only 2,774 (0.28%) were identical to molecules in the training ChEMBL set. The generated set contained 627,000 unique scaffolds compared to 172,000 in ChEMBL, with an overlap of only 18,000 scaffolds. This demonstrates substantial novelty and diversity.</p>
<h3 id="physicochemical-properties">Physicochemical Properties</h3>
<p>Calculated molecular descriptors (molecular weight, logP, and topological polar surface area) for the generated molecules closely matched the distributions of the ChEMBL training set. The synthetic accessibility score distributions were also practically identical, indicating comparable molecular complexity.</p>
<h3 id="substructure-feature-comparison">Substructure Feature Comparison</h3>
<p>The paper compares substructure features across three molecule sets: ChEMBL training data, LSTM-generated molecules, and a naive SMILES baseline generator. The naive generator uses only character frequency statistics and basic SMILES syntax rules, producing primarily macrocycles with very few fused aromatic systems.</p>
<table>
  <thead>
      <tr>
          <th>Feature</th>
          <th>ChEMBL (%)</th>
          <th>LSTM Generated (%)</th>
          <th>Naive Baseline (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>No rings</td>
          <td>0.4</td>
          <td>0.4</td>
          <td>0.1</td>
      </tr>
      <tr>
          <td>1 ring</td>
          <td>2.8</td>
          <td>4.3</td>
          <td>13.2</td>
      </tr>
      <tr>
          <td>2 rings</td>
          <td>14.8</td>
          <td>23.1</td>
          <td>17.7</td>
      </tr>
      <tr>
          <td>3 rings</td>
          <td>32.2</td>
          <td>43.5</td>
          <td>27.3</td>
      </tr>
      <tr>
          <td>4 rings</td>
          <td>32.7</td>
          <td>23.9</td>
          <td>25.2</td>
      </tr>
      <tr>
          <td>&gt;4 rings</td>
          <td>17.2</td>
          <td>4.8</td>
          <td>16.5</td>
      </tr>
      <tr>
          <td>Fused aromatic rings</td>
          <td>38.8</td>
          <td>30.9</td>
          <td>0.2</td>
      </tr>
      <tr>
          <td>Large rings (&gt;8)</td>
          <td>0.4</td>
          <td>1.8</td>
          <td>75.9</td>
      </tr>
      <tr>
          <td>Spiro rings</td>
          <td>1.9</td>
          <td>0.6</td>
          <td>0.6</td>
      </tr>
      <tr>
          <td>Contains N</td>
          <td>96.5</td>
          <td>96.1</td>
          <td>92.3</td>
      </tr>
      <tr>
          <td>Contains O</td>
          <td>93.0</td>
          <td>92.0</td>
          <td>85.5</td>
      </tr>
      <tr>
          <td>Contains S</td>
          <td>35.6</td>
          <td>27.9</td>
          <td>39.6</td>
      </tr>
      <tr>
          <td>Contains halogen</td>
          <td>40.7</td>
          <td>38.8</td>
          <td>49.4</td>
      </tr>
  </tbody>
</table>
<p>The LSTM-generated molecules closely mirror the ChEMBL distributions, while the naive generator fails to capture drug-like structural patterns. The LSTM tends to slightly over-represent 2-3 ring systems and under-represent 4+ ring systems relative to ChEMBL. Functional group distributions also closely matched between ChEMBL and the LSTM output.</p>
<h3 id="virtual-screening-validation">Virtual Screening Validation</h3>
<p>The generated molecules were evaluated using profile QSAR models for 159 ChEMBL kinase assays. The six best models (with realistic test set R-squared &gt; 0.75) were used to predict pIC50 values for both actual ChEMBL compounds and generated compounds. The cumulative frequency distributions of predicted activity were nearly identical between the two sets.</p>
<p>Kolmogorov-Smirnov (KS) tests on random samples of 1,000 compounds confirmed this quantitatively:</p>
<table>
  <thead>
      <tr>
          <th>Assay</th>
          <th>KS D</th>
          <th>Distributions Differ?</th>
          <th>Mean (Real)</th>
          <th>Mean (Gen)</th>
          <th>Stdev (Real)</th>
          <th>Stdev (Gen)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>688395</td>
          <td>6.01%</td>
          <td>No</td>
          <td>4.66</td>
          <td>4.69</td>
          <td>0.25</td>
          <td>0.24</td>
      </tr>
      <tr>
          <td>668624</td>
          <td>3.60%</td>
          <td>No</td>
          <td>4.86</td>
          <td>4.86</td>
          <td>0.25</td>
          <td>0.24</td>
      </tr>
      <tr>
          <td>809226</td>
          <td>9.90%</td>
          <td>Yes</td>
          <td>5.33</td>
          <td>5.26</td>
          <td>0.34</td>
          <td>0.30</td>
      </tr>
      <tr>
          <td>809226</td>
          <td>4.30%</td>
          <td>No</td>
          <td>5.18</td>
          <td>5.13</td>
          <td>0.47</td>
          <td>0.43</td>
      </tr>
      <tr>
          <td>688781</td>
          <td>2.20%</td>
          <td>No</td>
          <td>4.83</td>
          <td>4.82</td>
          <td>0.26</td>
          <td>0.25</td>
      </tr>
      <tr>
          <td>809170</td>
          <td>8.70%</td>
          <td>Yes</td>
          <td>5.12</td>
          <td>5.07</td>
          <td>0.51</td>
          <td>0.46</td>
      </tr>
  </tbody>
</table>
<p>For 4 of 6 models, the null hypothesis that the distributions are the same could not be rejected at the 95% confidence level (critical D = 6.04%). Even for the two assays where the KS test rejected the null hypothesis, the maximum vertical distance between distributions was below 10%.</p>
<h2 id="generated-molecules-are-novel-drug-like-and-potentially-bioactive">Generated Molecules Are Novel, Drug-Like, and Potentially Bioactive</h2>
<p>The key findings of this study are:</p>
<ol>
<li><strong>High novelty</strong>: Only 0.28% of generated molecules match training compounds; 627K novel scaffolds were produced versus 172K in ChEMBL</li>
<li><strong>Drug-like quality</strong>: Physicochemical properties, substructure features, functional group distributions, and synthetic accessibility scores all closely match the ChEMBL training distribution, without these being explicit constraints</li>
<li><strong>Predicted bioactivity</strong>: Virtual screening with profile QSAR models shows the generated molecules have comparable predicted activity profiles to known bioactive compounds</li>
<li><strong>Scalability</strong>: One million valid molecules in under 2 hours on 300 CPUs, with the potential to scale to billions with GPU acceleration</li>
<li><strong>LSTM superiority over naive baselines</strong>: A simple statistical SMILES generator using only character frequencies produces chemically unrealistic molecules (mostly macrocycles), demonstrating that the LSTM genuinely learns drug-like chemical patterns</li>
</ol>
<p>The main limitations are the 32% validity rate (68% of generated SMILES are invalid), the exclusion of stereochemistry and charged molecules from the training set, and the lack of any goal-directed generation capability (the model produces unconditional samples from the training distribution). The code was described as &ldquo;available on request&rdquo; from the corresponding author rather than publicly released.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ChEMBL bioactive molecules</td>
          <td>509,000 molecules</td>
          <td>Activity &lt; 10 uM on any target; organic atoms only; no charges or stereo</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Double-stacked LSTM layers with dropout</li>
<li>Softmax output over 23-character reduced SMILES alphabet</li>
<li>RMSProp optimizer with learning rate annealed from 0.01 to 0.0002</li>
<li>Temperature-based sampling at generation time</li>
<li>40-character input windows during training</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture consists of two LSTM layers, a dropout layer, and a 23-neuron dense output layer. Exact hidden unit counts and dropout rates are not specified in the paper.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid SMILES rate</td>
          <td>32%</td>
          <td>After bracket check and RDKit parsing</td>
      </tr>
      <tr>
          <td>Novelty (vs. training)</td>
          <td>99.72%</td>
          <td>Only 2,774 of 1M match ChEMBL</td>
      </tr>
      <tr>
          <td>Unique scaffolds</td>
          <td>627,000</td>
          <td>vs. 172,000 in ChEMBL</td>
      </tr>
      <tr>
          <td>KS test (4/6 assays)</td>
          <td>Not significantly different</td>
          <td>At 95% confidence</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Generation: 300 CPUs for under 2 hours (1 million valid molecules)</li>
<li>Training hardware not specified</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ertl, P., Lewis, R., Martin, E., &amp; Polyakov, V. (2017). In silico generation of novel, drug-like chemical matter using the LSTM neural network. <em>arXiv preprint</em>, arXiv:1712.07449.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ertl2017silico,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{In silico generation of novel, drug-like chemical matter using the LSTM neural network}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ertl, Peter and Lewis, Richard and Martin, Eric and Polyakov, Valery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1712.07449}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemGE: Molecule Generation via Grammatical Evolution</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemge-grammatical-evolution-molecule-generation/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemge-grammatical-evolution-molecule-generation/</guid><description>ChemGE applies grammatical evolution to SMILES strings for population-based de novo molecule generation with inherent parallelism and diversity.</description><content:encoded><![CDATA[<h2 id="grammatical-evolution-for-de-novo-molecular-design">Grammatical Evolution for De Novo Molecular Design</h2>
<p>This is a <strong>Method</strong> paper that introduces ChemGE, a population-based molecular generation approach built on grammatical evolution. Rather than using deep neural networks, ChemGE evolves populations of <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings through a context-free grammar, enabling concurrent evaluation by multiple molecular simulators and producing diverse molecular libraries. The method represents an alternative paradigm for de novo drug design: evolutionary optimization over formal grammars rather than learned latent spaces or autoregressive neural models.</p>
<h2 id="limitations-of-sequential-deep-learning-generators">Limitations of Sequential Deep Learning Generators</h2>
<p>At the time of publication, the dominant approaches to de novo molecular generation included Bayesian optimization over VAE latent spaces (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">CVAE</a>, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/grammar-variational-autoencoder/">GVAE</a>), reinforcement learning with recurrent neural networks (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a>, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>), sequential Monte Carlo search, and Monte Carlo tree search (ChemTS). These methods share two practical limitations:</p>
<ol>
<li>
<p><strong>Simulation concurrency</strong>: Most methods generate one molecule at a time, making it difficult to run multiple molecular simulations (e.g., <a href="https://en.wikipedia.org/wiki/Molecular_docking">docking</a>) in parallel. This wastes computational resources in high-throughput virtual screening settings.</p>
</li>
<li>
<p><strong>Molecular diversity</strong>: Deep learning generators tend to exploit narrow regions of chemical space. Deep reinforcement learning methods in particular often generate very similar molecules, requiring special countermeasures to maintain diversity. Since drug discovery is a multi-stage pipeline, limited diversity reduces survival rates in downstream <a href="https://en.wikipedia.org/wiki/ADME">ADMET</a> screening.</p>
</li>
</ol>
<p>ChemGE addresses both problems by maintaining a large population of molecules that are evolved and evaluated concurrently.</p>
<h2 id="core-innovation-chromosome-to-smiles-mapping-via-grammar-rules">Core Innovation: Chromosome-to-SMILES Mapping via Grammar Rules</h2>
<p>ChemGE encodes each molecule as a chromosome: a sequence of $N$ integers that deterministically maps to a SMILES string through a context-free grammar. The mapping process works as follows:</p>
<ol>
<li>Start with the grammar&rsquo;s start symbol</li>
<li>At each step $k$, look up the $k$-th integer $c = C[k]$ from the chromosome</li>
<li>Identify the leftmost non-terminal symbol and count its $r$ applicable production rules</li>
<li>Apply the $((c \bmod r) + 1)$-th rule</li>
<li>Repeat until no non-terminal symbols remain or the chromosome is exhausted</li>
</ol>
<p>The context-free grammar is a subset of the OpenSMILES specification, defined formally as $G = (V, \Sigma, R, S)$ where $V$ is the set of non-terminal symbols, $\Sigma$ is the set of terminal symbols, $R$ is the set of production rules, and $S$ is the start symbol.</p>
<p>Evolution follows the $(\mu + \lambda)$ evolution strategy:</p>
<ol>
<li>Create $\lambda$ new chromosomes by drawing random chromosomes from the population and mutating one integer at a random position</li>
<li>Translate each chromosome to a SMILES string and evaluate fitness (e.g., docking score). Invalid molecules receive fitness $-\infty$</li>
<li>Select the top $\mu$ molecules from the merged pool of $\mu + \lambda$ candidates</li>
</ol>
<p>The authors did not use crossover, as it did not improve performance. Diversity is inherently maintained because a large fraction of molecules are mutated in each generation.</p>
<h2 id="experimental-setup-and-benchmark-comparisons">Experimental Setup and Benchmark Comparisons</h2>
<h3 id="druglikeness-score-benchmark">Druglikeness Score Benchmark</h3>
<p>The first experiment optimized the penalized logP score $J^{\log P}$, an indicator of druglikeness defined as:</p>
<p>$$
J^{\log P}(m) = \log P(m) - \text{SA}(m) - \text{ring-penalty}(m)
$$</p>
<p>where $\log P(m)$ is the <a href="https://en.wikipedia.org/wiki/Octanol-water_partition_coefficient">octanol-water partition coefficient</a>, $\text{SA}(m)$ is the synthetic accessibility score, and ring-penalty$(m)$ penalizes carbon rings larger than size 6. All terms are normalized to zero mean and unit standard deviation. Initial populations were randomly sampled from the ZINC database (35 million compounds), with fitness set to $-\infty$ for molecules with molecular weight above 500 or duplicate structures.</p>
<p>ChemGE was compared against CVAE, GVAE, and ChemTS across population sizes $(\mu, \lambda) \in {(10, 20), (100, 200), (1000, 2000), (10000, 20000)}$.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>2h</th>
          <th>4h</th>
          <th>6h</th>
          <th>8h</th>
          <th>Mol/Min</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChemGE (10, 20)</td>
          <td>4.46 +/- 0.34</td>
          <td>4.46 +/- 0.34</td>
          <td>4.46 +/- 0.34</td>
          <td>4.46 +/- 0.34</td>
          <td>14.5</td>
      </tr>
      <tr>
          <td>ChemGE (100, 200)</td>
          <td>5.17 +/- 0.26</td>
          <td>5.17 +/- 0.26</td>
          <td>5.17 +/- 0.26</td>
          <td>5.17 +/- 0.26</td>
          <td>135</td>
      </tr>
      <tr>
          <td>ChemGE (1000, 2000)</td>
          <td>4.45 +/- 0.24</td>
          <td>5.32 +/- 0.43</td>
          <td>5.73 +/- 0.33</td>
          <td>5.88 +/- 0.34</td>
          <td>527</td>
      </tr>
      <tr>
          <td>ChemGE (10000, 20000)</td>
          <td>4.20 +/- 0.33</td>
          <td>4.28 +/- 0.28</td>
          <td>4.40 +/- 0.27</td>
          <td>4.53 +/- 0.26</td>
          <td>555</td>
      </tr>
      <tr>
          <td>CVAE</td>
          <td>-30.18 +/- 26.91</td>
          <td>-1.39 +/- 2.24</td>
          <td>-0.61 +/- 1.08</td>
          <td>-0.006 +/- 0.92</td>
          <td>0.14</td>
      </tr>
      <tr>
          <td>GVAE</td>
          <td>-4.34 +/- 3.14</td>
          <td>-1.29 +/- 1.67</td>
          <td>-0.17 +/- 0.96</td>
          <td>0.25 +/- 1.31</td>
          <td>1.38</td>
      </tr>
      <tr>
          <td>ChemTS</td>
          <td>4.91 +/- 0.38</td>
          <td>5.41 +/- 0.51</td>
          <td>5.49 +/- 0.44</td>
          <td>5.58 +/- 0.50</td>
          <td>40.89</td>
      </tr>
  </tbody>
</table>
<p>At $(\mu, \lambda) = (1000, 2000)$, ChemGE achieved the highest final score of 5.88 and generated 527 unique molecules per minute, roughly 13x faster than ChemTS and 3700x faster than CVAE. The small population (10, 20) converged prematurely with insufficient diversity, while the overly large population (10000, 20000) could not run enough generations to optimize effectively.</p>
<h3 id="docking-experiment-with-thymidine-kinase">Docking Experiment with Thymidine Kinase</h3>
<p>The second experiment applied ChemGE to generate molecules with high predicted binding affinity for <a href="https://en.wikipedia.org/wiki/Thymidine_kinase">thymidine kinase</a> (KITH), a well-known antiviral drug target. The authors used rDock for docking simulation, taking the best intermolecular score $S_{\text{inter}}$ from three runs with different initial conformations. Fitness was defined as $-S_{\text{inter}}$ (lower scores indicate higher affinity). The protein structure was taken from PDB ID 2B8T.</p>
<p>With 32 parallel cores and $(\mu, \lambda) = (32, 64)$, ChemGE completed 1000 generations in approximately 26 hours, generating 9466 molecules total. Among these, 349 molecules achieved intermolecular scores better than the best known inhibitor in the DUD-E database.</p>
<h3 id="diversity-analysis">Diversity Analysis</h3>
<p>Molecular diversity was measured using internal diversity based on Morgan fingerprints:</p>
<p>$$
I(A) = \frac{1}{|A|^2} \sum_{(x,y) \in A \times A} T_d(x, y)
$$</p>
<p>where $T_d(x, y) = 1 - \frac{|x \cap y|}{|x \cup y|}$ is the <a href="https://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_similarity_and_distance">Tanimoto distance</a>.</p>
<p>The 349 &ldquo;ChemGE-active&rdquo; molecules (those scoring better than the best known inhibitor) had an internal diversity of 0.55, compared to 0.46 for known inhibitors and 0.65 for the whole ZINC database. This is a substantial improvement over known actives, achieved without any explicit diversity-promoting mechanism.</p>
<p>ISOMAP visualizations showed that ChemGE populations migrated away from known inhibitors over generations, ultimately occupying a completely different region of chemical space by generation 1000. This suggests ChemGE discovered a novel structural class of potential binders.</p>
<h2 id="high-throughput-and-diversity-without-deep-learning">High Throughput and Diversity Without Deep Learning</h2>
<p>ChemGE demonstrates several notable findings:</p>
<ol>
<li>
<p><strong>Deep learning is not required</strong> for competitive de novo molecular generation. Grammatical evolution over SMILES achieves higher throughput and comparable or better optimization scores than VAE- and RNN-based methods.</p>
</li>
<li>
<p><strong>Population size matters significantly</strong>. Too small a population leads to premature convergence. Too large a population prevents sufficient per-molecule optimization within the computational budget. The $(\mu, \lambda) = (1000, 2000)$ setting provided the best balance.</p>
</li>
<li>
<p><strong>Inherent diversity</strong> is a key advantage of evolutionary methods. Without any explicit diversity loss or penalty, ChemGE maintains diversity comparable to the ZINC database and exceeds that of known active molecules.</p>
</li>
<li>
<p><strong>Parallel evaluation</strong> is naturally supported. Each generation produces $\lambda$ independent molecules that can be evaluated by separate docking simulators simultaneously.</p>
</li>
</ol>
<p>The authors acknowledge several limitations. Synthetic routes and ADMET properties were not evaluated for the generated molecules. The docking scores, while favorable, require confirmation through biological assays. The authors also note that incorporating probabilistic or neural models into the evolutionary process might further improve performance.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Initial population</td>
          <td>ZINC</td>
          <td>~35M compounds</td>
          <td>Randomly sampled starting molecules</td>
      </tr>
      <tr>
          <td>Docking target</td>
          <td>PDB 2B8T</td>
          <td>1 structure</td>
          <td>Thymidine kinase-ligand complex</td>
      </tr>
      <tr>
          <td>Baseline actives</td>
          <td>DUD-E (KITH)</td>
          <td>57 inhibitors</td>
          <td>Known thymidine kinase inhibitors</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Grammatical evolution with $(\mu + \lambda)$ evolution strategy</li>
<li>Mutation only (no crossover)</li>
<li>Context-free grammar subset of OpenSMILES specification</li>
<li>Chromosome length: $N$ integers per molecule</li>
<li>Fitness set to $-\infty$ for invalid SMILES, MW &gt; 500, or duplicate molecules</li>
</ul>
<h3 id="models">Models</h3>
<p>No neural network models are used. ChemGE is purely evolutionary.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Max $J^{\log P}$ (8h)</td>
          <td>5.88 +/- 0.34</td>
          <td>ChemTS: 5.58 +/- 0.50</td>
          <td>ChemGE (1000, 2000)</td>
      </tr>
      <tr>
          <td>Molecules/min</td>
          <td>527</td>
          <td>ChemTS: 40.89</td>
          <td>~13x throughput improvement</td>
      </tr>
      <tr>
          <td>Docking hits</td>
          <td>349</td>
          <td>Best DUD-E inhibitor</td>
          <td>Molecules with better $S_{\text{inter}}$</td>
      </tr>
      <tr>
          <td>Internal diversity</td>
          <td>0.55</td>
          <td>Known inhibitors: 0.46</td>
          <td>Morgan fingerprint Tanimoto distance</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>CPU: Intel Xeon E5-2630 v3 (benchmark experiments, single core)</li>
<li>Docking: 32 cores in parallel (thymidine kinase experiment, ~26 hours for 1000 generations)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/tsudalab/ChemGE">ChemGE</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yoshikawa, N., Terayama, K., Sumita, M., Homma, T., Oono, K., &amp; Tsuda, K. (2018). Population-based de novo molecule generation, using grammatical evolution. <em>Chemistry Letters</em>, 47(11), 1431-1434. <a href="https://doi.org/10.1246/cl.180665">https://doi.org/10.1246/cl.180665</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{yoshikawa2018chemge,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Population-based De Novo Molecule Generation, Using Grammatical Evolution}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Yoshikawa, Naruki and Terayama, Kei and Sumita, Masato and Homma, Teruki and Oono, Kenta and Tsuda, Koji}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemistry Letters}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{47}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1431--1434}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1246/cl.180665}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>S4 Structured State Space Models for De Novo Drug Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/s4-chemical-language-modeling/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/s4-chemical-language-modeling/</guid><description>S4 state space models are applied to chemical language modeling for de novo drug design, outperforming LSTMs and GPTs in bioactivity learning from SMILES.</description><content:encoded><![CDATA[<h2 id="structured-state-spaces-meet-chemical-language-modeling">Structured State Spaces Meet Chemical Language Modeling</h2>
<p>This is a <strong>Method</strong> paper that introduces structured state space sequence (S4) models to chemical language modeling (CLM) for de novo drug design. S4 models have a dual formulation: they process entire input sequences via convolution during training (like Transformers) and generate sequences element-by-element via recurrence during inference (like LSTMs). The authors benchmark S4 against LSTM and GPT architectures across multiple drug discovery tasks, including drug-like molecule generation, bioactivity learning, chemical space exploration, natural product design, and prospective kinase inhibitor design validated by molecular dynamics simulations.</p>
<h2 id="bridging-the-lstm-transformer-gap-in-molecular-generation">Bridging the LSTM-Transformer Gap in Molecular Generation</h2>
<p>Chemical language models (CLMs) generate molecules by learning the &ldquo;chemical language&rdquo; of <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> string representations. The two dominant architectures for CLMs are LSTMs and GPTs, each with complementary strengths and limitations:</p>
<ul>
<li><strong>LSTMs</strong> generate sequences recurrently (element-by-element), which enables efficient generation and good learning of local/short-range dependencies. However, their sequential information bottleneck limits learning of global sequence properties.</li>
<li><strong>GPTs</strong> (Transformer decoders) process the entire input at once, better capturing global properties like bioactivity. However, they become increasingly compute-intensive for longer SMILES strings and struggle with chemical space exploration at higher sampling temperatures.</li>
</ul>
<p>Complex molecular properties like bioactivity can emerge from separated portions of a SMILES string (e.g., distant functional groups in the linear notation). Neither architecture fully addresses the need to learn these long-range dependencies while maintaining efficient, robust generation. The chemical space, estimated at up to $10^{60}$ small molecules, demands models that can both capture complex property relationships and explore diverse scaffolds efficiently.</p>
<h2 id="the-dual-nature-of-s4-convolution-meets-recurrence">The Dual Nature of S4: Convolution Meets Recurrence</h2>
<p>S4 models are built on discrete <a href="https://en.wikipedia.org/wiki/State-space_model">state space models</a>, which map an input sequence $\mathbf{u}$ to an output sequence $\mathbf{y}$ through learnable parameters $\overline{\mathbf{A}} \in \mathbb{R}^{N \times N}$, $\overline{\mathbf{B}} \in \mathbb{R}^{N \times 1}$, $\overline{\mathbf{C}} \in \mathbb{R}^{1 \times N}$, and $\overline{\mathbf{D}} \in \mathbb{R}^{1 \times 1}$:</p>
<p>$$
x_{k} = \overline{\mathbf{A}} x_{k-1} + \overline{\mathbf{B}} u_{k}
$$</p>
<p>$$
y_{k} = \overline{\mathbf{C}} x_{k} + \overline{\mathbf{D}} u_{k}
$$</p>
<p>This linear recurrence can equivalently be &ldquo;unrolled&rdquo; into a global convolution:</p>
<p>$$
\mathbf{y} = \mathbf{u} * \overline{\mathbf{K}}
$$</p>
<p>where $\overline{\mathbf{K}}$ is a convolution filter parameterized by $\overline{\mathbf{A}}$, $\overline{\mathbf{B}}$, and $\overline{\mathbf{C}}$. This duality is the core innovation for CLMs:</p>
<ul>
<li><strong>Training</strong>: S4 uses the convolutional formulation to learn from entire SMILES sequences simultaneously, capturing global molecular properties.</li>
<li><strong>Generation</strong>: S4 switches to the recurrent formulation, producing SMILES tokens one at a time for efficient, robust chemical space exploration.</li>
</ul>
<p>S4 addresses the numerical instabilities of naive state space models through high-order polynomial projection operators (HiPPO) and reduction to the stable Cauchy kernel computation, enabling effective learning of long-range dependencies.</p>
<p>For molecular ranking after fine-tuning, the log-likelihood score subtracts the pre-training likelihood to isolate target-specific information:</p>
<p>$$
\mathcal{L}_{\text{score}}(\mathbf{M}) = \mathcal{L}(\mathbf{M}_{\text{ft}}) - \mathcal{L}(\mathbf{M}_{\text{pt}})
$$</p>
<p>where $\mathcal{L}(\mathbf{M}_{\text{ft}})$ and $\mathcal{L}(\mathbf{M}_{\text{pt}})$ are the fine-tuned and pre-trained model log-likelihoods, respectively.</p>
<h2 id="benchmarking-s4-across-drug-discovery-tasks">Benchmarking S4 Across Drug Discovery Tasks</h2>
<h3 id="drug-like-molecule-generation">Drug-like molecule generation</h3>
<p>All three CLMs (S4, LSTM, GPT) were pre-trained on 1.9M canonical SMILES from ChEMBL v31 (molecules with fewer than 100 tokens). Each model generated 102,400 SMILES strings de novo.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Valid</th>
          <th>Unique</th>
          <th>Novel</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>S4</td>
          <td>99,268 (97%)</td>
          <td>98,712 (96%)</td>
          <td>95,552 (93%)</td>
      </tr>
      <tr>
          <td>LSTM</td>
          <td>97,151 (95%)</td>
          <td>96,618 (94%)</td>
          <td>82,988 (81%)</td>
      </tr>
      <tr>
          <td>GPT</td>
          <td>93,580 (91%)</td>
          <td>93,263 (91%)</td>
          <td>91,590 (89%)</td>
      </tr>
  </tbody>
</table>
<p>S4 produces the most valid, unique, and novel molecules. Error analysis reveals that each architecture shows different failure modes: LSTMs struggle most with branching errors, GPTs with ring and bond assignment errors, while S4 generates fewer branching and ring errors but more bond assignment errors than LSTM. This pattern supports the hypothesis that S4 captures long-range dependencies (branching, ring opening/closure) better while local dependencies (bond assignment) are handled better by recurrent processing.</p>
<h3 id="bioactivity-learning-via-transfer-learning">Bioactivity learning via transfer learning</h3>
<p>Five fine-tuning campaigns were conducted on targets from the LIT-PCBA dataset: PKM2, <a href="https://en.wikipedia.org/wiki/Mitogen-activated_protein_kinase_1">MAPK1</a>, GBA, mTORC1, and TP53. After fine-tuning, models ranked held-out test molecules by learned log-likelihoods to evaluate bioactive compound prioritization.</p>
<p>S4 outperformed both benchmarks across targets. Wilcoxon signed-rank tests on pooled scores confirmed statistically significant superiority:</p>
<ul>
<li>S4 vs. LSTM: $p$ [top 10] = 8.41e-6, $p$ [top 50] = 2.93e-7, $p$ [top 100] = 1.45e-7</li>
<li>S4 vs. GPT: $p$ [top 10] = 2.33e-3, $p$ [top 50] = 3.72e-3, $p$ [top 100] = 2.61e-2</li>
</ul>
<p>TP53 was the most challenging target, where no model consistently retrieved actives in the top 10, possibly due to <a href="/notes/computational-chemistry/benchmark-problems/activity-cliffs-benchmark/">activity cliffs</a> in the test set.</p>
<h3 id="chemical-space-exploration-with-temperature-sampling">Chemical space exploration with temperature sampling</h3>
<p>Models were evaluated across sampling temperatures from $T = 1.0$ to $T = 2.0$ on three metrics: SMILES validity, rediscovery rate of known actives, and scaffold diversity. Key findings:</p>
<ul>
<li><strong>Validity</strong>: S4 and LSTM maintain higher validity than GPT at elevated temperatures (GPT median validity drops below 40% at high T).</li>
<li><strong>Rediscovery</strong>: S4 outperforms LSTM in rediscovering bioactive molecules at all temperatures.</li>
<li><strong>Scaffold diversity</strong>: LSTM achieves the highest number of unique scaffold clusters (median 6,602 at $T = 1.75$), with S4 as close second (6,520 clusters).</li>
</ul>
<p>S4 provides the best balance between bioactivity capture and structural diversity.</p>
<h3 id="natural-product-design">Natural product design</h3>
<p>Models were trained on 32,360 large natural product SMILES (length &gt; 100 tokens) from the COCONUT database and used to generate 102,400 designs each.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>S4</th>
          <th>LSTM</th>
          <th>GPT</th>
          <th>Training Set</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid</td>
          <td>82,633 (81%)</td>
          <td>76,264 (74%)</td>
          <td>70,117 (68%)</td>
          <td>n.a.</td>
      </tr>
      <tr>
          <td>Unique</td>
          <td>53,293 (52%)</td>
          <td>51,326 (50%)</td>
          <td>50,487 (49%)</td>
          <td>n.a.</td>
      </tr>
      <tr>
          <td>Novel</td>
          <td>40,897 (40%)</td>
          <td>43,245 (42%)</td>
          <td>43,168 (42%)</td>
          <td>n.a.</td>
      </tr>
      <tr>
          <td>NP-likeness</td>
          <td>1.6 +/- 0.7</td>
          <td>1.5 +/- 0.7</td>
          <td>1.5 +/- 0.7</td>
          <td>1.6 +/- 0.7</td>
      </tr>
  </tbody>
</table>
<p>S4 designs the most valid molecules (6,000 to 12,000 more than benchmarks) and achieves significantly higher NP-likeness ($p = 1.41 \times 10^{-53}$ vs. LSTM, $p = 1.02 \times 10^{-82}$ vs. GPT). S4 also achieves the lowest Kolmogorov-Smirnov distances to the training/test distributions across multiple structural properties (sp3 carbons, aliphatic rings, spiro atoms, molecular weight, fused ring size, heavy atoms).</p>
<p>For computational efficiency, S4 trains as fast as GPT (both approximately 1.3x faster than LSTM) and generates fastest among all architectures.</p>
<h3 id="prospective-mapk1-inhibitor-design">Prospective MAPK1 inhibitor design</h3>
<p>The pre-trained S4 model was fine-tuned on 68 manually curated MAPK1 inhibitors ($K_i &lt; 1 \mu M$) from ChEMBL v33. The last five fine-tuning epochs generated 256K molecules across five temperature values. After ranking and filtering by log-likelihood score and scaffold similarity, the top 10 designs were evaluated via <a href="/notes/computational-chemistry/molecular-dynamics/umbrella-sampling/">Umbrella Sampling</a> <a href="/notes/computational-chemistry/molecular-dynamics/">molecular dynamics</a> simulations.</p>
<p>Eight out of ten designs showed high predicted affinity, with $\Delta G$ values ranging from $-10.3 \pm 0.6$ to $-23 \pm 4$ kcal/mol. These affinities are comparable to or exceed those of the closest known active neighbors ($\Delta G = -9.1 \pm 0.8$ to $-13 \pm 2$ kcal/mol). The most potent predicted design (molecule 2, $\Delta G = -23 \pm 4$ kcal/mol) engages extensively with the MAPK1 binding pocket, though synthetic accessibility may be limited. Several designs incorporate halogen substitutions favorable for MAPK1 inhibition, consistent with known structure-activity relationships.</p>
<h2 id="s4-combines-the-best-of-lstms-and-gpts-for-molecular-design">S4 Combines the Best of LSTMs and GPTs for Molecular Design</h2>
<p>The main findings of this study are:</p>
<ol>
<li><strong>S4 outperforms both LSTM and GPT</strong> in learning complex molecular properties like bioactivity, while maintaining competitive or superior performance in syntax learning and chemical space exploration.</li>
<li><strong>The dual formulation is key</strong>: holistic training (convolution) enables better capture of global molecular properties, while recurrent generation preserves robust chemical syntax and diverse scaffold exploration.</li>
<li><strong>S4 is especially strong for longer sequences</strong>: natural product design (SMILES &gt; 100 tokens) shows the largest advantages over benchmarks in validity and property matching.</li>
<li><strong>Prospective validation</strong>: 8/10 S4-designed MAPK1 inhibitors are predicted as highly active by molecular dynamics, with affinities comparable to or exceeding known actives.</li>
</ol>
<p><strong>Limitations acknowledged by the authors</strong>:</p>
<ul>
<li>All evaluations are computational; no wet-lab experimental validation is reported.</li>
<li>Bioactivity evaluation relies on likelihood-based ranking, which is an indirect proxy.</li>
<li>The MD simulations, while more rigorous than simple docking, still represent in silico predictions.</li>
<li>SMILES augmentation and improved ranking protocols could further boost performance.</li>
</ul>
<p><strong>Future directions</strong> include application to macrocyclic peptides and protein sequences, organic reaction planning, structure-based drug design, and integration with wet-lab experimental validation.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL v31</td>
          <td>1.9M SMILES</td>
          <td>Molecules with SMILES length &lt;= 100 tokens</td>
      </tr>
      <tr>
          <td>Fine-tuning (bioactivity)</td>
          <td>LIT-PCBA (5 targets)</td>
          <td>11-56 actives + ~10K inactives per target</td>
          <td>PKM2, MAPK1, GBA, mTORC1, TP53</td>
      </tr>
      <tr>
          <td>Natural product training</td>
          <td>COCONUT</td>
          <td>32,360 SMILES</td>
          <td>SMILES length &gt; 100 tokens</td>
      </tr>
      <tr>
          <td>Prospective fine-tuning</td>
          <td>ChEMBL v33 (MAPK1)</td>
          <td>68 inhibitors</td>
          <td>$K_i &lt; 1 \mu M$, target ID CHEMBL4040</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Pre-training: next-token prediction on SMILES strings</li>
<li>Fine-tuning: transfer learning with early stopping (patience 5, tolerance $10^{-5}$)</li>
<li>Molecule ranking: log-likelihood scoring with pre-training bias subtraction (Eq. 5)</li>
<li>Temperature sampling: $T$ from 1.0 to 2.0 (step 0.25) for chemical space exploration</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>S4</strong>: Structured state space sequence model with HiPPO initialization; hyperparameter search over 242 + 108 configurations</li>
<li><strong>LSTM</strong>: 40 configurations optimized via random search</li>
<li><strong>GPT</strong>: 35 configurations optimized via random search</li>
<li>All models share the same pre-training data and fine-tuning protocol for fair comparison</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Model</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity (ChEMBL)</td>
          <td>S4</td>
          <td>97%</td>
          <td>Out of 102,400 generated SMILES</td>
      </tr>
      <tr>
          <td>Uniqueness (ChEMBL)</td>
          <td>S4</td>
          <td>96%</td>
          <td>Among valid designs</td>
      </tr>
      <tr>
          <td>Novelty (ChEMBL)</td>
          <td>S4</td>
          <td>93%</td>
          <td>Not in training set</td>
      </tr>
      <tr>
          <td>Bioactivity ranking (top 10)</td>
          <td>S4</td>
          <td>Significant (p = 8.41e-6 vs LSTM)</td>
          <td>Wilcoxon signed-rank test</td>
      </tr>
      <tr>
          <td>NP validity</td>
          <td>S4</td>
          <td>81%</td>
          <td>COCONUT, SMILES &gt; 100 tokens</td>
      </tr>
      <tr>
          <td>MAPK1 inhibitor success</td>
          <td>S4</td>
          <td>8/10 designs active</td>
          <td>Validated by MD (Umbrella Sampling)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Hyperparameter search: NVIDIA A100 40GB GPUs</li>
<li>LSTM/GPT search: 5 days on single A100</li>
<li>S4 search: 10 days on multiple A100 GPUs</li>
<li>MD simulations: Dutch supercomputer Snellius; 1.2-1.6 microseconds per ligand (<a href="/notes/computational-chemistry/molecular-dynamics/umbrella-sampling/">Umbrella Sampling</a>)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/molML/s4-for-de-novo-drug-design">S4 for de novo drug design</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch implementation with data and trained models</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.12666371">Zenodo archive</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>Source data and molecule designs</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ozcelik, R., de Ruiter, S., Criscuolo, E., &amp; Grisoni, F. (2024). Chemical language modeling with structured state space sequence models. <em>Nature Communications</em>, 15, 6176.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ozcelik2024chemical,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Chemical language modeling with structured state space sequence models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{\&#34;O{}z\c{c}elik, R{\i}za and de Ruiter, Sarah and Criscuolo, Emanuele and Grisoni, Francesca}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{6176}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-024-50469-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LMs Generate 3D Molecules from XYZ, CIF, PDB Files</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/3d-chemical-language-models-xyz-cif-pdb/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/3d-chemical-language-models-xyz-cif-pdb/</guid><description>Transformer language models trained on XYZ, CIF, and PDB sequences generate valid 3D molecules, crystals, and protein binding sites.</description><content:encoded><![CDATA[<h2 id="language-models-as-3d-chemical-structure-generators">Language Models as 3D Chemical Structure Generators</h2>
<p>This is a <strong>Method</strong> paper that demonstrates transformer-based language models can generate molecules, crystalline materials, and protein binding sites directly in three dimensions by training on sequences derived from standard chemical file formats (XYZ, CIF, PDB). The key contribution is showing that unmodified autoregressive language models, using only next-token prediction, achieve performance comparable to domain-specific 3D generative models that incorporate SE(3) equivariance and other geometric inductive biases.</p>
<h2 id="beyond-graphs-and-strings-the-need-for-3d-chemical-generation">Beyond Graphs and Strings: The Need for 3D Chemical Generation</h2>
<p>Molecular design with deep learning has largely relied on two representation paradigms: molecular graphs (processed with graph neural networks) and linearized string representations like <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> and <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a> (processed with sequence models). Both approaches have proven effective for drug-like organic molecules, but they share a fundamental limitation: they cannot represent structures whose identity depends on 3D spatial arrangement.</p>
<p>Crystalline materials, for example, have periodic lattice structures that cannot be reduced to simple graphs. Protein binding sites are defined by the 3D arrangement of hundreds of atoms across multiple residues. For tasks like catalysis design or structure-based drug discovery, the geometric positions of atoms are essential information that graphs and strings discard entirely.</p>
<p>Existing 3D generative models address this gap but typically require specialized architectures with SE(3) equivariance to handle rotational and translational symmetries. This work asks whether the general-purpose sequence modeling capability of transformers is sufficient to learn 3D chemical structure distributions without any domain-specific architectural modifications.</p>
<h2 id="direct-tokenization-of-chemical-file-formats">Direct Tokenization of Chemical File Formats</h2>
<p>The core insight is straightforward: any 3D molecule, crystal, or biomolecule is already stored as text in standard file formats (<a href="https://en.wikipedia.org/wiki/XYZ_file_format">XYZ</a>, <a href="https://en.wikipedia.org/wiki/Crystallographic_Information_File">CIF</a>, <a href="https://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format)">PDB</a>). These files encode atom types and their Cartesian coordinates as sequences of characters and numbers. Rather than designing specialized architectures for point cloud generation, the authors simply tokenize these files and train a standard GPT-style transformer to predict the next token.</p>
<p>A molecule with $n$ atoms is represented as:</p>
<p>$$
\mathcal{M} = (e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n)
$$</p>
<p>where $e_i$ is the element type and $(x_i, y_i, z_i)$ are Cartesian coordinates. Crystals additionally include lattice parameters:</p>
<p>$$
\mathcal{C} = (\ell_a, \ell_b, \ell_c, \alpha, \beta, \gamma, e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n)
$$</p>
<p>Protein binding sites use residue-atom indicators (e.g., HIS-C, CYS-N) instead of bare element symbols:</p>
<p>$$
\mathcal{P} = (a_1, x_1, y_1, z_1, \dots, a_n, x_n, y_n, z_n)
$$</p>
<p>The language model learns the joint distribution via autoregressive factorization:</p>
<p>$$
p(x) = \prod_{i=1}^{n} p(t_i \mid t_{i-1}, \dots, t_1)
$$</p>
<p>Two tokenization strategies are explored:</p>
<ol>
<li><strong>Character-level (LM-CH)</strong>: Every character in the file is a token, including digits, minus signs, spaces, and newlines. This produces long sequences but uses a small vocabulary (~30 tokens).</li>
<li><strong>Atom+coordinate-level (LM-AC)</strong>: Each atom placement requires exactly 4 tokens: one element/residue token and three coordinate tokens (e.g., &lsquo;-1.98&rsquo;). The vocabulary is larger (~100-10K tokens) but sequences are shorter.</li>
</ol>
<p>Numerical precision is controlled by rounding coordinates to 1, 2, or 3 decimal places. Since the model lacks rotation and translation invariance, random rotation augmentation during training improves performance.</p>
<h2 id="experiments-across-molecules-crystals-and-protein-binding-sites">Experiments Across Molecules, Crystals, and Protein Binding Sites</h2>
<h3 id="molecular-generation-zinc">Molecular Generation (ZINC)</h3>
<p>The model is evaluated on 250K commercially available molecules from the ZINC dataset, with an average of 23 heavy atoms. XYZ files are generated using RDKit&rsquo;s conformer tools. Coordinates use 2 decimal places of precision. The authors generate 10K molecules and evaluate both 3D geometry quality and standard generative metrics.</p>
<p>For 3D geometry assessment, root mean squared deviation (RMSD) between language model-generated conformers and RDKit-generated conformers shows most molecules fall between 1.0 and 2.0 RMSD, with a heavy tail extending to 4.0.</p>
<p>Standard metrics include validity, uniqueness, novelty, and earth mover&rsquo;s distance (WA) for molecular property distributions (QED, SA score, molecular weight).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>3D</th>
          <th>Valid (%)</th>
          <th>Unique (%)</th>
          <th>Novel (%)</th>
          <th>WA MW</th>
          <th>WA SA</th>
          <th>WA QED</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Train</td>
          <td>No</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>0.816</td>
          <td>0.013</td>
          <td>0.002</td>
      </tr>
      <tr>
          <td>SM-LM</td>
          <td>No</td>
          <td>98.35</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>3.640</td>
          <td>0.049</td>
          <td>0.005</td>
      </tr>
      <tr>
          <td>SF-LM</td>
          <td>No</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>3.772</td>
          <td>0.085</td>
          <td>0.006</td>
      </tr>
      <tr>
          <td>JTVAE</td>
          <td>No</td>
          <td>100.0</td>
          <td>98.56</td>
          <td>100.0</td>
          <td>22.63</td>
          <td>0.126</td>
          <td>0.023</td>
      </tr>
      <tr>
          <td>ENF</td>
          <td>Yes</td>
          <td>1.05</td>
          <td>96.37</td>
          <td>99.72</td>
          <td>168.5</td>
          <td>1.886</td>
          <td>0.160</td>
      </tr>
      <tr>
          <td>G-SchNet</td>
          <td>Yes</td>
          <td>1.20</td>
          <td>55.96</td>
          <td>98.33</td>
          <td>152.7</td>
          <td>1.126</td>
          <td>0.185</td>
      </tr>
      <tr>
          <td>EDM</td>
          <td>Yes</td>
          <td>77.51</td>
          <td>96.40</td>
          <td>95.30</td>
          <td>101.2</td>
          <td>0.939</td>
          <td>0.093</td>
      </tr>
      <tr>
          <td>LM-CH</td>
          <td>Yes</td>
          <td>90.13</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>3.912</td>
          <td>2.608</td>
          <td>0.077</td>
      </tr>
      <tr>
          <td>LM-AC</td>
          <td>Yes</td>
          <td>98.51</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>1.811</td>
          <td>0.026</td>
          <td>0.004</td>
      </tr>
  </tbody>
</table>
<p>The atom+coordinate tokenization model (LM-AC) achieves 98.51% validity with 100% uniqueness and novelty. Its WA scores for molecular weight (1.811) and QED (0.004) are substantially better than all other 3D generative baselines and competitive with SMILES/SELFIES language models. The character-level model (LM-CH) at 90.13% validity performs comparably to graph-based models but falls short of the string-based language models.</p>
<h3 id="crystal-generation-perov-5-and-mp-20">Crystal Generation (Perov-5 and MP-20)</h3>
<p>Crystal generation uses CIF-derived sequences with 3 decimal places of precision. Two datasets are used: Perov-5 (18,928 <a href="https://en.wikipedia.org/wiki/Perovskite_(structure)">perovskite</a> materials, 5 atoms per unit cell, 56 elements) and MP-20 (45,231 diverse materials, 1-20 atoms per unit cell, 89 elements).</p>
<p>Evaluation metrics include structural validity (minimum interatomic distance &gt; 0.5 angstrom), compositional validity (charge neutrality via SMACT), coverage (recall and precision between generated and test sets), and earth mover&rsquo;s distance for density and number of unique elements.</p>
<table>
  <thead>
      <tr>
          <th>Data</th>
          <th>Model</th>
          <th>Struc. Valid (%)</th>
          <th>Comp. Valid (%)</th>
          <th>COV-R (%)</th>
          <th>COV-P (%)</th>
          <th>WA density</th>
          <th>WA elements</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Perov-5</td>
          <td>CDVAE</td>
          <td>100.0</td>
          <td>98.59</td>
          <td>99.45</td>
          <td>98.46</td>
          <td>0.126</td>
          <td>0.063</td>
      </tr>
      <tr>
          <td>Perov-5</td>
          <td>LM-CH</td>
          <td>100.0</td>
          <td>98.51</td>
          <td>99.60</td>
          <td>99.42</td>
          <td>0.071</td>
          <td>0.036</td>
      </tr>
      <tr>
          <td>Perov-5</td>
          <td>LM-AC</td>
          <td>100.0</td>
          <td>98.79</td>
          <td>98.78</td>
          <td>99.36</td>
          <td>0.089</td>
          <td>0.028</td>
      </tr>
      <tr>
          <td>MP-20</td>
          <td>CDVAE</td>
          <td>100.0</td>
          <td>86.70</td>
          <td>99.15</td>
          <td>99.49</td>
          <td>0.688</td>
          <td>1.432</td>
      </tr>
      <tr>
          <td>MP-20</td>
          <td>LM-CH</td>
          <td>84.81</td>
          <td>83.55</td>
          <td>99.25</td>
          <td>97.89</td>
          <td>0.864</td>
          <td>0.132</td>
      </tr>
      <tr>
          <td>MP-20</td>
          <td>LM-AC</td>
          <td>95.81</td>
          <td>88.87</td>
          <td>99.60</td>
          <td>98.55</td>
          <td>0.696</td>
          <td>0.092</td>
      </tr>
  </tbody>
</table>
<p>On Perov-5, both language models outperform CDVAE across most metrics. On the more diverse MP-20 dataset, LM-AC achieves the best scores on 3 of 6 metrics and remains competitive on the others. LM-CH struggles more with structural validity on MP-20 (84.81%).</p>
<h3 id="protein-binding-site-generation-pdb">Protein Binding Site Generation (PDB)</h3>
<p>The most challenging task involves generating protein binding sites (~200-250 atoms each) from PDB-derived sequences. The dataset contains approximately 180K protein-ligand pairs. Residue-atom tokenization is used (e.g., CYS-C, CYS-N), with 2 decimal places of precision.</p>
<p>Validity is assessed per-residue using xyz2mol, with an additional check for inter-residue atomic overlap (atoms from different residues closer than the minimum bond distance). Approximately 99% of generated pockets pass the residue validity check, while about 5% fail the overlap check. Of generated pockets, 89.8% have unique residue orderings, and 83.6% have novel orderings not seen in training, indicating the model is generating novel binding site structures rather than memorizing.</p>
<h2 id="competitive-3d-generation-without-geometric-inductive-biases">Competitive 3D Generation Without Geometric Inductive Biases</h2>
<p>The central finding is that standard transformer language models, without any equivariance or geometric inductive biases, can generate valid 3D chemical structures across three substantially different domains. The atom+coordinate tokenization (LM-AC) consistently outperforms character-level tokenization (LM-CH), likely because it produces shorter sequences and reduces the number of sequential decisions needed per atom placement.</p>
<p>Several limitations are worth noting. The model generates atoms using absolute Cartesian coordinates, which means it must learn rotation and translation invariance purely from data augmentation rather than having it built into the architecture. The authors acknowledge this becomes increasingly difficult as structure size grows. The vocabulary size also scales with coordinate precision and structure complexity, which could become prohibitive for very large systems.</p>
<p>The paper does not include computational cost comparisons with baseline models, making it difficult to assess the practical tradeoff between the simplicity of the language modeling approach and the efficiency of specialized architectures. The authors also note that further validation through computational simulation and experiment is needed to confirm the physical plausibility of generated structures.</p>
<p>Future directions identified include inverse design of molecules and materials conditioned on target properties, extension to more complex structures (metal-organic frameworks), and exploration of alternative tokenization strategies to handle larger systems.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td>ZINC</td>
          <td>250K molecules</td>
          <td>~23 heavy atoms avg; XYZ files via RDKit conformer generation</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>Perov-5</td>
          <td>18,928 perovskites</td>
          <td>5 atoms/unit cell, 56 elements</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>MP-20</td>
          <td>45,231 materials</td>
          <td>1-20 atoms/unit cell, 89 elements</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>Protein binding sites</td>
          <td>~180K protein-ligand pairs</td>
          <td>Processed to 200-250 atoms per pocket</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: GPT-style transformer with ~1M to 100M parameters</li>
<li><strong>Layers</strong>: 12</li>
<li><strong>Embedding size</strong>: 128 to 1024</li>
<li><strong>Attention heads</strong>: 4 to 12</li>
<li><strong>Batch size</strong>: 4 to 32 structures</li>
<li><strong>Learning rate</strong>: $10^{-4}$ to $10^{-5}$, decayed to $9 \times 10^{-6}$</li>
<li><strong>Data augmentation</strong>: Random rotation of training structures at each epoch</li>
<li><strong>Numerical precision</strong>: 2 decimal places (molecules, proteins), 3 decimal places (crystals)</li>
</ul>
<h3 id="models">Models</h3>
<p>No pre-trained model weights are publicly available. The paper mentions &ldquo;Example code can be found at&rdquo; but the URL appears to be missing from the published version.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Domain</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>Molecules</td>
          <td>xyz2mol produces valid RDKit Mol object</td>
      </tr>
      <tr>
          <td>Validity</td>
          <td>Crystals</td>
          <td>Structural (min distance &gt; 0.5 angstrom) and compositional (charge neutral)</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>All</td>
          <td>Fraction of distinct generated structures</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>All</td>
          <td>Fraction not in training set</td>
      </tr>
      <tr>
          <td>Earth mover&rsquo;s distance</td>
          <td>All</td>
          <td>Distribution match for domain-specific properties</td>
      </tr>
      <tr>
          <td>RMSD</td>
          <td>Molecules</td>
          <td>Deviation from RDKit conformer geometries</td>
      </tr>
      <tr>
          <td>Coverage</td>
          <td>Crystals</td>
          <td>Recall and precision between generated and test sets</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Models were trained using the Canada Computing Systems (Compute Canada). Specific GPU types, counts, and training times are not reported.</p>
<h3 id="artifacts">Artifacts</h3>
<p>No public code repository, model weights, or datasets specific to this work were found. The ZINC, Perov-5, and MP-20 datasets used for evaluation are publicly available from their original sources.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Flam-Shepherd, D. &amp; Aspuru-Guzik, A. (2023). Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files. <em>arXiv preprint arXiv:2305.05708</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{flamshepherd2023language,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Language models can generate molecules, materials, and protein binding sites directly in three dimensions as {XYZ}, {CIF}, and {PDB} files}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Flam-Shepherd, Daniel and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2305.05708}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Back Translation for Semi-Supervised Molecule Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/back-translation-molecule-generation/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/back-translation-molecule-generation/</guid><description>A semi-supervised method adapting NLP back translation to molecule generation, improving property optimization and retrosynthesis with unlabeled ZINC data.</description><content:encoded><![CDATA[<h2 id="semi-supervised-data-augmentation-for-molecular-tasks">Semi-Supervised Data Augmentation for Molecular Tasks</h2>
<p>This is a <strong>Method</strong> paper that introduces back translation, a semi-supervised technique from neural machine translation, to the domain of molecular generation. The primary contribution is a general-purpose data augmentation strategy that leverages large pools of unlabeled molecules (from databases like ZINC) to improve the performance of both sequence-based and graph-based models on molecule optimization and <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthesis</a> prediction tasks.</p>
<h2 id="bridging-the-labeled-data-gap-in-molecular-generation">Bridging the Labeled Data Gap in Molecular Generation</h2>
<p>Molecular generation tasks, such as property optimization and retrosynthesis, require paired training data: an input molecule (or property specification) mapped to a desired output molecule. Obtaining these labeled pairs is expensive and labor-intensive. Meanwhile, enormous databases of unlabeled molecules exist. ZINC alone contains over 750 million compounds, and PubChem has 109 million.</p>
<p>Prior approaches to using unlabeled molecular data include <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">variational autoencoders (VAEs)</a> for learning latent representations, conditional recurrent neural networks for inverse design, and pretraining techniques borrowed from NLP. However, these methods either focus on representation learning rather than direct generation, or require task-specific architectural modifications. The authors identify back translation, a well-established technique in machine translation, as a natural fit for molecular generation tasks that can be treated as sequence-to-sequence mappings.</p>
<h2 id="back-translation-as-molecular-data-augmentation">Back Translation as Molecular Data Augmentation</h2>
<p>The core idea is straightforward. Given a main task that maps from source domain $\mathcal{X}$ to target domain $\mathcal{Y}$ (e.g., mapping low-QED molecules to high-QED molecules), the method trains a reverse model $g$ that maps from $\mathcal{Y}$ back to $\mathcal{X}$. This reverse model then &ldquo;back translates&rdquo; unlabeled molecules from $\mathcal{Y}$ to generate synthetic source molecules, creating pseudo-labeled training pairs.</p>
<p>The theoretical motivation comes from maximizing the reconstruction probability. Given an unlabeled molecule $y_u \in \mathcal{U}_y$, the logarithmic reconstruction probability through the reverse model $g$ and forward model $f$ is:</p>
<p>$$
\log P(y_u = \hat{y}_u \mid y_u; g, f) = \log \sum_{\hat{x}_u \in \mathcal{X}} P(\hat{x}_u \mid y_u; g) P(y_u = \hat{y}_u \mid \hat{x}_u; f)
$$</p>
<p>Since summing over the exponentially large space $\mathcal{X}$ is intractable, the authors apply Jensen&rsquo;s inequality to obtain a lower bound:</p>
<p>$$
\log P(y_u = \hat{y}_u \mid y_u; g, f) \geq \mathbb{E}_{\hat{x}_u \sim P(\cdot \mid y_u; g)} \log P(y_u = \hat{y}_u \mid \hat{x}_u; f)
$$</p>
<p>This lower bound is optimized via Monte Carlo sampling in three steps:</p>
<p><strong>Step 1</strong>: Train both forward model $f$ and reverse model $g$ on the labeled data $\mathcal{L}$:</p>
<p>$$
\begin{aligned}
\min_{\theta_f} \sum_{(x,y) \in \mathcal{L}} -\log P(y \mid x; \theta_f) \\
\min_{\theta_g} \sum_{(x,y) \in \mathcal{L}} -\log P(x \mid y; \theta_g)
\end{aligned}
$$</p>
<p><strong>Step 2</strong>: Use the trained reverse model $g$ to back translate each unlabeled molecule $y_u \in \mathcal{U}_y$, producing synthetic pairs:</p>
<p>$$
\hat{\mathcal{L}} = {(\hat{x}_u, y_u) \mid y_u \in \mathcal{U}_y, \hat{x}_u \text{ sampled from } P(\cdot \mid y_u; \theta_g)}
$$</p>
<p><strong>Step 3</strong>: Retrain the forward model $f$ on the combined labeled and synthetic data $\mathcal{L} \cup \hat{\mathcal{L}}$, warm-starting from the parameters obtained in Step 1:</p>
<p>$$
\min_{\theta_f^<em>} \sum_{(x,y) \in \mathcal{L} \cup \hat{\mathcal{L}}} -\log P(y \mid x; \theta_f^</em>)
$$</p>
<p>A key practical finding is that data filtration matters. When using large amounts of unlabeled data (1M molecules), keeping only the synthetic pairs that satisfy the same constraints as the labeled data (e.g., similarity thresholds and property ranges) significantly improves performance over using all back-translated data unfiltered.</p>
<h2 id="experiments-on-property-optimization-and-retrosynthesis">Experiments on Property Optimization and Retrosynthesis</h2>
<h3 id="molecular-property-improvement">Molecular Property Improvement</h3>
<p>The authors evaluate on four tasks from Jin et al. (2019, 2020), each requiring the model to improve a specific molecular property while maintaining structural similarity (measured by Dice similarity on Morgan fingerprints):</p>
<ul>
<li><strong>LogP</strong> (penalized <a href="https://en.wikipedia.org/wiki/Octanol-water_partition_coefficient">partition coefficient</a>): two settings with similarity thresholds $\delta \geq 0.4$ and $\delta \geq 0.6$</li>
<li><strong>QED</strong> (quantitative estimation of drug-likeness): translate molecules from QED range [0.7, 0.8] to [0.9, 1.0]</li>
<li><strong>DRD2</strong> (<a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">dopamine type 2 receptor</a> activity): translate inactive ($P &lt; 0.5$) to active ($P \geq 0.5$)</li>
</ul>
<p>Two backbone architectures are tested: a Transformer (6 layers, 4 heads, 128-dim embeddings, 512-dim FFN) and HierG2G, a hierarchical graph-to-graph translation model. Unlabeled molecules are sampled from ZINC at 250K and 1M scales.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>LogP ($\delta \geq 0.6$)</th>
          <th>LogP ($\delta \geq 0.4$)</th>
          <th>QED (%)</th>
          <th>DRD2 (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JT-VAE</td>
          <td>0.28</td>
          <td>1.03</td>
          <td>8.8</td>
          <td>3.4</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>0.79</td>
          <td>2.49</td>
          <td>9.4</td>
          <td>4.4</td>
      </tr>
      <tr>
          <td>JTNN</td>
          <td>2.33</td>
          <td>3.55</td>
          <td>59.9</td>
          <td>77.8</td>
      </tr>
      <tr>
          <td>Transformer baseline</td>
          <td>2.45</td>
          <td>3.69</td>
          <td>71.9</td>
          <td>60.2</td>
      </tr>
      <tr>
          <td>+BT (1M, filtered)</td>
          <td>2.86</td>
          <td>4.41</td>
          <td>82.9</td>
          <td>67.4</td>
      </tr>
      <tr>
          <td>HierG2G baseline</td>
          <td>2.49</td>
          <td>3.98</td>
          <td>76.9</td>
          <td>85.9</td>
      </tr>
      <tr>
          <td>+BT (250K, filtered)</td>
          <td>2.75</td>
          <td>4.24</td>
          <td>79.1</td>
          <td>87.3</td>
      </tr>
  </tbody>
</table>
<h3 id="retrosynthesis-prediction">Retrosynthesis Prediction</h3>
<p>On the USPTO-50K benchmark (50K reactions, 10 reaction types, 80/10/10 train/val/test split), the method is applied to Transformer and GLN (Graph Logic Network) backbones. For other approaches to this benchmark, see <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/tied-two-way-transformers-retrosynthesis/">Tied Two-Way Transformers</a> and <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/data-transfer-seq-to-seq-retrosynthesis/">Data Transfer for Retrosynthesis</a>. Unlabeled reactant sets are constructed by sampling molecules from ZINC and concatenating them following the training data&rsquo;s reactant count distribution ($N_1 : N_2 : N_3 = 29.3% : 70.4% : 0.3%$).</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Top-1</th>
          <th>Top-3</th>
          <th>Top-5</th>
          <th>Top-10</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Reaction type given</strong></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>GLN</td>
          <td>64.2</td>
          <td>79.1</td>
          <td>85.2</td>
          <td>90.0</td>
      </tr>
      <tr>
          <td>Ours + GLN</td>
          <td>67.9</td>
          <td>82.5</td>
          <td>87.3</td>
          <td>91.5</td>
      </tr>
      <tr>
          <td>Transformer</td>
          <td>52.2</td>
          <td>68.2</td>
          <td>72.7</td>
          <td>77.4</td>
      </tr>
      <tr>
          <td>Ours + Transformer</td>
          <td>55.9</td>
          <td>72.8</td>
          <td>77.8</td>
          <td>79.7</td>
      </tr>
      <tr>
          <td><strong>Reaction type unknown</strong></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>GLN</td>
          <td>52.5</td>
          <td>69.0</td>
          <td>75.6</td>
          <td>83.7</td>
      </tr>
      <tr>
          <td>Ours + GLN</td>
          <td>54.7</td>
          <td>70.2</td>
          <td>77.0</td>
          <td>84.4</td>
      </tr>
      <tr>
          <td>Transformer</td>
          <td>37.9</td>
          <td>57.3</td>
          <td>62.7</td>
          <td>68.1</td>
      </tr>
      <tr>
          <td>Ours + Transformer</td>
          <td>43.5</td>
          <td>58.8</td>
          <td>64.6</td>
          <td>69.7</td>
      </tr>
  </tbody>
</table>
<p>The improvements are largest at lower $k$ values (top-1 and top-3), suggesting that back translation helps the model make more precise high-confidence predictions.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p><strong>Effect of unlabeled data size</strong>: On retrosynthesis with Transformer, performance improves as unlabeled data increases from 50K to 250K, then plateaus or declines beyond 250K. The authors attribute this to noise in the back-translated data outweighing the benefits at larger scales.</p>
<p><strong>Effect of labeled data size</strong>: With only 5K labeled samples, adding back-translated data hurts performance because the reverse model is too weak to generate useful synthetic data. As labeled data increases (10K, 25K, 50K), the benefit of back translation grows. This confirms that the method requires a reasonably well-trained reverse model to be effective.</p>
<p><strong>Data filtration</strong>: Using 1M unfiltered back-translated molecules sometimes hurts performance (e.g., QED drops from 71.9% to 75.1% vs. 82.9% with filtering), while filtering to enforce the same constraints as the labeled data recovers and exceeds the 250K filtered results.</p>
<h2 id="consistent-gains-across-architectures-and-tasks">Consistent Gains Across Architectures and Tasks</h2>
<p>The method achieves state-of-the-art results on all four molecular property improvement tasks and the USPTO-50K retrosynthesis benchmark at time of publication. Several observations stand out:</p>
<ol>
<li><strong>Architecture agnosticism</strong>: Back translation improves both sequence-based (Transformer) and graph-based (HierG2G, GLN) models, confirming that the approach is independent of the underlying architecture.</li>
<li><strong>Filtration is essential at scale</strong>: Unfiltered 1M back-translated data can degrade performance, but filtered data at the same scale consistently outperforms smaller unfiltered sets.</li>
<li><strong>Training overhead is moderate</strong>: On the DRD2 task, back translation with Transformer takes about 2.5x the supervised training time (11.0h vs. 8.5h for initial training), with the back-translation step itself taking under 1 hour.</li>
<li><strong>Diversity and novelty increase</strong>: Back translation improves both diversity (average pairwise distance among generated molecules) and novelty (fraction of generated molecules not seen in training) across QED and DRD2 tasks.</li>
</ol>
<p>The authors acknowledge limitations: the method does not form a closed loop between forward and reverse models (as in dual learning approaches), and the data filtration strategy is rule-based rather than learned. They suggest joint training of forward and reverse models and learned filtration as future directions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training (property improvement)</td>
          <td>Jin et al. (2019, 2020) datasets</td>
          <td>34K-99K pairs</td>
          <td>LogP, QED, DRD2 tasks</td>
      </tr>
      <tr>
          <td>Training (retrosynthesis)</td>
          <td>USPTO-50K</td>
          <td>40K reactions</td>
          <td>80/10/10 split from Dai et al. (2019)</td>
      </tr>
      <tr>
          <td>Unlabeled molecules</td>
          <td>ZINC</td>
          <td>250K or 1M</td>
          <td>Randomly sampled</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Same as training</td>
          <td>800-1000 test samples</td>
          <td>Per-task test sets</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Back translation with optional data filtration</li>
<li>Beam search with $k=20$ for inference</li>
<li>Random sampling for back-translation step (Equation 5)</li>
<li>Dice similarity on Morgan fingerprints for similarity constraint</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Transformer</strong>: 6 layers, 4 attention heads, 128-dim embeddings, 512-dim FFN (for property improvement); 4 layers, 8 heads, 256-dim embeddings, 2048-dim FFN (for retrosynthesis)</li>
<li><strong>HierG2G</strong>: Settings from Jin et al. (2020)</li>
<li><strong>GLN</strong>: Settings from Dai et al. (2019)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Best Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LogP improvement</td>
          <td>LogP ($\delta \geq 0.6$)</td>
          <td>2.86</td>
          <td>2.49 (HierG2G)</td>
          <td>Transformer + BT(1M, filtered)</td>
      </tr>
      <tr>
          <td>LogP improvement</td>
          <td>LogP ($\delta \geq 0.4$)</td>
          <td>4.41</td>
          <td>3.98 (HierG2G)</td>
          <td>Transformer + BT(1M, filtered)</td>
      </tr>
      <tr>
          <td>Success rate</td>
          <td>QED</td>
          <td>82.9%</td>
          <td>76.9% (HierG2G)</td>
          <td>Transformer + BT(1M, filtered)</td>
      </tr>
      <tr>
          <td>Success rate</td>
          <td>DRD2</td>
          <td>87.3%</td>
          <td>85.9% (HierG2G)</td>
          <td>HierG2G + BT(250K, filtered)</td>
      </tr>
      <tr>
          <td>Top-1 accuracy</td>
          <td>USPTO-50K (known type)</td>
          <td>67.9%</td>
          <td>64.2% (GLN)</td>
          <td>Ours + GLN</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper reports training times (8.5h for Transformer, 16.8h for HierG2G on DRD2 with 1M unlabeled data) but does not specify the GPU hardware used.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/fyabc/BT4MolGen">BT4MolGen</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation in Python</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fan, Y., Xia, Y., Zhu, J., Wu, L., Xie, S., &amp; Qin, T. (2021). Back translation for molecule generation. <em>Bioinformatics</em>, 38(5), 1244-1251. <a href="https://doi.org/10.1093/bioinformatics/btab817">https://doi.org/10.1093/bioinformatics/btab817</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{fan2022back,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Back translation for molecule generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fan, Yang and Xia, Yingce and Zhu, Jinhua and Wu, Lijun and Xie, Shufang and Qin, Tao}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{38}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1244--1251}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bioinformatics/btab817}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>RetMol: Retrieval-Based Controllable Molecule Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/retmol-retrieval-molecule-generation/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/retmol-retrieval-molecule-generation/</guid><description>RetMol uses retrieval-augmented generation to steer a pre-trained molecular model toward desired properties using only a handful of exemplar molecules.</description><content:encoded><![CDATA[<h2 id="retrieval-augmented-generation-for-molecules">Retrieval-Augmented Generation for Molecules</h2>
<p>This is a <strong>Method</strong> paper that introduces RetMol, a retrieval-based framework for controllable molecule generation. The key idea is to guide a pre-trained generative model using a small set of exemplar molecules that partially satisfy the desired design criteria, retrieved from a task-specific database. The approach requires no task-specific fine-tuning of the generative backbone and works effectively with very few exemplar molecules (as few as 23).</p>
<h2 id="limitations-of-existing-controllable-generation">Limitations of Existing Controllable Generation</h2>
<p>Existing approaches to controllable molecule generation fall into three categories, each with drawbacks:</p>
<ol>
<li><strong>Reinforcement learning (RL)-based methods</strong> require task-specific fine-tuning of the generative model for each new objective</li>
<li><strong>Supervised learning (SL)-based methods</strong> need molecules with desired properties as training data, which may be scarce</li>
<li><strong>Latent optimization-based methods</strong> require training property predictors in the latent space, which is challenging with limited active molecules and incompatible with variable-length latent spaces like those in transformers</li>
</ol>
<p>RetMol addresses all three issues by keeping the generative backbone frozen and using a lightweight, task-agnostic retrieval module that can be applied to new tasks simply by swapping the retrieval database.</p>
<h2 id="the-retmol-framework">The RetMol Framework</h2>
<p>RetMol consists of four components built around a pre-trained encoder-decoder backbone (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemformer/">Chemformer</a>, a BART variant trained on ZINC):</p>
<h3 id="retrieval-database">Retrieval Database</h3>
<p>A task-specific collection of exemplar molecules that at least partially satisfy the design criteria. The database can be very small (e.g., 23 known inhibitors for the SARS-CoV-2 task) and is dynamically updated during inference with newly generated molecules.</p>
<h3 id="molecule-retriever">Molecule Retriever</h3>
<p>A heuristic-based module that selects the $K$ most relevant exemplar molecules (default $K = 10$). It first constructs a feasible set of molecules satisfying all constraints, then selects those with the best property scores. If too few molecules satisfy all constraints, it progressively relaxes constraints until enough candidates are available.</p>
<h3 id="information-fusion-via-cross-attention">Information Fusion via Cross-Attention</h3>
<p>The core trainable component. Retrieved exemplar embeddings are fused with the input molecule embedding using cross-attention:</p>
<p>$$\boldsymbol{e} = f_{\text{CA}}(\boldsymbol{e}_{\text{in}}, \boldsymbol{E}_r; \theta) = \text{Attn}(\text{Query}(\boldsymbol{e}_{\text{in}}), \text{Key}(\boldsymbol{E}_r)) \cdot \text{Value}(\boldsymbol{E}_r)$$</p>
<p>where $\boldsymbol{e}_{\text{in}} = \text{Enc}(x_{\text{in}}) \in \mathbb{R}^{L \times D}$ is the input embedding and $\boldsymbol{E}_r = [\boldsymbol{e}_r^1, \ldots, \boldsymbol{e}_r^K]$ are the retrieved exemplar embeddings. This module adds less than 5% parameter overhead (460K parameters over the 10M base model).</p>
<h3 id="self-supervised-training-nearest-neighbor-prediction">Self-Supervised Training: Nearest Neighbor Prediction</h3>
<p>Rather than reconstructing the input molecule (which would make the retrieval module unnecessary), RetMol trains the fusion module to predict the nearest neighbor of the input:</p>
<p>$$\mathcal{L}(\theta) = \sum_{i=1}^{B} \text{CE}\left(\text{Dec}\left(f_{\text{CA}}(\boldsymbol{e}_{\text{in}}^{(i)}, \boldsymbol{E}_r^{(i)}; \theta)\right), x_{\text{1NN}}^{(i)}\right)$$</p>
<p>The remaining $K - 1$ nearest neighbors serve as the retrieved exemplar molecules. This forces the fusion module to learn how to use exemplar molecules to transform the input toward a related target. Only the fusion module parameters are updated; the encoder and decoder remain frozen.</p>
<h2 id="iterative-refinement-at-inference">Iterative Refinement at Inference</h2>
<p>During inference, RetMol uses an iterative process:</p>
<ol>
<li>Encode the input molecule and retrieved exemplars</li>
<li>Fuse embeddings via cross-attention</li>
<li>Perturb the fused embedding $M$ times with Gaussian noise</li>
<li>Greedily decode $M$ candidate molecules</li>
<li>Replace the input with the best candidate if it improves upon the current score</li>
<li>Add remaining good candidates to the retrieval database</li>
<li>Repeat until convergence or a maximum number of iterations</li>
</ol>
<p>The dynamic update of the retrieval database is critical for extrapolating beyond the initial set of exemplar molecules.</p>
<h2 id="experiments-and-results">Experiments and Results</h2>
<p>RetMol is evaluated on four tasks of increasing difficulty:</p>
<h3 id="qed-optimization-under-similarity-constraint">QED Optimization Under Similarity Constraint</h3>
<p>Goal: generate molecules with QED $\geq$ 0.9 while maintaining <a href="https://en.wikipedia.org/wiki/Tanimoto_coefficient">Tanimoto similarity</a> $\geq$ 0.4 to the input. RetMol achieves 94.5% success rate, compared to 92.8% for the previous best (QMO).</p>
<h3 id="penalized-logp-optimization">Penalized LogP Optimization</h3>
<p>Goal: maximize penalized <a href="https://en.wikipedia.org/wiki/Octanol-water_partition_coefficient">LogP</a> while maintaining structural similarity. At $\delta = 0.4$, RetMol achieves 11.55 average improvement, compared to 7.71 for QMO.</p>
<h3 id="gsk3beta--jnk3-dual-inhibitor-design"><a href="https://en.wikipedia.org/wiki/GSK-3">GSK3</a>$\beta$ + <a href="https://en.wikipedia.org/wiki/C-Jun_N-terminal_kinase">JNK3</a> Dual Inhibitor Design</h3>
<p>Goal: simultaneously satisfy four constraints (GSK3$\beta$ inhibition $\geq$ 0.5, JNK3 inhibition $\geq$ 0.5, QED $\geq$ 0.6, SA $\leq$ 4). Results:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Success %</th>
          <th>Novelty</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a></td>
          <td>47.9</td>
          <td>0.561</td>
          <td>0.621</td>
      </tr>
      <tr>
          <td>RationaleRL</td>
          <td>74.8</td>
          <td>0.568</td>
          <td>0.701</td>
      </tr>
      <tr>
          <td>MARS</td>
          <td>92.3</td>
          <td>0.824</td>
          <td>0.719</td>
      </tr>
      <tr>
          <td>MolEvol</td>
          <td>93.0</td>
          <td>0.757</td>
          <td>0.681</td>
      </tr>
      <tr>
          <td>RetMol</td>
          <td>96.9</td>
          <td>0.862</td>
          <td>0.732</td>
      </tr>
  </tbody>
</table>
<p>RetMol achieves this without task-specific fine-tuning and requires only 80 iterations compared to MARS&rsquo;s 550.</p>
<h3 id="sars-cov-2-main-protease-inhibitor-optimization"><a href="https://en.wikipedia.org/wiki/3C-like_protease">SARS-CoV-2 Main Protease</a> Inhibitor Optimization</h3>
<p>A real-world task using only 23 known inhibitors as the retrieval database and optimizing 8 weakly-binding drugs. Under the milder similarity constraint ($\delta = 0.4$), RetMol achieves 2.84 kcal/mol average binding affinity improvement versus 1.67 for Graph GA. Under the stricter constraint ($\delta = 0.6$), RetMol succeeds on 5/8 molecules versus 3/8 for Graph GA.</p>
<h2 id="key-analysis-findings">Key Analysis Findings</h2>
<ul>
<li><strong>Database size</strong>: Strong performance even with 100 molecules, already outperforming baselines on success rate</li>
<li><strong>Database quality</strong>: Molecules satisfying all four constraints give the best results (96.9%), but partial satisfaction still works reasonably (84.7% with two properties)</li>
<li><strong>Training objective</strong>: The nearest neighbor prediction objective outperforms conventional reconstruction on validity (0.902 vs. 0.834) and uniqueness (0.922 vs. 0.665)</li>
<li><strong>Dynamic database update</strong>: Essential for extrapolating beyond the initial retrieval database, generating molecules with property values exceeding the best in the original database</li>
</ul>
<h2 id="limitations">Limitations</h2>
<p>RetMol requires exemplar molecules that at least partially satisfy the design criteria. When such molecules are entirely unavailable, the framework cannot be applied. The method also relies on property predictors (for scoring and retrieval), whose accuracy directly affects generation quality. The iterative refinement process adds computational overhead at inference time, and the results depend on the Chemformer backbone&rsquo;s generation capabilities.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/NVlabs/RetMol">NVlabs/RetMol</a></td>
          <td>Code</td>
          <td>NVIDIA Source Code License-NC</td>
          <td>Full training and inference code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/NVlabs/RetMol">NVlabs/RetMol (checkpoints)</a></td>
          <td>Model</td>
          <td>CC BY-NC-SA 4.0</td>
          <td>Pre-trained model checkpoints</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: ZINC250k and ChEMBL datasets for training. Task-specific retrieval databases constructed from these datasets. COVID-19 task uses 23 known SARS-CoV-2 Mpro inhibitors.</p>
<p><strong>Training</strong>: Information fusion module trained on 4x V100 GPUs (16GB each) for approximately 2 hours. Batch size of 256 per GPU, 50K iterations.</p>
<p><strong>Inference</strong>: Single V100 GPU. Greedy decoding with Gaussian perturbation ($\sigma = 1$) for sampling multiple candidates per iteration.</p>
<p><strong>Backbone</strong>: Chemformer (BART variant) pre-trained on ZINC. Frozen during RetMol training and inference.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, Z., Nie, W., Qiao, Z., Xiao, C., Baraniuk, R. G., &amp; Anandkumar, A. (2023). Retrieval-based Controllable Molecule Generation. <em>Proceedings of the Eleventh International Conference on Learning Representations (ICLR 2023)</em>.</p>
<p><strong>Publication</strong>: International Conference on Learning Representations (ICLR) 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/NVlabs/RetMol">GitHub: NVlabs/RetMol</a></li>
<li><a href="https://openreview.net/forum?id=vDFA1tpuLvk">OpenReview</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{wang2023retrieval,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Retrieval-based Controllable Molecule Generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wang, Zichao and Nie, Weili and Qiao, Zhuoran and Xiao, Chaowei and Baraniuk, Richard G. and Anandkumar, Anima}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://openreview.net/forum?id=vDFA1tpuLvk}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolGen: Molecular Generation with Chemical Feedback</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/molgen-molecular-generation-chemical-feedback/</link><pubDate>Fri, 20 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/molgen-molecular-generation-chemical-feedback/</guid><description>MolGen pre-trains on SELFIES molecules and uses chemical feedback to align generated molecules with real-world chemical preferences across domains.</description><content:encoded><![CDATA[<h2 id="a-selfies-based-method-for-molecular-generation">A SELFIES-Based Method for Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces MolGen, a pre-trained molecular language model for generating molecules with desired chemical properties. The primary contribution is a three-part framework: (1) pre-training on 100M+ molecular SELFIES to learn structural and grammatical knowledge, (2) domain-agnostic molecular prefix tuning for cross-domain knowledge transfer, and (3) a chemical feedback paradigm that aligns the model&rsquo;s generative probabilities with real-world chemical preferences. MolGen is the first language model pre-trained on SELFIES rather than SMILES, which guarantees 100% syntactic validity of generated molecules.</p>
<h2 id="challenges-in-language-model-based-molecule-generation">Challenges in Language Model-Based Molecule Generation</h2>
<p>Generating novel molecules with desirable properties is a central task in drug discovery and chemical design. The molecular space is estimated at $10^{33}$ possible structures, making exhaustive search impractical. Prior deep generative approaches face several limitations:</p>
<ol>
<li><strong>Syntactic invalidity</strong>: <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>-based language models frequently generate strings that do not correspond to valid molecular graphs. A single random mutation of a SMILES string has only a 9.9% chance of remaining valid.</li>
<li><strong>Narrow domain focus</strong>: Most existing models focus exclusively on synthetic molecules and neglect <a href="https://en.wikipedia.org/wiki/Natural_product">natural products</a>, which have distinct structural complexity and scaffold diversity.</li>
<li><strong>Molecular hallucinations</strong>: Generated molecules may satisfy chemical structural rules yet fail to exhibit anticipated chemical activity in practical applications. The authors formally define this as molecules that &ldquo;comply with chemical structural rules, yet fail to exhibit practical utility or the anticipated properties.&rdquo;</li>
<li><strong>Limited optimization signals</strong>: Existing approaches rely on reinforcement learning (high variance), fixed-dimensional latent spaces, or expert-provided generation rules, all of which impede efficient exploration of chemical space.</li>
</ol>
<h2 id="core-innovation-pre-training-with-selfies-and-chemical-feedback">Core Innovation: Pre-training with SELFIES and Chemical Feedback</h2>
<p>MolGen&rsquo;s novelty rests on three interconnected components.</p>
<h3 id="selfies-based-pre-training">SELFIES-Based Pre-training</h3>
<p>MolGen uses <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a> (Self-Referencing Embedded Strings) instead of SMILES. SELFIES guarantees that every possible combination of symbols in the alphabet corresponds to a chemically valid molecular graph. The model uses a compact vocabulary of 185 tokens.</p>
<p>The first pre-training stage uses a BART-style encoder-decoder. Tokens from a SELFIES string $S = {s_1, \ldots, s_l}$ are randomly replaced with [MASK], then the corrupted input is encoded bidirectionally and decoded left-to-right. The reconstruction loss is:</p>
<p>$$
\mathcal{L}_{\text{ce}}(S) = -\sum_{j=1}^{l} \sum_{s} p_{\text{true}}(s \mid S, S_{&lt; j}) \log p_{\theta}(s \mid S, S_{&lt; j}; \theta)
$$</p>
<p>where $S_{&lt; j}$ denotes the partial sequence ${s_0, \ldots, s_{j-1}}$ and $p_{\text{true}}$ is the one-hot distribution under standard maximum likelihood estimation.</p>
<h3 id="domain-agnostic-molecular-prefix-tuning">Domain-Agnostic Molecular Prefix Tuning</h3>
<p>The second pre-training stage introduces shared prefix vectors $P_k, P_v \in \mathbb{R}^{m \times d}$ prepended to the keys and values of multi-head attention at each layer. Unlike conventional prefix tuning that freezes model parameters, MolGen updates the entire model. The attention output becomes:</p>
<p>$$
\text{head} = \text{Attn}\left(xW_q, [P_k, XW_k], [P_v, XW_v]\right)
$$</p>
<p>This decomposes into a linear interpolation between prefix attention and standard attention:</p>
<p>$$
\text{head} = \lambda(x) \cdot \text{Attn}(xW_q, P_k, P_v) + (1 - \lambda(x)) \cdot \text{Attn}(xW_q, XW_k, XW_v)
$$</p>
<p>where $\lambda(x)$ is a scalar representing the sum of normalized attention weights on the prefixes. The prefixes are trained simultaneously across synthetic and natural product domains, acting as a domain instructor.</p>
<h3 id="chemical-feedback-paradigm">Chemical Feedback Paradigm</h3>
<p>To address molecular hallucinations, MolGen aligns the model&rsquo;s probabilistic rankings with chemical preference rankings. Given a molecule $S$ and a set of candidate outputs $\mathcal{S}^*$ with distinct property scores $\text{Ps}(\cdot)$, the model should satisfy:</p>
<p>$$
p_{\text{true}}(S_i \mid S) &gt; p_{\text{true}}(S_j \mid S), \quad \forall S_i, S_j \in \mathcal{S}^*, \text{Ps}(S_i) &gt; \text{Ps}(S_j)
$$</p>
<p>This is enforced via a rank loss:</p>
<p>$$
\mathcal{L}_{\text{rank}}(S) = \sum_{i} \sum_{j &gt; i} \max\left(0, f(S_j) - f(S_i) + \gamma_{ij}\right)
$$</p>
<p>where $\gamma_{ij} = (j - i) \cdot \gamma$ is a margin scaled by rank difference and $f(S) = \sum_{t=1}^{l} \log p_{\theta}(s_t \mid S, S_{&lt; t}; \theta)$ is the estimated log-probability. The overall training objective combines cross-entropy and rank loss:</p>
<p>$$
\mathcal{L} = \mathcal{L}_{\text{ce}} + \alpha \mathcal{L}_{\text{rank}}
$$</p>
<p>Label smoothing is applied to the target distribution in $\mathcal{L}_{\text{ce}}$, allocating probability mass $\beta$ to non-target tokens to maintain generative diversity.</p>
<h2 id="experiments-across-distribution-learning-and-property-optimization">Experiments Across Distribution Learning and Property Optimization</h2>
<h3 id="datasets">Datasets</h3>
<ul>
<li><strong>Stage 1 pre-training</strong>: 100M+ unlabeled molecules from ZINC-15 (molecular weight $\leq$ 500 Da, LogP $\leq$ 5)</li>
<li><strong>Stage 2 pre-training</strong>: 2.22M molecules spanning synthetic (ZINC, MOSES) and natural product (NPASS, 30,926 compounds) domains</li>
<li><strong>Downstream evaluation</strong>: MOSES synthetic dataset, ZINC250K, and natural product molecules</li>
</ul>
<h3 id="molecular-distribution-learning">Molecular Distribution Learning</h3>
<p>MolGen generates 10,000 synthetic and 80,000 natural product molecules, evaluated on seven metrics (Validity, Fragment similarity, Scaffold similarity, SNN, Internal Diversity, <a href="/notes/computational-chemistry/benchmark-problems/frechet-chemnet-distance/">FCD</a>, and Novelty). Baselines include AAE, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/latentgan-de-novo-molecular-generation/">LatentGAN</a>, CharRNN, VAE, JT-VAE, LIMO, and <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemformer/">Chemformer</a>.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Validity</th>
          <th>Frag</th>
          <th>Scaf</th>
          <th>SNN</th>
          <th>IntDiv</th>
          <th>FCD</th>
          <th>Novelty</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Chemformer</td>
          <td>.9843</td>
          <td>.9889</td>
          <td>.9248</td>
          <td>.5622</td>
          <td>.8553</td>
          <td>.0061</td>
          <td>.9581</td>
      </tr>
      <tr>
          <td>MolGen</td>
          <td>1.000</td>
          <td>.9999</td>
          <td>.9999</td>
          <td>.9996</td>
          <td>.8567</td>
          <td>.0015</td>
          <td>1.000</td>
      </tr>
  </tbody>
</table>
<p>On synthetic molecules, MolGen achieves 100% validity, near-perfect fragment and scaffold similarity, and the lowest FCD (0.0015). For natural products, MolGen achieves FCD of 0.6519 compared to Chemformer&rsquo;s 0.8346.</p>
<h3 id="targeted-molecule-discovery">Targeted Molecule Discovery</h3>
<p>For penalized logP maximization (top-3 scores):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>1st</th>
          <th>2nd</th>
          <th>3rd</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MARS (no length limit)</td>
          <td>44.99</td>
          <td>44.32</td>
          <td>43.81</td>
      </tr>
      <tr>
          <td>MolGen (no length limit)</td>
          <td>80.30</td>
          <td>74.70</td>
          <td>69.85</td>
      </tr>
      <tr>
          <td>MolGen (length-limited)</td>
          <td>30.51</td>
          <td>28.98</td>
          <td>28.95</td>
      </tr>
  </tbody>
</table>
<p>For QED maximization, MolGen achieves the maximum score of 0.948 across the top-3.</p>
<h3 id="molecular-docking">Molecular Docking</h3>
<p>MolGen optimizes binding affinity for two protein targets (<a href="https://en.wikipedia.org/wiki/Estrogen_receptor_alpha">ESR1</a> and ACAA1), measured by <a href="https://en.wikipedia.org/wiki/Dissociation_constant">dissociation constant</a> $K_D$ (lower is better):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>ESR1 1st</th>
          <th>ESR1 2nd</th>
          <th>ESR1 3rd</th>
          <th>ACAA1 1st</th>
          <th>ACAA1 2nd</th>
          <th>ACAA1 3rd</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LIMO</td>
          <td>0.72</td>
          <td>0.89</td>
          <td>1.4</td>
          <td>37</td>
          <td>37</td>
          <td>41</td>
      </tr>
      <tr>
          <td>MolGen</td>
          <td>0.13</td>
          <td>0.35</td>
          <td>0.47</td>
          <td>3.36</td>
          <td>3.98</td>
          <td>8.50</td>
      </tr>
  </tbody>
</table>
<p>MolGen achieves the lowest dissociation constants across both targets. Optimization of the 1,000 worst-affinity molecules yields 96.7% relative improvement for ESR1 and 70.4% for ACAA1.</p>
<h3 id="constrained-molecular-optimization">Constrained Molecular Optimization</h3>
<p>Optimizing 800 molecules from ZINC250K with lowest p-logP scores under Tanimoto similarity constraints:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>$\delta = 0.6$</th>
          <th>$\delta = 0.4$</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/retmol-retrieval-molecule-generation/">RetMol</a></td>
          <td>3.78 (3.29)</td>
          <td>11.55 (11.27)</td>
      </tr>
      <tr>
          <td>MolGen</td>
          <td>12.08 (0.82)</td>
          <td>12.35 (1.21)</td>
      </tr>
  </tbody>
</table>
<p>MolGen achieves the highest mean improvement with the lowest standard deviation under both constraints.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<ul>
<li><strong>Chemical feedback</strong>: Without it, the model generates molecules with property scores similar to initial molecules. With it ($\alpha = 3$), property scores increase progressively across generation rounds.</li>
<li><strong>Prefix tuning</strong>: Removing prefix tuning reduces constrained optimization improvement by 0.45 at $\delta = 0.6$ and 2.12 at $\delta = 0.4$.</li>
<li><strong>Label smoothing</strong>: Enhances diversity of generated molecules as measured by Internal Diversity.</li>
<li><strong>Substructure attention</strong>: MolGen focuses attention on chemically meaningful functional groups (fluoro, phenyl, hydroxyl), while SMILES-based PLMs scatter attention across syntactic tokens. The Substructure Attention Level (SAL) metric confirms MolGen&rsquo;s superior focus.</li>
</ul>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>SELFIES pre-training guarantees 100% molecular validity, eliminating the need for external valency checks.</li>
<li>Domain-agnostic prefix tuning enables effective knowledge transfer between synthetic and natural product domains.</li>
<li>The chemical feedback paradigm aligns model outputs with chemical preferences without requiring external annotated data or reference databases.</li>
<li>MolGen achieves the best or competitive results across all evaluated tasks: distribution learning, targeted molecule discovery, constrained optimization, and molecular docking.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>Computational cost</strong>: Training and fine-tuning on large datasets is computationally intensive.</li>
<li><strong>Model interpretability</strong>: The transformer architecture makes it difficult to understand explicit rationale behind decisions.</li>
<li><strong>Single-target optimization only</strong>: The chemical feedback paradigm handles single-target optimization; multiple conflicting objectives could create ambiguous optimization trajectories.</li>
<li><strong>Task specificity</strong>: MolGen is designed for 2D molecular generation; 3D conformation information is not incorporated.</li>
<li><strong>Reaction prediction</strong>: When applied to reaction prediction (an off-target task), MolGen achieves only 71.4% accuracy on 39,990 reaction samples.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest applying MolGen to retrosynthesis and reaction prediction, exploring multimodal pre-training, and incorporating additional knowledge sources.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Stage 1 pre-training</td>
          <td>ZINC-15</td>
          <td>100M+ molecules</td>
          <td>MW $\leq$ 500 Da, LogP $\leq$ 5</td>
      </tr>
      <tr>
          <td>Stage 2 pre-training</td>
          <td>ZINC + <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a> + NPASS</td>
          <td>2.22M molecules</td>
          <td>Synthetic and natural product domains</td>
      </tr>
      <tr>
          <td>Distribution learning (synthetic)</td>
          <td><a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a></td>
          <td>~1.9M molecules</td>
          <td>Standard benchmark split</td>
      </tr>
      <tr>
          <td>Distribution learning (natural)</td>
          <td>NPASS</td>
          <td>30,926 compounds</td>
          <td>30,126 train / 800 test</td>
      </tr>
      <tr>
          <td>Constrained optimization</td>
          <td>ZINC250K</td>
          <td>800 molecules</td>
          <td>Lowest p-logP scores</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: BART-based encoder-decoder with SELFIES vocabulary (185 tokens)</li>
<li><strong>Prefix length</strong>: 5 tunable vectors per layer</li>
<li><strong>Optimizer</strong>: LAMB (pre-training), AdamW (fine-tuning)</li>
<li><strong>Pre-training</strong>: 600M steps with linear warm-up (180,000 steps) followed by linear decay</li>
<li><strong>Rank loss weight</strong> ($\alpha$): Recommended values of 3 or 5</li>
<li><strong>Candidate generation</strong>: 30 candidates per molecule (synthetic), 8 candidates (natural products)</li>
</ul>
<h3 id="models">Models</h3>
<p>MolGen is publicly available on Hugging Face. The model uses a vocabulary of 185 SELFIES tokens and is comparable in size to Chemformer-large.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Domain</th>
          <th>MolGen</th>
          <th>Best Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/computational-chemistry/benchmark-problems/frechet-chemnet-distance/">FCD</a> (lower is better)</td>
          <td>Synthetic</td>
          <td>0.0015</td>
          <td>0.0061 (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemformer/">Chemformer</a>)</td>
          <td>Distribution learning</td>
      </tr>
      <tr>
          <td>p-logP top-1 (no limit)</td>
          <td>Synthetic</td>
          <td>80.30</td>
          <td>44.99 (MARS)</td>
          <td>Targeted discovery</td>
      </tr>
      <tr>
          <td>QED top-1</td>
          <td>Synthetic</td>
          <td>0.948</td>
          <td>0.948 (several)</td>
          <td>Tied at maximum</td>
      </tr>
      <tr>
          <td>ESR1 $K_D$ top-1</td>
          <td>Docking</td>
          <td>0.13</td>
          <td>0.72 (LIMO)</td>
          <td>Binding affinity</td>
      </tr>
      <tr>
          <td>p-logP improvement ($\delta=0.4$)</td>
          <td>Synthetic</td>
          <td>12.35 (1.21)</td>
          <td>11.55 (11.27) (RetMol)</td>
          <td>Constrained optimization</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>6 NVIDIA V100 GPUs</li>
<li>Pre-training batch size: 256 molecules per GPU</li>
<li>Fine-tuning batch size: 6 (synthetic and natural product)</li>
<li>Training: 100 epochs for fine-tuning tasks</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/zjunlp/MolGen">zjunlp/MolGen</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Official PyTorch implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/zjunlp">zjunlp/MolGen-large</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Pre-trained weights on Hugging Face</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fang, Y., Zhang, N., Chen, Z., Guo, L., Fan, X., &amp; Chen, H. (2024). Domain-Agnostic Molecular Generation with Chemical Feedback. <em>Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024)</em>.</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/zjunlp/MolGen">GitHub: zjunlp/MolGen</a></li>
<li><a href="https://huggingface.co/zjunlp">Hugging Face Models</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{fang2024domain,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Domain-Agnostic Molecular Generation with Chemical Feedback}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fang, Yin and Zhang, Ningyu and Chen, Zhuo and Guo, Lingbing and Fan, Xiaohui and Chen, Huajun}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{The Twelfth International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://openreview.net/forum?id=9rPyHyjfwP}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GP-MoLFormer: Molecular Generation via Transformers</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/gp-molformer/</link><pubDate>Thu, 25 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/gp-molformer/</guid><description>A 46.8M parameter transformer for molecular generation trained on 1.1B SMILES, introducing pair-tuning for efficient property optimization.</description><content:encoded><![CDATA[<h2 id="contribution-and-taxonomic-focus">Contribution and Taxonomic Focus</h2>
<p>This is primarily a <strong>Methodological</strong> paper, as it proposes a specific neural architecture (GP-MoLFormer) and a novel fine-tuning algorithm (Pair-tuning) for molecular generation. It validates these contributions against standard baselines (e.g., JT-VAE, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/molgen-molecular-generation-chemical-feedback/">MolGen</a>-7b).</p>
<p>It also contains a secondary <strong>Theoretical</strong> contribution by establishing an empirical <a href="/notes/machine-learning/model-architectures/scaling-laws-vs-model-architectures/">scaling law</a> that relates inference compute (generation size) to the novelty of the generated molecules.</p>
<h2 id="motivation-data-scale-and-prompt-based-optimization">Motivation: Data Scale and Prompt-Based Optimization</h2>
<p>While large language models (LLMs) have transformed text generation, the impact of training data scale and memorization on <em>molecular</em> generative models remains under-explored. Specifically, there is a need to understand how training on billion-scale datasets affects the novelty of generated molecules and whether biases in public databases (like ZINC and PubChem) perpetuate memorization. Furthermore, existing optimization methods often require computationally expensive property predictors or reinforcement learning loops; there is a practical need for more efficient &ldquo;prompt-based&rdquo; optimization techniques.</p>
<h2 id="core-innovations-architecture-and-pair-tuning">Core Innovations: Architecture and Pair-Tuning</h2>
<ol>
<li><strong>Architecture</strong>: The application of a linear-attention transformer decoder with Rotary Positional Embeddings (RoPE) to generative chemistry, allowing for efficient training on 1.1 billion SMILES.</li>
<li><strong>Pair-Tuning</strong>: A novel, parameter-efficient fine-tuning method that uses property-ordered molecular pairs to learn &ldquo;soft prompts&rdquo; for optimization without updating the base model weights.</li>
<li><strong>Scaling Analysis</strong>: An extensive empirical investigation mapping the trade-off between inference compute (up to 10B generations) and chemical novelty, fitting an exponential decay curve that demonstrates how novelty saturates as generation volume grows.</li>
</ol>
<h2 id="experimental-methodology-and-downstream-tasks">Experimental Methodology and Downstream Tasks</h2>
<p>The authors evaluated GP-MoLFormer on three distinct tasks, though the comparisons highlight the difficulty of evaluating foundation models against classical baselines:</p>
<ol>
<li><strong>De Novo Generation</strong>: Comparing validity, uniqueness, and novelty against baselines (CharRNN, VAE, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/limo-latent-inceptionism/">LIMO</a>, MolGen-7b) on a held-out test set. Notably, this is an unequal comparison; most baselines were trained on the 1.6M molecule <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a> dataset, whereas GP-MoLFormer uses up to 1.1B molecules, meaning performance gains are heavily driven by data scale.</li>
<li><strong>Scaffold-Constrained Decoration</strong>: Generating molecules from DRD2 active binder scaffolds and measuring the hit rate of active compounds against specialized scaffold decorators.</li>
<li><strong>Property-Guided Optimization</strong>: Using Pair-tuning to optimize for Drug-likeness (QED), Penalized <a href="https://en.wikipedia.org/wiki/Octanol-water_partition_coefficient">logP</a>, and <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">DRD2</a> binding activity, comparing the results to graph-based and reinforcement learning benchmarks.</li>
</ol>
<p>Additionally, they performed a <strong>Scaling Study</strong>:</p>
<ul>
<li>Comparing models trained on raw (1.1B) vs. de-duplicated (650M) data.</li>
<li>Generating up to 10 billion molecules to fit empirical scaling laws for novelty.</li>
</ul>
<h2 id="key-findings-and-scaling-laws">Key Findings and Scaling Laws</h2>
<ul>
<li><strong>Scale Driven Performance</strong>: GP-MoLFormer achieves high internal diversity and validity on generation metrics. However, its baseline novelty percentage (~32%) is considerably lower than classical models. The authors attribute this to the massive training scale forcing the model to heavily prioritize matching real-world molecule frequencies over pure exploration. GP-MoLFormer&rsquo;s advantage in generation metrics over LLM-baselines like <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/molgen-molecular-generation-chemical-feedback/">MolGen</a>-7b likely stems heavily from its 10x larger training dataset rather than fundamental architectural superiority.</li>
<li><strong>Pair-Tuning Efficacy</strong>: The proposed pair-tuning method effectively optimizes properties (e.g., improving DRD2 activity scores) without requiring full model fine-tuning or external reward loops. While successful, the text-based generation yields ~94.5% validity during optimization, which lags behind graph and SELFIES-based baselines that guarantee 100% structural validity.</li>
<li><strong>Memorization vs. Novelty</strong>: Training on de-duplicated data (GP-MoLFormer-UNIQ) yields higher novelty (approx. 5-8% higher) than training on raw data, confirming that duplication bias in public databases leads directly to memorization.</li>
<li><strong>Inference Scaling Law</strong>: Novelty decays exponentially with generation size ($y = ae^{-bx}$), yet the model maintains generative capability (~16.7% novelty) even after generating an unprecedented 10 billion molecules.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Sources</strong>: A combination of <strong>PubChem</strong> (111M SMILES) and <strong>ZINC</strong> (1B SMILES) databases. Downloading and pre-training instructions are located in the repository&rsquo;s <code>data/README.md</code>.</li>
<li><strong>Preprocessing</strong>:
<ul>
<li>All SMILES were canonicalized using RDKit (no isomeric information).</li>
<li><strong>GP-MoLFormer (Base)</strong>: Trained on the full 1.1B dataset (includes duplicates).</li>
<li><strong>GP-MoLFormer-UNIQ</strong>: Trained on a de-duplicated subset of 650M SMILES.</li>
</ul>
</li>
<li><strong>Tokenization</strong>: Uses the tokenizer from Schwaller et al. (2019) with a vocabulary size of <strong>2,362 tokens</strong>.</li>
<li><strong>Filtering</strong>: Sequences restricted to a maximum length of <strong>202 tokens</strong>.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Pair-Tuning (Algorithm 1)</strong>:</p>
<ul>
<li><strong>Objective</strong>: Learn task-specific soft prompts $\phi_T$ to maximize the conditional probability of target molecule $b$ given a seed molecule $a$, where pair $(a, b)$ satisfies the property condition $b &gt; a$. The base model parameters $\theta$ remain frozen.</li>
<li><strong>Prompt Structure</strong>: Autoregressive training optimizes the continuous embeddings of $n$ enhancement tokens against the cross-entropy loss of the target sequence:
$$ \mathcal{L}(\phi_T) = - \sum_{i=1}^{|b|} \log P_{\theta}(b_i | \phi_T, a, b_{&lt;i}) $$</li>
<li><strong>Hyperparameters</strong>: Trained for 1,000 epochs with a batch size of 35 and a fixed learning rate of $3 \times 10^{-2}$.</li>
<li><strong>Inference</strong>: The learned prompt $\phi_T$ and seed molecule $a$ are prepended as context, and candidates are sampled autoregressively until a termination token is produced.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Availability</strong>: The model trained on deduplicated data (GP-MoLFormer-UNIQ) is publicly available on <a href="https://huggingface.co/ibm-research/GP-MoLFormer-Uniq">Hugging Face</a>. The full 1.1B base model is not explicitly hosted. The source code repository includes a disclosure that IBM will not maintain the code going forward.</li>
<li><strong>Architecture</strong>: Transformer decoder (~47M parameters: 12 layers, 12 heads, hidden size 768).</li>
<li><strong>Attention Mechanism</strong>: Combines Linear Attention (Generalized Random Feature map, $\phi$) with Rotary Positional Embeddings (RoPE). To avoid the quadratic complexity of standard attention while maintaining relative positional awareness, RoPE is applied to queries ($Q$) and keys ($K$) prior to the random feature mapping:
$$ \text{Attention}(Q, K, V) = \frac{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle v_n}{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle} $$</li>
<li><strong>Inference Speed</strong>: ~3ms per forward pass on a single A100 GPU.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Generation Quality Metrics</strong>: Validity, Uniqueness, Novelty (<a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a> suite), <a href="/notes/computational-chemistry/benchmark-problems/frechet-chemnet-distance/">Fréchet ChemNet Distance (FCD)</a>, Scaffold similarity (Scaf), and Similarity to Nearest Neighbor (SNN).</li>
<li><strong>MoLFormer-Based Metrics</strong>: The authors introduce Fréchet <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/molformer/">MoLFormer</a> Distance (FMD) and MoLFormer-space IntDiv2 to measure distributional similarity using their own pre-trained continuous embeddings instead of standard fingerprints.</li>
<li><strong>Optimization Metrics</strong>: Penalized logP (calculated as $\text{logP} - \text{SA} - \text{max}(\text{maxrings}(size) - 6, 0)$), Drug-likeness (QED), and DRD2 activity scores.</li>
<li><strong>Scaling Metrics</strong>: Empirical fit for novelty decay: $y = ae^{-bx}$.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 16 x NVIDIA A100 (80 GB) GPUs across 2 nodes connected via EDR Infiniband.</li>
<li><strong>Training Time</strong>:
<ul>
<li>GP-MoLFormer (1.1B data): ~115 hours total (28.75 hours/epoch for 4 epochs).</li>
<li>GP-MoLFormer-UNIQ (650M data): ~80 hours total.</li>
</ul>
</li>
<li><strong>Hyperparameters</strong>: Used a batch size of 1,600 molecules per GPU with a fixed learning rate of $1.6 \times 10^{-4}$ (scaled up to $8\times$ factor as GPUs increased).</li>
<li><strong>Optimization</strong>: Used distributed data-parallel training and adaptive bucketing by sequence length to handle scale.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/gp-molformer/">GP-MoLFormer (GitHub)</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Official implementation; IBM will not maintain going forward</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/ibm-research/GP-MoLFormer-Uniq">GP-MoLFormer-Uniq (Hugging Face)</a></td>
          <td>Model</td>
          <td>Apache 2.0</td>
          <td>Pre-trained on 650M de-duplicated SMILES</td>
      </tr>
  </tbody>
</table>
<p>The full 1.1B base model weights are not publicly hosted. The training data (PubChem and ZINC) is publicly available, and instructions for downloading and pre-processing are in the repository&rsquo;s <code>data/README.md</code>.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ross, J., Belgodere, B., Hoffman, S. C., Chenthamarakshan, V., Navratil, J., Mroueh, Y., &amp; Das, P. (2025). GP-MoLFormer: A Foundation Model For Molecular Generation. <em>Digital Discovery</em>, 4(10), 2684&ndash;2696. <a href="https://doi.org/10.1039/D5DD00122F">https://doi.org/10.1039/D5DD00122F</a></p>
<p><strong>Publication</strong>: Digital Discovery, vol. 4, no. 10, pp. 2684&ndash;2696 (2025)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ross2025gpmolformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{GP-MoLFormer: a foundation model for molecular generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ross, Jerret and Belgodere, Brian and Hoffman, Samuel C and Chenthamarakshan, Vijil and Navratil, Jiri and Mroueh, Youssef and Das, Payel}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{2684--2696}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D5DD00122F}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemformer: A Pre-trained Transformer for Comp Chem</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemformer/</link><pubDate>Tue, 23 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemformer/</guid><description>BART-based Transformer pre-trained on 100M molecules using self-supervision to accelerate convergence on chemical sequence tasks.</description><content:encoded><![CDATA[<h2 id="paper-contribution-and-methodological-classification">Paper Contribution and Methodological Classification</h2>
<p>This is a <strong>Methodological ($\Psi_{\text{Method}}$)</strong> paper. It proposes an architecture adaptation (Chemformer based on BART) and a specific pre-training strategy (&ldquo;Combined&rdquo; masking and augmentation). The paper validates this method by benchmarking against established models on multiple tasks, including direct synthesis, retrosynthesis, and molecular optimization. It also includes a secondary <strong>Resource ($\Psi_{\text{Resource}}$)</strong> contribution by making the pre-trained models and code available.</p>
<h2 id="motivation-computational-bottlenecks-in-cheminformatics">Motivation: Computational Bottlenecks in Cheminformatics</h2>
<p>Existing Transformer models for cheminformatics are often developed for single applications and are computationally expensive to train from scratch. For example, training a Molecular Transformer for reaction prediction can take days, limiting hyperparameter exploration. Self-supervised pre-training (like BERT or T5) has significantly advanced NLP by reducing fine-tuning time and improving performance. In chemistry, applications have traditionally focused on task-specific datasets or encoder-only architectures, which perform poorly on sequence generation tasks. The authors aim to use transfer learning on a large unlabelled dataset to create a model that converges quickly and performs well across diverse sequence-to-sequence and discriminative tasks.</p>
<h2 id="core-innovation-bart-architecture-and-combined-pre-training">Core Innovation: BART Architecture and Combined Pre-training</h2>
<p>The primary insight lies in the adaptation of the <strong>BART architecture</strong> for chemistry and the introduction of a <strong>&ldquo;Combined&rdquo; self-supervised pre-training task</strong>.</p>
<ul>
<li><strong>Architecture</strong>: Chemformer uses the BART encoder-decoder structure, allowing it to handle both discriminative (property prediction) and generative (reaction prediction) tasks efficiently. This provides an alternative to encoder-only (BERT) or decoder-only (GPT) models.</li>
<li><strong>Combined Pre-training</strong>: The authors introduce a task that applies both <strong>Span Masking</strong> (randomly replacing tokens with <code>&lt;mask&gt;</code>) and <strong><a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> Augmentation</strong> (permuting atom order, see <a href="/notes/computational-chemistry/molecular-representations/randomized-smiles-generative-models/">Randomized SMILES</a>) simultaneously. Formally, given a canonical SMILES sequence $x$, a corrupted sequence $\tilde{x} = \text{Mask}(\text{Augment}(x))$ is generated. The model is trained using an autoregressive cross-entropy loss to reconstruct the canonical sequence from the corrupted input:
$$ \mathcal{L}_{\text{pre-train}} = -\sum_{t=1}^{|x|} \log P(x_t \mid x_{&lt;t}, \tilde{x}) $$</li>
<li><strong>Tunable Augmentation</strong>: A downstream augmentation strategy is proposed where the probability of augmenting the input/output SMILES ($p_{aug}$) is a tunable hyperparameter, performed on-the-fly.</li>
</ul>
<h2 id="experimental-setup-and-pre-training-tasks">Experimental Setup and Pre-training Tasks</h2>
<p>The authors pre-trained Chemformer on <strong>100 million molecules</strong> from ZINC-15 and fine-tuned it on three distinct task types:</p>
<ol>
<li><strong>Seq2Seq Reaction Prediction</strong>:
<ul>
<li><em>Direct Synthesis</em>: USPTO-MIT dataset (Mixed and Separated).</li>
<li><em><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a></em>: USPTO-50K dataset (see also <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/molecular-transformer/">Molecular Transformer</a>, <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/tied-two-way-transformers-retrosynthesis/">Tied Two-Way Transformers</a>).</li>
</ul>
</li>
<li><strong>Molecular Optimization</strong>: Generating molecules with improved properties (<a href="https://en.wikipedia.org/wiki/Distribution_coefficient">LogD</a>, solubility, clearance) starting from ChEMBL matched molecular pairs.</li>
<li><strong>Discriminative Tasks</strong>:
<ul>
<li><em><a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a></em>: Predicting properties (ESOL, FreeSolv, Lipophilicity) from <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>.</li>
<li><em>Bioactivity</em>: Predicting pXC50 values for 133 genes using ExCAPE data.</li>
</ul>
</li>
</ol>
<p>Ablation studies compared three pre-training strategies (Masking, Augmentation, Combined) against a randomly initialized baseline.</p>
<h2 id="results-trade-offs-and-conclusions">Results, Trade-offs, and Conclusions</h2>
<ul>
<li><strong>Performance</strong>: Chemformer achieved <strong>competitive top-1 accuracy</strong> on USPTO-MIT (91.3% Mixed) and USPTO-50K (53.6-54.3%), outperforming the Augmented Transformer and graph-based models (GLN, GraphRetro).</li>
<li><strong>Convergence Speed</strong>: Pre-training significantly accelerated training; fine-tuning for just 20 epochs (30 mins) outperformed the previous baselines trained for significantly longer.</li>
<li><strong>Pre-training Tasks</strong>: The &ldquo;Combined&rdquo; task generally performed best for reaction prediction and bioactivity, while &ldquo;Masking&rdquo; was superior for molecular optimization.</li>
<li><strong>Augmentation Trade-off</strong>: The augmentation strategy improved top-1 accuracy but significantly degraded top-5/10 accuracy because beam search outputs became populated with augmented versions of the same molecule. This presents a considerable limitation for practical applications like retrosynthesis mapping, where retrieving a diverse set of candidate reactions is often critical.</li>
<li><strong>Discriminative Evaluation Caveats</strong>: Chemformer underperformed specialized baselines (like D-MPNN or <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/molbert-molecular-representations/">MolBERT</a>) on small discriminative datasets. The authors note that direct comparison is difficult: Chemformer was trained simultaneously on multiple subtasks (multi-task learning), while the literature baselines were trained and tuned on each subtask separately. Additionally, the Chemformer encoder uses fewer than 20M parameters compared to MolBERT&rsquo;s approximately 85M, and Chemformer&rsquo;s pre-training does not include molecular property objectives. For other transfer learning approaches to QSAR, see <a href="/notes/computational-chemistry/chemical-language-models/property-prediction/molpmofit-transfer-learning-qsar/">MolPMoFiT</a>.</li>
<li><strong>Pre-training Data Scope</strong>: The 100M pre-training dataset from ZINC-15 was selected with constraints on molecular weight ($\le 500$ Da) and LogP ($\le 5$), focusing the learned representations on small, drug-like molecules.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><em>Note: The primary GitHub repository for Chemformer was officially archived on February 11, 2026. Pre-trained weights and datasets used in the paper are still hosted externally on <a href="https://az.app.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq">Box</a>. Active development of Chemformer models has moved to the <a href="https://github.com/MolecularAI/aizynthmodels">AiZynthModels</a> repository.</em></p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/MolecularAI/Chemformer">Chemformer (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Archived; original PyTorch implementation</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/MolecularAI/aizynthmodels">AiZynthModels (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Active successor repository</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://az.app.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq">Pre-trained weights (Box)</a></td>
          <td style="text-align: left">Model</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Base and Large model checkpoints</td>
      </tr>
  </tbody>
</table>
<p>The following datasets were used for pre-training and benchmarking.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Pre-training</strong></td>
          <td style="text-align: left">ZINC-15</td>
          <td style="text-align: left">100M</td>
          <td style="text-align: left">Selected subset (reactive, annotated purchasability, MW $\le 500$, LogP $\le 5$). Split: 99% Train / 0.5% Val / 0.5% Test.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Direct Synthesis</strong></td>
          <td style="text-align: left">USPTO-MIT</td>
          <td style="text-align: left">~470k</td>
          <td style="text-align: left">Evaluated on &ldquo;Mixed&rdquo; and &ldquo;Separated&rdquo; variants.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Retrosynthesis</strong></td>
          <td style="text-align: left">USPTO-50K</td>
          <td style="text-align: left">~50k</td>
          <td style="text-align: left">Standard benchmark for retrosynthesis.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Optimization</strong></td>
          <td style="text-align: left">ChEMBL MMPs</td>
          <td style="text-align: left">~160k Train</td>
          <td style="text-align: left">Matched Molecular Pairs for LogD, solubility, and clearance optimization.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Properties</strong></td>
          <td style="text-align: left">MoleculeNet</td>
          <td style="text-align: left">Small</td>
          <td style="text-align: left">ESOL (1128), FreeSolv (642), Lipophilicity (4200).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Bioactivity</strong></td>
          <td style="text-align: left">ExCAPE</td>
          <td style="text-align: left">~312k</td>
          <td style="text-align: left">133 gene targets; &gt;1200 compounds per gene.</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>Tokenization</strong>: Regex-based tokenization (523 tokens total) derived from ChEMBL 27 canonical SMILES.</li>
<li><strong>Augmentation</strong>: SMILES enumeration (permuting atom order) used for pre-training and on-the-fly during fine-tuning ($p_{aug}=0.5$ for Seq2Seq, $p_{aug}=1.0$ for discriminative).</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pre-training Tasks</strong>:
<ol>
<li><em>Masking</em>: Span masking (BART style).</li>
<li><em>Augmentation</em>: Input is a randomized SMILES; target is canonical SMILES.</li>
<li><em>Combined</em>: Input is augmented <em>then</em> masked; target is canonical SMILES.</li>
</ol>
</li>
<li><strong>Optimization</strong>:
<ul>
<li>Optimizer: Adam ($\beta_1=0.9, \beta_2=0.999$).</li>
<li>Schedule: Linear warm-up (8000 steps) for pre-training; One-cycle schedule for fine-tuning.</li>
</ul>
</li>
<li><strong>Inference</strong>: <a href="https://en.wikipedia.org/wiki/Beam_search">Beam search</a> with width 10 for Seq2Seq tasks. Used <code>molbart/inference_score.py</code> and <code>molbart/retrosynthesis/round_trip_inference.py</code> for standard and round-trip validation.</li>
</ul>
<h3 id="models">Models</h3>
<p>Two model sizes were trained. Both use the Pre-Norm Transformer layout with GELU activation.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Hyperparameter</th>
          <th style="text-align: left">Chemformer (Base)</th>
          <th style="text-align: left">Chemformer-Large</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Layers</strong></td>
          <td style="text-align: left">6</td>
          <td style="text-align: left">8</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Model Dimension</strong></td>
          <td style="text-align: left">512</td>
          <td style="text-align: left">1024</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Feed-forward Dim</strong></td>
          <td style="text-align: left">2048</td>
          <td style="text-align: left">4096</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Attention Heads</strong></td>
          <td style="text-align: left">8</td>
          <td style="text-align: left">16</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Parameters</strong></td>
          <td style="text-align: left">~45M</td>
          <td style="text-align: left">~230M</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Pre-training Task</strong></td>
          <td style="text-align: left">All 3 variants</td>
          <td style="text-align: left">Combined only</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>Comparisons relied on Top-N accuracy for reaction tasks and validity metrics for optimization.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Key Result</th>
          <th style="text-align: left">Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Top-1 Acc</strong></td>
          <td style="text-align: left">Direct Synthesis (Sep)</td>
          <td style="text-align: left"><strong>92.8%</strong> (Large)</td>
          <td style="text-align: left">91.1% (Aug Transformer)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-1 Acc</strong></td>
          <td style="text-align: left">Retrosynthesis</td>
          <td style="text-align: left"><strong>54.3%</strong> (Large)</td>
          <td style="text-align: left">53.7% (GraphRetro) / 52.5% (GLN)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Desirable %</strong></td>
          <td style="text-align: left">Mol Optimization</td>
          <td style="text-align: left"><strong>75.0%</strong> (Base-Mask)</td>
          <td style="text-align: left">70.2% (Transformer-R)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>RMSE</strong></td>
          <td style="text-align: left">Lipophilicity</td>
          <td style="text-align: left">0.598 (Combined)</td>
          <td style="text-align: left">0.555 (D-MPNN)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 4 NVIDIA V100 GPUs (batch size 128 per GPU).</li>
<li><strong>Training Time</strong>:
<ul>
<li>Pre-training: 2.5 days (Base) / 6 days (Large) for 1M steps.</li>
<li>Fine-tuning: ~20-40 epochs for reaction prediction (&lt;12 hours).</li>
</ul>
</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Irwin, R., Dimitriadis, S., He, J., &amp; Bjerrum, E. J. (2022). Chemformer: a pre-trained transformer for computational chemistry. <em>Machine Learning: Science and Technology</em>, 3(1), 015022. <a href="https://doi.org/10.1088/2632-2153/ac3ffb">https://doi.org/10.1088/2632-2153/ac3ffb</a></p>
<p><strong>Publication</strong>: Machine Learning: Science and Technology 2022</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{irwinChemformerPretrainedTransformer2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Chemformer: A Pre-Trained Transformer for Computational Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Chemformer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Irwin, Ross and Dimitriadis, Spyridon and He, Jiazhen and Bjerrum, Esben Jannik}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Machine Learning: Science and Technology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{015022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IOP Publishing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{2632-2153}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1088/2632-2153/ac3ffb}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>