<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Reaction Prediction on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/reaction-prediction/</link><description>Recent content in Reaction Prediction on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sat, 28 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/reaction-prediction/index.xml" rel="self" type="application/rss+xml"/><item><title>ReactionT5: Pre-trained T5 for Reaction Prediction</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/reaction-prediction/reactiont5-pretrained-limited-reaction-data/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/reaction-prediction/reactiont5-pretrained-limited-reaction-data/</guid><description>ReactionT5 uses two-stage pretraining on ZINC and the Open Reaction Database to enable competitive reaction and yield prediction with minimal fine-tuning data.</description><content:encoded><![CDATA[<h2 id="a-two-stage-pre-trained-transformer-for-chemical-reactions">A Two-Stage Pre-trained Transformer for Chemical Reactions</h2>
<p>ReactionT5 is a <strong>Method</strong> paper that proposes a T5-based pre-trained model for chemical reaction tasks, specifically product prediction and yield prediction. The primary contribution is a two-stage pretraining pipeline: first on a compound library (ZINC, 23M molecules) to learn molecular representations, then on a large-scale reaction database (the Open Reaction Database, 1.5M reactions) to learn reaction-level patterns. The key result is that this pre-trained model can be fine-tuned with very limited target-domain data (as few as 30 reactions) and still achieve competitive performance against models trained on full datasets.</p>
<h2 id="bridging-the-gap-between-single-molecule-and-multi-molecule-pretraining">Bridging the Gap Between Single-Molecule and Multi-Molecule Pretraining</h2>
<p>While transformer-based models pre-trained on compound libraries (e.g., <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/smiles-bert/">SMILES-BERT</a>, MolGPT) have seen substantial development, most focus on single-molecule inputs and outputs. Pretraining for multi-molecule contexts, such as chemical reactions involving reactants, reagents, catalysts, and products, remains underexplored. T5Chem supports multi-task reaction prediction but focuses on building a single multi-task model rather than investigating the effectiveness of pre-trained models for fine-tuning on limited in-house data.</p>
<p>The authors identify two key gaps:</p>
<ol>
<li>Most pre-trained chemical models do not account for reaction-level interactions between multiple molecules.</li>
<li>In practical settings, target-domain reaction data is often scarce, making transfer learning from large public datasets essential.</li>
</ol>
<h2 id="two-stage-pretraining-with-compound-restoration">Two-Stage Pretraining with Compound Restoration</h2>
<p>The core innovation is a two-stage pretraining procedure built on the <a href="https://en.wikipedia.org/wiki/T5_(language_model)">T5 (text-to-text transfer transformer)</a> architecture:</p>
<p><strong>Stage 1: Compound Pretraining (CompoundT5)</strong>. An initialized T5 model is trained on 23M <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> from the ZINC database using span-masked language modeling. The model learns to predict masked subsequences of SMILES tokens. A SentencePiece unigram tokenizer is trained on this compound library, allowing more compact representations than character-level or atom-level tokenizers. After this stage, new tokens are added to the tokenizer to cover metal atoms and other characters present in the reaction database but absent from ZINC.</p>
<p><strong>Stage 2: Reaction Pretraining (ReactionT5)</strong>. CompoundT5 is further pretrained on 1.5M reactions from the Open Reaction Database (ORD) on both product prediction and yield prediction tasks. Reactions are formulated as text-to-text tasks using special tokens:</p>
<ul>
<li><code>REACTANT:</code>, <code>REAGENT:</code>, and <code>PRODUCT:</code> tokens delimit the role of each molecule in the reaction string.</li>
<li>For product prediction, the model takes reactants and reagents as input and generates product SMILES.</li>
<li>For yield prediction, the model takes the full reaction (including products) and outputs a numerical yield value.</li>
</ul>
<p><strong>Compound Restoration</strong>. A notable methodological detail is the handling of uncategorized compounds in the ORD. About 31.8% of ORD reactions contain compounds with unknown roles. Simply discarding these reactions introduces severe product bias (only 447 unique products remain vs. 439,898 with uncategorized data included). The authors develop RestorationT5, a binary classifier built from CompoundT5, that assigns uncategorized compounds to either reactant or reagent roles. This classifier uses a sigmoid output layer and achieves an F1 score of 0.1564 at a threshold of 0.97, outperforming a random forest baseline (F1 = 0.1136). The restored dataset (&ldquo;ORD(restored)&rdquo;) is then used for reaction pretraining.</p>
<p>For yield prediction, the loss function is mean squared error:</p>
<p>$$L = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$</p>
<p>where $y_i$ is the true yield (normalized to [0, 1]) and $\hat{y}_i$ is the predicted yield.</p>
<h2 id="experimental-setup-product-and-yield-prediction-benchmarks">Experimental Setup: Product and Yield Prediction Benchmarks</h2>
<h3 id="product-prediction">Product Prediction</h3>
<p>The USPTO dataset (479K reactions) is used for evaluation, with standard train/val/test splits (409K/30K/40K). Reactions overlapping with the ORD (18%) are removed during evaluation. Beam search with beam size 10 is used for decoding, and minimum/maximum output length constraints are set based on the training data distribution. Top-k accuracy (k = 1, 2, 3, 5) and invalidity rate are reported.</p>
<p>Baselines include Seq-to-seq, WLDN (graph neural network), <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/molecular-transformer/">Molecular Transformer</a>, and T5Chem.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Train</th>
          <th>Top-1</th>
          <th>Top-2</th>
          <th>Top-3</th>
          <th>Top-5</th>
          <th>Invalidity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Seq-to-seq</td>
          <td>USPTO</td>
          <td>80.3</td>
          <td>84.7</td>
          <td>86.2</td>
          <td>87.5</td>
          <td>-</td>
      </tr>
      <tr>
          <td>WLDN</td>
          <td>USPTO</td>
          <td>85.6</td>
          <td>90.5</td>
          <td>92.8</td>
          <td>93.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Molecular Transformer</td>
          <td>USPTO</td>
          <td>88.8</td>
          <td>92.6</td>
          <td>-</td>
          <td>94.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>T5Chem</td>
          <td>USPTO</td>
          <td>90.4</td>
          <td>94.2</td>
          <td>-</td>
          <td>96.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>CompoundT5</td>
          <td>USPTO</td>
          <td>88.0</td>
          <td>92.4</td>
          <td>93.9</td>
          <td>95.0</td>
          <td>7.5</td>
      </tr>
      <tr>
          <td>ReactionT5 (restored ORD)</td>
          <td>USPTO200</td>
          <td>85.5</td>
          <td>91.7</td>
          <td>93.5</td>
          <td>94.9</td>
          <td>12.0</td>
      </tr>
  </tbody>
</table>
<p>A critical finding: ReactionT5 pre-trained on ORD achieves 0% accuracy on USPTO without fine-tuning due to domain mismatch (ORD includes byproducts; USPTO lists only the main product). Fine-tuning on just 200 USPTO reactions with the restored ORD model produces competitive results.</p>
<p>The few-shot fine-tuning analysis shows rapid performance scaling:</p>
<table>
  <thead>
      <tr>
          <th>Samples</th>
          <th>Top-1</th>
          <th>Top-2</th>
          <th>Top-3</th>
          <th>Top-5</th>
          <th>Invalidity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>10</td>
          <td>9.0</td>
          <td>12.5</td>
          <td>15.3</td>
          <td>19.1</td>
          <td>12.4</td>
      </tr>
      <tr>
          <td>30</td>
          <td>80.5</td>
          <td>87.3</td>
          <td>89.8</td>
          <td>92.0</td>
          <td>17.2</td>
      </tr>
      <tr>
          <td>50</td>
          <td>83.7</td>
          <td>89.9</td>
          <td>92.2</td>
          <td>94.0</td>
          <td>14.8</td>
      </tr>
      <tr>
          <td>100</td>
          <td>85.1</td>
          <td>91.0</td>
          <td>92.8</td>
          <td>94.4</td>
          <td>14.0</td>
      </tr>
      <tr>
          <td>200</td>
          <td>85.5</td>
          <td>91.7</td>
          <td>93.5</td>
          <td>94.9</td>
          <td>12.0</td>
      </tr>
  </tbody>
</table>
<h3 id="yield-prediction">Yield Prediction</h3>
<p>The <a href="https://en.wikipedia.org/wiki/Buchwald%E2%80%93Hartwig_amination">Buchwald-Hartwig</a> C-N cross-coupling dataset (3,955 reactions) is used with random 7:3 splits (repeated 10 times) plus four out-of-sample test sets (Tests 1-4) designed so that similar reactions do not appear in both train and test.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Random 7:3</th>
          <th>Test 1</th>
          <th>Test 2</th>
          <th>Test 3</th>
          <th>Test 4</th>
          <th>Avg. Tests 1-4</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DFT</td>
          <td>0.92</td>
          <td>0.80</td>
          <td>0.77</td>
          <td>0.64</td>
          <td>0.54</td>
          <td>0.69</td>
      </tr>
      <tr>
          <td>MFF</td>
          <td>0.927</td>
          <td>0.851</td>
          <td>0.713</td>
          <td>0.635</td>
          <td>0.184</td>
          <td>0.596</td>
      </tr>
      <tr>
          <td>Yield-BERT</td>
          <td>0.951</td>
          <td>0.838</td>
          <td>0.836</td>
          <td>0.738</td>
          <td>0.538</td>
          <td>0.738</td>
      </tr>
      <tr>
          <td>T5Chem</td>
          <td>0.970</td>
          <td>0.811</td>
          <td>0.907</td>
          <td>0.789</td>
          <td>0.627</td>
          <td>0.785</td>
      </tr>
      <tr>
          <td>CompoundT5</td>
          <td>0.971</td>
          <td>0.855</td>
          <td>0.852</td>
          <td>0.712</td>
          <td>0.547</td>
          <td>0.741</td>
      </tr>
      <tr>
          <td>ReactionT5</td>
          <td>0.966</td>
          <td>0.914</td>
          <td>0.940</td>
          <td>0.819</td>
          <td>0.896</td>
          <td>0.892</td>
      </tr>
      <tr>
          <td>ReactionT5 (zero-shot)</td>
          <td>0.904</td>
          <td>0.919</td>
          <td>0.927</td>
          <td>0.847</td>
          <td>0.909</td>
          <td>0.900</td>
      </tr>
  </tbody>
</table>
<p>ReactionT5 achieves the highest average $R^2$ across Tests 1-4 (0.892), with the zero-shot variant performing even better (0.900). The improvement is most dramatic on Test 4, the hardest split, where ReactionT5 achieves $R^2 = 0.896$ versus T5Chem&rsquo;s 0.627 and Yield-BERT&rsquo;s 0.538.</p>
<p>In a low-data regime (30% train / 70% test), ReactionT5 ($R^2 = 0.927$) substantially outperforms a random forest baseline ($R^2 = 0.853$), and even zero-shot ReactionT5 ($R^2 = 0.898$) exceeds the random forest.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li><strong>Two-stage pretraining is effective</strong>: Compound pretraining followed by reaction pretraining produces models with strong generalization, particularly on out-of-distribution test sets.</li>
<li><strong>Few-shot transfer works</strong>: With as few as 30 fine-tuning reactions, ReactionT5 achieves over 80% Top-1 accuracy on product prediction, competitive with models trained on the full USPTO dataset.</li>
<li><strong>Compound restoration matters</strong>: Restoring uncategorized compounds in the ORD is essential for product prediction. Without restoration, fine-tuning on 200 USPTO reactions yields 0% accuracy; with restoration, the same fine-tuning yields 85.5% Top-1.</li>
<li><strong>Zero-shot yield prediction is surprisingly effective</strong>: ReactionT5 achieves $R^2 = 0.900$ on the out-of-sample yield tests without any task-specific fine-tuning, outperforming all fine-tuned baselines.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li>Product prediction shows a high invalidity rate (12.0% for the best ReactionT5 variant) compared to CompoundT5 (7.5%), suggesting the reaction pretraining may introduce some noise.</li>
<li>The 0% accuracy without fine-tuning on product prediction reveals a significant domain gap between ORD and USPTO annotation conventions (byproducts vs. main products).</li>
<li>The RestorationT5 classifier has low precision (0.0878) despite high recall (0.7212), meaning many compounds are incorrectly assigned roles. The paper does not investigate how this impacts downstream performance.</li>
<li>The paper does not report training times, computational costs, or model sizes, making resource requirements unclear.</li>
<li>Only two downstream tasks (product prediction on USPTO, yield prediction on Buchwald-Hartwig) are evaluated.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Compound pretraining</td>
          <td>ZINC</td>
          <td>22,992,522 compounds</td>
          <td>SMILES canonicalized with <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a></td>
      </tr>
      <tr>
          <td>Reaction pretraining</td>
          <td>ORD (restored)</td>
          <td>1,505,916 reactions</td>
          <td>Atom mapping removed, compounds canonicalized</td>
      </tr>
      <tr>
          <td>Product prediction eval</td>
          <td>USPTO</td>
          <td>479,035 reactions</td>
          <td>409K/30K/40K train/val/test split</td>
      </tr>
      <tr>
          <td>Yield prediction eval</td>
          <td>Buchwald-Hartwig C-N</td>
          <td>3,955 reactions</td>
          <td>Random 7:3 split (10 repeats) + 4 OOS tests</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Base architecture: T5 (text-to-text transfer transformer)</li>
<li>Tokenizer: SentencePiece unigram, trained on ZINC, extended with special reaction tokens</li>
<li>Compound pretraining: Span-masked language modeling (15% masking rate, average span length 3)</li>
<li>Beam search: size 10 for product prediction</li>
<li>Output length constraints: min/max from training data distribution</li>
<li>Yield normalization: clipped to [0, 100], then scaled to [0, 1]</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>CompoundT5: T5 pretrained on ZINC</li>
<li>RestorationT5: CompoundT5 fine-tuned for binary classification (reactant vs. reagent)</li>
<li>ReactionT5: CompoundT5 pretrained on ORD for product and yield prediction</li>
<li>Pre-trained weights available on Hugging Face</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Best Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Top-1 accuracy</td>
          <td>Product prediction</td>
          <td>85.5%</td>
          <td>ReactionT5 with 200 fine-tuning reactions</td>
      </tr>
      <tr>
          <td>Top-5 accuracy</td>
          <td>Product prediction</td>
          <td>94.9%</td>
          <td>ReactionT5 with 200 fine-tuning reactions</td>
      </tr>
      <tr>
          <td>$R^2$</td>
          <td>Yield prediction (random)</td>
          <td>0.966</td>
          <td>ReactionT5 fine-tuned</td>
      </tr>
      <tr>
          <td>$R^2$</td>
          <td>Yield prediction (OOS avg.)</td>
          <td>0.900</td>
          <td>ReactionT5 zero-shot</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. Training times and GPU requirements are not reported.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/sagawatatsuya/ReactionT5v2">ReactionT5v2 (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/sagawa">ReactionT5 models (Hugging Face)</a></td>
          <td>Model</td>
          <td>MIT</td>
          <td>Pre-trained weights</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sagawa, T. &amp; Kojima, R. (2023). ReactionT5: a large-scale pre-trained model towards application of limited reaction data. <em>arXiv preprint arXiv:2311.06708</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{sagawa2023reactiont5,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ReactionT5: a large-scale pre-trained model towards application of limited reaction data}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Sagawa, Tatsuya and Kojima, Ryosuke}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2311.06708}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arxiv.2311.06708}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Neural Machine Translation for Reaction Prediction</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/reaction-prediction/nmt-organic-reaction-prediction/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/reaction-prediction/nmt-organic-reaction-prediction/</guid><description>Nam and Kim apply a GRU-based seq2seq model with attention to predict organic reaction products from SMILES, pioneering the NMT approach to chemistry.</description><content:encoded><![CDATA[<h2 id="pioneering-seq2seq-translation-for-reaction-prediction">Pioneering Seq2Seq Translation for Reaction Prediction</h2>
<p>This is a <strong>Method</strong> paper. It introduces the idea of applying neural machine translation (NMT) to organic chemistry reaction prediction by framing product prediction as a sequence-to-sequence translation problem from reactant/reagent <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> to product SMILES. This was one of the earliest works to demonstrate that a data-driven encoder-decoder model could predict reaction products without any hand-coded reaction rules or SMARTS transformations.</p>
<h2 id="limitations-of-existing-reaction-prediction-methods">Limitations of Existing Reaction Prediction Methods</h2>
<p>Prior computational approaches to reaction prediction fell into three categories, each with significant drawbacks:</p>
<ol>
<li>
<p><strong>Rule-based methods</strong> (e.g., CAMEO, EROS) relied on manually encoded reaction rules. They performed well on reactions covered by the rules but required continuous manual encoding as new reaction types were discovered. Many older systems became outdated for this reason.</p>
</li>
<li>
<p><strong>Physical calculation methods</strong> computed energies of transition states from plausible reaction pathways using quantum mechanics. While principled, these approaches carried high computational cost. Simplified approaches (ToyChem, ROBIA) traded accuracy for speed.</p>
</li>
<li>
<p><strong>Machine learning methods</strong> at the time either predicted individual mechanistic steps (requiring tree search for multi-step reactions) or classified reaction types and applied SMARTS transformations to generate products. The classification-based approach of Wei et al. still required manual encoding of SMARTS transformations for new reaction types and struggled with ambiguous reaction classes.</p>
</li>
</ol>
<p>The key gap was the absence of a method that could predict reaction products directly from input molecules, learn from data alone, and generalize to new reaction types without manual rule encoding.</p>
<h2 id="core-innovation-reactions-as-machine-translation">Core Innovation: Reactions as Machine Translation</h2>
<p>The central insight is that SMILES strings can be treated as a language with grammatical specifications. Predicting reaction products then becomes a problem of translating &ldquo;reactant and reagent&rdquo; sentences into &ldquo;product&rdquo; sentences.</p>
<p>The model uses a <a href="https://en.wikipedia.org/wiki/Gated_recurrent_unit">GRU</a>-based encoder-decoder architecture with attention:</p>
<ul>
<li><strong>Encoder</strong>: 3 layers of GRU cells that process the reversed, tokenized SMILES string of reactants and reagents</li>
<li><strong>Decoder</strong>: 3 layers of GRU cells that generate product SMILES tokens autoregressively</li>
<li><strong>Attention mechanism</strong>: allows the decoder to attend to relevant encoder states at each generation step</li>
<li><strong>Embedding dimension</strong>: 600</li>
<li><strong>Vocabulary</strong>: 311 input tokens (reactants/reagents), 180 output tokens (products)</li>
<li><strong>Bucketed sequences</strong>: four bucket sizes handle variable-length inputs and outputs: (54, 54), (70, 60), (90, 65), (150, 80)</li>
</ul>
<p>The SMILES tokenization uses a <a href="https://en.wikipedia.org/wiki/Parsing_expression_grammar">PEG</a>-based parser that splits SMILES strings into atoms, bonds, branching symbols, and ring closure numbers. Input sequences are reversed before feeding to the encoder, following standard practice in NMT at the time.</p>
<p>The translation objective finds the product sequence $\mathbf{y}$ that maximizes the conditional probability:</p>
<p>$$p(\mathbf{y} \mid \mathbf{x}) = \prod_{t=1}^{T} p(y_t \mid y_1, \ldots, y_{t-1}, \mathbf{x})$$</p>
<p>where $\mathbf{x}$ is the tokenized reactant/reagent sequence and $T$ is the product sequence length.</p>
<h2 id="training-data-and-experimental-evaluation">Training Data and Experimental Evaluation</h2>
<h3 id="training-sets">Training Sets</h3>
<p>Two training sets were constructed:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Source</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Patent reactions (&ldquo;real&rdquo;)</td>
          <td style="text-align: left">1,094,235</td>
          <td style="text-align: left">USPTO patent applications (2001-2013), filtered by length</td>
      </tr>
      <tr>
          <td style="text-align: left">Generated reactions (&ldquo;gen&rdquo;)</td>
          <td style="text-align: left">865,118</td>
          <td style="text-align: left">75 reaction types from Wade&rsquo;s organic chemistry textbook, applied to <a href="/notes/computational-chemistry/datasets/gdb-11/">GDB-11</a> molecules (1-10 atoms)</td>
      </tr>
  </tbody>
</table>
<p>The &ldquo;real&rdquo; set was filtered to exclude reactions with reactant/reagent strings longer than 150 characters, product strings longer than 80 characters, or more than four products. The &ldquo;gen&rdquo; set was composed by iterating reaction templates (as SMARTS) over small molecules from GDB-11, covering five substrate types: acid derivatives, alcohols, aldehydes/ketones, alkenes, and alkynes.</p>
<p>Two models were compared: a &ldquo;gen&rdquo; model (trained only on generated reactions) and a &ldquo;real+gen&rdquo; model (trained on both sets).</p>
<h3 id="textbook-problem-evaluation">Textbook Problem Evaluation</h3>
<p>The models were tested on 10 problem sets from Wade&rsquo;s textbook, following the evaluation approach of Wei et al. Each problem set contained 6-15 reactions. Evaluation metrics included the ratio of fully correct predictions and the average <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> between Morgan fingerprints of predicted and actual products.</p>
<p>The &ldquo;real+gen&rdquo; model outperformed the &ldquo;gen&rdquo; model on most problem sets. On problem set 17-44 (aromatic compound reactions, only present in the &ldquo;real&rdquo; training set), the &ldquo;real+gen&rdquo; model correctly answered 4 out of 11 problems while the &ldquo;gen&rdquo; model answered 2. The &ldquo;gen&rdquo; model&rsquo;s ability to correctly predict some aromatic reactions despite never being trained on them suggests the model can extrapolate to unseen reaction patterns.</p>
<p>For <a href="https://en.wikipedia.org/wiki/Diels%E2%80%93Alder_reaction">Diels-Alder reactions</a> (problem set 15-30), neither model achieved fully correct predictions for all problems, though the &ldquo;real+gen&rdquo; model showed better Tanimoto scores, indicating partially correct structural predictions even when the exact product was missed.</p>
<h3 id="scalability-testing">Scalability Testing</h3>
<p>A scalability test used generated reactions with substrate molecules containing 11-16 atoms (larger than the training set molecules with fewer than 11 atoms). Results showed:</p>
<ul>
<li>The &ldquo;real+gen&rdquo; model maintained Tanimoto scores around 0.7 and error rates around 0.4 as substrate atom count increased</li>
<li>The ratio of fully correct predictions decreased as atom count increased, revealing that the recurrent network struggled with longer input sequences</li>
<li>The &ldquo;real+gen&rdquo; model produced fewer invalid SMILES strings than the &ldquo;gen&rdquo; model, likely because training on more reactions improved the decoder&rsquo;s ability to generate syntactically valid SMILES</li>
</ul>
<h3 id="attention-analysis">Attention Analysis</h3>
<p>Visualization of attention weights revealed a limitation: the decoder cells predominantly attended to the first few encoder cells rather than distributing attention across the full input sequence. This means the attention mechanism was not learning meaningful &ldquo;alignment&rdquo; between reactant atoms and product atoms. The authors note that if decoder cells generating tokens for unreactive sites could attend to the corresponding encoder cells (analogous to atom mapping), prediction quality on longer sequences could improve.</p>
<h3 id="token-embedding-analysis">Token Embedding Analysis</h3>
<p>t-SNE visualization of the learned token embeddings showed that encoder and decoder tokens clustered primarily by syntactic similarity rather than chemical properties. The model did not learn chemically meaningful embeddings, which the authors identify as an area for future improvement.</p>
<h2 id="key-findings-limitations-and-impact">Key Findings, Limitations, and Impact</h2>
<h3 id="key-findings">Key Findings</h3>
<ul>
<li>Treating reaction prediction as NMT is viable: the seq2seq model can predict products without any hand-coded rules</li>
<li>Training on real patent data significantly improves prediction over generated data alone</li>
<li>The model can extrapolate to reaction types not seen during training (e.g., the &ldquo;gen&rdquo; model predicting aromatic reactions)</li>
<li>Compared to the fingerprint-based approach of Wei et al., this method performed better on textbook problems and eliminated the need for manual SMARTS encoding</li>
</ul>
<h3 id="limitations">Limitations</h3>
<ul>
<li><strong>Invalid SMILES generation</strong>: the token-by-token generation process can produce syntactically invalid SMILES (e.g., mismatched parentheses), which the authors scored as zero</li>
<li><strong>Sequence length degradation</strong>: prediction accuracy dropped for longer SMILES strings, a known limitation of RNN-based seq2seq models at the time</li>
<li><strong>Poor attention alignment</strong>: attention weights collapsed to the first encoder positions rather than learning meaningful reactant-product correspondences</li>
<li><strong>Chemically naive embeddings</strong>: token embeddings did not capture chemical properties</li>
<li><strong>Multiple reaction pathways</strong>: reactions with competing pathways (e.g., substitution vs. elimination) were difficult for the model to handle</li>
</ul>
<h3 id="historical-significance">Historical Significance</h3>
<p>This paper is historically significant as one of the first (alongside concurrent work) to propose the NMT framing for reaction prediction. This framing was later adopted and refined by the <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/molecular-transformer/">Molecular Transformer</a> (Schwaller et al., 2019), which replaced GRUs with the Transformer architecture and achieved over 90% top-1 accuracy on standard benchmarks. The conceptual contribution of treating SMILES-to-SMILES translation as machine translation became the foundation of an entire subfield.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Training (real)</td>
          <td style="text-align: left">USPTO patent reactions</td>
          <td style="text-align: left">1,094,235</td>
          <td style="text-align: left">2001-2013 applications, filtered by length</td>
      </tr>
      <tr>
          <td style="text-align: left">Training (gen)</td>
          <td style="text-align: left">Generated from Wade textbook templates</td>
          <td style="text-align: left">865,118</td>
          <td style="text-align: left">75 reaction types, GDB-11 substrates</td>
      </tr>
      <tr>
          <td style="text-align: left">Testing (textbook)</td>
          <td style="text-align: left">Wade textbook problems</td>
          <td style="text-align: left">~100</td>
          <td style="text-align: left">10 problem sets, 6-15 reactions each</td>
      </tr>
      <tr>
          <td style="text-align: left">Testing (scalability)</td>
          <td style="text-align: left">Generated from <a href="/notes/computational-chemistry/datasets/gdb-17/">GDB-17</a></td>
          <td style="text-align: left">2,400</td>
          <td style="text-align: left">400 per atom count (11-16)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>GRU-based encoder-decoder with attention mechanism</li>
<li>PEG-based SMILES tokenizer</li>
<li>Input sequence reversal</li>
<li>Bucketed training with four bucket sizes</li>
<li>TensorFlow seq2seq tutorial implementation with default learning rate</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Parameter</th>
          <th style="text-align: left">Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">GRU layers</td>
          <td style="text-align: left">3</td>
      </tr>
      <tr>
          <td style="text-align: left">Embedding size</td>
          <td style="text-align: left">600</td>
      </tr>
      <tr>
          <td style="text-align: left">Input vocabulary</td>
          <td style="text-align: left">311 tokens</td>
      </tr>
      <tr>
          <td style="text-align: left">Output vocabulary</td>
          <td style="text-align: left">180 tokens</td>
      </tr>
      <tr>
          <td style="text-align: left">Buckets</td>
          <td style="text-align: left">(54,54), (70,60), (90,65), (150,80)</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">gen Model</th>
          <th style="text-align: left">real+gen Model</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Textbook correct ratio</td>
          <td style="text-align: left">Variable by set</td>
          <td style="text-align: left">Higher on most sets</td>
          <td style="text-align: left">10 problem sets</td>
      </tr>
      <tr>
          <td style="text-align: left">Average Tanimoto similarity</td>
          <td style="text-align: left">Variable</td>
          <td style="text-align: left">~0.7 on scalability test</td>
          <td style="text-align: left">Morgan fingerprint based</td>
      </tr>
      <tr>
          <td style="text-align: left">Invalid SMILES ratio</td>
          <td style="text-align: left">Higher</td>
          <td style="text-align: left">~0.4 on scalability test</td>
          <td style="text-align: left">Decreases with more training data</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Nam, J. &amp; Kim, J. (2016). Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions. <em>arXiv preprint</em>, arXiv:1612.09529. <a href="https://arxiv.org/abs/1612.09529">https://arxiv.org/abs/1612.09529</a></p>
<p><strong>Publication</strong>: arXiv preprint 2016</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{nam2016linking,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Nam, Juno and Kim, Jurae}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1612.09529}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2016}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arxiv.1612.09529}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Data Transfer Approaches for Seq-to-Seq Retrosynthesis</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/reaction-prediction/data-transfer-seq-to-seq-retrosynthesis/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/reaction-prediction/data-transfer-seq-to-seq-retrosynthesis/</guid><description>Systematic comparison of joint training, self-training, and pre-training plus fine-tuning for Transformer-based retrosynthesis on USPTO-50K.</description><content:encoded><![CDATA[<h2 id="systematic-study-of-data-transfer-for-retrosynthesis">Systematic Study of Data Transfer for Retrosynthesis</h2>
<p>This is an <strong>Empirical</strong> paper that systematically compares three standard data transfer methods (joint training, self-training, and pre-training plus fine-tuning) applied to a Transformer-based sequence-to-sequence model for single-step retrosynthesis. The primary contribution is demonstrating that pre-training on a large augmented dataset (USPTO-Full, 877K reactions) followed by fine-tuning on the smaller target dataset (USPTO-50K) produces substantial accuracy improvements over the baseline Transformer, achieving competitive or superior results to contemporaneous state-of-the-art graph-based models at higher values of n-best accuracy.</p>
<h2 id="bridging-the-data-gap-in-retrosynthesis-prediction">Bridging the Data Gap in Retrosynthesis Prediction</h2>
<p><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a>, the problem of predicting reactant compounds needed to synthesize a target product, has seen rapid progress through increasingly sophisticated model architectures: <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/nmt-organic-reaction-prediction/">LSTM seq-to-seq models</a>, <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/molecular-transformer/">Transformer models</a>, and graph-to-graph approaches. However, the authors identify a gap in this research trajectory. While model architecture has received extensive attention, the role of training data strategies has been largely neglected in the retrosynthesis literature.</p>
<p>The core practical problem is that high-quality supervised datasets for retrosynthesis (like USPTO-50K) tend to be small and distribution-skewed, with all samples pre-classified into ten major reaction classes. Meanwhile, larger datasets (USPTO-Full with 877K samples, USPTO-MIT with 479K samples) exist but have different distributional properties. Data transfer techniques are standard practice in computer vision, NLP, and machine translation for exactly this scenario, yet they had not been systematically evaluated for retrosynthesis at the time of this work.</p>
<p>The authors also note a contrast with Zoph et al. (2020), who found that self-training outperforms pre-training in image recognition. They hypothesize that chemical compound strings may have more universal representations than images, making pre-training more effective in the chemistry domain.</p>
<h2 id="three-data-transfer-methods-for-retrosynthesis">Three Data Transfer Methods for Retrosynthesis</h2>
<p>The paper formalizes retrosynthesis as a seq-to-seq problem where both the product $x$ and reactant set $y$ are represented as <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings. A retrosynthesis model defines a likelihood $p_{\mathcal{M}}(y \mid x; \theta)$ optimized via maximum log-likelihood:</p>
<p>$$
\theta^{*} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}^{T}_{\text{Train}}} \log p(y_{i} \mid x_{i})
$$</p>
<p>Given a target dataset $\mathcal{D}^{T}$ and an augment dataset $\mathcal{D}^{A}$, three transfer methods are examined:</p>
<p><strong>Joint Training</strong> concatenates the training sets and optimizes over the union:</p>
<p>$$
\theta^{*}_{\text{joint}} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}_{\text{joint}}} \log p(y_{i} \mid x_{i}), \quad \mathcal{D}_{\text{joint}} = \mathcal{D}^{T}_{\text{Train}} \cup \mathcal{D}^{A}_{\text{Train}}
$$</p>
<p>This requires that both datasets share the same input/output domain (same SMILES canonicalization rules).</p>
<p><strong>Self-Training</strong> (pseudo labeling) first trains a base model on $\mathcal{D}^{T}$ alone, then uses this model to relabel the augment dataset products:</p>
<p>$$
\hat{y}_{i} = \arg\max_{y} \log p(y \mid x_{i}; \theta^{*}_{\text{single}}) \quad \text{for } x_{i} \in \mathcal{D}^{A}_{\text{Train}}
$$</p>
<p>The pseudo-labeled augment set is then combined with $\mathcal{D}^{T}_{\text{Train}}$ for joint training. This approach does not require consistent label domains between datasets.</p>
<p><strong>Pre-training plus Fine-tuning</strong> trains first on the augment dataset to obtain $\theta^{*}_{\text{pretrain}}$, then initializes fine-tuning from this checkpoint:</p>
<p>$$
\theta^{0}_{\text{finetune}} \leftarrow \theta^{*}_{\text{pretrain}}, \quad \theta^{\ell+1}_{\text{finetune}} \leftarrow \theta^{\ell}_{\text{finetune}} - \gamma^{\ell} \nabla \mathcal{L}(\mathcal{D}^{T}_{\text{Train}}) \big|_{{\theta^{\ell}_{\text{finetune}}}}
$$</p>
<h2 id="experimental-setup-on-uspto-benchmarks">Experimental Setup on USPTO Benchmarks</h2>
<p>The experiments use a fixed Transformer architecture (3 self-attention layers, 500-dimensional latent vectors) implemented in OpenNMT-py, evaluated across all three transfer methods.</p>
<p><strong>Datasets:</strong></p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Target</td>
          <td>USPTO-50K</td>
          <td>40K/5K/5K (train/val/test)</td>
          <td>10 reaction classes, curated by Lowe (2012)</td>
      </tr>
      <tr>
          <td>Augment (main)</td>
          <td>USPTO-Full</td>
          <td>844K train (after cleansing)</td>
          <td>Curated by Lowe (2017)</td>
      </tr>
      <tr>
          <td>Augment (smaller)</td>
          <td>USPTO-MIT</td>
          <td>384K train (after cleansing)</td>
          <td>Curated by Jin et al. (2017)</td>
      </tr>
  </tbody>
</table>
<p>Data cleansing removed all augment dataset samples whose product SMILES appeared in any USPTO-50K subset, preventing data leakage. All datasets were re-canonicalized with a unified <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> version.</p>
<p><strong>Evaluation</strong> uses n-best accuracy with k=50 beam search, computing accuracy at n=1, 3, 5, 10, 20, 50. Models are selected by best validation perplexity. All experiments report averages and standard deviations over 5 runs.</p>
<p><strong>Optimization</strong> uses Adam with cyclic learning rate scheduling (warm-up) for all methods except fine-tuning, which uses a standard non-cyclic scheduler.</p>
<p><strong>Results comparing data transfer methods (USPTO-Full augment):</strong></p>
<table>
  <thead>
      <tr>
          <th>Training Method</th>
          <th>n=1</th>
          <th>n=3</th>
          <th>n=5</th>
          <th>n=10</th>
          <th>n=20</th>
          <th>n=50</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Single model (No Transfer)</td>
          <td>35.3 +/- 1.4</td>
          <td>52.8 +/- 1.4</td>
          <td>58.9 +/- 1.3</td>
          <td>64.5 +/- 1.2</td>
          <td>68.8 +/- 1.2</td>
          <td>72.1 +/- 1.3</td>
      </tr>
      <tr>
          <td>Joint Training</td>
          <td>39.1 +/- 1.3</td>
          <td>63.4 +/- 0.9</td>
          <td>71.9 +/- 0.5</td>
          <td>80.1 +/- 0.2</td>
          <td>85.4 +/- 0.3</td>
          <td>89.4 +/- 0.2</td>
      </tr>
      <tr>
          <td>Self-Training</td>
          <td>41.5 +/- 1.0</td>
          <td>60.4 +/- 0.7</td>
          <td>66.1 +/- 0.7</td>
          <td>71.8 +/- 0.6</td>
          <td>75.3 +/- 0.5</td>
          <td>78.0 +/- 0.3</td>
      </tr>
      <tr>
          <td>Pre-training + Fine-Tune</td>
          <td>57.4 +/- 0.4</td>
          <td>77.6 +/- 0.4</td>
          <td>83.1 +/- 0.2</td>
          <td>87.4 +/- 0.4</td>
          <td>89.6 +/- 0.3</td>
          <td>90.9 +/- 0.2</td>
      </tr>
  </tbody>
</table>
<p><strong>Comparison with state-of-the-art models:</strong></p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Architecture</th>
          <th>n=1</th>
          <th>n=3</th>
          <th>n=5</th>
          <th>n=10</th>
          <th>n=20</th>
          <th>n=50</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GLN (Dai et al., 2019)</td>
          <td>Logic Network</td>
          <td>52.5</td>
          <td>69.0</td>
          <td>75.6</td>
          <td>83.7</td>
          <td>88.5</td>
          <td>92.4</td>
      </tr>
      <tr>
          <td>G2Gs (Shi et al., 2020)</td>
          <td>Graph-to-Graph</td>
          <td>48.9</td>
          <td>67.6</td>
          <td>72.5</td>
          <td>75.5</td>
          <td>N/A</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>RetroXpert (Yan et al., 2020)</td>
          <td>Graph-to-Graph</td>
          <td>65.6</td>
          <td>78.7</td>
          <td>80.8</td>
          <td>83.3</td>
          <td>84.6</td>
          <td>86.0</td>
      </tr>
      <tr>
          <td>GraphRetro (Somnath et al., 2020)</td>
          <td>Graph-to-Graph</td>
          <td>63.8</td>
          <td>80.5</td>
          <td>84.1</td>
          <td>85.9</td>
          <td>N/A</td>
          <td>87.2</td>
      </tr>
      <tr>
          <td>Pre-training + Fine-Tune (ours)</td>
          <td>Seq-to-Seq</td>
          <td>57.4</td>
          <td>77.6</td>
          <td>83.1</td>
          <td>87.4</td>
          <td>89.6</td>
          <td>90.9</td>
      </tr>
  </tbody>
</table>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p><strong>Primary findings:</strong></p>
<ol>
<li>All three data transfer methods improve over the no-transfer baseline across all n-best accuracy levels.</li>
<li>Pre-training plus fine-tuning provides the largest gains, improving top-1 accuracy by 22.1 absolute percentage points (from 35.3% to 57.4%) and achieving the best n=10 and n=20 accuracy among all compared models, including graph-based approaches.</li>
<li>Augment dataset size matters: using USPTO-Full (844K) yields substantially better results than USPTO-MIT (384K) for joint training and pre-training plus fine-tuning, though self-training gains are surprisingly robust to augment dataset size.</li>
<li>Manual inspection of erroneous predictions shows that over 99% of top-1 predictions from the pre-trained/fine-tuned model are chemically appropriate or sensible, even when they do not exactly match the gold-standard reactants.</li>
<li>Pre-training plus fine-tuning shows a distinct advantage in training dynamics: the 1-best and n-best accuracy curves evolve similarly during fine-tuning, unlike the single model where these curves can diverge significantly. This makes early stopping more reliable.</li>
</ol>
<p><strong>Class-wise improvements</strong> are observed across all 10 reaction classes, with the largest gains in heterocycle formation (0.40 to 0.86 at 50-best) and functional group interconversion (0.57 to 0.90).</p>
<p><strong>Limitations acknowledged by the authors:</strong></p>
<ul>
<li>The model struggles with compounds containing multiple similar substituents (e.g., long-chain hydrocarbons), occasionally selecting the wrong one.</li>
<li>Some reactions involving rare chemical groups (<a href="https://en.wikipedia.org/wiki/Polycyclic_aromatic_hydrocarbon">polycyclic aromatic hydrocarbons</a>) still produce invalid SMILES, suggesting the augment dataset lacks sufficient examples of these structures.</li>
<li>Top-1 accuracy (57.4%) lags behind the best graph-based models (RetroXpert at 65.6%), though the gap narrows at higher n values.</li>
<li>The study uses a fixed Transformer architecture without architecture-specific optimization for each transfer method.</li>
</ul>
<p><strong>Future directions</strong> proposed include freezing parts of the network during fine-tuning, applying data transfer to graph-to-graph models, and testing transferability to other retrosynthesis datasets beyond USPTO-50K.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Target</td>
          <td>USPTO-50K</td>
          <td>50K reactions</td>
          <td>Curated by Lowe (2012), 10 reaction classes</td>
      </tr>
      <tr>
          <td>Augment</td>
          <td>USPTO-Full</td>
          <td>877K reactions (844K after cleansing)</td>
          <td>Curated by Lowe (2017), available via Figshare</td>
      </tr>
      <tr>
          <td>Augment (alt)</td>
          <td>USPTO-MIT</td>
          <td>479K reactions (384K after cleansing)</td>
          <td>Curated by Jin et al. (2017)</td>
      </tr>
  </tbody>
</table>
<p>Data cleansing removes augment samples whose products appear in any USPTO-50K subset. Unified RDKit canonicalization applied to all datasets.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer seq-to-seq model (3 self-attention layers, 500-dim latent vectors)</li>
<li>Positional encoding enabled</li>
<li>Maximum sequence length: 200 tokens</li>
<li>Adam optimizer</li>
<li>Cyclic learning rate scheduler with warm-up (all methods except fine-tuning)</li>
<li>Non-cyclic scheduler for fine-tuning phase (Klein et al., 2017)</li>
<li>Beam search with k=50 for inference</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Implementation: OpenNMT-py</li>
<li>No pre-trained weights or model checkpoints released</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Top-1 accuracy</td>
          <td>57.4%</td>
          <td>35.3% (no transfer)</td>
          <td>Pre-train + fine-tune, USPTO-Full augment</td>
      </tr>
      <tr>
          <td>Top-10 accuracy</td>
          <td>87.4%</td>
          <td>64.5% (no transfer)</td>
          <td>Best among all compared models</td>
      </tr>
      <tr>
          <td>Top-20 accuracy</td>
          <td>89.6%</td>
          <td>68.8% (no transfer)</td>
          <td>Best among all compared models</td>
      </tr>
      <tr>
          <td>Top-50 accuracy</td>
          <td>90.9%</td>
          <td>72.1% (no transfer)</td>
          <td>Competitive with GLN (92.4%)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware details are not specified in the paper. The authors mention GPU memory constraints motivating the 200-token sequence length limit.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ishiguro, K., Ujihara, K., Sawada, R., Akita, H., &amp; Kotera, M. (2020). Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis. <em>arXiv preprint arXiv:2010.00792</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ishiguro2020data,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ishiguro, Katsuhiko and Ujihara, Kazuya and Sawada, Ryohto and Akita, Hirotaka and Kotera, Masaaki}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2010.00792}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Tied Two-Way Transformers for Diverse Retrosynthesis</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/reaction-prediction/tied-two-way-transformers-retrosynthesis/</link><pubDate>Mon, 23 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/reaction-prediction/tied-two-way-transformers-retrosynthesis/</guid><description>Tied two-way transformers with cycle consistency and multinomial latent variables improve retrosynthetic prediction validity, plausibility, and diversity.</description><content:encoded><![CDATA[<h2 id="bridging-forward-and-backward-reaction-prediction">Bridging Forward and Backward Reaction Prediction</h2>
<p>This is a <strong>Method</strong> paper that addresses three key limitations of template-free <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthesis</a> models: invalid <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> outputs, chemically implausible predictions, and lack of diversity in reactant candidates. The solution combines three techniques: (1) cycle consistency checks using a paired forward reaction transformer, (2) parameter tying between the forward and backward transformers, and (3) multinomial latent variables with a learned prior to capture multiple reaction pathways.</p>
<h2 id="three-problems-in-template-free-retrosynthesis">Three Problems in Template-Free Retrosynthesis</h2>
<p>Template-free retrosynthesis models cast retrosynthesis as a <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/data-transfer-seq-to-seq-retrosynthesis/">sequence-to-sequence</a> translation problem (product SMILES to reactant SMILES). While these models avoid the cost of hand-coded reaction templates, they suffer from:</p>
<ol>
<li><strong>Invalid SMILES</strong>: predicted reactant strings that contain grammatical errors and cannot be parsed into molecules</li>
<li><strong>Implausibility</strong>: predicted reactants that are valid molecules but cannot actually synthesize the target product</li>
<li><strong>Lack of diversity</strong>: beam search produces duplicate or near-duplicate candidates, reducing the number of useful suggestions</li>
</ol>
<p>Prior work addressed these individually (SCROP adds a syntax corrector for validity, Chen et al. use latent variables for diversity), but this paper tackles all three simultaneously.</p>
<h2 id="model-architecture">Model Architecture</h2>
<h3 id="tied-two-way-transformers">Tied Two-Way Transformers</h3>
<p>The model pairs a retrosynthesis transformer $p(y|z, x)$ (product to reactants) with a forward reaction transformer $p(\tilde{x}|z, y)$ (reactants to product). Both use the standard encoder-decoder transformer architecture with 6 layers, 8 attention heads, and 256-dimensional embeddings.</p>
<p>The key architectural innovation is aggressive parameter tying: the two transformers share the entire encoder and all decoder parameters except layer normalization. This means the two-transformer system has approximately the same parameter count as a single transformer (17.5M vs. 17.4M). The shared parameters force the model to learn bidirectional reaction patterns from both forward and backward training data simultaneously, improving grammar learning and reducing invalid outputs.</p>
<h3 id="multinomial-latent-variables">Multinomial Latent Variables</h3>
<p>A discrete latent variable $z \in \{1, \ldots, K\}$ is introduced to capture multiple reaction modes. Each latent value conditions a different decoding path, encouraging diverse reactant predictions. The decoder initializes with a latent-class-specific start token (e.g., &ldquo;&lt;CLS2&gt;&rdquo;) and then decodes autoregressively.</p>
<p>The prior $p(z|x)$ is a learned multinomial distribution parametrized by a two-layer feed-forward network with tanh activation, taking the mean-pooled encoder output as input. This learned prior outperforms the uniform prior used by Chen et al., producing a smaller trade-off between top-1 and top-10 accuracy as $K$ increases.</p>
<h3 id="training-with-hard-em">Training with Hard EM</h3>
<p>Since the latent variable $z$ is unobserved during training, the model is trained with the online <a href="https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm">hard-EM algorithm</a>. The loss function is:</p>
<p>$$\mathcal{L}(\theta) = \mathbb{E}_{(x,y) \sim \text{data}} \left[ \min_{z} \mathcal{L}_h(x, y, z; \theta) \right]$$</p>
<p>where $\mathcal{L}_h = -(\log p(z|x) + \log p(y|z,x) + \log p(\tilde{x}=x|z,y))$. The E-step selects the best $z$ for each training pair (with dropout disabled), and the M-step updates parameters given the complete data.</p>
<h3 id="inference-with-cycle-consistency-reranking">Inference with Cycle Consistency Reranking</h3>
<p>At inference, the model: (1) generates $K$ sets of beam search hypotheses from the retrosynthesis transformer (one per latent value), (2) scores each candidate with the forward reaction transformer for cycle consistency $p(\tilde{x}=x|z,y)$, and (3) reranks candidates by the full likelihood $p(z|x) \cdot p(y|z,x) \cdot p(\tilde{x}=x|z,y)$. This pushes chemically plausible predictions to higher ranks.</p>
<h2 id="results-on-uspto-50k">Results on USPTO-50K</h2>
<p>All results are averaged over 5 random seeds with beam size 10.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Top-1 Acc.</th>
          <th>Top-5 Acc.</th>
          <th>Top-10 Acc.</th>
          <th>Top-1 Invalid</th>
          <th>Top-10 Invalid</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Liu-LSTM</td>
          <td>37.4%</td>
          <td>57.0%</td>
          <td>61.7%</td>
          <td>12.2%</td>
          <td>22.0%</td>
      </tr>
      <tr>
          <td>SCROP</td>
          <td>43.7%</td>
          <td>65.2%</td>
          <td>68.7%</td>
          <td>0.7%</td>
          <td>2.3%</td>
      </tr>
      <tr>
          <td>Lin-TF</td>
          <td>42.0%</td>
          <td>71.3%</td>
          <td>77.6%</td>
          <td>2.2%</td>
          <td>7.8%</td>
      </tr>
      <tr>
          <td>Base transformer</td>
          <td>44.3%</td>
          <td>68.4%</td>
          <td>72.7%</td>
          <td>1.7%</td>
          <td>12.1%</td>
      </tr>
      <tr>
          <td>Proposed ($K$=5)</td>
          <td>46.8%</td>
          <td>73.5%</td>
          <td>78.5%</td>
          <td>0.1%</td>
          <td>2.6%</td>
      </tr>
  </tbody>
</table>
<p>The proposed model achieves a +3.1% top-1 accuracy improvement over the best previous template-free method and reduces top-1 invalid rate to 0.1%.</p>
<h3 id="ablation-analysis">Ablation Analysis</h3>
<p>The ablation study isolates the contribution of each component:</p>
<ul>
<li><strong>Base+CC</strong> (cycle consistency only): reranks candidates to improve top-1/3/5 accuracy and validity, but top-10 stays the same since the candidate set is unchanged. Parameter count doubles (34.8M).</li>
<li><strong>Base+PT</strong> (parameter tying only): improves accuracy and validity at all top-$k$ levels with negligible parameter increase. Parameter tying during training improves the retrosynthesis transformer itself, even without cycle consistency at inference.</li>
<li><strong>Proposed ($K$=1)</strong>: combines tying with cycle consistency reranking.</li>
<li><strong>Proposed ($K$=5)</strong>: adds latent diversity, further improving top-10 accuracy (+2.2%) and reducing top-10 invalid rate (from 10.2% to 2.6%).</li>
</ul>
<h3 id="diversity-unique-rate">Diversity: Unique Rate</h3>
<p>As $K$ increases from 1 to 5, the unique molecule rate among 10 predictions rises substantially, confirming that latent modeling produces more diverse candidates. The learned prior reduces the top-1/top-10 accuracy trade-off compared to Chen et al.&rsquo;s uniform prior.</p>
<h2 id="results-on-in-house-multi-pathway-dataset">Results on In-House Multi-Pathway Dataset</h2>
<p>The in-house dataset (162K reactions from <a href="https://en.wikipedia.org/wiki/Reaxys">Reaxys</a>) contains multiple ground-truth reactions per product, enabling direct evaluation of pathway diversity through coverage (proportion of ground-truth pathways correctly predicted in the top-10 candidates).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Top-1 Acc.</th>
          <th>Top-10 Acc.</th>
          <th>Unique Rate</th>
          <th>Coverage</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base</td>
          <td>64.2%</td>
          <td>91.6%</td>
          <td>76.1%</td>
          <td>84.4%</td>
      </tr>
      <tr>
          <td>Proposed</td>
          <td>66.0%</td>
          <td>92.8%</td>
          <td>93.2%</td>
          <td>87.3%</td>
      </tr>
  </tbody>
</table>
<p>The proposed model covers 87.3% of ground-truth reaction pathways on average, compared to 84.4% for the baseline. The unique rate jumps from 76.1% to 93.2%, confirming that the latent variables effectively encourage diverse predictions.</p>
<h2 id="limitations">Limitations</h2>
<p>The model uses SMILES string representation, which linearizes molecules and does not exploit the inherently rich chemical graph structure. Graph-based retrosynthesis models (e.g., GraphRetro at 63.8% top-1) substantially outperform template-free string-based models. The USPTO-50K dataset provides only one ground-truth pathway per product, making diversity evaluation limited on this benchmark. The in-house dataset is not publicly available. The model also does not predict reaction conditions (solvents, catalysts, temperature) or reagents.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ejklike/tied-twoway-transformer">ejklike/tied-twoway-transformer</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Training and inference code</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: USPTO-50K dataset (public, 50K reactions from USPTO patents). In-house dataset (162K reactions from Reaxys, not publicly available).</p>
<p><strong>Hardware</strong>: 4 NVIDIA Tesla M40 GPUs. Checkpoints saved every 5000 steps, last 5 averaged.</p>
<p><strong>Training</strong>: Adam optimizer ($\beta$ = 0.9, 0.98), initial learning rate 2 with 8000 warm-up steps, dropout 0.3, gradient accumulation over 4 batches. Label smoothing set to 0.</p>
<p><strong>Inference</strong>: Beam size 10, generating 10 candidates per product.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kim, E., Lee, D., Kwon, Y., Park, M. S., &amp; Choi, Y.-S. (2021). Valid, Plausible, and Diverse Retrosynthesis Using Tied Two-Way Transformers with Latent Variables. <em>Journal of Chemical Information and Modeling</em>, 61, 123-133.</p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling, 2021</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/ejklike/tied-twoway-transformer">GitHub: ejklike/tied-twoway-transformer</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{kim2021valid,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Valid, Plausible, and Diverse Retrosynthesis Using Tied Two-Way Transformers with Latent Variables}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kim, Eunji and Lee, Dongseon and Kwon, Youngchun and Park, Min Sik and Choi, Youn-Suk}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{61}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{123--133}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACS Publications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.0c01074}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Molecular Transformer: Calibrated Reaction Prediction</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/reaction-prediction/molecular-transformer/</link><pubDate>Wed, 18 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/reaction-prediction/molecular-transformer/</guid><description>A Transformer seq2seq model for chemical reaction prediction achieving 90.4% top-1 accuracy on USPTO_MIT with calibrated uncertainty estimation.</description><content:encoded><![CDATA[<h2 id="paper-contribution-and-methodological-classification">Paper Contribution and Methodological Classification</h2>
<p>This is a <strong>Method</strong> paper. It adapts the Transformer architecture to chemical reaction prediction, treating it as a machine translation problem from reactant <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> to product SMILES. The key contributions are (1) demonstrating that a fully attention-based model outperforms all prior template-based, graph-based, and RNN-based methods, (2) showing the model works without separating reactants from reagents, and (3) introducing calibrated uncertainty estimation for ranking synthesis pathways.</p>
<h2 id="motivation-limitations-of-existing-reaction-prediction">Motivation: Limitations of Existing Reaction Prediction</h2>
<p>Prior approaches to reaction prediction fell into two broad groups, template-based and template-free, each with fundamental limitations:</p>
<ul>
<li><strong>Template-based methods</strong> rely on libraries of reaction rules, either handcrafted or automatically extracted from atom-mapped data. Automatic template extraction itself depends on atom mapping, which depends on templates, creating a circular dependency.</li>
<li><strong>Graph-based template-free methods</strong> (e.g., WLDN, ELECTRO) avoid explicit templates but still require atom-mapped training data and cannot handle stereochemistry.</li>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/nmt-organic-reaction-prediction/">RNN-based seq2seq models</a></strong> (also template-free) treat reactions as SMILES translation but impose a positional inductive bias: tokens far apart in the SMILES string are assumed to be less related. This is incorrect because SMILES position has no relationship to 3D spatial distance.</li>
</ul>
<h2 id="core-innovation-transformer-for-reaction-prediction">Core Innovation: Transformer for Reaction Prediction</h2>
<p>The Molecular Transformer adapts the Transformer architecture to chemical reactions by treating SMILES strings of reactants and reagents as source sequences and product SMILES as target sequences.</p>
<ul>
<li><strong>Architecture</strong>: Encoder-decoder Transformer with 4 layers, 256-dimensional hidden states, 8 attention heads, and 12M parameters (reduced from the original 65M NMT model).</li>
<li><strong>Tokenization</strong>: Atom-wise regex tokenization of SMILES strings, applied uniformly to both reactants and reagents (no special reagent tokens).</li>
<li><strong>Data augmentation</strong>: Training data is doubled by generating <a href="/notes/computational-chemistry/molecular-representations/randomized-smiles-generative-models/">random (non-canonical) SMILES</a> for each reaction, which improves top-1 accuracy by roughly 1%.</li>
<li><strong>Weight averaging</strong>: Final model weights are averaged over the last 20 checkpoints, providing a further accuracy boost without the inference cost of ensembling.</li>
<li><strong>Mixed input</strong>: Unlike all prior work that separates reactants from reagents (which implicitly assumes knowledge of the product), the Molecular Transformer operates on mixed inputs where no distinction is made.</li>
</ul>
<p>The multihead attention mechanism is the key architectural advantage over RNNs. It allows the model to attend to any pair of tokens regardless of their position in the SMILES string, correctly capturing long-range chemical relationships that RNNs miss.</p>
<h2 id="uncertainty-estimation">Uncertainty Estimation</h2>
<p>A central contribution is calibrated uncertainty scoring. The product of predicted token probabilities serves as a confidence score for each prediction. This score achieves 0.89 AUC-ROC for classifying whether a prediction is correct.</p>
<p>An important finding: <strong>label smoothing hurts uncertainty calibration</strong>. While label smoothing (as used in the original Transformer) marginally improves top-1 accuracy (87.44% vs 87.28%), it destroys the model&rsquo;s ability to distinguish correct from incorrect predictions. Setting the label smoothing parameter to 0.0 preserves calibration.</p>
<p>The confidence score shows no correlation with SMILES length (Pearson $r = 0.06$), confirming it is not biased against predictions of larger molecules.</p>
<h2 id="experimental-results">Experimental Results</h2>
<h3 id="forward-synthesis-prediction">Forward Synthesis Prediction</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Setting</th>
          <th style="text-align: left">Top-1 (%)</th>
          <th style="text-align: left">Top-2 (%)</th>
          <th style="text-align: left">Top-5 (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">USPTO_MIT</td>
          <td style="text-align: left">separated</td>
          <td style="text-align: left">90.4</td>
          <td style="text-align: left">93.7</td>
          <td style="text-align: left">95.3</td>
      </tr>
      <tr>
          <td style="text-align: left">USPTO_MIT</td>
          <td style="text-align: left">mixed</td>
          <td style="text-align: left">88.6</td>
          <td style="text-align: left">92.4</td>
          <td style="text-align: left">94.2</td>
      </tr>
      <tr>
          <td style="text-align: left">USPTO_STEREO</td>
          <td style="text-align: left">separated</td>
          <td style="text-align: left">78.1</td>
          <td style="text-align: left">84.0</td>
          <td style="text-align: left">87.1</td>
      </tr>
      <tr>
          <td style="text-align: left">USPTO_STEREO</td>
          <td style="text-align: left">mixed</td>
          <td style="text-align: left">76.2</td>
          <td style="text-align: left">82.4</td>
          <td style="text-align: left">85.8</td>
      </tr>
  </tbody>
</table>
<p>The mixed-input model (88.6%) outperforms all prior methods that used separated inputs (best previous: WLDN5 at 85.6%).</p>
<h3 id="comparison-with-quantum-chemistry">Comparison with Quantum Chemistry</h3>
<p>On <a href="https://en.wikipedia.org/wiki/Regioselectivity">regioselectivity</a> of <a href="https://en.wikipedia.org/wiki/Electrophilic_aromatic_substitution">electrophilic aromatic substitution</a> in heteroaromatics, the Molecular Transformer achieves 83% top-1 accuracy vs 81% for RegioSQM (a quantum-chemistry-based predictor), at a fraction of the computational cost.</p>
<h3 id="comparison-with-human-chemists">Comparison with Human Chemists</h3>
<p>On 80 reactions sampled across rarity bins, the Molecular Transformer achieves 87.5% top-1 accuracy vs 76.5% for the best human chemist and 72.5% for the best graph-based model (WLDN5).</p>
<h3 id="chemically-constrained-beam-search">Chemically Constrained Beam Search</h3>
<p>Constraining beam search to only predict atoms present in the reactants (preventing &ldquo;alchemy&rdquo;) produces no change in accuracy, confirming the model has learned conservation of atoms from data alone.</p>
<h2 id="trade-offs-and-limitations">Trade-offs and Limitations</h2>
<ul>
<li><strong><a href="https://en.wikipedia.org/wiki/Stereochemistry">Stereochemistry</a></strong>: Accuracy drops significantly on USPTO_STEREO (76-78% vs 88-90% on USPTO_MIT), indicating stereochemical prediction remains challenging.</li>
<li><strong>Resolution reactions</strong>: Near-zero accuracy on resolution reactions (28.6%), where reagent information is often missing from patent data.</li>
<li><strong>Unclassified reactions</strong>: Accuracy on &ldquo;unrecognized&rdquo; reaction classes is 46.3%, likely reflecting noisy or mistranscribed data.</li>
<li><strong>No atom mapping</strong>: The model provides no explicit atom mapping between reactants and products, which limits interpretability for understanding reaction mechanisms.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Primary benchmark</strong></td>
          <td style="text-align: left">USPTO_MIT</td>
          <td style="text-align: left">479K</td>
          <td style="text-align: left">Filtered by Jin et al., no stereochemistry</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>LEF subset</strong></td>
          <td style="text-align: left">USPTO_LEF</td>
          <td style="text-align: left">350K</td>
          <td style="text-align: left">Subset of MIT with linear electron flow only</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Stereo benchmark</strong></td>
          <td style="text-align: left">USPTO_STEREO</td>
          <td style="text-align: left">1.0M</td>
          <td style="text-align: left">Patent reactions through Sept 2016, includes stereochemistry</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Time-split test</strong></td>
          <td style="text-align: left">Pistachio_2017</td>
          <td style="text-align: left">15.4K</td>
          <td style="text-align: left">Non-public, reactions from 2017</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>: SMILES canonicalized with RDKit. Regex tokenization from Schwaller et al. (2018). Two input modes: &ldquo;separated&rdquo; (reactants &gt; reagents) and &ldquo;mixed&rdquo; (all molecules concatenated).</p>
<h3 id="model">Model</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Hyperparameter</th>
          <th style="text-align: left">Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Layers</strong></td>
          <td style="text-align: left">4</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Model dimension</strong></td>
          <td style="text-align: left">256</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Attention heads</strong></td>
          <td style="text-align: left">8</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Parameters</strong></td>
          <td style="text-align: left">~12M</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Label smoothing</strong></td>
          <td style="text-align: left">0.0</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Optimizer</strong></td>
          <td style="text-align: left">Adam</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Warm-up steps</strong></td>
          <td style="text-align: left">8000</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Batch size</strong></td>
          <td style="text-align: left">~4096 tokens</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Beam width</strong></td>
          <td style="text-align: left">5</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Key Result</th>
          <th style="text-align: left">Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Top-1 accuracy</strong></td>
          <td style="text-align: left">USPTO_MIT (sep)</td>
          <td style="text-align: left"><strong>90.4%</strong></td>
          <td style="text-align: left">85.6% (WLDN5)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-1 accuracy</strong></td>
          <td style="text-align: left">USPTO_MIT (mixed)</td>
          <td style="text-align: left"><strong>88.6%</strong></td>
          <td style="text-align: left">80.3% (S2S RNN)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>AUC-ROC</strong></td>
          <td style="text-align: left">Uncertainty calibration</td>
          <td style="text-align: left"><strong>0.89</strong></td>
          <td style="text-align: left">N/A</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-1 accuracy</strong></td>
          <td style="text-align: left">Regioselectivity</td>
          <td style="text-align: left"><strong>83%</strong></td>
          <td style="text-align: left">81% (RegioSQM)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-1 accuracy</strong></td>
          <td style="text-align: left">Human comparison</td>
          <td style="text-align: left"><strong>87.5%</strong></td>
          <td style="text-align: left">76.5% (best human)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training: Single Nvidia P100 GPU, 48h for best single model</li>
<li>Inference: 20 min for 40K reactions on single P100</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Schwaller, P., Laino, T., Gaudin, T., Bolgar, P., Hunter, C. A., Bekas, C., &amp; Lee, A. A. (2019). Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction. <em>ACS Central Science</em>, 5(9), 1572-1583. <a href="https://doi.org/10.1021/acscentsci.9b00576">https://doi.org/10.1021/acscentsci.9b00576</a></p>
<p><strong>Publication</strong>: ACS Central Science 2019</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{schwallerMolecularTransformerModel2019,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Schwaller, Philippe and Laino, Teodoro and Gaudin, Th{\&#39;e}ophile and Bolgar, Peter and Hunter, Christopher A. and Bekas, Costas and Lee, Alpha A.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2019</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{ACS Central Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1572--1583}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/acscentsci.9b00576}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>