Reaction Prediction on Hunter Heidenreich | ML Research Scientist

ReactionT5: Pre-trained T5 for Reaction Prediction

Sat, 28 Mar 2026 00:00:00 +0000

A Two-Stage Pre-trained Transformer for Chemical Reactions

ReactionT5 is a Method paper that proposes a T5-based pre-trained model for chemical reaction tasks, specifically product prediction and yield prediction. The primary contribution is a two-stage pretraining pipeline: first on a compound library (ZINC, 23M molecules) to learn molecular representations, then on a large-scale reaction database (the Open Reaction Database, 1.5M reactions) to learn reaction-level patterns. The key result is that this pre-trained model can be fine-tuned with very limited target-domain data (as few as 30 reactions) and still achieve competitive performance against models trained on full datasets.

Bridging the Gap Between Single-Molecule and Multi-Molecule Pretraining

While transformer-based models pre-trained on compound libraries (e.g., SMILES-BERT, MolGPT) have seen substantial development, most focus on single-molecule inputs and outputs. Pretraining for multi-molecule contexts, such as chemical reactions involving reactants, reagents, catalysts, and products, remains underexplored. T5Chem supports multi-task reaction prediction but focuses on building a single multi-task model rather than investigating the effectiveness of pre-trained models for fine-tuning on limited in-house data.

The authors identify two key gaps:

Most pre-trained chemical models do not account for reaction-level interactions between multiple molecules.
In practical settings, target-domain reaction data is often scarce, making transfer learning from large public datasets essential.

Two-Stage Pretraining with Compound Restoration

The core innovation is a two-stage pretraining procedure built on the T5 (text-to-text transfer transformer) architecture:

Stage 1: Compound Pretraining (CompoundT5). An initialized T5 model is trained on 23M SMILES from the ZINC database using span-masked language modeling. The model learns to predict masked subsequences of SMILES tokens. A SentencePiece unigram tokenizer is trained on this compound library, allowing more compact representations than character-level or atom-level tokenizers. After this stage, new tokens are added to the tokenizer to cover metal atoms and other characters present in the reaction database but absent from ZINC.

Stage 2: Reaction Pretraining (ReactionT5). CompoundT5 is further pretrained on 1.5M reactions from the Open Reaction Database (ORD) on both product prediction and yield prediction tasks. Reactions are formulated as text-to-text tasks using special tokens:

REACTANT:, REAGENT:, and PRODUCT: tokens delimit the role of each molecule in the reaction string.
For product prediction, the model takes reactants and reagents as input and generates product SMILES.
For yield prediction, the model takes the full reaction (including products) and outputs a numerical yield value.

Compound Restoration. A notable methodological detail is the handling of uncategorized compounds in the ORD. About 31.8% of ORD reactions contain compounds with unknown roles. Simply discarding these reactions introduces severe product bias (only 447 unique products remain vs. 439,898 with uncategorized data included). The authors develop RestorationT5, a binary classifier built from CompoundT5, that assigns uncategorized compounds to either reactant or reagent roles. This classifier uses a sigmoid output layer and achieves an F1 score of 0.1564 at a threshold of 0.97, outperforming a random forest baseline (F1 = 0.1136). The restored dataset (“ORD(restored)”) is then used for reaction pretraining.

For yield prediction, the loss function is mean squared error:

$$L = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$

where $y_i$ is the true yield (normalized to [0, 1]) and $\hat{y}_i$ is the predicted yield.

Experimental Setup: Product and Yield Prediction Benchmarks

Product Prediction

The USPTO dataset (479K reactions) is used for evaluation, with standard train/val/test splits (409K/30K/40K). Reactions overlapping with the ORD (18%) are removed during evaluation. Beam search with beam size 10 is used for decoding, and minimum/maximum output length constraints are set based on the training data distribution. Top-k accuracy (k = 1, 2, 3, 5) and invalidity rate are reported.

Baselines include Seq-to-seq, WLDN (graph neural network), Molecular Transformer, and T5Chem.

Model	Train	Top-1	Top-2	Top-3	Top-5	Invalidity
Seq-to-seq	USPTO	80.3	84.7	86.2	87.5	-
WLDN	USPTO	85.6	90.5	92.8	93.4	-
Molecular Transformer	USPTO	88.8	92.6	-	94.4	-
T5Chem	USPTO	90.4	94.2	-	96.4	-
CompoundT5	USPTO	88.0	92.4	93.9	95.0	7.5
ReactionT5 (restored ORD)	USPTO200	85.5	91.7	93.5	94.9	12.0

A critical finding: ReactionT5 pre-trained on ORD achieves 0% accuracy on USPTO without fine-tuning due to domain mismatch (ORD includes byproducts; USPTO lists only the main product). Fine-tuning on just 200 USPTO reactions with the restored ORD model produces competitive results.

The few-shot fine-tuning analysis shows rapid performance scaling:

Samples	Top-1	Top-2	Top-3	Top-5	Invalidity
10	9.0	12.5	15.3	19.1	12.4
30	80.5	87.3	89.8	92.0	17.2
50	83.7	89.9	92.2	94.0	14.8
100	85.1	91.0	92.8	94.4	14.0
200	85.5	91.7	93.5	94.9	12.0

Yield Prediction

The Buchwald-Hartwig C-N cross-coupling dataset (3,955 reactions) is used with random 7:3 splits (repeated 10 times) plus four out-of-sample test sets (Tests 1-4) designed so that similar reactions do not appear in both train and test.

Model	Random 7:3	Test 1	Test 2	Test 3	Test 4	Avg. Tests 1-4
DFT	0.92	0.80	0.77	0.64	0.54	0.69
MFF	0.927	0.851	0.713	0.635	0.184	0.596
Yield-BERT	0.951	0.838	0.836	0.738	0.538	0.738
T5Chem	0.970	0.811	0.907	0.789	0.627	0.785
CompoundT5	0.971	0.855	0.852	0.712	0.547	0.741
ReactionT5	0.966	0.914	0.940	0.819	0.896	0.892
ReactionT5 (zero-shot)	0.904	0.919	0.927	0.847	0.909	0.900

ReactionT5 achieves the highest average $R^2$ across Tests 1-4 (0.892), with the zero-shot variant performing even better (0.900). The improvement is most dramatic on Test 4, the hardest split, where ReactionT5 achieves $R^2 = 0.896$ versus T5Chem’s 0.627 and Yield-BERT’s 0.538.

In a low-data regime (30% train / 70% test), ReactionT5 ($R^2 = 0.927$) substantially outperforms a random forest baseline ($R^2 = 0.853$), and even zero-shot ReactionT5 ($R^2 = 0.898$) exceeds the random forest.

Key Findings and Limitations

Key Findings

Two-stage pretraining is effective: Compound pretraining followed by reaction pretraining produces models with strong generalization, particularly on out-of-distribution test sets.
Few-shot transfer works: With as few as 30 fine-tuning reactions, ReactionT5 achieves over 80% Top-1 accuracy on product prediction, competitive with models trained on the full USPTO dataset.
Compound restoration matters: Restoring uncategorized compounds in the ORD is essential for product prediction. Without restoration, fine-tuning on 200 USPTO reactions yields 0% accuracy; with restoration, the same fine-tuning yields 85.5% Top-1.
Zero-shot yield prediction is surprisingly effective: ReactionT5 achieves $R^2 = 0.900$ on the out-of-sample yield tests without any task-specific fine-tuning, outperforming all fine-tuned baselines.

Limitations

Product prediction shows a high invalidity rate (12.0% for the best ReactionT5 variant) compared to CompoundT5 (7.5%), suggesting the reaction pretraining may introduce some noise.
The 0% accuracy without fine-tuning on product prediction reveals a significant domain gap between ORD and USPTO annotation conventions (byproducts vs. main products).
The RestorationT5 classifier has low precision (0.0878) despite high recall (0.7212), meaning many compounds are incorrectly assigned roles. The paper does not investigate how this impacts downstream performance.
The paper does not report training times, computational costs, or model sizes, making resource requirements unclear.
Only two downstream tasks (product prediction on USPTO, yield prediction on Buchwald-Hartwig) are evaluated.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Compound pretraining	ZINC	22,992,522 compounds	SMILES canonicalized with RDKit
Reaction pretraining	ORD (restored)	1,505,916 reactions	Atom mapping removed, compounds canonicalized
Product prediction eval	USPTO	479,035 reactions	409K/30K/40K train/val/test split
Yield prediction eval	Buchwald-Hartwig C-N	3,955 reactions	Random 7:3 split (10 repeats) + 4 OOS tests

Algorithms

Base architecture: T5 (text-to-text transfer transformer)
Tokenizer: SentencePiece unigram, trained on ZINC, extended with special reaction tokens
Compound pretraining: Span-masked language modeling (15% masking rate, average span length 3)
Beam search: size 10 for product prediction
Output length constraints: min/max from training data distribution
Yield normalization: clipped to [0, 100], then scaled to [0, 1]

Models

CompoundT5: T5 pretrained on ZINC
RestorationT5: CompoundT5 fine-tuned for binary classification (reactant vs. reagent)
ReactionT5: CompoundT5 pretrained on ORD for product and yield prediction
Pre-trained weights available on Hugging Face

Evaluation

Metric	Task	Best Value	Notes
Top-1 accuracy	Product prediction	85.5%	ReactionT5 with 200 fine-tuning reactions
Top-5 accuracy	Product prediction	94.9%	ReactionT5 with 200 fine-tuning reactions
$R^2$	Yield prediction (random)	0.966	ReactionT5 fine-tuned
$R^2$	Yield prediction (OOS avg.)	0.900	ReactionT5 zero-shot

Hardware

Not specified in the paper. Training times and GPU requirements are not reported.

Artifacts

Artifact	Type	License	Notes
ReactionT5v2 (GitHub)	Code	MIT	Official implementation
ReactionT5 models (Hugging Face)	Model	MIT	Pre-trained weights

Paper Information

Citation: Sagawa, T. & Kojima, R. (2023). ReactionT5: a large-scale pre-trained model towards application of limited reaction data. arXiv preprint arXiv:2311.06708.

@article{sagawa2023reactiont5,
  title={ReactionT5: a large-scale pre-trained model towards application of limited reaction data},
  author={Sagawa, Tatsuya and Kojima, Ryosuke},
  journal={arXiv preprint arXiv:2311.06708},
  year={2023},
  doi={10.48550/arxiv.2311.06708}
}

Neural Machine Translation for Reaction Prediction

Sat, 28 Mar 2026 00:00:00 +0000

Pioneering Seq2Seq Translation for Reaction Prediction

This is a Method paper. It introduces the idea of applying neural machine translation (NMT) to organic chemistry reaction prediction by framing product prediction as a sequence-to-sequence translation problem from reactant/reagent SMILES to product SMILES. This was one of the earliest works to demonstrate that a data-driven encoder-decoder model could predict reaction products without any hand-coded reaction rules or SMARTS transformations.

Limitations of Existing Reaction Prediction Methods

Prior computational approaches to reaction prediction fell into three categories, each with significant drawbacks:

Rule-based methods (e.g., CAMEO, EROS) relied on manually encoded reaction rules. They performed well on reactions covered by the rules but required continuous manual encoding as new reaction types were discovered. Many older systems became outdated for this reason.
Physical calculation methods computed energies of transition states from plausible reaction pathways using quantum mechanics. While principled, these approaches carried high computational cost. Simplified approaches (ToyChem, ROBIA) traded accuracy for speed.
Machine learning methods at the time either predicted individual mechanistic steps (requiring tree search for multi-step reactions) or classified reaction types and applied SMARTS transformations to generate products. The classification-based approach of Wei et al. still required manual encoding of SMARTS transformations for new reaction types and struggled with ambiguous reaction classes.

The key gap was the absence of a method that could predict reaction products directly from input molecules, learn from data alone, and generalize to new reaction types without manual rule encoding.

Core Innovation: Reactions as Machine Translation

The central insight is that SMILES strings can be treated as a language with grammatical specifications. Predicting reaction products then becomes a problem of translating “reactant and reagent” sentences into “product” sentences.

The model uses a GRU-based encoder-decoder architecture with attention:

Encoder: 3 layers of GRU cells that process the reversed, tokenized SMILES string of reactants and reagents
Decoder: 3 layers of GRU cells that generate product SMILES tokens autoregressively
Attention mechanism: allows the decoder to attend to relevant encoder states at each generation step
Embedding dimension: 600
Vocabulary: 311 input tokens (reactants/reagents), 180 output tokens (products)
Bucketed sequences: four bucket sizes handle variable-length inputs and outputs: (54, 54), (70, 60), (90, 65), (150, 80)

The SMILES tokenization uses a PEG-based parser that splits SMILES strings into atoms, bonds, branching symbols, and ring closure numbers. Input sequences are reversed before feeding to the encoder, following standard practice in NMT at the time.

The translation objective finds the product sequence $\mathbf{y}$ that maximizes the conditional probability:

$$p(\mathbf{y} \mid \mathbf{x}) = \prod_{t=1}^{T} p(y_t \mid y_1, \ldots, y_{t-1}, \mathbf{x})$$

where $\mathbf{x}$ is the tokenized reactant/reagent sequence and $T$ is the product sequence length.

Training Data and Experimental Evaluation

Training Sets

Two training sets were constructed:

Source	Size	Description
Patent reactions (“real”)	1,094,235	USPTO patent applications (2001-2013), filtered by length
Generated reactions (“gen”)	865,118	75 reaction types from Wade’s organic chemistry textbook, applied to GDB-11 molecules (1-10 atoms)

The “real” set was filtered to exclude reactions with reactant/reagent strings longer than 150 characters, product strings longer than 80 characters, or more than four products. The “gen” set was composed by iterating reaction templates (as SMARTS) over small molecules from GDB-11, covering five substrate types: acid derivatives, alcohols, aldehydes/ketones, alkenes, and alkynes.

Two models were compared: a “gen” model (trained only on generated reactions) and a “real+gen” model (trained on both sets).

Textbook Problem Evaluation

The models were tested on 10 problem sets from Wade’s textbook, following the evaluation approach of Wei et al. Each problem set contained 6-15 reactions. Evaluation metrics included the ratio of fully correct predictions and the average Tanimoto similarity between Morgan fingerprints of predicted and actual products.

The “real+gen” model outperformed the “gen” model on most problem sets. On problem set 17-44 (aromatic compound reactions, only present in the “real” training set), the “real+gen” model correctly answered 4 out of 11 problems while the “gen” model answered 2. The “gen” model’s ability to correctly predict some aromatic reactions despite never being trained on them suggests the model can extrapolate to unseen reaction patterns.

For Diels-Alder reactions (problem set 15-30), neither model achieved fully correct predictions for all problems, though the “real+gen” model showed better Tanimoto scores, indicating partially correct structural predictions even when the exact product was missed.

Scalability Testing

A scalability test used generated reactions with substrate molecules containing 11-16 atoms (larger than the training set molecules with fewer than 11 atoms). Results showed:

The “real+gen” model maintained Tanimoto scores around 0.7 and error rates around 0.4 as substrate atom count increased
The ratio of fully correct predictions decreased as atom count increased, revealing that the recurrent network struggled with longer input sequences
The “real+gen” model produced fewer invalid SMILES strings than the “gen” model, likely because training on more reactions improved the decoder’s ability to generate syntactically valid SMILES

Attention Analysis

Visualization of attention weights revealed a limitation: the decoder cells predominantly attended to the first few encoder cells rather than distributing attention across the full input sequence. This means the attention mechanism was not learning meaningful “alignment” between reactant atoms and product atoms. The authors note that if decoder cells generating tokens for unreactive sites could attend to the corresponding encoder cells (analogous to atom mapping), prediction quality on longer sequences could improve.

Token Embedding Analysis

t-SNE visualization of the learned token embeddings showed that encoder and decoder tokens clustered primarily by syntactic similarity rather than chemical properties. The model did not learn chemically meaningful embeddings, which the authors identify as an area for future improvement.

Key Findings, Limitations, and Impact

Key Findings

Treating reaction prediction as NMT is viable: the seq2seq model can predict products without any hand-coded rules
Training on real patent data significantly improves prediction over generated data alone
The model can extrapolate to reaction types not seen during training (e.g., the “gen” model predicting aromatic reactions)
Compared to the fingerprint-based approach of Wei et al., this method performed better on textbook problems and eliminated the need for manual SMARTS encoding

Limitations

Invalid SMILES generation: the token-by-token generation process can produce syntactically invalid SMILES (e.g., mismatched parentheses), which the authors scored as zero
Sequence length degradation: prediction accuracy dropped for longer SMILES strings, a known limitation of RNN-based seq2seq models at the time
Poor attention alignment: attention weights collapsed to the first encoder positions rather than learning meaningful reactant-product correspondences
Chemically naive embeddings: token embeddings did not capture chemical properties
Multiple reaction pathways: reactions with competing pathways (e.g., substitution vs. elimination) were difficult for the model to handle

Historical Significance

This paper is historically significant as one of the first (alongside concurrent work) to propose the NMT framing for reaction prediction. This framing was later adopted and refined by the Molecular Transformer (Schwaller et al., 2019), which replaced GRUs with the Transformer architecture and achieved over 90% top-1 accuracy on standard benchmarks. The conceptual contribution of treating SMILES-to-SMILES translation as machine translation became the foundation of an entire subfield.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training (real)	USPTO patent reactions	1,094,235	2001-2013 applications, filtered by length
Training (gen)	Generated from Wade textbook templates	865,118	75 reaction types, GDB-11 substrates
Testing (textbook)	Wade textbook problems	~100	10 problem sets, 6-15 reactions each
Testing (scalability)	Generated from GDB-17	2,400	400 per atom count (11-16)

Algorithms

GRU-based encoder-decoder with attention mechanism
PEG-based SMILES tokenizer
Input sequence reversal
Bucketed training with four bucket sizes
TensorFlow seq2seq tutorial implementation with default learning rate

Models

Parameter	Value
GRU layers	3
Embedding size	600
Input vocabulary	311 tokens
Output vocabulary	180 tokens
Buckets	(54,54), (70,60), (90,65), (150,80)

Evaluation

Metric	gen Model	real+gen Model	Notes
Textbook correct ratio	Variable by set	Higher on most sets	10 problem sets
Average Tanimoto similarity	Variable	~0.7 on scalability test	Morgan fingerprint based
Invalid SMILES ratio	Higher	~0.4 on scalability test	Decreases with more training data

Hardware

Not specified in the paper.

Paper Information

Citation: Nam, J. & Kim, J. (2016). Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions. arXiv preprint, arXiv:1612.09529. https://arxiv.org/abs/1612.09529

Publication: arXiv preprint 2016

@article{nam2016linking,
  title={Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions},
  author={Nam, Juno and Kim, Jurae},
  journal={arXiv preprint arXiv:1612.09529},
  year={2016},
  doi={10.48550/arxiv.1612.09529}
}

Data Transfer Approaches for Seq-to-Seq Retrosynthesis

Sat, 28 Mar 2026 00:00:00 +0000

Systematic Study of Data Transfer for Retrosynthesis

This is an Empirical paper that systematically compares three standard data transfer methods (joint training, self-training, and pre-training plus fine-tuning) applied to a Transformer-based sequence-to-sequence model for single-step retrosynthesis. The primary contribution is demonstrating that pre-training on a large augmented dataset (USPTO-Full, 877K reactions) followed by fine-tuning on the smaller target dataset (USPTO-50K) produces substantial accuracy improvements over the baseline Transformer, achieving competitive or superior results to contemporaneous state-of-the-art graph-based models at higher values of n-best accuracy.

Bridging the Data Gap in Retrosynthesis Prediction

Retrosynthesis, the problem of predicting reactant compounds needed to synthesize a target product, has seen rapid progress through increasingly sophisticated model architectures: LSTM seq-to-seq models, Transformer models, and graph-to-graph approaches. However, the authors identify a gap in this research trajectory. While model architecture has received extensive attention, the role of training data strategies has been largely neglected in the retrosynthesis literature.

The core practical problem is that high-quality supervised datasets for retrosynthesis (like USPTO-50K) tend to be small and distribution-skewed, with all samples pre-classified into ten major reaction classes. Meanwhile, larger datasets (USPTO-Full with 877K samples, USPTO-MIT with 479K samples) exist but have different distributional properties. Data transfer techniques are standard practice in computer vision, NLP, and machine translation for exactly this scenario, yet they had not been systematically evaluated for retrosynthesis at the time of this work.

The authors also note a contrast with Zoph et al. (2020), who found that self-training outperforms pre-training in image recognition. They hypothesize that chemical compound strings may have more universal representations than images, making pre-training more effective in the chemistry domain.

Three Data Transfer Methods for Retrosynthesis

The paper formalizes retrosynthesis as a seq-to-seq problem where both the product $x$ and reactant set $y$ are represented as SMILES strings. A retrosynthesis model defines a likelihood $p_{\mathcal{M}}(y \mid x; \theta)$ optimized via maximum log-likelihood:

$$ \theta^{*} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}^{T}_{\text{Train}}} \log p(y_{i} \mid x_{i}) $$

Given a target dataset $\mathcal{D}^{T}$ and an augment dataset $\mathcal{D}^{A}$, three transfer methods are examined:

Joint Training concatenates the training sets and optimizes over the union:

$$ \theta^{*}_{\text{joint}} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}_{\text{joint}}} \log p(y_{i} \mid x_{i}), \quad \mathcal{D}_{\text{joint}} = \mathcal{D}^{T}_{\text{Train}} \cup \mathcal{D}^{A}_{\text{Train}} $$

This requires that both datasets share the same input/output domain (same SMILES canonicalization rules).

Self-Training (pseudo labeling) first trains a base model on $\mathcal{D}^{T}$ alone, then uses this model to relabel the augment dataset products:

$$ \hat{y}_{i} = \arg\max_{y} \log p(y \mid x_{i}; \theta^{*}_{\text{single}}) \quad \text{for } x_{i} \in \mathcal{D}^{A}_{\text{Train}} $$

The pseudo-labeled augment set is then combined with $\mathcal{D}^{T}_{\text{Train}}$ for joint training. This approach does not require consistent label domains between datasets.

Pre-training plus Fine-tuning trains first on the augment dataset to obtain $\theta^{*}_{\text{pretrain}}$, then initializes fine-tuning from this checkpoint:

$$ \theta^{0}_{\text{finetune}} \leftarrow \theta^{*}_{\text{pretrain}}, \quad \theta^{\ell+1}_{\text{finetune}} \leftarrow \theta^{\ell}_{\text{finetune}} - \gamma^{\ell} \nabla \mathcal{L}(\mathcal{D}^{T}_{\text{Train}}) \big|_{{\theta^{\ell}_{\text{finetune}}}} $$

Experimental Setup on USPTO Benchmarks

The experiments use a fixed Transformer architecture (3 self-attention layers, 500-dimensional latent vectors) implemented in OpenNMT-py, evaluated across all three transfer methods.

Datasets:

Purpose	Dataset	Size	Notes
Target	USPTO-50K	40K/5K/5K (train/val/test)	10 reaction classes, curated by Lowe (2012)
Augment (main)	USPTO-Full	844K train (after cleansing)	Curated by Lowe (2017)
Augment (smaller)	USPTO-MIT	384K train (after cleansing)	Curated by Jin et al. (2017)

Data cleansing removed all augment dataset samples whose product SMILES appeared in any USPTO-50K subset, preventing data leakage. All datasets were re-canonicalized with a unified RDKit version.

Evaluation uses n-best accuracy with k=50 beam search, computing accuracy at n=1, 3, 5, 10, 20, 50. Models are selected by best validation perplexity. All experiments report averages and standard deviations over 5 runs.

Optimization uses Adam with cyclic learning rate scheduling (warm-up) for all methods except fine-tuning, which uses a standard non-cyclic scheduler.

Results comparing data transfer methods (USPTO-Full augment):

Training Method	n=1	n=3	n=5	n=10	n=20	n=50
Single model (No Transfer)	35.3 +/- 1.4	52.8 +/- 1.4	58.9 +/- 1.3	64.5 +/- 1.2	68.8 +/- 1.2	72.1 +/- 1.3
Joint Training	39.1 +/- 1.3	63.4 +/- 0.9	71.9 +/- 0.5	80.1 +/- 0.2	85.4 +/- 0.3	89.4 +/- 0.2
Self-Training	41.5 +/- 1.0	60.4 +/- 0.7	66.1 +/- 0.7	71.8 +/- 0.6	75.3 +/- 0.5	78.0 +/- 0.3
Pre-training + Fine-Tune	57.4 +/- 0.4	77.6 +/- 0.4	83.1 +/- 0.2	87.4 +/- 0.4	89.6 +/- 0.3	90.9 +/- 0.2

Comparison with state-of-the-art models:

Model	Architecture	n=1	n=3	n=5	n=10	n=20	n=50
GLN (Dai et al., 2019)	Logic Network	52.5	69.0	75.6	83.7	88.5	92.4
G2Gs (Shi et al., 2020)	Graph-to-Graph	48.9	67.6	72.5	75.5	N/A	N/A
RetroXpert (Yan et al., 2020)	Graph-to-Graph	65.6	78.7	80.8	83.3	84.6	86.0
GraphRetro (Somnath et al., 2020)	Graph-to-Graph	63.8	80.5	84.1	85.9	N/A	87.2
Pre-training + Fine-Tune (ours)	Seq-to-Seq	57.4	77.6	83.1	87.4	89.6	90.9

Key Findings and Limitations

Primary findings:

All three data transfer methods improve over the no-transfer baseline across all n-best accuracy levels.
Pre-training plus fine-tuning provides the largest gains, improving top-1 accuracy by 22.1 absolute percentage points (from 35.3% to 57.4%) and achieving the best n=10 and n=20 accuracy among all compared models, including graph-based approaches.
Augment dataset size matters: using USPTO-Full (844K) yields substantially better results than USPTO-MIT (384K) for joint training and pre-training plus fine-tuning, though self-training gains are surprisingly robust to augment dataset size.
Manual inspection of erroneous predictions shows that over 99% of top-1 predictions from the pre-trained/fine-tuned model are chemically appropriate or sensible, even when they do not exactly match the gold-standard reactants.
Pre-training plus fine-tuning shows a distinct advantage in training dynamics: the 1-best and n-best accuracy curves evolve similarly during fine-tuning, unlike the single model where these curves can diverge significantly. This makes early stopping more reliable.

Class-wise improvements are observed across all 10 reaction classes, with the largest gains in heterocycle formation (0.40 to 0.86 at 50-best) and functional group interconversion (0.57 to 0.90).

Limitations acknowledged by the authors:

The model struggles with compounds containing multiple similar substituents (e.g., long-chain hydrocarbons), occasionally selecting the wrong one.
Some reactions involving rare chemical groups (polycyclic aromatic hydrocarbons) still produce invalid SMILES, suggesting the augment dataset lacks sufficient examples of these structures.
Top-1 accuracy (57.4%) lags behind the best graph-based models (RetroXpert at 65.6%), though the gap narrows at higher n values.
The study uses a fixed Transformer architecture without architecture-specific optimization for each transfer method.

Future directions proposed include freezing parts of the network during fine-tuning, applying data transfer to graph-to-graph models, and testing transferability to other retrosynthesis datasets beyond USPTO-50K.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Target	USPTO-50K	50K reactions	Curated by Lowe (2012), 10 reaction classes
Augment	USPTO-Full	877K reactions (844K after cleansing)	Curated by Lowe (2017), available via Figshare
Augment (alt)	USPTO-MIT	479K reactions (384K after cleansing)	Curated by Jin et al. (2017)

Data cleansing removes augment samples whose products appear in any USPTO-50K subset. Unified RDKit canonicalization applied to all datasets.

Algorithms

Transformer seq-to-seq model (3 self-attention layers, 500-dim latent vectors)
Positional encoding enabled
Maximum sequence length: 200 tokens
Adam optimizer
Cyclic learning rate scheduler with warm-up (all methods except fine-tuning)
Non-cyclic scheduler for fine-tuning phase (Klein et al., 2017)
Beam search with k=50 for inference

Models

Implementation: OpenNMT-py
No pre-trained weights or model checkpoints released

Evaluation

Metric	Value	Baseline	Notes
Top-1 accuracy	57.4%	35.3% (no transfer)	Pre-train + fine-tune, USPTO-Full augment
Top-10 accuracy	87.4%	64.5% (no transfer)	Best among all compared models
Top-20 accuracy	89.6%	68.8% (no transfer)	Best among all compared models
Top-50 accuracy	90.9%	72.1% (no transfer)	Competitive with GLN (92.4%)

Hardware

Hardware details are not specified in the paper. The authors mention GPU memory constraints motivating the 200-token sequence length limit.

Paper Information

Citation: Ishiguro, K., Ujihara, K., Sawada, R., Akita, H., & Kotera, M. (2020). Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis. arXiv preprint arXiv:2010.00792.

@article{ishiguro2020data,
  title={Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis},
  author={Ishiguro, Katsuhiko and Ujihara, Kazuya and Sawada, Ryohto and Akita, Hirotaka and Kotera, Masaaki},
  journal={arXiv preprint arXiv:2010.00792},
  year={2020}
}

Tied Two-Way Transformers for Diverse Retrosynthesis

Mon, 23 Mar 2026 00:00:00 +0000

Bridging Forward and Backward Reaction Prediction

This is a Method paper that addresses three key limitations of template-free retrosynthesis models: invalid SMILES outputs, chemically implausible predictions, and lack of diversity in reactant candidates. The solution combines three techniques: (1) cycle consistency checks using a paired forward reaction transformer, (2) parameter tying between the forward and backward transformers, and (3) multinomial latent variables with a learned prior to capture multiple reaction pathways.

Three Problems in Template-Free Retrosynthesis

Template-free retrosynthesis models cast retrosynthesis as a sequence-to-sequence translation problem (product SMILES to reactant SMILES). While these models avoid the cost of hand-coded reaction templates, they suffer from:

Invalid SMILES: predicted reactant strings that contain grammatical errors and cannot be parsed into molecules
Implausibility: predicted reactants that are valid molecules but cannot actually synthesize the target product
Lack of diversity: beam search produces duplicate or near-duplicate candidates, reducing the number of useful suggestions

Prior work addressed these individually (SCROP adds a syntax corrector for validity, Chen et al. use latent variables for diversity), but this paper tackles all three simultaneously.

Model Architecture

Tied Two-Way Transformers

The model pairs a retrosynthesis transformer $p(y|z, x)$ (product to reactants) with a forward reaction transformer $p(\tilde{x}|z, y)$ (reactants to product). Both use the standard encoder-decoder transformer architecture with 6 layers, 8 attention heads, and 256-dimensional embeddings.

The key architectural innovation is aggressive parameter tying: the two transformers share the entire encoder and all decoder parameters except layer normalization. This means the two-transformer system has approximately the same parameter count as a single transformer (17.5M vs. 17.4M). The shared parameters force the model to learn bidirectional reaction patterns from both forward and backward training data simultaneously, improving grammar learning and reducing invalid outputs.

Multinomial Latent Variables

A discrete latent variable $z \in \{1, \ldots, K\}$ is introduced to capture multiple reaction modes. Each latent value conditions a different decoding path, encouraging diverse reactant predictions. The decoder initializes with a latent-class-specific start token (e.g., “”) and then decodes autoregressively.

The prior $p(z|x)$ is a learned multinomial distribution parametrized by a two-layer feed-forward network with tanh activation, taking the mean-pooled encoder output as input. This learned prior outperforms the uniform prior used by Chen et al., producing a smaller trade-off between top-1 and top-10 accuracy as $K$ increases.

Training with Hard EM

Since the latent variable $z$ is unobserved during training, the model is trained with the online hard-EM algorithm. The loss function is:

$$\mathcal{L}(\theta) = \mathbb{E}_{(x,y) \sim \text{data}} \left[ \min_{z} \mathcal{L}_h(x, y, z; \theta) \right]$$

where $\mathcal{L}_h = -(\log p(z|x) + \log p(y|z,x) + \log p(\tilde{x}=x|z,y))$. The E-step selects the best $z$ for each training pair (with dropout disabled), and the M-step updates parameters given the complete data.

Inference with Cycle Consistency Reranking

At inference, the model: (1) generates $K$ sets of beam search hypotheses from the retrosynthesis transformer (one per latent value), (2) scores each candidate with the forward reaction transformer for cycle consistency $p(\tilde{x}=x|z,y)$, and (3) reranks candidates by the full likelihood $p(z|x) \cdot p(y|z,x) \cdot p(\tilde{x}=x|z,y)$. This pushes chemically plausible predictions to higher ranks.

Results on USPTO-50K

All results are averaged over 5 random seeds with beam size 10.

Model	Top-1 Acc.	Top-5 Acc.	Top-10 Acc.	Top-1 Invalid	Top-10 Invalid
Liu-LSTM	37.4%	57.0%	61.7%	12.2%	22.0%
SCROP	43.7%	65.2%	68.7%	0.7%	2.3%
Lin-TF	42.0%	71.3%	77.6%	2.2%	7.8%
Base transformer	44.3%	68.4%	72.7%	1.7%	12.1%
Proposed ($K$=5)	46.8%	73.5%	78.5%	0.1%	2.6%

The proposed model achieves a +3.1% top-1 accuracy improvement over the best previous template-free method and reduces top-1 invalid rate to 0.1%.

Ablation Analysis

The ablation study isolates the contribution of each component:

Base+CC (cycle consistency only): reranks candidates to improve top-1/3/5 accuracy and validity, but top-10 stays the same since the candidate set is unchanged. Parameter count doubles (34.8M).
Base+PT (parameter tying only): improves accuracy and validity at all top-$k$ levels with negligible parameter increase. Parameter tying during training improves the retrosynthesis transformer itself, even without cycle consistency at inference.
Proposed ($K$=1): combines tying with cycle consistency reranking.
Proposed ($K$=5): adds latent diversity, further improving top-10 accuracy (+2.2%) and reducing top-10 invalid rate (from 10.2% to 2.6%).

Diversity: Unique Rate

As $K$ increases from 1 to 5, the unique molecule rate among 10 predictions rises substantially, confirming that latent modeling produces more diverse candidates. The learned prior reduces the top-1/top-10 accuracy trade-off compared to Chen et al.’s uniform prior.

Results on In-House Multi-Pathway Dataset

The in-house dataset (162K reactions from Reaxys) contains multiple ground-truth reactions per product, enabling direct evaluation of pathway diversity through coverage (proportion of ground-truth pathways correctly predicted in the top-10 candidates).

Model	Top-1 Acc.	Top-10 Acc.	Unique Rate	Coverage
Base	64.2%	91.6%	76.1%	84.4%
Proposed	66.0%	92.8%	93.2%	87.3%

The proposed model covers 87.3% of ground-truth reaction pathways on average, compared to 84.4% for the baseline. The unique rate jumps from 76.1% to 93.2%, confirming that the latent variables effectively encourage diverse predictions.

Limitations

The model uses SMILES string representation, which linearizes molecules and does not exploit the inherently rich chemical graph structure. Graph-based retrosynthesis models (e.g., GraphRetro at 63.8% top-1) substantially outperform template-free string-based models. The USPTO-50K dataset provides only one ground-truth pathway per product, making diversity evaluation limited on this benchmark. The in-house dataset is not publicly available. The model also does not predict reaction conditions (solvents, catalysts, temperature) or reagents.

Reproducibility

Artifact	Type	License	Notes
ejklike/tied-twoway-transformer	Code	Not specified	Training and inference code

Data: USPTO-50K dataset (public, 50K reactions from USPTO patents). In-house dataset (162K reactions from Reaxys, not publicly available).

Hardware: 4 NVIDIA Tesla M40 GPUs. Checkpoints saved every 5000 steps, last 5 averaged.

Training: Adam optimizer ($\beta$ = 0.9, 0.98), initial learning rate 2 with 8000 warm-up steps, dropout 0.3, gradient accumulation over 4 batches. Label smoothing set to 0.

Inference: Beam size 10, generating 10 candidates per product.

Paper Information

Citation: Kim, E., Lee, D., Kwon, Y., Park, M. S., & Choi, Y.-S. (2021). Valid, Plausible, and Diverse Retrosynthesis Using Tied Two-Way Transformers with Latent Variables. Journal of Chemical Information and Modeling, 61, 123-133.

Publication: Journal of Chemical Information and Modeling, 2021

Additional Resources:

GitHub: ejklike/tied-twoway-transformer

@article{kim2021valid,
  title={Valid, Plausible, and Diverse Retrosynthesis Using Tied Two-Way Transformers with Latent Variables},
  author={Kim, Eunji and Lee, Dongseon and Kwon, Youngchun and Park, Min Sik and Choi, Youn-Suk},
  journal={Journal of Chemical Information and Modeling},
  volume={61},
  number={1},
  pages={123--133},
  year={2021},
  publisher={ACS Publications},
  doi={10.1021/acs.jcim.0c01074}
}

Molecular Transformer: Calibrated Reaction Prediction

Wed, 18 Mar 2026 00:00:00 +0000

Paper Contribution and Methodological Classification

This is a Method paper. It adapts the Transformer architecture to chemical reaction prediction, treating it as a machine translation problem from reactant SMILES to product SMILES. The key contributions are (1) demonstrating that a fully attention-based model outperforms all prior template-based, graph-based, and RNN-based methods, (2) showing the model works without separating reactants from reagents, and (3) introducing calibrated uncertainty estimation for ranking synthesis pathways.

Motivation: Limitations of Existing Reaction Prediction

Prior approaches to reaction prediction fell into two broad groups, template-based and template-free, each with fundamental limitations:

Template-based methods rely on libraries of reaction rules, either handcrafted or automatically extracted from atom-mapped data. Automatic template extraction itself depends on atom mapping, which depends on templates, creating a circular dependency.
Graph-based template-free methods (e.g., WLDN, ELECTRO) avoid explicit templates but still require atom-mapped training data and cannot handle stereochemistry.
RNN-based seq2seq models (also template-free) treat reactions as SMILES translation but impose a positional inductive bias: tokens far apart in the SMILES string are assumed to be less related. This is incorrect because SMILES position has no relationship to 3D spatial distance.

Core Innovation: Transformer for Reaction Prediction

The Molecular Transformer adapts the Transformer architecture to chemical reactions by treating SMILES strings of reactants and reagents as source sequences and product SMILES as target sequences.

Architecture: Encoder-decoder Transformer with 4 layers, 256-dimensional hidden states, 8 attention heads, and 12M parameters (reduced from the original 65M NMT model).
Tokenization: Atom-wise regex tokenization of SMILES strings, applied uniformly to both reactants and reagents (no special reagent tokens).
Data augmentation: Training data is doubled by generating random (non-canonical) SMILES for each reaction, which improves top-1 accuracy by roughly 1%.
Weight averaging: Final model weights are averaged over the last 20 checkpoints, providing a further accuracy boost without the inference cost of ensembling.
Mixed input: Unlike all prior work that separates reactants from reagents (which implicitly assumes knowledge of the product), the Molecular Transformer operates on mixed inputs where no distinction is made.

The multihead attention mechanism is the key architectural advantage over RNNs. It allows the model to attend to any pair of tokens regardless of their position in the SMILES string, correctly capturing long-range chemical relationships that RNNs miss.

Uncertainty Estimation

A central contribution is calibrated uncertainty scoring. The product of predicted token probabilities serves as a confidence score for each prediction. This score achieves 0.89 AUC-ROC for classifying whether a prediction is correct.

An important finding: label smoothing hurts uncertainty calibration. While label smoothing (as used in the original Transformer) marginally improves top-1 accuracy (87.44% vs 87.28%), it destroys the model’s ability to distinguish correct from incorrect predictions. Setting the label smoothing parameter to 0.0 preserves calibration.

The confidence score shows no correlation with SMILES length (Pearson $r = 0.06$), confirming it is not biased against predictions of larger molecules.

Experimental Results

Forward Synthesis Prediction

Dataset	Setting	Top-1 (%)	Top-2 (%)	Top-5 (%)
USPTO_MIT	separated	90.4	93.7	95.3
USPTO_MIT	mixed	88.6	92.4	94.2
USPTO_STEREO	separated	78.1	84.0	87.1
USPTO_STEREO	mixed	76.2	82.4	85.8

The mixed-input model (88.6%) outperforms all prior methods that used separated inputs (best previous: WLDN5 at 85.6%).

Comparison with Quantum Chemistry

On regioselectivity of electrophilic aromatic substitution in heteroaromatics, the Molecular Transformer achieves 83% top-1 accuracy vs 81% for RegioSQM (a quantum-chemistry-based predictor), at a fraction of the computational cost.

Comparison with Human Chemists

On 80 reactions sampled across rarity bins, the Molecular Transformer achieves 87.5% top-1 accuracy vs 76.5% for the best human chemist and 72.5% for the best graph-based model (WLDN5).

Chemically Constrained Beam Search

Constraining beam search to only predict atoms present in the reactants (preventing “alchemy”) produces no change in accuracy, confirming the model has learned conservation of atoms from data alone.

Trade-offs and Limitations

Stereochemistry: Accuracy drops significantly on USPTO_STEREO (76-78% vs 88-90% on USPTO_MIT), indicating stereochemical prediction remains challenging.
Resolution reactions: Near-zero accuracy on resolution reactions (28.6%), where reagent information is often missing from patent data.
Unclassified reactions: Accuracy on “unrecognized” reaction classes is 46.3%, likely reflecting noisy or mistranscribed data.
No atom mapping: The model provides no explicit atom mapping between reactants and products, which limits interpretability for understanding reaction mechanisms.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Primary benchmark	USPTO_MIT	479K	Filtered by Jin et al., no stereochemistry
LEF subset	USPTO_LEF	350K	Subset of MIT with linear electron flow only
Stereo benchmark	USPTO_STEREO	1.0M	Patent reactions through Sept 2016, includes stereochemistry
Time-split test	Pistachio_2017	15.4K	Non-public, reactions from 2017

Preprocessing: SMILES canonicalized with RDKit. Regex tokenization from Schwaller et al. (2018). Two input modes: “separated” (reactants > reagents) and “mixed” (all molecules concatenated).

Model

Hyperparameter	Value
Layers	4
Model dimension	256
Attention heads	8
Parameters	~12M
Label smoothing	0.0
Optimizer	Adam
Warm-up steps	8000
Batch size	~4096 tokens
Beam width	5

Evaluation

Metric	Task	Key Result	Baseline
Top-1 accuracy	USPTO_MIT (sep)	90.4%	85.6% (WLDN5)
Top-1 accuracy	USPTO_MIT (mixed)	88.6%	80.3% (S2S RNN)
AUC-ROC	Uncertainty calibration	0.89	N/A
Top-1 accuracy	Regioselectivity	83%	81% (RegioSQM)
Top-1 accuracy	Human comparison	87.5%	76.5% (best human)

Hardware

Training: Single Nvidia P100 GPU, 48h for best single model
Inference: 20 min for 40K reactions on single P100

Paper Information

Citation: Schwaller, P., Laino, T., Gaudin, T., Bolgar, P., Hunter, C. A., Bekas, C., & Lee, A. A. (2019). Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction. ACS Central Science, 5(9), 1572-1583. https://doi.org/10.1021/acscentsci.9b00576

Publication: ACS Central Science 2019

@article{schwallerMolecularTransformerModel2019,
  title = {Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction},
  author = {Schwaller, Philippe and Laino, Teodoro and Gaudin, Th{\'e}ophile and Bolgar, Peter and Hunter, Christopher A. and Bekas, Costas and Lee, Alpha A.},
  year = 2019,
  journal = {ACS Central Science},
  volume = {5},
  number = {9},
  pages = {1572--1583},
  publisher = {American Chemical Society},
  doi = {10.1021/acscentsci.9b00576}
}