ReactionT5: Pre-trained T5 for Reaction Prediction

A Two-Stage Pre-trained Transformer for Chemical Reactions

ReactionT5 is a Method paper that proposes a T5-based pre-trained model for chemical reaction tasks, specifically product prediction and yield prediction. The primary contribution is a two-stage pretraining pipeline: first on a compound library (ZINC, 23M molecules) to learn molecular representations, then on a large-scale reaction database (the Open Reaction Database, 1.5M reactions) to learn reaction-level patterns. The key result is that this pre-trained model can be fine-tuned with very limited target-domain data (as few as 30 reactions) and still achieve competitive performance against models trained on full datasets.

Bridging the Gap Between Single-Molecule and Multi-Molecule Pretraining

While transformer-based models pre-trained on compound libraries (e.g., SMILES-BERT, MolGPT) have seen substantial development, most focus on single-molecule inputs and outputs. Pretraining for multi-molecule contexts, such as chemical reactions involving reactants, reagents, catalysts, and products, remains underexplored. T5Chem supports multi-task reaction prediction but focuses on building a single multi-task model rather than investigating the effectiveness of pre-trained models for fine-tuning on limited in-house data.

The authors identify two key gaps:

Most pre-trained chemical models do not account for reaction-level interactions between multiple molecules.
In practical settings, target-domain reaction data is often scarce, making transfer learning from large public datasets essential.

Two-Stage Pretraining with Compound Restoration

The core innovation is a two-stage pretraining procedure built on the T5 (text-to-text transfer transformer) architecture:

Stage 1: Compound Pretraining (CompoundT5). An initialized T5 model is trained on 23M SMILES from the ZINC database using span-masked language modeling. The model learns to predict masked subsequences of SMILES tokens. A SentencePiece unigram tokenizer is trained on this compound library, allowing more compact representations than character-level or atom-level tokenizers. After this stage, new tokens are added to the tokenizer to cover metal atoms and other characters present in the reaction database but absent from ZINC.

Stage 2: Reaction Pretraining (ReactionT5). CompoundT5 is further pretrained on 1.5M reactions from the Open Reaction Database (ORD) on both product prediction and yield prediction tasks. Reactions are formulated as text-to-text tasks using special tokens:

REACTANT:, REAGENT:, and PRODUCT: tokens delimit the role of each molecule in the reaction string.
For product prediction, the model takes reactants and reagents as input and generates product SMILES.
For yield prediction, the model takes the full reaction (including products) and outputs a numerical yield value.

Compound Restoration. A notable methodological detail is the handling of uncategorized compounds in the ORD. About 31.8% of ORD reactions contain compounds with unknown roles. Simply discarding these reactions introduces severe product bias (only 447 unique products remain vs. 439,898 with uncategorized data included). The authors develop RestorationT5, a binary classifier built from CompoundT5, that assigns uncategorized compounds to either reactant or reagent roles. This classifier uses a sigmoid output layer and achieves an F1 score of 0.1564 at a threshold of 0.97, outperforming a random forest baseline (F1 = 0.1136). The restored dataset (“ORD(restored)”) is then used for reaction pretraining.

For yield prediction, the loss function is mean squared error:

$$L = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$

where $y_i$ is the true yield (normalized to [0, 1]) and $\hat{y}_i$ is the predicted yield.

Experimental Setup: Product and Yield Prediction Benchmarks

Product Prediction

The USPTO dataset (479K reactions) is used for evaluation, with standard train/val/test splits (409K/30K/40K). Reactions overlapping with the ORD (18%) are removed during evaluation. Beam search with beam size 10 is used for decoding, and minimum/maximum output length constraints are set based on the training data distribution. Top-k accuracy (k = 1, 2, 3, 5) and invalidity rate are reported.

Baselines include Seq-to-seq, WLDN (graph neural network), Molecular Transformer, and T5Chem.

Model	Train	Top-1	Top-2	Top-3	Top-5	Invalidity
Seq-to-seq	USPTO	80.3	84.7	86.2	87.5	-
WLDN	USPTO	85.6	90.5	92.8	93.4	-
Molecular Transformer	USPTO	88.8	92.6	-	94.4	-
T5Chem	USPTO	90.4	94.2	-	96.4	-
CompoundT5	USPTO	88.0	92.4	93.9	95.0	7.5
ReactionT5 (restored ORD)	USPTO200	85.5	91.7	93.5	94.9	12.0

A critical finding: ReactionT5 pre-trained on ORD achieves 0% accuracy on USPTO without fine-tuning due to domain mismatch (ORD includes byproducts; USPTO lists only the main product). Fine-tuning on just 200 USPTO reactions with the restored ORD model produces competitive results.

The few-shot fine-tuning analysis shows rapid performance scaling:

Samples	Top-1	Top-2	Top-3	Top-5	Invalidity
10	9.0	12.5	15.3	19.1	12.4
30	80.5	87.3	89.8	92.0	17.2
50	83.7	89.9	92.2	94.0	14.8
100	85.1	91.0	92.8	94.4	14.0
200	85.5	91.7	93.5	94.9	12.0

Yield Prediction

The Buchwald-Hartwig C-N cross-coupling dataset (3,955 reactions) is used with random 7:3 splits (repeated 10 times) plus four out-of-sample test sets (Tests 1-4) designed so that similar reactions do not appear in both train and test.

Model	Random 7:3	Test 1	Test 2	Test 3	Test 4	Avg. Tests 1-4
DFT	0.92	0.80	0.77	0.64	0.54	0.69
MFF	0.927	0.851	0.713	0.635	0.184	0.596
Yield-BERT	0.951	0.838	0.836	0.738	0.538	0.738
T5Chem	0.970	0.811	0.907	0.789	0.627	0.785
CompoundT5	0.971	0.855	0.852	0.712	0.547	0.741
ReactionT5	0.966	0.914	0.940	0.819	0.896	0.892
ReactionT5 (zero-shot)	0.904	0.919	0.927	0.847	0.909	0.900

ReactionT5 achieves the highest average $R^2$ across Tests 1-4 (0.892), with the zero-shot variant performing even better (0.900). The improvement is most dramatic on Test 4, the hardest split, where ReactionT5 achieves $R^2 = 0.896$ versus T5Chem’s 0.627 and Yield-BERT’s 0.538.

In a low-data regime (30% train / 70% test), ReactionT5 ($R^2 = 0.927$) substantially outperforms a random forest baseline ($R^2 = 0.853$), and even zero-shot ReactionT5 ($R^2 = 0.898$) exceeds the random forest.

Key Findings and Limitations

Key Findings

Two-stage pretraining is effective: Compound pretraining followed by reaction pretraining produces models with strong generalization, particularly on out-of-distribution test sets.
Few-shot transfer works: With as few as 30 fine-tuning reactions, ReactionT5 achieves over 80% Top-1 accuracy on product prediction, competitive with models trained on the full USPTO dataset.
Compound restoration matters: Restoring uncategorized compounds in the ORD is essential for product prediction. Without restoration, fine-tuning on 200 USPTO reactions yields 0% accuracy; with restoration, the same fine-tuning yields 85.5% Top-1.
Zero-shot yield prediction is surprisingly effective: ReactionT5 achieves $R^2 = 0.900$ on the out-of-sample yield tests without any task-specific fine-tuning, outperforming all fine-tuned baselines.

Limitations

Product prediction shows a high invalidity rate (12.0% for the best ReactionT5 variant) compared to CompoundT5 (7.5%), suggesting the reaction pretraining may introduce some noise.
The 0% accuracy without fine-tuning on product prediction reveals a significant domain gap between ORD and USPTO annotation conventions (byproducts vs. main products).
The RestorationT5 classifier has low precision (0.0878) despite high recall (0.7212), meaning many compounds are incorrectly assigned roles. The paper does not investigate how this impacts downstream performance.
The paper does not report training times, computational costs, or model sizes, making resource requirements unclear.
Only two downstream tasks (product prediction on USPTO, yield prediction on Buchwald-Hartwig) are evaluated.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Compound pretraining	ZINC	22,992,522 compounds	SMILES canonicalized with RDKit
Reaction pretraining	ORD (restored)	1,505,916 reactions	Atom mapping removed, compounds canonicalized
Product prediction eval	USPTO	479,035 reactions	409K/30K/40K train/val/test split
Yield prediction eval	Buchwald-Hartwig C-N	3,955 reactions	Random 7:3 split (10 repeats) + 4 OOS tests

Algorithms

Base architecture: T5 (text-to-text transfer transformer)
Tokenizer: SentencePiece unigram, trained on ZINC, extended with special reaction tokens
Compound pretraining: Span-masked language modeling (15% masking rate, average span length 3)
Beam search: size 10 for product prediction
Output length constraints: min/max from training data distribution
Yield normalization: clipped to [0, 100], then scaled to [0, 1]

Models

CompoundT5: T5 pretrained on ZINC
RestorationT5: CompoundT5 fine-tuned for binary classification (reactant vs. reagent)
ReactionT5: CompoundT5 pretrained on ORD for product and yield prediction
Pre-trained weights available on Hugging Face

Evaluation

Metric	Task	Best Value	Notes
Top-1 accuracy	Product prediction	85.5%	ReactionT5 with 200 fine-tuning reactions
Top-5 accuracy	Product prediction	94.9%	ReactionT5 with 200 fine-tuning reactions
$R^2$	Yield prediction (random)	0.966	ReactionT5 fine-tuned
$R^2$	Yield prediction (OOS avg.)	0.900	ReactionT5 zero-shot

Hardware

Not specified in the paper. Training times and GPU requirements are not reported.

Artifacts

Artifact	Type	License	Notes
ReactionT5v2 (GitHub)	Code	MIT	Official implementation
ReactionT5 models (Hugging Face)	Model	MIT	Pre-trained weights

Paper Information

Citation: Sagawa, T. & Kojima, R. (2023). ReactionT5: a large-scale pre-trained model towards application of limited reaction data. arXiv preprint arXiv:2311.06708.

@article{sagawa2023reactiont5,
  title={ReactionT5: a large-scale pre-trained model towards application of limited reaction data},
  author={Sagawa, Tatsuya and Kojima, Ryosuke},
  journal={arXiv preprint arXiv:2311.06708},
  year={2023},
  doi={10.48550/arxiv.2311.06708}
}

A Two-Stage Pre-trained Transformer for Chemical Reactions#

Bridging the Gap Between Single-Molecule and Multi-Molecule Pretraining#

Two-Stage Pretraining with Compound Restoration#

Experimental Setup: Product and Yield Prediction Benchmarks#

Product Prediction#

Yield Prediction#

Key Findings and Limitations#

Key Findings#

Limitations#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#