Data Transfer Approaches for Seq-to-Seq Retrosynthesis

Systematic Study of Data Transfer for Retrosynthesis

This is an Empirical paper that systematically compares three standard data transfer methods (joint training, self-training, and pre-training plus fine-tuning) applied to a Transformer-based sequence-to-sequence model for single-step retrosynthesis. The primary contribution is demonstrating that pre-training on a large augmented dataset (USPTO-Full, 877K reactions) followed by fine-tuning on the smaller target dataset (USPTO-50K) produces substantial accuracy improvements over the baseline Transformer, achieving competitive or superior results to contemporaneous state-of-the-art graph-based models at higher values of n-best accuracy.

Bridging the Data Gap in Retrosynthesis Prediction

Retrosynthesis, the problem of predicting reactant compounds needed to synthesize a target product, has seen rapid progress through increasingly sophisticated model architectures: LSTM seq-to-seq models, Transformer models, and graph-to-graph approaches. However, the authors identify a gap in this research trajectory. While model architecture has received extensive attention, the role of training data strategies has been largely neglected in the retrosynthesis literature.

The core practical problem is that high-quality supervised datasets for retrosynthesis (like USPTO-50K) tend to be small and distribution-skewed, with all samples pre-classified into ten major reaction classes. Meanwhile, larger datasets (USPTO-Full with 877K samples, USPTO-MIT with 479K samples) exist but have different distributional properties. Data transfer techniques are standard practice in computer vision, NLP, and machine translation for exactly this scenario, yet they had not been systematically evaluated for retrosynthesis at the time of this work.

The authors also note a contrast with Zoph et al. (2020), who found that self-training outperforms pre-training in image recognition. They hypothesize that chemical compound strings may have more universal representations than images, making pre-training more effective in the chemistry domain.

Three Data Transfer Methods for Retrosynthesis

The paper formalizes retrosynthesis as a seq-to-seq problem where both the product $x$ and reactant set $y$ are represented as SMILES strings. A retrosynthesis model defines a likelihood $p_{\mathcal{M}}(y \mid x; \theta)$ optimized via maximum log-likelihood:

$$ \theta^{*} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}^{T}_{\text{Train}}} \log p(y_{i} \mid x_{i}) $$

Given a target dataset $\mathcal{D}^{T}$ and an augment dataset $\mathcal{D}^{A}$, three transfer methods are examined:

Joint Training concatenates the training sets and optimizes over the union:

$$ \theta^{*}_{\text{joint}} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}_{\text{joint}}} \log p(y_{i} \mid x_{i}), \quad \mathcal{D}_{\text{joint}} = \mathcal{D}^{T}_{\text{Train}} \cup \mathcal{D}^{A}_{\text{Train}} $$

This requires that both datasets share the same input/output domain (same SMILES canonicalization rules).

Self-Training (pseudo labeling) first trains a base model on $\mathcal{D}^{T}$ alone, then uses this model to relabel the augment dataset products:

$$ \hat{y}_{i} = \arg\max_{y} \log p(y \mid x_{i}; \theta^{*}_{\text{single}}) \quad \text{for } x_{i} \in \mathcal{D}^{A}_{\text{Train}} $$

The pseudo-labeled augment set is then combined with $\mathcal{D}^{T}_{\text{Train}}$ for joint training. This approach does not require consistent label domains between datasets.

Pre-training plus Fine-tuning trains first on the augment dataset to obtain $\theta^{*}_{\text{pretrain}}$, then initializes fine-tuning from this checkpoint:

$$ \theta^{0}_{\text{finetune}} \leftarrow \theta^{*}_{\text{pretrain}}, \quad \theta^{\ell+1}_{\text{finetune}} \leftarrow \theta^{\ell}_{\text{finetune}} - \gamma^{\ell} \nabla \mathcal{L}(\mathcal{D}^{T}_{\text{Train}}) \big|_{{\theta^{\ell}_{\text{finetune}}}} $$

Experimental Setup on USPTO Benchmarks

The experiments use a fixed Transformer architecture (3 self-attention layers, 500-dimensional latent vectors) implemented in OpenNMT-py, evaluated across all three transfer methods.

Datasets:

Purpose	Dataset	Size	Notes
Target	USPTO-50K	40K/5K/5K (train/val/test)	10 reaction classes, curated by Lowe (2012)
Augment (main)	USPTO-Full	844K train (after cleansing)	Curated by Lowe (2017)
Augment (smaller)	USPTO-MIT	384K train (after cleansing)	Curated by Jin et al. (2017)

Data cleansing removed all augment dataset samples whose product SMILES appeared in any USPTO-50K subset, preventing data leakage. All datasets were re-canonicalized with a unified RDKit version.

Evaluation uses n-best accuracy with k=50 beam search, computing accuracy at n=1, 3, 5, 10, 20, 50. Models are selected by best validation perplexity. All experiments report averages and standard deviations over 5 runs.

Optimization uses Adam with cyclic learning rate scheduling (warm-up) for all methods except fine-tuning, which uses a standard non-cyclic scheduler.

Results comparing data transfer methods (USPTO-Full augment):

Training Method	n=1	n=3	n=5	n=10	n=20	n=50
Single model (No Transfer)	35.3 +/- 1.4	52.8 +/- 1.4	58.9 +/- 1.3	64.5 +/- 1.2	68.8 +/- 1.2	72.1 +/- 1.3
Joint Training	39.1 +/- 1.3	63.4 +/- 0.9	71.9 +/- 0.5	80.1 +/- 0.2	85.4 +/- 0.3	89.4 +/- 0.2
Self-Training	41.5 +/- 1.0	60.4 +/- 0.7	66.1 +/- 0.7	71.8 +/- 0.6	75.3 +/- 0.5	78.0 +/- 0.3
Pre-training + Fine-Tune	57.4 +/- 0.4	77.6 +/- 0.4	83.1 +/- 0.2	87.4 +/- 0.4	89.6 +/- 0.3	90.9 +/- 0.2

Comparison with state-of-the-art models:

Model	Architecture	n=1	n=3	n=5	n=10	n=20	n=50
GLN (Dai et al., 2019)	Logic Network	52.5	69.0	75.6	83.7	88.5	92.4
G2Gs (Shi et al., 2020)	Graph-to-Graph	48.9	67.6	72.5	75.5	N/A	N/A
RetroXpert (Yan et al., 2020)	Graph-to-Graph	65.6	78.7	80.8	83.3	84.6	86.0
GraphRetro (Somnath et al., 2020)	Graph-to-Graph	63.8	80.5	84.1	85.9	N/A	87.2
Pre-training + Fine-Tune (ours)	Seq-to-Seq	57.4	77.6	83.1	87.4	89.6	90.9

Key Findings and Limitations

Primary findings:

All three data transfer methods improve over the no-transfer baseline across all n-best accuracy levels.
Pre-training plus fine-tuning provides the largest gains, improving top-1 accuracy by 22.1 absolute percentage points (from 35.3% to 57.4%) and achieving the best n=10 and n=20 accuracy among all compared models, including graph-based approaches.
Augment dataset size matters: using USPTO-Full (844K) yields substantially better results than USPTO-MIT (384K) for joint training and pre-training plus fine-tuning, though self-training gains are surprisingly robust to augment dataset size.
Manual inspection of erroneous predictions shows that over 99% of top-1 predictions from the pre-trained/fine-tuned model are chemically appropriate or sensible, even when they do not exactly match the gold-standard reactants.
Pre-training plus fine-tuning shows a distinct advantage in training dynamics: the 1-best and n-best accuracy curves evolve similarly during fine-tuning, unlike the single model where these curves can diverge significantly. This makes early stopping more reliable.

Class-wise improvements are observed across all 10 reaction classes, with the largest gains in heterocycle formation (0.40 to 0.86 at 50-best) and functional group interconversion (0.57 to 0.90).

Limitations acknowledged by the authors:

The model struggles with compounds containing multiple similar substituents (e.g., long-chain hydrocarbons), occasionally selecting the wrong one.
Some reactions involving rare chemical groups (polycyclic aromatic hydrocarbons) still produce invalid SMILES, suggesting the augment dataset lacks sufficient examples of these structures.
Top-1 accuracy (57.4%) lags behind the best graph-based models (RetroXpert at 65.6%), though the gap narrows at higher n values.
The study uses a fixed Transformer architecture without architecture-specific optimization for each transfer method.

Future directions proposed include freezing parts of the network during fine-tuning, applying data transfer to graph-to-graph models, and testing transferability to other retrosynthesis datasets beyond USPTO-50K.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Target	USPTO-50K	50K reactions	Curated by Lowe (2012), 10 reaction classes
Augment	USPTO-Full	877K reactions (844K after cleansing)	Curated by Lowe (2017), available via Figshare
Augment (alt)	USPTO-MIT	479K reactions (384K after cleansing)	Curated by Jin et al. (2017)

Data cleansing removes augment samples whose products appear in any USPTO-50K subset. Unified RDKit canonicalization applied to all datasets.

Algorithms

Transformer seq-to-seq model (3 self-attention layers, 500-dim latent vectors)
Positional encoding enabled
Maximum sequence length: 200 tokens
Adam optimizer
Cyclic learning rate scheduler with warm-up (all methods except fine-tuning)
Non-cyclic scheduler for fine-tuning phase (Klein et al., 2017)
Beam search with k=50 for inference

Models

Implementation: OpenNMT-py
No pre-trained weights or model checkpoints released

Evaluation

Metric	Value	Baseline	Notes
Top-1 accuracy	57.4%	35.3% (no transfer)	Pre-train + fine-tune, USPTO-Full augment
Top-10 accuracy	87.4%	64.5% (no transfer)	Best among all compared models
Top-20 accuracy	89.6%	68.8% (no transfer)	Best among all compared models
Top-50 accuracy	90.9%	72.1% (no transfer)	Competitive with GLN (92.4%)

Hardware

Hardware details are not specified in the paper. The authors mention GPU memory constraints motivating the 200-token sequence length limit.

Paper Information

Citation: Ishiguro, K., Ujihara, K., Sawada, R., Akita, H., & Kotera, M. (2020). Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis. arXiv preprint arXiv:2010.00792.

@article{ishiguro2020data,
  title={Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis},
  author={Ishiguro, Katsuhiko and Ujihara, Kazuya and Sawada, Ryohto and Akita, Hirotaka and Kotera, Masaaki},
  journal={arXiv preprint arXiv:2010.00792},
  year={2020}
}

Systematic Study of Data Transfer for Retrosynthesis#

Bridging the Data Gap in Retrosynthesis Prediction#

Three Data Transfer Methods for Retrosynthesis#

Experimental Setup on USPTO Benchmarks#

Key Findings and Limitations#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Paper Information#