Systematic Study of Data Transfer for Retrosynthesis
This is an Empirical paper that systematically compares three standard data transfer methods (joint training, self-training, and pre-training plus fine-tuning) applied to a Transformer-based sequence-to-sequence model for single-step retrosynthesis. The primary contribution is demonstrating that pre-training on a large augmented dataset (USPTO-Full, 877K reactions) followed by fine-tuning on the smaller target dataset (USPTO-50K) produces substantial accuracy improvements over the baseline Transformer, achieving competitive or superior results to contemporaneous state-of-the-art graph-based models at higher values of n-best accuracy.
Bridging the Data Gap in Retrosynthesis Prediction
Retrosynthesis, the problem of predicting reactant compounds needed to synthesize a target product, has seen rapid progress through increasingly sophisticated model architectures: LSTM seq-to-seq models, Transformer models, and graph-to-graph approaches. However, the authors identify a gap in this research trajectory. While model architecture has received extensive attention, the role of training data strategies has been largely neglected in the retrosynthesis literature.
The core practical problem is that high-quality supervised datasets for retrosynthesis (like USPTO-50K) tend to be small and distribution-skewed, with all samples pre-classified into ten major reaction classes. Meanwhile, larger datasets (USPTO-Full with 877K samples, USPTO-MIT with 479K samples) exist but have different distributional properties. Data transfer techniques are standard practice in computer vision, NLP, and machine translation for exactly this scenario, yet they had not been systematically evaluated for retrosynthesis at the time of this work.
The authors also note a contrast with Zoph et al. (2020), who found that self-training outperforms pre-training in image recognition. They hypothesize that chemical compound strings may have more universal representations than images, making pre-training more effective in the chemistry domain.
Three Data Transfer Methods for Retrosynthesis
The paper formalizes retrosynthesis as a seq-to-seq problem where both the product $x$ and reactant set $y$ are represented as SMILES strings. A retrosynthesis model defines a likelihood $p_{\mathcal{M}}(y \mid x; \theta)$ optimized via maximum log-likelihood:
$$ \theta^{*} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}^{T}_{\text{Train}}} \log p(y_{i} \mid x_{i}) $$
Given a target dataset $\mathcal{D}^{T}$ and an augment dataset $\mathcal{D}^{A}$, three transfer methods are examined:
Joint Training concatenates the training sets and optimizes over the union:
$$ \theta^{*}_{\text{joint}} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}_{\text{joint}}} \log p(y_{i} \mid x_{i}), \quad \mathcal{D}_{\text{joint}} = \mathcal{D}^{T}_{\text{Train}} \cup \mathcal{D}^{A}_{\text{Train}} $$
This requires that both datasets share the same input/output domain (same SMILES canonicalization rules).
Self-Training (pseudo labeling) first trains a base model on $\mathcal{D}^{T}$ alone, then uses this model to relabel the augment dataset products:
$$ \hat{y}_{i} = \arg\max_{y} \log p(y \mid x_{i}; \theta^{*}_{\text{single}}) \quad \text{for } x_{i} \in \mathcal{D}^{A}_{\text{Train}} $$
The pseudo-labeled augment set is then combined with $\mathcal{D}^{T}_{\text{Train}}$ for joint training. This approach does not require consistent label domains between datasets.
Pre-training plus Fine-tuning trains first on the augment dataset to obtain $\theta^{*}_{\text{pretrain}}$, then initializes fine-tuning from this checkpoint:
$$ \theta^{0}_{\text{finetune}} \leftarrow \theta^{*}_{\text{pretrain}}, \quad \theta^{\ell+1}_{\text{finetune}} \leftarrow \theta^{\ell}_{\text{finetune}} - \gamma^{\ell} \nabla \mathcal{L}(\mathcal{D}^{T}_{\text{Train}}) \big|_{{\theta^{\ell}_{\text{finetune}}}} $$
Experimental Setup on USPTO Benchmarks
The experiments use a fixed Transformer architecture (3 self-attention layers, 500-dimensional latent vectors) implemented in OpenNMT-py, evaluated across all three transfer methods.
Datasets:
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Target | USPTO-50K | 40K/5K/5K (train/val/test) | 10 reaction classes, curated by Lowe (2012) |
| Augment (main) | USPTO-Full | 844K train (after cleansing) | Curated by Lowe (2017) |
| Augment (smaller) | USPTO-MIT | 384K train (after cleansing) | Curated by Jin et al. (2017) |
Data cleansing removed all augment dataset samples whose product SMILES appeared in any USPTO-50K subset, preventing data leakage. All datasets were re-canonicalized with a unified RDKit version.
Evaluation uses n-best accuracy with k=50 beam search, computing accuracy at n=1, 3, 5, 10, 20, 50. Models are selected by best validation perplexity. All experiments report averages and standard deviations over 5 runs.
Optimization uses Adam with cyclic learning rate scheduling (warm-up) for all methods except fine-tuning, which uses a standard non-cyclic scheduler.
Results comparing data transfer methods (USPTO-Full augment):
| Training Method | n=1 | n=3 | n=5 | n=10 | n=20 | n=50 |
|---|---|---|---|---|---|---|
| Single model (No Transfer) | 35.3 +/- 1.4 | 52.8 +/- 1.4 | 58.9 +/- 1.3 | 64.5 +/- 1.2 | 68.8 +/- 1.2 | 72.1 +/- 1.3 |
| Joint Training | 39.1 +/- 1.3 | 63.4 +/- 0.9 | 71.9 +/- 0.5 | 80.1 +/- 0.2 | 85.4 +/- 0.3 | 89.4 +/- 0.2 |
| Self-Training | 41.5 +/- 1.0 | 60.4 +/- 0.7 | 66.1 +/- 0.7 | 71.8 +/- 0.6 | 75.3 +/- 0.5 | 78.0 +/- 0.3 |
| Pre-training + Fine-Tune | 57.4 +/- 0.4 | 77.6 +/- 0.4 | 83.1 +/- 0.2 | 87.4 +/- 0.4 | 89.6 +/- 0.3 | 90.9 +/- 0.2 |
Comparison with state-of-the-art models:
| Model | Architecture | n=1 | n=3 | n=5 | n=10 | n=20 | n=50 |
|---|---|---|---|---|---|---|---|
| GLN (Dai et al., 2019) | Logic Network | 52.5 | 69.0 | 75.6 | 83.7 | 88.5 | 92.4 |
| G2Gs (Shi et al., 2020) | Graph-to-Graph | 48.9 | 67.6 | 72.5 | 75.5 | N/A | N/A |
| RetroXpert (Yan et al., 2020) | Graph-to-Graph | 65.6 | 78.7 | 80.8 | 83.3 | 84.6 | 86.0 |
| GraphRetro (Somnath et al., 2020) | Graph-to-Graph | 63.8 | 80.5 | 84.1 | 85.9 | N/A | 87.2 |
| Pre-training + Fine-Tune (ours) | Seq-to-Seq | 57.4 | 77.6 | 83.1 | 87.4 | 89.6 | 90.9 |
Key Findings and Limitations
Primary findings:
- All three data transfer methods improve over the no-transfer baseline across all n-best accuracy levels.
- Pre-training plus fine-tuning provides the largest gains, improving top-1 accuracy by 22.1 absolute percentage points (from 35.3% to 57.4%) and achieving the best n=10 and n=20 accuracy among all compared models, including graph-based approaches.
- Augment dataset size matters: using USPTO-Full (844K) yields substantially better results than USPTO-MIT (384K) for joint training and pre-training plus fine-tuning, though self-training gains are surprisingly robust to augment dataset size.
- Manual inspection of erroneous predictions shows that over 99% of top-1 predictions from the pre-trained/fine-tuned model are chemically appropriate or sensible, even when they do not exactly match the gold-standard reactants.
- Pre-training plus fine-tuning shows a distinct advantage in training dynamics: the 1-best and n-best accuracy curves evolve similarly during fine-tuning, unlike the single model where these curves can diverge significantly. This makes early stopping more reliable.
Class-wise improvements are observed across all 10 reaction classes, with the largest gains in heterocycle formation (0.40 to 0.86 at 50-best) and functional group interconversion (0.57 to 0.90).
Limitations acknowledged by the authors:
- The model struggles with compounds containing multiple similar substituents (e.g., long-chain hydrocarbons), occasionally selecting the wrong one.
- Some reactions involving rare chemical groups (polycyclic aromatic hydrocarbons) still produce invalid SMILES, suggesting the augment dataset lacks sufficient examples of these structures.
- Top-1 accuracy (57.4%) lags behind the best graph-based models (RetroXpert at 65.6%), though the gap narrows at higher n values.
- The study uses a fixed Transformer architecture without architecture-specific optimization for each transfer method.
Future directions proposed include freezing parts of the network during fine-tuning, applying data transfer to graph-to-graph models, and testing transferability to other retrosynthesis datasets beyond USPTO-50K.
Reproducibility Details
Data
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Target | USPTO-50K | 50K reactions | Curated by Lowe (2012), 10 reaction classes |
| Augment | USPTO-Full | 877K reactions (844K after cleansing) | Curated by Lowe (2017), available via Figshare |
| Augment (alt) | USPTO-MIT | 479K reactions (384K after cleansing) | Curated by Jin et al. (2017) |
Data cleansing removes augment samples whose products appear in any USPTO-50K subset. Unified RDKit canonicalization applied to all datasets.
Algorithms
- Transformer seq-to-seq model (3 self-attention layers, 500-dim latent vectors)
- Positional encoding enabled
- Maximum sequence length: 200 tokens
- Adam optimizer
- Cyclic learning rate scheduler with warm-up (all methods except fine-tuning)
- Non-cyclic scheduler for fine-tuning phase (Klein et al., 2017)
- Beam search with k=50 for inference
Models
- Implementation: OpenNMT-py
- No pre-trained weights or model checkpoints released
Evaluation
| Metric | Value | Baseline | Notes |
|---|---|---|---|
| Top-1 accuracy | 57.4% | 35.3% (no transfer) | Pre-train + fine-tune, USPTO-Full augment |
| Top-10 accuracy | 87.4% | 64.5% (no transfer) | Best among all compared models |
| Top-20 accuracy | 89.6% | 68.8% (no transfer) | Best among all compared models |
| Top-50 accuracy | 90.9% | 72.1% (no transfer) | Competitive with GLN (92.4%) |
Hardware
Hardware details are not specified in the paper. The authors mention GPU memory constraints motivating the 200-token sequence length limit.
Paper Information
Citation: Ishiguro, K., Ujihara, K., Sawada, R., Akita, H., & Kotera, M. (2020). Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis. arXiv preprint arXiv:2010.00792.
@article{ishiguro2020data,
title={Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis},
author={Ishiguro, Katsuhiko and Ujihara, Kazuya and Sawada, Ryohto and Akita, Hirotaka and Kotera, Masaaki},
journal={arXiv preprint arXiv:2010.00792},
year={2020}
}
