Systematic Study of Data Transfer for Retrosynthesis

This is an Empirical paper that systematically compares three standard data transfer methods (joint training, self-training, and pre-training plus fine-tuning) applied to a Transformer-based sequence-to-sequence model for single-step retrosynthesis. The primary contribution is demonstrating that pre-training on a large augmented dataset (USPTO-Full, 877K reactions) followed by fine-tuning on the smaller target dataset (USPTO-50K) produces substantial accuracy improvements over the baseline Transformer, achieving competitive or superior results to contemporaneous state-of-the-art graph-based models at higher values of n-best accuracy.

Bridging the Data Gap in Retrosynthesis Prediction

Retrosynthesis, the problem of predicting reactant compounds needed to synthesize a target product, has seen rapid progress through increasingly sophisticated model architectures: LSTM seq-to-seq models, Transformer models, and graph-to-graph approaches. However, the authors identify a gap in this research trajectory. While model architecture has received extensive attention, the role of training data strategies has been largely neglected in the retrosynthesis literature.

The core practical problem is that high-quality supervised datasets for retrosynthesis (like USPTO-50K) tend to be small and distribution-skewed, with all samples pre-classified into ten major reaction classes. Meanwhile, larger datasets (USPTO-Full with 877K samples, USPTO-MIT with 479K samples) exist but have different distributional properties. Data transfer techniques are standard practice in computer vision, NLP, and machine translation for exactly this scenario, yet they had not been systematically evaluated for retrosynthesis at the time of this work.

The authors also note a contrast with Zoph et al. (2020), who found that self-training outperforms pre-training in image recognition. They hypothesize that chemical compound strings may have more universal representations than images, making pre-training more effective in the chemistry domain.

Three Data Transfer Methods for Retrosynthesis

The paper formalizes retrosynthesis as a seq-to-seq problem where both the product $x$ and reactant set $y$ are represented as SMILES strings. A retrosynthesis model defines a likelihood $p_{\mathcal{M}}(y \mid x; \theta)$ optimized via maximum log-likelihood:

$$ \theta^{*} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}^{T}_{\text{Train}}} \log p(y_{i} \mid x_{i}) $$

Given a target dataset $\mathcal{D}^{T}$ and an augment dataset $\mathcal{D}^{A}$, three transfer methods are examined:

Joint Training concatenates the training sets and optimizes over the union:

$$ \theta^{*}_{\text{joint}} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}_{\text{joint}}} \log p(y_{i} \mid x_{i}), \quad \mathcal{D}_{\text{joint}} = \mathcal{D}^{T}_{\text{Train}} \cup \mathcal{D}^{A}_{\text{Train}} $$

This requires that both datasets share the same input/output domain (same SMILES canonicalization rules).

Self-Training (pseudo labeling) first trains a base model on $\mathcal{D}^{T}$ alone, then uses this model to relabel the augment dataset products:

$$ \hat{y}_{i} = \arg\max_{y} \log p(y \mid x_{i}; \theta^{*}_{\text{single}}) \quad \text{for } x_{i} \in \mathcal{D}^{A}_{\text{Train}} $$

The pseudo-labeled augment set is then combined with $\mathcal{D}^{T}_{\text{Train}}$ for joint training. This approach does not require consistent label domains between datasets.

Pre-training plus Fine-tuning trains first on the augment dataset to obtain $\theta^{*}_{\text{pretrain}}$, then initializes fine-tuning from this checkpoint:

$$ \theta^{0}_{\text{finetune}} \leftarrow \theta^{*}_{\text{pretrain}}, \quad \theta^{\ell+1}_{\text{finetune}} \leftarrow \theta^{\ell}_{\text{finetune}} - \gamma^{\ell} \nabla \mathcal{L}(\mathcal{D}^{T}_{\text{Train}}) \big|_{{\theta^{\ell}_{\text{finetune}}}} $$

Experimental Setup on USPTO Benchmarks

The experiments use a fixed Transformer architecture (3 self-attention layers, 500-dimensional latent vectors) implemented in OpenNMT-py, evaluated across all three transfer methods.

Datasets:

PurposeDatasetSizeNotes
TargetUSPTO-50K40K/5K/5K (train/val/test)10 reaction classes, curated by Lowe (2012)
Augment (main)USPTO-Full844K train (after cleansing)Curated by Lowe (2017)
Augment (smaller)USPTO-MIT384K train (after cleansing)Curated by Jin et al. (2017)

Data cleansing removed all augment dataset samples whose product SMILES appeared in any USPTO-50K subset, preventing data leakage. All datasets were re-canonicalized with a unified RDKit version.

Evaluation uses n-best accuracy with k=50 beam search, computing accuracy at n=1, 3, 5, 10, 20, 50. Models are selected by best validation perplexity. All experiments report averages and standard deviations over 5 runs.

Optimization uses Adam with cyclic learning rate scheduling (warm-up) for all methods except fine-tuning, which uses a standard non-cyclic scheduler.

Results comparing data transfer methods (USPTO-Full augment):

Training Methodn=1n=3n=5n=10n=20n=50
Single model (No Transfer)35.3 +/- 1.452.8 +/- 1.458.9 +/- 1.364.5 +/- 1.268.8 +/- 1.272.1 +/- 1.3
Joint Training39.1 +/- 1.363.4 +/- 0.971.9 +/- 0.580.1 +/- 0.285.4 +/- 0.389.4 +/- 0.2
Self-Training41.5 +/- 1.060.4 +/- 0.766.1 +/- 0.771.8 +/- 0.675.3 +/- 0.578.0 +/- 0.3
Pre-training + Fine-Tune57.4 +/- 0.477.6 +/- 0.483.1 +/- 0.287.4 +/- 0.489.6 +/- 0.390.9 +/- 0.2

Comparison with state-of-the-art models:

ModelArchitecturen=1n=3n=5n=10n=20n=50
GLN (Dai et al., 2019)Logic Network52.569.075.683.788.592.4
G2Gs (Shi et al., 2020)Graph-to-Graph48.967.672.575.5N/AN/A
RetroXpert (Yan et al., 2020)Graph-to-Graph65.678.780.883.384.686.0
GraphRetro (Somnath et al., 2020)Graph-to-Graph63.880.584.185.9N/A87.2
Pre-training + Fine-Tune (ours)Seq-to-Seq57.477.683.187.489.690.9

Key Findings and Limitations

Primary findings:

  1. All three data transfer methods improve over the no-transfer baseline across all n-best accuracy levels.
  2. Pre-training plus fine-tuning provides the largest gains, improving top-1 accuracy by 22.1 absolute percentage points (from 35.3% to 57.4%) and achieving the best n=10 and n=20 accuracy among all compared models, including graph-based approaches.
  3. Augment dataset size matters: using USPTO-Full (844K) yields substantially better results than USPTO-MIT (384K) for joint training and pre-training plus fine-tuning, though self-training gains are surprisingly robust to augment dataset size.
  4. Manual inspection of erroneous predictions shows that over 99% of top-1 predictions from the pre-trained/fine-tuned model are chemically appropriate or sensible, even when they do not exactly match the gold-standard reactants.
  5. Pre-training plus fine-tuning shows a distinct advantage in training dynamics: the 1-best and n-best accuracy curves evolve similarly during fine-tuning, unlike the single model where these curves can diverge significantly. This makes early stopping more reliable.

Class-wise improvements are observed across all 10 reaction classes, with the largest gains in heterocycle formation (0.40 to 0.86 at 50-best) and functional group interconversion (0.57 to 0.90).

Limitations acknowledged by the authors:

  • The model struggles with compounds containing multiple similar substituents (e.g., long-chain hydrocarbons), occasionally selecting the wrong one.
  • Some reactions involving rare chemical groups (polycyclic aromatic hydrocarbons) still produce invalid SMILES, suggesting the augment dataset lacks sufficient examples of these structures.
  • Top-1 accuracy (57.4%) lags behind the best graph-based models (RetroXpert at 65.6%), though the gap narrows at higher n values.
  • The study uses a fixed Transformer architecture without architecture-specific optimization for each transfer method.

Future directions proposed include freezing parts of the network during fine-tuning, applying data transfer to graph-to-graph models, and testing transferability to other retrosynthesis datasets beyond USPTO-50K.


Reproducibility Details

Data

PurposeDatasetSizeNotes
TargetUSPTO-50K50K reactionsCurated by Lowe (2012), 10 reaction classes
AugmentUSPTO-Full877K reactions (844K after cleansing)Curated by Lowe (2017), available via Figshare
Augment (alt)USPTO-MIT479K reactions (384K after cleansing)Curated by Jin et al. (2017)

Data cleansing removes augment samples whose products appear in any USPTO-50K subset. Unified RDKit canonicalization applied to all datasets.

Algorithms

  • Transformer seq-to-seq model (3 self-attention layers, 500-dim latent vectors)
  • Positional encoding enabled
  • Maximum sequence length: 200 tokens
  • Adam optimizer
  • Cyclic learning rate scheduler with warm-up (all methods except fine-tuning)
  • Non-cyclic scheduler for fine-tuning phase (Klein et al., 2017)
  • Beam search with k=50 for inference

Models

  • Implementation: OpenNMT-py
  • No pre-trained weights or model checkpoints released

Evaluation

MetricValueBaselineNotes
Top-1 accuracy57.4%35.3% (no transfer)Pre-train + fine-tune, USPTO-Full augment
Top-10 accuracy87.4%64.5% (no transfer)Best among all compared models
Top-20 accuracy89.6%68.8% (no transfer)Best among all compared models
Top-50 accuracy90.9%72.1% (no transfer)Competitive with GLN (92.4%)

Hardware

Hardware details are not specified in the paper. The authors mention GPU memory constraints motivating the 200-token sequence length limit.


Paper Information

Citation: Ishiguro, K., Ujihara, K., Sawada, R., Akita, H., & Kotera, M. (2020). Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis. arXiv preprint arXiv:2010.00792.

@article{ishiguro2020data,
  title={Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis},
  author={Ishiguro, Katsuhiko and Ujihara, Kazuya and Sawada, Ryohto and Akita, Hirotaka and Kotera, Masaaki},
  journal={arXiv preprint arXiv:2010.00792},
  year={2020}
}