Back Translation for Semi-Supervised Molecule Generation

Semi-Supervised Data Augmentation for Molecular Tasks

This is a Method paper that introduces back translation, a semi-supervised technique from neural machine translation, to the domain of molecular generation. The primary contribution is a general-purpose data augmentation strategy that leverages large pools of unlabeled molecules (from databases like ZINC) to improve the performance of both sequence-based and graph-based models on molecule optimization and retrosynthesis prediction tasks.

Bridging the Labeled Data Gap in Molecular Generation

Molecular generation tasks, such as property optimization and retrosynthesis, require paired training data: an input molecule (or property specification) mapped to a desired output molecule. Obtaining these labeled pairs is expensive and labor-intensive. Meanwhile, enormous databases of unlabeled molecules exist. ZINC alone contains over 750 million compounds, and PubChem has 109 million.

Prior approaches to using unlabeled molecular data include variational autoencoders (VAEs) for learning latent representations, conditional recurrent neural networks for inverse design, and pretraining techniques borrowed from NLP. However, these methods either focus on representation learning rather than direct generation, or require task-specific architectural modifications. The authors identify back translation, a well-established technique in machine translation, as a natural fit for molecular generation tasks that can be treated as sequence-to-sequence mappings.

Back Translation as Molecular Data Augmentation

The core idea is straightforward. Given a main task that maps from source domain $\mathcal{X}$ to target domain $\mathcal{Y}$ (e.g., mapping low-QED molecules to high-QED molecules), the method trains a reverse model $g$ that maps from $\mathcal{Y}$ back to $\mathcal{X}$. This reverse model then “back translates” unlabeled molecules from $\mathcal{Y}$ to generate synthetic source molecules, creating pseudo-labeled training pairs.

The theoretical motivation comes from maximizing the reconstruction probability. Given an unlabeled molecule $y_u \in \mathcal{U}_y$, the logarithmic reconstruction probability through the reverse model $g$ and forward model $f$ is:

$$ \log P(y_u = \hat{y}_u \mid y_u; g, f) = \log \sum_{\hat{x}_u \in \mathcal{X}} P(\hat{x}_u \mid y_u; g) P(y_u = \hat{y}_u \mid \hat{x}_u; f) $$

Since summing over the exponentially large space $\mathcal{X}$ is intractable, the authors apply Jensen’s inequality to obtain a lower bound:

$$ \log P(y_u = \hat{y}_u \mid y_u; g, f) \geq \mathbb{E}_{\hat{x}_u \sim P(\cdot \mid y_u; g)} \log P(y_u = \hat{y}_u \mid \hat{x}_u; f) $$

This lower bound is optimized via Monte Carlo sampling in three steps:

Step 1: Train both forward model $f$ and reverse model $g$ on the labeled data $\mathcal{L}$:

$$ \begin{aligned} \min_{\theta_f} \sum_{(x,y) \in \mathcal{L}} -\log P(y \mid x; \theta_f) \\ \min_{\theta_g} \sum_{(x,y) \in \mathcal{L}} -\log P(x \mid y; \theta_g) \end{aligned} $$

Step 2: Use the trained reverse model $g$ to back translate each unlabeled molecule $y_u \in \mathcal{U}_y$, producing synthetic pairs:

$$ \hat{\mathcal{L}} = {(\hat{x}_u, y_u) \mid y_u \in \mathcal{U}_y, \hat{x}_u \text{ sampled from } P(\cdot \mid y_u; \theta_g)} $$

Step 3: Retrain the forward model $f$ on the combined labeled and synthetic data $\mathcal{L} \cup \hat{\mathcal{L}}$, warm-starting from the parameters obtained in Step 1:

$$ \min_{\theta_f^} \sum_{(x,y) \in \mathcal{L} \cup \hat{\mathcal{L}}} -\log P(y \mid x; \theta_f^) $$

A key practical finding is that data filtration matters. When using large amounts of unlabeled data (1M molecules), keeping only the synthetic pairs that satisfy the same constraints as the labeled data (e.g., similarity thresholds and property ranges) significantly improves performance over using all back-translated data unfiltered.

Experiments on Property Optimization and Retrosynthesis

Molecular Property Improvement

The authors evaluate on four tasks from Jin et al. (2019, 2020), each requiring the model to improve a specific molecular property while maintaining structural similarity (measured by Dice similarity on Morgan fingerprints):

LogP (penalized partition coefficient): two settings with similarity thresholds $\delta \geq 0.4$ and $\delta \geq 0.6$
QED (quantitative estimation of drug-likeness): translate molecules from QED range [0.7, 0.8] to [0.9, 1.0]
DRD2 (dopamine type 2 receptor activity): translate inactive ($P < 0.5$) to active ($P \geq 0.5$)

Two backbone architectures are tested: a Transformer (6 layers, 4 heads, 128-dim embeddings, 512-dim FFN) and HierG2G, a hierarchical graph-to-graph translation model. Unlabeled molecules are sampled from ZINC at 250K and 1M scales.

Method	LogP ($\delta \geq 0.6$)	LogP ($\delta \geq 0.4$)	QED (%)	DRD2 (%)
JT-VAE	0.28	1.03	8.8	3.4
GCPN	0.79	2.49	9.4	4.4
JTNN	2.33	3.55	59.9	77.8
Transformer baseline	2.45	3.69	71.9	60.2
+BT (1M, filtered)	2.86	4.41	82.9	67.4
HierG2G baseline	2.49	3.98	76.9	85.9
+BT (250K, filtered)	2.75	4.24	79.1	87.3

Retrosynthesis Prediction

On the USPTO-50K benchmark (50K reactions, 10 reaction types, 80/10/10 train/val/test split), the method is applied to Transformer and GLN (Graph Logic Network) backbones. For other approaches to this benchmark, see Tied Two-Way Transformers and Data Transfer for Retrosynthesis. Unlabeled reactant sets are constructed by sampling molecules from ZINC and concatenating them following the training data’s reactant count distribution ($N_1 : N_2 : N_3 = 29.3% : 70.4% : 0.3%$).

Method	Top-1	Top-3	Top-5	Top-10
Reaction type given
GLN	64.2	79.1	85.2	90.0
Ours + GLN	67.9	82.5	87.3	91.5
Transformer	52.2	68.2	72.7	77.4
Ours + Transformer	55.9	72.8	77.8	79.7
Reaction type unknown
GLN	52.5	69.0	75.6	83.7
Ours + GLN	54.7	70.2	77.0	84.4
Transformer	37.9	57.3	62.7	68.1
Ours + Transformer	43.5	58.8	64.6	69.7

The improvements are largest at lower $k$ values (top-1 and top-3), suggesting that back translation helps the model make more precise high-confidence predictions.

Ablation Studies

Effect of unlabeled data size: On retrosynthesis with Transformer, performance improves as unlabeled data increases from 50K to 250K, then plateaus or declines beyond 250K. The authors attribute this to noise in the back-translated data outweighing the benefits at larger scales.

Effect of labeled data size: With only 5K labeled samples, adding back-translated data hurts performance because the reverse model is too weak to generate useful synthetic data. As labeled data increases (10K, 25K, 50K), the benefit of back translation grows. This confirms that the method requires a reasonably well-trained reverse model to be effective.

Data filtration: Using 1M unfiltered back-translated molecules sometimes hurts performance (e.g., QED drops from 71.9% to 75.1% vs. 82.9% with filtering), while filtering to enforce the same constraints as the labeled data recovers and exceeds the 250K filtered results.

Consistent Gains Across Architectures and Tasks

The method achieves state-of-the-art results on all four molecular property improvement tasks and the USPTO-50K retrosynthesis benchmark at time of publication. Several observations stand out:

Architecture agnosticism: Back translation improves both sequence-based (Transformer) and graph-based (HierG2G, GLN) models, confirming that the approach is independent of the underlying architecture.
Filtration is essential at scale: Unfiltered 1M back-translated data can degrade performance, but filtered data at the same scale consistently outperforms smaller unfiltered sets.
Training overhead is moderate: On the DRD2 task, back translation with Transformer takes about 2.5x the supervised training time (11.0h vs. 8.5h for initial training), with the back-translation step itself taking under 1 hour.
Diversity and novelty increase: Back translation improves both diversity (average pairwise distance among generated molecules) and novelty (fraction of generated molecules not seen in training) across QED and DRD2 tasks.

The authors acknowledge limitations: the method does not form a closed loop between forward and reverse models (as in dual learning approaches), and the data filtration strategy is rule-based rather than learned. They suggest joint training of forward and reverse models and learned filtration as future directions.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training (property improvement)	Jin et al. (2019, 2020) datasets	34K-99K pairs	LogP, QED, DRD2 tasks
Training (retrosynthesis)	USPTO-50K	40K reactions	80/10/10 split from Dai et al. (2019)
Unlabeled molecules	ZINC	250K or 1M	Randomly sampled
Evaluation	Same as training	800-1000 test samples	Per-task test sets

Algorithms

Back translation with optional data filtration
Beam search with $k=20$ for inference
Random sampling for back-translation step (Equation 5)
Dice similarity on Morgan fingerprints for similarity constraint

Models

Transformer: 6 layers, 4 attention heads, 128-dim embeddings, 512-dim FFN (for property improvement); 4 layers, 8 heads, 256-dim embeddings, 2048-dim FFN (for retrosynthesis)
HierG2G: Settings from Jin et al. (2020)
GLN: Settings from Dai et al. (2019)

Evaluation

Metric	Task	Best Value	Baseline	Notes
LogP improvement	LogP ($\delta \geq 0.6$)	2.86	2.49 (HierG2G)	Transformer + BT(1M, filtered)
LogP improvement	LogP ($\delta \geq 0.4$)	4.41	3.98 (HierG2G)	Transformer + BT(1M, filtered)
Success rate	QED	82.9%	76.9% (HierG2G)	Transformer + BT(1M, filtered)
Success rate	DRD2	87.3%	85.9% (HierG2G)	HierG2G + BT(250K, filtered)
Top-1 accuracy	USPTO-50K (known type)	67.9%	64.2% (GLN)	Ours + GLN

Hardware

The paper reports training times (8.5h for Transformer, 16.8h for HierG2G on DRD2 with 1M unlabeled data) but does not specify the GPU hardware used.

Artifacts

Artifact	Type	License	Notes
BT4MolGen	Code	MIT	Official implementation in Python

Paper Information

Citation: Fan, Y., Xia, Y., Zhu, J., Wu, L., Xie, S., & Qin, T. (2021). Back translation for molecule generation. Bioinformatics, 38(5), 1244-1251. https://doi.org/10.1093/bioinformatics/btab817

@article{fan2022back,
  title={Back translation for molecule generation},
  author={Fan, Yang and Xia, Yingce and Zhu, Jinhua and Wu, Lijun and Xie, Shufang and Qin, Tao},
  journal={Bioinformatics},
  volume={38},
  number={5},
  pages={1244--1251},
  year={2021},
  publisher={Oxford University Press},
  doi={10.1093/bioinformatics/btab817}
}

Semi-Supervised Data Augmentation for Molecular Tasks#

Bridging the Labeled Data Gap in Molecular Generation#

Back Translation as Molecular Data Augmentation#

Experiments on Property Optimization and Retrosynthesis#

Molecular Property Improvement#

Retrosynthesis Prediction#

Ablation Studies#

Consistent Gains Across Architectures and Tasks#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#