X-MOL: Pre-training on 1.1B Molecules for SMILES

A Unified Molecular Pre-training Framework

X-MOL is a Method paper that introduces a large-scale pre-training framework for SMILES-based molecular understanding. The primary contribution is a Transformer encoder-decoder model pre-trained on 1.1 billion molecules from ZINC15, which is then fine-tuned across five distinct molecular analysis tasks: molecular property prediction (classification and regression), chemical reaction productivity prediction, drug-drug interaction (DDI) prediction, de novo molecule generation (distribution learning and goal-directed), and molecule optimization. The paper demonstrates that a single pre-trained model can serve as a universal foundation for diverse downstream chemistry tasks.

Bridging Scale and Understanding in Molecular SMILES

Prior to X-MOL, most molecular analysis tasks were investigated individually with task-specific models. SMILES-based deep learning methods existed but lacked the benefit of large-scale pre-training that had proven transformative in NLP (BERT, RoBERTa, ERNIE, XLNet, T5). Two challenges motivated this work:

SMILES sacrifices structural information for simplicity. While SMILES is a convenient linear representation, it does not directly encode molecular topology, making it harder for models to learn 3D structure from string input.
Labelled molecular data is scarce. Most benchmark datasets (MoleculeNet) contain only thousands of labelled examples, making it difficult to train large models from scratch without overfitting.

The authors hypothesized that massive-scale pre-training on unlabelled SMILES could teach a model the grammar rules and implicit structural information in SMILES, providing a strong initialization for multiple downstream tasks.

Generative Pre-training with Random SMILES

The core innovation in X-MOL is a generative pre-training strategy that exploits the non-uniqueness of SMILES. A single molecule can be represented by many valid SMILES strings (random SMILES), depending on the starting atom, main chain selection, and ring-opening position. X-MOL trains the model to generate a valid alternative SMILES given an input SMILES of the same molecule, forcing the model to:

Reconstruct the molecular structure from the input SMILES
Generate a valid output SMILES following SMILES grammar rules

The architecture uses a shared-parameter encoder-decoder based on the Transformer. Unlike standard encoder-decoder models (e.g., for machine translation), X-MOL shares all parameters between encoder and decoder, forcing both encoding and decoding to occur in the same semantic space. The output SMILES is fully masked during training, and only unidirectional attention is permitted within the output sequence.

The self-attention mechanism computes attention for each character $i$ as:

$$ Z_{i} = \text{SoftMax}\left(\frac{Q_{i} \cdot K^{T}}{\sqrt{D}}\right) \cdot V $$

where $Q_{i}$, $K$, and $V$ are the query, key, and value matrices, and $D$ is the feature dimension. The model uses 12 attention heads to capture different relational patterns.

Model Architecture

12 Transformer encoder layers
768-dimensional hidden units
12 attention heads
Character-level SMILES tokenization (108 chemical characters plus 5 special tokens: [PAD], [CLS], [SEP], [MASK], [UNK])
Characters within square brackets and double digits preceded by “%” are treated as single tokens

Data Augmentation in Pre-training

Because a molecule has multiple valid random SMILES, the output may differ from the predefined target. To handle this, X-MOL generates multiple training samples per molecule with the same input SMILES but different output random SMILES, and places these in the same mini-batch.

Experimental Setup Across Five Tasks

X-MOL is fine-tuned with task-specific strategies organized into two categories: prediction tasks and generation tasks.

Prediction Tasks

For prediction tasks, the [CLS] token’s output representation is passed through a fully connected network to produce predictions. The input format varies by task:

Task	Input Format	Loss Function	Metric
Property prediction (classification)	Single SMILES	Cross-entropy	ROC-AUC
Property prediction (regression)	Single SMILES	MSE	RMSE
Reaction productivity prediction	Four SMILES (reactant, additive, base, ligand)	MSE	RMSE
DDI prediction	Two SMILES (drug pair)	Cross-entropy	Accuracy

Molecular Property Prediction (Classification): Four MoleculeNet benchmarks were used: HIV (41,127 compounds), BACE (1,513), BBBP (2,039), and ClinTox (1,484). Data were randomly split 20 times, and average ROC-AUC is reported.

Molecular Property Prediction (Regression): Three MoleculeNet benchmarks: ESOL (1,128), FreeSolv (642), and Lipophilicity (4,200). Data augmentation with random SMILES was applied to the training set. Average RMSE over 20 random splits is reported.

Chemical Reaction Productivity Prediction: The C-N cross-coupling dataset (3,956 reactions) from Ahneman et al. was used with 10-fold cross-validation.

DDI Prediction: The DeepDDI dataset (192,284 DDI pairs, 86 interaction types) was used as benchmark.

Generation Tasks

Task	Generation Source	Sampling Strategy
Distribution learning (DL) generation	Fixed initial symbol ([CLS])	Random sampling
Goal-directed (GD) generation	Unfixed initial symbol	Random sampling
Molecule optimization	Input molecule	Beam search (beam size = 4)

DL-based Generation: Evaluated on ZINC250K (249,456 molecules) using validity, uniqueness, and novelty.

GD Generation: Also on ZINC250K, using QED as the goal property with target QED = 0.948 (the dataset maximum). 10,000 molecules were generated for evaluation.

Molecule Optimization: Evaluated on ZINC250K with QED as the optimization goal. Molecular pairs were constructed by selecting pairs with Tanimoto similarity in [0.6, 0.8], where the lower-QED molecule serves as input and the higher-QED molecule as target.

Key Results

Classification (ROC-AUC, higher is better): X-MOL achieved state-of-the-art on all four datasets, outperforming both shallow learning methods and deep learning baselines including graph convolutional models.

Regression (RMSE, lower is better): X-MOL achieved the best RMSE on ESOL, FreeSolv, and Lipophilicity.

Reaction Productivity: X-MOL obtained an average RMSE of 0.0626, compared to the random forest baseline of 0.078.

DDI Prediction: X-MOL achieved accuracy of 0.952, improving over DeepDDI’s 0.924.

DL-based Generation:

Method	Validity	Uniqueness	Novelty
GCPN	20%	99.97%	100%
MRNN	65%	99.89%	100%
GraphAF	68%	99.10%	100%
X-MOL	85.28%	99.91%	100%

GD Generation: X-MOL generated all top-3 molecules with QED = 0.948, matching the dataset maximum. GraphAF reached 0.948/0.948/0.947, while JT-VAE and MRNN fell further behind.

Knowledge Embedding Ablation

The paper tested three additional embedding strategies to inject structural information into the model:

Link embedding: Encodes connection information between atoms (position of the previous connected atom)
Ring embedding: Encodes ring structure information from SMILES number pairs
Type embedding: Categorizes characters into 9 types (atoms, bonds, structural symbols)

None of these additional embeddings improved performance on the HIV or DDI tasks, whether with or without pre-training. The authors conclude that SMILES already contains sufficient information for molecular understanding and that pre-training effectively extracts this information, a finding they label “SMILES is all you need.”

Attention Visualization

The authors provide attention heatmap analysis demonstrating that:

Middle layers (e.g., layer 9) reconstruct molecular structure by correctly identifying atom connectivity and ring closures
Later layers abstract higher-level features for property prediction
In multi-input prediction tasks (reaction productivity), attention reveals which reaction components are most important (e.g., the ligand receives highest cross-attention)
In generation tasks, attention patterns differ between DL (self-focused), GD (source-constrained), and optimization (gradual shift from input to output)

Findings, Limitations, and Future Directions

X-MOL demonstrates that large-scale pre-training on SMILES can produce a single model that achieves competitive or state-of-the-art performance across five distinct molecular analysis tasks. The key findings are:

Scale enables SMILES understanding. Pre-training on 1.1 billion molecules allows the model to learn SMILES grammar rules well enough to outperform graph-based methods on molecule generation validity.
Unified framework. A single pre-trained backbone serves classification, regression, reaction prediction, DDI prediction, and generative tasks through different fine-tuning strategies.
SMILES is sufficient. Additional knowledge embeddings (link, ring, type) do not improve performance, suggesting pre-training extracts the necessary structural information from SMILES alone.
Interpretable attention. Attention visualization confirms that the model reconstructs molecular structure internally.

Limitations (observed):

The paper reports only MoleculeNet benchmarks with relatively few datasets. No scaffold splits or temporal splits are used; all splits are random, which can overestimate performance on structurally novel compounds.
Comparison baselines are somewhat dated (2018-2019 era methods), and the paper does not compare against concurrent SMILES pre-training methods.
The molecule generation validity (85.28%) is much higher than graph baselines like GCPN (20%), but later work achieved near 100% validity with constrained SMILES grammars.
No code or model weights have been publicly released, limiting independent verification.
The paper remains a bioRxiv preprint and has not been published in a peer-reviewed venue.

Future directions proposed by the authors include: better pre-training strategies, extension to graph-based representations, and fine-tuning on additional downstream tasks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC15	1.1 billion molecules	Random SMILES augmentation
Classification	HIV (MoleculeNet)	41,127	Binary classification
Classification	BACE (MoleculeNet)	1,513	Binary classification
Classification	BBBP (MoleculeNet)	2,039	Binary classification
Classification	ClinTox (MoleculeNet)	1,484	Two sub-datasets, averaged
Regression	ESOL (MoleculeNet)	1,128	Water solubility
Regression	FreeSolv (MoleculeNet)	642	Hydration free energy
Regression	Lipophilicity (MoleculeNet)	4,200	logD at pH 7.4
Reaction	C-N cross-coupling	3,956	From Ahneman et al. (2018)
DDI	DeepDDI	192,284 DDI pairs	86 interaction types
Generation	ZINC250K	249,456	For DL, GD, and optimization

Algorithms

Pre-training: Generative SMILES-to-SMILES with shared encoder-decoder Transformer
Fine-tuning prediction tasks: [CLS] token passed through fully connected layers
Fine-tuning generation tasks: Autoregressive generation with random sampling (DL, GD) or beam search (optimization)
Data augmentation: Random SMILES augmentation for regression tasks
Repeated training: 20 random splits with averaged results for classification/regression
10-fold cross-validation for reaction productivity

Models

12-layer Transformer, 768 hidden dimensions, 12 attention heads
Character-level tokenization: 108 chemical characters + 5 special tokens
Implemented in PaddlePaddle framework

Evaluation

Task	Metric	X-MOL	Best Baseline
HIV (classification)	ROC-AUC	State-of-the-art	Previous best (various)
BACE (classification)	ROC-AUC	State-of-the-art	Previous best (various)
BBBP (classification)	ROC-AUC	State-of-the-art	Previous best (various)
ClinTox (classification)	ROC-AUC	State-of-the-art	Previous best (various)
ESOL (regression)	RMSE	State-of-the-art	Previous best (various)
FreeSolv (regression)	RMSE	State-of-the-art	Previous best (various)
Lipophilicity (regression)	RMSE	State-of-the-art	Previous best (various)
C-N coupling	RMSE	0.0626	0.078 (random forest)
DDI prediction	Accuracy	0.952	0.924 (DeepDDI)
DL generation	Validity	85.28%	68% (GraphAF)
GD generation	Top-3 QED	All 0.948	0.948/0.948/0.947 (GraphAF)

Hardware

Pre-training: 8/16 Tesla P40 GPUs (24 GB each), approximately 4 days
Data pre-processing: Over 1,000 CPUs with Hadoop

Artifacts

No code, model weights, or pre-trained checkpoints have been publicly released. The model was implemented in Baidu’s PaddlePaddle framework, but no repository is available.

Reproducibility status: Closed. While the datasets are all publicly available (ZINC15, MoleculeNet, ZINC250K, DeepDDI, C-N coupling), the model implementation, pre-trained weights, and fine-tuning code are not released. The computational requirements (1,000+ CPUs for data processing, 8-16 GPUs for 4 days of pre-training) are substantial.

Paper Information

Citation: Xue, D., Zhang, H., Xiao, D., Gong, Y., Chuai, G., Sun, Y., Tian, H., Wu, H., Li, Y., & Liu, Q. (2020). X-MOL: Large-scale pre-training for molecular understanding and diverse molecular analysis. bioRxiv.

@article{xue2020xmol,
  title={X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis},
  author={Xue, Dongyu and Zhang, Han and Xiao, Dongling and Gong, Yukang and Chuai, Guohui and Sun, Yu and Tian, Hao and Wu, Hua and Li, Yukun and Liu, Qi},
  journal={bioRxiv},
  year={2020},
  doi={10.1101/2020.12.23.424259},
  publisher={Cold Spring Harbor Laboratory}
}

A Unified Molecular Pre-training Framework#

Bridging Scale and Understanding in Molecular SMILES#

Generative Pre-training with Random SMILES#

Model Architecture#

Data Augmentation in Pre-training#

Experimental Setup Across Five Tasks#

Prediction Tasks#

Generation Tasks#

Key Results#

Knowledge Embedding Ablation#

Attention Visualization#

Findings, Limitations, and Future Directions#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#