A Unified Molecular Pre-training Framework

X-MOL is a Method paper that introduces a large-scale pre-training framework for SMILES-based molecular understanding. The primary contribution is a Transformer encoder-decoder model pre-trained on 1.1 billion molecules from ZINC15, which is then fine-tuned across five distinct molecular analysis tasks: molecular property prediction (classification and regression), chemical reaction productivity prediction, drug-drug interaction (DDI) prediction, de novo molecule generation (distribution learning and goal-directed), and molecule optimization. The paper demonstrates that a single pre-trained model can serve as a universal foundation for diverse downstream chemistry tasks.

Bridging Scale and Understanding in Molecular SMILES

Prior to X-MOL, most molecular analysis tasks were investigated individually with task-specific models. SMILES-based deep learning methods existed but lacked the benefit of large-scale pre-training that had proven transformative in NLP (BERT, RoBERTa, ERNIE, XLNet, T5). Two challenges motivated this work:

  1. SMILES sacrifices structural information for simplicity. While SMILES is a convenient linear representation, it does not directly encode molecular topology, making it harder for models to learn 3D structure from string input.
  2. Labelled molecular data is scarce. Most benchmark datasets (MoleculeNet) contain only thousands of labelled examples, making it difficult to train large models from scratch without overfitting.

The authors hypothesized that massive-scale pre-training on unlabelled SMILES could teach a model the grammar rules and implicit structural information in SMILES, providing a strong initialization for multiple downstream tasks.

Generative Pre-training with Random SMILES

The core innovation in X-MOL is a generative pre-training strategy that exploits the non-uniqueness of SMILES. A single molecule can be represented by many valid SMILES strings (random SMILES), depending on the starting atom, main chain selection, and ring-opening position. X-MOL trains the model to generate a valid alternative SMILES given an input SMILES of the same molecule, forcing the model to:

  1. Reconstruct the molecular structure from the input SMILES
  2. Generate a valid output SMILES following SMILES grammar rules

The architecture uses a shared-parameter encoder-decoder based on the Transformer. Unlike standard encoder-decoder models (e.g., for machine translation), X-MOL shares all parameters between encoder and decoder, forcing both encoding and decoding to occur in the same semantic space. The output SMILES is fully masked during training, and only unidirectional attention is permitted within the output sequence.

The self-attention mechanism computes attention for each character $i$ as:

$$ Z_{i} = \text{SoftMax}\left(\frac{Q_{i} \cdot K^{T}}{\sqrt{D}}\right) \cdot V $$

where $Q_{i}$, $K$, and $V$ are the query, key, and value matrices, and $D$ is the feature dimension. The model uses 12 attention heads to capture different relational patterns.

Model Architecture

  • 12 Transformer encoder layers
  • 768-dimensional hidden units
  • 12 attention heads
  • Character-level SMILES tokenization (108 chemical characters plus 5 special tokens: [PAD], [CLS], [SEP], [MASK], [UNK])
  • Characters within square brackets and double digits preceded by “%” are treated as single tokens

Data Augmentation in Pre-training

Because a molecule has multiple valid random SMILES, the output may differ from the predefined target. To handle this, X-MOL generates multiple training samples per molecule with the same input SMILES but different output random SMILES, and places these in the same mini-batch.

Experimental Setup Across Five Tasks

X-MOL is fine-tuned with task-specific strategies organized into two categories: prediction tasks and generation tasks.

Prediction Tasks

For prediction tasks, the [CLS] token’s output representation is passed through a fully connected network to produce predictions. The input format varies by task:

TaskInput FormatLoss FunctionMetric
Property prediction (classification)Single SMILESCross-entropyROC-AUC
Property prediction (regression)Single SMILESMSERMSE
Reaction productivity predictionFour SMILES (reactant, additive, base, ligand)MSERMSE
DDI predictionTwo SMILES (drug pair)Cross-entropyAccuracy

Molecular Property Prediction (Classification): Four MoleculeNet benchmarks were used: HIV (41,127 compounds), BACE (1,513), BBBP (2,039), and ClinTox (1,484). Data were randomly split 20 times, and average ROC-AUC is reported.

Molecular Property Prediction (Regression): Three MoleculeNet benchmarks: ESOL (1,128), FreeSolv (642), and Lipophilicity (4,200). Data augmentation with random SMILES was applied to the training set. Average RMSE over 20 random splits is reported.

Chemical Reaction Productivity Prediction: The C-N cross-coupling dataset (3,956 reactions) from Ahneman et al. was used with 10-fold cross-validation.

DDI Prediction: The DeepDDI dataset (192,284 DDI pairs, 86 interaction types) was used as benchmark.

Generation Tasks

TaskGeneration SourceSampling Strategy
Distribution learning (DL) generationFixed initial symbol ([CLS])Random sampling
Goal-directed (GD) generationUnfixed initial symbolRandom sampling
Molecule optimizationInput moleculeBeam search (beam size = 4)

DL-based Generation: Evaluated on ZINC250K (249,456 molecules) using validity, uniqueness, and novelty.

GD Generation: Also on ZINC250K, using QED as the goal property with target QED = 0.948 (the dataset maximum). 10,000 molecules were generated for evaluation.

Molecule Optimization: Evaluated on ZINC250K with QED as the optimization goal. Molecular pairs were constructed by selecting pairs with Tanimoto similarity in [0.6, 0.8], where the lower-QED molecule serves as input and the higher-QED molecule as target.

Key Results

Classification (ROC-AUC, higher is better): X-MOL achieved state-of-the-art on all four datasets, outperforming both shallow learning methods and deep learning baselines including graph convolutional models.

Regression (RMSE, lower is better): X-MOL achieved the best RMSE on ESOL, FreeSolv, and Lipophilicity.

Reaction Productivity: X-MOL obtained an average RMSE of 0.0626, compared to the random forest baseline of 0.078.

DDI Prediction: X-MOL achieved accuracy of 0.952, improving over DeepDDI’s 0.924.

DL-based Generation:

MethodValidityUniquenessNovelty
GCPN20%99.97%100%
MRNN65%99.89%100%
GraphAF68%99.10%100%
X-MOL85.28%99.91%100%

GD Generation: X-MOL generated all top-3 molecules with QED = 0.948, matching the dataset maximum. GraphAF reached 0.948/0.948/0.947, while JT-VAE and MRNN fell further behind.

Knowledge Embedding Ablation

The paper tested three additional embedding strategies to inject structural information into the model:

  • Link embedding: Encodes connection information between atoms (position of the previous connected atom)
  • Ring embedding: Encodes ring structure information from SMILES number pairs
  • Type embedding: Categorizes characters into 9 types (atoms, bonds, structural symbols)

None of these additional embeddings improved performance on the HIV or DDI tasks, whether with or without pre-training. The authors conclude that SMILES already contains sufficient information for molecular understanding and that pre-training effectively extracts this information, a finding they label “SMILES is all you need.”

Attention Visualization

The authors provide attention heatmap analysis demonstrating that:

  • Middle layers (e.g., layer 9) reconstruct molecular structure by correctly identifying atom connectivity and ring closures
  • Later layers abstract higher-level features for property prediction
  • In multi-input prediction tasks (reaction productivity), attention reveals which reaction components are most important (e.g., the ligand receives highest cross-attention)
  • In generation tasks, attention patterns differ between DL (self-focused), GD (source-constrained), and optimization (gradual shift from input to output)

Findings, Limitations, and Future Directions

X-MOL demonstrates that large-scale pre-training on SMILES can produce a single model that achieves competitive or state-of-the-art performance across five distinct molecular analysis tasks. The key findings are:

  1. Scale enables SMILES understanding. Pre-training on 1.1 billion molecules allows the model to learn SMILES grammar rules well enough to outperform graph-based methods on molecule generation validity.
  2. Unified framework. A single pre-trained backbone serves classification, regression, reaction prediction, DDI prediction, and generative tasks through different fine-tuning strategies.
  3. SMILES is sufficient. Additional knowledge embeddings (link, ring, type) do not improve performance, suggesting pre-training extracts the necessary structural information from SMILES alone.
  4. Interpretable attention. Attention visualization confirms that the model reconstructs molecular structure internally.

Limitations (observed):

  • The paper reports only MoleculeNet benchmarks with relatively few datasets. No scaffold splits or temporal splits are used; all splits are random, which can overestimate performance on structurally novel compounds.
  • Comparison baselines are somewhat dated (2018-2019 era methods), and the paper does not compare against concurrent SMILES pre-training methods.
  • The molecule generation validity (85.28%) is much higher than graph baselines like GCPN (20%), but later work achieved near 100% validity with constrained SMILES grammars.
  • No code or model weights have been publicly released, limiting independent verification.
  • The paper remains a bioRxiv preprint and has not been published in a peer-reviewed venue.

Future directions proposed by the authors include: better pre-training strategies, extension to graph-based representations, and fine-tuning on additional downstream tasks.


Reproducibility Details

Data

PurposeDatasetSizeNotes
Pre-trainingZINC151.1 billion moleculesRandom SMILES augmentation
ClassificationHIV (MoleculeNet)41,127Binary classification
ClassificationBACE (MoleculeNet)1,513Binary classification
ClassificationBBBP (MoleculeNet)2,039Binary classification
ClassificationClinTox (MoleculeNet)1,484Two sub-datasets, averaged
RegressionESOL (MoleculeNet)1,128Water solubility
RegressionFreeSolv (MoleculeNet)642Hydration free energy
RegressionLipophilicity (MoleculeNet)4,200logD at pH 7.4
ReactionC-N cross-coupling3,956From Ahneman et al. (2018)
DDIDeepDDI192,284 DDI pairs86 interaction types
GenerationZINC250K249,456For DL, GD, and optimization

Algorithms

  • Pre-training: Generative SMILES-to-SMILES with shared encoder-decoder Transformer
  • Fine-tuning prediction tasks: [CLS] token passed through fully connected layers
  • Fine-tuning generation tasks: Autoregressive generation with random sampling (DL, GD) or beam search (optimization)
  • Data augmentation: Random SMILES augmentation for regression tasks
  • Repeated training: 20 random splits with averaged results for classification/regression
  • 10-fold cross-validation for reaction productivity

Models

  • 12-layer Transformer, 768 hidden dimensions, 12 attention heads
  • Character-level tokenization: 108 chemical characters + 5 special tokens
  • Implemented in PaddlePaddle framework

Evaluation

TaskMetricX-MOLBest Baseline
HIV (classification)ROC-AUCState-of-the-artPrevious best (various)
BACE (classification)ROC-AUCState-of-the-artPrevious best (various)
BBBP (classification)ROC-AUCState-of-the-artPrevious best (various)
ClinTox (classification)ROC-AUCState-of-the-artPrevious best (various)
ESOL (regression)RMSEState-of-the-artPrevious best (various)
FreeSolv (regression)RMSEState-of-the-artPrevious best (various)
Lipophilicity (regression)RMSEState-of-the-artPrevious best (various)
C-N couplingRMSE0.06260.078 (random forest)
DDI predictionAccuracy0.9520.924 (DeepDDI)
DL generationValidity85.28%68% (GraphAF)
GD generationTop-3 QEDAll 0.9480.948/0.948/0.947 (GraphAF)

Hardware

  • Pre-training: 8/16 Tesla P40 GPUs (24 GB each), approximately 4 days
  • Data pre-processing: Over 1,000 CPUs with Hadoop

Artifacts

No code, model weights, or pre-trained checkpoints have been publicly released. The model was implemented in Baidu’s PaddlePaddle framework, but no repository is available.

Reproducibility status: Closed. While the datasets are all publicly available (ZINC15, MoleculeNet, ZINC250K, DeepDDI, C-N coupling), the model implementation, pre-trained weights, and fine-tuning code are not released. The computational requirements (1,000+ CPUs for data processing, 8-16 GPUs for 4 days of pre-training) are substantial.


Paper Information

Citation: Xue, D., Zhang, H., Xiao, D., Gong, Y., Chuai, G., Sun, Y., Tian, H., Wu, H., Li, Y., & Liu, Q. (2020). X-MOL: Large-scale pre-training for molecular understanding and diverse molecular analysis. bioRxiv.

@article{xue2020xmol,
  title={X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis},
  author={Xue, Dongyu and Zhang, Han and Xiao, Dongling and Gong, Yukang and Chuai, Guohui and Sun, Yu and Tian, Hao and Wu, Hua and Li, Yukun and Liu, Qi},
  journal={bioRxiv},
  year={2020},
  doi={10.1101/2020.12.23.424259},
  publisher={Cold Spring Harbor Laboratory}
}