SPE: Data-Driven SMILES Substructure Tokenization

A Data-Driven Tokenization Method for Chemical Deep Learning

This is a Method paper that introduces SMILES Pair Encoding (SPE), a tokenization algorithm adapted from byte pair encoding (BPE) in natural language processing. The primary contribution is a data-driven approach that learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset and then uses that vocabulary to tokenize SMILES for downstream deep learning tasks. The authors provide an open-source Python package (SmilesPE) and demonstrate improvements on both molecular generation and QSAR prediction benchmarks.

Limitations of Atom-Level SMILES Tokenization

SMILES-based deep learning models require tokenization to convert molecular strings into sequences of discrete units. The standard approaches have well-known drawbacks:

Character-level tokenization breaks SMILES character by character, splitting chemically meaningful multi-character atoms. For example, [C@@H] becomes six separate tokens ([, C, @, @, H, ]), losing the stereochemistry information of a single carbon.
Atom-level tokenization addresses some of these issues by treating multi-character element symbols (Cl, Br) and bracketed atoms ([nH], [O-]) as single tokens. However, these tokens still encode only individual atoms, not substructures.
k-mer tokenization (sequences of k consecutive overlapping characters) captures some connectivity information but suffers from the out-of-vocabulary problem: the model cannot represent k-mers not seen during training.

All three approaches produce relatively long input sequences (mean ~40 tokens per molecule on ChEMBL at the atom level), which increases computational cost for sequential architectures like RNNs and exacerbates long-range dependency issues.

Core Innovation: Adapting Byte Pair Encoding for SMILES

SPE adapts the byte pair encoding algorithm, originally developed for data compression and later adopted for subword tokenization in NLP, to the domain of chemical strings. The algorithm has two phases:

Vocabulary training:

Tokenize SMILES from a large dataset (ChEMBL) at the atom level
Initialize the vocabulary with all unique atom-level tokens
Iteratively count the frequency of all adjacent token pairs, merge the most frequent pair into a new token, and add it to the vocabulary
Stop when either the maximum vocabulary size (MVS) or a minimum frequency threshold (FT) is reached

Tokenization: Given a trained SPE vocabulary, a new SMILES string is first tokenized at the atom level, then token pairs are iteratively merged according to their frequency rank in the vocabulary until no further merges are possible.

The key hyperparameters are MVS and FT. In the reported experiments, MVS was set to 30,000 and FT was set to 2,000. The vocabulary was trained on ~3.4 million SMILES (both canonical and one non-canonical variant per molecule) from ChEMBL25. The resulting vocabulary contained 3,002 unique SMILES substrings with lengths ranging from 1 to 22 atom-level characters.

The trained SPE vocabulary produces tokens that are human-readable and correspond to chemically meaningful substructures and functional groups. SPE tokenization reduces the mean sequence length from approximately 40 tokens (atom-level) to approximately 6 tokens on ChEMBL, a roughly 6-7x compression. This shorter representation directly reduces computational cost for RNN-based and other sequential models.

The algorithm is also compatible with other text-based molecular representations such as DeepSMILES and SELFIES, since these share atom-level character structures that can serve as the starting point for pair merging.

Molecular Generation and QSAR Prediction Experiments

Molecular Generation

The authors trained AWD-LSTM language models with SPE and atom-level tokenization on 9 million SMILES (1 canonical + 5 non-canonical per compound from ChEMBL25). Each model sampled 1 million SMILES for evaluation. The AWD-LSTM architecture used an embedding size of 400, three LSTM layers with 1,152 hidden units each, and various dropout settings (embedding: 0.1, input: 0.6, weight: 0.5, hidden: 0.2). Models were trained for 10 epochs with a base learning rate of 0.008 using one-cycle scheduling.

Metric	SPE	Atom-level
Validity	0.941	0.970
Uniqueness	0.994	0.992
Novelty	0.983	0.978
Internal diversity	0.897	0.886
Nearest neighbor similarity	0.391	0.386

The SPE model generated a more diverse population of novel molecules at the cost of slightly lower validity (94.1% vs. 97.0%). Internal diversity is defined as:

$$ \text{Internal diversity} = 1 - \frac{1}{|G|} \sum_{(x_1, x_2) \in G \times G} T(x_1, x_2) $$

where $T(x_1, x_2)$ is the Tanimoto similarity between molecules $x_1$ and $x_2$ using 1024-bit ECFP6 fingerprints. Nearest neighbor similarity (SNN) measures how well the generated set resembles the reference set:

$$ \text{SNN} = \frac{1}{|G|} \sum_{x_G \in G} \max_{x_R \in R} T(x_G, x_R) $$

Substructure coverage analysis showed both models recovered the same top-1000 BRICS fragments (100% coverage), but SPE consistently outperformed atom-level tokenization on top-5000 coverage across all four substructure types: BRICS fragments (0.997 vs. 0.987), functional groups (0.688 vs. 0.659), scaffolds (0.872 vs. 0.825), and ring systems (0.781 vs. 0.761).

QSAR Prediction

QSAR models were built using the MolPMoFiT transfer learning framework, which pre-trains a language model on ChEMBL and then fine-tunes it for specific prediction tasks. The evaluation used 24 regression benchmarks (pIC50 values) from Cortes-Ciriano et al., covering targets ranging from 199 molecules (alpha-2a adrenergic receptor) to 5,010 molecules (hERG). Models were evaluated on 10 random 80:10:10 splits using RMSE, R-squared, and MAE. Random forest models with 1024-bit ECFP6 were included as baseline comparisons.

Cohen’s d effect sizes were computed to quantify performance differences between tokenization methods. SPE performed comparably or better than atom-level tokenization on 23 out of 24 datasets. Notable results with medium or large effect sizes favoring SPE included cannabinoid CB1 receptor (large effect), A2a adrenergic receptor, LCK, estrogen receptor, and Aurora-A kinase (all medium effects). Against k-mer tokenization, SPE matched or outperformed on 22 out of 24 datasets.

Cohen’s d is defined as:

$$ \text{Cohen’s } d = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{(\text{SD}_1^2 + \text{SD}_2^2) / 2}} $$

where $\bar{x}_1, \bar{x}_2$ are the group means and $\text{SD}_1, \text{SD}_2$ are the standard deviations. Thresholds of 0.2 (small), 0.5 (medium), and 0.8 (large) were used following standard recommendations.

SMILES-based deep learning models generally performed on par with or better than the RF baseline, with particularly strong advantages on the four largest datasets (COX-2, acetylcholinesterase, erbB1, and hERG).

In addition to performance gains, SPE-based models trained on average 5 times faster than atom-level models due to the shorter input sequences.

Results Summary and Future Directions

The main findings of this study are:

SPE produces chemically meaningful tokens. The learned vocabulary contains human-readable SMILES substrings that correspond to common substructures and functional groups, making model interpretations more accessible.
SPE compresses input sequences by ~6-7x. Mean token sequence length drops from ~40 (atom-level) to ~6 (SPE) on ChEMBL, yielding a ~5x training speedup.
SPE improves molecular generation diversity. The SPE-based generative model produces molecules with higher novelty (98.3% vs. 97.8%), internal diversity (0.897 vs. 0.886), and substructure coverage, at the cost of slightly lower validity (94.1% vs. 97.0%).
SPE matches or outperforms atom-level and k-mer tokenization on QSAR prediction. Across 24 benchmarks, SPE showed comparable or better performance in 23/24 comparisons against atom-level and 22/24 against k-mer tokenization.

Limitations acknowledged by the authors:

The SPE vocabulary is trained on a specific dataset (ChEMBL25) and may not optimally represent chemical spaces that differ significantly from drug-like compounds.
The validity rate for molecular generation is slightly lower than atom-level tokenization (94.1% vs. 97.0%), since longer substructure tokens can introduce invalid fragments.
The k-mer tokenization suffers from an out-of-vocabulary problem, which the authors address by replacing unseen 4-mers with [UNK] tokens, but this is a limitation of the comparison rather than of SPE itself.

Future directions: The authors suggest SPE could serve as a general tokenization method for SMILES-based deep learning, applicable to any task where SMILES strings are used as input (generation, property prediction, reaction prediction, retrosynthesis). The algorithm can also be applied to DeepSMILES and SELFIES representations without modification.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
SPE vocabulary training	ChEMBL25	~3.4M SMILES	1 canonical + 1 non-canonical per molecule
Language model training	ChEMBL25 augmented	~9M SMILES	1 canonical + 5 non-canonical per molecule
Molecular generation evaluation	Sampled from model	1M SMILES per model	Validated with RDKit
QSAR benchmarks	Cortes-Ciriano et al.	24 datasets, 199-5010 molecules	pIC50 regression tasks

Algorithms

SPE vocabulary training: iterative pair merging with MVS=30,000 and FT=2,000
Language model: AWD-LSTM with embedding size 400, 3 LSTM layers with 1,152 hidden units
Dropout: embedding=0.1, input=0.6, weight=0.5, hidden=0.2
Training: 10 epochs, base learning rate 0.008, one-cycle policy
QSAR: MolPMoFiT transfer learning with 25x training augmentation and 15x validation augmentation
Test time augmentation: average of canonical + 4 augmented SMILES predictions
RF baseline: 500 trees, 1024-bit ECFP6, default scikit-learn parameters

Models

AWD-LSTM architecture from Merity et al. (2018)
MolPMoFiT framework from Li and Fourches (2020) for transfer learning QSAR

Evaluation

Metric	Task	Notes
Validity, Uniqueness, Novelty	Generation	Basic quality metrics
Internal diversity	Generation	1 - mean pairwise Tanimoto (ECFP6)
Nearest neighbor similarity	Generation	Mean max Tanimoto to reference set
Substructure coverage	Generation	BRICS, functional groups, scaffolds, ring systems
RMSE, R-squared, MAE	QSAR regression	10 random 80:10:10 splits
Cohen’s d	QSAR comparison	Effect size between tokenization methods

Hardware

Not explicitly specified in the paper.

Artifacts

Artifact	Type	License	Notes
SmilesPE	Code	Apache-2.0	SPE tokenization Python package
MolPMoFiT	Code	Not specified	Transfer learning QSAR framework

Paper Information

Citation: Li, X., & Fourches, D. (2021). SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning. Journal of Chemical Information and Modeling, 61(4), 1560-1569. https://doi.org/10.1021/acs.jcim.0c01127

@article{li2021smiles,
  title={SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning},
  author={Li, Xinhao and Fourches, Denis},
  journal={Journal of Chemical Information and Modeling},
  volume={61},
  number={4},
  pages={1560--1569},
  year={2021},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.0c01127}
}

A Data-Driven Tokenization Method for Chemical Deep Learning#

Limitations of Atom-Level SMILES Tokenization#

Core Innovation: Adapting Byte Pair Encoding for SMILES#

Molecular Generation and QSAR Prediction Experiments#

Molecular Generation#

QSAR Prediction#

Results Summary and Future Directions#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#