SMILES vs SELFIES Tokenization for Chemical LMs

Atom Pair Encoding for Chemical Language Modeling

This is a Method paper that introduces Atom Pair Encoding (APE), a tokenization algorithm designed specifically for chemical string representations (SMILES and SELFIES). The primary contribution is demonstrating that a chemistry-aware tokenizer, which preserves atomic identity during subword merging, leads to improved molecular property classification accuracy in transformer-based models compared to the standard Byte Pair Encoding (BPE) approach.

Why Tokenization Matters for Chemical Strings

Existing chemical language models based on BERT/RoBERTa architectures have typically relied on BPE for tokenizing SMILES and SELFIES strings. Byte Pair Encoding (BPE) was originally designed for natural language and data compression, where it excels at breaking words into meaningful subword units. When applied to chemical strings, BPE operates at the character level without understanding chemical semantics, leading to several problems:

Stray characters: BPE may create tokens like “C)(” that have no chemical meaning.
Element splitting: Multi-character elements like chlorine (“Cl”) can be split into “C” and “l”, causing the model to misinterpret carbon and a dangling character.
Lost structural context: BPE compresses sequences without considering how character position encodes molecular structure.

Previous work on SMILES Pair Encoding (SPE) attempted to address this by iteratively merging SMILES substrings into chemically meaningful tokens. However, SPE had practical limitations: its Python implementation did not support SELFIES, and it produced a smaller vocabulary (~3000 tokens) than what the data could support. These gaps motivated the development of APE.

The APE Tokenizer: Chemistry-Aware Subword Merging

APE draws inspiration from both BPE and SPE but addresses their shortcomings. The key design decisions are:

Atom-level initialization: Instead of starting from individual characters (as BPE does), APE begins with chemically valid atomic units. For SMILES, this means recognizing multi-character elements (e.g., “Cl”, “Br”) as single tokens. For SELFIES, each bracketed string (e.g., [C], [Ring1], [=O]) serves as the fundamental unit.
Iterative pair merging: Like BPE, APE iteratively merges the most frequent adjacent token pairs. The difference is that the initial tokenization preserves atomic boundaries, so merged tokens always represent valid chemical substructures.
Larger vocabulary: Using the same minimum frequency threshold of 2000, APE generates approximately 5300 unique tokens from the PubChem dataset, compared to SPE’s approximately 3000. This richer vocabulary provides more expressive power for representing chemical substructures.
SELFIES compatibility: APE natively supports both SMILES and SELFIES, using the bracketed token structure of SELFIES as its starting point for that representation.

The tokenizer was trained on a subset of 2 million molecules from PubChem (10 million SMILES total). This produced four tokenizer variants: SMILES-BPE, SMILES-APE, SELFIES-BPE, and SELFIES-APE.

Pre-training and Evaluation on MoleculeNet Benchmarks

Model architecture

All four models use the RoBERTa architecture with 6 hidden layers, a hidden size of 768, an intermediate size of 1536, and 12 attention heads. Pre-training used masked language modeling (MLM) with 15% token masking on 1 million molecules from PubChem, with a validation set of 100,000 molecules. Each model was pre-trained for 20 epochs using AdamW, with hyperparameter optimization via Optuna.

Downstream tasks

The models were fine-tuned on three MoleculeNet classification tasks:

Dataset	Category	Compounds	Tasks	Metric
BBBP	Physiology	2,039	1	ROC-AUC
HIV	Biophysics	41,127	1	ROC-AUC
Tox21	Physiology	7,831	12	ROC-AUC

Data was split 80/10/10 (train/validation/test) following MoleculeNet recommendations. Models were fine-tuned for 5 epochs with early stopping based on validation ROC-AUC.

Baselines

Results were compared against two text-based models (ChemBERTa-2 MTR-77M and SELFormer) and two graph-based models (D-MPNN from Chemprop and MoleculeNet Graph-Conv).

Main results

Model	BBBP ROC	HIV ROC	Tox21 ROC
SMILYAPE-1M	0.754 +/- 0.006	0.772 +/- 0.010	0.838 +/- 0.002
SMILYBPE-1M	0.746 +/- 0.006	0.754 +/- 0.015	0.849 +/- 0.002
SELFYAPE-1M	0.735 +/- 0.015	0.768 +/- 0.012	0.842 +/- 0.002
SELFYBPE-1M	0.676 +/- 0.014	0.709 +/- 0.012	0.825 +/- 0.001
ChemBERTa-2-MTR-77M	0.698 +/- 0.014	0.735 +/- 0.008	0.790 +/- 0.003
SELFormer	0.716 +/- 0.021	0.769 +/- 0.010	0.838 +/- 0.005
MoleculeNet-Graph-Conv	0.690	0.763	0.829
D-MPNN	0.737	0.776	0.851

APE consistently outperforms BPE for both SMILES and SELFIES. SMILYAPE achieves the best BBBP score (0.754), beating D-MPNN (0.737). On HIV, SMILYAPE (0.772) is competitive with D-MPNN (0.776). On Tox21, D-MPNN (0.851) leads, with SMILYBPE (0.849) and SELFYAPE (0.842) close behind.

Statistical significance

Mann-Whitney U tests confirmed statistically significant differences between SMILYAPE and SMILYBPE (p < 0.05 on all datasets). Cliff’s delta values indicate large effect sizes: 0.74 (BBBP), 0.70 (HIV), and -1.00 (Tox21, favoring BPE). For SELFIES models, SELFYAPE achieved Cliff’s delta of 1.00 across all three datasets, indicating complete separation from SELFYBPE.

Key Findings and Limitations

APE outperforms BPE by preserving atomic identity

The consistent advantage of APE over BPE stems from APE’s atom-level initialization. By starting with chemically valid units rather than individual characters, APE avoids creating nonsensical tokens that break chemical elements or mix structural delimiters with atoms.

SMILES outperforms SELFIES with APE tokenization

SMILYAPE generally outperforms SELFYAPE across tasks. Attention weight analysis revealed that SMILYAPE assigns more weight to immediate neighboring tokens (0.108 vs. 0.096) and less to distant tokens (0.030 vs. 0.043). This pattern aligns with chemical intuition: bonding is primarily determined by directly connected atoms. SMILYAPE also produces more compact tokenizations (8.6 tokens per molecule vs. 11.9 for SELFYAPE), potentially allowing more efficient attention allocation.

SELFIES models show higher inter-tokenizer agreement

On the BBBP dataset, all true positives identified by SELFYBPE were also captured by SELFYAPE, with SELFYAPE achieving higher recall (61.68% vs. 55.14%). In contrast, SMILES-based models shared only 29.3% of true positives between APE and BPE variants, indicating that tokenization choice has a larger impact on SMILES models.

Limitations

Pre-training used only 1 million molecules, compared to 77 million for ChemBERTa-2. Despite this, APE models were competitive or superior, but scaling effects remain unexplored.
Evaluation was limited to three binary classification tasks from MoleculeNet. Regression tasks, molecular generation, and reaction prediction were not tested.
The Tox21 result is notable: SMILYBPE outperforms SMILYAPE (0.849 vs. 0.838), suggesting APE’s advantage may be task-dependent.
No comparison with recent atom-level tokenizers like Atom-in-SMILES or newer approaches beyond SPE.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Tokenizer training	PubChem subset	2M molecules	SMILES strings converted to SELFIES via selfies library
Pre-training	PubChem subset	1M molecules	100K validation set
Evaluation	BBBP	2,039 compounds	80/10/10 split
Evaluation	HIV	41,127 compounds	80/10/10 split
Evaluation	Tox21	7,831 compounds	80/10/10 split, 12 tasks

Algorithms

Tokenizers: BPE (via Hugging Face), APE (custom implementation, minimum frequency 2000)
Pre-training: Masked Language Modeling (15% masking) for 20 epochs
Optimizer: AdamW with Optuna hyperparameter search
Fine-tuning: 5 epochs with early stopping on validation ROC-AUC

Models

Architecture: RoBERTa with 6 layers, hidden size 768, intermediate size 1536, 12 attention heads
Four variants: SMILYAPE, SMILYBPE, SELFYAPE, SELFYBPE

Evaluation

Metric	SMILYAPE	SMILYBPE	SELFYAPE	SELFYBPE
BBBP ROC-AUC	0.754	0.746	0.735	0.676
HIV ROC-AUC	0.772	0.754	0.768	0.709
Tox21 ROC-AUC	0.838	0.849	0.842	0.825

Hardware

NVIDIA RTX 3060 GPU with 12 GiB VRAM

Artifacts

Artifact	Type	License	Notes
APE Tokenizer	Code	Other (unspecified SPDX)	Official APE tokenizer implementation
PubChem10M SMILES/SELFIES	Dataset	Not specified	10M SMILES with SELFIES conversions
Pre-trained and fine-tuned models	Model	Not specified	All four model variants on Hugging Face

Paper Information

Citation: Leon, M., Perezhohin, Y., Peres, F., Popovič, A., & Castelli, M. (2024). Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling. Scientific Reports, 14(1), 25016. https://doi.org/10.1038/s41598-024-76440-8

@article{leon2024comparing,
  title={Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling},
  author={Leon, Miguelangel and Perezhohin, Yuriy and Peres, Fernando and Popovi{\v{c}}, Ale{\v{s}} and Castelli, Mauro},
  journal={Scientific Reports},
  volume={14},
  number={1},
  pages={25016},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41598-024-76440-8}
}

Atom Pair Encoding for Chemical Language Modeling#

Why Tokenization Matters for Chemical Strings#

The APE Tokenizer: Chemistry-Aware Subword Merging#

Pre-training and Evaluation on MoleculeNet Benchmarks#

Model architecture#

Downstream tasks#

Baselines#

Main results#

Statistical significance#

Key Findings and Limitations#

APE outperforms BPE by preserving atomic identity#

SMILES outperforms SELFIES with APE tokenization#

SELFIES models show higher inter-tokenizer agreement#

Limitations#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#