SELFormer: A SELFIES-Based Molecular Language Model

Paper Information

Citation: Yüksel, A., Ulusoy, E., Ünlü, A., & Doğan, T. (2023). SELFormer: Molecular Representation Learning via SELFIES Language Models. Machine Learning: Science and Technology, 4(2), 025035. https://doi.org/10.1088/2632-2153/acdb30

Publication: Machine Learning: Science and Technology 2023

Additional Resources:

A SELFIES-Based Chemical Language Model

This is primarily a Method paper ($\Psi_{\text{Method}}$) with a secondary Resource component ($\Psi_{\text{Resource}}$).

SELFormer applies the RoBERTa transformer architecture to SELFIES molecular string representations instead of the SMILES notation used by prior chemical language models. The model is pretrained via masked language modeling (MLM) on 2M drug-like compounds from ChEMBL and fine-tuned for molecular property prediction tasks on MoleculeNet benchmarks. The authors release pretrained models, fine-tuning code, and datasets as open-source resources.

Why SELFIES Over SMILES for Pretraining?

Existing chemical language models, including ChemBERTa, ChemBERTa-2, MolBERT, and MolFormer, all use SMILES as their input representation. SMILES has well-documented validity and robustness issues: arbitrary perturbations to a SMILES string frequently produce syntactically invalid outputs. This means a pretrained model must spend capacity learning SMILES grammar rules rather than chemical semantics.

SELFIES addresses this by construction: every possible SELFIES string decodes to a valid molecule. Despite this theoretical advantage and SELFIES’ growing adoption in generative chemistry, no prior work had systematically evaluated SELFIES as input for large-scale transformer pretraining. SELFormer fills this gap by providing a direct comparison between SELFIES-based and SMILES-based chemical language models on standard benchmarks.

Masked Language Modeling on Guaranteed-Valid Molecular Strings

SELFormer uses byte-level Byte-Pair Encoding (BPE) to tokenize SELFIES strings, then pretrains a RoBERTa encoder using the standard MLM objective. 15% of input tokens are masked, and the model minimizes the cross-entropy loss over the masked positions:

$$ \mathcal{L}_{\text{MLM}} = -\frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}; \theta) $$

where $\mathcal{M}$ is the set of masked token indices, $x_i$ is the true token at position $i$, $x_{\setminus \mathcal{M}}$ is the corrupted input context, and $\theta$ are the model parameters.

The key insight is that because SELFIES guarantees 100% validity, every masked token prediction corresponds to a valid molecular fragment. The model never wastes capacity predicting invalid chemistry. For fine-tuning, a two-layer classification or regression head is added on top of the encoder’s output embedding.

Two model sizes were trained. Notably, the larger SELFormer uses fewer attention heads (4) but more hidden layers (12) than SELFormer-Lite (12 heads, 8 layers). This counterintuitive configuration emerged from the authors’ hyperparameter search over ~100 models, where deeper architectures with fewer heads outperformed wider, shallower ones:

Configuration	SELFormer-Lite	SELFormer
Attention Heads	12	4
Hidden Layers	8	12
Batch Size	16	16
Learning Rate	5e-5	5e-5
Weight Decay	0.01	0.01
Pretraining Epochs	100	100
Parameters	58.3M	86.7M

Benchmarking Against SMILES Transformers and Graph Models

SELFormer was pretrained on 2.08M drug-like compounds from ChEMBL v30 (converted from SMILES to SELFIES), then fine-tuned on nine MoleculeNet tasks. All evaluations use scaffold splitting via the Chemprop library.

Classification tasks (ROC-AUC, scaffold split):

Model	BACE	BBBP	HIV	Tox21	SIDER
SELFormer	0.832	0.902	0.681	0.653	0.745
ChemBERTa-2	0.799	0.728	0.622	-	-
MolBERT	0.866	0.762	0.783	-	-
D-MPNN	0.809	0.710	0.771	0.759	0.570
MolCLR	0.890	0.736	0.806	0.787	0.652
GEM	0.856	0.724	0.806	0.781	0.672
KPGT	0.855	0.908	-	0.848	0.649

Regression tasks (RMSE, scaffold split, lower is better):

Model	ESOL	FreeSolv	Lipophilicity	PDBbind
SELFormer	0.682	2.797	0.735	1.488
ChemBERTa-2	-	-	0.986	-
D-MPNN	1.050	2.082	0.683	1.397
GEM	0.798	1.877	0.660	-
KPGT	0.803	2.121	0.600	-

The ablation study compared SELFormer vs. SELFormer-Lite across pretrained-only, 25-epoch, and 50-epoch fine-tuning configurations on randomly split datasets. SELFormer consistently outperformed SELFormer-Lite, confirming the benefit of the deeper (12-layer) architecture.

Strong Classification Performance with Compact Pretraining

SELFormer’s strongest results come on classification tasks where molecular substructure matters:

SIDER: Best overall ROC-AUC (0.745), outperforming the next best method (MolCLR at 0.652) by 9.3 percentage points. The authors attribute this to SELFIES’ ability to capture subtle structural differences relevant to drug side effects.
BBBP: Second best (0.902), behind only KPGT (0.908). SELFormer scored 17.4 percentage points above ChemBERTa-2 (0.728) on this task.
BACE/HIV vs. ChemBERTa-2: SELFormer outperformed ChemBERTa-2 by 3.3 points on BACE (0.832 vs 0.799), 17.4 on BBBP, and 5.9 on HIV (0.681 vs 0.622). Since both models use similar RoBERTa architectures, this comparison is suggestive of a SELFIES advantage, though differences in pretraining corpus (ChEMBL vs PubChem), corpus size, and training procedure confound a clean attribution to the input representation alone.
ESOL regression: Best RMSE (0.682) vs GEM (0.798), a 14.5% relative improvement.

Limitations are also apparent:

HIV and Tox21: SELFormer underperforms graph-based methods (MolCLR, GEM, KPGT) on these larger datasets. The authors attribute this to insufficient hyperparameter search given computational constraints.
FreeSolv and Lipophilicity regression: D-MPNN and graph-based methods maintain an edge, suggesting that explicit 2D/3D structural inductive biases remain valuable for certain property types.
Small pretraining corpus: At 2M molecules, SELFormer’s corpus is orders of magnitude smaller than MolFormer’s 1.1B. Despite this, SELFormer outperforms MolFormer on SIDER (0.745 vs 0.690), highlighting SELFIES’ representational advantage.
Single-task ablation scope: Some architectural claims rest on limited task coverage, and broader benchmarking would strengthen the conclusions.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL v30	2,084,725 compounds (2,084,472 after SELFIES conversion)	Drug-like bioactive small molecules
Classification	BACE	1,513	Beta-secretase 1 inhibitor binding
Classification	BBBP	2,039	Blood-brain barrier permeability
Classification	HIV	41,127	HIV replication inhibition
Classification	SIDER	1,427	Drug side effects (27 classes)
Classification	Tox21	7,831	Toxicity (12 targets)
Regression	ESOL	1,128	Aqueous solubility
Regression	FreeSolv	642	Hydration free energy
Regression	Lipophilicity	4,200	Octanol/water distribution coefficient
Regression	PDBbind	11,908	Binding affinity

Algorithms

Pretraining objective: Masked language modeling (MLM), 15% token masking
Tokenization: Byte-level Byte-Pair Encoding (BPE) on SELFIES strings
SMILES to SELFIES conversion: SELFIES API with Pandaral.lel for parallelization
Splitting: Scaffold splitting via Chemprop library (80/10/10 train/validation/test)
Fine-tuning: Two-layer classification/regression head on encoder output; up to 200 epochs with hyperparameter search

Models

Architecture: RoBERTa (HuggingFace Transformers)
SELFormer: 12 hidden layers, 4 attention heads, 86.7M parameters
SELFormer-Lite: 8 hidden layers, 12 attention heads, 58.3M parameters
Hyperparameter search: Sequential search over ~100 configurations on 100K molecule subset

Evaluation

Metric	Task Type	Details
ROC-AUC	Classification	Area under receiver operating characteristic curve
PRC-AUC	Classification	Area under precision-recall curve (reported for random splits)
RMSE	Regression	Root mean squared error

Results reported on scaffold split and random split datasets.

Hardware

Compute: 2x NVIDIA A5000 GPUs
Hyperparameter optimization time: ~11 days
Full pretraining: 100 epochs on 2.08M molecules

Artifacts

Artifact	Type	License	Notes
SELFormer GitHub	Code	GPL-3.0	Pretraining, fine-tuning, and evaluation scripts
SELFormer on HuggingFace	Model	GPL-3.0	Pretrained SELFormer weights
ChEMBL v30	Dataset	CC BY-SA 3.0	Source pretraining data
MoleculeNet	Benchmark	Unknown	Downstream evaluation tasks

Citation

@article{yuksel2023selformer,
  title={{SELFormer}: Molecular Representation Learning via {SELFIES} Language Models},
  author={Y{\"u}ksel, Atakan and Ulusoy, Erva and {\"U}nl{\"u}, Atabey and Do{\u{g}}an, Tunca},
  journal={Machine Learning: Science and Technology},
  volume={4},
  number={2},
  pages={025035},
  year={2023},
  publisher={IOP Publishing},
  doi={10.1088/2632-2153/acdb30}
}

Paper Information#

A SELFIES-Based Chemical Language Model#

Why SELFIES Over SMILES for Pretraining?#

Masked Language Modeling on Guaranteed-Valid Molecular Strings#

Benchmarking Against SMILES Transformers and Graph Models#

Strong Classification Performance with Compact Pretraining#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Citation#