Paper Information

Citation: Yüksel, A., Ulusoy, E., Ünlü, A., & Doğan, T. (2023). SELFormer: Molecular Representation Learning via SELFIES Language Models. Machine Learning: Science and Technology, 4(2), 025035. https://doi.org/10.1088/2632-2153/acdb30

Publication: Machine Learning: Science and Technology 2023

Additional Resources:

A SELFIES-Based Chemical Language Model

This is primarily a Method paper ($\Psi_{\text{Method}}$) with a secondary Resource component ($\Psi_{\text{Resource}}$).

SELFormer applies the RoBERTa transformer architecture to SELFIES molecular string representations instead of the SMILES notation used by prior chemical language models. The model is pretrained via masked language modeling (MLM) on 2M drug-like compounds from ChEMBL and fine-tuned for molecular property prediction tasks on MoleculeNet benchmarks. The authors release pretrained models, fine-tuning code, and datasets as open-source resources.

Why SELFIES Over SMILES for Pretraining?

Existing chemical language models, including ChemBERTa, ChemBERTa-2, MolBERT, and MolFormer, all use SMILES as their input representation. SMILES has well-documented validity and robustness issues: arbitrary perturbations to a SMILES string frequently produce syntactically invalid outputs. This means a pretrained model must spend capacity learning SMILES grammar rules rather than chemical semantics.

SELFIES addresses this by construction: every possible SELFIES string decodes to a valid molecule. Despite this theoretical advantage and SELFIES’ growing adoption in generative chemistry, no prior work had systematically evaluated SELFIES as input for large-scale transformer pretraining. SELFormer fills this gap by providing a direct comparison between SELFIES-based and SMILES-based chemical language models on standard benchmarks.

Masked Language Modeling on Guaranteed-Valid Molecular Strings

SELFormer uses byte-level Byte-Pair Encoding (BPE) to tokenize SELFIES strings, then pretrains a RoBERTa encoder using the standard MLM objective. 15% of input tokens are masked, and the model minimizes the cross-entropy loss over the masked positions:

$$ \mathcal{L}_{\text{MLM}} = -\frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}; \theta) $$

where $\mathcal{M}$ is the set of masked token indices, $x_i$ is the true token at position $i$, $x_{\setminus \mathcal{M}}$ is the corrupted input context, and $\theta$ are the model parameters.

The key insight is that because SELFIES guarantees 100% validity, every masked token prediction corresponds to a valid molecular fragment. The model never wastes capacity predicting invalid chemistry. For fine-tuning, a two-layer classification or regression head is added on top of the encoder’s output embedding.

Two model sizes were trained. Notably, the larger SELFormer uses fewer attention heads (4) but more hidden layers (12) than SELFormer-Lite (12 heads, 8 layers). This counterintuitive configuration emerged from the authors’ hyperparameter search over ~100 models, where deeper architectures with fewer heads outperformed wider, shallower ones:

ConfigurationSELFormer-LiteSELFormer
Attention Heads124
Hidden Layers812
Batch Size1616
Learning Rate5e-55e-5
Weight Decay0.010.01
Pretraining Epochs100100
Parameters58.3M86.7M

Benchmarking Against SMILES Transformers and Graph Models

SELFormer was pretrained on 2.08M drug-like compounds from ChEMBL v30 (converted from SMILES to SELFIES), then fine-tuned on nine MoleculeNet tasks. All evaluations use scaffold splitting via the Chemprop library.

Classification tasks (ROC-AUC, scaffold split):

ModelBACEBBBPHIVTox21SIDER
SELFormer0.8320.9020.6810.6530.745
ChemBERTa-20.7990.7280.622--
MolBERT0.8660.7620.783--
D-MPNN0.8090.7100.7710.7590.570
MolCLR0.8900.7360.8060.7870.652
GEM0.8560.7240.8060.7810.672
KPGT0.8550.908-0.8480.649

Regression tasks (RMSE, scaffold split, lower is better):

ModelESOLFreeSolvLipophilicityPDBbind
SELFormer0.6822.7970.7351.488
ChemBERTa-2--0.986-
D-MPNN1.0502.0820.6831.397
GEM0.7981.8770.660-
KPGT0.8032.1210.600-

The ablation study compared SELFormer vs. SELFormer-Lite across pretrained-only, 25-epoch, and 50-epoch fine-tuning configurations on randomly split datasets. SELFormer consistently outperformed SELFormer-Lite, confirming the benefit of the deeper (12-layer) architecture.

Strong Classification Performance with Compact Pretraining

SELFormer’s strongest results come on classification tasks where molecular substructure matters:

  • SIDER: Best overall ROC-AUC (0.745), outperforming the next best method (MolCLR at 0.652) by 9.3 percentage points. The authors attribute this to SELFIES’ ability to capture subtle structural differences relevant to drug side effects.
  • BBBP: Second best (0.902), behind only KPGT (0.908). SELFormer scored 17.4 percentage points above ChemBERTa-2 (0.728) on this task.
  • BACE/HIV vs. ChemBERTa-2: SELFormer outperformed ChemBERTa-2 by 3.3 points on BACE (0.832 vs 0.799), 17.4 on BBBP, and 5.9 on HIV (0.681 vs 0.622). Since both models use similar RoBERTa architectures, this comparison is suggestive of a SELFIES advantage, though differences in pretraining corpus (ChEMBL vs PubChem), corpus size, and training procedure confound a clean attribution to the input representation alone.
  • ESOL regression: Best RMSE (0.682) vs GEM (0.798), a 14.5% relative improvement.

Limitations are also apparent:

  • HIV and Tox21: SELFormer underperforms graph-based methods (MolCLR, GEM, KPGT) on these larger datasets. The authors attribute this to insufficient hyperparameter search given computational constraints.
  • FreeSolv and Lipophilicity regression: D-MPNN and graph-based methods maintain an edge, suggesting that explicit 2D/3D structural inductive biases remain valuable for certain property types.
  • Small pretraining corpus: At 2M molecules, SELFormer’s corpus is orders of magnitude smaller than MolFormer’s 1.1B. Despite this, SELFormer outperforms MolFormer on SIDER (0.745 vs 0.690), highlighting SELFIES’ representational advantage.
  • Single-task ablation scope: Some architectural claims rest on limited task coverage, and broader benchmarking would strengthen the conclusions.

Reproducibility Details

Data

PurposeDatasetSizeNotes
PretrainingChEMBL v302,084,725 compounds (2,084,472 after SELFIES conversion)Drug-like bioactive small molecules
ClassificationBACE1,513Beta-secretase 1 inhibitor binding
ClassificationBBBP2,039Blood-brain barrier permeability
ClassificationHIV41,127HIV replication inhibition
ClassificationSIDER1,427Drug side effects (27 classes)
ClassificationTox217,831Toxicity (12 targets)
RegressionESOL1,128Aqueous solubility
RegressionFreeSolv642Hydration free energy
RegressionLipophilicity4,200Octanol/water distribution coefficient
RegressionPDBbind11,908Binding affinity

Algorithms

  • Pretraining objective: Masked language modeling (MLM), 15% token masking
  • Tokenization: Byte-level Byte-Pair Encoding (BPE) on SELFIES strings
  • SMILES to SELFIES conversion: SELFIES API with Pandaral.lel for parallelization
  • Splitting: Scaffold splitting via Chemprop library (80/10/10 train/validation/test)
  • Fine-tuning: Two-layer classification/regression head on encoder output; up to 200 epochs with hyperparameter search

Models

  • Architecture: RoBERTa (HuggingFace Transformers)
  • SELFormer: 12 hidden layers, 4 attention heads, 86.7M parameters
  • SELFormer-Lite: 8 hidden layers, 12 attention heads, 58.3M parameters
  • Hyperparameter search: Sequential search over ~100 configurations on 100K molecule subset

Evaluation

MetricTask TypeDetails
ROC-AUCClassificationArea under receiver operating characteristic curve
PRC-AUCClassificationArea under precision-recall curve (reported for random splits)
RMSERegressionRoot mean squared error

Results reported on scaffold split and random split datasets.

Hardware

  • Compute: 2x NVIDIA A5000 GPUs
  • Hyperparameter optimization time: ~11 days
  • Full pretraining: 100 epochs on 2.08M molecules

Artifacts

ArtifactTypeLicenseNotes
SELFormer GitHubCodeGPL-3.0Pretraining, fine-tuning, and evaluation scripts
SELFormer on HuggingFaceModelGPL-3.0Pretrained SELFormer weights
ChEMBL v30DatasetCC BY-SA 3.0Source pretraining data
MoleculeNetBenchmarkUnknownDownstream evaluation tasks

Citation

@article{yuksel2023selformer,
  title={{SELFormer}: Molecular Representation Learning via {SELFIES} Language Models},
  author={Y{\"u}ksel, Atakan and Ulusoy, Erva and {\"U}nl{\"u}, Atabey and Do{\u{g}}an, Tunca},
  journal={Machine Learning: Science and Technology},
  volume={4},
  number={2},
  pages={025035},
  year={2023},
  publisher={IOP Publishing},
  doi={10.1088/2632-2153/acdb30}
}