SELFIES: The Original Paper (Krenn et al. 2020)

Contribution: A 100% Robust Representation for ML

This is a Method paper that introduces a new molecular string representation designed specifically for machine learning applications.

Motivation: The Invalidity Bottleneck

When neural networks generate molecules using SMILES notation, a huge fraction of output strings are invalid: either syntax errors or chemically impossible structures. This was a fundamental bottleneck: if your generative model produces 70% invalid molecules, you’re wasting computational effort and severely limiting chemical space exploration.

Novelty: A Formal Grammar Approach

The authors’ key insight was using a formal grammar approach (specifically, a Chomsky type-2, context-free grammar with self-referencing functions) where each symbol is interpreted based on chemical context. The “state of the derivation” tracks available valence bonds, preventing impossible structures like a carbon with five single bonds.

For example, generating 2-Fluoroethenimine (FC=C=N) follows a state derivation where each step restricts the available valency for the next element:

$$ \mathbf{X}_0 \xrightarrow{[F]} \text{F } \mathbf{X}_1 \xrightarrow{[=C]} \text{FC } \mathbf{X}_3 \xrightarrow{[=C]} \text{FC=C } \mathbf{X}_2 \xrightarrow{[\#N]} \text{FC=C=N} $$

This approach guarantees 100% validity: every SELFIES string corresponds to a valid molecule, and every valid molecule can be represented.

Methodology & Experiments: Validating Robustness

The authors ran a convincing set of experiments to demonstrate SELFIES’ robustness:

Random Mutation Test

They took the SELFIES and SMILES representations of MDMA and introduced random changes:

SMILES: After just one random mutation, only 9.9% of strings remained valid (dropping to 1.1% after three mutations).
SELFIES: 100% of mutated strings still represented valid molecules (though different from the original).

This empirical difference demonstrates why SELFIES is highly effective for evolutionary algorithms and genetic programming approaches to molecular design.

Generative Model Performance

The real test came with actual machine learning models. The authors trained Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) on both representations:

VAE Results:

SMILES-based VAE: Large invalid regions scattered throughout the latent space
SELFIES-based VAE: Every point in the continuous latent space mapped to a valid molecule
The SELFIES model encoded over 100 times more diverse molecules

GAN Results:

Best SMILES GAN: 18.5% diverse, valid molecules
Best SELFIES GAN: 78.9% diverse, valid molecules

Evaluation Metrics:

Validity: Percentage of generated strings representing valid molecular structures
Diversity: Number of unique valid molecules produced
Reconstruction Accuracy: How well the autoencoder reproduced input molecules (detailed in Table 1)

Scalability Test

The authors proved SELFIES works beyond toy molecules by successfully encoding and decoding all 72 million molecules from the PubChem database, demonstrating practical applicability to real chemical databases.

Results & Conclusions: Chemical Space Exploration

Key Findings:

SELFIES achieves 100% validity guarantee: every string represents a valid molecule
SELFIES-based VAEs encode over 100x more diverse molecules than SMILES-based models
SELFIES-based GANs produce 78.9% diverse valid molecules vs. 18.5% for SMILES GANs
Successfully validated on all 72 million PubChem molecules

Limitations Acknowledged:

No direct canonicalization method (initially required SMILES conversion)
Missing features: aromaticity, isotopes, complex stereochemistry
Requires community testing and adoption

Impact:

This work demonstrated that designing ML-native molecular representations could unlock new capabilities in drug discovery and materials science. The paper sparked broader conversation about representation design and showed the effectiveness of addressing format limitations directly. SELFIES was subsequently evaluated as an alternative input representation to SMILES in ChemBERTa, a transformer pretrained on molecular strings for property prediction, where it performed comparably to SMILES on the Tox21 benchmark, though the comparison was limited to a single task.

Reproducibility Details

Data

The machine learning experiments used two distinct datasets:

QM9 (134k molecules): Primary training dataset for VAE and GAN models
PubChem (72M molecules): Used only to test representation coverage and scalability; not used for model training

Models

The VAE implementation included:

Latent space: 241-dimensional with Gaussian distributions
Input encoding: One-hot encoding of SELFIES/SMILES strings
Full architectural details (encoder/decoder structures, layer types) provided in Supplementary Information

Algorithms

The authors found GAN performance was highly sensitive to hyperparameter selection:

Searched 200 different hyperparameter configurations to achieve the reported 78.9% diversity
Specific optimizers, learning rates, and training duration detailed in Supplementary Information
Full rule generation algorithm provided in Table 2

Evaluation

All models evaluated on:

Validity rate: Percentage of syntactically and chemically valid outputs
Diversity: Count of unique valid molecules generated
Reconstruction accuracy: Fidelity of autoencoder reconstruction (VAEs only)

Hardware

Training performed on the SciNet supercomputing infrastructure.
Specific GPU/compute details (typically NVIDIA P100/V100 nodes for SciNet during this period) are provided in the Supplementary Information.

Replication Resources

Complete technical replication is highly accessible due to the paper being published open-access in Machine Learning: Science and Technology. It primarily requires:

The full rule generation algorithm (Table 2 in paper)
Code and trained models: https://github.com/aspuru-guzik-group/selfies
Supplementary Information for complete architectural and hyperparameter specifications

Note: The modern SELFIES library has evolved significantly since this foundational paper, addressing many of the implementation challenges identified by the authors.

Paper Information

Citation: Krenn, M., Häse, F., Nigam, A., Friederich, P., & Aspuru-Guzik, A. (2020). Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4), 045024. https://doi.org/10.1088/2632-2153/aba947

Publication: Machine Learning: Science and Technology, 2020

@article{Krenn_2020,
	doi = {10.1088/2632-2153/aba947},
	url = {https://doi.org/10.1088%2F2632-2153%2Faba947},
	year = 2020,
	month = {aug},
	publisher = {{IOP} Publishing},
	volume = {1},
	number = {4},
	pages = {045024},
	author = {Mario Krenn and Florian H{\"{a}}se and AkshatKumar Nigam and Pascal Friederich and Alan Aspuru-Guzik},
	title = {Self-referencing embedded strings ({SELFIES}): A 100{\%} robust molecular string representation},
	journal = {Machine Learning: Science and Technology}
}

Additional Resources:

Contribution: A 100% Robust Representation for ML#

Motivation: The Invalidity Bottleneck#

Novelty: A Formal Grammar Approach#

Methodology & Experiments: Validating Robustness#

Random Mutation Test#

Generative Model Performance#

Scalability Test#

Results & Conclusions: Chemical Space Exploration#

Reproducibility Details#

Data#

Models#

Algorithms#

Evaluation#

Hardware#

Replication Resources#

Paper Information#