This paper introduced SELFIES to solve a critical problem in machine learning applications: most AI-generated molecular strings don’t represent real molecules.

The Core Problem

Mario Krenn and colleagues identified that when neural networks generate molecules using SMILES notation, a huge fraction of output strings are invalid - either syntax errors or chemically impossible structures. This wasn’t just an inconvenience but a fundamental bottleneck: if your generative model produces 70% invalid molecules, you’re wasting computational effort and severely limiting chemical space exploration.

The SELFIES Solution

The authors’ key insight was using a formal grammar approach where each symbol is interpreted based on chemical context - the “state of the derivation” that tracks available valence bonds. This prevents impossible structures like a carbon with five single bonds, making invalid molecular regions impossible rather than hoping models learn to avoid them.

Experimental Validation

The authors ran a convincing set of experiments to demonstrate SELFIES’ robustness:

Random Mutation Test

They took the SELFIES and SMILES representations of MDMA and introduced random changes:

  • SMILES: After just one random mutation, only 26.6% of strings remained valid
  • SELFIES: 100% of mutated strings still represented valid molecules (though different from the original)

This stark difference shows why SELFIES is particularly valuable for evolutionary algorithms and genetic programming approaches to molecular design.

Generative Model Performance

The real test came with actual machine learning models. The authors trained Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) on both representations:

VAE Results:

  • SMILES-based VAE: Large invalid regions scattered throughout the latent space
  • SELFIES-based VAE: Every point in the continuous latent space mapped to a valid molecule
  • The SELFIES model encoded over 100 times more diverse molecules

GAN Results:

  • Best SMILES GAN: 18.5% diverse, valid molecules
  • Best SELFIES GAN: 78.9% diverse, valid molecules

Scalability Demonstration

The authors proved SELFIES works beyond toy molecules by successfully encoding and decoding all 72 million molecules from the PubChem database - demonstrating practical applicability to real chemical databases.

Looking Forward: Implementation Challenges

The authors identified key areas needing development for widespread adoption:

  • Canonicalization: No direct method for unique SELFIES strings (initially required SMILES conversion)
  • Feature completeness: Missing aromaticity, isotopes, and complex stereochemistry support
  • Community adoption: Need for extensive testing and migration paths

Historical Impact

This 2020 paper provided immediate practical value to the computational chemistry community while demonstrating a key principle: design representations for your applications. Rather than adapting human-designed formats like SMILES, the authors showed we could engineer ML-native molecular representations.

The work sparked broader conversation about representation design and showed that addressing format limitations could unlock new capabilities in drug discovery, materials science, and method development.

Connection to Current SELFIES

The modern SELFIES library has evolved significantly since this foundational paper, addressing the implementation challenges identified here. However, the core insight remains: make invalid molecular structures impossible rather than hoping models learn to avoid them.

References

  • Krenn, M., Häse, F., Nigam, A., Friederich, P., & Aspuru-Guzik, A. (2020). Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4), 045024. https://doi.org/10.1088/2632-2153/aba947