SELFIES and the Future of Molecular String Representations

Overview

This perspective paper by Krenn, Aspuru-Guzik, and colleagues reviews 250 years of chemical notation evolution and proposes 16 specific research projects to expand SELFIES beyond traditional organic chemistry. Building on the original SELFIES work, the authors outline a research roadmap for extending robust molecular representations into new chemical domains and AI applications.

The paper’s main contribution is identifying concrete directions for making SELFIES applicable to complex systems like polymers, crystals, and chemical reactions - areas where traditional molecular representations break down.

Beyond SMILES vs SELFIES

While SMILES has been the standard since 1988, its fundamental weakness for machine learning is well-established: randomly generated SMILES strings are often invalid. The authors note that other alternatives like InChI and DeepSMILES also fall short of providing 100% robustness.

SELFIES addressed this core problem, but this perspective paper asks: what’s next? How can we extend the robust representation principle to chemical domains that current formats can’t handle?

The 16 Research Projects: A Roadmap for Robust Representations

The paper’s main contribution is outlining 16 concrete research projects to extend robust representations beyond traditional organic chemistry. These fall into several key themes:

Extending to New Domains

metaSELFIES (Project 1): Instead of manually crafting rules for each domain, the authors propose learning graph construction rules automatically from data. This could enable robust representations for any graph-based system - from quantum optics to biological networks - without needing domain-specific expertise.

Token Optimization (Project 2): SELFIES uses “overloading” where a symbol’s meaning changes based on context. This project would investigate how this affects machine learning performance and whether the approach can be optimized.

Handling Complex Molecular Systems

BigSELFIES (Project 3): Current representations struggle with large, often random structures like polymers and biomolecules. BigSELFIES would combine hierarchical notation with stochastic building blocks to handle these complex systems where traditional small-molecule representations break down.

Crystal Structures (Projects 4-5): Crystals present unique challenges due to their infinite, periodic arrangements. The proposed approach represents crystal topology using labeled quotient graphs, enabling AI-driven materials design without relying on predefined crystal structures. This could unlock systematic exploration of theoretical materials space.

Beyond Organic Chemistry (Project 6): Transition metals and main-group compounds feature complex bonding that breaks the simple two-center, two-electron model. The solution: use machine learning on large structural databases to automatically learn these complex bonding rules rather than trying to encode them manually.

Chemical Reactions and Programming Concepts

Reaction Representations (Project 7): Moving beyond static molecules to represent chemical transformations. A robust reaction format would enforce conservation laws and could learn reactivity patterns from large reaction datasets, potentially revolutionizing synthesis planning.

Programming Language Perspective (Projects 8-9): An intriguing reframing views molecular representations as programming languages executed by chemical parsers. This opens possibilities for adding loops, logic, and other programming concepts to efficiently describe complex structures. The ambitious goal: a Turing-complete programming language that’s also 100% robust.

Representation Comparison and Benchmarking

Empirical Comparisons (Projects 10-11): With multiple representation options (strings, matrices, images), we need systematic comparisons. The proposed benchmarks would go beyond simple validity metrics to focus on real-world design objectives in drug discovery, catalysis, and materials science.

Understanding Interpretability

Human Readability (Project 12): While SMILES is often called “human-readable,” this claim lacks scientific validation. The proposed study would test how well humans actually understand different molecular representations.

Machine Learning Perspectives (Projects 13-16): These projects explore how machines interpret molecular representations:

Training networks to translate between formats to find universal representations
Comparing learning efficiency across different formats
Investigating latent space smoothness in generative models
Visualizing what models actually learn about molecular structure

Strategic Importance for Molecular AI

This research roadmap addresses a crucial bottleneck in computational chemistry. As generative models become more sophisticated, representation limitations become fundamental barriers to progress. The proposed extensions could unlock new capabilities across multiple domains:

Drug discovery: More efficient exploration of pharmacological space beyond small molecules
Materials design: Systematic discovery of novel crystal structures and polymers
Synthesis planning: Better representations for reaction prediction and retrosynthesis
Fundamental research: New ways to understand and predict chemical behavior in complex systems

The authors emphasize that robust representations could become a bridge for human scientists to learn new concepts from AI systems - enabling bidirectional learning between humans and machines where real breakthroughs happen.

References

Krenn, M., Ai, Q., Barthel, S., Carson, N., Frei, A., Frey, N. C., Friederich, P., Gaudin, T., Gayle, A. A., Jablonka, K. M., Lameiro, R. F., Lemm, D., Lo, A., Moosavi, S. M., Nápoles-Duarte, J. M., Nigam, A., Pollice, R., Rajan, K., Schatzschneider, U., … Aspuru-Guzik, A. (2022). SELFIES and the future of molecular string representations. Patterns, 3(10). https://doi.org/10.1016/j.patter.2022.100588

Dataset Details
Authors	Mario Krenn, Qianxiang Ai, Senja Barthel, Nessa Carson, Angelo Frei, Nathan C. Frey, Pascal Friederich, Théophile Gaudin, Alberto Alexander Gayle, Kevin Maik Jablonka, Rafael F. Lameiro, Dominik Lemm, Alston Lo, Seyed Mohamad Moosavi, José Manuel Nápoles-Duarte, AkshatKumar Nigam, Robert Pollice, Kohulan Rajan, Ulrich Schatzschneider, Philippe Schwaller, Marta Skreta, Berend Smit, Felix Strieth-Kalthoff, Chong Sun, Gregor N. C. Simm, Gary Tom, Bettina Weiss, Adamo Young, Rose Yu, Alán Aspuru-Guzik
Institutions	University of Toronto, Harvard University, MIT, EPFL, Various other institutions
Published In	Patterns
Category	Computational Chemistry
Date	October 2022
Year	2022
Links	🔗 DOI • 📄 Paper

Overview#

Beyond SMILES vs SELFIES#

The 16 Research Projects: A Roadmap for Robust Representations#

Extending to New Domains#

Handling Complex Molecular Systems#

Chemical Reactions and Programming Concepts#

Representation Comparison and Benchmarking#

Understanding Interpretability#

Strategic Importance for Molecular AI#

References#