Overview
This perspective paper by Krenn, Aspuru-Guzik, and colleagues reviews 250 years of chemical notation evolution and proposes 16 specific research projects to expand SELFIES beyond traditional organic chemistry. Building on the original SELFIES work, the authors outline a research roadmap for extending robust molecular representations into new chemical domains and AI applications.
The paper’s main contribution is identifying concrete directions for making SELFIES applicable to complex systems like polymers, crystals, and chemical reactions - areas where traditional molecular representations break down.
Beyond SMILES vs SELFIES
While SMILES has been the standard since 1988, its fundamental weakness for machine learning is well-established: randomly generated SMILES strings are often invalid. The authors note that other alternatives like InChI and DeepSMILES also fall short of providing 100% robustness.
SELFIES addressed this core problem, but this perspective paper asks: what’s next? How can we extend the robust representation principle to chemical domains that current formats can’t handle?
The 16 Research Projects: A Roadmap for Robust Representations
The paper’s main contribution is outlining 16 concrete research projects to extend robust representations beyond traditional organic chemistry. These fall into several key themes:
Extending to New Domains
metaSELFIES (Project 1): Instead of manually crafting rules for each domain, the authors propose learning graph construction rules automatically from data. This could enable robust representations for any graph-based system - from quantum optics to biological networks - without needing domain-specific expertise.
Token Optimization (Project 2): SELFIES uses “overloading” where a symbol’s meaning changes based on context. This project would investigate how this affects machine learning performance and whether the approach can be optimized.
Handling Complex Molecular Systems
BigSELFIES (Project 3): Current representations struggle with large, often random structures like polymers and biomolecules. BigSELFIES would combine hierarchical notation with stochastic building blocks to handle these complex systems where traditional small-molecule representations break down.
Crystal Structures (Projects 4-5): Crystals present unique challenges due to their infinite, periodic arrangements. The proposed approach represents crystal topology using labeled quotient graphs, enabling AI-driven materials design without relying on predefined crystal structures. This could unlock systematic exploration of theoretical materials space.
Beyond Organic Chemistry (Project 6): Transition metals and main-group compounds feature complex bonding that breaks the simple two-center, two-electron model. The solution: use machine learning on large structural databases to automatically learn these complex bonding rules rather than trying to encode them manually.
Chemical Reactions and Programming Concepts
Reaction Representations (Project 7): Moving beyond static molecules to represent chemical transformations. A robust reaction format would enforce conservation laws and could learn reactivity patterns from large reaction datasets, potentially revolutionizing synthesis planning.
Programming Language Perspective (Projects 8-9): An intriguing reframing views molecular representations as programming languages executed by chemical parsers. This opens possibilities for adding loops, logic, and other programming concepts to efficiently describe complex structures. The ambitious goal: a Turing-complete programming language that’s also 100% robust.
Representation Comparison and Benchmarking
Empirical Comparisons (Projects 10-11): With multiple representation options (strings, matrices, images), we need systematic comparisons. The proposed benchmarks would go beyond simple validity metrics to focus on real-world design objectives in drug discovery, catalysis, and materials science.
Understanding Interpretability
Human Readability (Project 12): While SMILES is often called “human-readable,” this claim lacks scientific validation. The proposed study would test how well humans actually understand different molecular representations.
Machine Learning Perspectives (Projects 13-16): These projects explore how machines interpret molecular representations:
- Training networks to translate between formats to find universal representations
- Comparing learning efficiency across different formats
- Investigating latent space smoothness in generative models
- Visualizing what models actually learn about molecular structure
Strategic Importance for Molecular AI
This research roadmap addresses a crucial bottleneck in computational chemistry. As generative models become more sophisticated, representation limitations become fundamental barriers to progress. The proposed extensions could unlock new capabilities across multiple domains:
- Drug discovery: More efficient exploration of pharmacological space beyond small molecules
- Materials design: Systematic discovery of novel crystal structures and polymers
- Synthesis planning: Better representations for reaction prediction and retrosynthesis
- Fundamental research: New ways to understand and predict chemical behavior in complex systems
The authors emphasize that robust representations could become a bridge for human scientists to learn new concepts from AI systems - enabling bidirectional learning between humans and machines where real breakthroughs happen.
References
- Krenn, M., Ai, Q., Barthel, S., Carson, N., Frei, A., Frey, N. C., Friederich, P., Gaudin, T., Gayle, A. A., Jablonka, K. M., Lameiro, R. F., Lemm, D., Lo, A., Moosavi, S. M., Nápoles-Duarte, J. M., Nigam, A., Pollice, R., Rajan, K., Schatzschneider, U., … Aspuru-Guzik, A. (2022). SELFIES and the future of molecular string representations. Patterns, 3(10). https://doi.org/10.1016/j.patter.2022.100588