Paper Summary

Citation: Lo, A., Pollice, R., Nigam, A., White, A. D., Krenn, M., & Aspuru-Guzik, A. (2023). Recent advances in the self-referencing embedded strings (SELFIES) library. Digital Discovery, 2(4), 897–908. https://doi.org/10.1039/D3DD00044C

This software update paper documents the major improvements made to the SELFIES Python library since its original 2019 release. While the original SELFIES concept was promising, the initial implementation had limitations that prevented widespread adoption - it was slow, supported limited chemistry, and wasn’t user-friendly enough for real-world applications.

This update transforms SELFIES from proof-of-concept into a production-ready tool that handles diverse chemical databases and integrates seamlessly into machine learning workflows.

Key Library Improvements

Architectural Redesign

The biggest change under the hood is that the library now uses directed molecular graphs internally instead of operating directly on strings. This architectural shift brings several benefits:

  • Faster encoding/decoding: Converting between SMILES and SELFIES is now much more efficient
  • Better flexibility: The graph-based approach makes it easier to add support for new chemical features
  • Cleaner code: The internal logic is more maintainable and extensible

The authors benchmarked the library on over 300,000 molecules from the DTP dataset, completing a full roundtrip conversion (SMILES → SELFIES → SMILES) in just 252 seconds. That’s fast enough for most practical applications.

Expanded Chemical Support

The 2023 version dramatically expands what kinds of molecules SELFIES can represent:

  • Aromatic systems: Full support for aromatic molecules (a major limitation in the original version)
  • Stereochemistry: Can now encode chirality and stereochemical information
  • Charged species: Supports ionic compounds and formal charges
  • Isotopes: Can represent specific isotopes of elements
  • Broader element coverage: Works with most elements relevant to organic and medicinal chemistry

This means the library can now handle essentially the same chemical diversity as SMILES while maintaining SELFIES’ core advantage: 100% validity guarantee.

Customizable Chemical Constraints

One of the most interesting additions is the ability to customize the underlying chemical constraints. The validity guarantee in SELFIES comes from enforcing valence rules, but these aren’t universal:

  • Hypervalent compounds: Some molecules violate standard valence rules (like PF₅ or SF₆)
  • Specialized chemistry: Different research areas might need different constraints
  • Theoretical studies: Sometimes you want to explore “impossible” molecules

The set_semantic_constraints() function lets users define their own rules, making SELFIES more flexible for specialized applications while keeping the validity guarantee within whatever constraints they choose.

Implementation Details

Enhanced Machine Learning Integration

The library now includes several utility functions that make it easier to use SELFIES in ML pipelines:

  • Tokenization: split_selfies() breaks a SELFIES string into individual symbols
  • Length calculation: len_selfies() counts symbols (useful for padding/truncation)
  • Numerical encoding: selfies_to_encoding() converts SELFIES to label or one-hot encodings for neural networks

Random Molecule Generation

One of the most compelling demonstrations is how easy it becomes to generate random but valid molecules. You can literally create random SELFIES strings and decode them—every single one will be a valid molecule. The size distribution of the resulting molecules can even be controlled by filtering which symbols you allow in the random generation.

Translation Attribution

Both the encoder and decoder now support an “attribution” mode that shows which input symbols are responsible for each output symbol. This is useful for understanding how the translation works and for aligning properties between SMILES and SELFIES representations.

Production Readiness and Future Outlook

The 2023 SELFIES library has become mature enough for large-scale applications. Testing on over 300,000 molecules from the DTP dataset showed robust performance, with full roundtrip conversion (SMILES → SELFIES → SMILES) completing in just 252 seconds.

Current Capabilities

  • Handles most molecular diversity found in databases like PubChem
  • Fast enough for production machine learning workflows
  • Supports the same chemical features as SMILES while maintaining 100% validity

Research Extensions

The roadmap includes support for even more complex chemical systems:

  • Polymers: Repeating chemical units
  • Crystals: Extended solid-state structures
  • Chemical reactions: Representing transformations, not just molecules
  • Non-covalent interactions: Hydrogen bonds, van der Waals forces, etc.

These extensions would make SELFIES useful for an even broader range of chemical applications beyond small-molecule drug discovery.

Significance for Computational Chemistry

This update addresses the main barriers that prevented widespread SELFIES adoption. The original theoretical advantages are now accessible to a much broader research community, enabling applications in:

  • Drug discovery: More efficient exploration of pharmacological space
  • Materials science: Systematic discovery of novel chemical structures
  • Chemical databases: More robust alternative to SMILES for storage and retrieval

The 100% validity guarantee remains SELFIES’ key advantage, but this update makes that advantage practically accessible for real-world research workflows.

References