Paper Information
Citation: Lo, A., Pollice, R., Nigam, A., White, A. D., Krenn, M., & Aspuru-Guzik, A. (2023). Recent advances in the self-referencing embedded strings (SELFIES) library. Digital Discovery, 2(4), 897-908. https://doi.org/10.1039/D3DD00044C
Publication: Digital Discovery 2023
Additional Resources:
What kind of paper is this?
This is a Resource paper - specifically a software update paper documenting major improvements to an existing computational tool (the SELFIES Python library).
What is the motivation?
While the original SELFIES concept was promising, the initial 2019 implementation had critical limitations that prevented widespread adoption:
- Performance: Too slow for production ML workflows
- Limited chemistry: Couldn’t represent aromatic molecules, stereochemistry, or many other important chemical features
- Poor usability: Lacked user-friendly APIs for common tasks
These barriers meant that despite SELFIES’ theoretical advantages (100% validity guarantee), researchers couldn’t practically use it for real-world applications like drug discovery or materials science.
What is the novelty here?
The key innovations transform SELFIES from proof-of-concept into production-ready tool:
Architectural redesign: Switched from string-based operations to directed molecular graphs internally, dramatically improving performance and extensibility
Expanded chemical support: Added aromatic systems, stereochemistry, charged species, isotopes, and broader element coverage - essentially matching SMILES’ chemical diversity while maintaining validity guarantees
Customizable semantic constraints: Introduced
set_semantic_constraints()API allowing users to define custom valence rules for specialized applications (hypervalent compounds, theoretical studies)ML integration utilities: Added tokenization, encoding, length calculation, and attribution functions specifically designed for neural network workflows
What experiments were performed?
The authors validated the library through several benchmarks:
Performance testing: Roundtrip conversion (SMILES → SELFIES → SMILES) on 300,000+ molecules from the DTP dataset completed in 252 seconds, demonstrating production-ready speed.
Chemical coverage: Tested on diverse molecular databases to verify the library handles most chemical diversity found in resources like PubChem.
Validity guarantee: Demonstrated that random SELFIES strings always decode to valid molecules, with controllable size distributions through symbol filtering.
Attribution system: Showed both encoder and decoder can track which input symbols produce which output symbols, useful for property alignment.
What outcomes/conclusions?
The 2023 update successfully addresses the main adoption barriers:
- Fast enough for large-scale ML applications (300K molecules in ~4 minutes)
- Chemically comprehensive enough for drug discovery and materials science
- User-friendly enough for seamless integration into existing workflows
The validity guarantee - SELFIES’ core advantage - is now practically accessible for real-world research. The roadmap includes future extensions for polymers, crystals, chemical reactions, and non-covalent interactions, which would expand SELFIES’ applicability beyond small-molecule chemistry.
Limitations acknowledged: The paper focuses on implementation improvements rather than novel algorithmic contributions. Some advanced chemical systems (polymers, crystals) still need future work.
Reproducibility Details
Algorithms
Technical Specification: The Grammar
The core innovation of SELFIES is a Context-Free Grammar (CFG) augmented with state-machine logic to ensure that every derived string represents a valid molecule. While the software features are important, understanding the underlying derivation rules is essential for replication or extension of the system.
1. Derivation Rules: The Atom State Machine
The fundamental mechanism that guarantees validity is a state machine that tracks the remaining valence of the most recently added atom:
- State Tracking: The derivation maintains a non-terminal state $X_l$, where $l$ represents the current atom’s remaining valence (number of bonds it can still form)
- Standard Derivation: An atom symbol $[\beta \alpha]$ (bond order + atom type) transitions the state from $S$ (start) to $X_l$, where $l$ is calculated from the atom’s standard valence minus the incoming bond order
- Bond Demotion (The Key Rule): If a requested bond order $\beta$ exceeds the available valence $l$ of the previous atom, the bond is automatically demoted to $\min(l, d(\beta))$ to prevent invalid valency. This automatic adjustment is the mathematical core of the validity guarantee.
This state machine ensures that no atom ever exceeds its allowed valence, making it impossible to generate chemically invalid structures.
2. Control Symbols: Branches and Rings
Branch length calculation: Unlike simple bracket-based branching, SELFIES uses a hexadecimal encoding to determine branch lengths. If a branch symbol is followed by index symbols $c_1 \dots c_k$, the number of symbols $N$ to include in the branch is calculated as:
$$N = 1 + 16 \cdot c_k$$
This formula interprets subsequent symbols as base-16 integers, allowing compact representation of arbitrarily long branches.
Ring closure queue system: Ring formation uses a lazy evaluation strategy to maintain validity. Ring symbols don’t create bonds immediately - instead, they push “closure candidates” into a queue $R$. These candidates are resolved after the main derivation completes. A ring closure candidate is rejected if either atom has no remaining valence ($m = 0$), or the closure would create a self-loop (same atom). This deferred validation prevents the creation of invalid ring structures while still allowing the grammar to remain context-free during the main derivation.
3. Symbol Structure and Standardization
SELFIES enforces a strict, standardized format for atom symbols to eliminate ambiguity:
- Canonical Format: Atom symbols follow the structure
[Bond, Isotope, Element, Chirality, H-count, Charge] - No Variation: Unlike SMILES, there’s only one way to write each symbol (e.g.,
[Fe++]and[Fe+2]are standardized to a single form) - Order Matters: The components must appear in the specified order
4. Default Semantic Constraints
By default, the library enforces standard organic chemistry valence rules:
- Typical Valences: C=4, N=3, O=2, F=1
- Element Coverage: Supports most elements relevant to organic and medicinal chemistry
- Customizable: These constraints can be modified via
set_semantic_constraints()for specialized applications (hypervalent compounds, theoretical studies, etc.)
The combination of these grammar rules with the state machine ensures that every valid SELFIES string decodes to a chemically valid molecule, regardless of how the string was generated (random, ML model output, manual construction, etc.).
Data
Benchmark dataset: DTP (Developmental Therapeutics Program) dataset with 300,000+ molecules used for performance testing.
Chemical coverage testing: Diverse molecular databases including PubChem to verify the library handles broad chemical diversity (aromatic systems, stereochemistry, charged species, isotopes).
Evaluation
Performance metric: Roundtrip conversion time (SMILES → SELFIES → SMILES) - 252 seconds for 300K molecules.
Validity testing: Random SELFIES string generation with decoding verification - 100% of generated strings decode to valid molecules.
Attribution system: Encoder and decoder track which input symbols produce which output symbols, tested for property alignment between SMILES and SELFIES representations.
