Recent Advances in the SELFIES Library (2023)

SELFIES as a Production-Ready Resource

This software update paper documents major improvements to the SELFIES Python library, establishing it as a production-ready computational tool.

Limitations in the Original SELFIES Implementation

While the original SELFIES concept was promising, the initial 2019 implementation had critical limitations that prevented widespread adoption:

Performance: Too slow for production ML workflows
Limited chemistry: Couldn’t represent aromatic molecules, stereochemistry, or many other important chemical features
Poor usability: Lacked user-friendly APIs for common tasks

These barriers meant that despite SELFIES’ theoretical advantages (100% validity guarantee), researchers couldn’t practically use it for real-world applications like drug discovery or materials science.

Architectural Refactoring and New ML Integrations

The 2023 update refactors the underlying SELFIES engine, establishing it as a production-ready repository. The key updates include:

Graph-Based Processing: The system replaces string-based operations with directed molecular graphs internally, significantly increasing execution speed and extensibility.
Expanded Canonical Forms: Standardizes support for aromatic systems, stereochemistry, charged species, and isotopic data, effectively overlapping with the chemical diversity coverage of SMILES while preserving robust validity guarantees.
Semantic Constraint API: Introduces the set_semantic_constraints() function, allowing specification of custom valence definitions useful for theoretical studies or hypervalent states.
Deep Learning Hooks: Integrates tokenization, length estimation, and attribution utilities specifically engineered to plug directly into neural network encoders and decoders.

Performance Benchmarks & Validity Testing

The authors validated the library through several benchmarks:

Performance testing: Roundtrip conversion (SMILES → SELFIES → SMILES) on 300,000+ molecules from the DTP dataset completed in 252 seconds, demonstrating production-ready speed compared to previous string-based iterations (translating to a massive orders-of-magnitude faster operation footprint).

Chemical coverage: Tested on diverse molecular databases to verify the library handles most chemical diversity found in resources like PubChem.

Validity guarantee: Demonstrated that random SELFIES strings always decode to valid molecules, with controllable size distributions through symbol filtering.

Attribution system: Showed both encoder and decoder can track which input symbols produce which output symbols, useful for property alignment.

Future Trajectories for General Chemical Representations

The 2023 update successfully addresses the main adoption barriers:

Fast enough for large-scale ML applications (300K molecules in ~4 minutes)
Chemically comprehensive enough for drug discovery and materials science
User-friendly enough for seamless integration into existing workflows

The validity guarantee, SELFIES’ core advantage, is now practically accessible for real-world research. The roadmap includes future extensions for polymers, crystals, chemical reactions, and non-covalent interactions, which would expand SELFIES’ applicability beyond small-molecule chemistry.

Limitations acknowledged: The paper focuses on implementation improvements. Some advanced chemical systems (polymers, crystals) still need future work.

Reproducibility Details

Code

The selfies library is completely open-source and written in pure Python. It requires no extra dependencies and is available on GitHub, installable via pip install selfies. The repository includes testing suites (tox) and example benchmarking scripts to reproduce the translation speeds reported in the paper.

Hardware

Performance benchmarks (e.g., the 252-second roundtrip conversion on 300K molecules) were executed on Google Colaboratory using two 2.20GHz Intel Xeon CPUs.

Algorithms

Technical Specification: The Grammar

The core innovation of SELFIES is a Context-Free Grammar (CFG) augmented with state-machine logic to ensure that every derived string represents a valid molecule. While the software features are important, understanding the underlying derivation rules is essential for replication or extension of the system.

1. Derivation Rules: The Atom State Machine

The fundamental mechanism that guarantees validity is a state machine that tracks the remaining valence of the most recently added atom:

State Tracking: The derivation maintains a non-terminal state $X_l$, where $l$ represents the current atom’s remaining valence (number of bonds it can still form)
Standard Derivation: An atom symbol $[\beta \alpha]$ (bond order + atom type) transitions the state from $S$ (start) to $X_l$, where $l$ is calculated from the atom’s standard valence minus the incoming bond order
Bond Demotion (The Key Rule): If a requested bond order $\beta$ exceeds the available valence $l$ of the previous atom, the bond is automatically demoted to $\min(l, d(\beta))$ to prevent invalid valency. This automatic adjustment is the mathematical core of the validity guarantee.

This state machine ensures that no atom ever exceeds its allowed valence, making it impossible to generate chemically invalid structures.

2. Control Symbols: Branches and Rings

Branch length calculation: SELFIES uses a hexadecimal encoding to determine branch lengths. If a branch symbol is followed by index symbols $c_1 \dots c_k$, the number of symbols $N$ to include in the branch is calculated as:

$$ \begin{aligned} N &= 1 + 16 \cdot c_k \end{aligned} $$

This formula interprets subsequent symbols as base-16 integers, allowing compact representation of arbitrarily long branches.

Ring closure queue system: Ring formation uses a lazy evaluation strategy to maintain validity. Ring symbols don’t create bonds immediately; instead, they push “closure candidates” into a queue $R$. These candidates are resolved after the main derivation completes. A ring closure candidate is rejected if either atom has no remaining valence ($m = 0$), or the closure would create a self-loop (same atom). This deferred validation prevents the creation of invalid ring structures while still allowing the grammar to remain context-free during the main derivation.

3. Symbol Structure and Standardization

SELFIES enforces a strict, standardized format for atom symbols to eliminate ambiguity:

Canonical Format: Atom symbols follow the structure [Bond, Isotope, Element, Chirality, H-count, Charge]
No Variation: There is only one way to write each symbol (e.g., [Fe++] and [Fe+2] are standardized to a single form)
Order Matters: The components must appear in the specified order

4. Default Semantic Constraints

By default, the library enforces standard organic chemistry valence rules:

Typical Valences: C=4, N=3, O=2, F=1
Element Coverage: Supports most elements relevant to organic and medicinal chemistry
Customizable: These constraints can be modified via set_semantic_constraints() for specialized applications (hypervalent compounds, theoretical studies, etc.)

The combination of these grammar rules with the state machine ensures that every valid SELFIES string decodes to a chemically valid molecule, regardless of how the string was generated (random, ML model output, manual construction, etc.).

Data

Benchmark dataset: DTP (Developmental Therapeutics Program) dataset with 300,000+ molecules used for performance testing.

Chemical coverage testing: Diverse molecular databases including PubChem to verify the library handles broad chemical diversity (aromatic systems, stereochemistry, charged species, isotopes).

Evaluation

Performance metric: Roundtrip conversion time (SMILES → SELFIES → SMILES) is 252 seconds for 300K molecules.

Validity testing: Random SELFIES string generation with decoding verification shows 100% of generated strings decode to valid molecules.

Attribution system: Encoder and decoder track which input symbols produce which output symbols, tested for property alignment between SMILES and SELFIES representations.

Paper Information

Citation: Lo, A., Pollice, R., Nigam, A., White, A. D., Krenn, M., & Aspuru-Guzik, A. (2023). Recent advances in the self-referencing embedded strings (SELFIES) library. Digital Discovery, 2(4), 897-908. https://doi.org/10.1039/D3DD00044C

Publication: Digital Discovery 2023

@article{lo2023recent,
  title={Recent advances in the self-referencing embedded strings (SELFIES) library},
  author={Lo, Alston and Pollice, Robert and Nigam, AkshatKumar and White, Andrew D and Krenn, Mario and Aspuru-Guzik, Al{\'a}n},
  journal={Digital Discovery},
  volume={2},
  number={4},
  pages={897--908},
  year={2023},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D3DD00044C}
}

Additional Resources:

SELFIES as a Production-Ready Resource#

Limitations in the Original SELFIES Implementation#

Architectural Refactoring and New ML Integrations#

Performance Benchmarks & Validity Testing#

Future Trajectories for General Chemical Representations#

Reproducibility Details#

Code#

Hardware#

Algorithms#

Technical Specification: The Grammar#

Data#

Evaluation#

Paper Information#