Overview

SELFIES (SELF-referencIng Embedded Strings) is a string-based molecular representation designed to be 100% robust. This means every SELFIES string—even one generated randomly—corresponds to a syntactically and semantically valid molecule. SELFIES was developed to overcome a major limitation of SMILES, where a large fraction of strings generated in machine learning tasks do not represent valid chemical structures.

The format is implemented in an open-source Python library called selfies, which has significantly expanded its capabilities since the original publication. The modern library uses an internal graph-based representation for greater efficiency and flexibility.

Key Characteristics

  • 100% Robustness: Every possible SELFIES string can be decoded into a valid molecular graph that obeys chemical valence rules. This is its fundamental advantage over SMILES.
  • Machine Learning Friendly: Can be used directly in any machine learning model (like VAEs or GANs) without adaptation, guaranteeing that all generated outputs are valid molecules.
  • Customizable Constraints: The underlying chemical rules, such as maximum valence for different atoms, can be customized by the user. The library provides presets (e.g., for hypervalent species) and allows users to define their own rule sets.
  • Human-readable: With some familiarity, SELFIES strings are human-readable, allowing interpretation of functional groups and connectivity.
  • Local Operations: Unlike SMILES, which uses non-local numbers for rings and parentheses for branches, SELFIES encodes branch length and ring size locally within the string, preventing common syntactical errors.
  • Broad Support: The current selfies library supports aromatic molecules (via kekulization), isotopes, charges, radicals, and stereochemistry. It also includes a dot symbol (.) for representing disconnected molecular fragments.

Current Limitations

  • Indirect Canonicalization: A canonical SELFIES string is currently generated by first creating a canonical SMILES string and then converting it to SELFIES. Direct canonicalization is a goal for future development.
  • Ongoing Standardization: While the library now supports most major features found in SMILES, work is ongoing to extend the format to more complex systems like polymers, crystals, and reactions.

Basic Syntax

SELFIES uses symbols enclosed in square brackets (e.g., [C], [O], [#N]). The interpretation of each symbol depends on the current state of the derivation, which ensures chemical valence rules are not violated. The syntax is formally defined by a Chomsky type-2 context-free grammar.

Derivation Rules

SELFIES are constructed using a table of derivation rules. The process starts in an initial state (e.g., $X_0$) and reads the SELFIES string symbol by symbol. Each symbol, combined with the current state, determines the resulting atom/bond and the next state. The derivation state $X_n$ intuitively tracks that the previously added atom can form a maximum of $n$ additional bonds.

For example, the string [F][C][C][#N] is derived as follows:

  1. Start state: $X_0$
  2. Symbol [F]: The rule for [F] in state $X_0$ produces the atom F and transitions to state $X_1$. Current molecule: F.
  3. Symbol [C]: The rule for [C] in state $X_1$ produces =C and transitions to state $X_3$. Current molecule: F-C. The state $X_3$ indicates the new carbon atom can accept up to 3 more valence bonds.
  4. Symbol [C]: The rule for [C] in state $X_3$ might produce =C and transition to state $X_2$. Current molecule: F-C=C.
  5. Symbol [#N]: The rule for [#N] in state $X_2$ might produce =N. Final molecule: F-C=C=N (2-Fluoroethenimine), which satisfies all valence rules.

Structural Features

  • Branches: Represented by a [Branch] symbol. The symbols immediately following it are interpreted as an index that specifies the number of SELFIES symbols belonging to that branch. This structure prevents errors like unmatched parentheses in SMILES.
  • Rings: Represented by a [Ring] symbol. Similar to branches, subsequent symbols specify an index that indicates which previous atom to connect to, forming a ring closure. To guarantee validity, ring bond creation is postponed to a final post-processing step, where it is only completed if the target atom has available valence bonds.

Examples

Here are some examples of SELFIES representations for common molecules:

Ethanol molecule from SELFIES
Ethanol: [C][C][O]
Benzene molecule from SELFIES
Benzene: [C][=C][C][=C][C][=C][Ring1][=Branch1]
Aspirin molecule from SELFIES
Aspirin: [C][C][Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][Branch1][C][=O][O]

Practical Applications

SELFIES is particularly advantageous for generative models in computational chemistry:

  • Inverse Molecular Design: Ideal for tasks where novel molecules with desired properties are generated using models like Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).
  • Valid Latent Space: When used in a VAE, the entire continuous latent space decodes to valid molecules, in stark contrast to SMILES where large regions of the latent space are invalid. This is crucial for both optimization tasks and model interpretability.
  • Higher Molecular Diversity: Experiments show that generative models trained with SELFIES produce a significantly higher density and diversity of valid molecules. A VAE trained with SELFIES stored two orders of magnitude more diverse molecules, and a GAN produced 78.9% diverse valid molecules compared to 18.5% for SMILES.

The selfies Python Library

The selfies library provides a user-friendly and efficient implementation without external dependencies. Key functionalities include:

  • Core Functions: encoder() to convert SMILES to SELFIES and decoder() to convert SELFIES to SMILES.
  • Customization: set_semantic_constraints() allows users to define their own valence rules.
  • Attribution: A flag in the core functions enables tracing which input symbols are responsible for each output symbol, explaining the translation process.
  • Utilities: Functions are provided for common tasks like tokenizing a SELFIES string (split_selfies), getting its length (len_selfies), and converting SELFIES strings into numerical representations (label or one-hot encodings) for machine learning pipelines (selfies_to_encoding).

Variants and Extensibility

The core SELFIES grammar is designed to be extensible. The selfies library has already extended the original concept to cover a much richer chemical space:

  • Current Support: The library can represent ions, radicals, stereochemistry, aromatic systems, and most elements relevant to organic and medicinal chemistry. It has been used to encode and decode all 72 million molecules from PubChem.
  • Future Directions: Researchers have outlined future extensions to represent more complex structures like polymers, crystals, molecules with non-covalent bonds, and chemical reactions.

References