Overview

SELFIES (SELF-referencIng Embedded Strings) is a string-based molecular representation designed to be 100% robust. This means every SELFIES string—even one generated randomly—corresponds to a syntactically and semantically valid molecule. SELFIES was developed to overcome a major limitation of SMILES, where a large fraction of generated strings in machine learning tasks do not represent valid chemical structures.

SELFIES uses a formal grammar and derivation rules that ensure chemical constraints, such as valence, are always satisfied.

Key Characteristics

  • 100% Robustness: Every possible SELFIES string can be decoded into a valid molecular graph. This is its fundamental advantage over SMILES.
  • Machine Learning Friendly: Can be used directly in any machine learning model (like VAEs or GANs) without adaptation, guaranteeing that all generated outputs are valid molecules.
  • Human-readable: With some familiarity, SELFIES strings are human-readable, allowing interpretation of functional groups and connectivity.
  • Local Operations: Unlike SMILES, which uses non-local numbers for rings and parentheses for branches, SELFIES encodes branch length and ring size locally within the string, preventing common syntactical errors.

Current Limitations

  • Indirect Canonicalization: A canonical SELFIES string is currently generated by first creating a canonical SMILES string and then converting it to SELFIES. Direct canonicalization is a goal for future development.
  • Standardization in Progress: While the core concept is established, further work is needed to standardize the format to cover the entire periodic table, advanced stereochemistry, aromaticity, and other special cases.

Basic Syntax

SELFIES uses symbols enclosed in square brackets (e.g., [C], [O], [#N]). The interpretation of each symbol depends on the current state of the derivation, which ensures chemical valence rules are not violated.

Derivation Rules

SELFIES are constructed using a table of derivation rules. The process starts in an initial state (e.g., $X_0$). Each symbol in the string acts as a “rule vector” that, combined with the current derivation state, determines the resulting atom/bond and the next state.

For example, the string [F][C][C][#N] is derived as follows:

  1. Start state: $X_0$
  2. Symbol [F]: The rule for [F] in state $X_0$ produces the atom F and transitions to state $X_1$. Current molecule: F.
  3. Symbol [C]: The rule for [C] in state $X_1$ produces =C and transitions to state $X_3$. Current molecule: F-C. The state $X_3$ indicates the new carbon atom can accept up to 3 more valence bonds.
  4. Symbol [C]: The rule for [C] in state $X_3$ might produce =C and transition to state $X_2$. Current molecule: F-C=C.
  5. Symbol [#N]: The rule for [#N] in state $X_2$ might produce =N. Final molecule: F-C=C=N (2-Fluoroethenimine), which satisfies all valence rules.

Structural Features

  • Branches: Represented by a [Branch] symbol followed by another symbol that indicates the number of atoms within that branch. This structure prevents errors like unmatched parentheses in SMILES. For example, [Branch3] indicates the next three symbols belong to a new branch.
  • Rings: Represented by a [Ring] symbol followed by a symbol indicating its size. The number specifies which previous atom to connect to, forming a ring closure. Rings are only inserted if the target atom has available valence bonds, ensuring validity.

Practical Applications

SELFIES is particularly advantageous for generative models in computational chemistry:

  • Inverse Molecular Design: Ideal for tasks where novel molecules with desired properties are generated using models like Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).
  • Valid Latent Space: When used in a VAE, the entire continuous latent space decodes to valid molecules, in stark contrast to SMILES where large regions of the latent space are invalid. This is crucial for both optimization tasks and model interpretability.
  • Higher Molecular Diversity: Experiments show that generative models trained with SELFIES produce a significantly higher density and diversity of valid molecules compared to models trained with SMILES. For instance, a VAE trained with SELFIES stored two orders of magnitude more diverse molecules, and a GAN produced 78.9% diverse valid molecules compared to 18.5% for SMILES.

Variants and Extensibility

The core SELFIES grammar is designed to be extensible. The derivation table can be expanded to include:

  • Ions and radicals
  • Stereochemistry information
  • A broader range of atoms from the periodic table
  • Support for much larger molecules (the released code supports rings and branches with up to 8000 atoms)

The framework has already been used to successfully encode and decode all 72 million molecules from the PubChem database.

References