Molecular Notations on Hunter Heidenreich | ML Research Scientist

Materials Representations for ML Review

Mon, 06 Apr 2026 00:00:00 +0000

A Systematization of Material Representations

This paper is a Systematization that organizes and categorizes the strategies researchers use to convert solid-state materials into numerical representations suitable for machine learning models. Rather than proposing a new method, the review provides a structured taxonomy of existing approaches, connecting each to the practical constraints of data availability, computational cost, and prediction targets. It covers structural descriptors, graph-based learned representations, compositional features, transfer learning, and generative models for inverse design.

Why Material Representations Matter

Machine learning has enabled rapid property prediction for materials, but every ML pipeline depends on how the material is encoded as a numerical input. The authors identify three guiding principles for effective representations:

Similarity preservation: Similar materials should have similar representations, and dissimilar materials should diverge in representation space.
Domain coverage: The representation should be constructable for every material in the target domain.
Cost efficiency: Computing the representation should be cheaper than computing the target property directly (e.g., via DFT).

In practice, materials scientists face several barriers. Atomistic structures span diverse space groups, supercell sizes, and disorder parameters. Real material performance depends on defects, microstructure, and interfaces. Structural information often requires expensive experimental or computational effort to obtain. Datasets in materials science tend to be small, sparse, and biased toward well-studied systems.

Structural Descriptors: Local, Global, and Topological

The review covers three families of hand-crafted structural descriptors that encode atomic positions and types.

Local Descriptors

Local descriptors characterize the environment around each atom. Atom-centered symmetry functions (ACSF), introduced by Behler and Parrinello, define radial and angular functions:

$$ G_{i}^{1} = \sum_{j \neq i}^{\text{neighbors}} e^{-\eta(R_{ij} - R_{s})^{2}} f_{c}(R_{ij}) $$

$$ G_{i}^{2} = 2^{1-\zeta} \sum_{j,k \neq i}^{\text{neighbors}} (1 + \lambda \cos \theta_{ijk})^{\zeta} e^{-\eta(R_{ij}^{2} + R_{ik}^{2} + R_{jk}^{2})} f_{c}(R_{ij}) f_{c}(R_{ik}) f_{c}(R_{jk}) $$

The Smooth Overlap of Atomic Positions (SOAP), proposed by Bartók et al., defines atomic neighborhood density as a sum of Gaussians and computes a rotationally invariant kernel through expansion in radial functions and spherical harmonics:

$$ \rho_{i}(\mathbf{r}) = \sum_{j} \exp\left(-\frac{|\mathbf{r} - \mathbf{r}_{ij}|^{2}}{2\sigma^{2}}\right) = \sum_{nlm} c_{nlm} g_{n}(\mathbf{r}) Y_{lm}(\hat{\mathbf{r}}) $$

The power spectrum $\mathbf{p}(\mathbf{r}) \equiv \sum_{m} c_{nlm}(c_{n’lm})^{*}$ serves as a vector descriptor of the local environment. SOAP has seen wide adoption both as a similarity metric and as input to ML models.

Voronoi tessellation provides another local approach, segmenting space into cells and extracting features like effective coordination numbers, cell volumes, and neighbor properties.

Global Descriptors

Global descriptors encode the full structure. The Coulomb matrix models electrostatic interactions between atoms:

$$ M_{i,j} = \begin{cases} Z_{i}^{2.4} & \text{for } i = j \\ \frac{Z_{i}Z_{j}}{|r_{i} - r_{j}|} & \text{for } i \neq j \end{cases} $$

Other global methods include partial radial distribution functions (PRDF), the many-body tensor representation (MBTR), and cluster expansions. The Atomic Cluster Expansion (ACE) framework generalizes cluster expansions to continuous environments and has become a foundation for modern deep learning potentials.

Topological Descriptors

Persistent homology from topological data analysis (TDA) identifies geometric features at multiple length scales. Topological descriptors capture pore geometries in porous materials and have outperformed traditional structural descriptors for predicting CO$_{2}$ adsorption in metal-organic frameworks and methane storage in zeolites. A caveat is the $O(N^{3})$ worst-case computational cost per filtration.

Crystal Graph Neural Networks

Graph neural networks bypass manual feature engineering by learning representations directly from structural data. Materials are converted to graphs $G(V, E)$ where nodes represent atoms and edges connect neighbors within a cutoff radius, with periodic boundary conditions.

Key architectures discussed include:

Model	Key Innovation
CGCNN	Crystal graph convolutions for broad property prediction
MEGNet	Materials graph networks with global state attributes
ALIGNN	Line graph neural networks incorporating three-body angular features
Equivariant GNNs	E(3)-equivariant message passing for tensorial properties

The review identifies several limitations. Graph convolutions based on local neighborhoods can fail to capture long-range interactions or periodicity-dependent properties (e.g., lattice parameters, phonon spectra). Strategies to address this include concatenation with hand-tuned descriptors, plane-wave periodic basis modulation, and reciprocal-space features.

A major practical restriction is the requirement for relaxed atomic positions. Graphs built from unrelaxed crystal prototypes lose information about geometric distortions, degrading accuracy. Approaches to mitigate this include data augmentation with perturbed structures, Bayesian optimization of prototypes, and surrogate force-field relaxation.

Equivariant models that introduce higher-order tensors to node and edge features, constrained to transform correctly under E(3) operations, achieve state-of-the-art accuracy and can match structural descriptor performance even in low-data (~100 datapoints) regimes.

Compositional Descriptors Without Structure

When crystal structures are unavailable, representations can be built purely from stoichiometry and tabulated atomic properties (radii, electronegativity, valence electrons). Despite their simplicity, these methods have distinct advantages: zero computational overhead, accessibility to non-experts, and robustness for high-throughput screening.

Key methods include:

MagPie: 145 input features derived from elemental properties
SISSO: Compressive sensing over algebraic combinations of atomic properties, capable of discovering interpretable descriptors (e.g., a new tolerance factor $\tau$ for perovskite stability)
ElemNet: Deep neural network using only fractional stoichiometry as input, outperforming MagPie with >3,000 training points
ROOST: Fully-connected compositional graph with attention-based message passing, achieving strong performance with only hundreds of examples
CrabNet: Self-attention on element embeddings with fractional encoding, handling dopant-level concentrations via log-scale inputs

Compositional models cannot distinguish polymorphs and generally underperform structural approaches. They are most valuable when atomistic resolution is unavailable.

Defects, Surfaces, and Grain Boundaries

The review extends beyond idealized unit cells to practical materials challenges:

Point defects: Representations of the pristine bulk can predict vacancy formation energies through linear relationships with band structure descriptors. Frey et al. proposed using relative differences between defect and parent structure properties, requiring no DFT on the defect itself.

Surfaces and catalysis: Binding energy prediction for catalysis requires representations beyond the bulk unit cell. The d-band center for metals and oxygen 2p-band center for metal oxides serve as simple electronic descriptors, following the Sabatier principle that optimal catalytic activity requires intermediate binding strength. Graph neural networks trained on the Open Catalyst 2020 dataset (>1 million DFT energies) have enabled broader screening, though errors remain high for certain adsorbates and non-metallic surfaces.

Grain boundaries: SOAP descriptors computed for atoms near grain boundaries and clustered into local environment classes can predict grain boundary energy, mobility, and shear coupling. This approach provides interpretable structure-property relationships.

Transfer Learning Across Representations

When target datasets are small, transfer learning leverages representations learned from large, related datasets. The standard procedure involves: (1) pretraining on a large dataset (e.g., all Materials Project formation energies), (2) freezing parameters up to a chosen depth, and (3) either fine-tuning remaining layers or extracting features for a separate model.

Key findings from the review:

Transfer learning is most effective when the source dataset is orders of magnitude larger than the target
Physically related tasks transfer better (e.g., Open Catalyst absorption energies transfer well to new adsorbates, less so to unrelated small molecules)
Earlier neural network layers learn more general representations and transfer better across properties
Multi-depth feature extraction, combining activations from multiple layers, can improve transfer
Predictions from surrogate models can serve as additional descriptors, expanding screening domains by orders of magnitude

Generative Models for Crystal Inverse Design

Generative models for solid-state materials face challenges beyond molecular generation: more diverse atomic species, the need to specify both positions and lattice parameters, non-unique definitions (rotations, translations, supercell scaling), and large unit cells (>100 atoms for zeolites and MOFs).

The review traces the progression of approaches:

Voxel representations: Discretize unit cells into volume elements. Early work (iMatGen, Court et al.) demonstrated feasibility but was restricted to specific chemistries or cubic systems.
Continuous coordinate models: Point cloud and invertible representations allowed broader chemical spaces but lacked symmetry invariances.
Symmetry-aware models: Crystal Diffusion VAE (CDVAE) uses periodic graphs and SE(3)-equivariant message passing for translationally and rotationally invariant generation, establishing benchmark tasks for the field.
Constrained models for porous materials: Approaches like SmVAE represent MOFs through their topological building blocks (RFcodes), ensuring all generated structures are physically valid.

Open Problems and Future Directions

The review highlights four high-impact open questions:

Local vs. global descriptor trade-offs: Local descriptors (SOAP) excel for short-range interactions but struggle with long-range physics. Global descriptors model periodicity but lack generality across space groups. Combining local and long-range features could provide more universal models.
Prediction from unrelaxed prototypes: ML force fields can relax structures at a fraction of DFT cost, potentially expanding screening domains. Key questions remain about required training data scale and generalizability.
Applicability of compositional descriptors: The performance gap between compositional and structural models may be property-dependent, being smaller for properties like band gap that depend on global features rather than local site energies.
Extensions of generative models: Diffusion-based architectures have improved on voxel approaches for small unit cells, but extending to microstructure, dimensionality, and surface generation remains open.

Reproducibility Details

This paper is a review and does not present new experimental results or release any novel code, data, or models. The paper is open-access (hybrid OA at Annual Reviews) and the arXiv preprint is freely available. The following artifacts table covers key publicly available resources discussed in the review.

Artifacts

Artifact	Type	License	Notes
arXiv preprint (2301.08813)	Other	arXiv (open access)	Free preprint version
Materials Project	Dataset	CC-BY-4.0	DFT energies, band gaps, structures for >100,000 compounds
OQMD	Dataset	CC-BY-4.0	Open Quantum Materials Database, >600,000 DFT entries
Open Catalyst 2020 (OC20)	Dataset	CC-BY-4.0	>1,000,000 DFT surface adsorption energies
AFLOW	Dataset	Public	High-throughput ab initio library, >3,000,000 entries
Matminer	Code	BSD	Open-source toolkit for materials data mining and featurization

Algorithms

The review covers: ACSF, SOAP, Voronoi tessellation, Coulomb matrices, PRDF, MBTR, cluster expansions, ACE, persistent homology, CGCNN, MEGNet, ALIGNN, E(3)-equivariant GNNs, MagPie, SISSO, ElemNet, ROOST, CrabNet, VAE, GAN, and diffusion-based crystal generators.

Hardware

No new experiments are conducted. Hardware requirements vary by the referenced methods (DFT calculations require HPC; GNN training typically requires 1-8 GPUs).

Reproducibility Status

Partially Reproducible: The review paper itself is open-access. All major datasets discussed (Materials Project, OQMD, OC20, AFLOW) are publicly available under permissive licenses. Most referenced model implementations (CGCNN, MEGNet, ALIGNN, ROOST, CDVAE) have open-source code. No novel artifacts are released by the authors.

Paper Information

Citation: Damewood, J., Karaguesian, J., Lunger, J. R., Tan, A. R., Xie, M., Peng, J., & Gómez-Bombarelli, R. (2023). Representations of Materials for Machine Learning. Annual Review of Materials Research, 53. https://doi.org/10.1146/annurev-matsci-080921-085947

Publication: Annual Review of Materials Research, 2023

@article{damewood2023representations,
  title={Representations of Materials for Machine Learning},
  author={Damewood, James and Karaguesian, Jessica and Lunger, Jaclyn R. and Tan, Aik Rui and Xie, Mingrou and Peng, Jiayu and G{\'o}mez-Bombarelli, Rafael},
  journal={Annual Review of Materials Research},
  volume={53},
  year={2023},
  doi={10.1146/annurev-matsci-080921-085947}
}

InChI: The International Chemical Identifier

Mon, 06 Apr 2026 00:00:00 +0000

Overview

InChI (International Chemical Identifier) is an open, non-proprietary chemical structure identifier developed by IUPAC and NIST. Unlike SMILES, which linearizes a molecular graph through depth-first traversal, InChI decomposes a molecule into a hierarchy of layers (connectivity, hydrogen atoms, charge, stereochemistry) that build progressively from the molecular formula to full stereochemical detail. This layered design means that two representations of the same molecule always produce the same InChI, even if their input drawings differ in atom ordering or layout.

InChI was created to solve a specific problem: linking chemical information across databases on the open web. Before InChI, interoperability between chemical databases depended on proprietary identifiers (like CAS Registry Numbers) or format-dependent representations. The project began at a March 2000 IUPAC meeting and is maintained by the InChI Trust, a UK charity supported by publishers and database providers. The algorithm’s source code is open source.

Key Characteristics

Canonical by design: Every valid molecular structure maps to exactly one standard InChI string, regardless of how the structure was drawn or which atoms were numbered first. This uniqueness is built into the algorithm, not added as a post-processing step.
Hierarchical layers: Information is organized from general (molecular formula) to specific (stereochemistry, isotopes). This allows matching at different levels of detail: a query with unknown stereochemistry can match against structures with known stereochemistry by comparing only the connectivity layers.
Web-searchable via InChIKey: Because InChI strings contain characters (/, +, =) that break web search engines, the 27-character InChIKey hash provides a fixed-length, search-friendly identifier.
Non-proprietary and open: Governed by IUPAC through the InChI Trust. The algorithm, source code, and specification are freely available.
Machine-optimized: Designed for programmatic parsing and database operations rather than human readability. Compare with SMILES, which prioritizes human readability.

Layered Structure

An InChI string begins with the prefix InChI= followed by a version number, then a series of layers separated by /. Each layer encodes a specific aspect of the molecular structure.

Layer Breakdown

For L-alanine (an amino acid with a chiral center):

InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1
       │  │      │            │                   │   │  │
       │  │      │            │                   │   │  └─ /s: stereo type (1=absolute)
       │  │      │            │                   │   └─ /m: parity inversion flag
       │  │      │            │                   └─ /t: tetrahedral parity
       │  │      │            └─ /h: hydrogen layer
       │  │      └─ /c: connectivity layer
       │  └─ molecular formula
       └─ version (1S = standard InChI v1)

The full set of layers, in order:

Main layer: Molecular formula (e.g., C3H7NO2)
Connectivity (/c): Atom-to-atom connections, excluding bond orders. Atoms are numbered starting from 1, and connections are listed as pairs.
Hydrogen (/h): Hydrogen atom assignments, distinguishing mobile (tautomeric) from fixed hydrogens
Charge (/q) and proton balance (/p): Net charge and protonation state
Double bond stereochemistry (/b): E/Z configuration around double bonds
Tetrahedral stereochemistry (/t): R/S configuration at sp3 centers
Parity inversion (/m): Relates computed parity to actual configuration
Stereo type (/s): Whether stereochemistry is absolute, relative, or racemic
Isotope layer (/i): Isotopic labeling (e.g., deuterium, carbon-13)

Standard vs. Non-Standard InChI

The S in InChI=1S/ indicates a Standard InChI, which uses a fixed set of normalization options to guarantee that any software producing Standard InChI will generate the same string for the same molecule. Non-standard InChI allows custom options (such as the Fixed-H layer /f, which distinguishes specific tautomeric forms) but sacrifices cross-implementation consistency.

The InChIKey

InChI strings can be arbitrarily long for large molecules, and their /, +, and = characters cause problems for web search engines. The InChIKey addresses both issues by hashing the InChI into a fixed 27-character string:

$$ \text{InChIKey} = f_{\text{SHA-256}}(\text{InChI}) $$

Structure

An InChIKey has the format XXXXXXXXXXXXXX-XXXXXXXXXX-X:

First block (14 characters): SHA-256 hash of the connectivity layer (molecular skeleton)
Second block (10 characters): 8 characters encoding stereochemistry and isotopes, plus a standard/non-standard flag (S or N) and a version indicator (A for v1)
Third block (1 character): Protonation flag (N for neutral)

For example, L-alanine:

InChIKey: QNAYBMKLOCPYGJ-REOHCLBHSA-N
          │                │          │
          └─ connectivity  └─ stereo  └─ protonation

Collision Risk

Because the InChIKey is a hash, collisions are theoretically possible. The first block provides $2^{65}$ possible values for connectivity, making accidental collisions extremely unlikely for practical database sizes (estimated 1 in $10^{12}$ chance for $10^9$ compounds). It is important to distinguish InChIKey collisions (a mathematical inevitability of hashing, but rare in practice) from InChI collisions (bugs in the algorithm, which are very rare and targeted by the certification suite).

Working with InChI in Python

The RDKit library provides InChI support through its built-in functions:

from rdkit import Chem
from rdkit.Chem.inchi import MolFromInchi, MolToInchi, InchiToInchiKey

# SMILES -> InChI
mol = Chem.MolFromSmiles("C[C@@H](N)C(=O)O")  # L-alanine
inchi = MolToInchi(mol)
print(inchi)
# -> InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1

# InChI -> Molecule -> SMILES
mol2 = MolFromInchi(inchi)
print(Chem.MolToSmiles(mol2))
# -> C[C@@H](N)C(=O)O

# InChI -> InChIKey
key = InchiToInchiKey(inchi)
print(key)
# -> QNAYBMKLOCPYGJ-REOHCLBHSA-N

Layer-Level Matching

Because InChI is hierarchical, you can compare molecules at different levels of detail by truncating layers. Two molecules that differ only in stereochemistry will share the same connectivity layers:

from rdkit import Chem
from rdkit.Chem.inchi import MolToInchi, InchiToInchiKey

# L-alanine and D-alanine differ only in chirality
l_ala = Chem.MolFromSmiles("C[C@@H](N)C(=O)O")
d_ala = Chem.MolFromSmiles("C[C@H](N)C(=O)O")

l_inchi = MolToInchi(l_ala)
d_inchi = MolToInchi(d_ala)

# Full InChIs differ (different /t and /m layers)
print(l_inchi)
# -> InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1
print(d_inchi)
# -> InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m1/s1

# First block of InChIKey is identical (same connectivity)
l_key = InchiToInchiKey(l_inchi)
d_key = InchiToInchiKey(d_inchi)
print(l_key[:14] == d_key[:14])
# -> True (same molecular skeleton)
print(l_key == d_key)
# -> False (different stereochemistry)

InChI in Machine Learning

InChI was designed for database interoperability, not for machine learning. Its hierarchical, layer-based structure differs fundamentally from the sequential, atom-by-atom encoding used by SMILES and SELFIES. This has practical implications for ML applications.

Optical Chemical Structure Recognition

InChI is widely used as an output format for optical chemical structure recognition (OCSR) systems that extract molecular structures from images in scientific literature. Because InChI is canonical, it provides an unambiguous target for image-to-text models.

Image2InChI uses an improved SwinTransformer encoder with attention-based feature fusion to convert molecular images directly to InChI strings, achieving 99.8% accuracy on the BMS dataset. The ViT-InChI Transformer takes a similar approach with a Vision Transformer backbone.

In a systematic comparison of string representations for OCSR, Rajan et al. (2022) evaluated SMILES, DeepSMILES, SELFIES, and InChI using the same transformer architecture. InChI strings are longer than SMILES (producing more tokens for the decoder), which increases sequence modeling difficulty. SMILES achieved the highest exact match accuracy (88.62%), while SELFIES achieved 100% structural validity.

Chemical Name Translation

InChI’s canonical structure makes it a natural intermediate representation for translating between chemical names and structures. Handsel et al. (2021) trained a sequence-to-sequence Transformer to translate InChI identifiers to IUPAC names character-by-character, achieving 91% accuracy on organic compounds from PubChem (10 million training pairs). STOUT converts through SELFIES as an intermediate but validates outputs against InChI for structural equivalence.

Representation Comparison for ML

InChI’s design trade-offs position it differently from SMILES and SELFIES for machine learning:

Property	InChI	SMILES	SELFIES
Uniqueness	Canonical by design	Requires canonicalization algorithm	Via SMILES roundtrip
Validity guarantee	N/A (not generative)	No	Yes (every string is valid)
Human readability	Low (machine-optimized)	High	Moderate
String length	Longest	Shortest	Moderate
Primary ML use	OCSR output, database linking	Generation, property prediction	Generation with validity
Tokenization	Complex (layers, separators)	Regex-based atom tokens	Bracket-delimited tokens

InChI’s length and structural complexity (layer separators, parenthetical groupings, comma-delimited atom lists) make it less common as a direct input representation for generative models. Most molecular language models use SMILES or SELFIES for generation tasks, and convert to InChI only for canonicalized comparison or database lookup.

Limitations

Tautomerism

InChI v1 handles many tautomeric forms by normalizing mobile hydrogen atoms in the /h layer. However, certain tautomeric transformations (such as 1,4-oxime/nitroso conversions) can produce different InChIs for what chemists consider the same compound. This is a known limitation targeted for InChI v2, with 86 tautomeric transformation rules compiled and validated across 400M+ structures to inform the update.

Inorganic and Organometallic Chemistry

The original InChI specification was designed primarily for organic molecules. Metal-ligand bonds, coordination compounds, and extended solid-state structures posed challenges. The InChI v1.07 release addresses this with dedicated handling for metal-ligand bonds, though complete coverage of all inorganic chemistry remains an ongoing effort.

Not Designed for Generation

Unlike SMILES (which can be generated token-by-token through depth-first graph traversal) or SELFIES (which guarantees validity by construction), InChI’s layered format does not lend itself to autoregressive generation. A generative model would need to produce internally consistent layers: the connectivity layer must agree with the molecular formula, the hydrogen layer must be consistent with the connectivity, and the stereochemistry layers must reference valid atom indices. This cross-layer dependency makes InChI poorly suited as a target for token-by-token molecular generation, which is why most generative chemistry models use SMILES or SELFIES.

Irreversibility of InChIKey

The InChIKey is a one-way hash. An InChIKey cannot be converted back to an InChI or a molecular structure. It is useful only for search and comparison, not for structure retrieval (without a lookup table).

Variants and Extensions

RInChI: Reactions

RInChI (Reaction InChI) extends InChI to represent chemical reactions by combining the InChIs of reactants, products, and agents into a single identifier. It provides a canonical identifier for reactions, enabling reaction database searching and duplicate detection (Grethe et al., 2018).

MInChI: Mixtures

MInChI (Mixture InChI) represents mixtures of substances, combined with the Mixfile format for storing detailed mixture composition data. This extends the InChI framework to complex multi-component systems like formulations and alloys (Clark et al., 2019).

NInChI: Nanomaterials

NInChI proposes a hierarchical adaptation of InChI for nanomaterial identification. Traditional chemical identifiers break down at the nanoscale, where a single “entity” may consist of millions of atoms arranged in layers, coatings, and surface functionalizations (Lynch et al., 2020).

References

Heller, S., McNaught, A., Pletnev, I., Stein, S., & Tchekhovskoi, D. (2015). InChI, the IUPAC International Chemical Identifier. Journal of Cheminformatics, 7(1), 23.
Heller, S., McNaught, A., Stein, S., Tchekhovskoi, D., & Pletnev, I. (2013). InChI - the worldwide chemical structure identifier standard. Journal of Cheminformatics, 5(1), 7.
Grethe, G., Blanke, G., Kraut, H., & Goodman, J. M. (2018). International Chemical Identifier for reactions (RInChI). Journal of Cheminformatics, 10(1), 22.
Clark, A. M., McEwen, L. R., Gedeck, P., & Bunin, B. A. (2019). Capturing mixture composition: an open machine-readable format for representing mixed substances. Journal of Cheminformatics, 11(1), 33.
Lynch, I., et al. (2020). Can an InChI for nano address the need for a simplified representation of complex nanomaterials across experimental and nanoinformatics studies? Nanomaterials, 10(12), 2493.
InChI Trust
InChI GitHub Repository

t-SMILES: Tree-Based Fragment Molecular Encoding

Thu, 26 Mar 2026 00:00:00 +0000

A Fragment-Based Molecular Representation Method

This is a Method paper that proposes t-SMILES (tree-based SMILES), a framework for representing molecules as SMILES-type strings derived from fragment-based decompositions. The primary contribution is an encoding algorithm that converts fragmented molecular graphs into full binary trees (FBTs) and then traverses them breadth-first to produce linear strings. Three coding variants are introduced: TSSA (shared atom), TSDY (dummy atom without ID), and TSID (dummy atom with ID). The framework achieves 100% theoretical validity, higher novelty scores, and improved distribution-learning metrics compared to classical SMILES, DeepSMILES, and SELFIES across ChEMBL, ZINC, and QM9 benchmarks.

Why Fragment-Based Representations Matter for Molecular Generation

Classical SMILES encodes molecules via depth-first traversal of the molecular graph, requiring parentheses and ring identifiers to appear in matched pairs with deep nesting. When generative models (LSTM, Transformer) are trained on SMILES, they produce chemically invalid strings, particularly on small datasets, because they struggle to learn these long-range pairing constraints. DeepSMILES addresses some syntactical issues but still permits semantic violations (e.g., oxygen with three bonds). SELFIES guarantees 100% valid strings but at the cost of readability and, as the authors show, lower FCD scores indicating generated molecules diverge from the training distribution.

Fragment-based approaches reduce the search space compared to atom-level methods and can provide insights into molecular recognition (e.g., protein-ligand interactions). However, existing fragment-based deep learning methods rely on fixed dictionaries of candidate fragments, creating in-vocabulary/out-of-vocabulary problems and high-dimensional sparse representations. The encoding of fragments as SMILES-type strings, rather than dictionary IDs, had not been systematically explored before this work.

The authors draw on the observation that fragments in organic molecules follow a Zipf-like rank distribution similar to words in natural language, motivating the use of NLP techniques for fragment-based molecular modeling.

Core Innovation: Binary Tree Encoding of Fragmented Molecules

The t-SMILES algorithm proceeds in three steps:

Fragmentation: A molecule is decomposed into valid chemical fragments using a chosen algorithm (JTVAE, BRICS, MMPA, or Scaffold), producing a fragmented molecular graph.
Tree construction: The fragmented graph is converted into an Acyclic Molecular Tree (AMT), which is a reduced graph where nodes represent fragments and edges represent bonds between them. The AMT is then transformed into a Full Binary Tree (FBT), where every internal node has exactly two children.
String generation: The FBT is traversed using breadth-first search (BFS) to produce the t-SMILES string.

The framework introduces only two new symbols beyond standard SMILES: & marks empty tree nodes (branch terminators providing global structural information), and ^ separates adjacent substructure segments (analogous to spaces between words in English).

Three Coding Variants

TSSA (shared atom): Two fragments share a real atom at their connection point. Produces the highest novelty scores and is recommended for goal-directed tasks.
TSDY (dummy atom, no ID): Uses dummy atoms (marked with *) to indicate bonding points. Provides a balanced choice between novelty and distribution fidelity.
TSID (dummy atom with ID): Uses numbered dummy atoms ([n*]) for unambiguous reconstruction. Produces the most faithful distribution reproduction and is recommended for distribution-learning tasks.

Structural Advantages

The key structural benefit is a dramatic reduction in nesting depth. For TSDY_M on ChEMBL, the proportion of tokens at nesting depth 0-1-2 increases from 68.0% (SMILES) to 99.3%, while depth 3-4-5 drops from 31.9% to 0.7%, and depth 6-11 drops from 0.1% to 0.0002%. The & symbol, which encodes molecular topology, does not need to appear in pairs (unlike parentheses in SMILES), and its high frequency means it does not create a scarcity problem for learning.

The framework also supports a multi-code system where classical SMILES can be integrated as a special case called TS_Vanilla, and multiple fragmentation-based codes can be combined into hybrid models.

Reconstruction and Data Augmentation

Molecules can be reconstructed from t-SMILES strings by reversing the process: rebuilding the FBT from the string, converting to AMT, and assembling fragments into a molecular graph. This reconstruction process can itself generate novel molecules without any model training by randomly assembling fragments. On ChEMBL, TSSA reconstruction achieves uniqueness above 0.98 and novelty above 0.68 for all four fragmentation algorithms, with 100% validity.

Data augmentation in t-SMILES operates at four levels: (1) different decomposition algorithms, (2) reconstruction, (3) enumeration of fragment strings, and (4) enumeration of FBTs. Unlike SMILES enumeration (which only produces different strings for the same molecule), t-SMILES reconstruction generates genuinely different molecules from the same fragment set.

Systematic Evaluation Across Multiple Benchmarks

All experiments use MolGPT (a Transformer-decoder model) as the primary generative model. Three types of metrics are employed: distribution-learning benchmarks, goal-directed benchmarks, and Wasserstein distance metrics for physicochemical properties.

Low-Resource Datasets (JNK3 and AID1706)

On JNK3 (923 active molecules), the authors investigate overfitting behavior across training epochs:

Model	Valid	Novelty	FCD	Active Novel
SMILES [R200]	0.795	0.120	0.584	0.072
SMILES [R2000]	1.000	0.001	0.765	0.004
SELFIES [R200]	1.000	0.238	0.544	0.148
SELFIES [R2000]	1.000	0.008	0.767	0.050
TSSA_S [R300]	1.000	0.833	0.564	0.582
TSSA_S [R5000]	1.000	0.817	0.608	0.564
TF_TSSA_S [R5]	1.000	0.932	0.483	0.710
TSSA_S_Rec50 [R10]	1.000	0.962	0.389	0.829

Key findings: SMILES and DeepSMILES novelty scores collapse to near zero after 200 epochs, while t-SMILES novelty stabilizes around 0.8. The highest active-novel score of 0.829 comes from t-SMILES with reconstruction-based data augmentation. Transfer learning with t-SMILES maintains novelty of 0.710 at 5 epochs versus 0.526 for SMILES, and at 100 epochs the gap widens dramatically (0.569 vs. 0.023).

Distribution Learning on ChEMBL

t-SMILES models outperform graph baselines (Graph MCTS, hG2G, MGM) and fragment-based methods (FASMIFRA). TSID_B and TSID_S achieve FCD scores of 0.909 while maintaining novelty of 0.941 and 0.933, surpassing SMILES (FCD 0.906, novelty 0.907) in both dimensions. TSDY and TSID models consistently outperform TSSA on distribution fidelity for larger molecules.

Goal-Directed Tasks on ChEMBL

On 20 GuacaMol subtasks, different fragmentation algorithms excel at different tasks. The goal-directed reconstruction algorithm significantly outperforms random reconstruction. On the Sitagliptin MPO task (T16.SMPO), the TSDY_M model with goal-directed reconstruction achieves a score of 0.930, compared to 0.598 for SMILES and 0.708 for CReM. On Valsartan SMARTS (T18.VS), t-SMILES models reach 0.997 versus 0.985 for SMILES.

Distribution Learning on ZINC and QM9

On ZINC, t-SMILES models significantly outperform existing fragment-based baselines (JTVAE, FragDgm). Seven t-SMILES models achieve both higher FCD and novelty scores than SELFIES. On QM9 (smaller molecules), all string-based models achieve high FCD scores (above 0.960), with t-SMILES performing better than existing string and graph approaches.

Physicochemical Properties

Across ChEMBL and ZINC, TSDY and TSID models capture physicochemical property distributions (MolWt, LogP, SAScore, N_Atoms, N_Rings, etc.) more faithfully than TSSA models. Multiple t-SMILES models outperform SMILES in more than four out of nine property categories. Baseline models hG2G and JTVAE show the weakest pattern learning, producing molecules with fewer atoms and rings than the training data.

Key Findings and Limitations

Main Results

t-SMILES achieves 100% theoretical validity by fragmenting molecules into chemically valid pieces before encoding.
The framework avoids the overfitting problem on low-resource datasets, maintaining stable novelty scores where SMILES, DeepSMILES, and SELFIES collapse.
The multi-code system allows different coding algorithms to complement each other, with hybrid models accessing broader chemical space.
Goal-directed reconstruction significantly outperforms all baselines on targeted optimization tasks.
TSDY and TSID provide better distribution fidelity than TSSA on larger molecules, while TSSA excels at novelty generation for goal-directed tasks.

Limitations

The authors acknowledge several limitations:

Whether the tree structure of t-SMILES can be effectively learned by Large Language Models remains unexplored.
Only published fragmentation algorithms were tested; custom fragmentation schemes were not investigated.
Experiments on more complex (larger) molecules were not performed.
The reconstruction algorithm uses simple rules for fragment assembly; more sophisticated assembly methods (Monte Carlo tree search, CReM) could improve quality.

Future Directions

The authors suggest exploring advanced reconstruction and optimization algorithms, improved generative models, evolutionary techniques, and extending t-SMILES to property prediction, retrosynthesis, and reaction prediction tasks. The framework is also extensible to other string representations (t-DSMILES, t-SELFIES) by changing how fragments are encoded.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Low-resource evaluation	JNK3	923 active molecules	Kinase inhibitors
Low-resource evaluation	AID1706	329 active molecules	SARS 3CLPro inhibitors
Distribution learning	ChEMBL	Standard split	Large drug-like molecules
Distribution learning	ZINC	250K subset	Medium drug-like molecules
Distribution learning	QM9	~134K molecules	Small organic molecules

Algorithms

Fragmentation: JTVAE, BRICS, MMPA, Scaffold (all via RDKit)
Tree construction: AMT from reduced graph, then FBT transformation
Traversal: Breadth-first search on FBT
Generative model: MolGPT (Transformer decoder)
Discriminative model: AttentiveFP for activity prediction on JNK3/AID1706

Evaluation

Metric	Description
Validity	Fraction of generated strings that decode to valid molecules
Uniqueness	Fraction of distinct molecules among valid generations
Novelty	Fraction of generated molecules not in training set
KLD	Kullback-Leibler divergence for physicochemical property distributions
FCD	Frechet ChemNet Distance measuring chemical similarity to training set
Active Novel	Novel molecules predicted active by AttentiveFP

Artifacts

Artifact	Type	License	Notes
t-SMILES GitHub	Code	MIT	Official implementation with training/generation scripts
Zenodo deposit	Code + Data	CC-BY-4.0	Archived code and data
Code Ocean capsule	Code	Not specified	Certified reproducible compute capsule

Hardware

The paper mentions limited computational resources but does not specify exact GPU types or training times.

Paper Information

Citation: Wu, J.-N., Wang, T., Chen, Y., Tang, L.-J., Wu, H.-L., & Yu, R.-Q. (2024). t-SMILES: a fragment-based molecular representation framework for de novo ligand design. Nature Communications, 15, 4993.

@article{wu2024tsmiles,
  title={t-SMILES: a fragment-based molecular representation framework for de novo ligand design},
  author={Wu, Juan-Ni and Wang, Tong and Chen, Yue and Tang, Li-Juan and Wu, Hai-Long and Yu, Ru-Qin},
  journal={Nature Communications},
  volume={15},
  number={1},
  pages={4993},
  year={2024},
  doi={10.1038/s41467-024-49388-6}
}

SPE: Data-Driven SMILES Substructure Tokenization

Thu, 26 Mar 2026 00:00:00 +0000

A Data-Driven Tokenization Method for Chemical Deep Learning

This is a Method paper that introduces SMILES Pair Encoding (SPE), a tokenization algorithm adapted from byte pair encoding (BPE) in natural language processing. The primary contribution is a data-driven approach that learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset and then uses that vocabulary to tokenize SMILES for downstream deep learning tasks. The authors provide an open-source Python package (SmilesPE) and demonstrate improvements on both molecular generation and QSAR prediction benchmarks.

Limitations of Atom-Level SMILES Tokenization

SMILES-based deep learning models require tokenization to convert molecular strings into sequences of discrete units. The standard approaches have well-known drawbacks:

Character-level tokenization breaks SMILES character by character, splitting chemically meaningful multi-character atoms. For example, [C@@H] becomes six separate tokens ([, C, @, @, H, ]), losing the stereochemistry information of a single carbon.
Atom-level tokenization addresses some of these issues by treating multi-character element symbols (Cl, Br) and bracketed atoms ([nH], [O-]) as single tokens. However, these tokens still encode only individual atoms, not substructures.
k-mer tokenization (sequences of k consecutive overlapping characters) captures some connectivity information but suffers from the out-of-vocabulary problem: the model cannot represent k-mers not seen during training.

All three approaches produce relatively long input sequences (mean ~40 tokens per molecule on ChEMBL at the atom level), which increases computational cost for sequential architectures like RNNs and exacerbates long-range dependency issues.

Core Innovation: Adapting Byte Pair Encoding for SMILES

SPE adapts the byte pair encoding algorithm, originally developed for data compression and later adopted for subword tokenization in NLP, to the domain of chemical strings. The algorithm has two phases:

Vocabulary training:

Tokenize SMILES from a large dataset (ChEMBL) at the atom level
Initialize the vocabulary with all unique atom-level tokens
Iteratively count the frequency of all adjacent token pairs, merge the most frequent pair into a new token, and add it to the vocabulary
Stop when either the maximum vocabulary size (MVS) or a minimum frequency threshold (FT) is reached

Tokenization: Given a trained SPE vocabulary, a new SMILES string is first tokenized at the atom level, then token pairs are iteratively merged according to their frequency rank in the vocabulary until no further merges are possible.

The key hyperparameters are MVS and FT. In the reported experiments, MVS was set to 30,000 and FT was set to 2,000. The vocabulary was trained on ~3.4 million SMILES (both canonical and one non-canonical variant per molecule) from ChEMBL25. The resulting vocabulary contained 3,002 unique SMILES substrings with lengths ranging from 1 to 22 atom-level characters.

The trained SPE vocabulary produces tokens that are human-readable and correspond to chemically meaningful substructures and functional groups. SPE tokenization reduces the mean sequence length from approximately 40 tokens (atom-level) to approximately 6 tokens on ChEMBL, a roughly 6-7x compression. This shorter representation directly reduces computational cost for RNN-based and other sequential models.

The algorithm is also compatible with other text-based molecular representations such as DeepSMILES and SELFIES, since these share atom-level character structures that can serve as the starting point for pair merging.

Molecular Generation and QSAR Prediction Experiments

Molecular Generation

The authors trained AWD-LSTM language models with SPE and atom-level tokenization on 9 million SMILES (1 canonical + 5 non-canonical per compound from ChEMBL25). Each model sampled 1 million SMILES for evaluation. The AWD-LSTM architecture used an embedding size of 400, three LSTM layers with 1,152 hidden units each, and various dropout settings (embedding: 0.1, input: 0.6, weight: 0.5, hidden: 0.2). Models were trained for 10 epochs with a base learning rate of 0.008 using one-cycle scheduling.

Metric	SPE	Atom-level
Validity	0.941	0.970
Uniqueness	0.994	0.992
Novelty	0.983	0.978
Internal diversity	0.897	0.886
Nearest neighbor similarity	0.391	0.386

The SPE model generated a more diverse population of novel molecules at the cost of slightly lower validity (94.1% vs. 97.0%). Internal diversity is defined as:

$$ \text{Internal diversity} = 1 - \frac{1}{|G|} \sum_{(x_1, x_2) \in G \times G} T(x_1, x_2) $$

where $T(x_1, x_2)$ is the Tanimoto similarity between molecules $x_1$ and $x_2$ using 1024-bit ECFP6 fingerprints. Nearest neighbor similarity (SNN) measures how well the generated set resembles the reference set:

$$ \text{SNN} = \frac{1}{|G|} \sum_{x_G \in G} \max_{x_R \in R} T(x_G, x_R) $$

Substructure coverage analysis showed both models recovered the same top-1000 BRICS fragments (100% coverage), but SPE consistently outperformed atom-level tokenization on top-5000 coverage across all four substructure types: BRICS fragments (0.997 vs. 0.987), functional groups (0.688 vs. 0.659), scaffolds (0.872 vs. 0.825), and ring systems (0.781 vs. 0.761).

QSAR Prediction

QSAR models were built using the MolPMoFiT transfer learning framework, which pre-trains a language model on ChEMBL and then fine-tunes it for specific prediction tasks. The evaluation used 24 regression benchmarks (pIC50 values) from Cortes-Ciriano et al., covering targets ranging from 199 molecules (alpha-2a adrenergic receptor) to 5,010 molecules (hERG). Models were evaluated on 10 random 80:10:10 splits using RMSE, R-squared, and MAE. Random forest models with 1024-bit ECFP6 were included as baseline comparisons.

Cohen’s d effect sizes were computed to quantify performance differences between tokenization methods. SPE performed comparably or better than atom-level tokenization on 23 out of 24 datasets. Notable results with medium or large effect sizes favoring SPE included cannabinoid CB1 receptor (large effect), A2a adrenergic receptor, LCK, estrogen receptor, and Aurora-A kinase (all medium effects). Against k-mer tokenization, SPE matched or outperformed on 22 out of 24 datasets.

Cohen’s d is defined as:

$$ \text{Cohen’s } d = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{(\text{SD}_1^2 + \text{SD}_2^2) / 2}} $$

where $\bar{x}_1, \bar{x}_2$ are the group means and $\text{SD}_1, \text{SD}_2$ are the standard deviations. Thresholds of 0.2 (small), 0.5 (medium), and 0.8 (large) were used following standard recommendations.

SMILES-based deep learning models generally performed on par with or better than the RF baseline, with particularly strong advantages on the four largest datasets (COX-2, acetylcholinesterase, erbB1, and hERG).

In addition to performance gains, SPE-based models trained on average 5 times faster than atom-level models due to the shorter input sequences.

Results Summary and Future Directions

The main findings of this study are:

SPE produces chemically meaningful tokens. The learned vocabulary contains human-readable SMILES substrings that correspond to common substructures and functional groups, making model interpretations more accessible.
SPE compresses input sequences by ~6-7x. Mean token sequence length drops from ~40 (atom-level) to ~6 (SPE) on ChEMBL, yielding a ~5x training speedup.
SPE improves molecular generation diversity. The SPE-based generative model produces molecules with higher novelty (98.3% vs. 97.8%), internal diversity (0.897 vs. 0.886), and substructure coverage, at the cost of slightly lower validity (94.1% vs. 97.0%).
SPE matches or outperforms atom-level and k-mer tokenization on QSAR prediction. Across 24 benchmarks, SPE showed comparable or better performance in 23/24 comparisons against atom-level and 22/24 against k-mer tokenization.

Limitations acknowledged by the authors:

The SPE vocabulary is trained on a specific dataset (ChEMBL25) and may not optimally represent chemical spaces that differ significantly from drug-like compounds.
The validity rate for molecular generation is slightly lower than atom-level tokenization (94.1% vs. 97.0%), since longer substructure tokens can introduce invalid fragments.
The k-mer tokenization suffers from an out-of-vocabulary problem, which the authors address by replacing unseen 4-mers with [UNK] tokens, but this is a limitation of the comparison rather than of SPE itself.

Future directions: The authors suggest SPE could serve as a general tokenization method for SMILES-based deep learning, applicable to any task where SMILES strings are used as input (generation, property prediction, reaction prediction, retrosynthesis). The algorithm can also be applied to DeepSMILES and SELFIES representations without modification.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
SPE vocabulary training	ChEMBL25	~3.4M SMILES	1 canonical + 1 non-canonical per molecule
Language model training	ChEMBL25 augmented	~9M SMILES	1 canonical + 5 non-canonical per molecule
Molecular generation evaluation	Sampled from model	1M SMILES per model	Validated with RDKit
QSAR benchmarks	Cortes-Ciriano et al.	24 datasets, 199-5010 molecules	pIC50 regression tasks

Algorithms

SPE vocabulary training: iterative pair merging with MVS=30,000 and FT=2,000
Language model: AWD-LSTM with embedding size 400, 3 LSTM layers with 1,152 hidden units
Dropout: embedding=0.1, input=0.6, weight=0.5, hidden=0.2
Training: 10 epochs, base learning rate 0.008, one-cycle policy
QSAR: MolPMoFiT transfer learning with 25x training augmentation and 15x validation augmentation
Test time augmentation: average of canonical + 4 augmented SMILES predictions
RF baseline: 500 trees, 1024-bit ECFP6, default scikit-learn parameters

Models

AWD-LSTM architecture from Merity et al. (2018)
MolPMoFiT framework from Li and Fourches (2020) for transfer learning QSAR

Evaluation

Metric	Task	Notes
Validity, Uniqueness, Novelty	Generation	Basic quality metrics
Internal diversity	Generation	1 - mean pairwise Tanimoto (ECFP6)
Nearest neighbor similarity	Generation	Mean max Tanimoto to reference set
Substructure coverage	Generation	BRICS, functional groups, scaffolds, ring systems
RMSE, R-squared, MAE	QSAR regression	10 random 80:10:10 splits
Cohen’s d	QSAR comparison	Effect size between tokenization methods

Hardware

Not explicitly specified in the paper.

Artifacts

Artifact	Type	License	Notes
SmilesPE	Code	Apache-2.0	SPE tokenization Python package
MolPMoFiT	Code	Not specified	Transfer learning QSAR framework

Paper Information

Citation: Li, X., & Fourches, D. (2021). SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning. Journal of Chemical Information and Modeling, 61(4), 1560-1569. https://doi.org/10.1021/acs.jcim.0c01127

@article{li2021smiles,
  title={SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning},
  author={Li, Xinhao and Fourches, Denis},
  journal={Journal of Chemical Information and Modeling},
  volume={61},
  number={4},
  pages={1560--1569},
  year={2021},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.0c01127}
}

Smirk: Complete Tokenization for Molecular Models

Thu, 26 Mar 2026 00:00:00 +0000

A Method for Complete Chemical Tokenization

This is a Method paper that introduces two new tokenizers for molecular foundation models: Smirk and Smirk-GPE. The primary contribution is a tokenization scheme that achieves complete coverage of the OpenSMILES specification using only 165 tokens, addressing the vocabulary gaps present in existing atom-wise tokenizers. The paper also proposes n-gram language models as low-cost proxy evaluators for tokenizer quality and validates these proxies against 18 transformer-based models across multiple benchmarks.

Vocabulary Gaps in Molecular Tokenization

Molecular foundation models overwhelmingly use “atom-wise” tokenization, where SMILES strings are split at atom boundaries using a regular expression first proposed by Schwaller et al. A key pattern in this regex treats all “bracketed atoms” (e.g., [C@@H], [18F], [Au+]) as single, irreducible tokens. Since bracketed atoms encode isotopes, chirality, charge, hydrogen count, and element identity, the number of possible permutations under the OpenSMILES specification exceeds 28 trillion. In practice, existing atom-wise tokenizers maintain vocabularies of fewer than 3,000 tokens, leaving large portions of chemical space unrepresentable.

This gap has real consequences. Many chemistry-specific tokenizers emit the unknown token [UNK] at non-negligible frequencies, particularly on datasets with diverse elements and stereochemistry. For example, SPE and APE tokenizers produce [UNK] for roughly 19% of tokens on MoleculeNet and approximately 50% on the tmQM transition metal complex dataset. Even models like SELFormer and ReactionT5 lack tokens for elements such as copper, ruthenium, gold, and uranium.

The authors also note a subtler issue: some open-vocabulary tokenizers (e.g., ChemBERTa’s BPE) conflate chemically distinct entities. The same Sc token may represent both a sulfur-carbon bond (in organic SMILES) and the element scandium (in [Sc]), creating ambiguity in downstream analysis.

Smirk: Glyph-Level Decomposition of SMILES

The core insight behind Smirk is to fully decompose bracketed atoms into their constituent “glyphs,” the primitive symbols defined by the OpenSMILES specification (element symbols, chirality markers, charges, isotope numbers, hydrogen counts, and brackets themselves). This transforms tokenization from a word-level scheme (one token per bracketed atom) to a character-level scheme over chemically meaningful glyphs.

Smirk uses a two-stage tokenization process:

Atom decomposition: Split a SMILES string into atom-level units using a regex (e.g., OC[C@@H][OH] becomes O C [C@@H] [OH]).
Glyph decomposition: Further split each unit into its constituent glyphs (e.g., [C@@H] becomes [ C @@ H ]).

The two-stage process is necessary to resolve ambiguities. For example, Sc in an unbracketed context represents a sulfur-carbon bond, while [Sc] denotes scandium. This ambiguity occurs over half a million times in PubChem’s compound dataset.

The resulting vocabulary contains only 165 tokens, requires no training, and by construction can faithfully tokenize any molecule that conforms to the OpenSMILES specification. The implementation is written in Rust using HuggingFace’s Tokenizers library and is available on PyPI.

Smirk-GPE (Glyph Pair Encoding) extends Smirk with a BPE-like compression step. After Smirk tokenization, adjacent tokens are merged using learned rules, reducing sequence length. Unlike standard BPE, merges operate on token IDs rather than character strings, preserving the distinction between chemically different entities that happen to share the same characters. Smirk-GPE was trained on 262 million molecules from Enamine REAL Space with a target vocabulary of 50,000 tokens, though training terminated at 2,300 tokens after exhausting all possible merges.

Evaluation Framework: Intrinsic Metrics, N-Gram Proxies, and Transformer Benchmarks

The evaluation covers 34 tokenizers across three datasets (Enamine REALSpace, MoleculeNet, and tmQM) using both intrinsic and extrinsic metrics.

Intrinsic Metrics

Four intrinsic metrics are computed for each tokenizer:

Fertility measures the mean tokenized sequence length. Higher fertility increases computational cost due to the quadratic scaling of attention:

$$ \text{cost} \propto \text{fertility}^2 $$

Normalized entropy quantifies how close a tokenizer comes to the information-theoretic ideal where all tokens are equally probable:

$$ \eta = \frac{-1}{\log |V|} \sum_{x \in V} p(x) \log p(x) $$

where $V$ is the vocabulary and $p(x)$ is the observed token probability. Higher normalized entropy correlates with better downstream performance.

Token imbalance measures the distance between observed token frequencies and a uniform distribution:

$$ D = \frac{1}{2} \sum_{x \in V} |p(x) - |V|^{-1}| $$

Unknown token frequency captures the fraction of emitted tokens that are [UNK]. This metric is particularly revealing: all existing chemistry-specific tokenizers (SPE/APE, atom-wise, BPE, and Unigram variants) emit [UNK] at non-negligible rates, while NLP tokenizers, Smirk, and Smirk-GPE do not.

N-Gram Proxy Language Models

The paper proposes using n-gram models as low-cost proxies for transformer-based evaluation. An n-gram estimates token likelihood with add-one smoothing:

$$ P_{n}(x_{i} \mid x_{i-n+1}, \dots, x_{i-1}) = \frac{C(x_{i-n+1}, \dots, x_{i}) + 1}{C(x_{i-n+1}, \dots, x_{i-1}) + |V|} $$

where $C$ is the count function and $|V|$ is the vocabulary size. N-grams were “pretrained” on 1.6 billion SMILES from Enamine REAL Space and evaluated on validation splits. Cross-entropy loss and information loss from unknown tokens were computed.

To quantify information lost to [UNK] tokens, the authors compute the KL-divergence between token distributions with and without unknown tokens, using a bidirectional character n-gram model:

$$ B_{n}(x_{i} \mid x_{i-n+1}, \dots, x_{i-1}, x_{i+1}, \dots, x_{i+n-1}) \propto \frac{C(x_{i-n+1}, \dots, x_{i}) + 1}{C(x_{i-n+1}, \dots, x_{i-1}) + |V|} \times \frac{C(x_{i}, \dots, x_{i+n-1}) + 1}{C(x_{i+1}, \dots, x_{i+n-1}) + |V|} $$

Transformer Experiments

Eighteen encoder-only RoBERTa models (25M parameters each, excluding embeddings) were pretrained from scratch using masked language modeling on Enamine REAL Space (245M molecules, 30,000 steps). Each model used a different tokenizer, isolating the tokenizer’s effect on performance. Finetuning was conducted on six regression and seven classification tasks from MoleculeNet and tmQM.

Linear fixed-effects models were used to estimate the standardized effect of each tokenization scheme relative to an atom-wise SMILES baseline.

Key Findings and Practical Implications

Tokenizer Performance

Smirk shows a positive effect on pretraining quality and downstream performance on tmQM (the dataset with the most bracketed atoms), but performs comparably to atom-wise tokenization on MoleculeNet tasks.
SPE and APE tokenizers have a negative impact on both pretraining and downstream performance relative to the atom-wise baseline, likely due to their high [UNK] rates.
Molecular encoding choice (SMILES vs. SELFIES) has a negligible effect on performance.
NLP tokenizers (GPT-4o, LLaMA, Gemma) score comparably to chemistry-specific tokenizers on intrinsic metrics and do not emit unknown tokens.

N-Gram Proxy Validation

N-gram cross-entropy and information loss metrics show strong rank correlation (Spearman’s $\rho$) with downstream transformer performance, validating their use as low-cost evaluation proxies. The effect sizes from n-gram and transformer experiments are directionally consistent.

Information Loss from Unknown Tokens

Information loss is minimal for tokenizers with robust coverage but substantial for tokenizers with limited vocabularies on chemically diverse datasets. MoLFormer incurs only 0.1 nats/molecule on MoleculeNet but 40.3 nats/molecule on tmQM. Open-vocabulary tokenizers (Smirk, Smirk-GPE, NLP tokenizers) mitigate this degradation.

Practical Recommendations

The authors argue that molecular foundation models must encode the entire breadth of chemical space or risk obscuring critical features. Bracketed atoms encode information essential to clinically relevant pharmaceuticals (e.g., Amoxicillin), industrial compounds (e.g., Tricalcium Silicate), and foundational chemistry (e.g., Cisplatin, where omitting the chiral marker erases medically relevant stereochemical information). The paper encourages the community to adopt open-vocabulary tokenizers and develop more chemically diverse benchmarks.

Limitations

The analysis uses a single-point evaluation for transformer experiments, which may underestimate performance achievable with additional hyperparameter tuning.
Smirk-GPE’s learned merges from REALSpace did not fully generalize to tmQM, as indicated by the token imbalance metric.
Current benchmarks (MoleculeNet) lack sufficient diversity to evaluate tokenizer robustness across the full periodic table, isotopes, charged species, and uncommon bond types.
The downstream impact of token ambiguities in BPE-based tokenizers (e.g., ChemBERTa’s conflation of Sc as both sulfur-carbon and scandium) remains unclear.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	Enamine REAL Space	1.6B SMILES (n-gram), 245M molecules (transformer)	80/10/10 train/val/test split
Downstream evaluation	MoleculeNet	Multiple tasks	6 regression + 7 classification tasks
Downstream evaluation	tmQM	108K transition metal complexes	OpenSMILES molecular encodings
Smirk-GPE training	Enamine REAL Space (subset)	262M molecules	Training split only

Algorithms

Smirk: Two-stage regex-based tokenization (atom decomposition, then glyph decomposition). No training required. Vocabulary: 165 tokens.
Smirk-GPE: BPE-like compression on top of Smirk. Operates on token IDs (not strings) to preserve chemical disambiguation. Final vocabulary: 2,300 tokens.
N-gram models: Add-one smoothing, bidirectional context ($2n - 2$ total context window). Implemented in Julia with exact integer arithmetic.

Models

Architecture: RoBERTa-PreLayerNorm, 8 layers, 8 attention heads, hidden size 512, intermediate size 2048, max sequence length 2048. ~25M parameters (excluding embeddings).
Pretraining: Masked language modeling, 30,000 steps, effective batch size 8192, FusedLamb optimizer, learning rate $1.6 \times 10^{-4}$.
Finetuning: 100,000 steps, AdamW optimizer, effective batch size 128, learning rate $1.6 \times 10^{-4}$.

Evaluation

MoleculeNet preferred metrics per task (AUROC for classification, MAE/RMSE for regression)
Fixed-effects models for standardized effect size estimation
Spearman’s rank correlation between n-gram and transformer metrics

Hardware

Pretraining: 2x NVIDIA A100 GPUs (Delta system at NCSA)
Finetuning: 1x NVIDIA A40 GPU
N-gram models: CPU-based (Julia implementation)

Artifacts

Artifact	Type	License	Notes
Smirk tokenizer	Code	Apache-2.0	Rust implementation with Python bindings, available on PyPI
Model checkpoints	Model	Not specified	Pretrained and finetuned checkpoints included in data release
N-gram code	Code	Not specified	Julia implementation included in data release

Paper Information

Citation: Wadell, A., Bhutani, A., & Viswanathan, V. (2026). Tokenization for Molecular Foundation Models. Journal of Chemical Information and Modeling, 66(3), 1384-1393. https://doi.org/10.1021/acs.jcim.5c01856

@article{wadell2026tokenization,
  title={Tokenization for Molecular Foundation Models},
  author={Wadell, Alexius and Bhutani, Anoushka and Viswanathan, Venkatasubramanian},
  journal={Journal of Chemical Information and Modeling},
  volume={66},
  number={3},
  pages={1384--1393},
  year={2026},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.5c01856}
}

SMILES vs SELFIES Tokenization for Chemical LMs

Thu, 26 Mar 2026 00:00:00 +0000

Atom Pair Encoding for Chemical Language Modeling

This is a Method paper that introduces Atom Pair Encoding (APE), a tokenization algorithm designed specifically for chemical string representations (SMILES and SELFIES). The primary contribution is demonstrating that a chemistry-aware tokenizer, which preserves atomic identity during subword merging, leads to improved molecular property classification accuracy in transformer-based models compared to the standard Byte Pair Encoding (BPE) approach.

Why Tokenization Matters for Chemical Strings

Existing chemical language models based on BERT/RoBERTa architectures have typically relied on BPE for tokenizing SMILES and SELFIES strings. Byte Pair Encoding (BPE) was originally designed for natural language and data compression, where it excels at breaking words into meaningful subword units. When applied to chemical strings, BPE operates at the character level without understanding chemical semantics, leading to several problems:

Stray characters: BPE may create tokens like “C)(” that have no chemical meaning.
Element splitting: Multi-character elements like chlorine (“Cl”) can be split into “C” and “l”, causing the model to misinterpret carbon and a dangling character.
Lost structural context: BPE compresses sequences without considering how character position encodes molecular structure.

Previous work on SMILES Pair Encoding (SPE) attempted to address this by iteratively merging SMILES substrings into chemically meaningful tokens. However, SPE had practical limitations: its Python implementation did not support SELFIES, and it produced a smaller vocabulary (~3000 tokens) than what the data could support. These gaps motivated the development of APE.

The APE Tokenizer: Chemistry-Aware Subword Merging

APE draws inspiration from both BPE and SPE but addresses their shortcomings. The key design decisions are:

Atom-level initialization: Instead of starting from individual characters (as BPE does), APE begins with chemically valid atomic units. For SMILES, this means recognizing multi-character elements (e.g., “Cl”, “Br”) as single tokens. For SELFIES, each bracketed string (e.g., [C], [Ring1], [=O]) serves as the fundamental unit.
Iterative pair merging: Like BPE, APE iteratively merges the most frequent adjacent token pairs. The difference is that the initial tokenization preserves atomic boundaries, so merged tokens always represent valid chemical substructures.
Larger vocabulary: Using the same minimum frequency threshold of 2000, APE generates approximately 5300 unique tokens from the PubChem dataset, compared to SPE’s approximately 3000. This richer vocabulary provides more expressive power for representing chemical substructures.
SELFIES compatibility: APE natively supports both SMILES and SELFIES, using the bracketed token structure of SELFIES as its starting point for that representation.

The tokenizer was trained on a subset of 2 million molecules from PubChem (10 million SMILES total). This produced four tokenizer variants: SMILES-BPE, SMILES-APE, SELFIES-BPE, and SELFIES-APE.

Pre-training and Evaluation on MoleculeNet Benchmarks

Model architecture

All four models use the RoBERTa architecture with 6 hidden layers, a hidden size of 768, an intermediate size of 1536, and 12 attention heads. Pre-training used masked language modeling (MLM) with 15% token masking on 1 million molecules from PubChem, with a validation set of 100,000 molecules. Each model was pre-trained for 20 epochs using AdamW, with hyperparameter optimization via Optuna.

Downstream tasks

The models were fine-tuned on three MoleculeNet classification tasks:

Dataset	Category	Compounds	Tasks	Metric
BBBP	Physiology	2,039	1	ROC-AUC
HIV	Biophysics	41,127	1	ROC-AUC
Tox21	Physiology	7,831	12	ROC-AUC

Data was split 80/10/10 (train/validation/test) following MoleculeNet recommendations. Models were fine-tuned for 5 epochs with early stopping based on validation ROC-AUC.

Baselines

Results were compared against two text-based models (ChemBERTa-2 MTR-77M and SELFormer) and two graph-based models (D-MPNN from Chemprop and MoleculeNet Graph-Conv).

Main results

Model	BBBP ROC	HIV ROC	Tox21 ROC
SMILYAPE-1M	0.754 +/- 0.006	0.772 +/- 0.010	0.838 +/- 0.002
SMILYBPE-1M	0.746 +/- 0.006	0.754 +/- 0.015	0.849 +/- 0.002
SELFYAPE-1M	0.735 +/- 0.015	0.768 +/- 0.012	0.842 +/- 0.002
SELFYBPE-1M	0.676 +/- 0.014	0.709 +/- 0.012	0.825 +/- 0.001
ChemBERTa-2-MTR-77M	0.698 +/- 0.014	0.735 +/- 0.008	0.790 +/- 0.003
SELFormer	0.716 +/- 0.021	0.769 +/- 0.010	0.838 +/- 0.005
MoleculeNet-Graph-Conv	0.690	0.763	0.829
D-MPNN	0.737	0.776	0.851

APE consistently outperforms BPE for both SMILES and SELFIES. SMILYAPE achieves the best BBBP score (0.754), beating D-MPNN (0.737). On HIV, SMILYAPE (0.772) is competitive with D-MPNN (0.776). On Tox21, D-MPNN (0.851) leads, with SMILYBPE (0.849) and SELFYAPE (0.842) close behind.

Statistical significance

Mann-Whitney U tests confirmed statistically significant differences between SMILYAPE and SMILYBPE (p < 0.05 on all datasets). Cliff’s delta values indicate large effect sizes: 0.74 (BBBP), 0.70 (HIV), and -1.00 (Tox21, favoring BPE). For SELFIES models, SELFYAPE achieved Cliff’s delta of 1.00 across all three datasets, indicating complete separation from SELFYBPE.

Key Findings and Limitations

APE outperforms BPE by preserving atomic identity

The consistent advantage of APE over BPE stems from APE’s atom-level initialization. By starting with chemically valid units rather than individual characters, APE avoids creating nonsensical tokens that break chemical elements or mix structural delimiters with atoms.

SMILES outperforms SELFIES with APE tokenization

SMILYAPE generally outperforms SELFYAPE across tasks. Attention weight analysis revealed that SMILYAPE assigns more weight to immediate neighboring tokens (0.108 vs. 0.096) and less to distant tokens (0.030 vs. 0.043). This pattern aligns with chemical intuition: bonding is primarily determined by directly connected atoms. SMILYAPE also produces more compact tokenizations (8.6 tokens per molecule vs. 11.9 for SELFYAPE), potentially allowing more efficient attention allocation.

SELFIES models show higher inter-tokenizer agreement

On the BBBP dataset, all true positives identified by SELFYBPE were also captured by SELFYAPE, with SELFYAPE achieving higher recall (61.68% vs. 55.14%). In contrast, SMILES-based models shared only 29.3% of true positives between APE and BPE variants, indicating that tokenization choice has a larger impact on SMILES models.

Limitations

Pre-training used only 1 million molecules, compared to 77 million for ChemBERTa-2. Despite this, APE models were competitive or superior, but scaling effects remain unexplored.
Evaluation was limited to three binary classification tasks from MoleculeNet. Regression tasks, molecular generation, and reaction prediction were not tested.
The Tox21 result is notable: SMILYBPE outperforms SMILYAPE (0.849 vs. 0.838), suggesting APE’s advantage may be task-dependent.
No comparison with recent atom-level tokenizers like Atom-in-SMILES or newer approaches beyond SPE.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Tokenizer training	PubChem subset	2M molecules	SMILES strings converted to SELFIES via selfies library
Pre-training	PubChem subset	1M molecules	100K validation set
Evaluation	BBBP	2,039 compounds	80/10/10 split
Evaluation	HIV	41,127 compounds	80/10/10 split
Evaluation	Tox21	7,831 compounds	80/10/10 split, 12 tasks

Algorithms

Tokenizers: BPE (via Hugging Face), APE (custom implementation, minimum frequency 2000)
Pre-training: Masked Language Modeling (15% masking) for 20 epochs
Optimizer: AdamW with Optuna hyperparameter search
Fine-tuning: 5 epochs with early stopping on validation ROC-AUC

Models

Architecture: RoBERTa with 6 layers, hidden size 768, intermediate size 1536, 12 attention heads
Four variants: SMILYAPE, SMILYBPE, SELFYAPE, SELFYBPE

Evaluation

Metric	SMILYAPE	SMILYBPE	SELFYAPE	SELFYBPE
BBBP ROC-AUC	0.754	0.746	0.735	0.676
HIV ROC-AUC	0.772	0.754	0.768	0.709
Tox21 ROC-AUC	0.838	0.849	0.842	0.825

Hardware

NVIDIA RTX 3060 GPU with 12 GiB VRAM

Artifacts

Artifact	Type	License	Notes
APE Tokenizer	Code	Other (unspecified SPDX)	Official APE tokenizer implementation
PubChem10M SMILES/SELFIES	Dataset	Not specified	10M SMILES with SELFIES conversions
Pre-trained and fine-tuned models	Model	Not specified	All four model variants on Hugging Face

Paper Information

Citation: Leon, M., Perezhohin, Y., Peres, F., Popovič, A., & Castelli, M. (2024). Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling. Scientific Reports, 14(1), 25016. https://doi.org/10.1038/s41598-024-76440-8

@article{leon2024comparing,
  title={Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling},
  author={Leon, Miguelangel and Perezhohin, Yuriy and Peres, Fernando and Popovi{\v{c}}, Ale{\v{s}} and Castelli, Mauro},
  journal={Scientific Reports},
  volume={14},
  number={1},
  pages={25016},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41598-024-76440-8}
}

SMI+AIS: Hybridizing SMILES with Environment Tokens

Thu, 26 Mar 2026 00:00:00 +0000

A Hybrid Molecular Representation Combining SMILES and Chemical-Environment Tokens

This is a Method paper that introduces SMI+AIS(N), a hybrid molecular string representation combining standard SMILES tokens with Atom-In-SMILES (AIS) tokens. AIS tokens encode local chemical environment information (central atom, ring membership, and neighboring atoms) into a single token. The key contribution is a systematic hybridization strategy that selectively replaces the most frequent SMILES tokens with AIS equivalents, preserving SMILES grammar compatibility while enriching token diversity. The method is validated on molecular structure generation via latent space optimization for drug design.

Limitations of Standard SMILES for Machine Learning

SMILES is the most widely adopted string-based molecular representation, used in major databases like ZINC and PubChem. Despite this ubiquity, SMILES has several well-known limitations for machine learning applications:

Non-unique representations: The same molecule can be encoded as multiple distinct SMILES strings.
Invalid string generation: Generative models can produce syntactically invalid SMILES that do not correspond to any molecule.
Limited token diversity: SMILES tokens map one-to-one to atoms or bonds, so the token vocabulary is restricted to the available atom and bond types.
Insufficient chemical context: Individual SMILES tokens carry no information about the local chemical environment of an atom.

Alternative representations like SELFIES (guaranteeing validity) and InChI (guaranteeing uniqueness) address some of these issues but share the same fundamental limitation of low token diversity. The Atom-In-SMILES (AIS) representation (Ucak et al., 2023) enriches tokens with neighboring atom and ring information, but using AIS exclusively produces a large vocabulary with many infrequent tokens that can cause data sparsity problems. The authors aim to find a middle ground: adding chemical context to the most common tokens while keeping the vocabulary manageable.

Core Innovation: Selective Token Hybridization with AIS

The SMI+AIS(N) representation hybridizes standard SMILES with AIS tokens through a frequency-based selection process:

AIS Token Structure

Each AIS token encodes three pieces of information about an atom, delimited by semicolons:

$$ \lbrack \text{central atom} ; \text{ring info} ; \text{neighbor atoms} \rbrack $$

For example, the oxygen in a carboxyl group of benzoic acid is represented as [O;!R;C], meaning: oxygen atom, not in a ring, bonded to carbon. In standard SMILES, this would simply be O.

Hybridization Procedure

Convert all SMILES strings in the ZINC database to their full AIS representations.
Count the frequency of each AIS token across the database.
Select the top-N most frequent AIS tokens to form the hybrid vocabulary.
In the hybrid representation, atoms matching these top-N AIS tokens are written in AIS notation; all other atoms use standard SMILES notation.

For benzoic acid, the hybridization produces:

$$ \text{SMI}: \texttt{O=C(O)c1ccccc1} $$

$$ \text{SMI+AIS}: \texttt{\lbrack O;!R;C\rbrack=\lbrack C;!R;COO\rbrack(\lbrack OH;!R;C\rbrack)c1ccccc1} $$

The parameter N controls vocabulary size. The authors test N = 50, 100, 150, and 200, finding that N = 100-150 provides the best balance for the ZINC database.

Token Frequency Rebalancing

A key benefit of hybridization is mitigating the severe token frequency imbalance in standard SMILES. Carbon (C), the most frequent element with ~184 million occurrences in ZINC, is represented by only 16 token types in SMILES. With SMI+AIS(200), carbon is distinguished into 145 token types based on chemical environment, with 74% of carbon occurrences represented by AIS tokens. Less common elements like halogens see minimal change (only 2% AIS representation), which avoids introducing unnecessarily rare tokens.

Element	Frequency	SMILES Types	SMI+AIS(100) Types (AIS %)	SMI+AIS(200) Types (AIS %)
C	183,860,954	16	78 (73%)	145 (74%)
O	27,270,229	8	16 (11%)	24 (11%)
N	26,022,928	11	32 (1%)	46 (10%)
X (halogens)	6,137,030	7	10 (2%)	11 (2%)
S	4,581,307	12	17 (2%)	24 (2%)

Latent Space Optimization for Molecular Generation

Model Architecture

The evaluation uses a conditional variational autoencoder (CVAE) with:

Encoder: BERT-style architecture with entity and positional embeddings, 4 multi-head attention layers (8 heads each), producing mean and standard deviation vectors in latent space.
Decoder: 4 stacked gated recurrent unit (GRU) layers that transform sampled latent vectors (conditioned) back into token sequences.
Training: 20 epochs on 9 million compounds from the ZINC database (8:1:1 train/valid/test split) under identical conditions for all representations.

Optimization Setup

Bayesian Optimization (BO) via BoTorch is applied to the CVAE latent space, maximizing a multi-objective function:

$$ \text{Obj} = -\text{BA} - 0.5 \times \text{SA}^2 $$

where BA is binding affinity (docking score from QuickVina 2, lower is stronger) and SA is synthetic accessibility score (from RDKit, lower is more synthesizable). Each BO iteration generates 800 candidate latent vectors. Invalid strings receive a penalty objective value of -100.

Protein Targets

Four diverse targets were used to assess generalizability:

PDK4 (Pyruvate Dehydrogenase Kinase 4): narrow, deep binding pocket
5-HT1B (Serotonin Receptor 1B): shallow, open GPCR conformation
PARP1 (Poly ADP-ribose Polymerase 1): small, flexible molecule binding site
CK1d (Casein Kinase I Delta): broad, accessible conformation

Protein structures were obtained from the Protein Data Bank (PDB IDs: 4V26, 4IAQ, 6I8M, 4TN6). Each optimization was run 10 times independently from the same 5 initial compounds selected from BindingDB.

Key Results

SMI+AIS(100) consistently achieved the highest objective values across protein targets.

PDK4 Optimization (Top-1 results over 10 independent runs):

SMI+AIS(100) achieved approximately 12% improvement over standard SMILES and 28% improvement over SELFIES based on median Top-1 objective values.
Generated structures exhibited BA scores between -10 and -9 and SA scores between 2.0 and 2.3.
Molecular weights clustered around 400 amu, consistent with the CVAE conditioning.

Validity Ratios: Standard SMILES produced approximately 40% valid structures. SMI+AIS representations showed significant improvement as N increased, though SMI+AIS(200) showed slight saturation, likely from insufficiently trained infrequent tokens.

SELFIES: Despite achieving the highest validity ratio, SELFIES failed to generate chemically meaningful structures with desirable BA and SA scores. The authors attribute this to SELFIES grammar where token meaning is highly context-dependent, causing minor latent space variations to produce large structural changes.

Cross-target consistency: Improvements were observed across all four protein targets, with slight variation (5-HT1B showed smaller differences between SMI and SMI+AIS(100) for Top-1, while other targets showed significant improvements).

Improved Molecular Generation Through Chemical Context Enrichment

The SMI+AIS(N) representation achieves consistent improvements in molecular generation quality compared to both standard SMILES and SELFIES. The core findings are:

Binding affinity improvement: Approximately 7% improvement over standard SMILES for the PDK4 target.
Synthesizability improvement: Approximately 6% increase in synthetic accessibility scores.
Target independence: Performance gains transfer across four structurally diverse protein targets.
Preserved structural motifs: The generative model retains chemically meaningful fragments (e.g., acetamide and piperidine) from initial compounds without explicit fragment constraints.

Limitations

The authors acknowledge several limitations:

Stereochemistry: SMI+AIS inherits the limited stereochemistry handling of standard SMILES.
Evaluation scope: Only molecular generation was tested; property prediction and other ML tasks remain unexplored.
Compute constraints: The study was limited to molecular generation due to computing power and time.
Single optimization strategy: Only latent space optimization with Bayesian optimization was evaluated; other generative approaches were not compared.

Future Directions

The authors suggest extending SMI+AIS to diverse benchmarking tests including molecular property prediction, experimental validation, and broader applications of chemical language models.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Vocab	ZINC Database	9M compounds	Canonicalized, deduplicated, split 8:1:1
Binding targets	BindingDB	5 initial compounds per target	Selected for each protein target
Protein structures	PDB	4 structures	IDs: 4V26, 4IAQ, 6I8M, 4TN6

Algorithms

Tokenization: AIS token frequency counting on full ZINC database, top-N selection
Generative model: Conditional VAE with BERT encoder (4 layers, 8 heads) and GRU decoder (4 layers)
Optimization: Bayesian Optimization via BoTorch (800 candidates per iteration)
Docking: QuickVina 2 with 25 A pocket size, 10 docking simulations per ligand
SA scoring: RDKit SA score
Training: 20 epochs for all representations under identical conditions

Models

CVAE architecture details in supplementary (Fig. S9, Tables S2, S4)
No pre-trained weights released

Evaluation

Metric	SMI+AIS(100) vs SMILES	SMI+AIS(100) vs SELFIES	Notes
Median Top-1 Obj. Value	+12%	+28%	PDK4 target
Validity Ratio	Higher than ~40% (SMILES)	Lower than SELFIES	SMI+AIS improves with N
BA (binding affinity)	~7% improvement	Substantial	Lower (more negative) is better
SA (synthesizability)	~6% improvement	Substantial	Lower is more synthesizable

Hardware

Hardware details are not specified in the main text. Optimization wall times are reported in supplementary Table S5.

Artifacts

Artifact	Type	License	Notes
AIS-Drug-Opt	Code	Not specified	Source code and datasets for reproduction

Reproducibility Status: Partially Reproducible. Code and processed data are publicly available on GitHub, but no pre-trained model weights are released, the license is unspecified, and hardware requirements are not documented in the main text.

Paper Information

Citation: Han, H., Yeom, M. S., & Choi, S. (2025). Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation. Scientific Reports, 15, 16892. https://doi.org/10.1038/s41598-025-01890-7

@article{han2025hybridization,
  title={Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation},
  author={Han, Herim and Yeom, Min Sun and Choi, Sunghwan},
  journal={Scientific Reports},
  volume={15},
  number={1},
  pages={16892},
  year={2025},
  publisher={Springer Nature},
  doi={10.1038/s41598-025-01890-7}
}

Randomized SMILES Improve Molecular Generative Models

Thu, 26 Mar 2026 00:00:00 +0000

Data Augmentation Through SMILES Randomization

This is an Empirical paper that performs an extensive benchmark of RNN-based molecular generative models trained with different SMILES string variants. The primary contribution is demonstrating that randomized SMILES (non-unique molecular string representations obtained by randomizing atom orderings) substantially improve the quality of the generated chemical space compared to canonical SMILES, without requiring any changes to the model architecture.

The paper evaluates three properties of generated chemical spaces: uniformity (equal probability of sampling each molecule), completeness (coverage of the target space), and closedness (generating only molecules within the target space). These are measured using a new composite metric called UC-JSD.

Canonical SMILES Bias in Generative Models

Recurrent Neural Networks trained on SMILES strings have shown the capacity to create large chemical spaces of valid molecules. However, when trained with canonical SMILES (the unique string representation produced by a canonicalization algorithm), these models exhibit biases. Specifically, prior work by the same group showed that models trained on one million GDB-13 molecules could only recover 68% of GDB-13 when sampled two billion times, compared to the theoretical maximum of 87% from an ideal uniform sampler.

The canonical SMILES representation introduces two problems. First, the canonicalization algorithm constrains how the molecular graph is traversed (e.g., prioritizing sidechains over ring atoms), forcing the model to learn both valid SMILES syntax and the specific canonical ordering rules. Second, structurally similar molecules can have substantially different canonical SMILES, making some molecules harder to sample than others. Molecules with more ring systems and complex topologies are particularly underrepresented.

The authors also note that DeepSMILES, a recently proposed alternative syntax, had not been benchmarked against randomized SMILES, and that the data augmentation capabilities of randomized SMILES at different training set sizes were unexplored.

Randomized SMILES as Non-Canonical Representations

The core insight is that by randomizing the atom ordering before SMILES generation, each molecule can be represented by multiple different but equally valid SMILES strings. This effectively provides data augmentation: a molecule with $n$ heavy atoms can theoretically yield up to $n$ different SMILES strings (though the actual number is typically lower due to molecular symmetry).

Two randomized SMILES variants are explored:

Restricted randomized SMILES: Atom ordering is randomized, but RDKit’s built-in fixes are applied. These fixes prevent overly complicated traversals, such as prioritizing sidechains before completing ring atoms.
Unrestricted randomized SMILES: Atom ordering is randomized without any RDKit restrictions, producing a superset of the restricted variant that includes more convoluted SMILES strings.

For each training epoch, a new set of randomized SMILES is generated for the same molecules, so a model trained for 300 epochs on one million molecules sees approximately 300 million different SMILES strings (with some overlap due to sampling).

The model architecture is a standard RNN with an embedding layer, $l$ layers of LSTM or GRU cells of size $w$, optional dropout, and a linear output layer with softmax. The training objective minimizes the average negative log-likelihood (NLL):

$$ J(T) = -\ln P(X_{0} = x_{0}) - \sum_{t=1}^{T} \ln P(X_{t} = x_{t} \mid X_{t-1} = x_{t-1} \dots X_{1} = x_{1}) $$

The key metric is the Uniformity-Completeness JSD (UC-JSD), which extends the Jensen-Shannon Divergence to measure how uniform, complete, and closed the generated chemical space is:

$$ JSD = H\left(\sum_{d \in D} \alpha_{i} \cdot d_{i}\right) - \sum_{d \in D} \alpha_{i} H(d_{i}) $$

where $H(d)$ is the Shannon entropy of a probability distribution. The UC-JSD is computed over the NLL vectors of the validation, training, and sampled sets. The composite UCC score is defined as:

$$ UCC = \text{completeness} \times \text{uniformity} \times \text{closedness} $$

where completeness measures coverage of GDB-13, uniformity measures how equal the sampling probabilities are, and closedness measures how few invalid (out-of-target-space) molecules are generated.

Benchmark Design Across SMILES Variants, Training Sizes, and Architectures

The benchmark covers a systematic grid of experimental conditions:

SMILES variants: Canonical, restricted randomized, unrestricted randomized, and three DeepSMILES variants (branch syntax, ring syntax, both).

Training set sizes from GDB-13: 1,000,000, 10,000, and 1,000 molecules with corresponding validation sets.

Architecture choices: LSTM vs. GRU cells, with hyperparameter grids over number of layers ($l$), hidden size ($w$), dropout rate ($d$), and batch size ($b$).

Model	Layers ($l$)	Hidden ($w$)	Dropout ($d$)	Batch ($b$)	Cell
GDB-13 1M	3	512	0, 25, 50	64, 128, 256, 512	GRU, LSTM
GDB-13 10K	2, 3, 4	256, 384, 512	0, 25, 50	8, 16, 32	LSTM
GDB-13 1K	2, 3, 4	128, 192, 256	0, 25, 50	4, 8, 16	LSTM
ChEMBL	3	512	0, 25, 50	64, 128, 256, 512	LSTM

Each model’s best epoch was selected using a smoothed UC-JSD curve, and the best epoch was then sampled with replacement $k = 2 \times 10^{9}$ times for GDB-13 benchmarks.

For ChEMBL experiments, models were trained on 1,483,943 molecules with a validation set of 78,102 molecules. Evaluation used validity, unique molecule count, and Frechet ChemNet Distance (FCD).

Randomized SMILES Produce More Complete and Uniform Chemical Spaces

GDB-13 results (1M training set)

The restricted randomized SMILES model recovered 83.0% of GDB-13, compared to 72.8% for canonical SMILES and 68.4-72.1% for DeepSMILES variants. All three quality metrics improved substantially:

SMILES Variant	% GDB-13	Uniformity	Completeness	Closedness	UCC
Canonical	72.8	0.879	0.836	0.861	0.633
Rand. restricted	83.0	0.977	0.953	0.925	0.860
Rand. unrestricted	80.9	0.970	0.929	0.876	0.790
DeepSMILES (both)	68.4	0.851	0.785	0.796	0.532

The NLL distribution of GDB-13 molecules under the randomized SMILES model was centered near $NLL_{GDB13} = -\ln(1/|GDB13|) = 20.6$ with a narrow spread, indicating near-uniform sampling probability. The canonical model showed a much wider NLL distribution, meaning some molecules were orders of magnitude harder to sample.

Randomized SMILES without data augmentation (same SMILES each epoch) still outperformed canonical SMILES (UCC 0.712 vs. 0.633 for restricted), confirming that the non-canonical representation itself is beneficial beyond the augmentation effect.

Smaller training sets amplify the advantage

With only 10,000 training molecules (0.001% of GDB-13), the randomized model generated 62.3% of GDB-13 vs. 38.8% for canonical. With 1,000 training molecules, the gap widened further: 34.1% vs. 14.5%. Validity also improved dramatically (81.2% vs. 50.4% for the 1K setting), suggesting randomized SMILES helps the model learn valid SMILES syntax more effectively from limited data.

ChEMBL results

On the drug-like ChEMBL dataset, the randomized SMILES model generated at least double the number of unique molecules compared to canonical (64.09% vs. 34.67% unique in a 2B sample), with comparable validity (98.33% vs. 98.26%). The canonical model showed a lower FCD (0.0712 vs. 0.1265), but the authors argue this reflects overfitting: the canonical model’s NLL distributions for training and validation sets overlapped tightly, while the randomized model showed more uniform coverage. Physicochemical property distributions (molecular weight, logP, SA score, QED, NP score, internal diversity) were nearly identical across both models.

Architecture findings

LSTM cells consistently outperformed GRU cells across all SMILES variants. Despite GRU’s faster per-epoch training time, LSTM models converged in fewer epochs, making them faster overall. Dropout improved canonical SMILES models but was less beneficial (or detrimental) for randomized SMILES, suggesting that randomized SMILES themselves serve as a regularization mechanism. Larger batch sizes generally improved performance across all variants.

UC-JSD as a model selection metric

The UC-JSD showed strong correlation with UCC ($R^{2} = 0.931$ for canonical, $R^{2} = 0.856$ for restricted randomized, $R^{2} = 0.885$ for unrestricted randomized), validating its use as a model selection criterion without requiring expensive sampling of every model.

The authors interpret randomized SMILES models as occupying a hybrid space between grammar-based and action-based generative models. The vocabulary serves as a fixed action space where atom tokens are “add atom” actions, bond tokens are “add bond” actions, and ring/branching tokens enable graph traversal. Canonical SMILES constrain this action space to a single deterministic path, while randomized SMILES allow the model to explore multiple valid traversals. This perspective also explains why DeepSMILES performed worse: its altered syntax creates a more complex action space without compensating benefits.

The authors encourage the use of randomized SMILES across different model architectures and tasks, including classification and property prediction, and suggest that finding optimal restricted variants of randomized SMILES is a promising research direction.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Eval	GDB-13 subsets	1M / 10K / 1K molecules	Randomly sampled from 975M GDB-13
Training/Eval	ChEMBL	1,483,943 training / 78,102 validation	Filtered subset of ChEMBL database

GDB-13 is available from the Reymond group website. ChEMBL is publicly available.

Algorithms

Character-level tokenization with special handling for multi-character tokens (Cl, Br, bracketed atoms, %-prefixed ring numbers)
Teacher forcing during training with NLL loss
Gradient norm clipping to 1.0
Weight initialization from $\mathcal{U}(-\sqrt{1/w}, \sqrt{1/w})$
Adaptive learning rate decay based on UC-JSD
Best epoch selection via smoothed UC-JSD (window size 4)

Models

Standard RNN architecture: embedding layer, stacked LSTM/GRU layers with optional dropout, linear output with softmax. Best models used 3 layers of 512-dimensional LSTM cells. Vocabulary sizes: 26 (GDB-13), 31 (ChEMBL).

Evaluation

Metric	Best Randomized	Best Canonical	Notes
% GDB-13 (1M)	83.0%	72.8%	2B sample with replacement
UCC (1M)	0.860	0.633	Composite score
% GDB-13 (10K)	62.3%	38.8%	2B sample with replacement
% GDB-13 (1K)	34.1%	14.5%	2B sample with replacement
% Unique ChEMBL	64.09%	34.67%	2B sample with replacement

Hardware

Nvidia Tesla V100 (Volta) 16 GB VRAM with CUDA 9.1, driver 390.30. Training times ranged from 1 minute (1K canonical) to 131 hours (ChEMBL canonical). Randomized SMILES models required longer per-epoch training due to augmentation overhead but converged to better solutions.

Artifacts

Artifact	Type	License	Notes
reinvent-randomized	Code	MIT	Training and benchmarking code
GDB-13	Dataset	Academic use	975 million fragment-like molecules
MOSES benchmark	Code	MIT	Used for FCD and property calculations

Paper Information

Citation: Arús-Pous, J., Johansson, S. V., Prykhodko, O., Bjerrum, E. J., Tyrchan, C., Reymond, J.-L., Chen, H., & Engkvist, O. (2019). Randomized SMILES strings improve the quality of molecular generative models. Journal of Cheminformatics, 11(1), 71. https://doi.org/10.1186/s13321-019-0393-0

@article{aruspous2019randomized,
  title={Randomized SMILES strings improve the quality of molecular generative models},
  author={Ar{\'u}s-Pous, Josep and Johansson, Simon Viet and Prykhodko, Oleksii and Bjerrum, Esben Jannik and Tyrchan, Christian and Reymond, Jean-Louis and Chen, Hongming and Engkvist, Ola},
  journal={Journal of Cheminformatics},
  volume={11},
  number={1},
  pages={71},
  year={2019},
  doi={10.1186/s13321-019-0393-0},
  publisher={Springer}
}

Group SELFIES: Fragment-Based Molecular Strings

Thu, 26 Mar 2026 00:00:00 +0000

A Fragment-Aware Extension of SELFIES

This is a Method paper that introduces Group SELFIES, a molecular string representation extending SELFIES by incorporating group tokens that represent functional groups or entire substructures. The primary contribution is a representation that maintains the 100% chemical validity guarantee of SELFIES while enabling fragment-level molecular encoding. Group SELFIES is shorter, more human-readable, and produces better distribution learning compared to both SMILES and standard SELFIES.

From Atoms to Fragments in Molecular Strings

Molecular string representations underpin nearly all string-based molecular generation, from chemical language models and VAEs to genetic algorithms. SMILES, the dominant representation, suffers from validity issues: generated strings frequently contain syntax errors or violate valency constraints. SELFIES solved this by guaranteeing that every string decodes to a valid molecule, but both SMILES and SELFIES operate at the atomic level. Human chemists, by contrast, think about molecules in terms of functional groups and substructures.

Fragment-based generative models exploit this inductive bias by constructing custom representations amenable to fragment-based molecular design. However, these approaches are typically graph-based, losing the desirable properties of string representations: easy manipulation and direct input into established language models. Historical string representations like Wiswesser Line Notation (WLN), Hayward Notation, and SYBYL Line Notation (SLN) did use non-atomic tokens, but none provided chemical robustness guarantees.

The gap is clear: no existing string representation combines the chemical robustness of SELFIES with the fragment-level abstraction that captures meaningful chemical motifs.

Group Tokens with Chemical Robustness Guarantees

The core innovation is the introduction of group tokens into the SELFIES framework. Each group token represents a predefined molecular fragment (such as a benzene ring, carboxyl group, or any user-specified substructure) and is treated as a single unit during encoding and decoding.

Group Definition

Each group is defined as a set of atoms and bonds with labeled attachment points that specify how the group participates in bonding. Each attachment point has a specified maximum valency, allowing the decoder to continue tracking available valency during string construction. Group tokens take the form [:S], where S is the starting attachment index.

Encoding

To encode a molecule, the encoder first recognizes and replaces substructure matches from the group set. By default, the encoder processes larger groups first, but users can override this with priority values. The encoder then traverses the molecular graph similarly to standard SELFIES encoding, inserting tokens that track attachment indices for entering and exiting groups.

Decoding

When the decoder encounters a group token, it looks up the corresponding group in the group set dictionary, places all atoms of the group, and connects the main chain to the starting attachment point. Navigation between attachment points is handled by reading subsequent tokens as relative indices. If an attachment point is occupied, the next available one is used. If all attachment points are exhausted, the group is immediately popped from the stack.

Chemical Robustness

The key property preserved from SELFIES is that any arbitrary Group SELFIES string decodes to a molecule with valid valency. This is achieved by maintaining the same two SELFIES decoder features within the group framework:

Token overloading: every token can be interpreted as a number when needed (for branch lengths, ring targets, or attachment indices).
Valency tracking: if adding a bond would exceed available valency, the decoder adjusts the bond order or skips the bond.

The authors verified robustness by encoding and decoding 25 million molecules from the eMolecules database.

Chirality Handling

Group SELFIES handles chirality differently from SMILES and SELFIES. Rather than using @-notation for tetrahedral chirality, all chiral centers must be specified as groups. An “essential set” of 23 groups covers all relevant chiral centers in the eMolecules database. This approach also supports extended chirality (axial, helical, planar) by abstracting the entire chiral substructure into a group token.

Fragment Selection

The group set is a user-defined dictionary that maps group names to molecular fragments. Users can specify groups manually using SMILES-like syntax, extract them from fragment libraries, or use fragmentation algorithms such as matched molecular pair analysis. The authors tested several approaches, including a naive method that cleaves side chains from rings and methods based on cheminformatics fragmentation tools. A useful group set typically contains fragments that appear in many molecules and replace many atoms, with similar fragments merged to reduce redundancy.

Experiments on Compactness, Generation, and Distribution Learning

Compactness (Section 4.1)

Using 53 groups (30 extracted from ZINC-250k plus 23 from the essential set), Group SELFIES strings are shorter than their SMILES and SELFIES equivalents. Despite Group SELFIES having a larger alphabet, the compressed file size of the ZINC-250k dataset is smallest for Group SELFIES, indicating lower information-theoretic complexity.

Random Molecular Generation (Section 4.2)

To isolate the effect of the representation from the generative model, the authors use a primitive generative model: sample a random string length from the dataset, draw tokens uniformly from a bag of all tokens, and concatenate. From 100,000 ZINC-250k molecules:

Randomly sampled Group SELFIES strings produce molecules whose SAScore and QED distributions more closely overlap with the original ZINC dataset than molecules from randomly sampled SELFIES strings.
The Wasserstein distances to the ZINC distribution are consistently lower for Group SELFIES.
On a nonfullerene acceptor (NFA) dataset, Group SELFIES preserves aromatic rings while SELFIES rarely does.

Distribution Learning with VAEs (Section 4.3)

Using the MOSES benchmarking framework, VAEs were trained for 125 epochs on both Group SELFIES and SELFIES representations. The Group SELFIES VAE used 300 groups extracted from the MOSES training set. Results from 100,000 generated molecules:

Metric	Group-VAE-125	SELFIES-VAE-125	Train (Reference)
Valid	1.0 (0)	1.0 (0)	1.0
Unique@1k	1.0 (0)	0.9996 (5)	1.0
Unique@10k	0.9985 (4)	0.9986 (4)	1.0
FCD (Test)	0.1787 (29)	0.6351 (43)	0.008
FCD (TestSF)	0.734 (109)	1.3136 (128)	0.4755
SNN (Test)	0.6051 (4)	0.6014 (3)	0.6419
Frag (Test)	0.9995 (0)	0.9989 (0)	1.0
Scaf (Test)	0.9649 (21)	0.9588 (15)	0.9907
IntDiv	0.8587 (1)	0.8579 (1)	0.8567
Novelty	0.9623 (7)	0.96 (4)	1.0

The most notable improvement is in Frechet ChemNet Distance (FCD), where Group SELFIES achieves 0.1787 versus 0.6351 for SELFIES on the test set. FCD measures the difference between penultimate-layer activations of ChemNet, encoding a mixture of biological and chemical properties relevant to drug-likeness. Most other metrics are comparable, with Group SELFIES matching or slightly outperforming SELFIES across the board.

Advantages, Limitations, and Future Directions

Key Findings

Group SELFIES provides three main advantages over standard SELFIES:

Substructure control: Important scaffolds, chiral centers, and charged groups can be preserved during molecular optimization.
Compactness: Group tokens represent multiple atoms, yielding shorter strings with lower information-theoretic complexity.
Improved distribution learning: The FCD metric shows substantial improvement, indicating generated molecules better capture biological and chemical properties of the training set.

Both SELFIES and Group SELFIES achieve 100% validity, eliminating the validity issues associated with SMILES-based generation.

Limitations

The authors acknowledge several limitations:

Computational speed: Encoding and decoding is slower than SELFIES due to RDKit overhead, particularly for the encoder which performs substructure matching for every group in the set.
No group overlap: Groups cannot overlap in the current formulation, which limits expressiveness for polycyclic compounds.
Group set design: Choosing an effective group set remains an open design choice that may require domain expertise or fragmentation algorithm tuning.
Limited generative model evaluation: The paper focuses on random sampling and VAEs; evaluation with more sophisticated models (GANs, reinforcement learning, genetic algorithms) is left to future work.

Future Directions

The authors propose several extensions: flexible scaffold tokens that preserve topology while allowing atom-type variation, representations based on cellular complexes or hypergraphs to handle overlapping groups, and integration with genetic algorithms like JANUS for molecular optimization.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Compactness / Generation	ZINC-250k	250,000 molecules	Random subset of 10,000 for fragment extraction; 100,000 for generation
Distribution Learning	MOSES benchmark	~1.9M molecules	Standard train/test split from MOSES framework
Robustness Verification	eMolecules	25M molecules	Full database encode-decode round trip
NFA Generation	NFA dataset	Not specified	Nonfullerene acceptors from Lopez et al. (2017)

Algorithms

Fragmentation: Naive ring-sidechain cleavage, matched molecular pair analysis, and diversity-based selection of 300 groups for VAE experiments.
Essential set: 23 chiral groups covering all relevant chiral centers in eMolecules.
Random generation: Bag-of-tokens sampling with length matched to dataset distribution.

Models

VAE: Trained for 125 epochs on MOSES dataset using both SELFIES and Group SELFIES tokenizations.
Architecture details follow the MOSES benchmark VAE configuration.

Evaluation

Metric	Description
FCD	Frechet ChemNet Distance (penultimate layer activations)
SNN	Average Tanimoto similarity to nearest neighbor in reference set
Frag	Cosine similarity of BRICS fragment distributions
Scaf	Cosine similarity of Bemis-Murcko scaffold distributions
IntDiv	Internal diversity via Tanimoto similarity
Validity	Percentage passing RDKit parsing
Uniqueness	Percentage of non-duplicate generated molecules
Novelty	Fraction of generated molecules not in training set

Hardware

Robustness verification performed on the Niagara supercomputer (SciNet HPC Consortium).
VAE training hardware not specified.

Artifacts

Artifact	Type	License	Notes
group-selfies	Code	Apache-2.0	Open-source Python implementation

Paper Information

Citation: Cheng, A. H., Cai, A., Miret, S., Malkomes, G., Phielipp, M., & Aspuru-Guzik, A. (2023). Group SELFIES: A robust fragment-based molecular string representation. Digital Discovery, 2(3), 748-758. https://doi.org/10.1039/D3DD00012E

@article{cheng2023group,
  title={Group SELFIES: A Robust Fragment-Based Molecular String Representation},
  author={Cheng, Austin H. and Cai, Andy and Miret, Santiago and Malkomes, Gustavo and Phielipp, Mariano and Aspuru-Guzik, Al{\'a}n},
  journal={Digital Discovery},
  volume={2},
  number={3},
  pages={748--758},
  year={2023},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D3DD00012E}
}

DeepSMILES: Adapting SMILES Syntax for Machine Learning

Thu, 26 Mar 2026 00:00:00 +0000

A New Molecular String Notation for Generative Models

This is a Method paper that introduces DeepSMILES, a modified SMILES syntax designed to reduce the rate of syntactically invalid strings produced by machine-learning generative models. The primary contribution is a pair of string-level transformations (for ring closures and for branches) that can be applied independently and interconverted with standard SMILES without loss of information, including stereochemistry.

The Problem of Invalid SMILES in Molecular Generation

Deep neural networks for de novo molecular design commonly operate on SMILES strings. Variational autoencoders (Gomez-Bombarelli et al., 2018), recurrent neural networks with LSTM (Segler et al., 2018; Olivecrona et al., 2017), and grammar-based approaches (Kusner et al., 2017) all generate molecules by sampling character sequences. A persistent problem is that many generated strings are syntactically invalid SMILES, with reported validity rates ranging from 7% to 80%.

Two structural features of SMILES syntax are responsible for most invalid strings:

Balanced parentheses: Branches require matched open/close parenthesis pairs. A generative model must track nesting state across long sequences to produce valid brackets.
Paired ring closure symbols: Rings require two identical digit tokens at corresponding positions. The model must remember which digits are “open” and close them appropriately.

Grammar-based approaches (e.g., Grammar VAE) can enforce balanced parentheses through a context-free grammar, but they cannot enforce the ring closure pairing constraint because that constraint is context-sensitive. Syntax-directed approaches (Dai et al., 2018) add explicit ring closure constraints but at the cost of significantly more complex decoder architectures.

Core Innovation: Postfix Branch Notation and Single Ring Closure Symbols

DeepSMILES addresses both syntax problems through two independent string transformations.

Ring closure transformation

Standard SMILES uses a pair of identical digits to mark ring openings and closings (e.g., c1ccccc1 for benzene). DeepSMILES eliminates the ring-opening digit and replaces the ring-closing digit with the ring size, counting back along the tree path to the ring-opening atom. Benzene becomes cccccc6, where 6 means “connect to the atom 6 positions back.”

This transformation has three key properties:

Every ring of a given size always uses the same digit, regardless of context. A phenyl ring is always cccccc6 in DeepSMILES, whereas in SMILES it might be c1ccccc1, c2ccccc2, c3ccccc3, etc.
A single symbol cannot be “unmatched” since there is no corresponding opening symbol.
For double-digit ring sizes, the %N notation is used (and %(N) for sizes above 99).

Bond stereochemistry is preserved by moving any explicit or stereo bond from the eliminated ring-opening symbol to the ring-closing symbol, with direction adjusted as needed.

Branch (parenthesis) transformation

Standard SMILES uses matched open/close parenthesis pairs for branches (e.g., C(OC)(SC)F). DeepSMILES replaces this with a postfix notation inspired by Reverse Polish Notation (RPN). Only close parentheses are used, and the number of consecutive close parentheses indicates how far back on the current branch the next atom attaches.

For example, C(OC)(SC)F becomes COC))SC))F. The interpretation uses a stack: atoms are pushed onto the stack as they are read, each close parenthesis pops one atom from the stack, and the next atom connects to whatever is on top of the stack.

Stereochemistry preservation

Tetrahedral stereochemistry is fully preserved through the transformations. When ring closure symbol reordering would change the stereo configuration, the @/@@ annotation is inverted during encoding to compensate.

Independence of transformations

The two transformations are independent and can be applied separately or together. Any application of DeepSMILES should specify which transformations were applied.

Roundtrip Validation on ChEMBL 23

The authors validated DeepSMILES by roundtripping all entries in the ChEMBL 23 database through SMILES-to-DeepSMILES-to-SMILES conversion. Canonical SMILES (including stereochemistry) were generated by four independent cheminformatics toolkits: CDK, OEChem, Open Babel, and RDKit. Using multiple toolkits ensures coverage of different traversal orders and ring closure ordering conventions.

All SMILES strings roundtripped without error across all three configurations (branches only, rings only, both). The exact string representation may differ in ring closure digit assignment or digit ordering, sometimes with an associated stereo inversion at tetrahedral centers, but the canonical SMILES of the original and roundtripped molecules are identical.

Performance characteristics

The following table shows the effect of DeepSMILES conversion on string length and throughput, measured on canonical SMILES from Open Babel for ChEMBL 23:

Transformation	Mean % change in length	Encoding (per sec)	Decoding (per sec)
Branches only	+8.2%	32,000	16,000
Rings only	-6.4%	26,000	24,000
Both	+1.9%	26,000	17,500

The ring transformation slightly shortens strings (by removing one digit per ring), while the branch transformation slightly lengthens them (additional close parentheses). Combined, the net effect is a small increase of about 2%. Throughput is in the tens of thousands of conversions per second in pure Python.

Limitations and Future Directions

DeepSMILES does not eliminate all invalid strings. Invalid DeepSMILES can still be generated, for example when there are more close parentheses than atoms on the stack, or when a ring size exceeds the number of available atoms. The reference implementation raises a DecodeError in these cases, though the authors note that a more tolerant decoder (ignoring extra parentheses or defaulting to the first atom for oversized rings) could be used during generation.

The paper assumes that input SMILES are generated by a standard cheminformatics toolkit as a depth-first traversal of the molecular graph. Non-standard SMILES (e.g., CC(C1)CCCC1) cannot be directly encoded.

The authors suggest several directions for future work:

Investigating whether a preferred traversal order (e.g., shorter branches first) would make DeepSMILES even easier for models to learn.
Exploring notations where atoms in the organic subset explicitly list their hydrogen count, which would allow a fully parenthesis-free representation.
Using SMILES augmentation with random traversal orders (as explored by Bjerrum and Threlfall, 2017) in combination with DeepSMILES.
Designing entirely new line notations optimized for ML, where every string maps to a valid molecule, there are few duplicate representations, small string changes produce small structural changes, and string length correlates with pharmaceutical relevance.

The fused ring case presents additional complexity: a bicyclic system has three cycles, and depending on traversal order, the ring size digit may not directly correspond to the ring size of any individual ring. This is an inherent limitation of depth-first traversal-based notations.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Validation	ChEMBL 23	~1.7M compounds	Canonical SMILES from CDK, OEChem, Open Babel, RDKit

Algorithms

The DeepSMILES encoder and decoder are pure string-processing algorithms with no machine-learning components. The transformations operate on SMILES syntax tokens (atoms, bonds, parentheses, ring closure digits) without chemical interpretation.

Evaluation

Metric	Value	Notes
Roundtrip accuracy	100%	All ChEMBL 23 entries across 4 toolkits
Encoding throughput	26,000-32,000/s	Pure Python, varies by transformation
Decoding throughput	16,000-24,000/s	Pure Python, varies by transformation

Hardware

No specific hardware requirements. The implementation is a pure Python module with no GPU dependencies.

Artifacts

Artifact	Type	License	Notes
deepsmiles	Code	MIT	Pure Python encoder/decoder

Paper Information

Citation: O’Boyle, N. M., & Dalke, A. (2018). DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures. ChemRxiv. https://doi.org/10.26434/chemrxiv.7097960.v1

@article{oboyle2018deepsmiles,
  title={DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures},
  author={O'Boyle, Noel M. and Dalke, Andrew},
  journal={ChemRxiv},
  year={2018},
  doi={10.26434/chemrxiv.7097960.v1}
}

Atom-in-SMILES: Better Tokens for Chemical Models

Thu, 26 Mar 2026 00:00:00 +0000

A New Tokenization Method for Chemical Language Models

This is a Method paper that introduces Atom-in-SMILES (AIS), a tokenization scheme for SMILES strings that replaces generic atomic tokens with environment-aware tokens encoding each atom’s local chemical neighborhood. The primary contribution is demonstrating that tokenization quality has a significant impact on chemical language model outcomes across multiple tasks: SMILES canonicalization, single-step retrosynthesis, and molecular property prediction.

Why Standard SMILES Tokenization Falls Short

Standard atom-wise SMILES tokenization treats all atoms of the same element identically. Every carbon is tokenized as “C” regardless of whether it is part of an aromatic ring, a carbonyl group, or a methyl chain. This creates a highly degenerate token space where chemically distinct atoms share the same representation.

The authors draw an analogy between natural language and chemical language. A typical SMILES sequence is about three times longer than a natural language sentence, yet the token vocabulary is roughly 1000 times smaller. This mismatch leads to extreme token repetition: the same tokens (C, c, N, O) appear many times within a single sequence. In natural language processing, token degeneration (where models repeatedly predict the same token) is a known failure mode of autoregressive decoders. The repetitive nature of SMILES tokens exacerbates this problem in chemical language models.

SMILES also lacks a one-to-one correspondence between tokens and chemical meaning. Two molecules that differ in only one atom substitution (e.g., swapping a carbon for a nitrogen in a ring) produce identical token sets under atom-wise tokenization, making it harder for models to distinguish structurally similar molecules.

Core Innovation: Encoding Atom Environments into Tokens

The key insight is to replace each atomic token with a richer token that encodes the atom’s local chemical environment, inspired by the atoms-in-molecules (AIM) concept from quantum chemistry. For a given SMILES string, the AIS mapping function $f$ operates on the token space:

$$ f(X) = \begin{cases} AE|_{X_{\text{central}}} & \text{if } X \text{ is an atom} \\ X & \text{otherwise} \end{cases} $$

where $AE|_{X_{\text{central}}}$ denotes the atomic environment centered on atom $X$. Non-atomic tokens (brackets, bond symbols, ring closures) pass through unchanged.

Each AIS token is formatted as [Sym;Ring;Neighbors] where:

Sym is the atomic symbol with chirality, aromaticity (lowercase for aromatic), hydrogen count, and formal charge
Ring indicates whether the atom is in a ring (R) or not (!R)
Neighbors lists the neighboring atoms interacting with the central atom

This mapping is bijective: SMILES strings can be fully recovered from AIS strings via an inverse projection. The algorithm iterates over atoms in a molecule, computes their local environments using RDKit, and produces environment-aware token variants.

As a concrete example, in glycine the two carbons and two oxygens are indistinguishable under atom-wise tokenization. Under AIS, each receives a unique token reflecting its bonding environment (e.g., the carboxyl carbon is distinguished from the alpha carbon).

The AIS tokenization also exhibits a fingerprint-like property. Because each token encodes local structural information, the set of AIS tokens for a molecule functions similarly to circular fingerprints like ECFP2. The authors show that pairwise Tanimoto similarities computed from AIS token sets have resolution comparable to ECFP2 and HashAP fingerprints, and better resolution than MACCS, Avalon, and RDKit fingerprints.

Token repetition can be quantified as:

$$ \text{rep-}l = \sum_{t=1}^{|s|} \mathbb{1}[s_t \in s_{t-w-1:t-1}] $$

where $s$ is the predicted sequence, $|s|$ is the token count, and $w$ is the window size. AIS tokens exhibit consistently lower normalized repetition rates compared to SMILES, SELFIES, and DeepSMILES across diverse molecular datasets (drugs, natural products, steroids, lipids, metal complexes, octane isomers).

Experimental Evaluation Across Three Chemical Tasks

Input-Output Equivalent Mapping (SMILES Canonicalization)

The first task tests whether a model can translate non-canonical SMILES enumerations into canonical form. The authors constructed deliberately challenging datasets from GDB-13 subsets with cumulative structural constraints (no cyclic heteroatom-heteroatom bonds, stable functional groups only, fragment-like, scaffold-like, etc.), generating training sets of 1M molecules augmented with 150K molecules from the most restrictive subset at 10x, 30x, and 50x augmentation levels.

GDB-13 Subset	Atom-wise (x10)	Atom-wise (x50)	AIS (x10)	AIS (x50)
ab	34.2%	33.2%	37.3%	34.1%
abc	31.0%	29.6%	33.7%	30.4%
abcde	48.7%	45.5%	53.6%	47.0%
abcdef	41.8%	39.1%	52.5%	46.9%
abcdefg	50.9%	50.0%	59.9%	56.8%

AIS outperformed atom-wise tokenization on all subsets and augmentation levels. The performance gap grew larger for more restrictive (more similar) subsets, reaching up to 10.7% on the abcdef subset. This demonstrates that AIS is particularly effective when molecules are structurally similar and harder to distinguish.

Single-Step Retrosynthesis

The second task uses the USPTO-50K benchmark for single-step retrosynthetic prediction via a template-free transformer encoder-decoder model. The model was trained for 200,000 steps with Adam optimizer, negative log-likelihood loss, and cyclic learning rate scheduling.

Tokenization	rep-\|P - rep-\|GT >= 2	String Exact (%)	Tc Exact (%)
Atom-wise baseline	–	42.00	–
Atom-wise (reproduced)	801	42.05	44.72
SmilesPE	821	19.82	22.74
SELFIES	886	28.82	30.76
DeepSMILES	902	38.63	41.20
Atom-in-SMILES	727	46.32	47.62

AIS achieved 46.32% string exact accuracy (4.3% above the atom-wise baseline) and 47.62% Tanimoto exact accuracy (2.9% above baseline). AIS also had the fewest degenerate token repetitions (727 vs. 801 for atom-wise), representing approximately a 10% reduction. DeepSMILES had the highest repetition count (902) despite reasonable overall accuracy. SELFIES and SmilesPE both performed substantially worse than the atom-wise baseline on this task.

The authors identified six common token repetition patterns in retrosynthetic predictions: long head repetitions, long tail repetitions, repetitive rings, repetitive chains, and halogen repetitions on both aliphatic and aromatic carbons.

Molecular Property Prediction

The third task evaluates tokenization schemes on MoleculeNet benchmarks using Random Forest models with 5-fold cross-validation. AIS tokens were converted to fingerprint-like feature vectors.

Dataset	SMILES	DeepSMILES	SELFIES	SmilesPE	AIS
Regression (RMSE, lower is better)
ESOL	0.628	0.631	0.675	0.689	0.553
FreeSolv	0.545	0.544	0.564	0.761	0.441
Lipophilicity	0.924	0.895	0.938	0.800	0.683
Classification (ROC-AUC, higher is better)
BBBP	0.758	0.777	0.799	0.847	0.885
BACE	0.740	0.774	0.746	0.837	0.835
HIV	0.649	0.648	0.653	0.739	0.729

AIS achieved the best performance on all three regression datasets and two of three classification datasets. On ESOL, the RMSE improvement over standard SMILES was 12%. On lipophilicity, the improvement was 26%.

Key Findings: Better Tokens Yield Better Chemical Models

The main findings of this work are:

Tokenization significantly impacts chemical language model quality. The choice of tokenization scheme can change prediction accuracy by over 10 percentage points on equivalent mapping tasks.
AIS reduces token degeneration by approximately 10% compared to atom-wise SMILES tokenization, with consistently lower normalized repetition rates across diverse molecular datasets.
AIS outperforms all compared tokenization schemes (atom-wise SMILES, SmilesPE, SELFIES, DeepSMILES) on canonicalization, retrosynthesis, and property prediction.
The fingerprint-like nature of AIS tokens enables direct use as molecular features for property prediction and provides resolution comparable to established circular fingerprints.
The mapping is invertible, so AIS strings can always be converted back to valid SMILES. This is a practical advantage over approaches that may lose structural information.

Limitations: AIS cannot distinguish environmentally identical substructures or atoms related by a molecular symmetry plane, since it only considers nearest-neighbor environments. Performance on long-chain molecules (e.g., lipids) is similar across all tokenization schemes, suggesting that local environment encoding is less informative for repetitive linear structures.

Future directions: The authors suggest AIS has potential for broader adoption in molecular generative models, chemical translation, and property prediction tasks across the cheminformatics community.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Canonicalization training	GDB-13 subsets	1M + 150K augmented	Cumulative structural constraints a-h
Canonicalization testing	GDB-13 disjoint test sets	20K per subset	Various restriction levels
Retrosynthesis	USPTO-50K	~50K reactions	Sequences > 150 tokens removed
Property prediction	MoleculeNet (ESOL, FreeSolv, Lipophilicity, BBBP, BACE, HIV)	Varies	Standard benchmark splits

Algorithms

Transformer encoder-decoder architecture for canonicalization and retrosynthesis tasks
200,000 training steps with Adam optimizer, negative log-likelihood loss, cyclic learning rate scheduler
Random Forest with 5-fold cross-validation for property prediction
AIS tokenization implemented via RDKit for atom environment extraction

Evaluation

Metric	Task	Notes
String exact match (%)	Canonicalization, Retrosynthesis	Exact SMILES match
Tanimoto exactness (Tc)	Retrosynthesis	Morgan FP radius 3, 2048 bits
RMSE	Regression property prediction	ESOL, FreeSolv, Lipophilicity
ROC-AUC	Classification property prediction	BBBP, BACE, HIV
rep-l	Token degeneration	Single-token repetition count

Hardware

Not explicitly specified in the paper.

Artifacts

Artifact	Type	License	Notes
atom-in-SMILES	Code	CC-BY-NC-SA-4.0	AIS tokenization implementation

Paper Information

Citation: Ucak, U. V., Ashyrmamatov, I., & Lee, J. (2023). Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization. Journal of Cheminformatics, 15, 55. https://doi.org/10.1186/s13321-023-00725-9

@article{ucak2023improving,
  title={Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization},
  author={Ucak, Umit V. and Ashyrmamatov, Islambek and Lee, Juyong},
  journal={Journal of Cheminformatics},
  volume={15},
  number={1},
  pages={55},
  year={2023},
  publisher={Springer},
  doi={10.1186/s13321-023-00725-9}
}

Review of Molecular Representation Learning Models

Wed, 25 Mar 2026 00:00:00 +0000

A Systematization of Molecular Representation Foundation Models

This paper is a Systematization that provides the first comprehensive review of foundation models for molecular representation learning (MRL). The authors classify existing models by their input modality (unimodal vs. multimodal), analyze four mainstream pretraining strategies, survey five downstream application domains, and propose practical guidelines for model selection. The review covers over 35 representative models published between 2020 and 2024, with parameter counts ranging from 2 million to over 1 trillion.

Why a Systematic Review of MRL Foundation Models Is Needed

Molecular representation learning transforms molecular structures and properties into numerical vectors that serve as inputs for machine learning models. The field has evolved rapidly from molecular fingerprints through SMILES-based sequence models to graph neural networks and 3D geometry-aware architectures. Foundation models, characterized by large-scale pretraining on unlabeled molecular data followed by fine-tuning on downstream tasks, have introduced new opportunities for generalizability and transfer learning in drug discovery.

Despite this rapid progress, the authors identify a gap: no prior work has systematically reviewed MRL foundation models across all input modalities and pretraining paradigms. Existing surveys tend to focus on specific representations (e.g., graph-based methods) or specific applications (e.g., property prediction) without providing the cross-cutting perspective needed to guide model selection. This review fills that gap by offering a unified taxonomy and practical guidelines.

Taxonomy of Molecular Descriptors and Model Architectures

The core organizational framework classifies models along two axes: the molecular descriptor used as input and the backbone architecture.

Molecular Descriptors

The review identifies five primary descriptor types:

Molecular fingerprints: Binary vectors encoding structural features (e.g., Morgan fingerprints). Rarely used in foundation models due to information loss and dimensional complexity.
1D sequences: SMILES and SELFIES string representations. SMILES is compact and widely used but can produce invalid molecules. SELFIES guarantees valid molecular strings by construction.
2D topological graphs: Atoms as nodes, bonds as edges. Can be derived from SMILES via RDKit, making graph datasets effectively interchangeable with SMILES datasets.
3D geometry: Spatial coordinates capturing conformational information, energy states, and stereochemistry. Experimentally expensive to obtain, limiting dataset availability.
Multimodal: Combinations of the above with text, IUPAC names, knowledge graphs, and molecular images.

The paper also discusses mathematically abstract molecular representations. For example, the Wiener index quantifies structural complexity:

$$ W = \frac{1}{2} \sum_{i < j} d_{ij} $$

where $d_{ij}$ is the topological distance (shortest bonding path length) between atoms $i$ and $j$.

Degree centrality captures local connectivity:

$$ C_{D}(v_{i}) = \sum_{j=1}^{n} A_{ij} $$

where $A \in \mathbb{R}^{n \times n}$ is the molecular graph adjacency matrix.

Model Architectures

Models are classified into two primary categories:

Unimodal-based models:

Sequence-based: Transformer models operating on SMILES/SELFIES (e.g., ChemBERTa-2, MoLFormer, MolGEN, LlaSMol). These capture syntactic patterns but miss spatial and topological features.
Topological graph-based: GNN variants (GIN, GCN, GAT) and Transformer-based graph models (Graphormer). GNNs capture local topology through message passing; Transformers overcome locality limitations through global self-attention.
3D geometry-based: Models like Uni-Mol and 3D PGT that incorporate spatial coordinates. Uni-Mol uses distance-aware self-attention with an SE(3)-equivariant coordinate head for rotation/translation invariance.
Image-based: CNN-based models (ImageMol) that process 2D molecular images using visual representation learning.

Multimodal-based models:

Sequence + Graph: DVMP, PanGu Drug Model. Combines the strengths of string and topological representations.
Graph + 3D Geometry: GraphMVP, Transformer-M. Enriches topological features with spatial information.
Text + Molecular Structure: KV-PLM, MolT5, MoleculeSTM, MolReGPT, Y-mol. Aligns molecular structural information with biomedical text through cross-modal learning.

Four Pretraining Paradigms for MRL

The review systematically categorizes pretraining strategies into four paradigms:

Masked Language Modeling (MLM)

The cornerstone strategy for sequence-based models. Randomly masks tokens in molecular sequences and trains the model to predict them. ChemBERTa pretrained on 77 million SMILES sequences from PubChem achieves 5-10% improvement in AUC-ROC on property prediction tasks compared to task-specific models. MLM captures local dependencies and global sequence patterns but cannot model spatial or topological features, making it best suited for unimodal sequence inputs.

Contrastive Learning (CL)

The dominant strategy for multimodal models. Constructs positive-negative sample pairs to align features across modalities or views. In unimodal settings, CL generates negative samples by perturbing molecular graphs. In multimodal settings, it aligns features from different modalities. GraphMVP, which contrasts 2D topological features with 3D spatial features, reduces RMSE by 15% on QM9 energy prediction compared to unimodal models. Performance depends heavily on the quality of positive sample construction.

Reconstruction-Based Pretraining (RBP)

Learns global molecular features by reconstructing original data from corrupted inputs. Tasks include node feature reconstruction, graph structure reconstruction, and coordinate/energy reconstruction. MGMAE masks more than 50% of nodes and edges in molecular graphs and trains the model to reconstruct them, achieving 94.2% AUC-ROC on BBBP. RBP captures global structural patterns but requires high model complexity and training cost.

Multimodal Alignment Pretraining (MAP)

Designed for multimodal inputs, aligning and fusing features from different modalities through cross-modal tasks. KV-PLM uses SMILES-to-text matching to align molecular structure and functional information. MAP fuses structural information (SMILES, graphs) with semantic information (text) but requires large-scale cross-modal labeled data, posing significant data acquisition challenges.

Downstream Applications and Performance Benchmarks

The review evaluates MRL foundation models across five application domains.

Molecular Property Prediction

The most common benchmark for MRL models. The review provides comprehensive ROC-AUC comparisons across eight MoleculeNet classification datasets:

Model	Type	BBBP	BACE	ClinTox	Tox21	SIDER	HIV
MGMAE	Graph	94.2	92.7	96.7	86.0	66.4	-
MPG	Graph	92.2	92.0	96.3	83.7	66.1	-
GROVER	Graph+Trans.	94.0	89.4	94.4	83.1	65.8	-
MoLFormer	Sequence	93.7	88.2	94.8	84.7	69.0	82.2
MM-Deacon	Seq.+IUPAC	78.5	-	99.5	-	69.3	80.1
Uni-Mol	3D	72.9	85.7	91.9	79.6	65.9	80.8
DVMP	Seq.+Graph	77.8	89.4	95.6	79.1	69.8	81.4
TxD-T-LLM	Seq.+Text	-	-	86.3	88.2	-	73.2

The table shows that no single architecture dominates across all datasets. Transformer- and GIN-based architectures with graph inputs generally perform well. The review notes that model effectiveness depends heavily on the dataset, with Mole-BERT encountering negative transfer due to a small and unbalanced atomic vocabulary.

Molecular Generation

MolGEN (SELFIES-based, 8B parameters) achieves 100% validity on synthetic molecules. MolT5 excels at text-to-molecule generation. Uni-Mol generates 3D conformations with 97.95% coverage on QM9.

Drug-Drug Interaction Prediction

MPG achieves 96.6% AUC-ROC on BIOSNAP by combining unsupervised pretraining with supervised fine-tuning and multi-task learning.

Retrosynthesis Prediction

DVMP achieves 66.5% top-1 accuracy on USPTO-50K when reaction types are provided as priors (54.2% without).

Drug Synergy Prediction

SynerGPT (GPT-based) achieves 77.7% AUC-ROC in few-shot settings for novel drug combinations, outperforming baselines through contextual learning.

Guidelines, Limitations, and Future Directions

Model Selection Guidelines

The authors provide structured guidelines for choosing MRL foundation models based on:

Task objective: Property prediction favors GNNs or large pretrained frameworks (ChemBERTa-2, Uni-Mol). Generation tasks favor GPT-style autoregressive models (MolGEN). Retrosynthesis benefits from multimodal architectures.
Data characteristics: SMILES/graph representations suit generation tasks. Knowledge graph-enhanced models benefit interaction and synergy prediction. Transfer learning helps data-limited scenarios.
Interpretability needs: Transformer architectures are preferred when interpretability is required, as attention matrices enable visualization of learned molecular features.
Computational budget: GIN-based models have $\mathcal{O}(|V| + |E|)$ complexity, while Transformer-based models scale as $\mathcal{O}(n^2 \cdot d)$.

Limitations and Future Directions

The review identifies five key challenges:

Multimodal data integration: Each representation paradigm has distinct limitations (1D neglects spatial configuration, 2D omits conformational details, 3D faces rotational invariance challenges). The authors propose incorporating molecular dynamics trajectories as a dynamic modality and using cross-modal data augmentation.
Data scarcity: Semi-supervised learning can achieve more than 90% of fully supervised performance using only 10% labeled data on QM9. Cross-modal augmentation (e.g., 3D InfoMax) can generate plausible 3D conformers from 2D graphs.
Interpretability: Current methods rely primarily on attention-based visualization, which is insufficient for multimodal models. The authors suggest assessing decision consistency across modalities and incorporating chemical knowledge graphs.
Training efficiency: Large parameter counts demand distributed parallel training techniques, with data parallelism being the most common approach.
Robustness and generalization: Strategies include data augmentation (multiple SMILES representations, 3D conformer generation), meta-learning for rapid adaptation, and sparse attention mechanisms to reduce sensitivity to irrelevant long-range interactions.

Reproducibility Details

This is a review paper, so standard reproducibility criteria for experimental papers do not directly apply. The review compiles results from the original publications of each surveyed model.

Data

The review catalogs 28 representative molecular datasets used by the surveyed foundation models:

Dataset	Size	Descriptor	Primary Use
PubChem	~118M	SMILES, 3D, Image, IUPAC	Pretraining
ZINC15	~980M	SMILES	Pretraining
ChEMBL	~2.4M	SMILES	Pretraining
QM9	133,884	SMILES	Property prediction
GEOM	450,000	3D coordinates	Property prediction
USPTO-full	950,000	SMILES	Reaction prediction
Molecule3D	4M	3D coordinates	Property prediction

Artifacts

Artifact	Type	License	Notes
Review Materials (GitHub)	Code/Data	Not specified	Code and data tables for figures
Paper (PMC)	Paper	CC-BY	Open access via PubMed Central

Evaluation

All performance metrics reported in the review are directly cited from the original studies. The evaluation protocols follow each model’s original setup. The review covers:

ROC-AUC for classification tasks (property prediction, DDI, synergy)
RMSE/MAE for regression tasks
Validity and novelty for molecular generation
Top-k accuracy for retrosynthesis
COV and MAT for conformation generation

Paper Information

Citation: Song, B., Zhang, J., Liu, Y., Liu, Y., Jiang, J., Yuan, S., Zhen, X., & Liu, Y. (2025). A systematic review of molecular representation learning foundation models. Briefings in Bioinformatics, 27(1), bbaf703. https://doi.org/10.1093/bib/bbaf703

@article{song2025systematic,
  title={A systematic review of molecular representation learning foundation models},
  author={Song, Bosheng and Zhang, Jiayi and Liu, Ying and Liu, Yuansheng and Jiang, Jing and Yuan, Sisi and Zhen, Xia and Liu, Yiping},
  journal={Briefings in Bioinformatics},
  volume={27},
  number={1},
  pages={bbaf703},
  year={2025},
  publisher={Oxford University Press},
  doi={10.1093/bib/bbaf703}
}

SELFIES and the Future of Molecular String Representations

Tue, 02 Dec 2025 00:00:00 +0000

Position: A Roadmap for Robust Chemical Languages

This is a Position paper (perspective) that proposes a research agenda for molecular representations in AI. It reviews the evolution of chemical notation over 250 years and argues for extending SELFIES-style robust representations beyond traditional organic chemistry into polymers, crystals, reactions, and other complex chemical systems.

The Generative Bottleneck in Traditional Representations

While SMILES has been the standard molecular representation since 1988, its fundamental weakness for machine learning is well-established: randomly generated SMILES strings are often invalid. The motivation is twofold:

Current problem: Traditional representations (SMILES, InChI, DeepSMILES) lack 100% robustness; random mutations or generations can produce invalid strings, limiting their use in generative AI models.
Future opportunity: SELFIES solved this for small organic molecules, but many important chemical domains (polymers, crystals, reactions) still lack robust representations, creating a bottleneck for AI-driven discovery in these areas.

16 Concrete Research Directions for SELFIES

The novelty is in the comprehensive research roadmap. The authors propose 16 concrete research projects organized around key themes:

Domain extension: Includes metaSELFIES for learning graph rules directly from data, BigSELFIES for stochastic polymers, and crystal structures via labeled quotient graphs.
Chemical reactions: Robust reaction representations that enforce conservation laws.
Programming perspective: Treating molecular representations as programming languages, potentially achieving Turing-completeness.
Benchmarking: Systematic comparisons across representation formats.
Interpretability: Understanding how humans and machines actually learn from different representations.

Evidence from Generative Case Studies

This perspective paper includes case studies:

Pasithea (Deep Molecular Dreaming): A generative model that first learns to predict a chemical property from a one-hot encoded SELFIES, then freezes the network weights and uses gradient descent on the one-hot input encoding to optimize molecular properties (logP). The target property increases or decreases nearly monotonically, demonstrating that the model has learned meaningful structure-property relationships from the SELFIES representation.
DECIMER and STOUT: DECIMER (Deep lEarning for Chemical ImagE Recognition) is an image-to-structure tool, and STOUT (SMILES-TO-IUPAC-name Translator) translates between IUPAC names and molecular string representations. Both show improved performance when using SELFIES as an intermediate representation. STOUT internally converts SMILES to SELFIES before processing and decodes predicted SELFIES back to SMILES. These results suggest SELFIES provides a more learnable internal representation for sequence-to-sequence models.

Strategic Outcomes and Future Vision

The paper establishes robust representations as a fundamental bottleneck in computational chemistry and proposes a clear path forward:

Key outcomes:

Identification of 16 concrete research projects spanning domain extension, benchmarking, and interpretability
Evidence that SELFIES enables capabilities (like smooth property optimization) impossible with traditional formats
Framework for thinking about molecular representations as programming languages

Strategic impact: The proposed extensions could enable new applications across drug discovery (efficient exploration beyond small molecules), materials design (systematic crystal structure discovery), synthesis planning (better reaction representations), and fundamental research (new ways to understand chemical behavior).

Future vision: The authors emphasize that robust representations could become a bridge for bidirectional learning between humans and machines, enabling humans to learn new chemical concepts from AI systems.

The Mechanism of Robustness

The key difference between SELFIES and other representations lies in how they handle syntax:

SMILES/DeepSMILES: Rely on non-local markers (opening/closing parentheses or ring numbers) that must be balanced. A mutation or random generation can easily break this balance, producing invalid strings.
SELFIES: Uses a formal grammar (automaton) where derivation rules are entirely local. The critical innovation is overloading: a state-modifying symbol like [Branch1] starts a branch and changes the interpretation of the next symbol to represent a numerical parameter (the branch length).

This overloading mechanism ensures that any arbitrary sequence of SELFIES tokens can be parsed into a valid molecular graph. The derivation can never fail because every symbol either adds an atom or modifies how subsequent symbols are interpreted.

The 16 Research Projects: Technical Details

This section provides technical details on the proposed research directions:

Extending to New Domains

metaSELFIES (Project 1): The authors propose learning graph construction rules automatically from data. This could enable robust representations for any graph-based system, from quantum optics to biological networks, without needing domain-specific expertise.

Token Optimization (Project 2): SELFIES uses “overloading” where a symbol’s meaning changes based on context. This project would investigate how this affects machine learning performance and whether the approach can be optimized.

Handling Complex Molecular Systems

BigSELFIES (Project 3): Current representations struggle with large, often random structures like polymers and biomolecules. BigSELFIES would combine hierarchical notation with stochastic building blocks to handle these complex systems where traditional small-molecule representations break down.

Crystal Structures (Projects 4-5): Crystals present unique challenges due to their infinite, periodic arrangements. An infinite net cannot be represented by a finite string directly. The proposed approach uses labeled quotient graphs (LQGs), which are finite graphs that uniquely determine a periodic net. However, current SELFIES cannot represent LQGs because they lack symbols for edge directions and edge labels (vector shifts encoding periodicity). Extending SELFIES to handle these structures could enable AI-driven materials design without relying on predefined crystal structures, opening up systematic exploration of theoretical materials space.

Beyond Organic Chemistry (Project 6): Transition metals and main-group compounds feature complex bonding that breaks the simple two-center, two-electron model. The solution: use machine learning on large structural databases to automatically learn these complex bonding rules.

Chemical Reactions and Programming Concepts

Reaction Representations (Project 7): Moving beyond static molecules to represent chemical transformations. A robust reaction format would enforce conservation laws and could learn reactivity patterns from large reaction datasets, improving synthesis planning.

Developing a 100% Robust Programming Language

Programming Language Perspective (Projects 8-9): An intriguing reframing views molecular representations as programming languages executed by chemical parsers. This opens possibilities for adding loops, logic, and other programming concepts to efficiently describe complex structures. The ambitious goal is a Turing-complete programming language that is also 100% robust. While fascinating, it is worth critically noting that enforcing 100% syntactical robustness inherently restricts grammar flexibility. Can a purely robust string representation realistically describe highly fuzzy, delocalized electron bonds (like in Project 6) without becoming impractically long or collapsing into specialized sub-languages?

Empirical Comparisons (Projects 10-11): With multiple representation options (strings, matrices, images), we need systematic comparisons. The proposed benchmarks would go beyond simple validity metrics to focus on real-world design objectives in drug discovery, catalysis, and materials science.

Human Readability (Project 12): While SMILES is often called “human-readable,” this claim lacks scientific validation. The proposed study would test how well humans actually understand different molecular representations.

Machine Learning Perspectives (Projects 13-16): These projects explore how machines interpret molecular representations:

Training networks to translate between formats to find universal representations
Comparing learning efficiency across different formats
Investigating latent space smoothness in generative models
Visualizing what models actually learn about molecular structure

Reproducibility Details

Since this is a position paper outlining future research directions, standard empirical reproducibility metrics do not apply. However, the foundational tools required to pursue the proposed roadmap are open-source.

Artifact	Type	License	Notes
aspuru-guzik-group/selfies	Code	Apache-2.0	Core SELFIES Python library, installable via `pip install selfies`
arXiv:2204.00056	Paper	N/A	Open-access preprint of the published Patterns article

Paper Information

Citation: Krenn, M., Ai, Q., Barthel, S., Carson, N., Frei, A., Frey, N. C., Friederich, P., Gaudin, T., Gayle, A. A., Jablonka, K. M., Lameiro, R. F., Lemm, D., Lo, A., Moosavi, S. M., Nápoles-Duarte, J. M., Nigam, A., Pollice, R., Rajan, K., Schatzschneider, U., … Aspuru-Guzik, A. (2022). SELFIES and the future of molecular string representations. Patterns, 3(10). https://doi.org/10.1016/j.patter.2022.100588

Publication: Patterns 2022

@article{Krenn2022,
  title = {SELFIES and the future of molecular string representations},
  volume = {3},
  ISSN = {2666-3899},
  url = {http://dx.doi.org/10.1016/j.patter.2022.100588},
  DOI = {10.1016/j.patter.2022.100588},
  number = {10},
  journal = {Patterns},
  publisher = {Elsevier BV},
  author = {Krenn, Mario and Ai, Qianxiang and Barthel, Senja and Carson, Nessa and Frei, Angelo and Frey, Nathan C. and Friederich, Pascal and Gaudin, Théophile and Gayle, Alberto Alexander and Jablonka, Kevin Maik and Lameiro, Rafael F. and Lemm, Dominik and Lo, Alston and Moosavi, Seyed Mohamad and Nápoles-Duarte, José Manuel and Nigam, AkshatKumar and Pollice, Robert and Rajan, Kohulan and Schatzschneider, Ulrich and Schwaller, Philippe and Skreta, Marta and Smit, Berend and Strieth-Kalthoff, Felix and Sun, Chong and Tom, Gary and von Rudorff, Guido Falk and Wang, Andrew and White, Andrew and Young, Adamo and Yu, Rose and Aspuru-Guzik, Alán},
  year = {2022},
  month = oct,
  pages = {100588}
}

Additional Resources:

Invalid SMILES Benefit Chemical Language Models: A Study

Tue, 02 Dec 2025 00:00:00 +0000

Core Contribution: Repurposing Invalid SMILES

This is an Empirical paper that challenges a fundamental assumption in the field of chemical language models. Skinnider provides both empirical evidence and mechanistic explanations for why the ability to generate “invalid” SMILES strings is beneficial for model performance.

The Problem with Absolute Validity in Chemical LMs

Prior research attempted to eliminate invalid generations using constrained representations like SELFIES. This paper demonstrates that invalid outputs serve as low-likelihood samples whose removal acts as an implicit quality filter, improving distribution learning.

Invalid Generation as an Implicit Quality Filter

The central insight is counterintuitive: invalid SMILES generation acts as a built-in quality control mechanism. The key contributions are:

Empirical Evidence: Direct comparisons showing that SMILES-based models consistently outperform SELFIES-based models across multiple metrics, with performance gains strongly correlated with the proportion of invalid outputs generated.
Mechanistic Explanation: Invalid SMILES are demonstrated to be low-likelihood samples from the model’s probability distribution. When these are filtered out, it’s equivalent to removing the model’s least confident predictions, a form of automatic quality control.
Causal Evidence: By modifying SELFIES to allow invalid generation (through relaxed constraints), the author shows that performance improves when models can generate and discard invalid outputs, directly proving the causal relationship.
Bias Analysis: SELFIES models are shown to introduce systematic structural biases (fewer aromatic rings, more aliphatic rings) due to their validity constraints, limiting their ability to explore chemical space naturally.

Experimental Design and Causal Interventions

The paper uses a multi-pronged approach to establish both correlation and causation:

Performance Comparisons: SMILES and SELFIES models were trained on identical datasets and evaluated using distribution-learning metrics like Fréchet ChemNet distance. The comparison was robust across different architectures, training set sizes, and chemical databases.

Loss Analysis: The relationship between SMILES validity and model confidence was examined by analyzing the sequence loss. For a given SMILES string $S$ composed of tokens $t_1, t_2, …, t_N$, the negative log-likelihood acts as a proxy for the model’s uncertainty:

$$ \text{NLL}(S) = -\sum_{i=1}^N \log P(t_i | t_1, …, t_{i-1}) $$

Invalid SMILES strings consistently register higher $\text{NLL}$ scores, meaning they represent the model’s least confident predictions. Filtering them effectively acts as automatic quality control, providing the mechanistic explanation for why invalid filtering improves performance.

Causal Intervention: A key experiment involved modifying the SELFIES valency constraints at two levels: first allowing pentavalent carbons (“Texas SELFIES”), then removing all constraints entirely (“unconstrained SELFIES”). This allowed direct testing of whether the ability to generate invalid outputs (which are then discarded) causally improves performance.

Structural Bias Analysis: Generated molecules were analyzed for chemical features like ring types and bond patterns to quantify how validity constraints systematically distort the model’s exploration of chemical space.

Generalization Testing: Models were trained on subsets of chemical databases and tested on their ability to reproduce the broader chemical space, measuring how validity constraints affect generalization.

Practical Application: The approach was tested on structure elucidation, using models to identify unknown molecules from minimal experimental data like mass spectrometry.

Key Findings on Validity Constraints and Bias

Superior Performance Across the Board: SMILES-based models consistently outperformed SELFIES models on distribution-learning tasks. Using metrics like Fréchet ChemNet distance, SMILES models generated molecules that more closely matched the statistical properties of their training data. This performance advantage was directly correlated with the proportion of invalid SMILES generated. Models that produced more invalid outputs performed better after filtering.

Invalid SMILES Are Low-Confidence Predictions: The analysis revealed that invalid SMILES consistently have higher loss values than valid ones, meaning they represent the model’s least confident predictions. This suggests that validity checking acts as an automatic confidence filter, removing low-quality samples without requiring explicit uncertainty estimation.

Causal Evidence Through Unconstrained SELFIES: Direct causal evidence came from modifying SELFIES to allow invalid generation. When “unconstrained SELFIES” models could generate and discard invalid molecules, their performance improved, approaching that of SMILES models. This provides direct causal evidence that the ability to generate invalid outputs is what drives the performance gains.

Validity Constraints Introduce Systematic Bias: SELFIES models showed clear structural biases compared to both training data and SMILES outputs. They generated fewer aromatic rings and more aliphatic structures, systematic distortions caused by the valency constraints used to ensure validity. These biases limit the model’s ability to faithfully represent chemical space.

Reduced Generalization: When trained on subsets of chemical databases, SMILES models could reproduce a larger portion of the complete chemical space compared to SELFIES models. Although SELFIES generated more valid molecules in absolute terms, their structural biases constrained exploration and limited generalization beyond the training set.

Real-World Application Benefits: In structure elucidation tasks, identifying unknown molecules from experimental data like mass spectrometry, SMILES-based models significantly outperformed SELFIES models. This demonstrates that the benefits extend beyond academic benchmarks to practical applications.

CASMI 2022 Benchmark: The language model trained on the LOTUS database was benchmarked against 19 submissions to the CASMI 2022 competition for structure elucidation of unknown compounds. Using only accurate mass as input (no MS/MS data), the model achieved competitive performance, highlighting the practical utility of the sampling-frequency-based approach for de novo structure elucidation.

Computational Efficiency: Filtering invalid SMILES is computationally trivial. Parsing ten million SMILES strings with RDKit takes approximately 7.5 minutes on a single CPU, making the post-processing overhead negligible compared to model training and inference costs.

Reproducibility Details

Models

Primary Architecture (LSTM): The main results rely on a Recurrent Neural Network (RNN) using Long Short-Term Memory (LSTM) units.

Structure: Three-layer LSTM with a hidden layer size of 1,024 dimensions
Embedding: An embedding layer of 128 dimensions
Decoder: A linear decoder layer outputs token probabilities

Secondary Architecture (Transformer/GPT): To confirm robustness across architectures, the author also used a Generative Pretrained Transformer (GPT) architecture adapted from MolGPT.

Structure: Eight transformer blocks
Internals: Each block contains eight masked self-attention heads and a feed-forward network (1,024 dimensions) using GELU activation
Embedding: 256 dimensions, concatenated with learned positional encodings

Algorithms

Optimizer: Adam optimizer for both architectures with $\beta_1=0.9$ and $\beta_2=0.999$.

Learning Rate:

LSTM: 0.001
Transformer: 0.0005

Batch Size: 64

Loss Function: Cross-entropy loss of next-token prediction.

Stopping Criteria: Early stopping using a validation set (10% of training data) with patience of 50,000 minibatches.

Data

Primary Source: ChEMBL database (version 28).

Preprocessing Pipeline:

Cleaning: Removal of duplicate SMILES, salts, and solvents (retaining heavy fragments with $\geq 3$ heavy atoms)
Filtering: Molecules with atoms other than {Br, C, Cl, F, H, I, N, O, P, S} were removed
Normalization: Charged molecules were neutralized and converted to canonical SMILES

Training Subsets: Models were trained on random samples of 30,000, 100,000, and 300,000 molecules to test scalability.

Generalization Data: To test generalization, models were also trained on the GDB-13 database (enumerating drug-like molecules up to 13 heavy atoms).

Structure Elucidation Data: For practical application tasks, models were trained on natural products (LOTUS, COCONUT), food compounds (FooDB), and environmental contaminants (NORMAN).

Evaluation

Primary Metric: Fréchet ChemNet Distance (FCD), measuring chemical similarity between generated molecules and the training set (lower is better).

Secondary Metrics:

Validity: Percentage of outputs parseable by RDKit
Scaffold Similarity: Jensen-Shannon distances between Murcko scaffold compositions
Physical Properties: Comparisons of molecular weight, LogP, topological polar surface area (TPSA), and ring counts (aromatic vs. aliphatic)
Structure Elucidation: “Top-k accuracy,” the proportion of held-out molecules where the correct structure appeared in the model’s top $k$ ranked outputs

Hardware

Compute Nodes: Dell EMC C4140 GPU compute nodes
GPUs: NVIDIA Tesla V100
Compute Time: Parsing 10 million SMILES took ~7.5 minutes on a single CPU; SELFIES models required an average of 0.6 hours longer to train than SMILES models

Replicability

Code Availability: Source code and intermediate data are available via Zenodo. Pre-trained model weights are not provided in the archive, requiring researchers to train models from scratch using the included scripts to fully replicate the study.

Data Availability: Training datasets and generated molecule samples (10 million from ChEMBL/GDB-13 models, 100 million from LOTUS/COCONUT/FooDB/NORMAN cross-validation folds) are available via Zenodo.

Software Libraries:

PyTorch: LSTM and Transformer implementations
RDKit: SMILES parsing, validity checking, and property calculation
SELFIES: Version 2.1.1 for conversion

Artifacts

Artifact	Type	License	Notes
Source code (Zenodo)	Code	Unknown	Training scripts, analysis code, and intermediate data
Training and generated molecules (Zenodo)	Dataset	Unknown	Preprocessed training sets and sampled molecules

Implications and Takeaways

This work reframes how we think about “errors” in generative models. The key insight is that model outputs appearing incorrect often represent low-likelihood samples whose removal improves overall performance.

The findings suggest that the field’s drive toward guaranteed validity leads to systematic biases. Letting models fail informatively and using those failures as quality signals can yield better distribution learning. This is relevant as the field moves toward larger, more capable models where such self-correction mechanisms become increasingly valuable.

For practitioners, the takeaway is to consider the role of invalid outputs before eliminating them. Filtering low-confidence generations provides automatic quality control that improves final results.

Paper Information

Citation: Skinnider, M. A. (2024). Invalid SMILES are beneficial rather than detrimental to chemical language models. Nature Machine Intelligence, 6(4), 437-448. https://doi.org/10.1038/s42256-024-00821-x

Publication: Nature Machine Intelligence (2024)

@article{skinnider2024invalid,
  title={Invalid SMILES are beneficial rather than detrimental to chemical language models},
  author={Skinnider, Michael A},
  journal={Nature Machine Intelligence},
  volume={6},
  number={4},
  pages={437--448},
  year={2024},
  publisher={Nature Publishing Group UK London}
}

SMILES Notation: The Original Paper by Weininger (1988)

Sun, 12 Oct 2025 00:00:00 +0000

Paper Information

Citation: Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1), 31-36. https://doi.org/10.1021/ci00057a005

Publication: Journal of Chemical Information and Computer Sciences, 1988

Additional Resources:

SMILES notation overview - Modern usage summary
Converting SMILES to 2D images - Practical visualization tutorial

Core Contribution: A String-Based Molecular Notation

This is a Method paper that introduces a novel notation system for representing chemical structures as text strings. It establishes the encoding rules and input conventions for SMILES (Simplified Molecular Input Line Entry System), while explicitly deferring the canonicalization algorithm to subsequent papers in the series.

The Computational Complexity of Chemical Information in the 1980s

As computers became central to chemical information processing in the 1980s, the field faced a fundamental problem: existing line notations were either too complex for chemists to use practically or too limited for computational applications. Previous systems required extensive training to write correctly and were prone to errors.

The goal was ambitious: create a system that could represent any molecule as a simple text string, making it both human-readable and machine-efficient. This would enable compact database storage, fast processing, and easy exchange between software systems.

Separating Input Rules from Canonicalization

Weininger’s key insight was to separate the problem into two parts: create simple, flexible rules that chemists could easily learn for input, while deferring to the computer the complex task of generating a unique, canonical representation. This division of labor made SMILES both practical and powerful.

The specific innovations include:

Simple input rules - Chemists could write molecules intuitively (e.g., CCO or OCC for ethanol)
Ring closure notation - Breaking one bond and marking ends with matching digits
Implicit hydrogens - Automatic calculation based on standard valences keeps strings compact
Algorithmic aromaticity detection - Automatic recognition of aromatic systems from Kekulé structures
Human-readable output - Unlike binary formats, SMILES strings are readable and debuggable

Important scope note: This first paper in the series establishes the input syntax and encoding rules. The canonicalization algorithm (how to generate unique SMILES) is explicitly stated as the subject of following papers: “specification of isomerisms, substructures, and unique SMILES generation are the subjects of following papers.”

Demonstrating Notation Rules Across Molecular Classes

The paper is primarily a specification document establishing notation rules. The methodology is demonstrated through worked examples showing how to encode various molecular structures:

Basic molecules: Ethane (CC), ethylene (C=C), acetylene (C#C)
Branches: Isobutyric acid (CC(C)C(=O)O)
Rings: Cyclohexane (C1CCCCC1), benzene (c1ccccc1)
Aromatic systems: Tropone (O=c1cccccc1), quinone (showing exocyclic bond effects)
Complex structures: Morphine (40 characters vs 1000-2000 for connection tables)
Edge cases: Salts, isotopes, charged species, tautomers

Performance comparisons are mentioned qualitatively: SMILES processing was approximately 100 times faster than traditional connection table methods on the hardware of the era (1988), with dramatic reductions in storage space.

Performance and Practical Viability

The paper successfully establishes SMILES as a practical notation system with several key outcomes:

Practical benefits:

Compactness: 40 characters for morphine vs 1000-2000 for connection tables
Speed: ~100x faster processing than traditional methods
Accessibility: Simple enough for chemists to learn without extensive training
Machine-friendly: Efficient parsing and string-based operations

Design principles validated:

Separating user input from canonical representation makes the system both usable and rigorous
Implicit hydrogens reduce string length without loss of information
Ring closure notation with digit markers is more intuitive than complex graph syntax
Automatic aromaticity detection handles most cases correctly

Acknowledged limitations:

Canonicalization algorithm not included in this paper
Stereochemistry handling deferred to subsequent papers
Some edge cases (like unusual valence states) require explicit specification

The paper concludes by positioning SMILES as a foundation for database storage, substructure searching, and chemical informatics applications - a vision that proved accurate as SMILES became one of the most widely used molecular representations in computational chemistry.

Reproducibility Details

To implement the method described in this paper, the following look-up tables and algorithms are required. Note: These details are critical for replication but are often glossed over in high-level summaries.

1. The Valence Look-Up Table

To calculate implicit hydrogens, the system assumes the “lowest normal valence” greater than or equal to the explicit bond count. The paper explicitly defines these valences:

Element	Allowed Valences
B	3
C	4
N	3, 5
O	2
P	3, 5
S (aliphatic)	2, 4, 6
S (aromatic)	3, 5
F, Cl, Br, I	1

Example: For sulfur in $\text{H}_2\text{SO}_4$ written as OS(=O)(=O)O, the explicit bond count is 6 (two single bonds + two double bonds to four oxygens), so the system uses valence 6 with zero implicit hydrogens. Without knowing S allows valence 6, the algorithm would fail.

2. Explicit Hydrogen Requirements

The paper lists exactly three cases where hydrogen atoms are retained (not suppressed):

Hydrogen connected to other hydrogen (molecular hydrogen, $\text{H}_2$, written as [H][H])
Hydrogen connected to zero or more than one other atom (bridging hydrogens, isolated protons)
Isotopic hydrogen specifications in isomeric SMILES (deuterium [2H], tritium [3H])

For all other cases, hydrogens are implicit and calculated from the valence table.

3. Ring Closure Notation

Standard SMILES supports single digits 1-9 for ring closures. For rings numbered 10 and higher, the notation requires a percent sign prefix:

Ring closures 1-9: C1CCCCC1
Ring closures 10+: C%10CCCCC%10, C2%13%24 (ring 2, ring 13, ring 24)

Without this rule, a parser would fail on large polycyclic structures.

4. Aromaticity Detection Algorithm

The system uses an extended version of Hückel’s Rule ($4N+2$ π-electrons). The “excess electron” count for the aromatic system is determined by these rules:

Carbon contribution:

C in aromatic ring: Contributes 1 electron
C double-bonded to exocyclic electronegative atom (e.g., $\text{C}=\text{O}$ in quinone): Contributes 0 electrons (the carbon “loses” its electron to the oxygen)

Heteroatom contribution:

O, S in ring: Contributes 2 electrons (lone pair)
N in ring: Contributes 1 electron (pyridine-like) or 2 electrons (pyrrole-like, must have explicit hydrogen [nH])

Charge effects:

Positive charge: Reduces electron count by 1
Negative charge: Increases electron count by 1

Critical example - Quinone:

O=C1C=CC(=O)C=C1

Quinone has 6 carbons in the ring, but the two carbons bonded to exocyclic oxygens contribute 0 electrons each. The four remaining carbons contribute 4 electrons total (not 6), so quinone is not aromatic by this algorithm. This exocyclic bond rule is essential for correct aromaticity detection.

Aromatic ring test:

All atoms must be sp² hybridized
Count excess electrons using the rules above
Calculate whether the system complies with Hückel’s parity rule constraint: $$ \text{Excess Electrons} \equiv 2 \pmod 4 \iff \text{Excess Electrons} = 4N + 2 $$ If the electron count satisfies this property for some integer $N$, the ring is determined to be aromatic.

Encoding Rules Reference

The following sections provide a detailed reference for the six fundamental SMILES encoding rules. These are the rules a user would apply when writing SMILES strings.

1. Atoms

Atoms use their standard chemical symbols. Elements in the “organic subset” (B, C, N, O, P, S, F, Cl, Br, I) can be written directly when they have their most common valence - so C automatically means a carbon with enough implicit hydrogens to satisfy its valence.

Everything else goes in square brackets: [Au] for gold, [NH4+] for ammonium ion, or [13C] for carbon-13. Aromatic atoms get lowercase letters: c for aromatic carbon in benzene.

2. Bonds

Bond notation is straightforward:

- for single bonds (usually omitted)
= for double bonds
# for triple bonds
: for aromatic bonds (also usually omitted)

So CC and C-C both represent ethane, while C=C is ethylene.

3. Branches

Branches use parentheses, just like in mathematical expressions. Isobutyric acid becomes CC(C)C(=O)O - the main chain is CC C(=O)O with a methyl (C) branch.

4. Rings

This is where SMILES gets clever. You break one bond and mark both ends with the same digit. Cyclohexane becomes C1CCCCC1 - the 1 connects the first and last carbon, closing the ring.

You can reuse digits for different rings in the same molecule, making complex structures manageable.

5. Disconnected Parts

Salts and other disconnected structures use periods. Sodium phenoxide: [Na+].[O-]c1ccccc1. The order doesn’t matter - you’re just listing the separate components.

6. Aromaticity

Aromatic rings can be written directly with lowercase letters. Benzoic acid becomes c1ccccc1C(=O)O. The system can also detect aromaticity automatically from Kekulé structures, so C1=CC=CC=C1C(=O)O works just as well.

Simplified Subset for Organic Chemistry

Weininger recognized that most chemists work primarily with organic compounds, so he defined a simplified subset that covers the vast majority of cases. For organic molecules, you only need four rules:

Atoms: Use standard symbols (C, N, O, etc.)
Multiple bonds: Use = and # for double and triple bonds
Branches: Use parentheses ()
Rings: Use matching digits

This “basic SMILES” covers the vast majority of organic compounds, making the system immediately accessible without having to learn all the edge cases.

Design Decisions and Edge Cases

Beyond the basic rules, the paper established several important conventions for handling ambiguous cases:

Hydrogen Handling

Hydrogens are usually implicit - the system calculates how many each atom needs based on standard valences. So C represents CH₄, N represents NH₃, and so on. This keeps strings compact and readable.

Explicit hydrogens only appear in special cases: when hydrogen connects to multiple atoms, when you need to specify an exact count, or in isotopic specifications like [2H] for deuterium.

Bond Representation

The paper made an important choice about how to represent bonds in ambiguous cases. For example, nitromethane could be written as charge-separated C[N+](=O)[O-] or with covalent double bonds CN(=O)=O. Weininger chose to prefer the covalent form when possible, because it preserves the correct topological symmetry.

However, when covalent representation would require unusual valences, charge separation is preferred. Diazomethane becomes C=[N+]=[N-] to avoid forcing carbon into an unrealistic valence state.

Tautomers

SMILES doesn’t try to be too clever about tautomers - it represents exactly what you specify. So 2-pyridone can be written as either the enol form Oc1ncccc1 or the keto form O=c1[nH]cccc1. The system won’t automatically convert between them.

This explicit approach means you need to decide which tautomeric form to represent, but it also means the notation precisely captures what you intend.

Aromaticity Detection

One of the most sophisticated parts of the original system was automatic aromaticity detection. The algorithm uses an extended Hückel rule: a ring is aromatic if all atoms are sp² hybridized and it contains 4N+2 π-electrons.

This means you can input benzene as the Kekulé structure C1=CC=CC=C1 and the system will automatically recognize it as aromatic and convert it to c1ccccc1. The algorithm handles complex cases like tropone (O=c1cccccc1) and correctly identifies them as aromatic.

Aromatic Nitrogen

The system makes an important distinction for nitrogen in aromatic rings. Pyridine-type nitrogen (like in pyridine itself) is written as n and has no attached hydrogens. Pyrrole-type nitrogen has an attached hydrogen that must be specified explicitly: [nH]1cccc1 for pyrrole.

This distinction captures the fundamental difference in electron contribution between these two nitrogen types in aromatic systems.

Impact and Legacy

Nearly four decades later, SMILES remains one of the most widely used molecular notations in computational chemistry. The notation became the foundation for:

Database storage - Compact, searchable molecular representations
Substructure searching - Pattern matching in chemical databases
Property prediction - Input format for QSAR models
Chemical informatics - Standard exchange format between software
Modern ML - Text-based representation for neural networks

While newer approaches like SELFIES have addressed some limitations (like the possibility of invalid strings), SMILES’ combination of simplicity and power has made it enduringly useful.

The paper established both a notation system and a design philosophy: chemical informatics tools should be powerful enough for computers while remaining accessible to working chemists. That balance remains relevant today as we develop new molecular representations for machine learning and AI applications.

SELFIES: The Original Paper on Robust Molecular Strings

Sun, 12 Oct 2025 00:00:00 +0000

Contribution: A 100% Robust Representation for ML

This is a Method paper that introduces a new molecular string representation designed specifically for machine learning applications.

Motivation: The Invalidity Bottleneck

When neural networks generate molecules using SMILES notation, a huge fraction of output strings are invalid: either syntax errors or chemically impossible structures. This was a fundamental bottleneck: if your generative model produces a large fraction of invalid molecules, you are wasting computational effort and severely limiting chemical space exploration.

Novelty: A Formal Grammar Approach

The authors’ key insight was using a formal grammar approach (specifically, a Chomsky type-2, context-free grammar with self-referencing functions) where each symbol is interpreted based on chemical context. The “state of the derivation” tracks available valence bonds, preventing impossible structures like a carbon with five single bonds.

For example, generating 2-Fluoroethenimine (FC=C=N) follows a state derivation where each step restricts the available valency for the next element:

$$ \mathbf{X}_0 \xrightarrow{[F]} \text{F } \mathbf{X}_1 \xrightarrow{[=C]} \text{FC } \mathbf{X}_3 \xrightarrow{[=C]} \text{FC=C } \mathbf{X}_2 \xrightarrow{[\#N]} \text{FC=C=N} $$

This approach guarantees 100% validity: every SELFIES string corresponds to a valid molecule, and every valid molecule can be represented.

Methodology & Experiments: Validating Robustness

The authors ran several experiments to demonstrate SELFIES’ robustness:

Random Mutation Test

They took the SELFIES and SMILES representations of MDMA and introduced random changes:

SMILES: After just one random mutation, only 9.9% of strings remained valid (dropping to 1.1% after three mutations).
SELFIES: 100% of mutated strings still represented valid molecules (though different from the original).

This empirical difference demonstrates why SELFIES is well suited for evolutionary algorithms and genetic programming approaches to molecular design, where random mutations of strings are a core operation.

Generative Model Performance

The real test came with actual machine learning models. The authors trained Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) on both representations:

VAE Results:

SMILES-based VAE: Large invalid regions scattered throughout the latent space
SELFIES-based VAE: Every point in the continuous latent space mapped to a valid molecule
The SELFIES model encoded over 100 times more diverse molecules

GAN Results:

Best SMILES GAN: 18.6% diverse, valid molecules
Best SELFIES GAN: 78.9% diverse, valid molecules

Evaluation Metrics:

Validity: Percentage of generated strings representing valid molecular structures
Diversity: Number of unique valid molecules produced
Reconstruction Accuracy: How well the autoencoder reproduced input molecules

Scalability Test

The authors showed SELFIES works beyond toy molecules by successfully encoding and decoding all 72 million molecules from the PubChem database (with fewer than 500 SMILES characters per molecule), demonstrating practical applicability to real chemical databases.

Results & Conclusions: Chemical Space Exploration

Key Findings:

SELFIES achieves 100% validity guarantee: every string represents a valid molecule
SELFIES-based VAEs encode over 100x more diverse molecules than SMILES-based models
SELFIES-based GANs produce 78.9% diverse valid molecules vs. 18.6% for SMILES GANs
Successfully validated on all 72 million PubChem molecules

Limitations Acknowledged:

No standardization or canonicalization method at time of publication
The initial grammar covered only small biomolecules; extensions for stereochemistry, ions, polyvalency, and full periodic table coverage were planned
Requires community testing and adoption

Impact:

This work demonstrated that designing ML-native molecular representations could enable new approaches in drug discovery and materials science. SELFIES was subsequently evaluated as an alternative input representation to SMILES in ChemBERTa, a transformer pretrained on molecular strings for property prediction, where it performed comparably to SMILES on the Tox21 benchmark, though the comparison was limited to a single task.

Reproducibility Details

Data

The machine learning experiments used two distinct datasets:

QM9 (134k molecules): Primary training dataset for VAE and GAN models
PubChem (72M molecules): Used only to test representation coverage and scalability; not used for model training

Models

The VAE implementation included:

Latent space: 241-dimensional with Gaussian distributions
Input encoding: One-hot encoding of SELFIES/SMILES strings
Full architectural details (encoder/decoder structures, layer types) provided in Supplementary Information

Algorithms

The authors found GAN performance was highly sensitive to hyperparameter selection:

Searched 200 different hyperparameter configurations to achieve the reported 78.9% diversity
Specific optimizers, learning rates, and training duration detailed in Supplementary Information
Full rule generation algorithm provided in Table 2

Evaluation

All models evaluated on:

Validity rate: Percentage of syntactically and chemically valid outputs
Diversity: Count of unique valid molecules generated
Reconstruction accuracy: Fidelity of autoencoder reconstruction (VAEs only)

Hardware

Training performed on the SciNet supercomputing infrastructure.
The paper does not specify GPU types or training times.

Artifacts

Artifact	Type	License	Notes
SELFIES GitHub Repository	Code	Apache-2.0	Official implementation; has evolved significantly since the original paper

Replication Resources

Complete technical replication is highly accessible due to the paper being published open-access in Machine Learning: Science and Technology. It primarily requires:

The full rule generation algorithm (Table 2 in paper)
Code: https://github.com/aspuru-guzik-group/selfies
Supplementary Information for complete architectural and hyperparameter specifications

Note: The modern SELFIES library has evolved significantly since this foundational paper, addressing many of the implementation challenges identified by the authors.

Paper Information

Citation: Krenn, M., Häse, F., Nigam, A., Friederich, P., & Aspuru-Guzik, A. (2020). Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4), 045024. https://doi.org/10.1088/2632-2153/aba947

Publication: Machine Learning: Science and Technology, 2020

@article{Krenn_2020,
	doi = {10.1088/2632-2153/aba947},
	url = {https://doi.org/10.1088%2F2632-2153%2Faba947},
	year = 2020,
	month = {aug},
	publisher = {{IOP} Publishing},
	volume = {1},
	number = {4},
	pages = {045024},
	author = {Mario Krenn and Florian H{\"{a}}se and AkshatKumar Nigam and Pascal Friederich and Alan Aspuru-Guzik},
	title = {Self-referencing embedded strings ({SELFIES}): A 100{\%} robust molecular string representation},
	journal = {Machine Learning: Science and Technology}
}

Additional Resources:

RInChI: The Reaction International Chemical Identifier

Sun, 12 Oct 2025 00:00:00 +0000

Paper Classification and Scope

This is an infrastructure/resource paper combined with a methods paper. It establishes a standard format, releases an open-source software library, and enables large-scale database operations. The methods component details the specific algorithmic rules for constructing identifiers through hashing, sorting, and layering.

The Need for Standardized Reaction Identifiers

While we have excellent standards for identifying individual molecules (like SMILES and InChI), there was no equivalent for chemical reactions. This creates real problems:

Different researchers working on the same reaction might describe it completely differently
Searching large reaction databases becomes nearly impossible
No way to check if two apparently different reaction descriptions are actually the same process
Chemical databases can’t easily link related reactions or identify duplicates

If a reaction converts “starting material A + reagent B to product C,” it is difficult to determine if that is identical to another researcher’s description of the same transformation using different names or graphical representations. A working group was established in 2008 to address this, producing prototype versions at the University of Cambridge starting in 2011. The first official release (RInChI V1.00) was funded by the InChI Trust.

Core Innovation: Standardizing Reaction Strings

RInChI solves this by creating a standardized, machine-readable label for any chemical reaction. The key insight is to focus on the essential chemistry while ignoring experimental details that can vary between labs.

Core Principles

RInChI captures three fundamental pieces of information:

Starting materials: What molecules you begin with
Products: What molecules you end up with
Agents: Substances present at both the beginning and end (catalysts, solvents, etc.)

Importantly, RInChI intentionally excludes experimental conditions like temperature, pressure, yield, or reaction time. These details can vary significantly even for identical chemical transformations, so including them would make it nearly impossible for different researchers to generate the same identifier.

How RInChI Works

The RInChI String Structure

A RInChI string has six distinct layers. Crucially, Layers 2 and 3 are assigned alphabetically. This is essential for generating consistent identifiers.

Layer 1: Version

Standard header defining the RInChI version (e.g., RInChI=1.00.1S)

Layers 2 & 3: Component Molecules

These layers contain the InChI strings of reaction participants (reactants and products)
Sorting Rule: The distinct groups (Reactant Group vs. Product Group) are sorted alphabetically as aggregate strings. The group that comes first alphabetically becomes Layer 2; the other becomes Layer 3
This means if a product’s InChI is alphabetically “earlier” than the reactant’s, the product goes in Layer 2
Formatting: Molecules within a layer are separated by !. The two layers are separated by <>

Layer 4: Agents

Contains catalysts, solvents, and any molecule found in both the reactant and product input lists
Algorithmic rule: Anything appearing in both the reactant list and product list must be removed from both and added to Layer 4

Layer 5: Direction (The Decoder)

This layer determines which component layer represents the starting material:
- /d+: Layer 2 is the Starting Material (forward direction)
- /d-: Layer 3 is the Starting Material (reverse direction)
- /d=: Equilibrium reaction
Without this layer, you cannot determine reactants from products

Layer 6: No-Structure Data

Format: /uA-B-C where the numbers indicate the count of structureless materials in Layer 2, Layer 3, and Layer 4 respectively
Used when substances lack defined structures and cannot be represented by InChI

Separator Syntax

For parsing or generating RInChI strings, the separator characters are:

Separator	Purpose
`/`	Separates layers
`!`	Separates molecules within a layer
`<>`	Separates reactant/product groups

Example Structure

RInChI=1.00.1S/[Layer2 InChIs]<>[Layer3 InChIs]<>[Agent InChIs]/d+/u0-0-0

This systematic approach ensures that any researcher starting with the same reaction will generate an identical RInChI string.

RInChIKeys: Shorter Identifiers for Practical Use

Since full RInChI strings can become extremely long, the standard includes three types of shorter, hashed keys for different applications:

Long-RInChIKey

Contains complete InChIKeys for every molecule in the reaction
Variable length, but allows searching for reactions containing specific compounds
Useful for substructure searches: “Show me all reactions involving compound X”

Short-RInChIKey

Fixed length (63 characters): 55 letters plus eight hyphens
Generated by separately hashing the major InChI layers (molecular formula and connectivity) of layers two, three, and four into ten-character strings, then hashing the minor layers (stereochemistry) and protonation states into five-character groups
Suitable for exact matching, database indexing, and linking identical reactions across different databases

Web-RInChIKey

Shortest format (47 characters)
Generated by combining all InChIs from every layer, removing duplicates, sorting alphabetically, then hashing the major layers into a seventeen-character block and the minor layers into a twelve-character block, with a protonation indicator
Ignores molecular roles (reactant vs. product), making it useful for finding related reactions where a molecule’s role might differ between studies
Good for discovering “reverse” reactions, comparing databases with different drawing models, or finding alternative synthetic routes

Experimental Validation and Software Implementation

This infrastructure paper focuses on developing and validating the RInChI standard. The validation approach includes:

Software implementation: Development of the official RInChI software library capable of parsing reaction files and generating identifiers
Format testing: Validation that the system correctly handles standard reaction file formats (.RXN, .RD)
Consistency verification: Ensuring identical reactions produce identical RInChI strings regardless of input variations
Key generation: Testing all three RInChIKey variants (Long, Short, Web) for different use cases
Database integration: Demonstrating practical application in reaction database management. A database of over one million RInChIs was assembled using data that NextMove Software extracted from the patent literature, available at www-rinchi.ch.cam.ac.uk

Impact on Chemical Database Analytics

Practical Applications

RInChI enables systematic organization and analysis of chemical reactions:

Database Management

RInChI enables systematic organization of reaction databases. You can:

Automatically identify and merge duplicate reaction entries
Find all variations of a particular transformation
Link related reactions across different data sources

Reaction Analysis

With standardized identifiers, you can perform large-scale analysis:

Identify the most commonly used reagents or catalysts
Find cases where identical starting materials yield different products
Analyze reaction trends and patterns across entire databases

Multi-Step Synthesis Representation

RInChI can represent complex, multi-step syntheses as single combined identifiers, making it easier to analyze and compare different synthetic routes.

Research Integration

The standard enables better collaboration by ensuring different research groups can generate identical identifiers for the same chemical processes, facilitating data sharing and literature analysis.

Limitations and Considerations

What Gets Lost

Since RInChI builds on the Standard InChI for individual molecules, it inherits certain limitations:

Tautomers: Different tautomeric forms are treated as identical
Stereochemistry: Relative stereochemical relationships aren’t captured
Experimental conditions: Temperature, pressure, yield, and reaction time are intentionally excluded

The Trade-off

This is an intentional feature. By focusing on core chemical identity, RInChI achieves its primary goal: ensuring that different researchers working on the same fundamental transformation generate the same identifier.

Implementation and Tools

Official Software

The RInChI software, available from the InChI Trust, handles the practical details:

Accepts standard reaction file formats (.RXN, .RD)
Generates RInChI strings, all three RInChIKey variants, and auxiliary information
Automates the complex process of creating consistent identifiers

RAuxInfo: Preserving Visual Information

While RInChI discards graphical information (atom coordinates, drawing layout), the software can generate supplementary “RAuxInfo” strings that preserve this data. This allows reconstruction of the original visual representation when needed.

Future Directions

RInChI development continues to evolve:

Integration: Plans for compatibility with other emerging standards like MInChI for chemical mixtures
Extended applications: Work on representing complex, multi-component reaction systems
Software development: Tools for generating graphical representations directly from RInChI without auxiliary information

Key Takeaways

Filling a critical gap: RInChI provides the first standardized way to uniquely identify chemical reactions, solving a fundamental problem in chemical informatics.
Focus on essential chemistry: By excluding experimental variables, RInChI achieves consistent identification of core chemical transformations.
Flexible searching: Multiple RInChIKey formats enable different types of database queries, from exact matching to similarity searching.
Practical implementation: Official software tools make RInChI generation accessible to working chemists and database managers.
Foundation for analysis: Standardized reaction identifiers enable large-scale analysis of chemical databases and systematic study of reaction patterns.

RInChI brings to reaction data the same kind of standardization and machine-readability that SMILES and InChI provide for individual molecules.

Reproducibility

The RInChI software is available for download from the InChI Trust website (http://www.inchi-trust.org/downloads/). It is also available as an Oracle cartridge and as a Pipeline Pilot component from StructurePendium. A database of over one million RInChIs is hosted at www-rinchi.ch.cam.ac.uk.

Artifact	Type	License	Notes
RInChI Software (InChI Trust)	Code	Unknown	Official RInChI V1.00 implementation
RInChI Database	Dataset	Unknown	Over 1M reactions from patent literature

Paper Information

Citation: Grethe, G., Blanke, G., Kraut, H., & Goodman, J. M. (2018). International chemical identifier for reactions (RInChI). Journal of Cheminformatics, 10(1), 22. https://doi.org/10.1186/s13321-018-0277-8

Publication: Journal of Cheminformatics (2018)

@article{Grethe2018,
  title={International chemical identifier for reactions (RInChI)},
  author={Grethe, Guenter and Blanke, Gerd and Kraut, Hans and Goodman, Jonathan M},
  journal={Journal of Cheminformatics},
  volume={10},
  number={1},
  pages={22},
  year={2018},
  publisher={Springer},
  doi={10.1186/s13321-018-0277-8}
}

Recent Advances in the SELFIES Library: 2023 Update

Sun, 12 Oct 2025 00:00:00 +0000

Overview

This software update paper documents major improvements to the SELFIES Python library (version 2.1.1), covering its history, underlying algorithms, design, and performance.

Limitations in the Original SELFIES Implementation

While the original SELFIES concept was promising, the initial 2019 implementation had critical limitations that prevented widespread adoption:

Performance: Too slow for production ML workflows
Limited chemistry: Couldn’t represent aromatic molecules, stereochemistry, or many other important chemical features
Poor usability: Lacked user-friendly APIs for common tasks

These barriers meant that despite SELFIES’ theoretical advantages (100% validity guarantee), researchers couldn’t practically use it for real-world applications like drug discovery or materials science.

Architectural Refactoring and New ML Integrations

The 2023 update refactors the underlying SELFIES engine with improvements to design, efficiency, and supported features. The key updates include:

Streamlined Grammar: The underlying context-free grammar has been generalized and streamlined, improving execution speed and extensibility while maintaining the 100% validity guarantee.
Expanded Chemical Support: Adds support for aromatic systems (via internal kekulization), stereochemistry (chirality, cis/trans), charged species, and isotopic data, covering nearly all features supported by SMILES while preserving the validity guarantee.
Semantic Constraint API: Introduces the set_semantic_constraints() function, allowing specification of custom valence definitions useful for theoretical studies or hypervalent states.
ML Utility Functions: Provides tokenization (split_selfies), length estimation (len_selfies), label/one-hot encoding (selfies_to_encoding), vocabulary extraction, and attribution tracking for integration with neural network pipelines.

Performance Benchmarks & Validity Testing

The authors validated the library through several benchmarks:

Performance testing: Roundtrip conversion (SMILES to SELFIES to SMILES) on the DTP open compound collection (slightly over 300K molecules) completed in 252 seconds total (136s encoding, 116s decoding), using pure Python with no external dependencies.

Random SELFIES generation: Demonstrated that random SELFIES strings of varying lengths always decode to valid molecules, with the size distribution of generated molecules controllable by filtering the sampling alphabet (e.g., removing multi-bond and low-valence atom symbols shifts the distribution toward larger molecules).

Validity guarantee: By construction, every SELFIES string decodes to a valid molecule. The grammar’s bond demotion and deferred ring closure mechanisms make it impossible to generate chemically invalid structures.

Attribution system: Showed both encoder and decoder can track which input symbols produce which output symbols, useful for property alignment.

Future Trajectories for General Chemical Representations

The 2023 update successfully addresses the main adoption barriers:

Fast enough for large-scale ML applications (300K molecules in ~4 minutes)
Chemically comprehensive enough for drug discovery and materials science
User-friendly enough for straightforward integration into existing workflows

The validity guarantee, SELFIES’ core advantage, is now practically accessible for real-world research. The roadmap includes future extensions for polymers, crystals, chemical reactions, and non-covalent interactions, which would expand SELFIES’ applicability beyond small-molecule chemistry.

Limitations acknowledged: The paper focuses on implementation improvements. Some advanced chemical systems (polymers, crystals) still need future work.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
selfies	Code	Apache 2.0	Official Python library, installable via `pip install selfies`

Code

The selfies library is completely open-source and written in pure Python. It requires no extra dependencies and is available on GitHub, installable via pip install selfies. The repository includes testing suites (tox) and example benchmarking scripts to reproduce the translation speeds reported in the paper.

Hardware

Performance benchmarks (e.g., the 252-second roundtrip conversion on 300K molecules) were executed on Google Colaboratory using two 2.20GHz Intel Xeon CPUs.

Algorithms

Technical Specification: The Grammar

The core innovation of SELFIES is a Context-Free Grammar (CFG) augmented with state-machine logic to ensure that every derived string represents a valid molecule. While the software features are important, understanding the underlying derivation rules is essential for replication or extension of the system.

1. Derivation Rules: The Atom State Machine

The fundamental mechanism that guarantees validity is a state machine that tracks the remaining valence of the most recently added atom:

State Tracking: The derivation maintains a non-terminal state $X_l$, where $l$ represents the current atom’s remaining valence (number of bonds it can still form)
Standard Derivation: An atom symbol $[\beta \alpha]$ (bond order + atom type) transitions the state from $S$ (start) to $X_l$, where $l$ is calculated from the atom’s standard valence minus the incoming bond order
Bond Demotion (The Key Rule): When deriving atom symbol $[\beta \alpha]$ in state $X_i$, the actual bond order used is $d_0 = \min(\ell, i, d(\beta))$, where $\ell$ is the new atom’s valence, $i$ is the previous atom’s remaining capacity, and $d(\beta)$ is the requested bond order. This automatic downward adjustment is the mathematical core of the validity guarantee.

This state machine ensures that no atom ever exceeds its allowed valence, making it impossible to generate chemically invalid structures.

2. Control Symbols: Branches and Rings

Branch length calculation: SELFIES uses a hexadecimal encoding to determine branch lengths. A branch symbol [Branch l] consumes the next $\ell$ symbols from the queue and converts them to integer indices $c_1, \dots, c_\ell$ via a fixed mapping (Table III in the paper). The number of symbols $N$ to include in the branch is then:

$$ N = 1 + \sum_{k=1}^{\ell} 16^{\ell - k} , c_k $$

This formula interprets the indices as hexadecimal digits, allowing compact specification of branches up to hundreds of symbols long.

Ring closure queue system: Ring formation uses a deferred evaluation strategy to maintain validity. Ring symbols don’t create bonds immediately; instead, they push closure candidates into a queue $R$. These candidates are resolved after the main derivation completes. A ring closure candidate is rejected if either ring atom has no remaining valence ($m_1 = 0$ or $m_2 = 0$), or if the left and right ring atoms are not distinct (to avoid self-loops). If a prior bond already exists between the two atoms, the bond order is incremented rather than duplicated. This deferred validation prevents invalid ring structures while keeping the grammar context-free during the main derivation.

3. Symbol Structure and Standardization

SELFIES enforces a strict, standardized format for atom symbols to eliminate ambiguity:

Canonical Format: Atom symbols follow the structure [Bond, Isotope, Element, Chirality, H-count, Charge]
No Variation: There is only one way to write each symbol (e.g., [Fe++] and [Fe+2] are standardized to a single form)
Order Matters: The components must appear in the specified order

4. Default Semantic Constraints

By default, the library enforces standard organic chemistry valence rules:

Charge-Dependent Valences: Default constraints specify maximum bonds per charge state (e.g., C: 4/5/3 for neutral/+1/-1; S: 6/7/5). Unlisted atom types default to 8 maximum bonds as a catch-all.
Preset Options: Three preset constraint sets are available: default, octet_rule, and hypervalent.
Customizable: Constraints can be modified via set_semantic_constraints() for specialized applications (hypervalent compounds, theoretical studies, etc.)

The combination of these grammar rules with the state machine ensures that every valid SELFIES string decodes to a chemically valid molecule, regardless of how the string was generated (random, ML model output, manual construction, etc.).

Data

Benchmark dataset: DTP (Developmental Therapeutics Program) open compound collection with slightly over 300K SMILES strings, a set of molecules tested experimentally for potential treatment against cancer and AIDS.

Random generation testing: Random SELFIES strings of varying lengths (10, 100, 250 symbols) generated from both basic and filtered alphabets to test decoding validity and molecule size distributions.

Evaluation

Performance metric: Roundtrip conversion time (SMILES to SELFIES to SMILES) is 252 seconds for 300K+ molecules (136s encoding, 116s decoding). Times averaged over 3 replicate trials on Google Colaboratory.

Validity testing: Random SELFIES strings of lengths 10, 100, and 250 all decode to valid molecules. Decoding 1000 random strings of length 250 from the basic alphabet takes 0.341s; from the filtered alphabet, 1.633s.

Attribution system: Both encoder() and decoder() support an attribute flag that returns AttributionMap objects, tracing which input symbols produce which output symbols for property alignment.

Paper Information

Citation: Lo, A., Pollice, R., Nigam, A., White, A. D., Krenn, M., & Aspuru-Guzik, A. (2023). Recent advances in the self-referencing embedded strings (SELFIES) library. Digital Discovery, 2(4), 897-908. https://doi.org/10.1039/D3DD00044C

Publication: Digital Discovery 2023

@article{lo2023recent,
  title={Recent advances in the self-referencing embedded strings (SELFIES) library},
  author={Lo, Alston and Pollice, Robert and Nigam, AkshatKumar and White, Andrew D and Krenn, Mario and Aspuru-Guzik, Al{\'a}n},
  journal={Digital Discovery},
  volume={2},
  number={4},
  pages={897--908},
  year={2023},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D3DD00044C}
}

Additional Resources:

NInChI: Toward a Chemical Identifier for Nanomaterials

Sun, 12 Oct 2025 00:00:00 +0000

A New Standard for Nanoinformatics

This is a Systematization paper that proposes a new standard: the NInChI. It addresses a fundamental limitation in nanoinformatics. The result of a collaborative workshop organized by the H2020 research infrastructure NanoCommons and the nanoinformatics project NanoSolveIT, this work uses six detailed case studies to systematically develop a hierarchical, machine-readable notation for complex nanomaterials that could work across experimental research, regulatory frameworks, and computational modeling.

The Breakdown of Traditional Chemical Identifiers

Chemoinformatics has fantastic tools for representing small molecules: SMILES strings, InChI identifiers, and standardized databases that make molecular data searchable and shareable. But when you step into nanotechnology, everything breaks down.

Consider trying to describe a gold nanoparticle with a silica shell and organic surface ligands. How do you capture:

The gold core composition and size
The silica shell thickness and interface
The surface chemistry and ligand density
The overall shape and morphology

There’s simply no standardized way to represent this complexity in a machine-readable format. This creates massive problems for:

Data sharing between research groups
Regulatory assessment where precise identification matters
Computational modeling that needs structured input
Database development and search capabilities

Without a standard notation, nanomaterials research suffers from the same data fragmentation that plagued small molecule chemistry before SMILES existed.

The Five-Tier Nanomaterial Description Hierarchy

The authors propose NInChI (Nanomaterials InChI), a layered extension to the existing InChI system. The core insight is organizing nanomaterial description from the inside out, following the OECD’s framework for risk assessment, with a five-tier hierarchy:

Tier 1: Chemical Composition: What is the core made of? This differentiates uniform compositions (Tier 1.1), randomly mixed (Tier 1.2), ordered core-shell materials (Tier 1.3), and onion-like multi-shell morphologies (Tier 1.4).
Tier 2: Morphology: What shape, size, and dimensionality? This encodes dimension (0D-3D), size and size distribution, and shape information.
Tier 3: Surface Properties: Physical and chemical surface parameters such as charge, roughness, and hydrophobicity. Many of these depend on external conditions (pH, solvent, temperature).
Tier 4: Surface Functionalization: How are coatings attached to the core? This includes functionalization density, orientation, and binding type (covalent vs. non-covalent).
Tier 5: Surface Ligands: What molecules are on the surface, their density, orientation, and distribution?

This hierarchy captures the essential information needed to distinguish between different nanomaterials while building on familiar chemical concepts.

Testing the Standard: Six Case Studies

The authors tested their concept against six real-world case studies to identify what actually matters in practice.

Case Study 1: Gold Nanoparticles

Gold NPs provided a relatively simple test case: an inert metallic core with various surface functionalizations. Key insights: core composition and size are essential, surface chemistry (what molecules are attached) matters critically, shape affects properties, and dynamic properties like protein corona formation belong outside the intrinsic NInChI representation. This established the boundary: NInChI should capture intrinsic, stable properties.

Case Study 2: Graphene-Family NMs

Carbon nanotubes and graphene introduced additional complexity: dimensionality (1D tubes vs 2D sheets vs 0D fullerenes), chirality (the (n,m) vector that defines a nanotube’s structure), defects and impurities that can alter properties, and number of layers (for nanotubes, single-wall vs multi-wall). This case showed that the notation needed to handle both topological complexity and chemical composition.

Case Study 3: Complex Engineered (Doped and Multi-Metallic) NMs

Doped materials, alloys, and core-shell structures revealed key requirements: need to distinguish true alloys (homogeneous mixing) and core-shell structures with the same overall composition, crystal structure information becomes crucial, and component ratios must be precisely specified. The case study assessed whether the MInChI extension could represent these solid solutions.

Case Study 4: Database Applications

The FAIR (Findable, Accessible, Interoperable, Reusable) principles guided this analysis. NInChI addresses real database problems: it provides greater specificity than CAS numbers (which lack nanoform distinction), offers a systematic alternative to ad-hoc naming schemes, and enables machine-searchability.

Case Study 5: Computational Modeling

This explored several applications: automated descriptor generation from NInChI structure, read-across predictions for untested materials, and model input preparation from standardized notation. The layered structure provides structured input that computational tools need for both physics-based and data-driven nanoinformatics approaches.

Case Study 6: Regulatory Applications

Under frameworks like REACH, regulators need to distinguish between different “nanoforms”, which are materials with the same chemical composition but different sizes, shapes, or surface treatments. NInChI directly addresses this by encoding the specific properties that define regulatory categories, providing precision sufficient for legal definitions and risk assessment frameworks.

The NInChI Alpha Specification in Practice

Synthesizing insights from all six case studies, the authors propose the NInChI alpha specification (version 0.00.1A), a three-layer structure. Importantly, the paper distinguishes the five-tier NM description hierarchy (Section 1.2 above) from the three-layer NInChI notation hierarchy. NM properties from the five tiers are encoded into these three notation layers:

Layer 1 (Version Number): Standard header indicating the NInChI version, denoted as 0.00.1A for the alpha version. This follows the convention of all InChI-based notations.

Layer 2 (Composition): Each component (core, shell, ligands, impurities, dopants, linkers) gets described using standard InChI (or PInChI/MInChI) for chemical composition, with additional sublayers for morphology (prefix m, e.g., sp for sphere, sh for shell, tu for tube), size (prefix s, in scientific notation in meters), crystal structure (prefix k), and chirality (prefix w for carbon nanotubes). Components are separated by !.

Layer 3 (Arrangement): Specified with prefix y, this layer describes how the components from Layer 2 are combined, proceeding from inside out. A core-shell material is written as y2&1 where the numbers reference components in Layer 2. Covalent bonding between components is indicated with parentheses, e.g., (1&2&3) for a nano core with a covalently bound ligand coating.

The paper provides concrete worked examples from the case studies:

Silica with gold coating (20 nm silica, 2 nm gold shell): NInChI=0.00.1A/Au/msh/s2t10r1-9;12r2-9!/O2Si/c1-3-2/msp/s20d-9/k000/y2&1
CTAB-capped gold nanoparticle (20 nm diameter): NinChI=0.00.1A/Au/msp/s20d-9!C19H42N.BrH/c1-5-6-7.../y1&2
Chiral single-wall nanotube of the (3,1) type with 0.4 nm diameter: NInChI=0.00.1A/C/mtu/s4d-10/w(3,1)/y1

Property Prioritization: The case studies produced a prioritization of NM properties into four categories (Table 3 in the paper):

Category 1: Must Have	Category 2a: Nice to Have	Category 2b: Extrinsic	Category 3: Out of Scope
Chemical composition	Structural defects	Surface charge	Optical properties
Size/size distribution	Density	Corona	Magnetic properties
Shape	Surface composition	Agglomeration state	Chemical/oxidation state
Crystal structure		Dispersion
Chirality
Ligand and ligand binding

Implementation: The authors built a prototype NInChI generation tool using the ZK framework with a Java backend, available through the Enalos Cloud Platform. The tool lets users specify core composition, morphology, size, crystal structure, and chirality, then build outward by adding shells or clusters. InChIs for shell components are retrieved via the NCI/CADD chemical structure REST API.

Limitations: The alpha version acknowledges areas for future development: nanocomposite and nanostructured materials, inverse NMs (nano holes in bulk material), and nanoporous materials are beyond current scope. Dynamic properties such as dissolution, agglomeration, and protein corona formation are excluded. The stochastic nature of NMs (e.g., broad size distributions) is not yet fully addressed. Covalent bonding between components needs further refinement.

Impact: For researchers, NInChI enables precise structural queries for nanomaterials data sharing. For regulators, it provides systematic identification for risk assessment and nanoform classification under frameworks like REACH. For computational modelers, it enables automated descriptor generation and read-across predictions.

Key Conclusions: The 8-month collaborative process demonstrates that creating systematic notation for nanomaterials is feasible. The hierarchical, inside-out organization provides an approach that satisfies experimentalists, modelers, database owners, and regulators. Testing against six case studies identified the essential features that must be captured. By extending InChI and reusing conventions from MInChI, RInChI, and PInChI, the work builds on existing infrastructure. The proposed NInChI alpha is intended to stimulate further analysis and refinement with the broader community and the InChI Trust.

Reproducibility Details

Paper Accessibility: The paper is fully open-access under the CC BY 4.0 license, allowing for straightforward reading and analysis.
Tools & Code: The authors provided a prototype NInChI generation tool available through the Enalos Cloud Platform, built using the ZK framework with a Java backend. The underlying backend code was not released as an open-source library.
Documentation: The paper serves as the first alpha specification for community discussion and refinement. No formal algorithmic pseudocode for automated string parsing or generation from structured nanomaterials files (like .cif) is provided.

Artifact	Type	License	Notes
NInChI Generator (Enalos Cloud)	Other	Unknown	Prototype web tool for generating NInChI strings; backend not open-source
Paper (MDPI)	Other	CC BY 4.0	Open-access alpha specification

Paper Information

Citation: Lynch, I., Afantitis, A., Exner, T., Himly, M., Lobaskin, V., Doganis, P., … & Melagraki, G. (2020). Can an InChI for Nano Address the Need for a Simplified Representation of Complex Nanomaterials across Experimental and Nanoinformatics Studies? Nanomaterials, 10(12), 2493. https://doi.org/10.3390/nano10122493

Publication: Nanomaterials (2020)

@article{lynch2020inchi,
  title={Can an InChI for Nano Address the Need for a Simplified Representation of Complex Nanomaterials across Experimental and Nanoinformatics Studies?},
  author={Lynch, Iseult and Afantitis, Antreas and Exner, Thomas and others},
  journal={Nanomaterials},
  volume={10},
  number={12},
  pages={2493},
  year={2020},
  publisher={MDPI},
  doi={10.3390/nano10122493}
}

Mixfile & MInChI: Machine-Readable Mixture Formats

Sun, 12 Oct 2025 00:00:00 +0000

A Standardized Resource for Chemical Mixtures

This is a Resource paper that introduces two complementary standards for representing chemical mixtures: the detailed Mixfile format for comprehensive mixture descriptions and the compact MInChI (Mixtures InChI) specification for canonical mixture identifiers.

The Missing Format for Complex Formulations

There is a fundamental gap in chemical informatics: current standards excel at representing pure individual molecules (SMILES, InChI, Molfile), but a corresponding standard for multi-component mixtures remains an open challenge. This is a major problem because real-world chemistry predominantly involves complex mixtures.

Everyday chemical work frequently involves:

Reagents with specified purity (e.g., “$\geq$ 97% pure”)
Solutions and formulations
Complex mixtures like “hexanes” (which contains multiple isomers)
Drug formulations with active ingredients and excipients

Without a machine-readable standard, chemists are forced to describe these mixtures in plain text that software cannot parse or analyze systematically. This creates barriers for automated safety analysis, inventory management, and data sharing.

Dual Design: Comprehensive Mixfiles and Canonical MInChIs

The authors propose a two-part solution:

Mixfile: A detailed, hierarchical JSON format that captures the complete composition of a mixture
MInChI: A compact, canonical string identifier derived from Mixfile data

This dual approach provides both comprehensive description (Mixfile) and simple identification (MInChI), similar to having both a detailed recipe and a short name for a dish.

What Makes a Good Mixture Format?

The authors identify three essential properties any mixture format must capture:

Compound: What molecules are present?
Quantity: How much of each component?
Hierarchy: How are components organized (e.g., mixtures-of-mixtures)?

The hierarchical aspect is crucial. Consider “hexanes”: it is a named mixture containing specific proportions of n-hexane, 2-methylpentane, 3-methylpentane, etc. A mixture format needs to represent both the individual isomers and the fact that they are grouped under the umbrella term “hexanes.”

Mixfile Format Details

Mixfile uses JSON as its foundation, making it both human-readable and easy to parse in modern programming languages. The core structure is a hierarchical tree where each component can contain:

name: Component identifier
molfile/smiles/inchi/formula: Molecular structure (molfile is the primary source of truth)
quantity/units/relation/ratio: Concentration data with optional relation operators
contents: Array of sub-components for hierarchical mixtures
identifiers: Database IDs or URLs for additional information

Simple Example

A basic Mixfile might look like:

{
  "mixfileVersion": 0.01,
  "name": "Acetone, ≥99%",
  "contents": [
    {
      "name": "acetone",
      "smiles": "CC(=O)C",
      "quantity": 99,
      "units": "%",
      "relation": ">="
    }
  ]
}

Note that the paper specifies distinct fields for molecular structures: molfile (the primary source of truth), smiles, inchi, and formula. Concentration data uses separate quantity, units, and relation fields.

Complex Example: Mixture-of-Mixtures

For something like “ethyl acetate dissolved in hexanes,” the structure would be:

{
  "mixfileVersion": 0.01,
  "name": "Ethyl acetate in hexanes",
  "contents": [
    {
      "name": "ethyl acetate",
      "smiles": "CCOC(=O)C",
      "quantity": 10,
      "units": "%"
    },
    {
      "name": "hexanes",
      "contents": [
        {
          "name": "n-hexane",
          "smiles": "CCCCCC",
          "quantity": 60,
          "units": "%"
        },
        {
          "name": "2-methylpentane",
          "smiles": "CC(C)CCC",
          "quantity": 25,
          "units": "%"
        }
      ]
    }
  ]
}

This hierarchical structure captures the “recipe” of complex mixtures while remaining machine-readable.

MInChI: Canonical Mixture Identifiers

While Mixfiles provide comprehensive descriptions, simple identifiers are also needed for database storage and searching. This is where MInChI comes in.

A MInChI string is structured as:

MInChI=0.00.1S//n/g

Header: Version information (0.00.1S in the paper’s specification)
Components: Standard InChI for each unique molecule, sorted alphabetically by the InChI strings themselves, then concatenated with &
Indexing (prefixed with /n): Hierarchical structure using curly braces {} for branches and & for adjacent nodes; uses 1-based integer indices referring to the sorted InChI list
Concentration (prefixed with /g): Quantitative information for each component, with units converted to canonical codes

Why This Matters

MInChI strings enable simple database searches:

Check if a specific component appears in any mixture
Compare different formulations of the same product
Identify similar mixtures based on string similarity

Validating the Standard Through Practical Tooling

The paper demonstrates the format’s capabilities through several practical applications and a proof-of-concept implementation:

Text Extraction Algorithm

The authors demonstrate a proof-of-concept algorithm that uses regular expressions and chemical name recognition to parse plain-text mixture descriptions into structured Mixfile data. The algorithm:

Applies regex rules to remove filler words and extract concentrations
Looks up cleaned names against a custom chemical database
Falls back to OPSIN for SMILES generation from chemical names
Generates 2D coordinates for molecular structures

Graphical Editor

An open-source editor provides:

Tree-based interface for building and editing hierarchical structures
Chemical structure sketching and editing
Database lookup (e.g., PubChem integration)
Automatic MInChI generation
Import/export capabilities

Example Use Cases

The paper validates the format through real-world applications:

Safety compliance: Automated hazard assessment based on concentration-dependent properties (e.g., solid osmium tetroxide vs. 1% aqueous solution)
Inventory management: Precise, searchable laboratory records
Data extraction: Parsing vendor catalogs and safety data sheets

Outcomes and Future Extensibility

The work successfully establishes the first standardized, machine-readable formats for chemical mixtures. Key achievements:

Comprehensive representation: Mixfile captures component identity, quantity, and hierarchy
Canonical identification: MInChI provides compact, searchable identifiers
Practical tooling: Open-source editor and text extraction demonstrate feasibility
Real-world validation: Format handles diverse use cases from safety to inventory

Limitations and Future Directions

The authors acknowledge areas for improvement:

Machine learning improvements: Better text extraction using modern NLP techniques
Extended coverage: Support for polymers, complex formulations, analytical results
Community adoption: Integration with existing chemical databases and software

The hierarchical design makes Mixfile suitable for both “recipe” descriptions (how to make something) and analytical results (what was found). This flexibility should help drive adoption across different use cases in chemistry and materials science.

Reproducibility Details

Open Source Tooling & Data

While the central repository focusing on validating and establishing the MInChI standard is github.com/IUPAC/MInChI, the tools and datasets actually used to develop the paper’s proofs-of-concept are hosted elsewhere:

Graphical Editor & App codebase: The Electron application and Mixfile handling codebase (console.js) can be found at github.com/cdd/mixtures.
Text Extraction Data: The several thousand extracted mixture records generated through the text extraction method can be accessed inside the cdd/mixtures repository under reference/gathering.zip.

Artifacts

Artifact	Type	License	Notes
IUPAC/MInChI	Code / Data	Unknown	Validation test suite with ~150 mixture JSON files
cdd/mixtures	Code / Data	GPL-3.0	Electron-based Mixfile editor, CLI tools, and reference mixture corpus

The paper was funded by NIH Grant 1R43TR002528-01. No specific hardware requirements are needed, as this is a format specification with lightweight tooling.

Algorithms

This section provides the specific algorithmic logic, schema definitions, and standardization rules needed to replicate the Mixfile parser or MInChI generator.

The Strict Mixfile JSON Schema

To implement the format, a parser must recognize these specific fields:

Root Structure:

{
  "mixfileVersion": 0.01,
  "header": {},
  "contents": []
}

Component Fields:

name: string (required if no structure is provided)
molfile: string (the primary source of truth for molecular structure)
smiles, inchi, formula: derived/transient fields for convenience
quantity: number OR [min, max] array for ranges
units: string (must map to supported ontology)
relation: string (e.g., ">", "~", ">=")
ratio: array of two numbers [numerator, denominator]
identifiers: database assignments (e.g., CASRN, PubChem)
links: URLs relevant to the component
contents: recursive array for hierarchical mixtures

MInChI Generation Algorithm

To generate MInChI=0.00.1S/..., the software must follow these steps:

Component Layer:
- Calculate standard InChI for all structures in the mixture
- Sort distinct InChIs alphabetically by the InChI string itself
- Join with & to form the structure layer
Hierarchy & Concentration Layers:
- Traverse the Mixfile tree recursively
- Indexing: Use integer indices (1-based) referring to the sorted InChI list
- Grouping: Use {} to denote hierarchy branches and & to separate nodes at the same level
- Concentration: Convert all quantities to canonical unit codes and apply scaling factors

Unit Standardization Table

Replication requires mapping input units to canonical MInChI codes. The full table from the paper (Table 1) includes:

Input Unit	MInChI Code	Scale Factor
%	pp	1
w/v%	wv	0.01
w/w%	wf	0.01
v/v%	vf	0.01
mol/mol%	mf	0.01
mol/L (M)	mr	1
mmol/L	mr	$10^{-3}$
g/L	wv	$10^{-3}$
mol/kg	mb	1
ratio	vp	1

Text Extraction Logic

The paper defines a recursive procedure for parsing plain-text mixture descriptions:

Input: Raw text string (e.g., “2 M acetone in water”)
Rule Application: Apply RegEx rules in order:
- Remove: Delete common filler words (“solution”, “in”)
- Replace: Substitute known variations
- Concentration: Extract quantities like “2 M”, “97%”
- Branch: Split phrases like “A in B” into sub-nodes
Lookup: Check cleaned name against a custom table (handles cases like “xylenes” or specific structures)
OPSIN: If no lookup match, send to the OPSIN tool to generate SMILES from the chemical name
Embed: If structure found, generate 2D coordinates (Molfile) via RDKit

Paper Information

Citation: Clark, A. M., McEwen, L. R., Gedeck, P., & Bunin, B. A. (2019). Capturing mixture composition: an open machine-readable format for representing mixed substances. Journal of Cheminformatics, 11(1), 33. https://doi.org/10.1186/s13321-019-0357-4

Publication: Journal of Cheminformatics (2019)

@article{clark2019capturing,
  title={Capturing mixture composition: an open machine-readable format for representing mixed substances},
  author={Clark, Alex M and McEwen, Leah R and Gedeck, Peter and Bunin, Barry A},
  journal={Journal of cheminformatics},
  volume={11},
  number={1},
  pages={33},
  year={2019},
  publisher={BioMed Central}
}

Additional Resources:

Official MInChI GitHub repository

Making InChI FAIR and Sustainable for Inorganic Chemistry

Sun, 12 Oct 2025 00:00:00 +0000

Paper Contribution: Modernizing Chemical Identifiers

This is a Resource paper that describes the development and maintenance of InChI (International Chemical Identifier), a fundamental infrastructure component for chemical databases. While it includes methodological improvements to the canonicalization algorithm for inorganic compounds, its primary contribution is ensuring the sustainability and accessibility of a critical chemical informatics resource.

Motivation: The Inorganic Chemistry Problem

The International Chemical Identifier (InChI) is prevalent in chemistry databases, with over a billion structures using it. The original system was designed specifically for organic chemistry and systematically fails to parse organometallic structures accurately. The original implementation had significant limitations:

FAIR principles gap: Development was closed-source, documentation was inadequate, and the codebase was difficult to maintain
Inorganic chemistry failure: Metal-ligand bonds were automatically disconnected, destroying stereochemical information for coordination complexes
Technical debt: More than 3000 bugs and security vulnerabilities, nearly 60 Google OSS-Fuzz issues, and an unmaintainable codebase

If you’ve ever tried to search for a metal complex in a chemical database and gotten nonsense results, this is why. This paper describes the fix.

Core Innovation: Smart Metal-Ligand Handling

The key innovations are:

Smart metal-ligand bond handling: A decision tree algorithm that uses coordination number and electronegativity to determine which bonds to keep and which to disconnect, preserving stereochemistry for coordination complexes
Modernized development infrastructure: Migration to GitHub with open development, comprehensive testing, and maintainable documentation
Backward compatibility: The core canonicalization algorithm remained unchanged, preserving over a billion existing InChIs for organic compounds

The preprocessing step applies a two-pass iterative process for every metal in a structure:

Terminal metals (connected to only one other atom): check the electronegativity lookup table and disconnect if $\Delta EN \geq 1.7$
Non-terminal metals: if coordination number exceeds the element’s standard valence threshold, keep all bonds; otherwise, apply the same electronegativity check per bond (if at least one bond is kept, all are retained)
Hardcoded exceptions exist for Grignard reagents and organolithium compounds

For example, $\text{FeCl}_2$ is treated as ionic and disconnected into $\text{Fe}^{2+}$ and $2\ \text{Cl}^-$, while $[\text{FeCl}_4]^{2-}$ remains connected as a coordination complex.

Validation Methods & Experiments

The paper focuses on software engineering validation:

Bug fixing: Fixed more than 3000 bugs and security issues, plus nearly 60 Google OSS-Fuzz issues from the legacy codebase
Backward compatibility testing: Verified that existing organic molecule InChIs remained unchanged
Inorganic compound validation: Tested the new decision tree algorithm on coordination complexes, organometallic compounds, and ionic salts
Documentation overhaul: Split technical documentation into Chemical Manual (for chemists) and Technical Manual (for developers)
Web Demo: Created a browser-based InChI Web Demo that calculates InChI, InChIKey, and AuxInfo from drawn structures or Molfiles, with all computation performed client-side

The validation approach emphasizes maintaining the “same molecule, same identifier” principle while extending coverage to inorganic chemistry.

Key Outcomes and Future Work

The v1.07 release successfully:

Modernizes infrastructure: Open development on GitHub with maintainable codebase
Extends to inorganic chemistry: Proper handling of coordination complexes and organometallic compounds
Maintains backward compatibility: No breaking changes for existing organic compound InChIs
Improves database search: Metal complexes now searchable with correct stereochemistry preserved
IUPAC approval: Version 1.07 has been approved by IUPAC’s Committee on Publications and Cheminformatics Data Standards (CPCDS)

Acknowledged limitations for future work:

Stereochemistry for inorganic and organometallic compounds still needs improvement, including atropisomers and MDL enhanced stereochemistry
Mixtures (MInChI) and nanomaterials (NInChI) remain unsolved problems
Chemical identifiers work best for discrete molecules and struggle with variable-composition materials

Impact: This update improves searchability of inorganic and organometallic compounds in major chemical databases by preserving coordination bond information that was previously discarded.

Reproducibility Details

Software & Data Availability

Artifact	Type	License	Notes
IUPAC-InChI/InChI	Code	Open source (IUPAC/InChI Trust)	Official C/C++ implementation of InChI v1.07
InChI Web Demo	Other	Open source	Browser-based InChI/InChIKey generator for testing

The InChI v1.07 codebase, primarily written in C/C++, is openly available on GitHub at IUPAC-InChI/InChI. The repository includes the core canonicalization engine and the new inorganic preprocessing logic. Both the Technical Manual (for structural integration) and the Chemical Manual are maintained alongside the codebase. Compiled binaries are available for Windows, Linux, and macOS.

Benchmarking Data: Validation of the new decision tree logic is managed through rigorous unit testing built directly into the repository’s continuous integration pipelines. Standard tests with existing organic compounds confirm backward compatibility, while newly integrated suites of coordination complexes and organometallic compounds ensure the 1.07 processing triggers as expected.

Algorithms

The Metal Problem

InChI’s original algorithm assumed that bonds to metals were ionic and automatically disconnected them. This makes sense for something like sodium chloride (NaCl), where you have separate $\text{Na}^+$ and $\text{Cl}^-$ ions.

It fails for:

Coordination complexes: Where ligands are bonded to the metal center
Organometallic compounds: Where carbon-metal bonds are covalent
Sandwich compounds: Like ferrocene, where the bonding has both ionic and covalent character

The result: loss of stereochemical information and identical InChIs for structurally different compounds.

The Solution: Smart Preprocessing

The new system uses a decision tree to figure out which metal-ligand bonds to keep and which to disconnect. The process is iterative: it runs for every metal in the structure, then checks every bond to that metal. In the C/C++ repository, this preprocessing logic acts as a filter applied before the traditional organic canonicalization engine (from v1.06) runs, dynamically determining whether coordination bonds are retained for downstream layer generation.

Decision Tree Logic

The algorithm handles metals in two passes. First, terminal metals (bonded to only one atom) are checked against the electronegativity lookup table and disconnected if $\Delta EN \geq 1.7$. This preserves all metal-metal bonds.

Second, non-terminal metals are examined. For a metal $m$ bonded to ligand $l$:

$$ \begin{aligned} B(m, l) &= \begin{cases} \text{Connected (all bonds)} & \text{if } CN(m) > V(m) \\ \text{Connected} & \text{if } |EN(m) - EN(l)| < 1.7 \\ \text{Disconnected} & \text{if } |EN(m) - EN(l)| \geq 1.7 \end{cases} \end{aligned} $$

A key rule: if at least one metal-ligand bond is kept for a given metal, all other bonds to that metal are also retained (no disconnection is carried out).

(Note: Explicit overrides exist for specific classes like Grignard reagents).

Hardcoded Chemical Exceptions

The algorithm includes specific overrides based on well-established chemistry:

Grignard reagents (RMgX): Explicitly configured to keep the Mg-C bond but disconnect the Mg-halide bond
Organolithium compounds (RLi): Explicitly configured to keep the structure intact

These exceptions exist because the general electronegativity rules would give incorrect results for these compound classes.

Practical Example

For example, $\text{FeCl}_2$ is treated as ionic and disconnected into $\text{Fe}^{2+}$ and $2\ \text{Cl}^-$, while $[\text{FeCl}_4]^{2-}$ remains connected because its coordination number exceeds the threshold.

How InChI Generation Works

The process has six main steps:

Parse input: Read the structure from a file (Molfile, SDF, etc.)
Convert to internal format: Transform into the software’s data structures
Normalize: Standardize tautomers, resolve ambiguities (where the new metal rules apply)
Canonicalize: Create a unique representation independent of atom numbering
Generate InChI string: Build the layered text identifier
Create InChIKey: Hash the full string into a 27-character key for databases

The InChI itself has separate layers for formula, connectivity, hydrogens, stereochemistry, isotopes, and charge. The InChIKey is what actually gets stored in databases for fast searching.

InChIKey Version Flag

Character 25 of the InChIKey indicates the version status:

“S”: Standard InChI
“N”: Non-standard InChI
“B”: Beta (experimental features)

This flag is important for anyone parsing InChIKeys programmatically, as it tells you whether the identifier was generated using stable or experimental algorithms.

Additional Context

What InChI Actually Does

InChI creates a unique text string for any chemical structure. SMILES has multiple vendor implementations and can represent the same molecule in different ways. InChI provides a single, standardized format controlled by IUPAC. The goal is simple: same molecule, same identifier, every time.

This matters for FAIR data principles:

Findable: You can search for a specific compound across databases
Accessible: The standard is open and free
Interoperable: Different systems can connect chemical knowledge
Reusable: The identifiers work consistently across platforms

Better Documentation

The technical manual is being split into two documents:

Chemical Manual: For chemists who need to understand what InChIs mean
Technical Manual: For developers who need to implement the algorithms

This addresses the problem of current documentation serving both audiences poorly.

The Bigger Picture

InChI’s evolution reflects chemistry’s expansion beyond its organic roots. The fact that it took this long to properly handle inorganic compounds shows how much computational chemistry has historically focused on carbon-based molecules.

As the field moves into catalysis, materials science, and coordination chemistry applications, having proper chemical identifiers becomes essential. You can’t build FAIR chemical databases if half of chemistry is represented incorrectly.

Paper Information

Citation: Blanke, G., Brammer, J., Baljozovic, D., Khan, N. U., Lange, F., Bänsch, F., Tovee, C. A., Schatzschneider, U., Hartshorn, R. M., & Herres-Pawlis, S. (2025). Making the InChI FAIR and sustainable while moving to inorganics. Faraday Discussions, 256, 503-519. https://doi.org/10.1039/D4FD00145A

Publication: Faraday Discussions, 2025

@article{blanke2025making,
  title={Making the InChI FAIR and sustainable while moving to inorganics},
  author={Blanke, G. and Brammer, J. and Baljozovic, D. and Khan, N. U. and Lange, F. and B{\"a}nsch, F. and Tovee, C. A. and Schatzschneider, U. and Hartshorn, R. M. and Herres-Pawlis, S.},
  journal={Faraday Discussions},
  volume={256},
  pages={503--519},
  year={2025},
  publisher={Royal Society of Chemistry}
}

InChI: The Worldwide Chemical Structure Identifier Standard

Sun, 12 Oct 2025 00:00:00 +0000

InChI as a Resource and Systematization Standard

This is a Resource & Systematization Paper that reviews the history, technical architecture, governance structure, and implementation status of the InChI standard. It documents both the institutional development of an open chemical identifier and the technical specification that enables it.

The Motivation: Interoperability in Chemical Databases

Before InChI, the chemistry community faced a fundamental interoperability problem. Chemical databases used proprietary systems like CAS Registry Numbers, or format-dependent representations like SMILES strings. These were expensive, restricted, and relied on “in-house” databases.

The authors argue the Internet and Open Source software acted as a “black swan” event that disrupted this status quo. The Internet created a need to link diverse, free and fee-based resources without a central gatekeeper. InChI was designed as the solution: a non-proprietary, open-source identifier enabling linking of distinct data compilations.

Technical and Institutional Innovations of InChI

InChI’s innovation is both technical and institutional:

Technical novelty: A hierarchical “layered” canonicalization system where structure representations build from basic connectivity to full stereochemistry. This allows flexible matching: a molecule with unknown stereochemistry produces an InChI that’s a subset of the same molecule with known stereochemistry.

Institutional novelty: Creating an open standard governed by a charitable trust (the InChI Trust) that convinced commercial competitors (publishers, databases) to adopt it as a “pre-competitive” necessity. This solved the political problem of maintaining an open standard in a competitive industry.

Technical Architecture: Layers and Hashing

The InChI String

InChI is a canonicalized structure representation derived from IUPAC conventions. It uses a hierarchical “layered” format where specific layers add detail. The exact technical specification includes these string segments:

Main Layer: Chemical Formula
Connectivity Layer (/c): Atoms and bonds (excluding bond orders)
Hydrogen Layer (/h): Tautomeric and immobile H atoms
Charge (/q) & Proton Balance (/p): Accounting for ionization
Stereochemistry:
- Double bond (/b) and Tetrahedral (/t) parity
- Parity inversion (/m)
- Stereo type (/s): absolute, relative, or racemic
Fixed-H Layer (/f): Distinguishes specific tautomers if needed

This layered approach means that a molecule with unknown stereochemistry will have an InChI that’s a subset of the same molecule with known stereochemistry. This allows for flexible matching at the connectivity level even without complete stereochemical information.

The InChIKey

Because InChI strings can be too long for search engines (which break at ~30 characters or at symbols like / and +), the InChIKey was created.

Mechanism: A 27-character string generated via a SHA-256 hash of the InChI string. This can be represented as:

$$ \text{InChIKey} = f_{\text{SHA-256}}(\text{InChI}) $$

Structure:

Block 1 (14 characters): Encodes the molecular skeleton (connectivity)
Block 2 (10 characters): Eight letters encoding stereochemistry and isotopes, plus a flag indicating standard InChI (S) and an InChI version indicator (A for version 1)
Block 3 (1 character): Protonation flag (e.g., ‘N’ for neutral)

Because the InChIKey is a hash, it cannot be converted back to a structure (irreversible) and has a theoretical risk of collision. It is important to distinguish between InChI collisions (which are due to flaws/bugs and are very rare) and InChIKey collisions (which are mathematically inevitable due to hashing).

What experiments were performed?

This is a systematization paper documenting an existing standard. However, the authors provide:

Validation evidence:

Certification Suite: A test suite that software vendors must pass to display the “InChI Certified” logo, preventing fragmentation
Round-trip conversion testing: Demonstrated >99% success rate converting InChI back to structure (100% with AuxInfo layer)
Real-world adoption metrics: Documented integration across major chemical databases and publishers

Known limitations identified:

Tautomer representation issues in Version 1 (different drawings of same tautomer can generate different InChIs)
Edge cases in stereochemistry representation

Institutional History & Governance

Origin: The project was initiated at a March 2000 IUPAC meeting in Washington, DC. It was originally called the IUPAC Chemical Identifier Project (IChIP).

Development: Technical work was done by NIST (Stein, Heller, Tchekhovskoi), overseen by the IUPAC CCINS committee, which later became the InChI Subcommittee of Division VIII.

The InChI Trust: To ensure the algorithm survived beyond a volunteer organization, the InChI Trust was formed in 2009. It is a UK charity supported by publishers and databases (e.g., Nature, RSC) to maintain the standard pre-competitively. This was a critical innovation: getting commercial publishers and software vendors to agree that a non-proprietary standard would benefit everyone.

Real-World Impact and Future Directions

Key Findings

Success through “un-coerced adoption”: InChI succeeded because commercial competitors viewed it as a “pre-competitive” necessity for the Internet age. The open governance model proved durable.

Technical achievements:

Reversible representation (>99% without AuxInfo, 100% with it)
Hierarchical structure enables flexible matching at different levels of detail
InChIKey enables web search despite being a hash (with inherent collision risk)

Limitations Acknowledged (as of 2013)

Tautomerism Issues: Different drawings of the same tautomer (e.g., 1,4-oxime vs nitroso) can generate different InChIs in Version 1, which is targeted for Version 2
Hash collision risk: InChIKey collisions are mathematically inevitable due to SHA-256 hashing, though InChI collisions (actual bugs) are very rare
Certification required: To prevent fragmentation, software must pass the InChI Certification Suite

Future Directions

The authors note that while this paper documents the state as of 2013, InChI continues to evolve. Tautomer handling and edge cases in stereochemistry representation were priorities for future versions. The governance model through the InChI Trust was designed to ensure long-term maintenance beyond the original volunteer contributors.

Reproducibility Details

This systematization paper documents an existing standard. Key implementation resources are openly maintained by the InChI Trust.

Code & Software

Official Open Source Implementation: The C source code and pre-compiled binaries for the InChI algorithm are freely available via the InChI Trust Downloads Page and their official GitHub repository.
Canonicalization algorithm: Open-source implementation of IUPAC-based rules for generating unique representations from multiple possible drawings of the same molecule.

Data & Validation

InChI Certification Suite: A test suite of chemical structures provided by the InChI Trust used to validate that third-party software implementations generate correct InChIs.
Version 1 specification: Complete technical documentation of the layered format.

Evaluation

Round-trip conversion: >99% success rate (100% with AuxInfo) as validated by NIST and IUPAC.
Certification testing: Pass/fail validation for software claiming InChI compliance.

Paper Information

Citation: Heller, S., McNaught, A., Stein, S., Tchekhovskoi, D., & Pletnev, I. (2013). InChI - the worldwide chemical structure identifier standard. Journal of Cheminformatics, 5(1), 7. https://doi.org/10.1186/1758-2946-5-7

Publication: Journal of Cheminformatics, 2013

@article{heller2013inchi,
  title={{InChI} - the worldwide chemical structure identifier standard},
  author={Heller, Stephen and McNaught, Alan and Stein, Stephen and Tchekhovskoi, Dmitrii and Pletnev, Igor},
  journal={Journal of Cheminformatics},
  volume={5},
  number={1},
  pages={7},
  year={2013},
  publisher={Springer},
  doi={10.1186/1758-2946-5-7}
}

InChI and Tautomerism: Toward Comprehensive Treatment

Sun, 12 Oct 2025 00:00:00 +0000

Paper Contribution: A Systematized Tautomer Database Resource

This is a Resource paper with strong Systematization elements. It provides a comprehensive catalog of 86 tautomeric transformation rules (20 pre-existing CACTVS defaults plus 66 new rules derived from experimental literature), designed to serve as a foundational resource for chemical database systems and the InChI V2 identifier standard. The systematic validation across 400+ million structures also makes it a benchmarking study for evaluating current chemoinformatics tools.

The Tautomerism Problem in Chemical Databases

Chemical databases face a fundamental problem: the same molecule can appear multiple times under different identifiers simply because it exists in different tautomeric forms. For example, glucose’s ring-closed and open-chain forms are the same molecule; however, current chemical identifiers (including InChI) often treat them as distinct compounds.

Ring-chain tautomerism in glucose: the open-chain aldehyde form (left) and the cyclic pyranose form (right) are the same molecule in different tautomeric states.

This creates three critical problems:

Database redundancy: Millions of duplicate entries for the same chemical entities
Search failures: Researchers miss relevant compounds during structure searches
ML training issues: Machine learning models learn to treat tautomers as different molecules

The motivation for this work is to provide a comprehensive, experimentally-grounded rule set that enables InChI V2 to properly recognize tautomeric relationships, eliminating these problems at the identifier level.

86 Comprehensive Tautomeric Transformation Rules

The key contributions are:

Comprehensive Rule Set: Compilation of 86 tautomeric transformation rules (20 pre-existing CACTVS defaults plus 66 new rules derived from experimental literature), categorized into:
- 54 Prototropic rules (classic H-movement tautomerism)
- 21 Ring-Chain rules (cyclic/open-chain transformations)
- 11 Valence rules (structural rearrangements with valence changes)
Massive-Scale Validation: Testing these rules against nine major chemical databases totaling over 400 million structures to identify coverage gaps in current InChI implementations
Quantitative Assessment: Systematic measurement showing that current InChI (even with Nonstandard 15T + KET settings) only achieves ~50% success in recognizing tautomeric relationships, with some new rules showing <2% success rates
Practical Tools: Creation of the Tautomerizer web tool for public use, demonstrating practical application of the rule set

The novelty lies in the systematic compilation and validation of transformation rules at a scale that reveals critical gaps in current chemical identification systems.

Massive-Scale Validation Across 400M+ Structures

Database Analysis

The researchers analyzed 9 chemical databases totaling 400+ million structures:

Public databases: PubChem (largest), ChEMBL, DrugBank, PDB Ligands, SureChEMBL, AMS, ChemNavigator
Private databases: CSD (Cambridge Structural Database), CSDB (NCI internal)

Methodology

Software: CACTVS Chemoinformatics Toolkit (versions 3.4.6.33 and 3.4.8.6)

Tautomer Generation Protocol:

Algorithm: Single-step generation (apply transforms to input structure only, avoiding recursion)
Constraints: Max 10 tautomers per structure, 30-second CPU timeout per transform
Format: All rules expressed as SMIRKS strings
Stereochemistry: Stereocenters involved in tautomerism were flattened during transformation

Success Metrics (tested against InChI V.1.05):

Complete InChI match: All tautomers share identical InChI
Partial InChI match: At least two tautomers share an InChI
Tested against two InChI configurations: Standard InChI and Nonstandard InChI (with 15T and KET options enabled)

Rule Coverage Analysis

For each of the 86 rules, the researchers:

Applied the transformation to all molecules in each database
Generated tautomers using the SMIRKS patterns
Computed InChI identifiers for each tautomer
Measured success rates (percentage of cases where InChI recognized the relationship)

Key Findings from Experiments

Rule Frequency: The most common rule PT_06_00 (1,3-heteroatom H-shift, covering keto-enol tautomerism) affects >70% of molecules across databases.

InChI Performance:

Standard InChI: ~37% success rate
Nonstandard InChI (15T + KET): ~50% success rate
Many newly defined rules: <2% success rate

Scale Impact: Implementing the full 86-rule set would approximately triple the number of compounds recognized as having tautomeric relationships relative to Standard InChI.

Outcomes: InChI V2 Requirements and Coverage Gaps

Main Findings

Current Systems Are Inadequate: Even with the Nonstandard 15T + KET settings, InChI only achieves ~50% success in recognizing tautomeric relationships, with Standard InChI at ~37%
Massive Coverage Gap: The new rule set reveals millions of tautomeric relationships that current InChI completely misses, particularly for ring-chain and valence tautomerism
Implementation Requirement: InChI V2 will require a major redesign to handle the comprehensive rule set
Rule Validation: The 86-rule set provides a validated foundation for next-generation chemical identifiers, with the new rules further confirmed against an independent ChEMBL 24.1 tautomer extraction

Implications

For Chemical Databases:

Reduced redundancy through proper tautomer recognition
Improved data quality and consistency
More comprehensive structure search results

For Machine Learning:

More accurate training data (tautomers properly grouped)
Better molecular property prediction models
Reduced dataset bias from tautomeric duplicates

For Chemoinformatics Tools:

Blueprint for InChI V2 development
Standardized rule set for tautomer generation
Public tool (Tautomerizer) for practical use

Limitations Acknowledged

Single-step generation only (omits recursive enumeration of all possible tautomers)
30-second timeout may miss complex transformations
Some tautomeric preferences are context-dependent (pH, solvent) and require more than static rules for capture

Additional Validation

The authors validated their rule set against 4,158 tautomeric systems independently extracted from ChEMBL 24.1 via a SMILES-based tautomer hash (provided by Noel O’Boyle and Roger Sayle). Their rules covered essentially all tautomeric systems in that set, with practically all cases handled by the standard CACTVS rules PT_02_00 through PT_21_00.

Companion Resource: Tautomer Database

A companion paper describes the creation of a publicly available Tautomer Database (Tauto DB) containing over 2,800 tautomeric tuples extracted from experimental literature, available at https://cactus.nci.nih.gov/download/tautomer/. Data from this database informed the generation of new rules in this work.

Future Directions

The paper lays groundwork for InChI V2 development, emphasizing that the comprehensive rule set necessitates algorithmic redesign.

Reproducibility Details

Data

Datasets Analyzed (400M+ total structures):

Public Databases (Enable partial reproduction):

PubChem: Largest public chemical database
ChEMBL: Bioactive molecules with drug-like properties
DrugBank: FDA-approved and experimental drugs
PDB Ligands: Small molecules from protein structures
SureChEMBL: Chemical structures from patents
AMS: Screening samples
ChemNavigator: Commercial chemical database

Private/Proprietary Databases (Prevent 100% full-scale reproduction):

CSD: Cambridge Structural Database (requires commercial/academic license)
CSDB: NCI internal database (private)

Algorithms

Tautomer Generation:

Method: Single-step SMIRKS-based transformations
Constraints:
- Maximum 10 tautomers per input structure
- 30-second CPU timeout per transformation
- Stereochemistry flattening for affected centers
Toolkit Dependency: The authors used the CACTVS Chemoinformatics Toolkit. Researchers attempting to reproduce this with fully open-source tools (like RDKit) may encounter differing behavior due to proprietary chemical perception logic and licensing differences.

Rule Categories:

Prototropic (PT): 54 rules for hydrogen movement
- Most common: PT_06_00 (1,3-heteroatom H-shift, >70% coverage)
Ring-Chain (RC): 21 rules for cyclic/open-chain transformations
- Examples: RC_03_00 (pentose sugars), RC_04_01 (hexose sugars)
Valence (VT): 11 rules for valence changes
- Notable: VT_02_00 (tetrazole/azide, ~2.8M hits)

InChI Comparison:

Standard InChI (default settings)
Nonstandard InChI with 15T and KET options (mobile H and keto-enol)

Evaluation

Success Metrics:

Let $\mathcal{T}(m)$ be the set of generated tautomers for molecule $m$.

Complete Match: Occurs iff $\forall t_i, t_j \in \mathcal{T}(m), \text{InChI}(t_i) = \text{InChI}(t_j)$.
Partial Match: At least 2 tautomers share the same InChI.
Fail: All tautomers have different InChIs.

Benchmark Results:

Standard InChI: ~37% success rate across all rules
Nonstandard (15T + KET): ~50% success rate
New rules: Many show <2% recognition by current InChI

Hardware

Software Environment:

Toolkit: CACTVS Chemoinformatics Toolkit v3.4.6.33 and v3.4.8.6
Hash Functions:
- E_TAUTO_HASH (tautomer-invariant identifier)
- E_ISOTOPE_STEREO_HASH128 (tautomer-sensitive identifier)

Note: The paper omits computational hardware specifications but acknowledges using the NIH HPC Biowulf cluster. Evaluating 400M+ structures necessitates high-throughput cluster computing, making it computationally expensive for an individual to replicate the full analysis from scratch.

Artifacts

Artifact	Type	License	Notes
Tautomerizer Web Tool	Other	Unknown	Public web tool for applying tautomeric rules to user molecules
Tautomer Database	Dataset	Unknown	2800+ experimental tautomeric tuples (companion resource)
SMIRKS and Scripts (SI)	Code	Unknown	CACTVS Tcl scripts and SMIRKS provided as Supporting Information

Paper Information

Citation: Dhaked, D. K., Ihlenfeldt, W.-D., Patel, H., Delannée, V., & Nicklaus, M. C. (2020). Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2. Journal of Chemical Information and Modeling, 60(3), 1253-1275. https://doi.org/10.1021/acs.jcim.9b01080

Publication: Journal of Chemical Information and Modeling, 2020

@article{dhaked2020toward,
  title={Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2},
  author={Dhaked, Devendra K and Ihlenfeldt, Wolf-Dietrich and Patel, Hitesh and Delann{\'e}e, Victorien and Nicklaus, Marc C},
  journal={Journal of Chemical Information and Modeling},
  volume={60},
  number={3},
  pages={1253--1275},
  year={2020},
  publisher={ACS Publications},
  doi={10.1021/acs.jcim.9b01080}
}

Additional Resources:

Tautomerizer Tool - Public web tool for testing tautomeric transformations

SELFIES: A Robust Molecular String Representation

Fri, 12 Sep 2025 00:00:00 +0000

Overview

SELFIES (SELF-referencIng Embedded Strings) is a string-based molecular representation where every possible string, even one generated randomly, corresponds to a syntactically and semantically valid molecule. This property addresses a major limitation of SMILES, where a large fraction of strings produced by machine learning models represent invalid chemical structures.

The format is implemented in an open-source Python library called selfies. Since the original publication, the library has undergone significant architectural changes, most notably replacing the original string-manipulation engine with a graph-based internal representation that improved both performance and extensibility (see Recent Developments).

Key Characteristics

Guaranteed Validity: Every possible SELFIES string can be decoded into a valid molecular graph that obeys chemical valence rules. This is its fundamental advantage over SMILES.
Machine Learning Friendly: Can be used directly in any machine learning model (like VAEs or GANs) without adaptation, guaranteeing that all generated outputs are valid molecules.
Customizable Constraints: The underlying chemical rules, such as maximum valence for different atoms, can be customized by the user. The library provides presets (e.g., for hypervalent species) and allows users to define their own rule sets.
Human-readable: With some familiarity, SELFIES strings are human-readable, allowing interpretation of functional groups and connectivity.
Local Operations: SELFIES encodes branch length and ring size as adjacent symbols in the string (rather than requiring matched delimiters or repeated digits at distant positions, as SMILES does), preventing common syntactical errors like unmatched parentheses or mismatched ring-closure digits.
Broad Support: The current selfies library supports aromatic molecules (via kekulization), isotopes, charges, radicals, and stereochemistry. It also includes a dot symbol (.) for representing disconnected molecular fragments.

Basic Syntax

SELFIES uses symbols enclosed in square brackets (e.g., [C], [O], [#N]). The interpretation of each symbol depends on the current state of the derivation (described below), which ensures chemical valence rules are strictly obeyed. The syntax is formally defined by a Chomsky type-2 context-free grammar.

Derivation Rules

SELFIES are constructed using a table of derivation rules. The process starts in an initial state (e.g., $X_0$) and reads the SELFIES string symbol by symbol. Each symbol, combined with the current state, determines the resulting atom/bond and the next state. The derivation state $X_n$ intuitively tracks that the previously added atom can form a maximum of $n$ additional bonds.

For example, the string [F][=C][=C][#N] is derived as follows, where $X_n$ indicates the atom can form up to $n$ additional bonds. Notice how bond demotion occurs: the first [=C] requests a double bond, but only a single bond is formed because state $X_1$ limits the connection to one bond.

$$ \begin{aligned} \text{State } X_0 + \text{[F]} &\rightarrow \text{F} + \text{State } X_1 \\ \text{State } X_1 + \text{[=C]} &\rightarrow \text{F-C} + \text{State } X_3 \\ \text{State } X_3 + \text{[=C]} &\rightarrow \text{F-C=C} + \text{State } X_2 \\ \text{State } X_2 + [\#\text{N}] &\rightarrow \text{F-C=C=N} + \text{Final} \end{aligned} $$

Structural Features

Branches: Represented by a [Branch] symbol. The symbols immediately following it are interpreted as an index that specifies the number of SELFIES symbols belonging to that branch. This structure prevents errors like unmatched parentheses in SMILES.
Rings: Represented by a [Ring] symbol. Similar to branches, subsequent symbols specify an index that indicates which previous atom to connect to, forming a ring closure. To avoid violating valence constraints, ring bond creation is postponed to a final post-processing step, where it is only completed if the target atom has available bonds.

Examples

To see how these derivation rules work in practice, here are SELFIES representations for common molecules of increasing complexity:

Ethanol: [C][C][O]

Benzene: [C][=C][C][=C][C][=C][Ring1][=Branch1]

Aspirin: [C][C][Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][Branch1][C][=O][O]

The `selfies` Python Library

The selfies library provides a dependency-free Python implementation. Here are the core operations:

import selfies as sf

# SMILES -> SELFIES
smiles = "c1ccc(C(=O)O)cc1"  # benzoic acid
encoded = sf.encoder(smiles)
print(encoded)
# -> [C][=C][C][=C][C][Branch1][C][=O][O][=C][Ring1][=Branch1]

# SELFIES -> SMILES
decoded = sf.decoder(encoded)
print(decoded)
# -> C1=CC=CC(=C1)C(=O)O

# Robustness: random strings always decode to valid molecules
random_selfies = "[C][F][Ring1][O][=N][Branch1][C][S]"
print(sf.decoder(random_selfies))
# -> always returns a valid molecule

Tokenization and Encoding

import selfies as sf

selfies_str = "[C][=C][C][=C][C][Branch1][C][=O][O][=C][Ring1][=Branch1]"

# Tokenize into individual symbols
tokens = list(sf.split_selfies(selfies_str))
print(tokens)
# -> ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[Branch1]', '[C]',
#     '[=O]', '[O]', '[=C]', '[Ring1]', '[=Branch1]']

# Get the alphabet (unique token set) from a dataset
dataset = ["[C][C][O]", "[C][=C][C][=C][C][=C][Ring1][=Branch1]"]
alphabet = sf.get_alphabet_from_selfies(dataset)
print(sorted(alphabet))
# -> ['[=Branch1]', '[=C]', '[C]', '[O]', '[Ring1]']

# Convert to integer encoding for ML pipelines
encoding, _ = sf.selfies_to_encoding(
    selfies=selfies_str,
    vocab_stoi={s: i for i, s in enumerate(sorted(alphabet))},
    pad_to_len=20,
    enc_type="label",
)

Customizing Valence Constraints

import selfies as sf

# View current constraints
print(sf.get_semantic_constraints())

# Allow hypervalent sulfur (e.g., SF6)
sf.set_semantic_constraints("hypervalent")

# Or define custom constraints
sf.set_semantic_constraints({
    "S": 6,  # allow hexavalent sulfur
    "P": 5,  # allow pentavalent phosphorus
})

# Reset to defaults
sf.set_semantic_constraints("default")

SELFIES in Machine Learning

Molecular Generation

SELFIES is particularly advantageous for generative models in computational chemistry. When used in a VAE, the entire continuous latent space decodes to valid molecules, unlike SMILES where large regions of the latent space are invalid. The original SELFIES paper demonstrated this concretely: a VAE trained with SELFIES stored two orders of magnitude more diverse molecules than a SMILES-based VAE, and a GAN produced 78.9% diverse valid molecules compared to 18.6% for SMILES (Krenn et al., 2020).

Several generation approaches build directly on SELFIES:

Latent space optimization: LIMO uses a SELFIES-based VAE with gradient-based optimization to generate molecules with nanomolar binding affinities, achieving 6-8x speedup over RL baselines (Eckmann et al., 2022).
Training-free generation: STONED demonstrates that simple character-level mutations in SELFIES (replacement, deletion, insertion) produce valid molecules by construction, eliminating the need for neural networks entirely. STONED achieved a GuacaMol score of 14.70, competitive with deep generative models (Nigam et al., 2021).
Gradient-based dreaming: PASITHEA computes gradients with respect to one-hot encoded SELFIES inputs to steer molecules toward target property values. Because SELFIES’ surjective mapping guarantees every intermediate representation is a valid molecule, this continuous optimization over the input space is feasible. PASITHEA generated molecules with properties outside the training data range (logP up to 4.24 vs. a training max of 3.08), with 97.2% novelty (Shen et al., 2021).
Large-scale pre-training: MolGen is a BART-based model pre-trained on 100M+ SELFIES molecules. It achieves 100% validity and an FCD of 0.0015 on MOSES (vs. 0.0061 for Chemformer), and introduces chemical feedback to align outputs with preference rankings (Fang et al., 2024).

In benchmarks, SELFIES performs well for optimization-oriented tasks. In the PMO benchmark of 25 methods, SELFIES-REINVENT ranked 3rd and STONED ranked 5th. SELFIES-based genetic algorithms outperformed SMILES-based GAs, likely because SELFIES provides more intuitive mutation operations (Gao et al., 2022). The Tartarus benchmark corroborates this across more diverse real-world objectives (organic emitters, protein ligands, reaction substrates): SELFIES-VAE consistently outperforms SMILES-VAE, and the representation matters most where validity is a bottleneck (Nigam et al., 2022).

SELFIES mutations provide a simple but effective way to explore chemical space:

import selfies as sf
import random

def mutate_selfies(selfies_str, mutation_type="replace"):
    """Mutate a SELFIES string. Every output is a valid molecule."""
    tokens = list(sf.split_selfies(selfies_str))
    alphabet = list(sf.get_semantic_robust_alphabet())
    idx = random.randint(0, len(tokens) - 1)

    if mutation_type == "replace":
        tokens[idx] = random.choice(alphabet)
    elif mutation_type == "insert":
        tokens.insert(idx, random.choice(alphabet))
    elif mutation_type == "delete" and len(tokens) > 1:
        tokens.pop(idx)

    return "".join(tokens)

# Every mutation produces a valid molecule
original = sf.encoder("c1ccccc1")  # benzene
for _ in range(5):
    mutant = mutate_selfies(original)
    print(sf.decoder(mutant))  # always valid

Property Prediction and Pretraining

SELFormer is a RoBERTa-based chemical language model pretrained on 2M ChEMBL compounds using SELFIES as input. Because every masked token prediction corresponds to a valid molecular fragment, the model never wastes capacity learning invalid chemistry. SELFormer outperformed ChemBERTa-2 by approximately 12% on average across BACE, BBBP, and HIV classification benchmarks (Yüksel et al., 2023). ChemBERTa also evaluated SELFIES as an input representation, finding comparable performance to SMILES on the Tox21 task (Chithrananda et al., 2020).

The Regression Transformer demonstrated that SELFIES achieves ~100% validity vs. ~40% for SMILES in conditional molecular generation, while performing comparably for property prediction. This dual prediction-generation capability is enabled by interleaving numerical property tokens with SELFIES molecular tokens in a single sequence (Born & Manica, 2023).

At larger scales, ChemGPT (up to 1B parameters) uses a GPT-Neo backbone with SELFIES tokenization for autoregressive molecular generation, demonstrating that SELFIES follows the same power-law neural scaling behavior observed in NLP (Frey et al., 2023).

Optical Chemical Structure Recognition

In image-to-text chemical structure recognition, Rajan et al. (2022) compared SMILES, DeepSMILES, SELFIES, and InChI as output formats using the same transformer architecture. SELFIES achieved 100% structural validity (every prediction could be decoded), while SMILES predictions occasionally contained syntax errors. The trade-off: SMILES achieved higher exact match accuracy (88.62%) partly because SELFIES strings are longer, producing more tokens for the decoder to predict.

Chemical Name Translation

STOUT uses SELFIES as its internal representation for translating between chemical line notations and IUPAC names. All SMILES are converted to SELFIES before processing, and the model achieves a BLEU score of 0.94 for IUPAC-to-SELFIES translation and 0.98 Tanimoto similarity on valid outputs. The authors found SELFIES’ syntactic robustness particularly valuable for this sequence-to-sequence task, where the decoder must produce a chemically valid output string (Rajan et al., 2021).

Tokenization

Converting SELFIES strings into tokens for neural models is more straightforward than SMILES tokenization. Each bracket-enclosed symbol ([C], [=C], [Branch1]) is a natural token boundary. Atom Pair Encoding (APE) extends byte pair encoding with chemistry-aware constraints for both SMILES and SELFIES. For SELFIES specifically, APE preserves atomic identity during subword merging, and SELFIES models showed strong inter-tokenizer agreement: all true positives from SELFIES-BPE were captured by SELFIES-APE (Leon et al., 2024).

Limitations and Trade-offs

Validity Constraints Can Introduce Bias

The guarantee that every string decodes to a valid molecule is SELFIES’ core advantage, but recent work has shown this comes with trade-offs. Skinnider (2024) demonstrated that SMILES-based models consistently outperform SELFIES-based models on distribution-learning tasks. The mechanism: invalid SMILES represent a model’s least confident predictions, and filtering them out acts as implicit quality control. SELFIES models, by construction, cannot discard low-confidence outputs this way. Furthermore, SELFIES validity constraints introduce systematic structural biases, generating fewer aromatic rings and more aliphatic structures compared to training data. When SELFIES constraints were relaxed to allow invalid generation (“unconstrained SELFIES”), performance improved, providing causal evidence that the ability to generate and discard invalid outputs benefits distribution learning.

This finding reframes the SMILES vs. SELFIES choice as context-dependent. As Grisoni (2023) summarizes in a review of chemical language models: “SMILES offer a richer, more interpretable language with well-studied augmentation strategies, while SELFIES guarantee validity at the cost of chemical realism and edit interpretability.”

The PMO benchmark provides further nuance: SELFIES-based variants of language model methods (REINVENT, LSTM HC, VAE) generally do not outperform their SMILES counterparts, because modern language models learn SMILES grammar well enough that syntactic invalidity is no longer a practical bottleneck. The exception is genetic algorithms, where SELFIES mutations are naturally well-suited.

A study on complex molecular distributions paints a consistent picture: SELFIES-trained RNNs achieve better standard metrics (validity, uniqueness, novelty), while SMILES-trained RNNs achieve better distributional fidelity as measured by Wasserstein distance (Flam-Shepherd et al., 2022). Taken together, these findings suggest that SELFIES and SMILES have genuinely complementary strengths, and the best choice depends on whether the task prioritizes validity/novelty or distributional faithfulness.

Degenerate Outputs

Although every SELFIES string decodes to a valid molecule, the decoded molecule may not always be chemically meaningful in context. The Regression Transformer reported ~1.9% defective generations where the output molecule had fewer than 50% of the seed molecule’s atoms (Born & Manica, 2023). This highlights a distinction between syntactic validity (which SELFIES guarantees) and semantic appropriateness (which it does not).

Other Limitations

Indirect Canonicalization: A canonical SELFIES string is currently generated by first creating a canonical SMILES string and then converting it to SELFIES. Direct canonicalization is a goal for future development.
String Length: SELFIES strings are generally longer than their corresponding SMILES strings, which can impact storage, processing times, and sequence modeling difficulty for very large datasets.
Ongoing Standardization: While the library now supports most major features found in SMILES, work is ongoing to extend the format to more complex systems like polymers, crystals, and reactions.

Variants and Extensions

Group SELFIES

Group SELFIES extends the representation with group tokens that represent functional groups or entire substructures (e.g., a benzene ring or carboxyl group) as single units. Each group token has labeled attachment points with specified valency, allowing the decoder to continue tracking available bonds. Group SELFIES maintains the validity guarantee while producing shorter, more human-readable strings. On MOSES VAE benchmarks, Group SELFIES achieved an FCD of 0.1787 versus 0.6351 for standard SELFIES, indicating substantially better distribution learning (Cheng et al., 2023).

STONED Algorithms

STONED (Superfast Traversal, Optimization, Novelty, Exploration and Discovery) is a suite of algorithms that exploit SELFIES’ validity guarantee for training-free molecular design through point mutations, interpolation, and optimization (Nigam et al., 2021). See Molecular Generation above for benchmark results.

Recent Developments

The 2023 library update replaced the original string-manipulation engine with a graph-based internal representation. This change resolved several long-standing limitations: the original approach could not handle aromatics (requiring kekulization), stereochemistry, or charged species. The graph-based engine now supports all of these, and processes 300K+ molecules in approximately 4 minutes in pure Python. The library has been validated on all 72 million molecules from PubChem.

Looking forward, researchers have outlined 16 future research directions for extending robust representations to complex systems like polymers, crystals, and chemical reactions.

References

Krenn, M., Häse, F., Nigam, A., Friederich, P., & Aspuru-Guzik, A. (2020). Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4), 045024.
Krenn, M., Ai, Q., Barthel, S., Carson, N., Frei, A., Frey, N. C., … & Aspuru-Guzik, A. (2022). SELFIES and the future of molecular string representations. Patterns, 3(10), 100588.
Lo, A., Pollice, R., Nigam, A., White, A. D., Krenn, M., & Aspuru-Guzik, A. (2023). Recent advances in the self-referencing embedded strings (SELFIES) library. Digital Discovery, 2, 897-908.
Skinnider, M. A. (2024). Invalid SMILES are beneficial rather than detrimental to chemical language models. Nature Machine Intelligence, 6, 437-448.
Shen, C., Krenn, M., Eppel, S., & Aspuru-Guzik, A. (2021). Deep molecular dreaming: inverse machine learning for de-novo molecular design and interpretability with surjective representations. Machine Learning: Science and Technology, 2(3), 03LT02.
Fang, Y., et al. (2024). Domain-agnostic molecular generation with chemical feedback. ICLR 2024.
Born, J., & Manica, M. (2023). Regression Transformer enables concurrent sequence regression and generation for molecular language modelling. Nature Machine Intelligence, 5, 432-444.
Frey, N. C., Soklaski, R., Axelrod, S., Samsi, S., Gómez-Bombarelli, R., Coley, C. W., & Gadepally, V. (2023). Neural scaling of deep chemical models. Nature Machine Intelligence, 5, 1297-1305.
Rajan, K., Zielesny, A., & Steinbeck, C. (2021). STOUT: SMILES to IUPAC names using neural machine translation. Journal of Cheminformatics, 13, 34.
Nigam, A., Pollice, R., & Aspuru-Guzik, A. (2022). Tartarus: A benchmarking platform for realistic and practical inverse molecular design. NeurIPS 2022 Datasets and Benchmarks.
SELFIES GitHub Repository

The Number of Isomeric Hydrocarbons of the Methane Series

Mon, 08 Sep 2025 00:00:00 +0000

A Theoretical Foundation for Mathematical Chemistry

This is a foundational theoretical paper in mathematical chemistry and chemical graph theory. It derives exact mathematical laws governing molecular topology. The paper also serves as a benchmark resource, establishing the first systematic isomer counts that corrected historical errors and whose recursive method remains the basis for modern molecular enumeration.

Historical Motivation and the Failure of Centric Trees

The primary motivation was the lack of a rigorous mathematical relationship between carbon content ($N$) and isomer count.

Previous failures: Earlier attempts by Cayley (1875) (as cited by Henze and Blair, referring to the Berichte der deutschen chemischen Gesellschaft summary) and Schiff (1875) used “centric” and “bicentric” symmetry tree methods that broke down as carbon content increased, producing incorrect counts as early as $N = 12$. Subsequent efforts by Tiemann (1893), Delannoy (1894), Losanitsch (1897), Goldberg (1898), and Trautz (1924), as cited in the paper, each improved on specific aspects but none achieved general accuracy beyond moderate carbon content.
The theoretical gap: All prior formulas depended on exhaustively identifying centers of symmetry, meaning they required additional correction terms for each increase in $N$ and could not reliably predict counts for larger molecules like $C_{40}$.

This work aimed to develop a theoretically sound, generalizable method that could be extended to any number of carbons.

Core Innovation: Recursive Enumeration of Graphs

The core novelty is the proof that the count of hydrocarbons is a recursive function of the count of alkyl radicals (alcohols) of size $N/2$ or smaller. The authors rely on a preliminary calculation of the total number of isomeric alcohols (the methanol series) to make this hydrocarbon enumeration possible. By defining $T_k$ as the exact number of possible isomeric alkyl radicals strictly containing $k$ carbon atoms, graph enumeration transforms into a mathematical recurrence.

To rigorously prevent double-counting when functionally identical branches connect to a central carbon, Henze and Blair applied combinations with substitution. Because the chemical branches are unordered topologically, connecting $x$ branches of identical structural size $k$ results in combinations with repetition:

$$ \binom{T_k + x - 1}{x} $$

For example, if a Group B central carbon is bonded to three identical sub-branches of length $k$, the combinatoric volume for that precise topological partition resolves to:

$$ \frac{T_k (T_k + 1)(T_k + 2)}{6} $$

Summing these constrained combinatorial partitions across all valid branch sizes (governed by the Even/Odd bisection rules) yields the exact isomer count for $N$ without overestimating due to symmetric permutations.

The Symmetry Constraints: The paper rigorously divides the problem space to prevent double-counting:

Group A (Centrosymmetric): Hydrocarbons that can be bisected into two smaller alkyl radicals.
- Even $N$: Split into two radicals of size $N/2$.
- Odd $N$: Split into sizes $(N+1)/2$ and $(N-1)/2$.
Group B (Asymmetric): Hydrocarbons whose graphic formula cannot be symmetrically bisected. They contain exactly one central carbon atom attached to 3 or 4 branches. To prevent double-counting, Henze and Blair established strict maximum branch sizes:
- Even $N$: No branch can be larger than $(N/2 - 1)$ carbons.
- Odd $N$: No branch can be larger than $(N-3)/2$ carbons.
- The Combinatorial Partitioning: They further subdivided these 3-branch and 4-branch molecules into distinct mathematical cases based on whether the branches were structurally identical or unique, applying distinct combinatorial formulas to each scenario.

The five isomers of hexane ($C_6$) classified by Henze and Blair’s symmetry scheme. Group A molecules (top row) can be bisected along a bond (highlighted in red) into two $C_3$ alkyl radicals. Group B molecules (bottom row) have a central carbon atom (red circle) with 3-4 branches, preventing symmetric bisection.

This classification is the key insight that enables the recursive formulas. By exhaustively partitioning hydrocarbons into these mutually exclusive groups, the authors could derive separate combinatorial expressions for each and sum them without double-counting.

For each structural class, combinatorial formulas are derived that depend on the number of isomeric alcohols ($T_k$) where $k < N$. This transforms the problem of counting large molecular graphs into a recurrence relation based on the counts of smaller, simpler sub-graphs.

Validation via Exhaustive Hand-Enumeration

The experiments were computational and enumerative:

Derivation of the recursion formulas: The main effort was the mathematical derivation of the set of equations for each structural class of hydrocarbon.
Calculation: They applied their formulas to calculate the number of isomers for alkanes up to $N=40$, reaching over $6.2 \times 10^{13}$ isomers. This was far beyond what was previously possible.
Validation by exhaustive enumeration: To prove the correctness of their theory, the authors manually drew and counted all possible structural formulas for the undecanes ($C_{11}$), dodecanes ($C_{12}$), tridecanes ($C_{13}$), and tetradecanes ($C_{14}$). This brute-force check confirmed their calculated numbers and corrected long-standing errors in the literature.
- Key correction: The manual enumeration proved that the count for tetradecane ($C_{14}$) is 1,858, correcting erroneous values previously published by Losanitsch (1897), whose results for $C_{12}$ and $C_{14}$ the paper identifies as incorrect.

Benchmark Outcomes and Scaling Limits

The Constitutional Limit: The paper establishes the mathematical ground truth for organic molecular graphs by strictly counting constitutional (structural) isomers. The derivation completely excludes 3D stereoisomerism (enantiomers and diastereomers). For modern geometric deep learning applications (e.g., generating 3D conformers), Henze and Blair’s scaling sequence serves as a lower bound, representing a severe underestimation of the true number of spatial configurations feasible within chemical space.
Theoretical outcome: The paper proves that the problem’s inherent complexity requires a recursive approach.
Benchmark resource: The authors published a table of isomer counts up to $C_{40}$ (Table II), correcting historical errors and establishing the first systematic enumeration across this range. Later computational verification revealed that the paper’s hand-calculated values are exact through at least $C_{14}$ (confirmed by exhaustive enumeration) but accumulate minor arithmetic errors beyond that range (e.g., at $C_{40}$). The recursive method itself is exact and remains the basis for the accepted values in OEIS A000602.

The number of structural isomers grows super-exponentially with carbon content, reaching over 62 trillion for C₄₀. This plot, derived from Henze and Blair’s Table II, illustrates the combinatorial explosion that makes direct enumeration intractable for larger molecules.

The plot above illustrates the staggering growth rate. Methane ($C_1$) through propane ($C_3$) each have exactly one isomer. Beyond this, the count accelerates rapidly: 75 isomers at $C_{10}$, nearly 37 million at $C_{25}$, and over 4 billion at $C_{30}$. By $C_{40}$, the count exceeds $6.2 \times 10^{13}$ (the paper’s hand-calculated Table II reports 62,491,178,805,831, while the modern OEIS-verified value is 62,481,801,147,341). This super-exponential scaling demonstrates why brute-force enumeration becomes impossible and why the recursive approach was essential.

Foundational impact: This work established the mathematical framework that would later evolve into modern chemical graph theory and computational chemistry approaches for molecular enumeration. In the context of AI for molecular generation, this is an early form of expressivity analysis, defining the size of the chemical space that generative models must learn to cover.

Reproducibility Details

Algorithms: The exact mathematical recursive formulas and combinatorial partitioning logic are fully provided in the text, allowing for programmatic implementation.
Evaluation: The authors scientifically validated their recursive formulas through exhaustive manual hand-enumeration (brute-force drawing of structural formulas) up to $C_{14}$ to establish absolute correctness.
Data: The paper’s Table II provides isomer counts up to $C_{40}$. These hand-calculated values are exact through at least $C_{14}$ (validated by exhaustive enumeration) but accumulate minor arithmetic errors beyond that range. The corrected integer sequence is maintained in the On-Line Encyclopedia of Integer Sequences (OEIS) as A000602.

Code: The OEIS page provides Mathematica and Maple implementations. The following pure Python implementation uses the OEIS generating functions (which formalize Henze and Blair’s recursive method) to compute the corrected isomer counts up to any arbitrary $N$:

def compute_alkane_isomers(max_n: int) -> list[int]:
    """
    Computes the number of alkane structural isomers C_nH_{2n+2}
    up to max_n using the generating functions from OEIS A000602.
    """
    if max_n == 0: return [1]

    # Helper: multiply two polynomials (cap at degree max_n)
    def poly_mul(a: list[int], b: list[int]) -> list[int]:
        res = [0] * (max_n + 1)
        for i, v_a in enumerate(a):
            for j, v_b in enumerate(b):
                if i + j <= max_n: res[i + j] += v_a * v_b
                else: break
        return res

    # Helper: evaluate P(x^k) by spacing out terms
    def poly_pow(a: list[int], k: int) -> list[int]:
        res = [0] * (max_n + 1)
        for i, v in enumerate(a):
            if i * k <= max_n: res[i * k] = v
            else: break
        return res

    # T represents the alkyl radicals (OEIS A000598), T[0] = 1
    T = [0] * (max_n + 1)
    T[0] = 1

    # Iteratively build coefficients of T
    # We only need to compute the (n-1)-th degree terms at step n
    for n in range(1, max_n + 1):
        # Extract previously calculated slices
        t_prev = T[:n]

        # T(x^2) and T(x^3) terms up to n-1
        t2_term = T[(n - 1) // 2] if (n - 1) % 2 == 0 else 0
        t3_term = T[(n - 1) // 3] if (n - 1) % 3 == 0 else 0

        # T(x)^2 and T(x)^3 terms up to n-1
        t_squared_n_1 = sum(t_prev[i] * t_prev[n - 1 - i] for i in range(n))

        t_cubed_n_1 = sum(
            T[i] * T[j] * T[n - 1 - i - j]
            for i in range(n)
            for j in range(n - i)
        )

        # T(x) * T(x^2) term up to n-1
        t_t2_n_1 = sum(
            T[i] * T[j]
            for i in range(n)
            for j in range((n - 1 - i) // 2 + 1)
            if i + 2*j == n - 1
        )

        T[n] = (t_cubed_n_1 + 3 * t_t2_n_1 + 2 * t3_term) // 6

    # Calculate Alkanes (OEIS A000602) from fully populated T
    T2 = poly_pow(T, 2)
    T3 = poly_pow(T, 3)
    T4 = poly_pow(T, 4)
    T_squared = poly_mul(T, T)
    T_cubed = poly_mul(T_squared, T)
    T_fourth = poly_mul(T_cubed, T)

    term2 = [(T_squared[i] - T2[i]) // 2 for i in range(max_n + 1)]

    term3_inner = [
        T_fourth[i]
        + 6 * poly_mul(T_squared, T2)[i]
        + 8 * poly_mul(T, T3)[i]
        + 3 * poly_mul(T2, T2)[i]
        + 6 * T4[i]
        for i in range(max_n + 1)
    ]

    alkanes = [1] + [0] * max_n
    for n in range(1, max_n + 1):
        alkanes[n] = T[n] - term2[n] + term3_inner[n - 1] // 24

    return alkanes

# Calculate and verify
isomers = compute_alkane_isomers(40)
print(f"C_14 isomers: {isomers[14]}")   # Output: 1858
print(f"C_40 isomers: {isomers[40]}")   # Output: 62481801147341

Hardware: Derived analytically and enumerated manually by the authors in 1931 without computational hardware.

Paper Information

Citation: Henze, H. R., & Blair, C. M. (1931). The number of isomeric hydrocarbons of the methane series. Journal of the American Chemical Society, 53(8), 3077-3085. https://doi.org/10.1021/ja01359a034

Publication: Journal of the American Chemical Society (JACS) 1931

@article{henze1931number,
  title={The number of isomeric hydrocarbons of the methane series},
  author={Henze, Henry R and Blair, Charles M},
  journal={Journal of the American Chemical Society},
  volume={53},
  number={8},
  pages={3077--3085},
  year={1931},
  publisher={ACS Publications}
}

SMILES: A Compact Notation for Chemical Structures

Mon, 08 Sep 2025 00:00:00 +0000

Overview

SMILES (Simplified Molecular Input Line Entry System), originally developed by David Weininger in the late 1980s, is a one-dimensional string format for representing chemical molecular structures. It linearizes 3D molecular structures by performing a depth-first traversal of the molecular graph, recording the atoms and bonds along the way.

For example, the simple molecule ethanol ($\text{C}_2\text{H}_6\text{O}$) can be represented as CCO, while the more complex caffeine molecule becomes CN1C=NC2=C1C(=O)N(C(=O)N2C)C.

Key Characteristics

Human-readable: Designed primarily for human readability. Compare with InChI, a hierarchical representation optimized for machine parsing.
Compact: More compact than other representations (3D coordinates, connectivity tables)
Simple syntax: A language with simple syntax and structure, making it relatively easy to learn and use for chemists and researchers
Flexible: Both linear and cyclic structures can be represented in many different valid ways

For a hands-on tutorial on visualizing SMILES strings as 2D molecular images, see Converting SMILES Strings to 2D Molecular Images.

Basic Syntax

Atomic Symbols

SMILES uses standard atomic symbols with implied hydrogen atoms:

C (methane, $\text{CH}_4$)
N (ammonia, $\text{NH}_3$)
O (water, $\text{H}_2\text{O}$)
P (phosphine, $\text{PH}_3$)
S (hydrogen sulfide, $\text{H}_2\text{S}$)
Cl (hydrogen chloride, $\text{HCl}$)

Bracket notation: Elements outside the organic subset must be shown in brackets, e.g., [Pt] for elemental platinum. The organic subset (B, C, N, O, P, S, F, Cl, Br, and I) can omit brackets.

Bond Representation

Bonds are represented by symbols:

Single bond: - (usually omitted)

Ethane ($\text{C}_2\text{H}_6$), SMILES: CC

Double bond: =

Methyl Isocyanate ($\text{C}_2\text{H}_3\text{NO}$), SMILES: CN=C=O

Triple bond: #

Hydrogen Cyanide (HCN), SMILES: C#N

Aromatic bond: : (usually omitted when lowercase atom symbols indicate aromaticity)

Vanillin ($\text{C}_8\text{H}_8\text{O}_3$), SMILES: O=Cc1ccc(O)c(OC)c1

Disconnected structures: . (separates disconnected components such as salts and ionic compounds)

Copper(II) Sulfate ($\text{CuSO}_4$), SMILES: [Cu+2].[O-]S(=O)(=O)[O-]

Structural Features

Branches: Enclosed in parentheses and can be nested. For example, CC(C)C(=O)O represents isobutyric acid, where (C) and (=O) are branches off the main chain.

3-Propyl-4-isopropyl-1-heptene ($\text{C}\{12}\text{H}\{22}$), SMILES: C=CC(CCC)C(C(C)C)CCC

Cyclic structures: Written by breaking bonds and using numbers to indicate bond connections. For example, C1CCCCC1 represents cyclohexane (the 1 connects the first and last carbon).
Aromaticity: Lower case letters are used for atoms in aromatic rings. For example, benzene is written as c1ccccc1.
Formal charges: Indicated by placing the charge in brackets after the atom symbol, e.g., [C+], [C-], or [C-2]

Stereochemistry and Isomers

Isotope Notation

Isotope notation specifies the exact isotope of an element and comes before the element within square brackets, e.g., [13C] for carbon-13.

Double Bond Stereochemistry

Directional bonds can be specified using \ and / symbols to indicate the stereochemistry of double bonds:

C/C=C\C represents (E)-2-butene (trans configuration)
C/C=C/C represents (Z)-2-butene (cis configuration)

The direction of the slashes indicates which side of the double bond each substituent is on.

Tetrahedral Chirality

Chirality around tetrahedral centers uses @ and @@ symbols:

N[C@](C)(F)C(=O)O vs N[C@@](F)(C)C(=O)O
Anti-clockwise counting vs clockwise counting
@ and @@ are shorthand for @TH1 and @TH2, respectively

Glucose ($\text{C}\6\text{H}\{12}\text{O}\_6$), SMILES: OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@H](O)1

Advanced Stereochemistry

More general notation for other stereocenters:

@AL1, @AL2 for allene-type stereocenters
@SP1, @SP2, @SP3 for square-planar stereocenters
@TB1…@TB20 for trigonal bipyramidal stereocenters
@OH1…@OH30 for octahedral stereocenters

SMILES allows partial specification since it relies on local chirality.

SMILES in Machine Learning

Beyond its original role as a compact notation, SMILES has become the dominant molecular input format for deep learning in chemistry. Its adoption has revealed both strengths and challenges specific to neural architectures.

Canonical vs. Randomized SMILES

Canonical SMILES algorithms produce a single unique string per molecule, which is valuable for database deduplication. In generative modeling, however, canonical representations introduce training bias: the canonicalization algorithm constrains how the molecular graph is traversed (e.g., prioritizing sidechains over ring atoms), forcing models to learn both valid SMILES syntax and the specific ordering rules. Structurally similar molecules can have substantially different canonical strings, making complex topologies harder to sample.

Randomized SMILES address this by generating non-unique representations through random atom orderings. Training RNN-based generative models on randomized SMILES acts as data augmentation, improving chemical space coverage, sampling uniformity, and completeness compared to canonical SMILES (Arus-Pous et al., 2019). In one benchmark, randomized SMILES recovered significantly more of GDB-13 chemical space than canonical SMILES across all training set sizes.

RDKit makes it straightforward to enumerate randomized SMILES for a given molecule:

from rdkit import Chem

mol = Chem.MolFromSmiles("c1ccc(C(=O)O)cc1")  # benzoic acid

# Canonical form (deterministic)
print(Chem.MolToSmiles(mol))
# -> O=C(O)c1ccccc1

# Randomized forms (different each call)
for _ in range(5):
    print(Chem.MolToSmiles(mol, doRandom=True))
# -> OC(=O)c1ccccc1
# -> O=C(c1ccccc1)O
# -> OC(c1ccccc1)=O
# -> C(O)(c1ccccc1)=O
# -> c1c(C(=O)O)cccc1

Each of these strings encodes the same molecule but presents a different traversal of the molecular graph, giving a generative model more diverse training signal per molecule.

Validity and the Role of Invalid SMILES

A large fraction of SMILES strings generated by neural models are syntactically or semantically invalid. Early efforts aimed to eliminate invalid outputs entirely, either through constrained representations like SELFIES (which guarantee 100% validity) or modified syntax like DeepSMILES (which removes paired syntax; see Variants below for syntax details).

More recent work has complicated this picture. Skinnider (2024) demonstrated that invalid SMILES generation actually benefits chemical language models. Invalid strings tend to be low-likelihood samples from the model’s probability distribution. Filtering them out is equivalent to removing the model’s least confident predictions, acting as implicit quality control. Meanwhile, enforcing absolute validity (as SELFIES does) can introduce systematic structural biases that impair distribution learning. This reframes SMILES’ non-robustness as potentially advantageous in certain ML contexts.

Tokenization Challenges

Converting SMILES strings into token sequences for neural models is non-trivial. The two baseline approaches illustrate the problem using chloramphenicol (O=C(NC([C@@H](O)c1ccc([N+](=O)[O-])cc1)CO)C(Cl)Cl):

import re

smiles = "O=C(NC([C@@H](O)c1ccc([N+](=O)[O-])cc1)CO)C(Cl)Cl"

# Character-level: splits every character individually
char_tokens = list(smiles)
# ['O', '=', 'C', '(', 'N', 'C', '(', '[', 'C', '@', '@', 'H', ']',
#  '(', 'O', ')', 'c', '1', 'c', 'c', 'c', '(', '[', 'N', '+', ']',
#  '(', '=', 'O', ')', '[', 'O', '-', ']', ')', 'c', 'c', '1', ')',
#  'C', 'O', ')', 'C', '(', 'C', 'l', ')', 'C', 'l']
# -> 49 tokens

# Atom-level: regex groups brackets, two-char elements, and bond symbols
atom_pattern = (
    r"(\[[^\]]+\]|Br?|Cl?|N|O|S|P|F|I|"
    r"b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|"
    r"\\|\/|:|~|@|\?|>>?|\*|%[0-9]{2}|[0-9])"
)
atom_tokens = re.findall(atom_pattern, smiles)
# ['O', '=', 'C', '(', 'N', 'C', '(', '[C@@H]', '(', 'O', ')', 'c',
#  '1', 'c', 'c', 'c', '(', '[N+]', '(', '=', 'O', ')', '[O-]', ')',
#  'c', 'c', '1', ')', 'C', 'O', ')', 'C', '(', 'Cl', ')', 'Cl']
# -> 36 tokens

Character-level tokenization splits Cl (chlorine) into C + l, making the chlorine indistinguishable from carbon. It also fragments [C@@H] (a chiral carbon) into six meaningless tokens: [, C, @, @, H, ]. Atom-level tokenization preserves these as single tokens but still produces long sequences (~40 tokens per molecule on average in ChEMBL).

Several chemistry-aware tokenizers go further:

SMILES Pair Encoding (SPE) adapts byte pair encoding to learn high-frequency SMILES substrings from large chemical datasets, compressing average sequence length from ~40 to ~6 tokens while preserving chemically meaningful substructures.
Atom Pair Encoding (APE) preserves atomic identity during subword merging, preventing chemically meaningless token splits.
Atom-in-SMILES (AIS) encodes each atom’s local chemical environment into the token itself (e.g., distinguishing a carbonyl carbon from a methyl carbon), reducing token degeneration and improving translation accuracy.
Smirk achieves full OpenSMILES coverage with only 165 tokens by decomposing bracketed atoms into glyphs.

SMILES-Based Foundation Models

SMILES serves as the primary input format for molecular encoder models, including SMILES-BERT, SMILES-Transformer, BARTSmiles, SMI-TED, and MolBERT. These models learn molecular representations from large SMILES corpora through pre-training objectives like masked language modeling.

A key open challenge is robustness to SMILES variants. The AMORE framework revealed that current chemical language models struggle to recognize chemically equivalent SMILES representations (such as hydrogen-explicit vs. implicit forms, or different atom orderings) as encoding the same molecule.

Molecular Generation

SMILES is the dominant representation for de novo molecular generation. The typical pipeline trains a language model on SMILES corpora, then steers sampling toward molecules with desired properties. Major architecture families include:

Variational autoencoders: The Automatic Chemical Design VAE (Gomez-Bombarelli et al., 2018) encodes SMILES into a continuous latent space, enabling gradient-based optimization toward target properties.
RL-tuned generators: REINVENT and its successors fine-tune a pre-trained SMILES language model using reinforcement learning, rewarding molecules that satisfy multi-objective scoring functions. DrugEx extends this with Pareto-based multi-objective optimization.
Adversarial approaches: ORGAN and LatentGAN apply GAN-based training to SMILES generation, using domain-specific rewards alongside the discriminator signal.

The challenges of canonical vs. randomized SMILES and invalid outputs discussed above are particularly relevant in this generation context.

Property Prediction

SMILES strings serve as the primary input for quantitative structure-activity relationship (QSAR) models. SMILES2Vec learns fixed-length molecular embeddings directly from SMILES for property regression and classification. MaxSMI demonstrates that SMILES augmentation (training on multiple randomized SMILES per molecule) improves property prediction accuracy, connecting the data augmentation benefits observed in generative settings to discriminative tasks.

Optical Chemical Structure Recognition

SMILES is also the standard output format for optical chemical structure recognition (OCSR) systems, which extract molecular structures from images in scientific literature. Deep learning approaches like DECIMER and Image2SMILES frame this as an image-to-SMILES translation problem, using encoder-decoder architectures to generate SMILES strings directly from molecular diagrams. For a taxonomy of OCSR approaches, see the OCSR methods overview.

Limitations

Classical Limitations

Non-uniqueness: Different SMILES strings can represent the same molecule (e.g., ethanol can be written as CCO or OCC). Canonical SMILES algorithms address this by producing a single unique representation.
Non-robustness: SMILES strings can be written that do not correspond to any valid molecular structure.
- Strings that cannot represent a molecular structure.
- Strings that violate basic rules (more bonds than is physically possible).
Information loss: If 3D structural information exists, a SMILES string cannot encode it.

Machine Learning Limitations

The challenges described above (canonical ordering bias motivating randomized SMILES, validity constraints motivating DeepSMILES and SELFIES, and tokenization ambiguity motivating chemistry-aware tokenizers) remain active areas of research. See the linked sections for details on each.

Variants and Standards

Canonical SMILES

For how canonical vs. randomized SMILES affects generative modeling, see Canonical vs. Randomized SMILES above.

Canonical SMILES algorithms produce a single unique string per molecule by assigning a deterministic rank to each atom and then traversing the molecular graph in that rank order. Most implementations build on the Morgan algorithm (extended connectivity): each atom starts with an initial invariant based on its properties (atomic number, degree, charge, hydrogen count), then iteratively updates its invariant by incorporating its neighbors’ invariants until the ranking stabilizes. The final atom ranks determine the traversal order, which determines the canonical string.

In practice, the Morgan algorithm alone does not fully resolve all ties. Implementations must also make choices about tie-breaking heuristics, aromaticity perception (Kekulé vs. aromatic form), and stereochemistry encoding. Because these choices differ across toolkits (RDKit, OpenBabel, Daylight, ChemAxon), the same molecule can produce different “canonical” SMILES depending on the software. A canonical SMILES is only guaranteed unique within a single implementation, not across implementations.

from rdkit import Chem

# RDKit's canonical SMILES for caffeine
mol = Chem.MolFromSmiles("CN1C=NC2=C1C(=O)N(C(=O)N2C)C")
print(Chem.MolToSmiles(mol))
# -> Cn1c(=O)c2c(ncn2C)n(C)c1=O

Isomeric SMILES

Isomeric SMILES incorporates isotopes and stereochemistry information, providing more detailed molecular representations than generic SMILES. Non-isomeric SMILES strip this information, collapsing stereoisomers and isotopologues into the same string:

from rdkit import Chem

# L-alanine (chiral center)
mol = Chem.MolFromSmiles("N[C@@H](C)C(=O)O")
print(Chem.MolToSmiles(mol, isomericSmiles=True))
# -> C[C@H](N)C(=O)O    (preserves chirality)
print(Chem.MolToSmiles(mol, isomericSmiles=False))
# -> CC(N)C(=O)O         (chirality lost)

# Deuterated water (isotope labels)
mol2 = Chem.MolFromSmiles("[2H]O[2H]")
print(Chem.MolToSmiles(mol2, isomericSmiles=True))
# -> [2H]O[2H]           (preserves isotopes)
print(Chem.MolToSmiles(mol2, isomericSmiles=False))
# -> [H]O[H]             (isotope info lost)

OpenSMILES vs. Proprietary

Proprietary: The original SMILES specification was proprietary (Daylight Chemical Information Systems), which led to compatibility issues between different implementations.
OpenSMILES: An open-source alternative standardization effort to address compatibility concerns and provide a freely available specification.

DeepSMILES

DeepSMILES modifies two aspects of SMILES syntax that cause most invalid strings in generative models, while remaining interconvertible with standard SMILES without information loss.

Ring closures: Standard SMILES uses paired digits (c1ccccc1 for benzene). A model must remember which digits are “open” and close them correctly. DeepSMILES replaces this with a single ring-size indicator at the closing position: cccccc6 means “connect to the atom 6 positions back.”

Branches: Standard SMILES uses matched parentheses (C(OC)(SC)F). DeepSMILES uses a postfix notation with only closing parentheses, where consecutive ) symbols indicate how far to pop back on the atom stack: COC))SC))F.

SMILES:       c1ccccc1          C(OC)(SC)F
DeepSMILES:   cccccc6           COC))SC))F
              ↑                 ↑
              single digit =    no opening parens,
              ring size         )) pops back to C

A single unpaired symbol cannot be “unmatched,” eliminating the two main sources of syntactically invalid strings from generative models.

Reaction SMILES

Reaction SMILES extends the notation to represent chemical reactions by separating reactants, reagents, and products with > symbols. The general format is reactants>reagents>products, where each group can contain multiple molecules separated by .:

CC(=O)O.CCO>>CC(=O)OCC.O
│         │ │            │
│         │ │            └─ water
│         │ └─ ethyl acetate
│         └─ ethanol
└─ acetic acid

(Fischer esterification: acetic acid + ethanol → ethyl acetate + water)

The Molecular Transformer treats this as a machine translation problem, translating reactant SMILES to product SMILES with a Transformer encoder-decoder architecture.

SMARTS and SMIRKS

SMARTS (SMILES Arbitrary Target Specification) is a pattern language built on SMILES syntax for substructure searching. It extends SMILES with query primitives like atom environments ([CX3] for a carbon with three connections) and logical operators, enabling precise structural pattern matching:

from rdkit import Chem

# SMARTS pattern for a carboxylic acid group: C(=O)OH
pattern = Chem.MolFromSmarts("[CX3](=O)[OX2H1]")

for name, smi in [("acetic acid", "CC(=O)O"),
                  ("benzoic acid", "c1ccc(C(=O)O)cc1"),
                  ("ethanol", "CCO"),
                  ("acetone", "CC(=O)C")]:
    mol = Chem.MolFromSmiles(smi)
    print(f"  {name:15s} -> {'match' if mol.HasSubstructMatch(pattern) else 'no match'}")
# -> acetic acid      -> match
# -> benzoic acid     -> match
# -> ethanol          -> no match
# -> acetone          -> no match

SMIRKS extends SMARTS to describe reaction transforms, using atom maps (:1, :2, …) to track which atoms in the reactants correspond to which atoms in the products:

from rdkit.Chem import AllChem, MolFromSmiles, MolToSmiles

# SMIRKS for ester hydrolysis: break the C-O ester bond
smirks = "[C:1](=[O:2])[O:3][C:4]>>[C:1](=[O:2])[OH:3].[C:4][OH]"
rxn = AllChem.ReactionFromSmarts(smirks)

reactant = MolFromSmiles("CC(=O)OCC")  # ethyl acetate
products = rxn.RunReactants((reactant,))
print(" + ".join(MolToSmiles(p) for p in products[0]))
# -> CC(=O)O + CCO    (acetic acid + ethanol)

See the Smirk tokenizer for a recent approach to tokenizing these extensions for molecular foundation models.

t-SMILES

t-SMILES encodes molecules as fragment-based strings by decomposing a molecule into chemically meaningful substructures, arranging them into a full binary tree, and traversing it breadth-first. This dramatically reduces nesting depth compared to standard SMILES (99.3% of tokens at depth 0-2 vs. 68.0% for SMILES on ChEMBL).

Standard SMILES (depth-first, atom-level):
  CC(=O)Oc1ccccc1C(=O)O                     (aspirin)

t-SMILES pipeline:
  1. Fragment:     [CC(=O)O*]  [*c1ccccc1*]  [*C(=O)O]
  2. Binary tree:
                   [*c1ccccc1*]
                  /             \
         [CC(=O)O*]          [*C(=O)O]
  3. BFS string:   [*c1ccccc1*] ^ [CC(=O)O*] ^ [*C(=O)O]

The framework introduces two symbols beyond standard SMILES: ^ separates adjacent fragments (analogous to spaces between words), and & marks empty tree nodes. Only single closure symbols are needed per fragment, eliminating the deep nesting that makes standard SMILES difficult for generative models on small datasets.