Molecular Representations on Hunter Heidenreich | ML Research Scientist

Materials Representations for ML Review

Mon, 06 Apr 2026 00:00:00 +0000

A Systematization of Material Representations

This paper is a Systematization that organizes and categorizes the strategies researchers use to convert solid-state materials into numerical representations suitable for machine learning models. Rather than proposing a new method, the review provides a structured taxonomy of existing approaches, connecting each to the practical constraints of data availability, computational cost, and prediction targets. It covers structural descriptors, graph-based learned representations, compositional features, transfer learning, and generative models for inverse design.

Why Material Representations Matter

Machine learning has enabled rapid property prediction for materials, but every ML pipeline depends on how the material is encoded as a numerical input. The authors identify three guiding principles for effective representations:

Similarity preservation: Similar materials should have similar representations, and dissimilar materials should diverge in representation space.
Domain coverage: The representation should be constructable for every material in the target domain.
Cost efficiency: Computing the representation should be cheaper than computing the target property directly (e.g., via DFT).

In practice, materials scientists face several barriers. Atomistic structures span diverse space groups, supercell sizes, and disorder parameters. Real material performance depends on defects, microstructure, and interfaces. Structural information often requires expensive experimental or computational effort to obtain. Datasets in materials science tend to be small, sparse, and biased toward well-studied systems.

Structural Descriptors: Local, Global, and Topological

The review covers three families of hand-crafted structural descriptors that encode atomic positions and types.

Local Descriptors

Local descriptors characterize the environment around each atom. Atom-centered symmetry functions (ACSF), introduced by Behler and Parrinello, define radial and angular functions:

$$ G_{i}^{1} = \sum_{j \neq i}^{\text{neighbors}} e^{-\eta(R_{ij} - R_{s})^{2}} f_{c}(R_{ij}) $$

$$ G_{i}^{2} = 2^{1-\zeta} \sum_{j,k \neq i}^{\text{neighbors}} (1 + \lambda \cos \theta_{ijk})^{\zeta} e^{-\eta(R_{ij}^{2} + R_{ik}^{2} + R_{jk}^{2})} f_{c}(R_{ij}) f_{c}(R_{ik}) f_{c}(R_{jk}) $$

The Smooth Overlap of Atomic Positions (SOAP), proposed by Bartók et al., defines atomic neighborhood density as a sum of Gaussians and computes a rotationally invariant kernel through expansion in radial functions and spherical harmonics:

$$ \rho_{i}(\mathbf{r}) = \sum_{j} \exp\left(-\frac{|\mathbf{r} - \mathbf{r}_{ij}|^{2}}{2\sigma^{2}}\right) = \sum_{nlm} c_{nlm} g_{n}(\mathbf{r}) Y_{lm}(\hat{\mathbf{r}}) $$

The power spectrum $\mathbf{p}(\mathbf{r}) \equiv \sum_{m} c_{nlm}(c_{n’lm})^{*}$ serves as a vector descriptor of the local environment. SOAP has seen wide adoption both as a similarity metric and as input to ML models.

Voronoi tessellation provides another local approach, segmenting space into cells and extracting features like effective coordination numbers, cell volumes, and neighbor properties.

Global Descriptors

Global descriptors encode the full structure. The Coulomb matrix models electrostatic interactions between atoms:

$$ M_{i,j} = \begin{cases} Z_{i}^{2.4} & \text{for } i = j \\ \frac{Z_{i}Z_{j}}{|r_{i} - r_{j}|} & \text{for } i \neq j \end{cases} $$

Other global methods include partial radial distribution functions (PRDF), the many-body tensor representation (MBTR), and cluster expansions. The Atomic Cluster Expansion (ACE) framework generalizes cluster expansions to continuous environments and has become a foundation for modern deep learning potentials.

Topological Descriptors

Persistent homology from topological data analysis (TDA) identifies geometric features at multiple length scales. Topological descriptors capture pore geometries in porous materials and have outperformed traditional structural descriptors for predicting CO$_{2}$ adsorption in metal-organic frameworks and methane storage in zeolites. A caveat is the $O(N^{3})$ worst-case computational cost per filtration.

Crystal Graph Neural Networks

Graph neural networks bypass manual feature engineering by learning representations directly from structural data. Materials are converted to graphs $G(V, E)$ where nodes represent atoms and edges connect neighbors within a cutoff radius, with periodic boundary conditions.

Key architectures discussed include:

Model	Key Innovation
CGCNN	Crystal graph convolutions for broad property prediction
MEGNet	Materials graph networks with global state attributes
ALIGNN	Line graph neural networks incorporating three-body angular features
Equivariant GNNs	E(3)-equivariant message passing for tensorial properties

The review identifies several limitations. Graph convolutions based on local neighborhoods can fail to capture long-range interactions or periodicity-dependent properties (e.g., lattice parameters, phonon spectra). Strategies to address this include concatenation with hand-tuned descriptors, plane-wave periodic basis modulation, and reciprocal-space features.

A major practical restriction is the requirement for relaxed atomic positions. Graphs built from unrelaxed crystal prototypes lose information about geometric distortions, degrading accuracy. Approaches to mitigate this include data augmentation with perturbed structures, Bayesian optimization of prototypes, and surrogate force-field relaxation.

Equivariant models that introduce higher-order tensors to node and edge features, constrained to transform correctly under E(3) operations, achieve state-of-the-art accuracy and can match structural descriptor performance even in low-data (~100 datapoints) regimes.

Compositional Descriptors Without Structure

When crystal structures are unavailable, representations can be built purely from stoichiometry and tabulated atomic properties (radii, electronegativity, valence electrons). Despite their simplicity, these methods have distinct advantages: zero computational overhead, accessibility to non-experts, and robustness for high-throughput screening.

Key methods include:

MagPie: 145 input features derived from elemental properties
SISSO: Compressive sensing over algebraic combinations of atomic properties, capable of discovering interpretable descriptors (e.g., a new tolerance factor $\tau$ for perovskite stability)
ElemNet: Deep neural network using only fractional stoichiometry as input, outperforming MagPie with >3,000 training points
ROOST: Fully-connected compositional graph with attention-based message passing, achieving strong performance with only hundreds of examples
CrabNet: Self-attention on element embeddings with fractional encoding, handling dopant-level concentrations via log-scale inputs

Compositional models cannot distinguish polymorphs and generally underperform structural approaches. They are most valuable when atomistic resolution is unavailable.

Defects, Surfaces, and Grain Boundaries

The review extends beyond idealized unit cells to practical materials challenges:

Point defects: Representations of the pristine bulk can predict vacancy formation energies through linear relationships with band structure descriptors. Frey et al. proposed using relative differences between defect and parent structure properties, requiring no DFT on the defect itself.

Surfaces and catalysis: Binding energy prediction for catalysis requires representations beyond the bulk unit cell. The d-band center for metals and oxygen 2p-band center for metal oxides serve as simple electronic descriptors, following the Sabatier principle that optimal catalytic activity requires intermediate binding strength. Graph neural networks trained on the Open Catalyst 2020 dataset (>1 million DFT energies) have enabled broader screening, though errors remain high for certain adsorbates and non-metallic surfaces.

Grain boundaries: SOAP descriptors computed for atoms near grain boundaries and clustered into local environment classes can predict grain boundary energy, mobility, and shear coupling. This approach provides interpretable structure-property relationships.

Transfer Learning Across Representations

When target datasets are small, transfer learning leverages representations learned from large, related datasets. The standard procedure involves: (1) pretraining on a large dataset (e.g., all Materials Project formation energies), (2) freezing parameters up to a chosen depth, and (3) either fine-tuning remaining layers or extracting features for a separate model.

Key findings from the review:

Transfer learning is most effective when the source dataset is orders of magnitude larger than the target
Physically related tasks transfer better (e.g., Open Catalyst absorption energies transfer well to new adsorbates, less so to unrelated small molecules)
Earlier neural network layers learn more general representations and transfer better across properties
Multi-depth feature extraction, combining activations from multiple layers, can improve transfer
Predictions from surrogate models can serve as additional descriptors, expanding screening domains by orders of magnitude

Generative Models for Crystal Inverse Design

Generative models for solid-state materials face challenges beyond molecular generation: more diverse atomic species, the need to specify both positions and lattice parameters, non-unique definitions (rotations, translations, supercell scaling), and large unit cells (>100 atoms for zeolites and MOFs).

The review traces the progression of approaches:

Voxel representations: Discretize unit cells into volume elements. Early work (iMatGen, Court et al.) demonstrated feasibility but was restricted to specific chemistries or cubic systems.
Continuous coordinate models: Point cloud and invertible representations allowed broader chemical spaces but lacked symmetry invariances.
Symmetry-aware models: Crystal Diffusion VAE (CDVAE) uses periodic graphs and SE(3)-equivariant message passing for translationally and rotationally invariant generation, establishing benchmark tasks for the field.
Constrained models for porous materials: Approaches like SmVAE represent MOFs through their topological building blocks (RFcodes), ensuring all generated structures are physically valid.

Open Problems and Future Directions

The review highlights four high-impact open questions:

Local vs. global descriptor trade-offs: Local descriptors (SOAP) excel for short-range interactions but struggle with long-range physics. Global descriptors model periodicity but lack generality across space groups. Combining local and long-range features could provide more universal models.
Prediction from unrelaxed prototypes: ML force fields can relax structures at a fraction of DFT cost, potentially expanding screening domains. Key questions remain about required training data scale and generalizability.
Applicability of compositional descriptors: The performance gap between compositional and structural models may be property-dependent, being smaller for properties like band gap that depend on global features rather than local site energies.
Extensions of generative models: Diffusion-based architectures have improved on voxel approaches for small unit cells, but extending to microstructure, dimensionality, and surface generation remains open.

Reproducibility Details

This paper is a review and does not present new experimental results or release any novel code, data, or models. The paper is open-access (hybrid OA at Annual Reviews) and the arXiv preprint is freely available. The following artifacts table covers key publicly available resources discussed in the review.

Artifacts

Artifact	Type	License	Notes
arXiv preprint (2301.08813)	Other	arXiv (open access)	Free preprint version
Materials Project	Dataset	CC-BY-4.0	DFT energies, band gaps, structures for >100,000 compounds
OQMD	Dataset	CC-BY-4.0	Open Quantum Materials Database, >600,000 DFT entries
Open Catalyst 2020 (OC20)	Dataset	CC-BY-4.0	>1,000,000 DFT surface adsorption energies
AFLOW	Dataset	Public	High-throughput ab initio library, >3,000,000 entries
Matminer	Code	BSD	Open-source toolkit for materials data mining and featurization

Algorithms

The review covers: ACSF, SOAP, Voronoi tessellation, Coulomb matrices, PRDF, MBTR, cluster expansions, ACE, persistent homology, CGCNN, MEGNet, ALIGNN, E(3)-equivariant GNNs, MagPie, SISSO, ElemNet, ROOST, CrabNet, VAE, GAN, and diffusion-based crystal generators.

Hardware

No new experiments are conducted. Hardware requirements vary by the referenced methods (DFT calculations require HPC; GNN training typically requires 1-8 GPUs).

Reproducibility Status

Partially Reproducible: The review paper itself is open-access. All major datasets discussed (Materials Project, OQMD, OC20, AFLOW) are publicly available under permissive licenses. Most referenced model implementations (CGCNN, MEGNet, ALIGNN, ROOST, CDVAE) have open-source code. No novel artifacts are released by the authors.

Paper Information

Citation: Damewood, J., Karaguesian, J., Lunger, J. R., Tan, A. R., Xie, M., Peng, J., & Gómez-Bombarelli, R. (2023). Representations of Materials for Machine Learning. Annual Review of Materials Research, 53. https://doi.org/10.1146/annurev-matsci-080921-085947

Publication: Annual Review of Materials Research, 2023

@article{damewood2023representations,
  title={Representations of Materials for Machine Learning},
  author={Damewood, James and Karaguesian, Jessica and Lunger, Jaclyn R. and Tan, Aik Rui and Xie, Mingrou and Peng, Jiayu and G{\'o}mez-Bombarelli, Rafael},
  journal={Annual Review of Materials Research},
  volume={53},
  year={2023},
  doi={10.1146/annurev-matsci-080921-085947}
}

InChI: The International Chemical Identifier

Mon, 06 Apr 2026 00:00:00 +0000

Overview

InChI (International Chemical Identifier) is an open, non-proprietary chemical structure identifier developed by IUPAC and NIST. Unlike SMILES, which linearizes a molecular graph through depth-first traversal, InChI decomposes a molecule into a hierarchy of layers (connectivity, hydrogen atoms, charge, stereochemistry) that build progressively from the molecular formula to full stereochemical detail. This layered design means that two representations of the same molecule always produce the same InChI, even if their input drawings differ in atom ordering or layout.

InChI was created to solve a specific problem: linking chemical information across databases on the open web. Before InChI, interoperability between chemical databases depended on proprietary identifiers (like CAS Registry Numbers) or format-dependent representations. The project began at a March 2000 IUPAC meeting and is maintained by the InChI Trust, a UK charity supported by publishers and database providers. The algorithm’s source code is open source.

Key Characteristics

Canonical by design: Every valid molecular structure maps to exactly one standard InChI string, regardless of how the structure was drawn or which atoms were numbered first. This uniqueness is built into the algorithm, not added as a post-processing step.
Hierarchical layers: Information is organized from general (molecular formula) to specific (stereochemistry, isotopes). This allows matching at different levels of detail: a query with unknown stereochemistry can match against structures with known stereochemistry by comparing only the connectivity layers.
Web-searchable via InChIKey: Because InChI strings contain characters (/, +, =) that break web search engines, the 27-character InChIKey hash provides a fixed-length, search-friendly identifier.
Non-proprietary and open: Governed by IUPAC through the InChI Trust. The algorithm, source code, and specification are freely available.
Machine-optimized: Designed for programmatic parsing and database operations rather than human readability. Compare with SMILES, which prioritizes human readability.

Layered Structure

An InChI string begins with the prefix InChI= followed by a version number, then a series of layers separated by /. Each layer encodes a specific aspect of the molecular structure.

Layer Breakdown

For L-alanine (an amino acid with a chiral center):

InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1
       │  │      │            │                   │   │  │
       │  │      │            │                   │   │  └─ /s: stereo type (1=absolute)
       │  │      │            │                   │   └─ /m: parity inversion flag
       │  │      │            │                   └─ /t: tetrahedral parity
       │  │      │            └─ /h: hydrogen layer
       │  │      └─ /c: connectivity layer
       │  └─ molecular formula
       └─ version (1S = standard InChI v1)

The full set of layers, in order:

Main layer: Molecular formula (e.g., C3H7NO2)
Connectivity (/c): Atom-to-atom connections, excluding bond orders. Atoms are numbered starting from 1, and connections are listed as pairs.
Hydrogen (/h): Hydrogen atom assignments, distinguishing mobile (tautomeric) from fixed hydrogens
Charge (/q) and proton balance (/p): Net charge and protonation state
Double bond stereochemistry (/b): E/Z configuration around double bonds
Tetrahedral stereochemistry (/t): R/S configuration at sp3 centers
Parity inversion (/m): Relates computed parity to actual configuration
Stereo type (/s): Whether stereochemistry is absolute, relative, or racemic
Isotope layer (/i): Isotopic labeling (e.g., deuterium, carbon-13)

Standard vs. Non-Standard InChI

The S in InChI=1S/ indicates a Standard InChI, which uses a fixed set of normalization options to guarantee that any software producing Standard InChI will generate the same string for the same molecule. Non-standard InChI allows custom options (such as the Fixed-H layer /f, which distinguishes specific tautomeric forms) but sacrifices cross-implementation consistency.

The InChIKey

InChI strings can be arbitrarily long for large molecules, and their /, +, and = characters cause problems for web search engines. The InChIKey addresses both issues by hashing the InChI into a fixed 27-character string:

$$ \text{InChIKey} = f_{\text{SHA-256}}(\text{InChI}) $$

Structure

An InChIKey has the format XXXXXXXXXXXXXX-XXXXXXXXXX-X:

First block (14 characters): SHA-256 hash of the connectivity layer (molecular skeleton)
Second block (10 characters): 8 characters encoding stereochemistry and isotopes, plus a standard/non-standard flag (S or N) and a version indicator (A for v1)
Third block (1 character): Protonation flag (N for neutral)

For example, L-alanine:

InChIKey: QNAYBMKLOCPYGJ-REOHCLBHSA-N
          │                │          │
          └─ connectivity  └─ stereo  └─ protonation

Collision Risk

Because the InChIKey is a hash, collisions are theoretically possible. The first block provides $2^{65}$ possible values for connectivity, making accidental collisions extremely unlikely for practical database sizes (estimated 1 in $10^{12}$ chance for $10^9$ compounds). It is important to distinguish InChIKey collisions (a mathematical inevitability of hashing, but rare in practice) from InChI collisions (bugs in the algorithm, which are very rare and targeted by the certification suite).

Working with InChI in Python

The RDKit library provides InChI support through its built-in functions:

from rdkit import Chem
from rdkit.Chem.inchi import MolFromInchi, MolToInchi, InchiToInchiKey

# SMILES -> InChI
mol = Chem.MolFromSmiles("C[C@@H](N)C(=O)O")  # L-alanine
inchi = MolToInchi(mol)
print(inchi)
# -> InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1

# InChI -> Molecule -> SMILES
mol2 = MolFromInchi(inchi)
print(Chem.MolToSmiles(mol2))
# -> C[C@@H](N)C(=O)O

# InChI -> InChIKey
key = InchiToInchiKey(inchi)
print(key)
# -> QNAYBMKLOCPYGJ-REOHCLBHSA-N

Layer-Level Matching

Because InChI is hierarchical, you can compare molecules at different levels of detail by truncating layers. Two molecules that differ only in stereochemistry will share the same connectivity layers:

from rdkit import Chem
from rdkit.Chem.inchi import MolToInchi, InchiToInchiKey

# L-alanine and D-alanine differ only in chirality
l_ala = Chem.MolFromSmiles("C[C@@H](N)C(=O)O")
d_ala = Chem.MolFromSmiles("C[C@H](N)C(=O)O")

l_inchi = MolToInchi(l_ala)
d_inchi = MolToInchi(d_ala)

# Full InChIs differ (different /t and /m layers)
print(l_inchi)
# -> InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1
print(d_inchi)
# -> InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m1/s1

# First block of InChIKey is identical (same connectivity)
l_key = InchiToInchiKey(l_inchi)
d_key = InchiToInchiKey(d_inchi)
print(l_key[:14] == d_key[:14])
# -> True (same molecular skeleton)
print(l_key == d_key)
# -> False (different stereochemistry)

InChI in Machine Learning

InChI was designed for database interoperability, not for machine learning. Its hierarchical, layer-based structure differs fundamentally from the sequential, atom-by-atom encoding used by SMILES and SELFIES. This has practical implications for ML applications.

Optical Chemical Structure Recognition

InChI is widely used as an output format for optical chemical structure recognition (OCSR) systems that extract molecular structures from images in scientific literature. Because InChI is canonical, it provides an unambiguous target for image-to-text models.

Image2InChI uses an improved SwinTransformer encoder with attention-based feature fusion to convert molecular images directly to InChI strings, achieving 99.8% accuracy on the BMS dataset. The ViT-InChI Transformer takes a similar approach with a Vision Transformer backbone.

In a systematic comparison of string representations for OCSR, Rajan et al. (2022) evaluated SMILES, DeepSMILES, SELFIES, and InChI using the same transformer architecture. InChI strings are longer than SMILES (producing more tokens for the decoder), which increases sequence modeling difficulty. SMILES achieved the highest exact match accuracy (88.62%), while SELFIES achieved 100% structural validity.

Chemical Name Translation

InChI’s canonical structure makes it a natural intermediate representation for translating between chemical names and structures. Handsel et al. (2021) trained a sequence-to-sequence Transformer to translate InChI identifiers to IUPAC names character-by-character, achieving 91% accuracy on organic compounds from PubChem (10 million training pairs). STOUT converts through SELFIES as an intermediate but validates outputs against InChI for structural equivalence.

Representation Comparison for ML

InChI’s design trade-offs position it differently from SMILES and SELFIES for machine learning:

Property	InChI	SMILES	SELFIES
Uniqueness	Canonical by design	Requires canonicalization algorithm	Via SMILES roundtrip
Validity guarantee	N/A (not generative)	No	Yes (every string is valid)
Human readability	Low (machine-optimized)	High	Moderate
String length	Longest	Shortest	Moderate
Primary ML use	OCSR output, database linking	Generation, property prediction	Generation with validity
Tokenization	Complex (layers, separators)	Regex-based atom tokens	Bracket-delimited tokens

InChI’s length and structural complexity (layer separators, parenthetical groupings, comma-delimited atom lists) make it less common as a direct input representation for generative models. Most molecular language models use SMILES or SELFIES for generation tasks, and convert to InChI only for canonicalized comparison or database lookup.

Limitations

Tautomerism

InChI v1 handles many tautomeric forms by normalizing mobile hydrogen atoms in the /h layer. However, certain tautomeric transformations (such as 1,4-oxime/nitroso conversions) can produce different InChIs for what chemists consider the same compound. This is a known limitation targeted for InChI v2, with 86 tautomeric transformation rules compiled and validated across 400M+ structures to inform the update.

Inorganic and Organometallic Chemistry

The original InChI specification was designed primarily for organic molecules. Metal-ligand bonds, coordination compounds, and extended solid-state structures posed challenges. The InChI v1.07 release addresses this with dedicated handling for metal-ligand bonds, though complete coverage of all inorganic chemistry remains an ongoing effort.

Not Designed for Generation

Unlike SMILES (which can be generated token-by-token through depth-first graph traversal) or SELFIES (which guarantees validity by construction), InChI’s layered format does not lend itself to autoregressive generation. A generative model would need to produce internally consistent layers: the connectivity layer must agree with the molecular formula, the hydrogen layer must be consistent with the connectivity, and the stereochemistry layers must reference valid atom indices. This cross-layer dependency makes InChI poorly suited as a target for token-by-token molecular generation, which is why most generative chemistry models use SMILES or SELFIES.

Irreversibility of InChIKey

The InChIKey is a one-way hash. An InChIKey cannot be converted back to an InChI or a molecular structure. It is useful only for search and comparison, not for structure retrieval (without a lookup table).

Variants and Extensions

RInChI: Reactions

RInChI (Reaction InChI) extends InChI to represent chemical reactions by combining the InChIs of reactants, products, and agents into a single identifier. It provides a canonical identifier for reactions, enabling reaction database searching and duplicate detection (Grethe et al., 2018).

MInChI: Mixtures

MInChI (Mixture InChI) represents mixtures of substances, combined with the Mixfile format for storing detailed mixture composition data. This extends the InChI framework to complex multi-component systems like formulations and alloys (Clark et al., 2019).

NInChI: Nanomaterials

NInChI proposes a hierarchical adaptation of InChI for nanomaterial identification. Traditional chemical identifiers break down at the nanoscale, where a single “entity” may consist of millions of atoms arranged in layers, coatings, and surface functionalizations (Lynch et al., 2020).

References

Heller, S., McNaught, A., Pletnev, I., Stein, S., & Tchekhovskoi, D. (2015). InChI, the IUPAC International Chemical Identifier. Journal of Cheminformatics, 7(1), 23.
Heller, S., McNaught, A., Stein, S., Tchekhovskoi, D., & Pletnev, I. (2013). InChI - the worldwide chemical structure identifier standard. Journal of Cheminformatics, 5(1), 7.
Grethe, G., Blanke, G., Kraut, H., & Goodman, J. M. (2018). International Chemical Identifier for reactions (RInChI). Journal of Cheminformatics, 10(1), 22.
Clark, A. M., McEwen, L. R., Gedeck, P., & Bunin, B. A. (2019). Capturing mixture composition: an open machine-readable format for representing mixed substances. Journal of Cheminformatics, 11(1), 33.
Lynch, I., et al. (2020). Can an InChI for nano address the need for a simplified representation of complex nanomaterials across experimental and nanoinformatics studies? Nanomaterials, 10(12), 2493.
InChI Trust
InChI GitHub Repository

MoMu: Bridging Molecular Graphs and Natural Language

Sat, 28 Mar 2026 00:00:00 +0000

Bridging Molecular Graphs and Natural Language Through Contrastive Learning

MoMu (Molecular Multimodal foundation model) is a Method paper that proposes a multimodal pre-training approach to associate molecular graphs with natural language descriptions. The primary contribution is a dual-encoder architecture, consisting of a Graph Isomorphism Network (GIN) for molecular graphs and a BERT-based text encoder, jointly trained through contrastive learning on weakly-correlated graph-text pairs collected from scientific literature. The pre-trained model supports four downstream capabilities: cross-modal retrieval (graph-to-text and text-to-graph), molecule captioning, zero-shot text-to-graph molecule generation, and molecular property prediction.

Why Single-Modality Models Are Insufficient for Molecular Understanding

Existing AI models for molecular tasks generally operate on a single modality and learn a single cognitive ability. Language-based models process SMILES strings or natural language texts and handle tasks like property prediction from strings, literature comprehension, or SMILES-based generation. Graph-based models use molecular graph representations and handle graph-level property prediction or graph generation. Neither category connects structural information from molecular graphs with the rich semantic knowledge encoded in scientific texts.

Prior work by Zeng et al. (KV-PLM) jointly modeled molecule-related texts and SMILES strings, but SMILES representations have inherent drawbacks: they are one-dimensional and may lose structural information, they cannot capture structural similarities between molecules, and a single molecule can have multiple valid SMILES representations. Molecular graphs, by contrast, are more intuitive and better reveal functional structures. Human experts learn molecular knowledge by associating both graphical representations and textual descriptions, yet no prior model bridged these two modalities directly.

The key challenge is the scarcity of paired molecular graph-text data compared to general image-text datasets. Additionally, learning specialized molecular knowledge requires foundational cognitive abilities in both the graph and text domains, making training from scratch infeasible with limited data.

MoMu consists of two encoders initialized from pre-trained unimodal models: a GIN graph encoder initialized from GraphCL self-supervised weights, and a BERT text encoder initialized from either Sci-BERT (yielding MoMu-S) or KV-PLM (yielding MoMu-K).

Data Collection

The authors collect approximately 15,613 molecular graph-document pairs by:

Gathering names, synonyms, and SMILES for the top 50K compounds in PubChem
Converting SMILES to molecular graphs using the OGB smiles2graph function
Retrieving related text from the S2ORC corpus (136M+ papers) by querying with molecule names, filtering to Medicine, Biology, Chemistry, and Computer Science fields
Restricting retrieval to abstract, introduction, and conclusion sections to avoid experimental data artifacts

Contrastive Training Objective

For each graph-text pair in a mini-batch of $N$ pairs, MoMu applies two graph augmentations (node dropping and subgraph extraction) to create two augmented graphs, and randomly samples two sentences from the document. This produces $2N$ graph representations ${z_1^G, \tilde{z}_1^G, \ldots, z_N^G, \tilde{z}_N^G}$ and $2N$ text representations ${z_1^T, \tilde{z}_1^T, \ldots, z_N^T, \tilde{z}_N^T}$.

The cross-modal contrastive loss for a pair $(z_i^G, z_i^T)$ is:

$$ \ell_i^{(z_i^G, z_i^T)} = -\log \frac{\exp(\text{sim}(z_i^G, z_i^T) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i^G, z_j^T) / \tau)} $$

where $\tau$ is the temperature parameter and $\text{sim}(\cdot, \cdot)$ projects both representations into a shared 256-dimensional space before computing cosine similarity. The total cross-modal loss includes four contrastive terms for each pair: $(z_i^G, z_i^T)$, $(\tilde{z}_i^G, z_i^T)$, $(z_i^G, \tilde{z}_i^T)$, and $(\tilde{z}_i^G, \tilde{z}_i^T)$.

An intra-modal graph contrastive loss further strengthens the graph encoder:

$$ \ell_i^{(z_i^G, \tilde{z}_i^G)} = -\log \frac{\exp(\text{sim}(z_i^G, \tilde{z}_i^G) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i^G, \tilde{z}_j^G) / \tau)} $$

Zero-Shot Text-to-Graph Generation

MoMu enables a zero-shot generation pipeline by combining the pre-trained MoMu encoders with MoFlow, a flow-based molecular generator. Given an input text description $x^T$, the method:

Samples a latent variable $q$ from MoFlow’s Gaussian prior $P(q)$
Generates a molecular graph through MoFlow’s reverse flows: $\hat{E} = f_g^{-1}(q_e)$ and $\hat{V} = f_c^{-1}(q_v \mid GN(\hat{E}))$
Feeds $\hat{V}$ (using soft atom type probabilities instead of hard assignments) into MoMu’s graph encoder
Optimizes $q$ to maximize the cosine similarity between the resulting graph and text representations:

$$ \ell_q = -\text{sim}(z^G, z^T) / \tau $$

All MoMu and MoFlow parameters are frozen; only $q$ is updated via Adam for up to 500 iterations. The final molecule is obtained by applying argmax to the optimized probability matrices $\hat{V}$ and $\hat{E}$.

Evaluation Across Four Downstream Tasks

MoMu is evaluated on the PCdes dataset (15K SMILES-description pairs from PubChem, split 10,500/1,500/3,000 for train/val/test). Retrieval is performed in mini-batches of 64 pairs, reporting top-1 accuracy and Recall@20.

Graph-to-Text Retrieval (PCdes, fine-tuned):

Method	Sentence Acc	Sentence R@20	Paragraph Acc	Paragraph R@20
Sci-BERT	50.38	62.11	62.57	60.67
KV-PLM	53.79	66.63	64.81	63.87
KV-PLM*	55.92	68.59	77.92	75.93
MoMu-S	58.64	80.59	80.62	79.11
MoMu-K	58.74	81.29	81.09	80.15

Text-to-Graph Retrieval (PCdes, fine-tuned):

Method	Sentence Acc	Sentence R@20	Paragraph Acc	Paragraph R@20
Sci-BERT	50.12	68.02	61.75	60.77
KV-PLM	54.22	71.80	64.95	64.27
KV-PLM*	55.61	74.77	77.03	75.47
MoMu-S	55.44	76.92	80.22	79.02
MoMu-K	54.94	78.29	81.45	80.62

In zero-shot retrieval (on a separate test set of 5,562 pairs not seen during pre-training), MoMu achieves approximately 39-46% accuracy compared to below 2% for Sci-BERT and KV-PLM, demonstrating strong generalization.

Molecule Captioning

MoMu’s graph features are appended to MolT5’s encoder inputs through a learned MLP mapping module on the ChEBI-20 dataset. Results show improvements in BLEU, METEOR, and Text2Mol scores when incorporating graph features, though ROUGE-L slightly drops. The graph structural information leads to more accurate captions for complex molecular structures.

Molecular Property Prediction

The pre-trained graph encoder from MoMu is fine-tuned on eight MoleculeNet datasets using scaffold splitting and ROC-AUC evaluation (10 runs).

Dataset	No Pre-Train	GraphCL	MoMu-S	MoMu-K
BBBP	65.8	69.7	70.5	70.1
Tox21	74.0	73.9	75.6	75.6
ToxCast	63.4	62.4	63.4	63.0
SIDER	57.3	60.5	60.5	60.4
ClinTox	58.0	76.0	79.9	77.4
MUV	71.8	69.8	70.5	71.1
HIV	75.3	78.5	75.9	76.2
BACE	70.1	75.4	76.7	77.1
Average	66.96	70.78	71.63	71.36

MoMu-S achieves the best average ROC-AUC (71.63%) across all eight datasets, outperforming GraphCL (70.78%), the self-supervised method used to initialize MoMu’s graph encoder. MoMu outperforms GraphCL on six of eight datasets. Notably, MoMu-S and MoMu-K perform comparably, indicating that KV-PLM’s SMILES-based knowledge does not transfer well to graph-based representations.

Zero-Shot Text-to-Graph Generation

The method generates molecules from three types of text descriptions:

High-level vague descriptions (e.g., “The molecule is beautiful”): MoMu generates diverse, interpretable molecules where “beautiful” tends to produce locally symmetric and stretched graphs, “versatile” produces molecules with varied elements and functional groups, and “strange” produces cluttered, irregular structures.
Functional descriptions (e.g., “fluorescent molecules”, “high water solubility and barrier permeability with low toxicity”): MoMu successfully generates molecules with appropriate functional groups and properties. For the solubility/permeability/toxicity query, MoMu generates molecules that satisfy three of three evaluable properties.
Structural descriptions (e.g., “molecules containing nucleophilic groups”): MoMu generates diverse molecules with appropriate functional groups (amino, hydroxyl, carbonyl, halogen atoms).

Promising Multimodal Transfer with Clear Data Limitations

MoMu demonstrates that contrastive pre-training on weakly-correlated graph-text data can bridge molecular graphs and natural language in a shared representation space. The key findings are:

Cross-modal alignment works with limited data: With only 15K graph-text pairs (far fewer than the millions used in vision-language models like CLIP), MoMu achieves meaningful cross-modal retrieval and enables zero-shot generation.
Multimodal supervision improves graph representations: The graph encoder supervised by text descriptions outperforms self-supervised methods (GraphCL, AttrMasking, ContextPred) on average across molecular property prediction benchmarks.
SMILES knowledge does not transfer to graphs: MoMu-S and MoMu-K perform comparably across all tasks, showing that structural information learned from one-dimensional SMILES strings does not readily generalize to graph neural networks.

Limitations

The authors acknowledge several important limitations:

Data scarcity: 15K graph-text pairs is substantially smaller than general image-text datasets, potentially leaving the common space insufficiently aligned.
Noisy supervision: Retrieved texts may mention a molecule by name without describing its properties or structure, leading to spurious correlations.
Generator constraints: The zero-shot generation method is limited by MoFlow’s capacity (maximum 38 atoms, 9 element types from ZINC250K training).
Property coverage: Generation quality degrades for molecular properties that appear infrequently or not at all in the training texts.

Future Directions

The authors propose four avenues: (1) collecting larger-scale multimodal molecular data including 3D conformations, (2) using strongly-correlated paired data with more advanced generators, (3) developing interpretable tools for the learned cross-modal space, and (4) wet-lab validation of generated molecules.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	Collected graph-text pairs (PubChem + S2ORC)	15,613 pairs	~37M paragraphs total; top 50K PubChem compounds
Cross-modal retrieval	PCdes	15K pairs (10.5K/1.5K/3K split)	SMILES-description pairs from PubChem
Molecule captioning	ChEBI-20	~33K pairs	Used with MolT5
Text-to-graph generation	ZINC250K (MoFlow)	250K molecules	Pre-trained generator, max 38 atoms
Property prediction	MoleculeNet (8 datasets)	Varies	BBBP, Tox21, ToxCast, SIDER, ClinTox, MUV, HIV, BACE

Algorithms

Graph augmentations: Node dropping (10% ratio) and subgraph extraction (80% of original size via random walk)
Contrastive learning: InfoNCE loss with temperature $\tau = 0.1$, following the DeClip paradigm with both inter-modal and intra-modal objectives
Zero-shot generation: Adam optimizer on latent variable $q$ for up to 500 iterations; formal charges prohibited in output

Models

Graph encoder: GIN with 5 layers, 300-dimensional hidden size, initialized from GraphCL checkpoint
Text encoder: BERT-base (768 hidden size), initialized from Sci-BERT or KV-PLM
Projection heads: Two MLPs projecting graph (300-dim) and text (768-dim) features to 256-dimensional shared space
Optimizer: AdamW, learning rate 0.0001, weight decay 1e-5, 300 epochs, batch size 256

Evaluation

Task	Metric	Best Result	Notes
G-T Retrieval (PCdes)	Accuracy / R@20	81.09 / 80.15 (paragraph)	MoMu-K, fine-tuned
T-G Retrieval (PCdes)	Accuracy / R@20	81.45 / 80.62 (paragraph)	MoMu-K, fine-tuned
Zero-shot G-T Retrieval	Accuracy	~46%	vs. ~1.4% for baselines
Property Prediction	ROC-AUC (avg)	71.63%	MoMu-S, 8 MoleculeNet datasets
Molecule Captioning	Text2Mol	Improved over MolT5	MoMu + MolT5-large

Hardware

Pre-training: 8x NVIDIA Tesla V100 PCIe 32GB GPUs
Framework: PyTorch

Artifacts

Artifact	Type	License	Notes
MoMu code	Code	Not specified	Pre-training and downstream task code
GraphTextRetrieval	Code	Not specified	Data collection and cross-modal retrieval code
Pre-training dataset	Dataset	Not specified	Hosted on Baidu Pan (Chinese cloud storage)

Paper Information

Citation: Su, B., Du, D., Yang, Z., Zhou, Y., Li, J., Rao, A., Sun, H., Lu, Z., & Wen, J.-R. (2022). A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language. arXiv preprint arXiv:2209.05481.

@article{su2022momu,
  title={A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language},
  author={Su, Bing and Du, Dazhao and Yang, Zhao and Zhou, Yujie and Li, Jiangmeng and Rao, Anyi and Sun, Hao and Lu, Zhiwu and Wen, Ji-Rong},
  journal={arXiv preprint arXiv:2209.05481},
  year={2022}
}

MolFM: Trimodal Molecular Foundation Pre-training

Sat, 28 Mar 2026 00:00:00 +0000

Trimodal Pre-training for Molecular Understanding

MolFM is a Method paper that introduces a multimodal molecular foundation model integrating three distinct sources of molecular knowledge: 2D molecular graphs, biomedical text, and knowledge graphs. The primary contribution is a pre-training framework that uses fine-grained cross-modal attention to fuse information across all three modalities, combined with theoretical justification from a deep metric learning perspective. MolFM achieves the best reported results (at time of publication) on cross-modal retrieval, molecule captioning, text-based molecule generation, and molecular property prediction.

Why Existing Molecular Models Fall Short

Prior multimodal molecular foundation models operate on at most two modalities (structures and text) and suffer from two key limitations. First, generative approaches like KV-PLM and MolT5 rely on 1D SMILES strings, which cannot capture complex topological and spatial molecular properties such as macrocycles. Contrastive approaches like MoMu and MoleculeSTM learn global alignment between molecule graphs and text but overlook fine-grained connections between specific substructures and textual descriptions.

Second, and more fundamentally, no prior model incorporates knowledge graphs as a third modality. Knowledge graphs encode global-level relationships among molecules, target ligands, diseases, and other biomedical entities. These relationships capture functional and structural similarity patterns that cannot be learned from individual molecule-text pairs alone. MolFM addresses both gaps by introducing cross-modal attention across all three modalities and providing theoretical guarantees about what the pre-training objectives learn.

Architecture

MolFM uses three pre-trained single-modal encoders:

Molecular graph encoder: A 5-layer GIN (1.8M parameters) initialized from GraphMVP, producing atom-level features $h_{SA}$ and a graph-level feature $h_{SM}$
Text encoder: A 6-layer transformer (61.8M parameters) initialized from KV-PLM’s first 6 layers, producing token features $h_T$
Knowledge graph encoder: A TransE model (12.6M parameters) trained on the knowledge graph for 500 epochs, producing entity features $h_K$

A multimodal encoder (61.8M parameters, 6 transformer layers with cross-attention) fuses the three modalities. The cross-attention uses text token features as queries and the concatenation of atom features and knowledge graph neighbor features as keys and values. For each molecule, the knowledge graph input is the molecule’s entity and $N=4$ randomly sampled one-hop neighbors.

Pre-training Objectives

MolFM combines four losses:

Structure-text contrastive (STC) aligns the global feature spaces of structure and text encoders using a symmetric InfoNCE loss:

$$\mathcal{L}_{stc} = -\frac{1}{2} \left[ \log \frac{\exp(s(z_S, z_T) / \tau)}{\sum_{S’ \in B} \exp(s(z_{S’}, z_T) / \tau)} + \log \frac{\exp(s(z_S, z_T) / \tau)}{\sum_{T’ \in B} \exp(s(z_S, z_{T’}) / \tau)} \right]$$

where $s(\cdot, \cdot)$ is cosine similarity and $\tau = 0.1$ is a temperature parameter.

Cross-modal matching (CMM) predicts whether a structure-text-knowledge triplet corresponds to the same molecule, using cross-entropy over the multimodal encoder’s CLS token:

$$\mathcal{L}_{cmm} = \sum_{(\tilde{S}, \tilde{T}, \tilde{K}) \in \tilde{B}} H\left[y_{cmm}(\tilde{S}, \tilde{T}, \tilde{K}),; p_{cmm}\left(\mathcal{M}_\theta(h_{\tilde{S}}, h_{\tilde{T}}, h_{\tilde{K}})\right)\right]$$

Masked language modeling (MLM) predicts masked text tokens conditioned on all three modalities:

$$\mathcal{L}_{mlm} = H\left[y_{mlm}(\hat{T}),; p_{mlm}\left(\mathcal{M}_\theta(h_S, h_{\hat{T}}, h_K)\right)\right]$$

Knowledge graph embedding (KGE) regularizes entity embeddings with a max-margin TransE loss:

$$\mathcal{L}_{kge} = \sum_{h \in K} \left[\max(0, d(h,r,t) - d(h,r,\tilde{t}) + \Delta) + \max(0, d(h,r,t) - d(\tilde{h},r,t) + \Delta)\right]$$

where $d(h,r,t) = | f(h) + g(r) - f(t) |_2$ and $\Delta = 0.2$.

The total pre-training loss is:

$$\mathcal{L} = \mathbb{E}_{(S,T,K)}\left[\mathcal{L}_{stc} + \mathcal{L}_{cmm} + \mathcal{L}_{mlm} + \mathcal{L}_{kge}\right]$$

Theoretical Justifications

The authors provide metric learning interpretations for each objective. For CMM, they show that the loss is proportional to assigning higher scores to matched triplets and lower scores to unmatched ones, aligning the feature space across all three modalities.

For KGE, two lemmas provide guarantees about structurally and functionally similar molecules:

Lemma 1 (Structural similarity): For a symmetric structural-similarity relation $r_s$, the KGE loss satisfies:

$$\mathcal{L}_{kge}(h, r_s, t) \propto 2|f(h) - f(t)| - \mathbb{E}_{\tilde{t}}|f(h) - f(\tilde{t})| - \mathbb{E}_{\tilde{h}}|f(\tilde{h}) - f(t)|$$

This shows KGE pulls structurally similar molecules closer while pushing dissimilar ones apart.

Lemma 2 (Functional similarity): For molecules $h$ and $t$ that interact with a common entity $o$, the distance between their embeddings is upper-bounded:

$$|f(h) - f(t)| \leq \alpha,\mathbb{E}_{(e_1, r, e_2) \sim \mathcal{I}}\left[\mathcal{L}_{kge}(e_1, r, e_2)\right] + C$$

where $\alpha \approx 1$ and $C \approx 0$. This guarantees that minimizing KGE also brings functionally similar molecules closer in the embedding space.

Experiments Across Four Downstream Tasks

Pre-training Data

MolFM pre-trains on 15K molecules from PubChem paired with 37M paragraphs from S2ORC. The knowledge graph contains 49K entities and 3.2M relations, constructed from DrugBank, BindingDB, and additional public databases with heuristic augmentation.

Evaluated on PCdes (paragraph-level) in zero-shot and fine-tuning settings. MolFM uses a re-ranking strategy that linearly combines cosine similarity with CMM logits over the top-$k$ retrieved candidates.

Mode	Model	S-T MRR	S-T R@1	S-T R@10	T-S MRR	T-S R@1	T-S R@10
Zero-shot	MoMu	9.89	5.08	18.93	10.33	4.90	20.69
Zero-shot	MolFM	21.42	13.90	36.21	23.63	16.14	39.54
Fine-tune	MoMu	34.29	24.47	53.84	34.53	24.87	54.25
Fine-tune	MolFM	39.56	29.76	58.63	39.34	29.39	58.49

MolFM achieves 12.13% and 5.04% absolute gains over MoMu under zero-shot and fine-tuning settings, respectively.

Molecule Captioning

Evaluated on ChEBI-20 using MolT5 decoders. MolFM’s structure encoder features are concatenated with the MolT5 encoder outputs.

Decoder	Encoder	BLEU-4	ROUGE-L	METEOR	Text2Mol
MolT5-base	MolT5-base	0.457	0.578	0.569	0.547
MolT5-base	MoMu	0.462	0.575	0.576	0.558
MolT5-base	GraphMVP	0.491	0.592	0.599	0.570
MolT5-base	MolFM	0.498	0.594	0.607	0.576

Text-Based Molecule Generation

Also on ChEBI-20 with MolT5 decoders. MolFM’s text features are projected and fed to the decoder.

Decoder	Encoder	Exact	Valid	Morgan FTS	Text2Mol
MolT5-base	MolT5-base	0.082	0.786	0.601	0.543
MolT5-base	MoMu	0.183	0.863	0.678	0.580
MolT5-base	MolFM	0.210	0.892	0.697	0.583

Molecular Property Prediction

On MoleculeNet (8 classification datasets), MolFM concatenates the structure feature and the multimodal encoder’s CLS feature to predict properties.

Model	BBBP	Tox21	ClinTox	HIV	BACE	Avg
GraphMVP	72.4	74.4	77.5	77.0	81.2	73.07
DeepEIK	72.1	72.4	89.7	75.0	80.5	73.27
MolFM (w/o T+K)	72.2	76.6	78.6	78.2	82.6	73.95
MolFM (w/ T+K)	72.9	77.2	79.7	78.8	83.9	74.62

With multimodal inputs, MolFM averages 74.62% ROC-AUC, a 1.55% absolute gain over GraphMVP.

Ablation Studies

Zero-shot retrieval ablations reveal that cross-modal attention to atoms and CMM are the most critical components. Removing either causes a sharp drop (approximately 3% on S-T retrieval). Knowledge graph incorporation yields a 1.5% average improvement, with both attention to neighbors and KGE contributing marginally.

Key Findings and Limitations

MolFM demonstrates that incorporating knowledge graphs as a third modality provides consistent improvements across all evaluated tasks. The theoretical analysis connecting pre-training objectives to deep metric learning provides interpretability for why the model works: STC and CMM align representations of the same molecule across modalities, while KGE pulls structurally and functionally similar molecules closer in the embedding space.

The cross-modal attention visualizations show that MolFM learns to associate specific atom substructures with relevant text tokens and knowledge graph entities. For example, the model correctly attends to functional groups mentioned in textual descriptions.

The authors acknowledge several limitations:

Data quality: The pre-training dataset (15K molecules) is small and may introduce biases
Cold-start problem: MolFM provides limited benefit for newly emerged molecules lacking text and knowledge graph information
Entity scope: The model focuses on molecules and does not incorporate proteins, genes, or cell lines, which could further improve biomedical understanding

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training (molecules)	PubChem	15K molecules	Follows MoMu’s pre-training data
Pre-training (text)	S2ORC	37M paragraphs	Biomedical literature paragraphs
Knowledge graph	DrugBank, BindingDB, public DBs	49K entities, 3.2M relations	Constructed with heuristics from MoCL
Cross-modal retrieval	PCdes	Paragraph-level	Test split
Captioning/Generation	ChEBI-20	-	Following MolT5 splits
Property prediction	MoleculeNet	8 datasets	Classification tasks, ROC-AUC metric

Algorithms

Optimizer: AdamW with weight decay $1 \times 10^{-4}$
Learning rate: linear warmup to $1 \times 10^{-4}$ over 2,000 iterations, cosine annealing to $1 \times 10^{-5}$
Batch size: 128
Pre-training epochs: 300
Knowledge graph neighbors per molecule: $N = 4$
Temperature: $\tau = 0.1$
Margin: $\Delta = 0.2$

Models

Component	Architecture	Parameters	Initialization
Graph encoder	5-layer GIN	1.8M	GraphMVP
Text encoder	6-layer Transformer	61.8M	KV-PLM (first 6 layers)
Knowledge encoder	TransE	12.6M	Trained 500 epochs on KG
Multimodal encoder	6-layer Transformer + cross-attention	61.8M	KV-PLM (last 6 layers)
Total		~138M

Evaluation

Task	Metrics
Cross-modal retrieval	MRR, Recall@1/5/10
Molecule captioning	BLEU-2/4, ROUGE-1/2/L, METEOR, Text2Mol
Text-to-molecule generation	BLEU, Exact ratio, Validity, Levenshtein, Fingerprint Tanimoto (MACCS/RDKit/Morgan), Text2Mol
Property prediction	ROC-AUC per dataset

Hardware

4 NVIDIA A100 GPUs for pre-training

Artifacts

Artifact	Type	License	Notes
OpenBioMed	Code	MIT	Official implementation including MolFM

Paper Information

Citation: Luo, Y., Yang, K., Hong, M., Liu, X. Y., & Nie, Z. (2023). MolFM: A Multimodal Molecular Foundation Model. arXiv preprint arXiv:2307.09484.

@article{luo2023molfm,
  title={MolFM: A Multimodal Molecular Foundation Model},
  author={Luo, Yizhen and Yang, Kai and Hong, Massimo and Liu, Xing Yi and Nie, Zaiqing},
  journal={arXiv preprint arXiv:2307.09484},
  year={2023}
}

BioT5: Cross-Modal Integration of Biology and Chemistry

Sat, 28 Mar 2026 00:00:00 +0000

A Unified Pretraining Framework for Molecules, Proteins, and Text

BioT5 is a Method paper that introduces a comprehensive T5-based pretraining framework for cross-modal integration of molecules, proteins, and natural language. The primary contribution is a multi-task pretraining approach that uses SELFIES (instead of SMILES) for 100% valid molecular representations, separate tokenization for each modality, and a combination of masked language modeling and translation objectives to connect structured biological data with unstructured scientific text. After fine-tuning, BioT5 (252M parameters) achieves state-of-the-art performance on 10 out of 15 downstream tasks spanning molecule property prediction, protein property prediction, drug-target interaction, protein-protein interaction, molecule captioning, and text-based molecule generation.

Bridging the Gap Between Molecular Sequences and Scientific Knowledge

Prior cross-modal models in computational biology face three recurring challenges. First, models like MolT5 and MolXPT rely on SMILES to represent molecules, but SMILES strings are syntactically fragile: random perturbations or model-generated sequences frequently produce invalid molecular structures. Edwards et al. (2022) and Li et al. (2023) both highlight this validity problem as a bottleneck for text-to-molecule generation. Second, the contextual information surrounding molecular and protein names in scientific literature (e.g., mentions in PubMed abstracts that describe properties, interactions, and experimental results) remains underutilized. Most models either ignore this context or treat it identically to structured database entries. Third, existing approaches like MolT5 and Galactica share a single tokenizer and embedding space across molecules, proteins, and text. This leads to chemically incorrect tokenization: the bromine atom “Br” in SMILES gets split into “B” (boron) and “r”, producing erroneous downstream predictions.

BioT5 addresses all three issues simultaneously by adopting SELFIES for molecular representation, extracting entity-linked contextual knowledge from PubMed, and employing separate vocabularies for each modality.

SELFIES, Separate Tokenization, and Multi-Task Pretraining

The core innovations of BioT5 center on three design decisions:

SELFIES for Robust Molecular Representation

BioT5 replaces SMILES with SELFIES (Self-referencing Embedded Strings) for all molecular representations. Every permutation of symbols within the SELFIES alphabet generates a chemically valid molecular structure, guaranteeing 100% validity in generation tasks. Molecules from ZINC20 are converted from SMILES to SELFIES during data preprocessing.

Modality-Specific Tokenization

Rather than sharing a single SentencePiece vocabulary across modalities, BioT5 maintains three separate dictionaries:

Molecules: Each SELFIES token corresponds to a chemically meaningful atom group enclosed in brackets (e.g., [C], [=C], [Br]).
Proteins: Amino acids are prefixed with a special
token to distinguish them from text characters (e.g.,
M,
K,
R).
Text: The standard T5 vocabulary is retained.

This prevents semantic conflation across modalities. The total vocabulary size is 35,073, and the model comprises 252M parameters using the T5-v1.1-base architecture.

Multi-Task Pretraining Objectives

BioT5 uses six pretraining tasks organized into three categories:

Single-modal T5 objective: Standard span corruption and recovery applied independently to molecule SELFIES (task 1), protein FASTA (task 2), and general text from C4 (task 3).
Wrapped text T5 objective (task 4): Applied to PubMed articles where molecular names are replaced with corresponding SELFIES strings and gene names are appended with protein FASTA sequences, using BERN2 for named entity recognition and entity linking.
Bidirectional translation (tasks 5 and 6): Molecule SELFIES to text description and vice versa (using 339K pairs from PubChem), and protein FASTA to text description and vice versa (using 569K pairs from Swiss-Prot).

The translation direction is randomly sampled with probability 0.5 for each example. For downstream tasks, BioT5 uses prompt-based fine-tuning to cast all tasks into a sequence generation format, reducing the gap between pretraining and fine-tuning.

Evaluation Across 15 Downstream Tasks

BioT5 is evaluated on 15 tasks organized into three categories: single-instance prediction, multi-instance prediction, and cross-modal generation.

Molecule Property Prediction (MoleculeNet)

BioT5 is evaluated on six binary classification tasks from MoleculeNet using scaffold splitting: BBBP, Tox21, ClinTox, HIV, BACE, and SIDER. Results are averaged over three random runs.

Dataset	GEM	MolXPT	BioT5
BBBP	72.4	80.0	77.7
Tox21	78.1	77.1	77.9
ClinTox	90.1	95.3	95.4
HIV	80.6	78.1	81.0
BACE	85.6	88.4	89.4
SIDER	67.2	71.7	73.2
Avg	79.0	81.9	82.4

BioT5 achieves the best average AUROC (82.4) across all six datasets, surpassing both GNN-based methods (GEM) and language model baselines (MolXPT).

Protein Property Prediction (PEER Benchmark)

On the PEER benchmark, BioT5 is evaluated on protein solubility and subcellular localization prediction:

Model	Params	Solubility (Acc)	Localization (Acc)
ESM-1b	652.4M	70.23	92.40
ProtBert	419.9M	68.15	91.32
BioT5	252.1M	74.65	91.69

BioT5 achieves the best solubility prediction accuracy (74.65%) despite being 2-3x smaller than dedicated protein language models like ESM-1b and ProtBert.

Drug-Target Interaction Prediction

BioT5 is evaluated on three DTI datasets (BioSNAP, Human, BindingDB) with five random runs:

Method	BioSNAP AUROC	Human AUROC	BindingDB AUROC
DrugBAN	0.903	0.982	0.960
BioT5	0.937	0.989	0.963

BioT5 consistently outperforms DrugBAN and other specialized DTI models across all three datasets.

Molecule Captioning and Text-Based Molecule Generation

On the ChEBI-20 dataset, BioT5 outperforms all baselines in molecule captioning:

Model	Params	BLEU-4	METEOR	Text2Mol
MolT5-large	783M	0.508	0.614	0.582
MolXPT	350M	0.505	0.626	0.594
BioT5	252M	0.556	0.656	0.603

For text-based molecule generation, BioT5 achieves an exact match score of 0.413 (vs. 0.311 for MolT5-large) while maintaining 100% validity, compared to 90.5% for MolT5-large. This demonstrates the direct benefit of SELFIES: every generated sequence is a valid molecule.

Protein-Protein Interaction Prediction

On the PEER PPI benchmarks (Yeast and Human), BioT5 achieves competitive results, outperforming fully fine-tuned ProtBert and ESM-1b on the Yeast dataset (64.89% vs. 63.72% for ProtBert) and placing second on Human (86.22% vs. 88.06% for ESM-1b with frozen weights).

Key Findings, Limitations, and Future Directions

BioT5 demonstrates that integrating molecular, protein, and textual modalities within a single pretraining framework yields consistent improvements across diverse biological tasks. Three factors drive BioT5’s performance: (1) SELFIES guarantees 100% molecular validity in generation tasks, eliminating a persistent failure mode of SMILES-based models; (2) separate tokenization preserves the semantic integrity of each modality; (3) wrapped text pretraining on PubMed provides contextual biological knowledge that pure sequence models miss.

The authors acknowledge several limitations. BioT5 requires full-parameter fine-tuning for each downstream task because instruction-tuning does not generalize across tasks, and combining datasets via instructions causes data leakage (the authors note overlaps between BindingDB training data and BioSNAP/Human test sets). The model only handles sequence-format bio-entities and does not incorporate 2D or 3D structural information. Additional biological modalities such as DNA/RNA sequences and cell-level data are also left for future work.

The authors also note risks: BioT5 could potentially be misused to generate dangerous molecules, and it may fail to generate effective therapeutic molecules or produce compounds with adverse side effects.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining (molecules)	ZINC20	~300M molecules	Converted from SMILES to SELFIES
Pretraining (proteins)	UniRef50	27M proteins	Filtered by length
Pretraining (text)	C4	Large	Standard T5 corpus
Pretraining (wrapped text)	PubMed	33M articles	Entity linking via BERN2
Pretraining (molecule-text pairs)	PubChem	339K pairs	Excludes ChEBI-20 molecules
Pretraining (protein-text pairs)	Swiss-Prot	569K pairs	High-quality annotations
Evaluation (molecular properties)	MoleculeNet	6 datasets	Scaffold splitting
Evaluation (protein properties)	PEER	2 tasks	Solubility and localization
Evaluation (DTI)	BioSNAP, Human, BindingDB	3 datasets	Binary classification
Evaluation (PPI)	Yeast, Human	2 datasets	From PEER benchmark
Evaluation (generation)	ChEBI-20	33K pairs	Molecule captioning and text-to-molecule

Algorithms

Architecture: T5-v1.1-base (encoder-decoder transformer)
Optimizer: AdamW with RMS scaling
Learning rate: cosine annealing, base $1 \times 10^{-2}$, minimum $1 \times 10^{-5}$
Warmup steps: 10,000
Dropout: 0.0
Maximum input length: 512 tokens
Pretraining steps: 350K
Batch size: 96 per GPU (6 data types per batch)
Prompt-based fine-tuning for all downstream tasks

Models

Model	Parameters	Vocabulary Size	Architecture
BioT5	252M	35,073	T5-v1.1-base

Evaluation

Molecule property prediction: AUROC on 6 MoleculeNet tasks (scaffold split, 3 runs)
Protein property prediction: accuracy on PEER benchmark (3 runs)
Drug-target interaction: AUROC, AUPRC, accuracy on 3 DTI datasets (5 runs)
Protein-protein interaction: accuracy on 2 PPI datasets (3 runs)
Molecule captioning: BLEU, ROUGE, METEOR, Text2Mol on ChEBI-20
Text-based molecule generation: BLEU, exact match, fingerprint similarities, FCD, validity on ChEBI-20

Hardware

8x NVIDIA A100 80GB GPUs for pretraining
Codebase: nanoT5

Artifacts

Artifact	Type	License	Notes
BioT5 Code	Code	MIT	Official implementation

Paper Information

Citation: Pei, Q., Zhang, W., Zhu, J., Wu, K., Gao, K., Wu, L., Xia, Y., & Yan, R. (2023). BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 1102-1123. https://doi.org/10.18653/v1/2023.emnlp-main.70

@inproceedings{pei2023biot5,
  title={BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations},
  author={Pei, Qizhi and Zhang, Wei and Zhu, Jinhua and Wu, Kehan and Gao, Kaiyuan and Wu, Lijun and Xia, Yingce and Yan, Rui},
  booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
  pages={1102--1123},
  year={2023},
  publisher={Association for Computational Linguistics},
  doi={10.18653/v1/2023.emnlp-main.70}
}

Mol2vec: Unsupervised ML with Chemical Intuition

Fri, 27 Mar 2026 00:00:00 +0000

Word2vec Meets Cheminformatics

Mol2vec is a Method paper that introduces an unsupervised approach for learning dense vector representations of molecular substructures. The core idea is a direct analogy to Word2vec from natural language processing: molecular substructures (derived from the Morgan algorithm) are treated as “words,” and entire molecules are treated as “sentences.” By training on a large unlabeled corpus of 19.9 million compounds, Mol2vec produces embeddings where chemically related substructures occupy nearby regions of vector space. Compound-level vectors are then obtained by summing constituent substructure vectors, and these can serve as features for downstream supervised learning tasks.

Sparse Fingerprints and Their Limitations

Molecular fingerprints, particularly Morgan fingerprints (extended-connectivity fingerprints, ECFP), are among the most widely used molecular representations in cheminformatics. They perform well for similarity searching, virtual screening, and activity prediction. However, they suffer from several practical drawbacks:

High dimensionality and sparsity: Morgan fingerprints are typically hashed to fixed-length binary vectors (e.g., 2048 or 4096 bits), resulting in very sparse representations.
Bit collisions: The hashing step can map distinct substructures to the same bit position, losing structural information.
No learned relationships: Each bit is independent, so the representation does not encode any notion of chemical similarity between substructures.

At the time of this work (2017), NLP techniques had started to appear in cheminformatics. The tf-idf method had been applied to Morgan fingerprints for compound-protein interaction prediction, and Latent Dirichlet Allocation had been used for chemical topic modeling. The Word2vec concept had been adapted for protein sequences (ProtVec) but had not yet been applied to small molecules. Mol2vec fills this gap.

From Substructure Identifiers to Dense Embeddings

The central insight of Mol2vec is that the Morgan algorithm already produces a natural “vocabulary” of molecular substructures, and the order in which these substructures appear in a molecule provides local context, analogous to word order in a sentence.

Corpus Construction

The training corpus was assembled from ZINC v15 and ChEMBL v23, merged and deduplicated, then filtered by molecular weight (12-600), heavy atom count (3-50), clogP (-5 to 7), and allowed elements (H, B, C, N, O, F, P, S, Cl, Br). This yielded 19.9 million compounds.

Sentence Generation

For each molecule, the Morgan algorithm generates atom identifiers at radius 0 and radius 1. Each atom contributes two identifiers (one per radius), ordered according to the atom order in the canonical SMILES. This sequence of identifiers forms a “sentence” for Word2vec training.

Word2vec Training

The model was trained using the gensim implementation of Word2vec. After evaluating both CBOW and Skip-gram architectures with window sizes of 5, 10, and 20, and embedding dimensions of 100 and 300, the best configuration was:

Architecture: Skip-gram
Window size: 10
Embedding dimension: 300

Rare identifiers appearing fewer than 3 times in the corpus were replaced with a special “UNSEEN” token, which learns a near-zero vector. This allows the model to handle novel substructures at inference time.

Compound Vector Generation

The final vector for a molecule is the sum of all its substructure vectors:

$$\mathbf{v}_{\text{mol}} = \sum_{i=1}^{N} \mathbf{v}_{s_i}$$

where $\mathbf{v}_{s_i}$ is the 300-dimensional embedding for the $i$-th substructure identifier in the molecule. This summation implicitly captures substructure counts and importance through vector amplitude.

Benchmarking Across Regression and Classification Tasks

Datasets

The authors evaluated Mol2vec on four datasets:

Dataset	Task	Size	Description
ESOL	Regression	1,144	Aqueous solubility prediction
Ames	Classification	6,511	Mutagenicity (balanced: 3,481 positive, 2,990 negative)
Tox21	Classification	8,192	12 human toxicity targets (imbalanced)
Kinase	Classification	284 kinases	Bioactivity from ChEMBL v23

Machine Learning Methods

Three ML methods were compared using both Mol2vec and Morgan FP features:

Random Forest (RF): scikit-learn, 500 estimators
Gradient Boosting Machine (GBM): XGBoost, 2000 estimators, max depth 3, learning rate 0.1
Deep Neural Network (DNN): Keras/TensorFlow, 4 hidden layers with 2000 neurons each for Mol2vec; 1 hidden layer with 512 neurons for Morgan FP

All models were validated using 20x 5-fold cross-validation with the Wilcoxon signed-rank test for statistical comparison.

ESOL Regression Results

Features	Method	$R^2_{\text{ext}}$	MSE	MAE
Descriptors	MLR	0.81 +/- 0.01	0.82	0.69
Molecular Graph	CNN	0.93	0.31 +/- 0.03	0.40 +/- 0.00
Morgan FP	GBM	0.66 +/- 0.00	1.43 +/- 0.00	0.88 +/- 0.00
Mol2vec	GBM	0.86 +/- 0.00	0.62 +/- 0.00	0.60 +/- 0.00

Mol2vec substantially outperformed Morgan FP ($R^2_{\text{ext}}$ 0.86 vs. 0.66) but did not match the best graph convolution methods ($R^2_{\text{ext}}$ ~0.93).

Classification Results (Ames and Tox21)

On the Ames dataset, Mol2vec and Morgan FP performed comparably (AUC 0.87 vs. 0.88), both matching or exceeding prior SVM and Naive Bayes results. On Tox21, both achieved an average AUC of 0.83, outperforming literature results from graph convolution (0.71) and DNN/SVM approaches (0.71-0.72).

Proteochemometric (PCM) Extension

Mol2vec was combined with ProtVec (protein sequence embeddings using the same Word2vec approach on 3-grams) by concatenating vectors, forming PCM2vec. This was evaluated using a rigorous 4-level cross-validation scheme:

CV1: New compound-target pairs
CV2: New targets
CV3: New compounds
CV4: New compounds and targets

On Tox21, PCM2vec improved predictions for new compound-target pairs (CV1: AUC 0.87 vs. 0.79 for Morgan FP) and new compounds (CV3: AUC 0.85 vs. 0.78). On the kinase dataset, PCM2vec approached the performance of classical PCM (Morgan + z-scales) while being alignment-independent, meaning it can be applied to proteins with low sequence similarity.

Chemical Intuition and Practical Value

Embedding Quality

The learned substructure embeddings capture meaningful chemical relationships. Hierarchical clustering of the 25 most common substructures shows expected groupings: aromatic carbons cluster together, aliphatic ring carbons form a separate group, and carbonyl carbons and oxygens are closely related. Similarly, t-SNE projections of amino acid vectors encoded by Mol2vec reproduce known amino acid relationships (e.g., similar distances between Glu/Gln and Asp/Asn pairs, reflecting the carboxylic acid to amide transition).

Key Findings

Skip-gram with 300-dimensional embeddings provides the best Mol2vec representations, consistent with NLP best practices.
Mol2vec excels at regression tasks, substantially outperforming Morgan FP on ESOL solubility prediction ($R^2_{\text{ext}}$ 0.86 vs. 0.66).
Classification performance is competitive with Morgan FP across Ames and Tox21 datasets.
PCM2vec enables alignment-independent proteochemometrics, extending PCM approaches to diverse protein families with low sequence similarity.
Tree-based methods (RF, GBM) outperformed DNNs on these tasks, though the authors note further DNN tuning could help.

Limitations

The compound vector is a simple sum of substructure vectors, which discards information about substructure arrangement and molecular topology.
Only Morgan identifiers at radii 0 and 1 were used. Larger radii might capture more context but would increase vocabulary size.
DNN architectures were not extensively optimized, leaving open the question of how well Mol2vec pairs with deep learning.
The approach was benchmarked against Morgan FP but not against other learned representations such as graph neural networks in a controlled comparison.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC v15 + ChEMBL v23	19.9M compounds	Filtered by MW, atom count, clogP, element types
Evaluation	ESOL	1,144 compounds	Aqueous solubility regression
Evaluation	Ames	6,511 compounds	Mutagenicity classification
Evaluation	Tox21	8,192 compounds	12 toxicity targets, retrieved via DeepChem
Evaluation	Kinase (ChEMBL v23)	284 kinases	IC50/Kd/Ki binding assays
Protein corpus	UniProt	554,241 sequences	For ProtVec training

Algorithms

Word2vec: Skip-gram, window size 10, 300-dimensional embeddings, min count 3
Morgan algorithm: Radii 0 and 1 (119 and 19,831 unique identifiers respectively)
UNSEEN token: Replaces identifiers occurring fewer than 3 times
Compound vector: Sum of all substructure vectors

Models

RF: scikit-learn, 500 estimators, sqrt features, balanced class weights
GBM: XGBoost, 2000 estimators, max depth 3, learning rate 0.1
DNN: Keras/TensorFlow, 4 layers x 2000 neurons (Mol2vec) or 1 layer x 512 neurons (Morgan FP), ReLU activation, dropout 0.1

Evaluation

Metric	Mol2vec Best	Morgan FP Best	Task
$R^2_{\text{ext}}$	0.86 (GBM)	0.66 (GBM)	ESOL regression
AUC	0.87 (RF)	0.88 (RF)	Ames classification
AUC	0.83 (RF)	0.83 (RF)	Tox21 classification

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
mol2vec	Code	BSD-3-Clause	Python package with pre-trained model

Paper Information

Citation: Jaeger, S., Fulle, S., & Turk, S. (2018). Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. Journal of Chemical Information and Modeling, 58(1), 27-35. https://doi.org/10.1021/acs.jcim.7b00616

@article{jaeger2018mol2vec,
  title={Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition},
  author={Jaeger, Sabrina and Fulle, Simone and Turk, Samo},
  journal={Journal of Chemical Information and Modeling},
  volume={58},
  number={1},
  pages={27--35},
  year={2018},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.7b00616}
}

MG-BERT: Graph BERT for Molecular Property Prediction

Fri, 27 Mar 2026 00:00:00 +0000

A Graph-Aware BERT for Molecular Property Prediction

MG-BERT is a Method paper that adapts the BERT pretraining paradigm from NLP to molecular graphs. The primary contribution is a modified Transformer architecture that replaces global self-attention with bond-based local attention, allowing atoms to exchange information only through chemical bonds. This creates a deep message-passing network that avoids the oversmoothing problem of conventional graph neural networks (GNNs). Combined with a masked atom prediction pretraining strategy on 1.7 million unlabeled molecules from ChEMBL, MG-BERT learns context-sensitive atomic representations that transfer effectively to downstream property prediction tasks.

Data Scarcity in Molecular Property Prediction

Molecular property prediction is central to drug discovery, particularly for ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) endpoints. While deep learning has advanced many domains, molecular property prediction faces a persistent challenge: labeled data scarcity. ADMET measurements require expensive, time-consuming experiments, and typical datasets contain only hundreds to thousands of examples.

Prior approaches fall into three categories, each with limitations:

Feature engineering (molecular fingerprints, descriptors): Requires expert design, suffers from low scalability, and fixed representations cannot be optimized for specific tasks.
SMILES-based deep learning (CNNs, LSTMs, Transformers on SMILES strings): Must learn to parse molecular information from complex string syntax, increasing learning difficulty. Autoencoder-based methods (e.g., CDDD) learn fixed representations that cannot be fine-tuned.
Graph neural networks (GAT, GCN): Can learn directly from molecular topology, but are limited to 2-3 layers due to oversmoothing, restricting their capacity to capture deep-level patterns.

The BERT model from NLP demonstrated that self-supervised pretraining on large unlabeled corpora followed by fine-tuning on small labeled datasets can substantially improve downstream performance. SMILES-BERT applied this idea to SMILES strings directly, but suffered from interpretability issues due to auxiliary characters in the SMILES syntax. MG-BERT addresses these limitations by operating directly on molecular graphs.

Bond-Based Local Attention and Masked Atom Pretraining

The core innovation of MG-BERT has two components: a modified Transformer architecture for molecular graphs and a self-supervised pretraining strategy.

Architecture Modifications

The original BERT model uses three components: an embedding layer, Transformer encoder layers, and a task-specific output layer. MG-BERT makes three key modifications:

Atom embeddings replace word embeddings. The dictionary contains 16 tokens: 13 common atom types ([H], [C], [N], [O], [F], [S], [Cl], [P], [Br], [B], [I], [Si], [Se]), plus [UNK] for rare atoms, [MASK] for pretraining, and [GLOBAL] for graph-level readout.
No positional encoding. Unlike sequential text, atoms in a molecular graph have no inherent ordering, so positional embeddings are removed.
Local attention replaces global attention. The adjacency matrix of the molecular graph is used as a visibility matrix to modulate the attention scores. Each atom can only attend to atoms connected by chemical bonds. Formally, the attention is constrained so that:

$$A’_{ij} = \begin{cases} A_{ij} & \text{if bond exists between } i \text{ and } j \\ -\infty & \text{otherwise} \end{cases}$$

where $A_{ij}$ is the standard scaled dot-product attention score. This local message passing makes MG-BERT a variant of GNN, but one that can stack many layers (6 in the medium configuration) without oversmoothing, thanks to the residual connections inherited from the Transformer architecture.

Supernode for graph-level readout. A [GLOBAL] supernode is added to each molecular graph, connected to all atoms. This node aggregates information from the entire molecule and serves as the molecular representation for downstream prediction.

Masked Atom Prediction

The pretraining strategy mirrors BERT’s masked language model but operates on atoms:

15% of atoms in each molecule are randomly selected (at least one atom per molecule)
Of selected atoms: 80% are replaced with [MASK], 10% are randomly replaced with another atom type, and 10% remain unchanged
The model is trained to predict the original atom type at masked positions
Loss is computed only at masked positions

Model Configurations

Three model sizes were compared:

Configuration	Layers	Heads	Embedding Size	FFN Size	Recovery Accuracy
MG-BERT Small	3	2	128	256	95.27%
MG-BERT Medium	6	4	256	512	98.31%
MG-BERT Large	12	8	576	1152	98.35%

The medium configuration was selected for all experiments because it achieved the best downstream performance, despite the large model having slightly higher pretraining recovery accuracy. The authors attribute this to overfitting risk with the larger model.

Experimental Setup and Baselines

Pretraining

MG-BERT was pretrained on 1.7 million compounds randomly selected from ChEMBL, with 10% held out for evaluation (1.53M training molecules). Molecules were converted to 2D undirected graphs using RDKit, with hydrogen atoms explicitly included. The model was pretrained for 10 epochs using Adam with learning rate 1e-4 and batch size 256.

Fine-tuning Datasets

Sixteen datasets covering ADMET endpoints and common molecular properties were collected from ADMETlab and MoleculeNet:

Type	Dataset	Category	Size
Regression	Caco2	Absorption	979
Regression	logD	Physicochemical	10,354
Regression	logS	Physicochemical	5,045
Regression	PPB	Distribution	1,480
Regression	tox	Toxicity	7,295
Regression	ESOL	Physicochemical	1,128
Regression	FreeSolv	Physicochemical	642
Regression	Lipo	Physicochemical	4,200
Classification	Ames	Toxicity	6,719
Classification	BBB	Distribution	1,855
Classification	FDAMDD	Toxicity	795
Classification	H_HT	Toxicity	2,170
Classification	Pgp_inh	Absorption	2,125
Classification	Pgp_sub	Absorption	1,210
Classification	BACE	Biophysics	1,513
Classification	BBBP	Physiology	2,039

Datasets were split 8:1:1 (train:validation:test) with stratified sampling by SMILES length. Each experiment was repeated 10 times with random splits, reporting mean and standard deviation. Regression was evaluated by R-squared, classification by ROC-AUC. Early stopping with a maximum of 100 epochs was used.

Baselines

Five baselines were compared:

ECFP4-XGBoost: Extended connectivity fingerprints (diameter 4) with gradient-boosted trees
GAT: Graph Attention Network
GCN: Graph Convolutional Network
CDDD: Continuous and Data-Driven Descriptors (pretrained RNN encoder on SMILES with a fully connected network)
SMILES-BERT: Original BERT applied directly to SMILES strings

Ablation Studies

Two ablation studies were conducted:

Pretraining effectiveness: Comparing pretrained vs. non-pretrained MG-BERT under identical hyperparameters
Hydrogen atoms: Comparing MG-BERT with and without explicit hydrogen atoms in the molecular graph

Consistent Improvements Across ADMET Benchmarks

Main Results

MG-BERT consistently outperformed all baselines across all 16 datasets. Key results on the 11 ADMET datasets:

Dataset	ECFP4-XGBoost	GAT	GCN	CDDD	SMILES-BERT	MG-BERT
Caco2 (R2)	61.41	69.16	67.15	73.42	72.39	74.68
logD (R2)	70.84	84.62	86.22	85.85	86.31	87.46
logS (R2)	73.73	84.06	83.47	84.01	85.20	87.66
PPB (R2)	55.11	59.96	57.34	54.12	62.37	65.94
Ames (AUC)	87.21	86.38	87.04	86.82	87.69	89.33
BBB (AUC)	94.62	93.03	92.67	94.44	94.02	95.41
BBBP (AUC)	89.16	90.33	90.74	91.12	91.32	92.08

The overall improvement across all datasets was 28.1% (7.02% on classification, 21.28% on regression). Improvements were statistically significant at the 95% confidence level (paired t-test, P <= 0.001).

Pretraining Ablation

Pretraining improved performance by more than 2% on all datasets. The benefit was largest for small datasets: Caco2 improved by approximately 10 percentage points (64.79 to 74.68 R2), and FDAMDD improved by about 7.5 points (80.76 to 88.23 AUC). This confirms that self-supervised pretraining effectively addresses the labeled data scarcity problem.

Hydrogen Atom Ablation

Including explicit hydrogen atoms improved pretraining recovery accuracy from 92.25% to 98.31% and consistently improved downstream performance. The authors provide an intuitive explanation: hydrogen atoms help determine bond counts for neighboring atoms, which is critical for the masked atom recovery task. They also show that removing hydrogens can make structurally distinct molecules (e.g., benzene and cyclohexane) indistinguishable at the graph level.

Interpretability via Attention Visualization

The authors provide two forms of interpretability analysis:

t-SNE visualization of atomic representations: Pretrained atomic representations cluster by atom type and, more specifically, by local chemical environment (e.g., aromatic carbons separate from aliphatic carbons, C-N bonds from C-O bonds). This demonstrates that pretraining captures neighborhood context beyond simple atom identity.
Attention weight visualization: On the logD task, the supernode’s attention focuses on polar groups (which govern lipophilicity). On the Ames mutagenicity task, attention concentrates on known mutagenic structural alerts (acylchloride, nitrosamide, azide groups). This provides chemically meaningful explanations for predictions.

Limitations

The paper does not extensively discuss limitations, but several can be identified:

The model uses only 2D molecular topology (atom types and bonds) without 3D conformational information or bond-type features
The atom dictionary is limited to 13 common types plus [UNK], which may lose information for molecules containing rarer elements
Evaluation is limited to ADMET-focused datasets; broader chemical spaces (e.g., materials, catalysts) are not tested
The comparison baselines do not include other graph-based pretraining methods (e.g., the contemporaneous Strategies for Pre-training Graph Neural Networks by Hu et al.)

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL (random subset)	1.7M molecules (1.53M train)	10% held out for evaluation
Fine-tuning	ADMETlab + MoleculeNet	16 datasets (642-10,354 molecules)	8:1:1 splits, stratified by SMILES length

Algorithms

Optimizer: Adam (pretraining: lr=1e-4, batch=256; fine-tuning: lr from {1e-5, 5e-5, 1e-4}, batch from {16, 32, 64})
Pretraining epochs: 10
Fine-tuning: Up to 100 epochs with early stopping
Dropout: Optimized per task in range [0.0, 0.5]
Masking: 15% of atoms (80% [MASK], 10% random, 10% unchanged)

Models

Architecture: MG-BERT Medium (6 layers, 4 heads, embedding size 256, FFN size 512)
Molecule processing: RDKit for graph conversion with explicit hydrogens

Evaluation

Metric	Task Type	Notes
R-squared (R2)	Regression	Higher is better
ROC-AUC	Classification	Higher is better
Accuracy, RMSE	Both	Reported in supplementary Table S1

All results averaged over 10 random splits with standard deviations reported.

Hardware

The paper does not specify hardware requirements (GPU type, training time, or memory usage).

Artifacts

Artifact	Type	License	Notes
Molecular-graph-BERT	Code	Not specified	Jupyter Notebook implementation; last code push August 2021

Paper Information

Citation: Zhang, X.-C., Wu, C.-K., Yang, Z.-J., Wu, Z.-X., Yi, J.-C., Hsieh, C.-Y., Hou, T.-J., & Cao, D.-S. (2021). MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. Briefings in Bioinformatics, 22(6), bbab152. https://doi.org/10.1093/bib/bbab152

@article{zhang2021mgbert,
  title={{MG-BERT}: leveraging unsupervised atomic representation learning for molecular property prediction},
  author={Zhang, Xiao-Chen and Wu, Cheng-Kun and Yang, Zhi-Jiang and Wu, Zhen-Xing and Yi, Jia-Cai and Hsieh, Chang-Yu and Hou, Ting-Jun and Cao, Dong-Sheng},
  journal={Briefings in Bioinformatics},
  volume={22},
  number={6},
  pages={bbab152},
  year={2021},
  publisher={Oxford University Press},
  doi={10.1093/bib/bbab152}
}

DMP: Dual-View Molecule Pre-training (SMILES+GNN)

Fri, 27 Mar 2026 00:00:00 +0000

A Dual-Branch Pre-training Method for Molecular Property Prediction

DMP (Dual-view Molecule Pre-training) is a Method paper that introduces a pre-training framework combining two complementary molecular encoders: a Transformer operating on SMILES strings and a Graph Neural Network (GNN) operating on molecular graphs. The two branches are trained jointly with masked language modeling (MLM) objectives plus a BYOL-style dual-view consistency loss. After pre-training on 10M PubChem molecules, either branch (or both) can be fine-tuned for downstream tasks. The authors recommend the Transformer branch based on empirical results. DMP achieves the best reported performance on 7 of 9 MoleculeNet classification tasks and 3 retrosynthesis benchmarks (at the time of the 2021 arXiv version).

Why Combine SMILES and Graph Views for Molecules

Prior molecule pre-training methods used either graph representations with GNNs or SMILES representations with Transformers, but not both. The authors observe that the two views are complementary: Transformers handle molecules with large atom distances (long chains) well, while GNNs handle molecules with many concatenated rings better. Neither model alone captures the full range of molecular structures effectively.

Existing GNN-based pre-training methods (Hu et al. 2020, MolCLR, GROVER) and SMILES-based methods (ChemBERTa, SMILES-BERT) each have blind spots dictated by their input representation. DMP addresses this by pre-training both views simultaneously and enforcing representation consistency between them, so each branch benefits from the structural knowledge of the other.

Dual-View Consistency with BYOL-Style Training

The core innovation is the dual-view consistency objective, inspired by Bootstrap Your Own Latent (BYOL). Given a molecule $M$ with SMILES representation $M_s$ and graph representation $M_g$, DMP obtains high-level features from each branch:

Transformer branch: A RoBERTa-base model encodes the SMILES sequence. The [CLS] token output serves as the molecule representation $f_s$.
GNN branch: A DeeperGCN network encodes the molecular graph. Mean+max pooling over atom representations yields $f_g$.

The dual-view consistency loss uses nonlinear projection heads $\psi_g, \psi_s$ and prediction heads $\rho_g, \rho_s$:

$$ p_g = \psi_g(f_g), \quad q_g = \rho_g(p_g); \quad p_s = \psi_s(f_s), \quad q_s = \rho_s(p_s) $$

The consistency loss maximizes cross-view cosine similarity with stop-gradient (SG) on the target:

$$ \ell_{\text{dual}}(\tilde{M}_g, \tilde{M}_s) = -\cos(q_s, \text{SG}(p_g)) - \cos(q_g, \text{SG}(p_s)) $$

where $\cos(p, q) = \frac{p^\top q}{|p|_2 |q|_2}$ and $\tilde{M}_g, \tilde{M}_s$ are the masked versions of the inputs. The stop-gradient prevents representation collapse without requiring negative samples or a momentum encoder.

The full training objective combines three losses:

MLM on Transformer: Recover masked tokens in SMILES sequences
MLM on GNN: Recover masked atoms in molecular graphs
Dual-view consistency: The BYOL-style loss above

Both MLM objectives and the consistency loss are necessary. Ablations show that removing MLM (using only dual-view loss) degrades performance, and using two branches of the same type (two Transformers or two GNNs) is less effective than the heterogeneous Transformer+GNN combination.

Experiments on MoleculeNet and Retrosynthesis

Pre-training Setup

DMP is pre-trained on 10M molecules from PubChem (matching prior work). The Transformer branch uses RoBERTa-base (12 layers, hidden dim 768, 87M parameters). The GNN branch uses DeeperGCN (12 layers, hidden dim 384, 7.4M parameters). Combined, DMP has 104.1M parameters. Training runs for 200K iterations on 8 V100 GPUs over 3.8 days with Adam optimizer (lr = 5e-4, weight decay 0.01).

Molecular Property Prediction (MoleculeNet)

DMP is evaluated on 6 binary classification tasks (BBBP, Tox21, ClinTox, HIV, BACE, SIDER) using official DeepChem splits, and on 3 additional tasks (BBBP, SIDER, ClinTox classification + ESOL, QM7, QM8 regression) using scaffold splits from GROVER.

Key results on DeepChem splits (ROC-AUC %):

Dataset	MolCLR	TF (MLM)	DMP_TF	DMP_TF+GNN
BBBP	73.6	74.9	78.1	77.8
Tox21	79.8	77.6	78.8	79.1
ClinTox	93.2	92.9	95.0	95.6
HIV	80.6	80.2	81.0	81.4
BACE	89.0	88.0	89.3	89.4
SIDER	68.0	68.4	69.2	69.8

On scaffold splits (comparison with GROVER and MPG):

Dataset	GROVER	MPG	DMP_TF
BBBP (AUC)	0.940	0.922	0.945
SIDER (AUC)	0.658	0.661	0.695
ClinTox (AUC)	0.944	0.963	0.968
ESOL (RMSE)	0.831	0.741	0.700
QM7 (MAE)	72.6	-	69.6
QM8 (MAE)	0.0125	-	0.0124

Retrosynthesis

DMP is tested on USPTO-50K (reaction type known/unknown) and USPTO-full. Using a “DMP fusion” approach (fusing pre-trained representations into a Transformer encoder-decoder for retrosynthesis), DMP improves top-1 accuracy by 2-3 points over the baseline Transformer across all settings:

Setting	Transformer	ChemBERTa fusion	DMP fusion
USPTO-50K (unknown)	42.3	43.9	46.1
USPTO-50K (known)	54.2	56.4	57.5
USPTO-full	42.9	-	45.0

For GNN-based retrosynthesis, replacing GLN’s GNN modules with DMP’s pre-trained GNN branch improves top-1 accuracy from 52.5% to 54.2% (unknown type) and from 64.2% to 66.5% (known type).

Representation Quality

t-SNE visualization of pre-trained representations shows that DMP produces better scaffold-based clustering than either GNN-only or Transformer-only pre-training. The Davies-Bouldin index improves from 3.56 (GNN) and 3.59 (Transformer) to 2.19 (DMP), indicating much tighter within-scaffold clusters.

Key Findings and Limitations

Key findings:

Combining heterogeneous views (SMILES + graph) during pre-training is more effective than using two branches of the same type. TF(x2) and GNN(x2) variants show smaller gains.
Both MLM and dual-view consistency loss contribute. Removing MLM (dual-view only) hurts performance, especially on BBBP (71.1 vs 78.1 with both losses).
The Transformer branch alone is recommended for downstream tasks, as it achieves strong results without adding GNN parameters at inference time.
Scaling pre-training data from 10M to 100M compounds yields marginal additional improvement.

Limitations acknowledged by the authors:

Training cost is higher than single-branch methods (3.8 days vs 2.5 days for TF-only on 8 V100s), since both branches must be trained jointly.
A fixed branch selection strategy is used at inference time. The authors note that a meta-controller for dynamic branch selection per molecule would be preferable.
The GNN branch uses simple atom masking without bond deletion or subgraph removal, leaving room for stronger graph-level pre-training objectives.

Relation to co-training: The authors clarify that DMP differs from classical co-training (Blum and Mitchell 1998) in that it does not require conditional independence between views and produces a pre-trained model rather than additional labeled data.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	PubChem subset	10M compounds	Same subset as MolCLR and ChemBERTa
Pre-training (large)	PubChem subset	100M compounds	Additional scale experiment
Evaluation (classification)	MoleculeNet (BBBP, Tox21, ClinTox, HIV, BACE, SIDER)	1.5K-41K molecules	Official DeepChem splits
Evaluation (regression)	MoleculeNet (ESOL, QM7, QM8)	Varies	Scaffold splits from GROVER
Evaluation (retrosynthesis)	USPTO-50K, USPTO-full	50K / 950K reactions	Splits from Dai et al. (2019)

Algorithms

Transformer branch: RoBERTa-base with MLM. SMILES tokenized using regex from Schwaller et al. (2019).
GNN branch: DeeperGCN with 12 layers, atom masking for MLM.
Dual-view loss: BYOL-style with 3-layer MLP projection heads and 2-layer MLP prediction heads, stop-gradient on targets.
Optimizer: Adam (lr=5e-4, beta1=0.9, beta2=0.98, epsilon=1e-6), weight decay 0.01, 10K warmup steps, linear decay.

Models

Component	Architecture	Parameters
Transformer branch	RoBERTa-base (12L, 768H, 12 heads)	87M
GNN branch	DeeperGCN (12L, 384H)	7.4M
DMP (total)	Transformer + GNN + projection/prediction heads	104.1M

Evaluation

Classification: ROC-AUC, averaged over 3 random seeds
Regression: RMSE (ESOL) or MAE (QM7, QM8)
Retrosynthesis: Top-k exact match accuracy (k=1,3,5,10,20,50)

Hardware

Pre-training: 8 NVIDIA V100 GPUs, batch size 12288 tokens, gradient accumulation 16x
Pre-training time: 3.8 days (DMP), 2.5 days (TF-only), 1.7 days (GNN-only)

Artifacts

No public code repository or pre-trained model weights were identified for this paper. The paper references GLN’s code repository (https://github.com/Hanjun-Dai/GLN) for the retrosynthesis baseline but does not release DMP-specific code.

Artifact	Type	License	Notes
GLN (baseline)	Code	MIT	Retrosynthesis baseline, not DMP code

Paper Information

Citation: Zhu, J., Xia, Y., Wu, L., Xie, S., Zhou, W., Qin, T., Li, H., & Liu, T.-Y. (2023). Dual-view Molecular Pre-training. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 3615-3627). https://doi.org/10.1145/3580305.3599317

@inproceedings{zhu2023dualview,
  title={Dual-view Molecular Pre-training},
  author={Zhu, Jinhua and Xia, Yingce and Wu, Lijun and Xie, Shufang and Zhou, Wengang and Qin, Tao and Li, Houqiang and Liu, Tie-Yan},
  booktitle={Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
  pages={3615--3627},
  year={2023},
  doi={10.1145/3580305.3599317}
}

X-MOL: Pre-training on 1.1B Molecules for SMILES

Thu, 26 Mar 2026 00:00:00 +0000

A Unified Molecular Pre-training Framework

X-MOL is a Method paper that introduces a large-scale pre-training framework for SMILES-based molecular understanding. The primary contribution is a Transformer encoder-decoder model pre-trained on 1.1 billion molecules from ZINC15, which is then fine-tuned across five distinct molecular analysis tasks: molecular property prediction (classification and regression), chemical reaction productivity prediction, drug-drug interaction (DDI) prediction, de novo molecule generation (distribution learning and goal-directed), and molecule optimization. The paper demonstrates that a single pre-trained model can serve as a universal foundation for diverse downstream chemistry tasks.

Bridging Scale and Understanding in Molecular SMILES

Prior to X-MOL, most molecular analysis tasks were investigated individually with task-specific models. SMILES-based deep learning methods existed but lacked the benefit of large-scale pre-training that had proven transformative in NLP (BERT, RoBERTa, ERNIE, XLNet, T5). Two challenges motivated this work:

SMILES sacrifices structural information for simplicity. While SMILES is a convenient linear representation, it does not directly encode molecular topology, making it harder for models to learn 3D structure from string input.
Labelled molecular data is scarce. Most benchmark datasets (MoleculeNet) contain only thousands of labelled examples, making it difficult to train large models from scratch without overfitting.

The authors hypothesized that massive-scale pre-training on unlabelled SMILES could teach a model the grammar rules and implicit structural information in SMILES, providing a strong initialization for multiple downstream tasks.

Generative Pre-training with Random SMILES

The core innovation in X-MOL is a generative pre-training strategy that exploits the non-uniqueness of SMILES. A single molecule can be represented by many valid SMILES strings (random SMILES), depending on the starting atom, main chain selection, and ring-opening position. X-MOL trains the model to generate a valid alternative SMILES given an input SMILES of the same molecule, forcing the model to:

Reconstruct the molecular structure from the input SMILES
Generate a valid output SMILES following SMILES grammar rules

The architecture uses a shared-parameter encoder-decoder based on the Transformer. Unlike standard encoder-decoder models (e.g., for machine translation), X-MOL shares all parameters between encoder and decoder, forcing both encoding and decoding to occur in the same semantic space. The output SMILES is fully masked during training, and only unidirectional attention is permitted within the output sequence.

The self-attention mechanism computes attention for each character $i$ as:

$$ Z_{i} = \text{SoftMax}\left(\frac{Q_{i} \cdot K^{T}}{\sqrt{D}}\right) \cdot V $$

where $Q_{i}$, $K$, and $V$ are the query, key, and value matrices, and $D$ is the feature dimension. The model uses 12 attention heads to capture different relational patterns.

Model Architecture

12 Transformer encoder layers
768-dimensional hidden units
12 attention heads
Character-level SMILES tokenization (108 chemical characters plus 5 special tokens: [PAD], [CLS], [SEP], [MASK], [UNK])
Characters within square brackets and double digits preceded by “%” are treated as single tokens

Data Augmentation in Pre-training

Because a molecule has multiple valid random SMILES, the output may differ from the predefined target. To handle this, X-MOL generates multiple training samples per molecule with the same input SMILES but different output random SMILES, and places these in the same mini-batch.

Experimental Setup Across Five Tasks

X-MOL is fine-tuned with task-specific strategies organized into two categories: prediction tasks and generation tasks.

Prediction Tasks

For prediction tasks, the [CLS] token’s output representation is passed through a fully connected network to produce predictions. The input format varies by task:

Task	Input Format	Loss Function	Metric
Property prediction (classification)	Single SMILES	Cross-entropy	ROC-AUC
Property prediction (regression)	Single SMILES	MSE	RMSE
Reaction productivity prediction	Four SMILES (reactant, additive, base, ligand)	MSE	RMSE
DDI prediction	Two SMILES (drug pair)	Cross-entropy	Accuracy

Molecular Property Prediction (Classification): Four MoleculeNet benchmarks were used: HIV (41,127 compounds), BACE (1,513), BBBP (2,039), and ClinTox (1,484). Data were randomly split 20 times, and average ROC-AUC is reported.

Molecular Property Prediction (Regression): Three MoleculeNet benchmarks: ESOL (1,128), FreeSolv (642), and Lipophilicity (4,200). Data augmentation with random SMILES was applied to the training set. Average RMSE over 20 random splits is reported.

Chemical Reaction Productivity Prediction: The C-N cross-coupling dataset (3,956 reactions) from Ahneman et al. was used with 10-fold cross-validation.

DDI Prediction: The DeepDDI dataset (192,284 DDI pairs, 86 interaction types) was used as benchmark.

Generation Tasks

Task	Generation Source	Sampling Strategy
Distribution learning (DL) generation	Fixed initial symbol ([CLS])	Random sampling
Goal-directed (GD) generation	Unfixed initial symbol	Random sampling
Molecule optimization	Input molecule	Beam search (beam size = 4)

DL-based Generation: Evaluated on ZINC250K (249,456 molecules) using validity, uniqueness, and novelty.

GD Generation: Also on ZINC250K, using QED as the goal property with target QED = 0.948 (the dataset maximum). 10,000 molecules were generated for evaluation.

Molecule Optimization: Evaluated on ZINC250K with QED as the optimization goal. Molecular pairs were constructed by selecting pairs with Tanimoto similarity in [0.6, 0.8], where the lower-QED molecule serves as input and the higher-QED molecule as target.

Key Results

Classification (ROC-AUC, higher is better): X-MOL achieved state-of-the-art on all four datasets, outperforming both shallow learning methods and deep learning baselines including graph convolutional models.

Regression (RMSE, lower is better): X-MOL achieved the best RMSE on ESOL, FreeSolv, and Lipophilicity.

Reaction Productivity: X-MOL obtained an average RMSE of 0.0626, compared to the random forest baseline of 0.078.

DDI Prediction: X-MOL achieved accuracy of 0.952, improving over DeepDDI’s 0.924.

DL-based Generation:

Method	Validity	Uniqueness	Novelty
GCPN	20%	99.97%	100%
MRNN	65%	99.89%	100%
GraphAF	68%	99.10%	100%
X-MOL	85.28%	99.91%	100%

GD Generation: X-MOL generated all top-3 molecules with QED = 0.948, matching the dataset maximum. GraphAF reached 0.948/0.948/0.947, while JT-VAE and MRNN fell further behind.

Knowledge Embedding Ablation

The paper tested three additional embedding strategies to inject structural information into the model:

Link embedding: Encodes connection information between atoms (position of the previous connected atom)
Ring embedding: Encodes ring structure information from SMILES number pairs
Type embedding: Categorizes characters into 9 types (atoms, bonds, structural symbols)

None of these additional embeddings improved performance on the HIV or DDI tasks, whether with or without pre-training. The authors conclude that SMILES already contains sufficient information for molecular understanding and that pre-training effectively extracts this information, a finding they label “SMILES is all you need.”

Attention Visualization

The authors provide attention heatmap analysis demonstrating that:

Middle layers (e.g., layer 9) reconstruct molecular structure by correctly identifying atom connectivity and ring closures
Later layers abstract higher-level features for property prediction
In multi-input prediction tasks (reaction productivity), attention reveals which reaction components are most important (e.g., the ligand receives highest cross-attention)
In generation tasks, attention patterns differ between DL (self-focused), GD (source-constrained), and optimization (gradual shift from input to output)

Findings, Limitations, and Future Directions

X-MOL demonstrates that large-scale pre-training on SMILES can produce a single model that achieves competitive or state-of-the-art performance across five distinct molecular analysis tasks. The key findings are:

Scale enables SMILES understanding. Pre-training on 1.1 billion molecules allows the model to learn SMILES grammar rules well enough to outperform graph-based methods on molecule generation validity.
Unified framework. A single pre-trained backbone serves classification, regression, reaction prediction, DDI prediction, and generative tasks through different fine-tuning strategies.
SMILES is sufficient. Additional knowledge embeddings (link, ring, type) do not improve performance, suggesting pre-training extracts the necessary structural information from SMILES alone.
Interpretable attention. Attention visualization confirms that the model reconstructs molecular structure internally.

Limitations (observed):

The paper reports only MoleculeNet benchmarks with relatively few datasets. No scaffold splits or temporal splits are used; all splits are random, which can overestimate performance on structurally novel compounds.
Comparison baselines are somewhat dated (2018-2019 era methods), and the paper does not compare against concurrent SMILES pre-training methods.
The molecule generation validity (85.28%) is much higher than graph baselines like GCPN (20%), but later work achieved near 100% validity with constrained SMILES grammars.
No code or model weights have been publicly released, limiting independent verification.
The paper remains a bioRxiv preprint and has not been published in a peer-reviewed venue.

Future directions proposed by the authors include: better pre-training strategies, extension to graph-based representations, and fine-tuning on additional downstream tasks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC15	1.1 billion molecules	Random SMILES augmentation
Classification	HIV (MoleculeNet)	41,127	Binary classification
Classification	BACE (MoleculeNet)	1,513	Binary classification
Classification	BBBP (MoleculeNet)	2,039	Binary classification
Classification	ClinTox (MoleculeNet)	1,484	Two sub-datasets, averaged
Regression	ESOL (MoleculeNet)	1,128	Water solubility
Regression	FreeSolv (MoleculeNet)	642	Hydration free energy
Regression	Lipophilicity (MoleculeNet)	4,200	logD at pH 7.4
Reaction	C-N cross-coupling	3,956	From Ahneman et al. (2018)
DDI	DeepDDI	192,284 DDI pairs	86 interaction types
Generation	ZINC250K	249,456	For DL, GD, and optimization

Algorithms

Pre-training: Generative SMILES-to-SMILES with shared encoder-decoder Transformer
Fine-tuning prediction tasks: [CLS] token passed through fully connected layers
Fine-tuning generation tasks: Autoregressive generation with random sampling (DL, GD) or beam search (optimization)
Data augmentation: Random SMILES augmentation for regression tasks
Repeated training: 20 random splits with averaged results for classification/regression
10-fold cross-validation for reaction productivity

Models

12-layer Transformer, 768 hidden dimensions, 12 attention heads
Character-level tokenization: 108 chemical characters + 5 special tokens
Implemented in PaddlePaddle framework

Evaluation

Task	Metric	X-MOL	Best Baseline
HIV (classification)	ROC-AUC	State-of-the-art	Previous best (various)
BACE (classification)	ROC-AUC	State-of-the-art	Previous best (various)
BBBP (classification)	ROC-AUC	State-of-the-art	Previous best (various)
ClinTox (classification)	ROC-AUC	State-of-the-art	Previous best (various)
ESOL (regression)	RMSE	State-of-the-art	Previous best (various)
FreeSolv (regression)	RMSE	State-of-the-art	Previous best (various)
Lipophilicity (regression)	RMSE	State-of-the-art	Previous best (various)
C-N coupling	RMSE	0.0626	0.078 (random forest)
DDI prediction	Accuracy	0.952	0.924 (DeepDDI)
DL generation	Validity	85.28%	68% (GraphAF)
GD generation	Top-3 QED	All 0.948	0.948/0.948/0.947 (GraphAF)

Hardware

Pre-training: 8/16 Tesla P40 GPUs (24 GB each), approximately 4 days
Data pre-processing: Over 1,000 CPUs with Hadoop

Artifacts

No code, model weights, or pre-trained checkpoints have been publicly released. The model was implemented in Baidu’s PaddlePaddle framework, but no repository is available.

Reproducibility status: Closed. While the datasets are all publicly available (ZINC15, MoleculeNet, ZINC250K, DeepDDI, C-N coupling), the model implementation, pre-trained weights, and fine-tuning code are not released. The computational requirements (1,000+ CPUs for data processing, 8-16 GPUs for 4 days of pre-training) are substantial.

Paper Information

Citation: Xue, D., Zhang, H., Xiao, D., Gong, Y., Chuai, G., Sun, Y., Tian, H., Wu, H., Li, Y., & Liu, Q. (2020). X-MOL: Large-scale pre-training for molecular understanding and diverse molecular analysis. bioRxiv.

@article{xue2020xmol,
  title={X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis},
  author={Xue, Dongyu and Zhang, Han and Xiao, Dongling and Gong, Yukang and Chuai, Guohui and Sun, Yu and Tian, Hao and Wu, Hua and Li, Yukun and Liu, Qi},
  journal={bioRxiv},
  year={2020},
  doi={10.1101/2020.12.23.424259},
  publisher={Cold Spring Harbor Laboratory}
}

Transformer Name-to-SMILES with Atom Count Losses

Thu, 26 Mar 2026 00:00:00 +0000

Translating Chemical Names to Structures with Transformers

This is a Method paper that proposes using Transformer-based sequence-to-sequence models to predict chemical compound structures (represented as SMILES strings) from chemical compound names. The primary contribution is the application of neural machine translation techniques to the name-to-structure problem, along with two domain-specific improvements: an atom-count constraint loss function and a multi-task learning approach that jointly predicts SMILES and InChI strings.

Why Rule-Based Name-to-Structure Fails for Synonyms

Chemical compound names come in several varieties. IUPAC names follow systematic nomenclature and are well-handled by rule-based parsers like OPSIN. Database IDs (e.g., CAS registry numbers) can be resolved by dictionary lookup. The third category, Synonyms (which includes abbreviations, common names, and other informal designations), is problematic because naming patterns are complex and widely variable.

In preliminary experiments, rule-based tools achieved F-measures of 0.878 to 0.960 on IUPAC names but only 0.719 to 0.758 on Synonyms. This performance gap motivates a data-driven approach. The authors frame name-to-SMILES prediction as a machine translation problem: the source language is the chemical compound name and the target language is the SMILES string. A neural model trained on millions of name-SMILES pairs can learn patterns that rule-based systems miss, particularly for non-systematic nomenclature.

Atom-Count Constraints and Multi-Task Learning

The paper introduces two improvements over a vanilla Transformer seq2seq model.

Atom-Count Constraint Loss

A correct structure prediction must contain the right number of atoms of each element. The authors add an auxiliary loss that penalizes the squared difference between the predicted and true atom counts for each element. The predicted atom counts are obtained by summing Gumbel-softmax outputs across all decoded positions.

For the $i$-th output token, the Gumbel-softmax probability vector is:

$$ y_{ij} = \frac{\exp\left((\log(\pi_{ij}) + g_{ij}) / \tau\right)}{\sum_{k=1}^{|\mathcal{V}|} \exp\left((\log(\pi_{ik}) + g_{ik}) / \tau\right)} $$

where $\pi_{ij}$ is the model’s softmax output, $g_{ij}$ is a Gumbel noise sample, and $\tau = 0.1$ is the temperature. The predicted token frequency vector is $\mathbf{y}^{pred} = \sum_{i=1}^{m} \mathbf{y}_i$, and the atom-count loss is:

$$ \mathcal{L}_{atom} = \frac{1}{|A|} \sum_{a \in A} \left(N_a(T) - y_{idx(a)}^{pred}\right)^2 $$

where $A$ is the set of chemical elements in the vocabulary, $N_a(T)$ returns the number of atoms of element $a$ in the correct SMILES string $T$, and $idx(a)$ returns the vocabulary index of element $a$. Only element tokens (e.g., “C”, “O”) are counted; bond symbols (e.g., “=”, “#”) are excluded.

The combined objective is:

$$ \mathcal{L}_{smiles} + \lambda_{atom} \mathcal{L}_{atom} $$

with $\lambda_{atom} = 0.7$.

Multi-Task SMILES/InChI Prediction

SMILES and InChI strings encode the same chemical structure in different formats. The authors hypothesize that jointly predicting both representations can improve the shared encoder. The multi-task model shares the encoder between a SMILES decoder and an InChI decoder, minimizing:

$$ \mathcal{L}_{smiles} + \lambda_{inchi} \mathcal{L}_{inchi} $$

where $\mathcal{L}_{inchi} = -\log P(I | X; \boldsymbol{\theta}_{enc}, \boldsymbol{\theta}_{inchi})$ and $\lambda_{inchi} = 0.3$.

Experimental Setup and Evaluation

Dataset

The dataset was constructed from PubChem dump data (97M compound records). Chemical compound names categorized as Synonyms were paired with canonical SMILES strings (converted via RDKit). Database-like IDs were filtered out using regular expressions. Duplicate names mapping to different CIDs were removed.

Split	Size
Training	5,000,000
Development	1,113
Test	11,194

Model Configuration

The Transformer uses 6 encoder/decoder layers, 8 attention heads, 512-dimensional embeddings, and 0.1 dropout. Training used label-smoothing cross-entropy ($\epsilon = 0.1$), Adam optimizer ($\beta_1 = 0.9$, $\beta_2 = 0.98$), and a warmup schedule with peak learning rate 0.0005 over 4,000 steps followed by inverse square root decay. Models were trained for 300,000 update steps. Final predictions averaged the last 10 checkpoints and used beam search (beam size 4, length penalty $\alpha = 0.6$, max output length 200).

Tokenization

Three tokenization strategies were compared:

BPE: Byte pair encoding learned on chemical compound names (500 merge operations) via fastBPE
OPSIN-TK: The OPSIN rule-based tokenizer
OPSIN-TK+BPE: A hybrid where OPSIN handles tokenizable names and BPE handles the rest

SMILES tokens were identified by regular expressions (elements as single tokens, remaining symbols as characters). InChI strings were tokenized by SentencePiece (vocabulary size 1,000).

Baselines

OPSIN: Open-source rule-based parser
Tool A and Tool B: Two commercially available name-to-structure tools

Results

Method	Tokenizer	Recall	Precision	F-measure
OPSIN	Rule-based	0.693	0.836	0.758
Tool A	Rule-based	0.711	0.797	0.752
Tool B	Rule-based	0.653	0.800	0.719
Transformer	BPE	0.793	0.806	0.799
+ atomnum	BPE	0.798	0.808	0.803
+ inchigen	BPE	0.810	0.819	0.814
Transformer	OPSIN-TK+BPE	0.763	0.873	0.814
+ atomnum	OPSIN-TK+BPE	0.768	0.876	0.818
+ inchigen	OPSIN-TK+BPE	0.779	0.886	0.829
Transformer	OPSIN-TK	0.755	0.868	0.808
+ atomnum	OPSIN-TK	0.757	0.867	0.808
+ inchigen	OPSIN-TK	0.754	0.869	0.807

The best configuration (inchigen with OPSIN-TK+BPE) achieved an F-measure of 0.829, surpassing OPSIN by 0.071 points. The multi-task learning approach (inchigen) consistently outperformed the atom-count constraint alone (atomnum) across all tokenizer settings.

Key Findings and Error Analysis

The Transformer-based approach produced grammatically correct SMILES strings (parseable by RDKit) for 99% of test examples, compared to 81.6-88.4% for the rule-based tools. Even when predictions were incorrect, they tended to be structurally similar to the correct answer. Using MACCS fingerprints and Jaccard (Tanimoto) similarity, the average similarity between incorrectly predicted and correct structures was 0.753.

The OPSIN-TK tokenizer yielded higher precision than BPE because approximately 11.5% (1,293 of 11,194) of test compounds could not be tokenized by OPSIN, reducing the number of outputs. BPE-based tokenizers achieved higher recall by covering all inputs. The hybrid OPSIN-TK+BPE approach balanced both, achieving the highest overall F-measure.

Limitations: The paper does not evaluate on IUPAC names separately with the Transformer models (only comparing rule-based tools on IUPAC). The atom-count constraint and multi-task learning are not combined in a single model. The dataset is released but the training code is not. Hardware details and training times are not reported. The evaluation uses only exact-match F-measure and Jaccard similarity, without measuring partial credit for nearly-correct structures.

Future work: The authors plan to explore additional tokenization methods, combine the atom-count constraint with multi-task learning, and apply the constraint loss to other chemistry problems including chemical reaction prediction.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	PubChem Synonyms (custom split)	5,000,000 pairs	Chemical compound names to canonical SMILES
Development	PubChem Synonyms (custom split)	1,113 pairs	Filtered for duplicates
Test	PubChem Synonyms (custom split)	11,194 pairs	Filtered for duplicates; released as benchmark

The authors state the dataset is released for future research. The data was constructed from the PubChem dump (97M compound records) using RDKit for SMILES canonicalization. Database-like IDs were removed with regular expressions and duplicate names across CIDs were filtered.

Algorithms

Transformer seq2seq (6 layers, 8 heads, 512-dim embeddings)
BPE tokenization via fastBPE (500 merge operations)
SentencePiece for InChI tokenization (vocabulary size 1,000)
Gumbel-softmax atom-count constraint ($\tau = 0.1$, $\lambda_{atom} = 0.7$)
Multi-task SMILES/InChI loss ($\lambda_{inchi} = 0.3$)
Adam optimizer ($\beta_1 = 0.9$, $\beta_2 = 0.98$, $\epsilon = 10^{-8}$)
Label smoothing ($\epsilon = 0.1$), 300K training steps
Beam search (beam size 4, length penalty $\alpha = 0.6$)

Models

Standard Transformer architecture following Vaswani et al. (2017). No pre-trained weights or model checkpoints are released.

Evaluation

Metric	Best Value	Model	Notes
F-measure	0.829	inchigen (OPSIN-TK+BPE)	Highest overall
Precision	0.886	inchigen (OPSIN-TK+BPE)	Highest overall
Recall	0.810	inchigen (BPE)	Highest overall
Grammatical correctness	99%	inchigen (BPE)	SMILES parseable by RDKit
Avg. Jaccard similarity (errors)	0.753	inchigen (BPE)	On incorrect predictions only

Hardware

Not reported.

Paper Information

Citation: Omote, Y., Matsushita, K., Iwakura, T., Tamura, A., & Ninomiya, T. (2020). Transformer-based Approach for Predicting Chemical Compound Structures. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 154-162. https://doi.org/10.18653/v1/2020.aacl-main.19

@inproceedings{omote2020transformer,
  title={Transformer-based Approach for Predicting Chemical Compound Structures},
  author={Omote, Yutaro and Matsushita, Kyoumoto and Iwakura, Tomoya and Tamura, Akihiro and Ninomiya, Takashi},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  pages={154--162},
  year={2020},
  publisher={Association for Computational Linguistics},
  doi={10.18653/v1/2020.aacl-main.19}
}

Transformer CLMs for SMILES: Literature Review 2024

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Transformer-Based Chemical Language Models

This paper is a Systematization (literature review) that surveys the landscape of transformer-based chemical language models (CLMs) operating on SMILES representations. It organizes the field into three architectural categories (encoder-only, decoder-only, encoder-decoder), discusses tokenization strategies, pre-training and fine-tuning methodologies, and identifies open challenges and future research directions. The review covers approximately 30 distinct CLMs published through early 2024.

Why Review Transformer CLMs for SMILES?

The chemical space is vast, with databases like ZINC20 exceeding 5.5 billion compounds, and the amount of unlabeled molecular data far outstrips available labeled data for specific tasks like toxicity prediction or binding affinity estimation. Traditional molecular representations (fingerprints, descriptors, graph-based methods) require expert-engineered features and extensive domain knowledge.

Transformer-based language models, originally developed for NLP, have emerged as a compelling alternative. By treating SMILES strings as a “chemical language,” these models can leverage large-scale unsupervised pre-training on abundant unlabeled molecules, then fine-tune on small labeled datasets for specific downstream tasks. Earlier approaches like Seq2Seq and Seq3Seq fingerprint methods used RNN-based encoder-decoders, but these suffered from vanishing gradients and sequential processing bottlenecks when handling long SMILES sequences.

The authors motivate this review by noting that no prior survey has comprehensively organized transformer-based CLMs by architecture type while simultaneously covering tokenization, embedding strategies, and downstream application domains.

Architectural Taxonomy: Encoder, Decoder, and Encoder-Decoder Models

The core organizational contribution is a three-way taxonomy of transformer CLMs based on their architectural backbone.

Encoder-Only Models (BERT Family)

These models capture bidirectional context, making them well suited for extracting molecular representations for property prediction tasks. The review covers:

BERT (Lee and Nam, 2022): Adapted for SMILES processing with linguistic knowledge infusion, using BPE tokenization
MOLBERT (Fabian et al., 2020): Chemistry-specific BERT for physicochemical property and bioactivity prediction
SMILES-BERT (Wang et al., 2019): BERT variant designed to learn molecular representations directly from SMILES without feature engineering
ChemBERTa / ChemBERTa-2 (Chithrananda et al., 2020; Ahmad et al., 2022): RoBERTa-based models optimized for chemical property prediction, with ChemBERTa-2 exploring multi-task pre-training
GPT-MolBERTa (Balaji et al., 2023): Combines GPT molecular features with a RoBERTa backbone
MoLFormer (Ross et al., 2022): Large-scale model trained on 1.1 billion molecules, published in Nature Machine Intelligence
SELFormer (Yuksel et al., 2023): Operates on SELFIES representations rather than SMILES
Mol-BERT / MolRoPE-BERT (Li and Jiang, 2021; Liu et al., 2023): Differ in positional embedding strategy, with MolRoPE-BERT using rotary position embedding to handle longer sequences
BET (Chen et al., 2021): Extracts predictive representations from hundreds of millions of molecules

Decoder-Only Models (GPT Family)

These models excel at generative tasks, including de novo molecular design:

GPT-2-based model (Adilov, 2021): Generative pre-training from molecules
MolXPT (Liu et al., 2023): Wraps molecules with text for generative pre-training, connecting chemical and natural language
BioGPT (Luo et al., 2022): Focuses on biomedical text generation and mining
MolGPT (Haroon et al., 2023): Uses relative attention to capture token distances and relationships for de novo drug design
Mol-Instructions (Fang et al., 2023): Large-scale biomolecular instruction dataset for LLMs

Encoder-Decoder Models

These combine encoding and generation capabilities for sequence-to-sequence tasks:

Chemformer (Irwin et al., 2022): BART-based model for reaction prediction and molecular property prediction
MolT5 (adapted T5): Unified text-to-text framework for molecular tasks
SMILES Transformer (Honda et al., 2019): Pre-trained molecular fingerprints for low-data drug discovery
X-MOL (Xue et al., 2020): Large-scale pre-training for molecular understanding
Regression Transformer (Born and Manica, 2023): Operates on SELFIES, enabling concurrent regression and generation
TransAntivirus (Mao et al., 2023): Specialized for antiviral drug design using IUPAC nomenclature

Tokenization, Embedding, and Pre-Training Strategies

SMILES Tokenization

The review identifies tokenization as a critical preprocessing step that affects downstream performance. SMILES tokenization differs from standard NLP tokenization because SMILES strings lack whitespace and use parentheses for branching rather than sentence separation. The key approaches include:

Strategy	Source	Description
Atom-in-SMILES (AIS)	Ucak et al. (2023)	Atom-level tokens preserving chemical identity
SMILES Pair Encoding (SPE)	Li and Fourches (2021)	BPE-inspired substructure tokenization
Byte-Pair Encoding (BPE)	Chithrananda et al. (2020); Lee and Nam (2022)	Standard subword tokenization adapted for SMILES
SMILESTokenizer	Chithrananda et al. (2020)	Character-level tokenization with chemical adjustments

Positional Embeddings

The models use various positional encoding strategies: absolute, relative key, relative key-query, rotary (RoPE), and sinusoidal. Notably, SMILES-based models omit segmentation embeddings since SMILES data consists of single sequences rather than sentence pairs.

Pre-Training and Fine-Tuning Pipeline

The standard workflow follows two phases:

Pre-training: Unsupervised training on large unlabeled SMILES databases (ZINC, PubChem, ChEMBL) using masked language modeling (MLM), where the model learns to predict masked tokens within SMILES strings
Fine-tuning: Supervised adaptation on smaller labeled datasets for specific tasks (classification or regression)

The self-attention mechanism, central to all transformer CLMs, is formulated as:

$$ Z = \text{Softmax}\left(\frac{(XW^Q)(XW^K)^T}{\sqrt{d_k}}\right) XW^V $$

where $X \in \mathbb{R}^{N \times M}$ is the input feature matrix, $W^Q$, $W^K$, $W^V \in \mathbb{R}^{M \times d_k}$ are learnable weight matrices, and $\sqrt{d_k}$ is the scaling factor.

Benchmark Datasets and Evaluation Landscape

The review catalogs the standard evaluation ecosystem for CLMs. Pre-training databases include ZINC, PubChem, and ChEMBL. Fine-tuning and evaluation rely heavily on MoleculeNet benchmarks:

Category	Datasets	Task Type	Example Size
Physical Chemistry	ESOL, FreeSolv, Lipophilicity	Regression	642 to 4,200
Biophysics	PCBA, MUV, HIV, PDBbind, BACE	Classification/Regression	11,908 to 437,929
Physiology	BBBP, Tox21, ToxCast, SIDER, ClinTox	Classification	1,427 to 8,575

The authors also propose four new fine-tuning datasets targeting diseases: COVID-19 drug compounds, cocrystal formation, antimalarial drugs (Plasmodium falciparum targets), and cancer gene expression/drug response data.

Challenges, Limitations, and Future Directions

Current Challenges

The review identifies several persistent limitations:

Data efficiency: Despite transfer learning, transformer CLMs still require substantial pre-training data, and labeled datasets for specific tasks remain scarce
Interpretability: The complexity of transformer architectures makes it difficult to understand how specific molecular features contribute to predictions
Computational cost: Training large-scale models demands significant GPU resources, limiting accessibility
Handling rare molecules: Models struggle with molecular structures that deviate significantly from training data distributions
SMILES limitations: Non-unique representations, invalid strings, exceeded atom valency, and inadequate spatial information capture

SMILES Representation Issues

The authors highlight five specific problems with SMILES as an input representation:

Non-canonical representations reduce string uniqueness for the same molecule
Many symbol combinations produce chemically invalid outputs
Valid SMILES strings can encode chemically impossible molecules (e.g., exceeded valency)
Spatial information is inadequately captured
Syntactic and semantic robustness is limited

Future Research Directions

The review proposes several directions:

Alternative molecular representations: Exploring SELFIES, DeepSMILES, IUPAC, and InChI beyond SMILES
Role of SMILES token types: Strategic masking of metals, non-metals, bonds, and branches during MLM pre-training to identify which components are most critical
Few-shot learning: Combining few-shot approaches with large-scale pre-trained CLMs for data-scarce scenarios
Drug repurposing: Training CLMs to distinguish identical compounds with different biological activity profiles across therapeutic domains
Improved benchmarks: Incorporating disease-specific datasets (malaria, cancer, COVID-19) for more realistic evaluation
Ethical considerations: Addressing dual-use risks, data biases, and responsible open-source release of CLMs

Reproducibility Details

This is a literature review paper. It does not introduce new models, code, or experimental results. The reproducibility assessment focuses on the accessibility of the reviewed works and proposed datasets.

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC20	5.5B+ compounds	Publicly available
Pre-training	PubChem	100M+ compounds	Publicly available
Pre-training	ChEMBL	2M+ compounds	Publicly available
Fine-tuning	MoleculeNet (8 datasets)	642 to 437,929	Standard benchmark suite
Proposed	COVID-19 drug compounds	740	From Harigua-Souiai et al. (2021)
Proposed	Cocrystal formation	3,282	From Mswahili et al. (2021)
Proposed	Antimalarial drugs	4,794	From Mswahili et al. (2024)
Proposed	Cancer gene/drug response	201 drugs, 734 cell lines	From Kim et al. (2021)

Artifacts

Artifact	Type	License	Notes
DAI Lab website	Other	N/A	Authors’ research lab

No code, models, or evaluation scripts are released with this review. The paper does not include a supplementary materials section or GitHub repository.

Hardware

Not applicable (literature review).

Paper Information

Citation: Mswahili, M. E., & Jeong, Y.-S. (2024). Transformer-based models for chemical SMILES representation: A comprehensive literature review. Heliyon, 10(20), e39038. https://doi.org/10.1016/j.heliyon.2024.e39038

@article{mswahili2024transformer,
  title={Transformer-based models for chemical {SMILES} representation: A comprehensive literature review},
  author={Mswahili, Medard Edmund and Jeong, Young-Seob},
  journal={Heliyon},
  volume={10},
  number={20},
  pages={e39038},
  year={2024},
  publisher={Elsevier},
  doi={10.1016/j.heliyon.2024.e39038}
}

t-SMILES: Tree-Based Fragment Molecular Encoding

Thu, 26 Mar 2026 00:00:00 +0000

A Fragment-Based Molecular Representation Method

This is a Method paper that proposes t-SMILES (tree-based SMILES), a framework for representing molecules as SMILES-type strings derived from fragment-based decompositions. The primary contribution is an encoding algorithm that converts fragmented molecular graphs into full binary trees (FBTs) and then traverses them breadth-first to produce linear strings. Three coding variants are introduced: TSSA (shared atom), TSDY (dummy atom without ID), and TSID (dummy atom with ID). The framework achieves 100% theoretical validity, higher novelty scores, and improved distribution-learning metrics compared to classical SMILES, DeepSMILES, and SELFIES across ChEMBL, ZINC, and QM9 benchmarks.

Why Fragment-Based Representations Matter for Molecular Generation

Classical SMILES encodes molecules via depth-first traversal of the molecular graph, requiring parentheses and ring identifiers to appear in matched pairs with deep nesting. When generative models (LSTM, Transformer) are trained on SMILES, they produce chemically invalid strings, particularly on small datasets, because they struggle to learn these long-range pairing constraints. DeepSMILES addresses some syntactical issues but still permits semantic violations (e.g., oxygen with three bonds). SELFIES guarantees 100% valid strings but at the cost of readability and, as the authors show, lower FCD scores indicating generated molecules diverge from the training distribution.

Fragment-based approaches reduce the search space compared to atom-level methods and can provide insights into molecular recognition (e.g., protein-ligand interactions). However, existing fragment-based deep learning methods rely on fixed dictionaries of candidate fragments, creating in-vocabulary/out-of-vocabulary problems and high-dimensional sparse representations. The encoding of fragments as SMILES-type strings, rather than dictionary IDs, had not been systematically explored before this work.

The authors draw on the observation that fragments in organic molecules follow a Zipf-like rank distribution similar to words in natural language, motivating the use of NLP techniques for fragment-based molecular modeling.

Core Innovation: Binary Tree Encoding of Fragmented Molecules

The t-SMILES algorithm proceeds in three steps:

Fragmentation: A molecule is decomposed into valid chemical fragments using a chosen algorithm (JTVAE, BRICS, MMPA, or Scaffold), producing a fragmented molecular graph.
Tree construction: The fragmented graph is converted into an Acyclic Molecular Tree (AMT), which is a reduced graph where nodes represent fragments and edges represent bonds between them. The AMT is then transformed into a Full Binary Tree (FBT), where every internal node has exactly two children.
String generation: The FBT is traversed using breadth-first search (BFS) to produce the t-SMILES string.

The framework introduces only two new symbols beyond standard SMILES: & marks empty tree nodes (branch terminators providing global structural information), and ^ separates adjacent substructure segments (analogous to spaces between words in English).

Three Coding Variants

TSSA (shared atom): Two fragments share a real atom at their connection point. Produces the highest novelty scores and is recommended for goal-directed tasks.
TSDY (dummy atom, no ID): Uses dummy atoms (marked with *) to indicate bonding points. Provides a balanced choice between novelty and distribution fidelity.
TSID (dummy atom with ID): Uses numbered dummy atoms ([n*]) for unambiguous reconstruction. Produces the most faithful distribution reproduction and is recommended for distribution-learning tasks.

Structural Advantages

The key structural benefit is a dramatic reduction in nesting depth. For TSDY_M on ChEMBL, the proportion of tokens at nesting depth 0-1-2 increases from 68.0% (SMILES) to 99.3%, while depth 3-4-5 drops from 31.9% to 0.7%, and depth 6-11 drops from 0.1% to 0.0002%. The & symbol, which encodes molecular topology, does not need to appear in pairs (unlike parentheses in SMILES), and its high frequency means it does not create a scarcity problem for learning.

The framework also supports a multi-code system where classical SMILES can be integrated as a special case called TS_Vanilla, and multiple fragmentation-based codes can be combined into hybrid models.

Reconstruction and Data Augmentation

Molecules can be reconstructed from t-SMILES strings by reversing the process: rebuilding the FBT from the string, converting to AMT, and assembling fragments into a molecular graph. This reconstruction process can itself generate novel molecules without any model training by randomly assembling fragments. On ChEMBL, TSSA reconstruction achieves uniqueness above 0.98 and novelty above 0.68 for all four fragmentation algorithms, with 100% validity.

Data augmentation in t-SMILES operates at four levels: (1) different decomposition algorithms, (2) reconstruction, (3) enumeration of fragment strings, and (4) enumeration of FBTs. Unlike SMILES enumeration (which only produces different strings for the same molecule), t-SMILES reconstruction generates genuinely different molecules from the same fragment set.

Systematic Evaluation Across Multiple Benchmarks

All experiments use MolGPT (a Transformer-decoder model) as the primary generative model. Three types of metrics are employed: distribution-learning benchmarks, goal-directed benchmarks, and Wasserstein distance metrics for physicochemical properties.

Low-Resource Datasets (JNK3 and AID1706)

On JNK3 (923 active molecules), the authors investigate overfitting behavior across training epochs:

Model	Valid	Novelty	FCD	Active Novel
SMILES [R200]	0.795	0.120	0.584	0.072
SMILES [R2000]	1.000	0.001	0.765	0.004
SELFIES [R200]	1.000	0.238	0.544	0.148
SELFIES [R2000]	1.000	0.008	0.767	0.050
TSSA_S [R300]	1.000	0.833	0.564	0.582
TSSA_S [R5000]	1.000	0.817	0.608	0.564
TF_TSSA_S [R5]	1.000	0.932	0.483	0.710
TSSA_S_Rec50 [R10]	1.000	0.962	0.389	0.829

Key findings: SMILES and DeepSMILES novelty scores collapse to near zero after 200 epochs, while t-SMILES novelty stabilizes around 0.8. The highest active-novel score of 0.829 comes from t-SMILES with reconstruction-based data augmentation. Transfer learning with t-SMILES maintains novelty of 0.710 at 5 epochs versus 0.526 for SMILES, and at 100 epochs the gap widens dramatically (0.569 vs. 0.023).

Distribution Learning on ChEMBL

t-SMILES models outperform graph baselines (Graph MCTS, hG2G, MGM) and fragment-based methods (FASMIFRA). TSID_B and TSID_S achieve FCD scores of 0.909 while maintaining novelty of 0.941 and 0.933, surpassing SMILES (FCD 0.906, novelty 0.907) in both dimensions. TSDY and TSID models consistently outperform TSSA on distribution fidelity for larger molecules.

Goal-Directed Tasks on ChEMBL

On 20 GuacaMol subtasks, different fragmentation algorithms excel at different tasks. The goal-directed reconstruction algorithm significantly outperforms random reconstruction. On the Sitagliptin MPO task (T16.SMPO), the TSDY_M model with goal-directed reconstruction achieves a score of 0.930, compared to 0.598 for SMILES and 0.708 for CReM. On Valsartan SMARTS (T18.VS), t-SMILES models reach 0.997 versus 0.985 for SMILES.

Distribution Learning on ZINC and QM9

On ZINC, t-SMILES models significantly outperform existing fragment-based baselines (JTVAE, FragDgm). Seven t-SMILES models achieve both higher FCD and novelty scores than SELFIES. On QM9 (smaller molecules), all string-based models achieve high FCD scores (above 0.960), with t-SMILES performing better than existing string and graph approaches.

Physicochemical Properties

Across ChEMBL and ZINC, TSDY and TSID models capture physicochemical property distributions (MolWt, LogP, SAScore, N_Atoms, N_Rings, etc.) more faithfully than TSSA models. Multiple t-SMILES models outperform SMILES in more than four out of nine property categories. Baseline models hG2G and JTVAE show the weakest pattern learning, producing molecules with fewer atoms and rings than the training data.

Key Findings and Limitations

Main Results

t-SMILES achieves 100% theoretical validity by fragmenting molecules into chemically valid pieces before encoding.
The framework avoids the overfitting problem on low-resource datasets, maintaining stable novelty scores where SMILES, DeepSMILES, and SELFIES collapse.
The multi-code system allows different coding algorithms to complement each other, with hybrid models accessing broader chemical space.
Goal-directed reconstruction significantly outperforms all baselines on targeted optimization tasks.
TSDY and TSID provide better distribution fidelity than TSSA on larger molecules, while TSSA excels at novelty generation for goal-directed tasks.

Limitations

The authors acknowledge several limitations:

Whether the tree structure of t-SMILES can be effectively learned by Large Language Models remains unexplored.
Only published fragmentation algorithms were tested; custom fragmentation schemes were not investigated.
Experiments on more complex (larger) molecules were not performed.
The reconstruction algorithm uses simple rules for fragment assembly; more sophisticated assembly methods (Monte Carlo tree search, CReM) could improve quality.

Future Directions

The authors suggest exploring advanced reconstruction and optimization algorithms, improved generative models, evolutionary techniques, and extending t-SMILES to property prediction, retrosynthesis, and reaction prediction tasks. The framework is also extensible to other string representations (t-DSMILES, t-SELFIES) by changing how fragments are encoded.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Low-resource evaluation	JNK3	923 active molecules	Kinase inhibitors
Low-resource evaluation	AID1706	329 active molecules	SARS 3CLPro inhibitors
Distribution learning	ChEMBL	Standard split	Large drug-like molecules
Distribution learning	ZINC	250K subset	Medium drug-like molecules
Distribution learning	QM9	~134K molecules	Small organic molecules

Algorithms

Fragmentation: JTVAE, BRICS, MMPA, Scaffold (all via RDKit)
Tree construction: AMT from reduced graph, then FBT transformation
Traversal: Breadth-first search on FBT
Generative model: MolGPT (Transformer decoder)
Discriminative model: AttentiveFP for activity prediction on JNK3/AID1706

Evaluation

Metric	Description
Validity	Fraction of generated strings that decode to valid molecules
Uniqueness	Fraction of distinct molecules among valid generations
Novelty	Fraction of generated molecules not in training set
KLD	Kullback-Leibler divergence for physicochemical property distributions
FCD	Frechet ChemNet Distance measuring chemical similarity to training set
Active Novel	Novel molecules predicted active by AttentiveFP

Artifacts

Artifact	Type	License	Notes
t-SMILES GitHub	Code	MIT	Official implementation with training/generation scripts
Zenodo deposit	Code + Data	CC-BY-4.0	Archived code and data
Code Ocean capsule	Code	Not specified	Certified reproducible compute capsule

Hardware

The paper mentions limited computational resources but does not specify exact GPU types or training times.

Paper Information

Citation: Wu, J.-N., Wang, T., Chen, Y., Tang, L.-J., Wu, H.-L., & Yu, R.-Q. (2024). t-SMILES: a fragment-based molecular representation framework for de novo ligand design. Nature Communications, 15, 4993.

@article{wu2024tsmiles,
  title={t-SMILES: a fragment-based molecular representation framework for de novo ligand design},
  author={Wu, Juan-Ni and Wang, Tong and Chen, Yue and Tang, Li-Juan and Wu, Hai-Long and Yu, Ru-Qin},
  journal={Nature Communications},
  volume={15},
  number={1},
  pages={4993},
  year={2024},
  doi={10.1038/s41467-024-49388-6}
}

Systematic Review of Deep Learning CLMs (2020-2024)

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Chemical Language Models for Molecular Generation

This paper is a Systematization that provides a comprehensive, PRISMA-guided systematic review of deep learning chemical language models (CLMs) used for de novo molecular generation. The primary contribution is a structured statistical analysis of 72 retrieved articles from 2020 to June 2024, comparing architectures (RNNs, transformers, VAEs, GANs, S4 models), molecular representations, biased generation strategies, and quality metrics from the MOSES and GuacaMol benchmarking platforms. The review addresses five research questions about architecture configuration effects, best-performing architectures, impactful hyperparameters, common molecular representations, and effective biased generation methods.

Motivation: Evaluating Four Years of Generative CLM Progress

Deep learning molecular generation has expanded rapidly since 2018, when Gomez-Bombarelli et al. and Segler et al. demonstrated that deep generative models could learn to produce novel molecules from SMILES representations. By 2020, multiple architectures (RNNs, transformers, VAEs, GANs) were being applied to chemical language modeling, and benchmarking platforms like MOSES and GuacaMol had been introduced to enable standardized evaluation.

Despite this growth, existing reviews largely focused on theoretical background or drug development applications rather than systematic statistical comparison of model performance. Few studies had examined how architecture choice, training dataset size, molecular representation format, and biased learning strategies interact to affect generation quality metrics like validity, uniqueness, and novelty. This review fills that gap by restricting the analysis to papers reporting MOSES or GuacaMol metrics, enabling quantitative cross-study comparison.

PRISMA-Based Systematic Review Methodology

The review follows the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) guidelines. Articles were retrieved from Scopus, Web of Science, and Google Scholar using six Boolean search queries combining terms like “Molecule Generation,” “Chemical Language Models,” “Deep Learning,” and specific architecture names. The search window covered January 2020 to June 2024.

Eligibility Criteria

Papers were included if they:

Were written in English
Explicitly presented at least two metrics of uniqueness, validity, or novelty
Defined these metrics consistent with MOSES or GuacaMol concepts
Used deep learning generative models for de novo molecule design
Used conventional (non-quantum) deep learning methods
Were published between January 2020 and June 2024

This yielded 48 articles from query-based search and 25 from citation search, totaling 72 articles. Of these, 62 used CLM approaches (string-based molecular representations) and 10 used graph-based representations.

Data Collection

For each article, the authors extracted: journal details, database name, training dataset size, molecular representation type (SMILES, SELFIES, InChI, DeepSMILES), architecture details (embedding length, layers, hidden units, trainable parameters, dropout, temperature, batch size, epochs, learning rate, optimizer), biased method usage (TL, RL, conditional learning), and generation metrics (validity, uniqueness, novelty, scaffold diversity, SNN, FCD).

Evaluation Metrics

The review focuses on three core MOSES metrics:

$$ \text{Validity}(V_m) = \frac{\text{Valid molecules}}{\text{Molecules produced}} $$

$$ \text{Uniqueness} = \frac{\text{set}(V_m)}{V_m} $$

$$ \text{Novelty} = 1 - \frac{V_m \cap T_d}{V_m} $$

where $V_m$ denotes valid molecules and $T_d$ the training dataset.

Architecture Distribution and Performance Comparison

Architecture Trends (2020-2024)

The review found that RNNs and transformers dominate CLM usage, with a growing trend toward transformers over time. The breakdown across 62 CLM articles: 24 RNN-based, 23 transformer-based, 16 VAE-based, 8 GAN-based, and 1 S4-based model. Among RNN variants, LSTM was the most common, followed by GRU, despite GRU having fewer trainable parameters.

The increase in transformer adoption is attributed to self-attention mechanisms enabling parallel computation and effective long-range dependency capture. Meanwhile, GANs and VAEs saw lower adoption rates, partly due to higher memory and time complexity and reduced ability to generate large molecules.

Molecular Representations and Databases

SMILES was used exclusively in 77.27% of CLM articles, reflecting its wide database availability and compact format. SELFIES, DeepSMILES, and InChI each appeared in smaller fractions. The dominant databases were ChEMBL and ZINC (27 articles each), followed by PubChem (4 articles). Approximately 71% of reviewed articles focused on drug discovery applications.

Database	Molecules (millions)	Representation	Articles
ChEMBL	2.4	SMILES, InChI	27
ZINC	750	SMILES	27
PubChem	115.3	SMILES, InChI	4
COCONUT	0.695	SMILES, InChI	1
DNA-Encoded Library	1,040	SMILES	1

Unbiased Model Performance

Validity: No statistically significant differences were observed across architecture families. Transformers generally achieved high validity through self-attention mechanisms that retain uncompressed sequence information. However, one transformer model (TransMol) achieved only 6.9% validity when using stochastic sampling with Gaussian noise to explore unseen chemical space. GANs showed high dispersion, with validity as low as 8.5% when learning from gene expression signatures rather than molecular structures directly.

Uniqueness: No significant differences in median uniqueness across architectures. Transformer-based models using masked self-attention achieved near-perfect uniqueness scores. Scaffold decoration and fragment-linking approaches sometimes compromised uniqueness due to overfit-driven redundancy.

Validity-Novelty Trade-off: The authors propose a “Valid/Sample” metric (Validity x Novelty) and find an inverse trend between validity and novelty (Spearman $\rho = -0.3575$, p-value = 0.0618). Only 17.9% of models achieved above-median values for both validity (95.6%) and novelty (96.5%) simultaneously. SELFIES-based models achieve 100% validity by construction, which can help address this trade-off.

Biased Model Performance

The review examines three biased generation strategies:

Transfer Learning (TL): The most prevalent biased method, used across all architecture types. Fine-tuning transfers pre-trained parameters to a target model, requiring significantly fewer training molecules (median ~2,507 vs. ~1.1M for unbiased). TL does not significantly affect validity (p = 0.16) or novelty (p = 0.84), but uniqueness decreases significantly (median 90.2% vs. 97.9%, p = 0.014), likely due to overfitting on small target datasets.

Metric	Unbiased (median)	TL Target (median)	p-value
Training size	1,128,920	2,507	<0.0001
Validity	98.05%	95.5%	0.1602
Uniqueness	97.9%	90.2%	0.0144
Novelty	91.6%	96.0%	0.8438

Reinforcement Learning (RL): Applied only to RNNs and transformers in the reviewed set. 90.1% of RL implementations used policy gradient methods with scoring functions for properties like synthesizability, binding affinity, and membrane permeability. No significant effects on generation metrics were observed.

Metric	Unbiased (median)	RL Target (median)	p-value
Validity	91.1%	96.5%	0.1289
Uniqueness	99.9%	89.7%	0.0935
Novelty	91.5%	93.5%	0.2500

Conditional Learning (CL): Integrates domain-specific data (properties, bioactivities, functional groups) directly into training via constraint tokens or property embeddings. Used primarily with encoder-decoder architectures (ARAEs, VAEs, transformers). CL does not significantly degrade generation metrics relative to unbiased models.

Metric	Unbiased (median)	CL Target (median)	p-value
Validity	98.5%	96.8%	0.4648
Uniqueness	99.9%	97.5%	0.0753
Novelty	89.3%	99.6%	0.2945

Key Findings and Directions for Chemical Language Models

Main Conclusions

Transformers are overtaking RNNs as the dominant CLM architecture, driven by self-attention mechanisms that capture long-range dependencies without the gradient vanishing issues of recurrent models.
SMILES remains dominant (77% of models) despite known limitations (non-uniqueness, syntax errors). SELFIES shows promise for improving the validity-novelty trade-off.
No architecture achieves both high validity and high novelty easily. Only 17.9% of unbiased models exceeded medians for both metrics simultaneously, highlighting a fundamental tension in generative chemistry.
Transfer learning requires only ~2,500 molecules to generate targeted compounds, compared to ~1.1M for unbiased training, but at the cost of reduced uniqueness.
Combining biased methods (e.g., TL + RL, CL + TL) shows promise for multi-objective optimization and exploring distant regions of chemical space.
S4 models were newly introduced for CLMs in 2023, showing competitive performance with the dual nature of convolution during training and recurrent generation.

Limitations

The review is restricted to papers reporting MOSES or GuacaMol metrics, which excludes many molecular generation studies that use alternative evaluation frameworks. The statistical comparisons rely on median values reported across different experimental settings, making direct architecture comparisons approximate. Graph-based approaches are included only for coarse comparison (10 of 72 articles) and are not the focus of the analysis.

Reproducibility Details

Data

This is a systematic review, so no new models were trained. The authors collected metadata from 72 published articles. No datasets were generated or analyzed beyond the literature corpus.

Algorithms

Statistical comparisons used Mann-Whitney U tests for paired samples. Spearman correlation was used to assess the validity-novelty relationship. Outlier identification used the Valid/Sample (Validity x Novelty) metric with box plot analysis.

Evaluation

The review evaluates models using MOSES metrics: validity, uniqueness, novelty, scaffold diversity, scaffold novelty, fragment similarity, SNN, internal diversity, and FCD. Statistical tests were applied to compare medians across architecture families and between biased and unbiased models.

Hardware

Not applicable (systematic review, no model training performed).

Paper Information

Citation: Flores-Hernandez, H., & Martínez-Ledesma, E. (2024). A systematic review of deep learning chemical language models in recent era. Journal of Cheminformatics, 16(1), 129. https://doi.org/10.1186/s13321-024-00916-y

@article{floreshernandez2024systematic,
  title={A systematic review of deep learning chemical language models in recent era},
  author={Flores-Hernandez, Hector and Mart{\'i}nez-Ledesma, Emmanuel},
  journal={Journal of Cheminformatics},
  volume={16},
  number={1},
  pages={129},
  year={2024},
  publisher={BioMed Central},
  doi={10.1186/s13321-024-00916-y}
}

Survey of Transformer Architectures in Molecular Science

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Transformer Architectures for Molecular Science

This paper is a Systematization review. It organizes and taxonomizes 12 families of transformer architectures that have been applied across molecular science, including chemistry, biology, and drug discovery. The primary contribution is not a new method or dataset, but a structured technical overview of the algorithmic internals of each transformer variant and their specific applications to molecular problems. The review covers 201 references and provides a unified treatment of how these architectures capture molecular patterns from sequential, graphical, and image-based data.

Bridging the Gap Between Transformer Variants and Molecular Applications

Transformer-based models have become widespread in molecular science, yet the authors identify a gap: there is no organized taxonomy linking these diverse techniques in the existing literature. Individual papers introduce specific architectures or applications, but practitioners lack a unified reference that explains the technical differences between GPT, BERT, BART, graph transformers, and other variants in the context of molecular data. The review aims to fill this gap by providing an in-depth investigation of the algorithmic components of each model family, explaining how their architectural innovations contribute to processing complex molecular data. The authors note that the success of transformers in molecular science stems from several factors: the sequential nature of chemical and biological molecules (DNA, RNA, proteins, SMILES strings), the attention mechanism’s ability to capture long-range dependencies within molecular structures, and the capacity for transfer learning through pre-training on large chemical and biological datasets.

Twelve Transformer Families and Their Molecular Mechanisms

The review covers transformer preliminaries before diving into 12 specific architecture families. The core self-attention mechanism computes:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

where $d_k$ is the dimension of the key vectors. The position-wise feed-forward network is:

$$ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 $$

The 12 architecture families covered are:

GPT (Generative Pre-trained Transformer): Uses the decoder part of the transformer for autoregressive generation. Applications include MolGPT for molecular generation, DrugGPT for protein-ligand binding, and cMolGPT for target-specific de novo molecular generation.
BERT (Bidirectional Encoder Representations from Transformers): Uses transformer encoders with masked language modeling and next-sentence prediction for pre-training. Molecular applications include FP-BERT for molecular property prediction using composite fingerprint representations, Graph-BERT for protein-protein interaction identification, SMILES-BERT, and Mol-BERT.
BART (Bidirectional and Auto-Regressive Transformers): Functions as a denoising autoencoder with both encoder and decoder. Molecular applications include Chemformer for sequence-to-sequence chemistry tasks, MS2Mol for mass spectrometry analysis, and MolBART for molecular feature learning.
Graph Transformer: Leverages self-attention on graph-structured data to capture global context. Applications include GraphSite for protein-DNA binding site prediction (using AlphaFold2 structure predictions), KPGT for knowledge-guided molecular graph pre-training, and PAGTN for establishing long-range dependencies in molecular graphs.
Transformer-XL: Incorporates relative positional encoding for modeling long sequences. Used for small molecule retention time prediction, drug design with CHEMBL data (1.27 million molecules), and Heck reaction generation.
T5 (Text-to-Text Transfer Transformer): Unifies NLP tasks into text-to-text mapping. T5Chem was pre-trained on 97 million molecules from PubChem and achieved 99.5% accuracy on reaction classification (USPTO 500 MT). C5T5 uses IUPAC naming for molecular optimization in drug discovery.
Vision Transformer (ViT): Applies transformer architecture to image patches. Used for organic molecule classification (97% accuracy with WGAN-generated data), bacterial identification via SERS, and molecular property prediction from mass spectrometry data (TransG-Net).
DETR (Detection Transformer): End-to-end object detection using transformers. Applied to cryo-EM particle picking (TransPicker), molecular structure image recognition (IMG2SMI), and cell segmentation (Cell-DETR).
Conformer: Integrates convolutional modules into transformer structure. Used for DNA storage error correction (RRCC-DNN), drug-target affinity prediction (NG-DTA with Davis and Kiba datasets).
CLIP (Contrastive Language-Image Pre-training): Multimodal learning linking text and images. Applied to peptide design (Cut&CLIP for protein degradation), gene identification (pathCLIP), and drug discovery (CLOOME for zero-shot transfer learning).
Sparse Transformers: Use sparse attention matrices to reduce complexity to $O(n\sqrt{n})$. Applied to drug-target interaction prediction with gated cross-attention mechanisms.
Mobile and Efficient Transformers: Compressed variants (TinyBERT, MobileBERT) for resource-constrained environments. Molormer uses ProbSparse self-attention for drug-drug interaction prediction. LOGO is a lightweight pre-trained language model for non-coding genome interpretation.

Survey Organization and Coverage of Molecular Domains

As a survey paper, this work does not present new experiments. Instead, it catalogues existing applications across multiple molecular domains:

Drug Discovery and Design: GPT-based ligand design (DrugGPT), BART-based molecular generation (Chemformer, MolBART), graph transformer pre-training for molecular property prediction (KPGT), T5-based chemical reaction prediction (T5Chem), and sparse transformer methods for drug-target interactions.

Protein Science: BERT-based protein-protein interaction prediction (Graph-BERT), graph transformer methods for protein-DNA binding (GraphSite with AlphaFold2 integration), conformer-based drug-target affinity prediction (NG-DTA), and CLIP-based peptide design (Cut&CLIP).

Molecular Property Prediction: FP-BERT for fingerprint-based prediction, SMILES-BERT and Mol-BERT for end-to-end prediction from SMILES, KPGT for knowledge-guided graph pre-training, and Transformer-XL for property modeling with relative positional encoding.

Structural Biology: DETR-based cryo-EM particle picking (TransPicker), vision transformer applications in cell imaging, and Cell-DETR for instance segmentation in microscopy.

Genomics: Conformer-based DNA storage error correction (RRCC-DNN), LOGO for non-coding genome interpretation, and MetaTransformer for metagenomic sequencing analysis.

Future Directions and Limitations of the Survey

The review concludes with four future directions:

ChatGPT integration into molecular science: Using LLMs for data analysis, literature review, and hypothesis generation in chemistry and biology.
Multifunction transformers: Models that extract features across diverse molecular structures and sequences simultaneously.
Molecular-aware transformers: Architectures that handle multiple data types (text, sequence, structure, image, energy, molecular dynamics, function) in a unified framework.
Self-assessment transformers and superintelligence: Speculative discussion of models that learn from seemingly unrelated data sources.

The review has several limitations worth noting. The coverage is broad but shallow: each architecture family receives only 1-2 pages of discussion, and the paper largely describes existing work rather than critically evaluating it. The review does not systematically compare the architectures against each other on common benchmarks. The future directions section (particularly the superintelligence discussion) is speculative and lacks concrete proposals. The paper also focuses primarily on technical architecture descriptions rather than analyzing failure modes, scalability challenges, or reproducibility concerns across the surveyed methods. As a review article, no new data were created or analyzed.

Reproducibility Details

Data

This is a survey paper. No new datasets were created or used. The paper reviews applications involving datasets such as PubChem (97 million molecules for T5Chem), CHEMBL (1.27 million molecules for Transformer-XL drug design), USPTO 500 MT (reaction classification), ESOL (5,328 molecules for property prediction), and Davis/Kiba (drug-target affinity).

Algorithms

No new algorithms are introduced. The paper provides mathematical descriptions of the core transformer components (self-attention, positional encoding, feed-forward networks, layer normalization) and describes how 12 architecture families modify these components.

Models

No new models are presented. The paper surveys existing models including MolGPT, DrugGPT, FP-BERT, SMILES-BERT, Chemformer, MolBART, GraphSite, KPGT, T5Chem, TransPicker, Cell-DETR, CLOOME, and Molormer, among others.

Evaluation

No new evaluation is performed. Performance numbers cited from the literature include: T5Chem reaction classification accuracy of 99.5%, ViT organic molecule classification at 97%, Transformer-XL property prediction RMSE of 0.6 on ESOL, and Heck reaction generation feasibility rate of 47.76%.

Hardware

No hardware requirements are specified, as this is a survey paper.

Artifact	Type	License	Notes
Paper (open access)	Paper	CC-BY-NC-ND	Open access via Wiley

Paper Information

Citation: Jiang, J., Ke, L., Chen, L., Dou, B., Zhu, Y., Liu, J., Zhang, B., Zhou, T., & Wei, G.-W. (2024). Transformer technology in molecular science. WIREs Computational Molecular Science, 14(4), e1725. https://doi.org/10.1002/wcms.1725

@article{jiang2024transformer,
  title={Transformer technology in molecular science},
  author={Jiang, Jian and Ke, Lu and Chen, Long and Dou, Bozheng and Zhu, Yueying and Liu, Jie and Zhang, Bengong and Zhou, Tianshou and Wei, Guo-Wei},
  journal={WIREs Computational Molecular Science},
  volume={14},
  number={4},
  pages={e1725},
  year={2024},
  publisher={Wiley},
  doi={10.1002/wcms.1725}
}

SPMM: A Bidirectional Molecular Foundation Model

Thu, 26 Mar 2026 00:00:00 +0000

A Multimodal Foundation Model for Structure-Property Comprehension

This is a Method paper that introduces the Structure-Property Multi-Modal foundation model (SPMM), a transformer-based architecture that treats SMILES strings and molecular property vectors (PVs) as two separate modalities and learns to align them in a shared embedding space. The primary contribution is enabling bidirectional generation through a single pre-trained model: given a property vector, SPMM can generate molecules (inverse-QSAR), and given a SMILES string, it can predict all 53 properties simultaneously. The model also transfers to unimodal downstream tasks including MoleculeNet benchmarks and reaction prediction.

Bridging the Gap Between Molecular Structure and Properties

Existing chemical pre-trained models typically learn representations from a single modality (SMILES, graphs, or fingerprints) and fine-tune for specific downstream tasks. While some approaches have attempted multimodal learning by combining SMILES with graph representations or InChI strings, these modalities encode nearly identical structural information, limiting the potential for emergent cross-modal knowledge.

The key gap SPMM addresses is the lack of multimodal pre-training that incorporates genuinely complementary modalities. Prior conditional molecule generation methods could typically control only a small number of properties simultaneously and required retraining when target properties changed. The authors draw on successes in vision-language pre-training (VLP), where aligning image and text modalities has enabled rich bidirectional understanding, and apply similar ideas to molecular structure and property domains.

Treating Property Vectors as a Language

The core innovation in SPMM is treating a collection of 53 RDKit-computed molecular properties as a “language” where each property value is analogous to a word token. This design allows the model to attend to individual properties independently rather than treating the entire property vector as a single fixed-length condition.

Dual-Stream Architecture

SPMM follows the dual-stream VLP architecture. The model has three components:

SMILES Encoder: 6 BERT-base layers that encode tokenized SMILES (using a 300-subword BPE vocabulary) via self-attention
PV Encoder: 6 BERT-base layers that encode the 53 normalized property values (each passed through a linear layer) with learnable positional embeddings
Fusion Encoder: 6 BERT-base layers with cross-attention that combines both modalities, using one modality’s features as queries and the other as keys/values

Pre-training Objectives

The model is pre-trained with four complementary losses:

Contrastive Learning aligns SMILES and PV features in a shared embedding space. For [CLS] token outputs $\mathbf{S}_{cls}$ and $\mathbf{P}_{cls}$:

$$ \text{sim}(\mathbf{S}, \mathbf{P}) = \left(h_{S}(\mathbf{S}_{cls})\right)^{\top} h_{P}(\mathbf{P}_{cls}) $$

The intermodal similarities are computed with a learnable temperature $\tau$:

$$ s_{s2p} = \frac{\exp(\text{sim}(\mathbf{S}, \mathbf{P}) / \tau)}{\sum_{n=1}^{N} \exp(\text{sim}(\mathbf{S}, \mathbf{P}_{n}) / \tau)} $$

The contrastive loss uses cross-entropy with one-hot labels (1 for same-molecule pairs):

$$ L_{\text{contrastive}} = \frac{1}{2}\left(H(y_{s2p}, s_{s2p}) + H(y_{p2s}, s_{p2s}) + H(y_{s2s}, s_{s2s}) + H(y_{p2p}, s_{p2p})\right) $$

Next Word Prediction (NWP) trains autoregressive SMILES generation conditioned on the PV:

$$ L_{NWP} = \sum_{i=1}^{n} H\left(y_{n}^{NWP}, p^{NWP}(s_{n} \mid s_{0:n-1}, \mathbf{P})\right) $$

Next Property Prediction (NPP) applies the same autoregressive concept to property values, using mean-square-error loss:

$$ L_{NPP} = \sum_{i=1}^{n} \left(p_{n} - \hat{p}_{n}(p_{0:n-1}, \mathbf{S})\right)^{2} $$

SMILES-PV Matching (SPM) is a binary classification loss predicting whether a SMILES-PV pair originated from the same molecule, trained with hard-negative mining.

The overall pre-training loss combines all four:

$$ L = \widetilde{L}_{\text{contrastive}} + \widetilde{L}_{NWP} + L_{NPP} + L_{SPM} $$

where tildes indicate the use of momentum teacher distillation to soften one-hot labels, acknowledging that multiple valid SMILES-PV pairings may exist.

Random Property Masking

During pre-training, 50% of property values are randomly replaced with a special [UNK] token. This serves three purposes: preventing overfitting to specific properties, augmenting data, and enabling flexible inference where users can specify any subset of the 53 properties as generation conditions. The model can handle all $2^{53}$ possible property combinations at inference time despite never seeing most of them during training.

Experiments Across Bidirectional and Unimodal Tasks

PV-to-SMILES Generation (Conditional Molecule Design)

The authors evaluate SPMM on multiple generation scenarios using 1000 unseen PubChem PVs:

Sampling	Input PV	Validity	Uniqueness	Novelty	Norm. RMSE
Deterministic	1000 unseen PVs	0.995	0.999	0.961	0.216
Stochastic	Full PV (molecule 1)	0.974	0.905	0.998	0.185
Stochastic	Molar mass = 150	0.974	0.945	0.872	0.192
Stochastic	4 properties controlled	0.998	0.981	0.952	0.257
Stochastic	No control (all [UNK])	0.971	0.991	0.950	-

The normalized RMSE of 0.216 across 53 properties indicates that generated molecules closely match the input property conditions. The model can also perform unconditional generation (all properties masked) where outputs follow the pre-training distribution. The authors report that SPMM outperforms benchmark models including MolGAN, GraphVAE, and scaffold-based graph generative models in both conditional and unconditional settings (Supplementary Table 1).

SMILES-to-PV Generation (Multi-Property Prediction)

When given 1000 unseen ZINC15 molecules, SPMM predicts all 53 properties autoregressively with a mean $r^{2}$ of 0.924 across all properties.

MoleculeNet Benchmarks

Using only the SMILES encoder (6 BERT layers), SPMM achieves best or competitive performance on 9 MoleculeNet tasks:

Dataset	Metric	SPMM	Best Baseline	Baseline Model
ESOL	RMSE	0.817	0.798	ChemRL-GEM
LIPO	RMSE	0.681	0.660	ChemRL-GEM
FreeSolv	RMSE	1.868	1.877	ChemRL-GEM
BACE (reg)	RMSE	1.041	1.047	MolFormer
Clearance	RMSE	42.607	43.175	MolFormer
BBBP	AUROC	75.1%	73.6%	MolFormer
BACE (cls)	AUROC	84.4%	86.3%	MolFormer
ClinTox	AUROC	92.7%	91.2%	MolFormer
SIDER	AUROC	66.9%	67.2%	ChemRL-GEM

SPMM achieved best performance on 5 of 9 tasks, with notable gains on BBBP (75.1% vs. 73.6%) and ClinTox (92.7% vs. 91.2%). Without pre-training, all scores dropped substantially.

DILI Classification

On Drug-Induced Liver Injury prediction, SPMM achieved 92.6% AUROC, outperforming the 5-ensemble model of Ai et al. (90.4% AUROC) while using a single model.

Reaction Prediction

On USPTO-480k forward reaction prediction, SPMM achieved 91.5% top-1 accuracy, the highest among all models tested (including Chemformer at 91.3%). On USPTO-50k retro-reaction prediction, SPMM reached 53.4% top-1 accuracy, second only to Chemformer (54.3%) among string-based models.

Bidirectional Generation From a Single Pre-trained Model

SPMM demonstrates that multimodal pre-training with genuinely complementary modalities (structure and properties, rather than structurally redundant representations) enables a single foundation model to handle both generation directions and downstream unimodal tasks. Key findings include:

Flexible conditional generation: The [UNK] masking strategy allows controlling any subset of 53 properties at inference time without retraining, a capability not demonstrated by prior methods.
Interpretable cross-attention: Attention visualizations show that the model learns chemically meaningful structure-property relationships (e.g., hydrogen bonding properties attend to oxygen and nitrogen atoms; ring count properties attend to ring tokens).
Competitive unimodal transfer: Despite using only 6 BERT layers and 50M pre-training molecules (smaller than ChemBERTa-2’s 77M or Chemformer’s 100M), the SMILES encoder alone achieves best or second-best results on 5 of 9 MoleculeNet tasks and the highest forward reaction prediction accuracy among tested models.

Limitations

The authors acknowledge several limitations:

SMILES representation constraints: Implicit connectivity information in SMILES means small structural changes can cause drastic string changes. Graph representations could be a complementary alternative.
Stereochemistry blindness: All 53 RDKit properties used are invariant to stereochemistry, meaning different stereoisomers produce identical PVs. The contrastive loss then forces their SMILES encoder outputs to converge, which the authors identify as the primary factor limiting MoleculeNet performance on stereo-sensitive tasks.
No wet-lab validation: Generated molecules and predicted properties are not experimentally verified.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	PubChem	50M molecules	SMILES + 53 RDKit properties
Property prediction	MoleculeNet (9 tasks)	642-4200 per task	Scaffold split via DeepChem (8:1:1)
DILI classification	Ai et al. dataset	Not specified	Following published preparation
Forward reaction	USPTO-480k	479,035 pairs	Reactant-product pairs
Retro reaction	USPTO-50k	50,037 pairs	Product-reactant pairs, no reaction types used
SMILES-to-PV test	ZINC15	1000 molecules	Not in pre-training set

Algorithms

Tokenization: BPE with 300-subword dictionary
Property masking: 50% random replacement with [UNK] during pre-training
Momentum distillation: EMA parameter $\lambda = 0.995$, soft-label mixing $\alpha$ linearly warmed from 0 to 0.4 over first epoch
Contrastive queue: Size $k = 24{,}576$ for storing recent SMILES and PV instances
Beam search: $k = 2$ for PV-to-SMILES generation
SMILES augmentation: Random non-canonical augmentation with probability 0.5 for reaction tasks

Models

Architecture: 6 BERT-base encoder layers each for SMILES encoder, PV encoder, and fusion encoder (18 total layers)
Vocabulary: 300 BPE subwords for SMILES; 53 property tokens for PV
Pre-trained weights: Available via GitHub

Evaluation

Task	Metric	Value	Notes
PV-to-SMILES (deterministic)	Validity	99.5%	1000 unseen PubChem PVs
PV-to-SMILES (deterministic)	Normalized RMSE	0.216	Across 53 properties
SMILES-to-PV	Mean $r^{2}$	0.924	1000 ZINC15 molecules
Forward reaction (USPTO-480k)	Top-1 accuracy	91.5%	Best among all tested models
Retro reaction (USPTO-50k)	Top-1 accuracy	53.4%	Second-best string-based
DILI classification	AUROC	92.6%	Single model vs. 5-ensemble

Hardware

Pre-training: 8 NVIDIA A100 GPUs, approximately 52,000 batch iterations, roughly 12 hours
Batch size: 96
Optimizer: AdamW with weight decay 0.02
Learning rate: Warmed up to $10^{-4}$, cosine decay to $10^{-5}$

Artifacts

Artifact	Type	License	Notes
SPMM Source Code	Code	Apache-2.0	Official implementation with experimental scripts
SPMM Zenodo Archive	Code	Apache-2.0	Archived version for reproducibility
PubChem	Dataset	Public domain	50M molecules for pre-training
MoleculeNet	Dataset	Varies	Benchmark datasets via DeepChem

Paper Information

Citation: Chang, J., & Ye, J. C. (2024). Bidirectional generation of structure and properties through a single molecular foundation model. Nature Communications, 15, 2323. https://doi.org/10.1038/s41467-024-46440-3

@article{chang2024bidirectional,
  title={Bidirectional generation of structure and properties through a single molecular foundation model},
  author={Chang, Jinho and Ye, Jong Chul},
  journal={Nature Communications},
  volume={15},
  pages={2323},
  year={2024},
  doi={10.1038/s41467-024-46440-3}
}

SPE: Data-Driven SMILES Substructure Tokenization

Thu, 26 Mar 2026 00:00:00 +0000

A Data-Driven Tokenization Method for Chemical Deep Learning

This is a Method paper that introduces SMILES Pair Encoding (SPE), a tokenization algorithm adapted from byte pair encoding (BPE) in natural language processing. The primary contribution is a data-driven approach that learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset and then uses that vocabulary to tokenize SMILES for downstream deep learning tasks. The authors provide an open-source Python package (SmilesPE) and demonstrate improvements on both molecular generation and QSAR prediction benchmarks.

Limitations of Atom-Level SMILES Tokenization

SMILES-based deep learning models require tokenization to convert molecular strings into sequences of discrete units. The standard approaches have well-known drawbacks:

Character-level tokenization breaks SMILES character by character, splitting chemically meaningful multi-character atoms. For example, [C@@H] becomes six separate tokens ([, C, @, @, H, ]), losing the stereochemistry information of a single carbon.
Atom-level tokenization addresses some of these issues by treating multi-character element symbols (Cl, Br) and bracketed atoms ([nH], [O-]) as single tokens. However, these tokens still encode only individual atoms, not substructures.
k-mer tokenization (sequences of k consecutive overlapping characters) captures some connectivity information but suffers from the out-of-vocabulary problem: the model cannot represent k-mers not seen during training.

All three approaches produce relatively long input sequences (mean ~40 tokens per molecule on ChEMBL at the atom level), which increases computational cost for sequential architectures like RNNs and exacerbates long-range dependency issues.

Core Innovation: Adapting Byte Pair Encoding for SMILES

SPE adapts the byte pair encoding algorithm, originally developed for data compression and later adopted for subword tokenization in NLP, to the domain of chemical strings. The algorithm has two phases:

Vocabulary training:

Tokenize SMILES from a large dataset (ChEMBL) at the atom level
Initialize the vocabulary with all unique atom-level tokens
Iteratively count the frequency of all adjacent token pairs, merge the most frequent pair into a new token, and add it to the vocabulary
Stop when either the maximum vocabulary size (MVS) or a minimum frequency threshold (FT) is reached

Tokenization: Given a trained SPE vocabulary, a new SMILES string is first tokenized at the atom level, then token pairs are iteratively merged according to their frequency rank in the vocabulary until no further merges are possible.

The key hyperparameters are MVS and FT. In the reported experiments, MVS was set to 30,000 and FT was set to 2,000. The vocabulary was trained on ~3.4 million SMILES (both canonical and one non-canonical variant per molecule) from ChEMBL25. The resulting vocabulary contained 3,002 unique SMILES substrings with lengths ranging from 1 to 22 atom-level characters.

The trained SPE vocabulary produces tokens that are human-readable and correspond to chemically meaningful substructures and functional groups. SPE tokenization reduces the mean sequence length from approximately 40 tokens (atom-level) to approximately 6 tokens on ChEMBL, a roughly 6-7x compression. This shorter representation directly reduces computational cost for RNN-based and other sequential models.

The algorithm is also compatible with other text-based molecular representations such as DeepSMILES and SELFIES, since these share atom-level character structures that can serve as the starting point for pair merging.

Molecular Generation and QSAR Prediction Experiments

Molecular Generation

The authors trained AWD-LSTM language models with SPE and atom-level tokenization on 9 million SMILES (1 canonical + 5 non-canonical per compound from ChEMBL25). Each model sampled 1 million SMILES for evaluation. The AWD-LSTM architecture used an embedding size of 400, three LSTM layers with 1,152 hidden units each, and various dropout settings (embedding: 0.1, input: 0.6, weight: 0.5, hidden: 0.2). Models were trained for 10 epochs with a base learning rate of 0.008 using one-cycle scheduling.

Metric	SPE	Atom-level
Validity	0.941	0.970
Uniqueness	0.994	0.992
Novelty	0.983	0.978
Internal diversity	0.897	0.886
Nearest neighbor similarity	0.391	0.386

The SPE model generated a more diverse population of novel molecules at the cost of slightly lower validity (94.1% vs. 97.0%). Internal diversity is defined as:

$$ \text{Internal diversity} = 1 - \frac{1}{|G|} \sum_{(x_1, x_2) \in G \times G} T(x_1, x_2) $$

where $T(x_1, x_2)$ is the Tanimoto similarity between molecules $x_1$ and $x_2$ using 1024-bit ECFP6 fingerprints. Nearest neighbor similarity (SNN) measures how well the generated set resembles the reference set:

$$ \text{SNN} = \frac{1}{|G|} \sum_{x_G \in G} \max_{x_R \in R} T(x_G, x_R) $$

Substructure coverage analysis showed both models recovered the same top-1000 BRICS fragments (100% coverage), but SPE consistently outperformed atom-level tokenization on top-5000 coverage across all four substructure types: BRICS fragments (0.997 vs. 0.987), functional groups (0.688 vs. 0.659), scaffolds (0.872 vs. 0.825), and ring systems (0.781 vs. 0.761).

QSAR Prediction

QSAR models were built using the MolPMoFiT transfer learning framework, which pre-trains a language model on ChEMBL and then fine-tunes it for specific prediction tasks. The evaluation used 24 regression benchmarks (pIC50 values) from Cortes-Ciriano et al., covering targets ranging from 199 molecules (alpha-2a adrenergic receptor) to 5,010 molecules (hERG). Models were evaluated on 10 random 80:10:10 splits using RMSE, R-squared, and MAE. Random forest models with 1024-bit ECFP6 were included as baseline comparisons.

Cohen’s d effect sizes were computed to quantify performance differences between tokenization methods. SPE performed comparably or better than atom-level tokenization on 23 out of 24 datasets. Notable results with medium or large effect sizes favoring SPE included cannabinoid CB1 receptor (large effect), A2a adrenergic receptor, LCK, estrogen receptor, and Aurora-A kinase (all medium effects). Against k-mer tokenization, SPE matched or outperformed on 22 out of 24 datasets.

Cohen’s d is defined as:

$$ \text{Cohen’s } d = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{(\text{SD}_1^2 + \text{SD}_2^2) / 2}} $$

where $\bar{x}_1, \bar{x}_2$ are the group means and $\text{SD}_1, \text{SD}_2$ are the standard deviations. Thresholds of 0.2 (small), 0.5 (medium), and 0.8 (large) were used following standard recommendations.

SMILES-based deep learning models generally performed on par with or better than the RF baseline, with particularly strong advantages on the four largest datasets (COX-2, acetylcholinesterase, erbB1, and hERG).

In addition to performance gains, SPE-based models trained on average 5 times faster than atom-level models due to the shorter input sequences.

Results Summary and Future Directions

The main findings of this study are:

SPE produces chemically meaningful tokens. The learned vocabulary contains human-readable SMILES substrings that correspond to common substructures and functional groups, making model interpretations more accessible.
SPE compresses input sequences by ~6-7x. Mean token sequence length drops from ~40 (atom-level) to ~6 (SPE) on ChEMBL, yielding a ~5x training speedup.
SPE improves molecular generation diversity. The SPE-based generative model produces molecules with higher novelty (98.3% vs. 97.8%), internal diversity (0.897 vs. 0.886), and substructure coverage, at the cost of slightly lower validity (94.1% vs. 97.0%).
SPE matches or outperforms atom-level and k-mer tokenization on QSAR prediction. Across 24 benchmarks, SPE showed comparable or better performance in 23/24 comparisons against atom-level and 22/24 against k-mer tokenization.

Limitations acknowledged by the authors:

The SPE vocabulary is trained on a specific dataset (ChEMBL25) and may not optimally represent chemical spaces that differ significantly from drug-like compounds.
The validity rate for molecular generation is slightly lower than atom-level tokenization (94.1% vs. 97.0%), since longer substructure tokens can introduce invalid fragments.
The k-mer tokenization suffers from an out-of-vocabulary problem, which the authors address by replacing unseen 4-mers with [UNK] tokens, but this is a limitation of the comparison rather than of SPE itself.

Future directions: The authors suggest SPE could serve as a general tokenization method for SMILES-based deep learning, applicable to any task where SMILES strings are used as input (generation, property prediction, reaction prediction, retrosynthesis). The algorithm can also be applied to DeepSMILES and SELFIES representations without modification.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
SPE vocabulary training	ChEMBL25	~3.4M SMILES	1 canonical + 1 non-canonical per molecule
Language model training	ChEMBL25 augmented	~9M SMILES	1 canonical + 5 non-canonical per molecule
Molecular generation evaluation	Sampled from model	1M SMILES per model	Validated with RDKit
QSAR benchmarks	Cortes-Ciriano et al.	24 datasets, 199-5010 molecules	pIC50 regression tasks

Algorithms

SPE vocabulary training: iterative pair merging with MVS=30,000 and FT=2,000
Language model: AWD-LSTM with embedding size 400, 3 LSTM layers with 1,152 hidden units
Dropout: embedding=0.1, input=0.6, weight=0.5, hidden=0.2
Training: 10 epochs, base learning rate 0.008, one-cycle policy
QSAR: MolPMoFiT transfer learning with 25x training augmentation and 15x validation augmentation
Test time augmentation: average of canonical + 4 augmented SMILES predictions
RF baseline: 500 trees, 1024-bit ECFP6, default scikit-learn parameters

Models

AWD-LSTM architecture from Merity et al. (2018)
MolPMoFiT framework from Li and Fourches (2020) for transfer learning QSAR

Evaluation

Metric	Task	Notes
Validity, Uniqueness, Novelty	Generation	Basic quality metrics
Internal diversity	Generation	1 - mean pairwise Tanimoto (ECFP6)
Nearest neighbor similarity	Generation	Mean max Tanimoto to reference set
Substructure coverage	Generation	BRICS, functional groups, scaffolds, ring systems
RMSE, R-squared, MAE	QSAR regression	10 random 80:10:10 splits
Cohen’s d	QSAR comparison	Effect size between tokenization methods

Hardware

Not explicitly specified in the paper.

Artifacts

Artifact	Type	License	Notes
SmilesPE	Code	Apache-2.0	SPE tokenization Python package
MolPMoFiT	Code	Not specified	Transfer learning QSAR framework

Paper Information

Citation: Li, X., & Fourches, D. (2021). SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning. Journal of Chemical Information and Modeling, 61(4), 1560-1569. https://doi.org/10.1021/acs.jcim.0c01127

@article{li2021smiles,
  title={SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning},
  author={Li, Xinhao and Fourches, Denis},
  journal={Journal of Chemical Information and Modeling},
  volume={61},
  number={4},
  pages={1560--1569},
  year={2021},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.0c01127}
}

Smirk: Complete Tokenization for Molecular Models

Thu, 26 Mar 2026 00:00:00 +0000

A Method for Complete Chemical Tokenization

This is a Method paper that introduces two new tokenizers for molecular foundation models: Smirk and Smirk-GPE. The primary contribution is a tokenization scheme that achieves complete coverage of the OpenSMILES specification using only 165 tokens, addressing the vocabulary gaps present in existing atom-wise tokenizers. The paper also proposes n-gram language models as low-cost proxy evaluators for tokenizer quality and validates these proxies against 18 transformer-based models across multiple benchmarks.

Vocabulary Gaps in Molecular Tokenization

Molecular foundation models overwhelmingly use “atom-wise” tokenization, where SMILES strings are split at atom boundaries using a regular expression first proposed by Schwaller et al. A key pattern in this regex treats all “bracketed atoms” (e.g., [C@@H], [18F], [Au+]) as single, irreducible tokens. Since bracketed atoms encode isotopes, chirality, charge, hydrogen count, and element identity, the number of possible permutations under the OpenSMILES specification exceeds 28 trillion. In practice, existing atom-wise tokenizers maintain vocabularies of fewer than 3,000 tokens, leaving large portions of chemical space unrepresentable.

This gap has real consequences. Many chemistry-specific tokenizers emit the unknown token [UNK] at non-negligible frequencies, particularly on datasets with diverse elements and stereochemistry. For example, SPE and APE tokenizers produce [UNK] for roughly 19% of tokens on MoleculeNet and approximately 50% on the tmQM transition metal complex dataset. Even models like SELFormer and ReactionT5 lack tokens for elements such as copper, ruthenium, gold, and uranium.

The authors also note a subtler issue: some open-vocabulary tokenizers (e.g., ChemBERTa’s BPE) conflate chemically distinct entities. The same Sc token may represent both a sulfur-carbon bond (in organic SMILES) and the element scandium (in [Sc]), creating ambiguity in downstream analysis.

Smirk: Glyph-Level Decomposition of SMILES

The core insight behind Smirk is to fully decompose bracketed atoms into their constituent “glyphs,” the primitive symbols defined by the OpenSMILES specification (element symbols, chirality markers, charges, isotope numbers, hydrogen counts, and brackets themselves). This transforms tokenization from a word-level scheme (one token per bracketed atom) to a character-level scheme over chemically meaningful glyphs.

Smirk uses a two-stage tokenization process:

Atom decomposition: Split a SMILES string into atom-level units using a regex (e.g., OC[C@@H][OH] becomes O C [C@@H] [OH]).
Glyph decomposition: Further split each unit into its constituent glyphs (e.g., [C@@H] becomes [ C @@ H ]).

The two-stage process is necessary to resolve ambiguities. For example, Sc in an unbracketed context represents a sulfur-carbon bond, while [Sc] denotes scandium. This ambiguity occurs over half a million times in PubChem’s compound dataset.

The resulting vocabulary contains only 165 tokens, requires no training, and by construction can faithfully tokenize any molecule that conforms to the OpenSMILES specification. The implementation is written in Rust using HuggingFace’s Tokenizers library and is available on PyPI.

Smirk-GPE (Glyph Pair Encoding) extends Smirk with a BPE-like compression step. After Smirk tokenization, adjacent tokens are merged using learned rules, reducing sequence length. Unlike standard BPE, merges operate on token IDs rather than character strings, preserving the distinction between chemically different entities that happen to share the same characters. Smirk-GPE was trained on 262 million molecules from Enamine REAL Space with a target vocabulary of 50,000 tokens, though training terminated at 2,300 tokens after exhausting all possible merges.

Evaluation Framework: Intrinsic Metrics, N-Gram Proxies, and Transformer Benchmarks

The evaluation covers 34 tokenizers across three datasets (Enamine REALSpace, MoleculeNet, and tmQM) using both intrinsic and extrinsic metrics.

Intrinsic Metrics

Four intrinsic metrics are computed for each tokenizer:

Fertility measures the mean tokenized sequence length. Higher fertility increases computational cost due to the quadratic scaling of attention:

$$ \text{cost} \propto \text{fertility}^2 $$

Normalized entropy quantifies how close a tokenizer comes to the information-theoretic ideal where all tokens are equally probable:

$$ \eta = \frac{-1}{\log |V|} \sum_{x \in V} p(x) \log p(x) $$

where $V$ is the vocabulary and $p(x)$ is the observed token probability. Higher normalized entropy correlates with better downstream performance.

Token imbalance measures the distance between observed token frequencies and a uniform distribution:

$$ D = \frac{1}{2} \sum_{x \in V} |p(x) - |V|^{-1}| $$

Unknown token frequency captures the fraction of emitted tokens that are [UNK]. This metric is particularly revealing: all existing chemistry-specific tokenizers (SPE/APE, atom-wise, BPE, and Unigram variants) emit [UNK] at non-negligible rates, while NLP tokenizers, Smirk, and Smirk-GPE do not.

N-Gram Proxy Language Models

The paper proposes using n-gram models as low-cost proxies for transformer-based evaluation. An n-gram estimates token likelihood with add-one smoothing:

$$ P_{n}(x_{i} \mid x_{i-n+1}, \dots, x_{i-1}) = \frac{C(x_{i-n+1}, \dots, x_{i}) + 1}{C(x_{i-n+1}, \dots, x_{i-1}) + |V|} $$

where $C$ is the count function and $|V|$ is the vocabulary size. N-grams were “pretrained” on 1.6 billion SMILES from Enamine REAL Space and evaluated on validation splits. Cross-entropy loss and information loss from unknown tokens were computed.

To quantify information lost to [UNK] tokens, the authors compute the KL-divergence between token distributions with and without unknown tokens, using a bidirectional character n-gram model:

$$ B_{n}(x_{i} \mid x_{i-n+1}, \dots, x_{i-1}, x_{i+1}, \dots, x_{i+n-1}) \propto \frac{C(x_{i-n+1}, \dots, x_{i}) + 1}{C(x_{i-n+1}, \dots, x_{i-1}) + |V|} \times \frac{C(x_{i}, \dots, x_{i+n-1}) + 1}{C(x_{i+1}, \dots, x_{i+n-1}) + |V|} $$

Transformer Experiments

Eighteen encoder-only RoBERTa models (25M parameters each, excluding embeddings) were pretrained from scratch using masked language modeling on Enamine REAL Space (245M molecules, 30,000 steps). Each model used a different tokenizer, isolating the tokenizer’s effect on performance. Finetuning was conducted on six regression and seven classification tasks from MoleculeNet and tmQM.

Linear fixed-effects models were used to estimate the standardized effect of each tokenization scheme relative to an atom-wise SMILES baseline.

Key Findings and Practical Implications

Tokenizer Performance

Smirk shows a positive effect on pretraining quality and downstream performance on tmQM (the dataset with the most bracketed atoms), but performs comparably to atom-wise tokenization on MoleculeNet tasks.
SPE and APE tokenizers have a negative impact on both pretraining and downstream performance relative to the atom-wise baseline, likely due to their high [UNK] rates.
Molecular encoding choice (SMILES vs. SELFIES) has a negligible effect on performance.
NLP tokenizers (GPT-4o, LLaMA, Gemma) score comparably to chemistry-specific tokenizers on intrinsic metrics and do not emit unknown tokens.

N-Gram Proxy Validation

N-gram cross-entropy and information loss metrics show strong rank correlation (Spearman’s $\rho$) with downstream transformer performance, validating their use as low-cost evaluation proxies. The effect sizes from n-gram and transformer experiments are directionally consistent.

Information Loss from Unknown Tokens

Information loss is minimal for tokenizers with robust coverage but substantial for tokenizers with limited vocabularies on chemically diverse datasets. MoLFormer incurs only 0.1 nats/molecule on MoleculeNet but 40.3 nats/molecule on tmQM. Open-vocabulary tokenizers (Smirk, Smirk-GPE, NLP tokenizers) mitigate this degradation.

Practical Recommendations

The authors argue that molecular foundation models must encode the entire breadth of chemical space or risk obscuring critical features. Bracketed atoms encode information essential to clinically relevant pharmaceuticals (e.g., Amoxicillin), industrial compounds (e.g., Tricalcium Silicate), and foundational chemistry (e.g., Cisplatin, where omitting the chiral marker erases medically relevant stereochemical information). The paper encourages the community to adopt open-vocabulary tokenizers and develop more chemically diverse benchmarks.

Limitations

The analysis uses a single-point evaluation for transformer experiments, which may underestimate performance achievable with additional hyperparameter tuning.
Smirk-GPE’s learned merges from REALSpace did not fully generalize to tmQM, as indicated by the token imbalance metric.
Current benchmarks (MoleculeNet) lack sufficient diversity to evaluate tokenizer robustness across the full periodic table, isotopes, charged species, and uncommon bond types.
The downstream impact of token ambiguities in BPE-based tokenizers (e.g., ChemBERTa’s conflation of Sc as both sulfur-carbon and scandium) remains unclear.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	Enamine REAL Space	1.6B SMILES (n-gram), 245M molecules (transformer)	80/10/10 train/val/test split
Downstream evaluation	MoleculeNet	Multiple tasks	6 regression + 7 classification tasks
Downstream evaluation	tmQM	108K transition metal complexes	OpenSMILES molecular encodings
Smirk-GPE training	Enamine REAL Space (subset)	262M molecules	Training split only

Algorithms

Smirk: Two-stage regex-based tokenization (atom decomposition, then glyph decomposition). No training required. Vocabulary: 165 tokens.
Smirk-GPE: BPE-like compression on top of Smirk. Operates on token IDs (not strings) to preserve chemical disambiguation. Final vocabulary: 2,300 tokens.
N-gram models: Add-one smoothing, bidirectional context ($2n - 2$ total context window). Implemented in Julia with exact integer arithmetic.

Models

Architecture: RoBERTa-PreLayerNorm, 8 layers, 8 attention heads, hidden size 512, intermediate size 2048, max sequence length 2048. ~25M parameters (excluding embeddings).
Pretraining: Masked language modeling, 30,000 steps, effective batch size 8192, FusedLamb optimizer, learning rate $1.6 \times 10^{-4}$.
Finetuning: 100,000 steps, AdamW optimizer, effective batch size 128, learning rate $1.6 \times 10^{-4}$.

Evaluation

MoleculeNet preferred metrics per task (AUROC for classification, MAE/RMSE for regression)
Fixed-effects models for standardized effect size estimation
Spearman’s rank correlation between n-gram and transformer metrics

Hardware

Pretraining: 2x NVIDIA A100 GPUs (Delta system at NCSA)
Finetuning: 1x NVIDIA A40 GPU
N-gram models: CPU-based (Julia implementation)

Artifacts

Artifact	Type	License	Notes
Smirk tokenizer	Code	Apache-2.0	Rust implementation with Python bindings, available on PyPI
Model checkpoints	Model	Not specified	Pretrained and finetuned checkpoints included in data release
N-gram code	Code	Not specified	Julia implementation included in data release

Paper Information

Citation: Wadell, A., Bhutani, A., & Viswanathan, V. (2026). Tokenization for Molecular Foundation Models. Journal of Chemical Information and Modeling, 66(3), 1384-1393. https://doi.org/10.1021/acs.jcim.5c01856

@article{wadell2026tokenization,
  title={Tokenization for Molecular Foundation Models},
  author={Wadell, Alexius and Bhutani, Anoushka and Viswanathan, Venkatasubramanian},
  journal={Journal of Chemical Information and Modeling},
  volume={66},
  number={3},
  pages={1384--1393},
  year={2026},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.5c01856}
}

SMILES-BERT: BERT-Style Pre-Training for Molecules

Thu, 26 Mar 2026 00:00:00 +0000

Pre-Training Transformers on SMILES for Molecular Properties

SMILES-BERT is a Method paper that introduces a BERT-inspired pre-training and fine-tuning framework for molecular property prediction. The primary contribution is adapting the masked language model paradigm from NLP to SMILES strings, enabling a Transformer encoder to learn molecular representations from large-scale unlabeled data before fine-tuning on smaller labeled datasets.

Limited Labels in Molecular Property Prediction

Molecular property prediction is central to drug discovery and chemical design, but obtaining labeled data requires expensive biological assays. Deep learning methods for this task fall into three categories: manually designed fingerprints (e.g., ECFP), graph-based methods (GCNs operating on molecular graphs), and sequence-based methods (RNNs or CNNs operating on SMILES strings).

Prior unsupervised approaches like Seq2seq Fingerprint used an encoder-decoder architecture to learn representations from unlabeled SMILES, but the decoder acts as scaffolding that consumes GPU memory during pre-training without contributing to downstream prediction. The semi-supervised Seq3seq Fingerprint improved on this by incorporating labeled data, but retained the encoder-decoder inefficiency. RNN-based methods also suffer from difficulty in parallel training and require careful tuning (gradient clipping, early stopping) to converge.

The authors identify two motivations: (1) building a semi-supervised model that effectively leverages large pools of unlabeled SMILES to improve prediction with limited labels, and (2) designing an architecture where the entire pre-trained model participates in fine-tuning (no wasted decoder parameters) and naturally supports parallel training.

Masked SMILES Recovery with Transformer Encoders

The core innovation is the Masked SMILES Recovery pre-training task, directly analogous to BERT’s masked language modeling. The model architecture is a stack of Transformer encoder layers, making it fully convolutional and parallelizable.

Architecture

SMILES-BERT uses 6 Transformer encoder layers, each with 4-head multi-head self-attention and feed-forward dimension of 1024. Each Transformer layer contains three components: a pre-attention feed-forward network, a self-attention layer, and a post-attention feed-forward network, all followed by layer normalization with residual connections.

The self-attention mechanism uses scaled dot-product attention:

$$ Z = \text{Softmax}\left(\frac{(XW^{Q})(XW^{K})^{T}}{\sqrt{d_{k}}}\right) XW^{V} $$

where $X \in \mathbb{R}^{N \times M}$ is the input feature matrix, $W^{Q}$, $W^{K}$, $W^{V} \in \mathbb{R}^{M \times d_{k}}$ are the query, key, and value weight matrices, and $\sqrt{d_{k}}$ is the scaling factor.

Input SMILES are tokenized at the character level with token embeddings and positional embeddings. A special token is prepended to each SMILES, and its output representation is used for downstream classification/regression after fine-tuning.

Pre-training: Masked SMILES Recovery

Following BERT’s masking strategy, 15% of tokens in each SMILES are selected for masking (minimum one per SMILES). Of the selected tokens:

85% are replaced with a token
10% are replaced with a random token from the vocabulary
5% are kept unchanged

The model is trained to recover the original tokens at masked positions. The loss is computed only on the masked token outputs.

Fine-tuning

After pre-training, a classifier or regressor head is added to the token output. The entire model (all Transformer layers plus the new head) is fine-tuned on the labeled dataset.

Key differences from the original BERT:

Only the Masked SMILES Recovery task is used (BERT’s next sentence prediction is dropped since SMILES have no consecutive-sentence structure)
Segment embeddings are removed
The architecture is smaller (6 layers, 4 heads, 1024 FFN dim) since SMILES have a much smaller vocabulary and shorter sequences than natural language

The authors compared this configuration against a larger BERT-base setup (12 layers, 12 heads, 3072 FFN dim) and found no meaningful performance difference, confirming that the smaller model is sufficient for SMILES.

Experimental Setup and Baseline Comparisons

Pre-training Data

SMILES-BERT was pre-trained on the ZINC database with 18,671,355 training SMILES, 10,000 for validation, and 10,000 for evaluation. Pre-training ran for 10 epochs using the Adam optimizer with a warm-up strategy (learning rate from $10^{-9}$ to $10^{-4}$ over 4,000 steps, then inverse-square-root decay). Batch size was 256 and dropout was 0.1. The pre-training masked SMILES exact recovery rate reached 82.85% on the validation set.

Fine-tuning Datasets

Dataset	Source	Size	Task	Metric
LogP	NCATS/NIH	10,850	Classification (threshold 1.88)	Accuracy
PM2	NCATS/NIH	323,242	Classification (threshold 0.024896)	Accuracy
PCBA-686978	PubChem	302,175	Classification	Accuracy

All datasets were split 80/10/10 for train/validation/test. Fine-tuning used Adam with a fixed learning rate for 50 epochs, selecting the best model on validation data.

Baselines

Circular Fingerprint (CircularFP): Manually designed hash-based fingerprint (ECFP family)
Neural Fingerprint (NeuralFP): Graph-based neural network replacing hash functions with learned layers
Seq2seq Fingerprint (Seq2seqFP): Unsupervised encoder-decoder model on SMILES
Seq3seq Fingerprint (Seq3seqFP): Semi-supervised encoder-decoder model on SMILES

Results

Method	LogP	PM2	PCBA-686978
CircularFP	~0.90	0.6858	~0.82
NeuralFP	~0.90	0.6802	~0.82
Seq2seqFP	~0.87	0.6112	~0.80
Seq3seqFP	~0.90	0.7038	~0.84
SMILES-BERT	0.9154	0.7589	0.8784

SMILES-BERT outperformed all baselines on all three datasets. The improvement over Seq3seqFP was approximately 2% on LogP, 5.5% on PM2, and 3.8% on PCBA-686978. The results on PM2 (the largest labeled dataset) show that pre-training benefits persist even with substantial labeled data.

Structure Study

Configuration	Layers	Attention Heads	FFN Dim	LogP Accuracy
SMILES-BERT	6	4	1024	0.9154
SMILES-BERT (large)	12	12	3072	0.9147

The larger configuration provided no improvement, supporting the choice of the smaller, more efficient architecture.

Findings, Limitations, and Future Directions

SMILES-BERT demonstrated that BERT-style masked pre-training on SMILES strings produces transferable molecular representations that improve property prediction across datasets of varying sizes and property types.

Key findings:

The Masked SMILES Recovery pre-training task transfers effectively to molecular property prediction
The full model participates in fine-tuning (no wasted decoder), making SMILES-BERT more parameter-efficient than encoder-decoder alternatives
A smaller Transformer configuration (6 layers, 4 heads) matches the performance of a BERT-base-sized model for SMILES data
Pre-training on ~18.7M SMILES from ZINC provides robust initialization across different downstream tasks

Limitations: The evaluation uses only classification accuracy as the metric, without reporting AUC-ROC, F1, or other metrics common in molecular property prediction. The comparison is limited to four baselines, and two of the three evaluation datasets (LogP, PM2) are non-public NIH datasets. The paper does not explore different pre-training dataset sizes or ablate the masking strategy. Only classification tasks are evaluated, though the architecture supports regression.

Future work: The authors propose incorporating Quantitative Estimate of Druglikeness (QED) prediction as an additional pre-training task to warm up the model’s classification capability, analogous to BERT’s next sentence prediction.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC	18,671,355 SMILES	Publicly available database
Fine-tuning	LogP	10,850	Non-public, from NCATS/NIH
Fine-tuning	PM2	323,242	Non-public, from NCATS/NIH
Fine-tuning	PCBA-686978	302,175	Public, from PubChem BioAssay

Algorithms

Pre-training: Adam optimizer, warm-up for 4,000 steps ($10^{-9}$ to $10^{-4}$), inverse-square-root LR schedule, batch size 256, dropout 0.1, 10 epochs
Fine-tuning: Adam optimizer, fixed LR (insensitive to choice among $10^{-5}$, $10^{-6}$, $10^{-7}$), 50 epochs, best model on validation

Models

6 Transformer encoder layers, 4-head multi-head attention, FFN dim 1024
Token embedding + positional embedding, special token
Implemented with FairSeq (Facebook AI Research Sequence-to-Sequence Toolkit)

Evaluation

Metric	SMILES-BERT	Best Baseline (Seq3seqFP)	Notes
LogP Accuracy	0.9154	~0.90	~2% improvement
PM2 Accuracy	0.7589	0.7038	~5.5% improvement
PCBA Accuracy	0.8784	~0.84	~3.8% improvement

Hardware

The paper mentions GPU training and NVIDIA GPU donation in acknowledgments but does not specify the exact GPU model or training time beyond noting that pre-training on a single GPU takes over a week for 10 epochs.

Artifact	Type	License	Notes
No public code or model release identified	-	-	Paper does not provide a GitHub link or model checkpoint

Reproducibility status: Partially Reproducible. The ZINC pre-training data is public and the architecture is described in detail, but no code or pre-trained weights are released. Two of three evaluation datasets (LogP, PM2) are non-public.

Paper Information

Citation: Wang, S., Guo, Y., Wang, Y., Sun, H., & Huang, J. (2019). SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM-BCB ‘19), 429-436. https://doi.org/10.1145/3307339.3342186

@inproceedings{wang2019smilesbert,
  title={SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction},
  author={Wang, Sheng and Guo, Yuzhi and Wang, Yuhong and Sun, Hongmao and Huang, Junzhou},
  booktitle={Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics},
  pages={429--436},
  year={2019},
  publisher={ACM},
  doi={10.1145/3307339.3342186}
}

SMILES vs SELFIES Tokenization for Chemical LMs

Thu, 26 Mar 2026 00:00:00 +0000

Atom Pair Encoding for Chemical Language Modeling

This is a Method paper that introduces Atom Pair Encoding (APE), a tokenization algorithm designed specifically for chemical string representations (SMILES and SELFIES). The primary contribution is demonstrating that a chemistry-aware tokenizer, which preserves atomic identity during subword merging, leads to improved molecular property classification accuracy in transformer-based models compared to the standard Byte Pair Encoding (BPE) approach.

Why Tokenization Matters for Chemical Strings

Existing chemical language models based on BERT/RoBERTa architectures have typically relied on BPE for tokenizing SMILES and SELFIES strings. Byte Pair Encoding (BPE) was originally designed for natural language and data compression, where it excels at breaking words into meaningful subword units. When applied to chemical strings, BPE operates at the character level without understanding chemical semantics, leading to several problems:

Stray characters: BPE may create tokens like “C)(” that have no chemical meaning.
Element splitting: Multi-character elements like chlorine (“Cl”) can be split into “C” and “l”, causing the model to misinterpret carbon and a dangling character.
Lost structural context: BPE compresses sequences without considering how character position encodes molecular structure.

Previous work on SMILES Pair Encoding (SPE) attempted to address this by iteratively merging SMILES substrings into chemically meaningful tokens. However, SPE had practical limitations: its Python implementation did not support SELFIES, and it produced a smaller vocabulary (~3000 tokens) than what the data could support. These gaps motivated the development of APE.

The APE Tokenizer: Chemistry-Aware Subword Merging

APE draws inspiration from both BPE and SPE but addresses their shortcomings. The key design decisions are:

Atom-level initialization: Instead of starting from individual characters (as BPE does), APE begins with chemically valid atomic units. For SMILES, this means recognizing multi-character elements (e.g., “Cl”, “Br”) as single tokens. For SELFIES, each bracketed string (e.g., [C], [Ring1], [=O]) serves as the fundamental unit.
Iterative pair merging: Like BPE, APE iteratively merges the most frequent adjacent token pairs. The difference is that the initial tokenization preserves atomic boundaries, so merged tokens always represent valid chemical substructures.
Larger vocabulary: Using the same minimum frequency threshold of 2000, APE generates approximately 5300 unique tokens from the PubChem dataset, compared to SPE’s approximately 3000. This richer vocabulary provides more expressive power for representing chemical substructures.
SELFIES compatibility: APE natively supports both SMILES and SELFIES, using the bracketed token structure of SELFIES as its starting point for that representation.

The tokenizer was trained on a subset of 2 million molecules from PubChem (10 million SMILES total). This produced four tokenizer variants: SMILES-BPE, SMILES-APE, SELFIES-BPE, and SELFIES-APE.

Pre-training and Evaluation on MoleculeNet Benchmarks

Model architecture

All four models use the RoBERTa architecture with 6 hidden layers, a hidden size of 768, an intermediate size of 1536, and 12 attention heads. Pre-training used masked language modeling (MLM) with 15% token masking on 1 million molecules from PubChem, with a validation set of 100,000 molecules. Each model was pre-trained for 20 epochs using AdamW, with hyperparameter optimization via Optuna.

Downstream tasks

The models were fine-tuned on three MoleculeNet classification tasks:

Dataset	Category	Compounds	Tasks	Metric
BBBP	Physiology	2,039	1	ROC-AUC
HIV	Biophysics	41,127	1	ROC-AUC
Tox21	Physiology	7,831	12	ROC-AUC

Data was split 80/10/10 (train/validation/test) following MoleculeNet recommendations. Models were fine-tuned for 5 epochs with early stopping based on validation ROC-AUC.

Baselines

Results were compared against two text-based models (ChemBERTa-2 MTR-77M and SELFormer) and two graph-based models (D-MPNN from Chemprop and MoleculeNet Graph-Conv).

Main results

Model	BBBP ROC	HIV ROC	Tox21 ROC
SMILYAPE-1M	0.754 +/- 0.006	0.772 +/- 0.010	0.838 +/- 0.002
SMILYBPE-1M	0.746 +/- 0.006	0.754 +/- 0.015	0.849 +/- 0.002
SELFYAPE-1M	0.735 +/- 0.015	0.768 +/- 0.012	0.842 +/- 0.002
SELFYBPE-1M	0.676 +/- 0.014	0.709 +/- 0.012	0.825 +/- 0.001
ChemBERTa-2-MTR-77M	0.698 +/- 0.014	0.735 +/- 0.008	0.790 +/- 0.003
SELFormer	0.716 +/- 0.021	0.769 +/- 0.010	0.838 +/- 0.005
MoleculeNet-Graph-Conv	0.690	0.763	0.829
D-MPNN	0.737	0.776	0.851

APE consistently outperforms BPE for both SMILES and SELFIES. SMILYAPE achieves the best BBBP score (0.754), beating D-MPNN (0.737). On HIV, SMILYAPE (0.772) is competitive with D-MPNN (0.776). On Tox21, D-MPNN (0.851) leads, with SMILYBPE (0.849) and SELFYAPE (0.842) close behind.

Statistical significance

Mann-Whitney U tests confirmed statistically significant differences between SMILYAPE and SMILYBPE (p < 0.05 on all datasets). Cliff’s delta values indicate large effect sizes: 0.74 (BBBP), 0.70 (HIV), and -1.00 (Tox21, favoring BPE). For SELFIES models, SELFYAPE achieved Cliff’s delta of 1.00 across all three datasets, indicating complete separation from SELFYBPE.

Key Findings and Limitations

APE outperforms BPE by preserving atomic identity

The consistent advantage of APE over BPE stems from APE’s atom-level initialization. By starting with chemically valid units rather than individual characters, APE avoids creating nonsensical tokens that break chemical elements or mix structural delimiters with atoms.

SMILES outperforms SELFIES with APE tokenization

SMILYAPE generally outperforms SELFYAPE across tasks. Attention weight analysis revealed that SMILYAPE assigns more weight to immediate neighboring tokens (0.108 vs. 0.096) and less to distant tokens (0.030 vs. 0.043). This pattern aligns with chemical intuition: bonding is primarily determined by directly connected atoms. SMILYAPE also produces more compact tokenizations (8.6 tokens per molecule vs. 11.9 for SELFYAPE), potentially allowing more efficient attention allocation.

SELFIES models show higher inter-tokenizer agreement

On the BBBP dataset, all true positives identified by SELFYBPE were also captured by SELFYAPE, with SELFYAPE achieving higher recall (61.68% vs. 55.14%). In contrast, SMILES-based models shared only 29.3% of true positives between APE and BPE variants, indicating that tokenization choice has a larger impact on SMILES models.

Limitations

Pre-training used only 1 million molecules, compared to 77 million for ChemBERTa-2. Despite this, APE models were competitive or superior, but scaling effects remain unexplored.
Evaluation was limited to three binary classification tasks from MoleculeNet. Regression tasks, molecular generation, and reaction prediction were not tested.
The Tox21 result is notable: SMILYBPE outperforms SMILYAPE (0.849 vs. 0.838), suggesting APE’s advantage may be task-dependent.
No comparison with recent atom-level tokenizers like Atom-in-SMILES or newer approaches beyond SPE.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Tokenizer training	PubChem subset	2M molecules	SMILES strings converted to SELFIES via selfies library
Pre-training	PubChem subset	1M molecules	100K validation set
Evaluation	BBBP	2,039 compounds	80/10/10 split
Evaluation	HIV	41,127 compounds	80/10/10 split
Evaluation	Tox21	7,831 compounds	80/10/10 split, 12 tasks

Algorithms

Tokenizers: BPE (via Hugging Face), APE (custom implementation, minimum frequency 2000)
Pre-training: Masked Language Modeling (15% masking) for 20 epochs
Optimizer: AdamW with Optuna hyperparameter search
Fine-tuning: 5 epochs with early stopping on validation ROC-AUC

Models

Architecture: RoBERTa with 6 layers, hidden size 768, intermediate size 1536, 12 attention heads
Four variants: SMILYAPE, SMILYBPE, SELFYAPE, SELFYBPE

Evaluation

Metric	SMILYAPE	SMILYBPE	SELFYAPE	SELFYBPE
BBBP ROC-AUC	0.754	0.746	0.735	0.676
HIV ROC-AUC	0.772	0.754	0.768	0.709
Tox21 ROC-AUC	0.838	0.849	0.842	0.825

Hardware

NVIDIA RTX 3060 GPU with 12 GiB VRAM

Artifacts

Artifact	Type	License	Notes
APE Tokenizer	Code	Other (unspecified SPDX)	Official APE tokenizer implementation
PubChem10M SMILES/SELFIES	Dataset	Not specified	10M SMILES with SELFIES conversions
Pre-trained and fine-tuned models	Model	Not specified	All four model variants on Hugging Face

Paper Information

Citation: Leon, M., Perezhohin, Y., Peres, F., Popovič, A., & Castelli, M. (2024). Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling. Scientific Reports, 14(1), 25016. https://doi.org/10.1038/s41598-024-76440-8

@article{leon2024comparing,
  title={Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling},
  author={Leon, Miguelangel and Perezhohin, Yuriy and Peres, Fernando and Popovi{\v{c}}, Ale{\v{s}} and Castelli, Mauro},
  journal={Scientific Reports},
  volume={14},
  number={1},
  pages={25016},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41598-024-76440-8}
}

SMILES Transformer: Low-Data Molecular Fingerprints

Thu, 26 Mar 2026 00:00:00 +0000

A Transformer Approach to Learned Molecular Fingerprints

This is a Method paper that introduces SMILES Transformer (ST), a Transformer-based sequence-to-sequence model pre-trained on unlabeled SMILES strings to produce continuous, data-driven molecular fingerprints. The primary contribution is demonstrating that unsupervised pre-training on chemical text representations yields fingerprints that generalize well under low-data conditions, outperforming both rule-based fingerprints (ECFP) and graph convolution models on several MoleculeNet benchmarks. A secondary contribution is the Data Efficiency Metric (DEM), a scalar metric for evaluating model performance across varying training set sizes.

The Low-Data Problem in Molecular Property Prediction

Machine learning for drug discovery depends on molecular representations, but labeled datasets of experimentally validated properties are typically small. Conventional approaches fall into two camps: rule-based fingerprints like ECFP that hash substructures into sparse binary vectors, and graph-based methods like GraphConv that learn representations end-to-end. Rule-based fingerprints perform poorly with shallow models or limited data, while graph-based methods are designed for large fully-labeled settings.

Pre-training on unlabeled data had shown strong results in NLP (ELMo, BERT, XLNet), and prior work in cheminformatics had explored RNN-based and VAE-based pre-training on SMILES (Seq2Seq fingerprints, Grammar VAE, heteroencoders). However, none of these studies systematically evaluated performance in small-data settings. Honda et al. fill this gap by applying Transformer-based pre-training to SMILES and measuring data efficiency explicitly.

Transformer Pre-training on SMILES with Pooled Fingerprint Extraction

The core innovation is a Transformer encoder-decoder architecture pre-trained as an autoencoder on SMILES strings, with a specific fingerprint extraction strategy that pools the encoder outputs into a fixed-length vector.

Architecture

The model uses 4 Transformer blocks for both the encoder and decoder, each with 4-head attention and 256 embedding dimensions plus 2 linear layers. Input SMILES are tokenized at the symbol level (e.g., ‘c’, ‘Br’, ‘=’, ‘(’, ‘2’) and one-hot encoded. Following Vaswani et al. (2017), the input uses the sum of token encoding and positional encoding.

Pre-training

The model is pre-trained on 861,000 unlabeled SMILES sampled from ChEMBL24 to minimize cross-entropy between input and output SMILES (i.e., reconstruction). SMILES enumeration (Bjerrum, 2017) randomly generates non-canonical SMILES at each epoch to reduce representation bias. Training runs for 5 epochs with Adam optimization, reaching a perplexity of 1.0 (perfect decoding).

Fingerprint Extraction

Since the Transformer outputs symbol-level (atom-level) representations, a pooling strategy produces molecule-level fingerprints. Four vectors are concatenated:

Mean-pooled output of the last encoder layer
Max-pooled output of the last encoder layer
First output token of the last encoder layer
First output token of the penultimate encoder layer

This produces a 1024-dimensional fingerprint, matching the dimensionality of ECFP for fair comparison.

Data Efficiency Metric

The paper proposes DEM to measure how well a model performs across different training set sizes:

$$ M_{DE}(f, m) = \frac{1}{|I|} \sum_{i \in I} m(f_i, X_i, Y_i) $$

where $f_i$ is the model trained on the fraction $i$ of training data, $m$ is the task metric, and $I = {0.0125, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8}$ doubles the training percentage at each step. This captures average performance across a range of data availability, giving a single scalar that balances accuracy and data efficiency.

Benchmarking Across MoleculeNet with Data Efficiency Focus

Datasets

The evaluation uses 10 datasets from MoleculeNet spanning three categories:

Category	Dataset	Tasks	Type	Molecules	Metric
Physical chemistry	ESOL	1	Regression	1,128	RMSE
Physical chemistry	FreeSolv	1	Regression	643	RMSE
Physical chemistry	Lipophilicity	1	Regression	4,200	RMSE
Biophysics	MUV	17	Classification	93,127	PRC-AUC
Biophysics	HIV	1	Classification	41,913	ROC-AUC
Biophysics	BACE	1	Classification	1,522	ROC-AUC
Physiology	BBBP	1	Classification	2,053	ROC-AUC
Physiology	Tox21	12	Classification	8,014	ROC-AUC
Physiology	SIDER	27	Classification	1,427	ROC-AUC
Physiology	ClinTox	2	Classification	1,491	ROC-AUC

Baselines

ECFP4: Rule-based extended-connectivity fingerprint with 1024 dimensions
RNNS2S: RNN-based Seq2Seq pre-trained fingerprint (3-layer bidirectional GRU, same pre-training data as ST)
GraphConv: Graph convolution network trained end-to-end on labeled data

Experimental Setup

All fingerprint methods use a simple MLP classifier/regressor from scikit-learn with default hyperparameters to isolate the fingerprint quality from model capacity. Datasets are randomly split (stratified for classification), and results are averaged over 20 trials. Note that random splits are used rather than scaffold splits for the DEM experiments.

Data Efficiency Results (DEM)

Dataset	ST+MLP	ECFP+MLP	RNNS2S+MLP	GraphConv
ESOL (RMSE, lower is better)	1.144	1.741	1.317	1.673
FreeSolv (RMSE, lower is better)	2.246	3.043	2.987	3.476
Lipophilicity (RMSE, lower is better)	1.169	1.090	1.219	1.062
MUV (PRC-AUC, higher is better)	0.009	0.036	0.010	0.004
HIV (ROC-AUC, higher is better)	0.683	0.697	0.682	0.723
BACE (ROC-AUC, higher is better)	0.719	0.769	0.717	0.744
BBBP (ROC-AUC, higher is better)	0.900	0.760	0.884	0.795
Tox21 (ROC-AUC, higher is better)	0.706	0.616	0.702	0.687
SIDER (ROC-AUC, higher is better)	0.559	0.588	0.558	0.557
ClinTox (ROC-AUC, higher is better)	0.963	0.515	0.904	0.936

ST achieves the best DEM in 5 of 10 datasets (ESOL, FreeSolv, BBBP, Tox21, ClinTox), with particularly strong margins on ClinTox (+0.027 over GraphConv) and BBBP (+0.016 over RNNS2S).

Linear Model Experiments

To further isolate fingerprint quality, the authors replace MLP with ridge/logistic regression with L2 penalty. On 8 datasets (excluding MUV and SIDER due to class imbalance issues), ST achieves best DEM in 5 of 8, confirming the fingerprint quality holds regardless of downstream model.

Stratified Analysis by Molecule Size

On BBBP stratified by SMILES length, ST’s ROC-AUC increases with longer SMILES, similar to RNNS2S but unlike GraphConv which shows stable performance across lengths. This suggests text-based models extract richer information from longer sequences.

Comparison with Record Scores (Large Data)

Under the large-data setting (80/10/10 train/val/test split with hyperparameter tuning via Optuna), ST achieves first place only in ClinTox (0.954) but performs comparably to ECFP and graph-based models on the other datasets. This confirms that ST’s main advantage is in the low-data regime.

Strong Low-Data Performance with Caveats on Scalability

Key Findings

Transformer-based unsupervised pre-training on SMILES produces fingerprints that excel in low-data molecular property prediction, achieving best data efficiency on 5 of 10 MoleculeNet tasks.
The advantage is most pronounced on small datasets (ESOL with 1,128 molecules, FreeSolv with 643, BBBP with 2,053, ClinTox with 1,491) where pre-training enables good generalization.
With sufficient labeled data and hyperparameter tuning, ST fingerprints perform comparably to (but do not surpass) graph-based methods.
Longer SMILES provide richer information for text-based models, as shown by the stratified analysis on BBBP.

Limitations

Random splits are used for most DEM experiments rather than scaffold splits, which may inflate performance estimates for drug discovery applications where training and test molecules are structurally distinct.
The pre-training corpus (861K SMILES from ChEMBL24) is relatively small by modern standards.
MUV performance is poor across all methods (PRC-AUC near zero), suggesting the DEM framework may not be informative for extremely imbalanced or noisy datasets.
No comparison with BERT-style masked language model pre-training, which later work (ChemBERTa) would show as a viable alternative.

Future Directions

The authors propose three directions: (1) replacing the Transformer with Transformer-XL to handle longer SMILES, (2) multi-task pre-training that jointly predicts molecular descriptors (e.g., molecular weight, LogP) alongside SMILES reconstruction, and (3) better exploitation of enumerated SMILES to constrain the latent space.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL24	861,000 SMILES	Unlabeled, randomly sampled
Evaluation	MoleculeNet (10 datasets)	643 to 93,127 molecules	See Table 1 for per-dataset details

Algorithms

Transformer encoder-decoder: 4 blocks each, 4-head attention, 256 embedding dimensions
Pre-training: 5 epochs, Adam optimizer, cross-entropy loss, SMILES enumeration for augmentation
Fingerprint: 1024 dimensions from concatenated mean pool, max pool, and first-token outputs
Downstream: scikit-learn MLP (default hyperparameters) for DEM experiments; ridge/logistic regression for linear model experiments; Optuna for hyperparameter search in large-data comparison

Models

Artifact	Type	License	Notes
smiles-transformer	Code	MIT	Official implementation (Jupyter notebooks)

Evaluation

DEM averaged over 7 training fractions (1.25% to 80%), 20 trials each
Random splits for DEM; scaffold splits for HIV, BACE, BBBP in large-data comparison
Metrics: RMSE (regression), ROC-AUC or PRC-AUC (classification) per MoleculeNet conventions

Hardware

The paper does not specify GPU type or training time for the pre-training phase.

Paper Information

Citation: Honda, S., Shi, S., & Ueda, H. R. (2019). SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery. arXiv preprint arXiv:1911.04738.

@article{honda2019smiles,
  title={SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery},
  author={Honda, Shion and Shi, Shoi and Ueda, Hiroki R.},
  journal={arXiv preprint arXiv:1911.04738},
  year={2019}
}

SMI+AIS: Hybridizing SMILES with Environment Tokens

Thu, 26 Mar 2026 00:00:00 +0000

A Hybrid Molecular Representation Combining SMILES and Chemical-Environment Tokens

This is a Method paper that introduces SMI+AIS(N), a hybrid molecular string representation combining standard SMILES tokens with Atom-In-SMILES (AIS) tokens. AIS tokens encode local chemical environment information (central atom, ring membership, and neighboring atoms) into a single token. The key contribution is a systematic hybridization strategy that selectively replaces the most frequent SMILES tokens with AIS equivalents, preserving SMILES grammar compatibility while enriching token diversity. The method is validated on molecular structure generation via latent space optimization for drug design.

Limitations of Standard SMILES for Machine Learning

SMILES is the most widely adopted string-based molecular representation, used in major databases like ZINC and PubChem. Despite this ubiquity, SMILES has several well-known limitations for machine learning applications:

Non-unique representations: The same molecule can be encoded as multiple distinct SMILES strings.
Invalid string generation: Generative models can produce syntactically invalid SMILES that do not correspond to any molecule.
Limited token diversity: SMILES tokens map one-to-one to atoms or bonds, so the token vocabulary is restricted to the available atom and bond types.
Insufficient chemical context: Individual SMILES tokens carry no information about the local chemical environment of an atom.

Alternative representations like SELFIES (guaranteeing validity) and InChI (guaranteeing uniqueness) address some of these issues but share the same fundamental limitation of low token diversity. The Atom-In-SMILES (AIS) representation (Ucak et al., 2023) enriches tokens with neighboring atom and ring information, but using AIS exclusively produces a large vocabulary with many infrequent tokens that can cause data sparsity problems. The authors aim to find a middle ground: adding chemical context to the most common tokens while keeping the vocabulary manageable.

Core Innovation: Selective Token Hybridization with AIS

The SMI+AIS(N) representation hybridizes standard SMILES with AIS tokens through a frequency-based selection process:

AIS Token Structure

Each AIS token encodes three pieces of information about an atom, delimited by semicolons:

$$ \lbrack \text{central atom} ; \text{ring info} ; \text{neighbor atoms} \rbrack $$

For example, the oxygen in a carboxyl group of benzoic acid is represented as [O;!R;C], meaning: oxygen atom, not in a ring, bonded to carbon. In standard SMILES, this would simply be O.

Hybridization Procedure

Convert all SMILES strings in the ZINC database to their full AIS representations.
Count the frequency of each AIS token across the database.
Select the top-N most frequent AIS tokens to form the hybrid vocabulary.
In the hybrid representation, atoms matching these top-N AIS tokens are written in AIS notation; all other atoms use standard SMILES notation.

For benzoic acid, the hybridization produces:

$$ \text{SMI}: \texttt{O=C(O)c1ccccc1} $$

$$ \text{SMI+AIS}: \texttt{\lbrack O;!R;C\rbrack=\lbrack C;!R;COO\rbrack(\lbrack OH;!R;C\rbrack)c1ccccc1} $$

The parameter N controls vocabulary size. The authors test N = 50, 100, 150, and 200, finding that N = 100-150 provides the best balance for the ZINC database.

Token Frequency Rebalancing

A key benefit of hybridization is mitigating the severe token frequency imbalance in standard SMILES. Carbon (C), the most frequent element with ~184 million occurrences in ZINC, is represented by only 16 token types in SMILES. With SMI+AIS(200), carbon is distinguished into 145 token types based on chemical environment, with 74% of carbon occurrences represented by AIS tokens. Less common elements like halogens see minimal change (only 2% AIS representation), which avoids introducing unnecessarily rare tokens.

Element	Frequency	SMILES Types	SMI+AIS(100) Types (AIS %)	SMI+AIS(200) Types (AIS %)
C	183,860,954	16	78 (73%)	145 (74%)
O	27,270,229	8	16 (11%)	24 (11%)
N	26,022,928	11	32 (1%)	46 (10%)
X (halogens)	6,137,030	7	10 (2%)	11 (2%)
S	4,581,307	12	17 (2%)	24 (2%)

Latent Space Optimization for Molecular Generation

Model Architecture

The evaluation uses a conditional variational autoencoder (CVAE) with:

Encoder: BERT-style architecture with entity and positional embeddings, 4 multi-head attention layers (8 heads each), producing mean and standard deviation vectors in latent space.
Decoder: 4 stacked gated recurrent unit (GRU) layers that transform sampled latent vectors (conditioned) back into token sequences.
Training: 20 epochs on 9 million compounds from the ZINC database (8:1:1 train/valid/test split) under identical conditions for all representations.

Optimization Setup

Bayesian Optimization (BO) via BoTorch is applied to the CVAE latent space, maximizing a multi-objective function:

$$ \text{Obj} = -\text{BA} - 0.5 \times \text{SA}^2 $$

where BA is binding affinity (docking score from QuickVina 2, lower is stronger) and SA is synthetic accessibility score (from RDKit, lower is more synthesizable). Each BO iteration generates 800 candidate latent vectors. Invalid strings receive a penalty objective value of -100.

Protein Targets

Four diverse targets were used to assess generalizability:

PDK4 (Pyruvate Dehydrogenase Kinase 4): narrow, deep binding pocket
5-HT1B (Serotonin Receptor 1B): shallow, open GPCR conformation
PARP1 (Poly ADP-ribose Polymerase 1): small, flexible molecule binding site
CK1d (Casein Kinase I Delta): broad, accessible conformation

Protein structures were obtained from the Protein Data Bank (PDB IDs: 4V26, 4IAQ, 6I8M, 4TN6). Each optimization was run 10 times independently from the same 5 initial compounds selected from BindingDB.

Key Results

SMI+AIS(100) consistently achieved the highest objective values across protein targets.

PDK4 Optimization (Top-1 results over 10 independent runs):

SMI+AIS(100) achieved approximately 12% improvement over standard SMILES and 28% improvement over SELFIES based on median Top-1 objective values.
Generated structures exhibited BA scores between -10 and -9 and SA scores between 2.0 and 2.3.
Molecular weights clustered around 400 amu, consistent with the CVAE conditioning.

Validity Ratios: Standard SMILES produced approximately 40% valid structures. SMI+AIS representations showed significant improvement as N increased, though SMI+AIS(200) showed slight saturation, likely from insufficiently trained infrequent tokens.

SELFIES: Despite achieving the highest validity ratio, SELFIES failed to generate chemically meaningful structures with desirable BA and SA scores. The authors attribute this to SELFIES grammar where token meaning is highly context-dependent, causing minor latent space variations to produce large structural changes.

Cross-target consistency: Improvements were observed across all four protein targets, with slight variation (5-HT1B showed smaller differences between SMI and SMI+AIS(100) for Top-1, while other targets showed significant improvements).

Improved Molecular Generation Through Chemical Context Enrichment

The SMI+AIS(N) representation achieves consistent improvements in molecular generation quality compared to both standard SMILES and SELFIES. The core findings are:

Binding affinity improvement: Approximately 7% improvement over standard SMILES for the PDK4 target.
Synthesizability improvement: Approximately 6% increase in synthetic accessibility scores.
Target independence: Performance gains transfer across four structurally diverse protein targets.
Preserved structural motifs: The generative model retains chemically meaningful fragments (e.g., acetamide and piperidine) from initial compounds without explicit fragment constraints.

Limitations

The authors acknowledge several limitations:

Stereochemistry: SMI+AIS inherits the limited stereochemistry handling of standard SMILES.
Evaluation scope: Only molecular generation was tested; property prediction and other ML tasks remain unexplored.
Compute constraints: The study was limited to molecular generation due to computing power and time.
Single optimization strategy: Only latent space optimization with Bayesian optimization was evaluated; other generative approaches were not compared.

Future Directions

The authors suggest extending SMI+AIS to diverse benchmarking tests including molecular property prediction, experimental validation, and broader applications of chemical language models.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Vocab	ZINC Database	9M compounds	Canonicalized, deduplicated, split 8:1:1
Binding targets	BindingDB	5 initial compounds per target	Selected for each protein target
Protein structures	PDB	4 structures	IDs: 4V26, 4IAQ, 6I8M, 4TN6

Algorithms

Tokenization: AIS token frequency counting on full ZINC database, top-N selection
Generative model: Conditional VAE with BERT encoder (4 layers, 8 heads) and GRU decoder (4 layers)
Optimization: Bayesian Optimization via BoTorch (800 candidates per iteration)
Docking: QuickVina 2 with 25 A pocket size, 10 docking simulations per ligand
SA scoring: RDKit SA score
Training: 20 epochs for all representations under identical conditions

Models

CVAE architecture details in supplementary (Fig. S9, Tables S2, S4)
No pre-trained weights released

Evaluation

Metric	SMI+AIS(100) vs SMILES	SMI+AIS(100) vs SELFIES	Notes
Median Top-1 Obj. Value	+12%	+28%	PDK4 target
Validity Ratio	Higher than ~40% (SMILES)	Lower than SELFIES	SMI+AIS improves with N
BA (binding affinity)	~7% improvement	Substantial	Lower (more negative) is better
SA (synthesizability)	~6% improvement	Substantial	Lower is more synthesizable

Hardware

Hardware details are not specified in the main text. Optimization wall times are reported in supplementary Table S5.

Artifacts

Artifact	Type	License	Notes
AIS-Drug-Opt	Code	Not specified	Source code and datasets for reproduction

Reproducibility Status: Partially Reproducible. Code and processed data are publicly available on GitHub, but no pre-trained model weights are released, the license is unspecified, and hardware requirements are not documented in the main text.

Paper Information

Citation: Han, H., Yeom, M. S., & Choi, S. (2025). Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation. Scientific Reports, 15, 16892. https://doi.org/10.1038/s41598-025-01890-7

@article{han2025hybridization,
  title={Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation},
  author={Han, Herim and Yeom, Min Sun and Choi, Sunghwan},
  journal={Scientific Reports},
  volume={15},
  number={1},
  pages={16892},
  year={2025},
  publisher={Springer Nature},
  doi={10.1038/s41598-025-01890-7}
}

SMI-TED: Encoder-Decoder Foundation Models for Chemistry

Thu, 26 Mar 2026 00:00:00 +0000

An Encoder-Decoder Chemical Foundation Model Family

SMI-TED is a Method paper that introduces a family of encoder-decoder transformer-based foundation models for chemistry. The primary contribution is the SMI-TED289M architecture, a 289-million parameter model pre-trained on 91 million curated SMILES from PubChem, along with a Mixture-of-Experts variant (MoE-OSMI) that scales to 8x289M parameters. The models support molecular property prediction, molecule reconstruction, reaction yield prediction, and few-shot reasoning over molecular embeddings. All model weights and code are open-sourced under an Apache 2.0 license.

Bridging Encoding and Decoding for Molecular Representations

Chemical language models based on SMILES have gained traction for molecular property prediction and generation. Most existing models, such as MoLFormer and ChemBERTa, are encoder-only architectures that produce molecular embeddings through mean pooling. While effective for downstream classification and regression, this encoder-only approach has a limitation: mean pooling has no natural inverse, meaning the model cannot reconstruct the input molecule from its latent representation. This restricts the model’s utility for generative tasks and limits the interpretability of the learned latent space.

The authors argue that adding a decoder with a reconstruction objective forces the model to encode a more complete set of structural features. Prior work has shown that the quality of pre-training data matters more than the choice of SMILES vs. SELFIES, and that large-scale pre-training can yield useful chemical representations. SMI-TED builds on these observations by combining an encoder-decoder architecture with a carefully curated 91-million molecule dataset from PubChem.

Invertible Pooling and Two-Phase Pre-Training

The core architectural innovation in SMI-TED is a learned pooling mechanism that replaces standard mean or max pooling with an invertible projection. Given token embeddings $\mathbf{x} \in \mathbb{R}^{D \times L}$ (where $D = 202$ is the maximum token count and $L = 768$ is the embedding dimension), the submersion into the latent space $\mathbf{z} \in \mathbb{R}^{L}$ is computed as:

$$ \mathbf{z} = \left(\text{LayerNorm}\left(\text{GELU}\left(\mathbf{W}_1^T \mathbf{x} + \mathbf{b}_1\right)\right)\right) \mathbf{W}_2 $$

where $\mathbf{W}_1 \in \mathbb{R}^{D \times L}$, $\mathbf{b}_1 \in \mathbb{R}^{L}$, and $\mathbf{W}_2 \in \mathbb{R}^{L \times L}$. The immersion (inverse mapping) back to the token space is:

$$ \tilde{\mathbf{x}}^T = \left(\text{LayerNorm}\left(\text{GELU}\left(\mathbf{z} \mathbf{W}_3 + \mathbf{b}_3\right)\right)\right) \mathbf{W}_4 $$

where $\mathbf{W}_3 \in \mathbb{R}^{L \times L}$, $\mathbf{b}_3 \in \mathbb{R}^{L}$, and $\mathbf{W}_4 \in \mathbb{R}^{L \times D}$. A decoder language model then predicts the next token from $\tilde{\mathbf{x}}$.

The encoder uses a modified RoFormer attention mechanism with rotary position embeddings:

$$ \text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle v_n}{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle} $$

where $R_m$ are position-dependent rotation matrices and $\varphi$ is a random feature map.

Two-phase pre-training strategy:

Phase 1: The token encoder is pre-trained on 95% of the data using masked language modeling (15% token selection, of which 80% masked, 10% random, 10% unchanged). The remaining 5% trains the encoder-decoder layer, preventing convergence issues from unstable early embeddings.
Phase 2: After the token embeddings converge, both the encoder and decoder train on 100% of the data jointly.

Mixture-of-Experts (MoE-OSMI): The MoE variant composes 8 fine-tuned SMI-TED289M expert models with a gating network. Given an input embedding $x$, the output is:

$$ y = \sum_{i=1}^{n} G(x)_i E_i(\hat{x}) $$

where $G(x) = \text{Softmax}(\text{TopK}(x \cdot W_g))$ selects the top $k = 2$ experts per input, setting all other gate values to zero.

Benchmarks Across Property Prediction, Generation, and Reaction Yield

MoleculeNet classification (6 datasets, ROC-AUC)

Method	BBBP	ClinTox	HIV	BACE	SIDER	Tox21
MoLFormer	73.6 +/- 0.8	91.2 +/- 1.4	80.5 +/- 1.65	86.3 +/- 0.6	65.5 +/- 0.2	80.46 +/- 0.2
Uni-Mol	72.9 +/- 0.6	91.9 +/- 1.8	80.8 +/- 0.3	85.7 +/- 0.2	65.9 +/- 1.3	79.6 +/- 0.5
GEM	72.4 +/- 0.4	90.1 +/- 1.3	80.6 +/- 0.9	85.6 +/- 1.1	67.2 +/- 0.4	78.1 +/- 0.1
SMI-TED289M (pre-trained)	91.46 +/- 0.47	93.49 +/- 0.85	80.51 +/- 1.34	85.58 +/- 0.92	66.01 +/- 0.88	81.53 +/- 0.45
SMI-TED289M (fine-tuned)	92.26 +/- 0.57	94.27 +/- 1.83	76.85 +/- 0.89	88.24 +/- 0.50	65.68 +/- 0.45	81.85 +/- 1.42

SMI-TED achieves the best results in 4 of 6 classification tasks. Notably, the pre-trained version (without fine-tuning) already matches or exceeds many baselines on BBBP, ClinTox, and Tox21.

MoleculeNet regression (5 datasets, MAE for QM9/QM8, RMSE for ESOL/FreeSolv/Lipophilicity)

Method	QM9	QM8	ESOL	FreeSolv	Lipophilicity
MoLFormer	1.5894	0.0102	0.880	2.342	0.700
D-MPNN	3.241	0.0143	0.98	2.18	0.65
SMI-TED289M (fine-tuned)	1.3246	0.0095	0.6112	1.2233	0.5522

SMI-TED289M achieves the best results across all 5 regression tasks when fine-tuned. The improvements are substantial on ESOL (0.61 vs. 0.82 for next best) and FreeSolv (1.22 vs. 1.91 for next best).

Reaction yield prediction (Buchwald-Hartwig C-N cross-coupling)

The model was tested on Pd-catalyzed Buchwald-Hartwig reactions with 3,955 reactions across varying train/test splits. Selected $R^2$ results:

Split	Yield-BERT (Aug)	DRFP	SMI-TED289M
70/30	0.97	0.95	0.984
10/90	0.81	0.81	0.961
2.5/97.5	0.61	0.62	0.875
Test 1-4 avg	0.58	0.71	0.983

SMI-TED shows particularly strong performance in low-data regimes. With only 2.5% training data, it achieves $R^2 = 0.875$, compared to 0.61-0.62 for competing methods.

MOSES molecular generation benchmarks

SMI-TED is competitive with baselines including CharRNN, SMILES VAE, JT-VAE, LIMO, MolGen-7b, and GP-MoLFormer on standard metrics (validity, uniqueness, novelty, FCD, internal diversity). It achieves superior scaffold cosine similarity (Scaf) and nearest-neighbor similarity (SNN) scores.

Latent space compositionality

Using six families of carbon chains ($\mathcal{F} = {CC, CO, CN, CS, CF, CP}$), the authors test whether the embedding space respects hierarchical distance structures. A linear regression on SMI-TED embeddings yields $R^2 = 0.99$ and $MSE = 0.002$, compared to $R^2 = 0.55$ and $MSE = 0.237$ for MoLFormer. This indicates that the SMI-TED latent space captures compositional chemical relationships far more faithfully.

For structure-property analysis on QM9, nitrogen-containing molecules represent 9.10% of the dataset but account for 32.81% of the top 10% by HOMO energy. In the SMI-TED latent space, these molecules cluster distinctly (Davies-Bouldin index of 2.82 vs. 4.28 for MoLFormer), suggesting the decoder objective encourages encoding of functional group information.

Strong Performance with a Compositional Latent Space

SMI-TED289M demonstrates competitive or superior performance across molecular property prediction, reaction yield prediction, and molecular generation benchmarks. The key findings include:

Broad applicability: The single pre-trained model achieves strong results across classification (4/6 best), regression (5/5 best), reaction yield, and generation tasks.
Low-data robustness: The pre-training on 91M molecules provides chemical knowledge that transfers well to small training sets, as shown by the reaction yield experiments where SMI-TED maintains high accuracy even at 2.5% training data.
Compositional embeddings: The encoder-decoder architecture produces a latent space where molecular similarity follows chemical intuition, with near-perfect linear relationships between functional group families ($R^2 = 0.99$).
Structure-property capture: The reconstruction objective appears to enforce encoding of chemically meaningful features like nitrogen substituent effects on HOMO energy, outperforming encoder-only models in latent space organization.

Limitations: The paper evaluates on MoleculeNet benchmarks, which are well-studied but may not reflect performance on more diverse chemical tasks. The BBBP classification result (92.26) shows a large jump from prior methods (73.6 for MoLFormer), which is worth scrutinizing. The MoE variant is evaluated only in supplementary materials, and scaling behavior beyond 8 experts is not explored.

Future directions: The authors note that compositionality of the learned representations suggests potential for reasoning applications, though they acknowledge that stronger claims require further studies following compositionality analysis methodologies from natural language processing. The model has been integrated into the dZiner agent for inverse molecular design.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	PubChem (curated)	91M molecules, 4B tokens	Deduplicated, canonicalized, validity-checked
Classification	MoleculeNet (BBBP, ClinTox, HIV, BACE, SIDER, Tox21)	Varies	Original benchmark splits
Regression	MoleculeNet (QM9, QM8, ESOL, FreeSolv, Lipophilicity)	Varies	Original benchmark splits
Generation	MOSES	1.94M molecules	Train/test/scaffold test splits
Reaction yield	Buchwald-Hartwig HTE	3,955 reactions	3x 1536-well plates

Algorithms

Masked language modeling for token encoder (15% selection: 80% masked, 10% random, 10% unchanged)
Two-phase pre-training (95/5 split then 100% joint training)
RoFormer attention with rotary position embeddings
Vocabulary: 2,993 tokens (2,988 molecular + 5 special)
Maximum sequence length: 202 tokens (covers 99.4% of PubChem)
Learning rate: 1.6e-4, batch size: 288 molecules
40 epochs over the full PubChem corpus
10 random seeds per experiment for robustness

Models

Variant	Parameters	Encoder	Decoder	Description
SMI-TED289M base	289M	47M	242M	12 layers, 12 attention heads, hidden size 768, dropout 0.2
MoE-OSMI	8x289M	-	-	8 experts, top-k=2 routing, gating network

Evaluation

Classification: ROC-AUC
Regression: MAE (QM9, QM8), RMSE (ESOL, FreeSolv, Lipophilicity)
Reaction yield: $R^2$
Generation: Validity, uniqueness, novelty, FCD, IntDiv, Scaf, SNN (MOSES metrics)
Latent space: Linear regression $R^2$, MSE, Davies-Bouldin index, t-SNE visualization

Hardware

24 NVIDIA V100 GPUs (16GB)
4 nodes with DDP (Distributed Data Parallel)
Pre-training: 40 epochs on 91M molecules

Artifacts

Artifact	Type	License	Notes
IBM/materials (smi_ted)	Code	Apache-2.0	Training, fine-tuning scripts, Jupyter notebooks
ibm/materials.smi-ted	Model	Apache-2.0	Pre-trained model weights
Zenodo archive	Code + Data	Apache-2.0	Archival copy of scripts

Paper Information

Citation: Soares, E., Vital Brazil, E., Shirasuna, V., Zubarev, D., Cerqueira, R., & Schmidt, K. (2025). An open-source family of large encoder-decoder foundation models for chemistry. Communications Chemistry, 8(1). https://doi.org/10.1038/s42004-025-01585-0

@article{soares2025smited,
  title={An open-source family of large encoder-decoder foundation models for chemistry},
  author={Soares, Eduardo and Vital Brazil, Emilio and Shirasuna, Victor and Zubarev, Dmitry and Cerqueira, Renato and Schmidt, Kristin},
  journal={Communications Chemistry},
  volume={8},
  number={1},
  year={2025},
  publisher={Nature Publishing Group},
  doi={10.1038/s42004-025-01585-0}
}

Seq2seq Fingerprint: Unsupervised Molecular Embedding

Thu, 26 Mar 2026 00:00:00 +0000

An Unsupervised Seq2seq Method for Molecular Fingerprints

This is a Method paper that introduces seq2seq fingerprint, an unsupervised molecular embedding approach based on sequence-to-sequence learning. The core idea is to train a GRU encoder-decoder network to translate SMILES strings to themselves, then extract the intermediate fixed-length vector as a molecular fingerprint. These fingerprints are then used with standard supervised classifiers for downstream property prediction tasks such as solubility classification and promiscuity prediction.

The Labeled Data Bottleneck in Drug Discovery

Machine learning approaches to molecular property prediction depend on fixed-length feature vectors as inputs. Traditional molecular fingerprints fall into two categories: hash-based methods like Extended-Connectivity Fingerprints (ECFP) that are fast but lossy and non-invertible, and biologist-guided local-feature fingerprints that require domain expertise and are task-specific. Supervised deep learning fingerprints (e.g., neural fingerprints) can learn representations from data but require large amounts of labeled data, which is expensive to obtain in drug discovery due to the cost of biological experiments.

The authors identify three limitations of existing approaches:

Hash-based fingerprints discard information during the hashing process and cannot reconstruct the original molecule
Local-feature fingerprints require expert knowledge and generalize poorly across tasks
Supervised deep learning fingerprints are data-hungry and fail when labeled data is limited

Self-Translation as Unsupervised Molecular Encoding

The key insight is to adapt the sequence-to-sequence learning framework from machine translation (originally English-to-French) to molecular representation learning by setting both the input and output to the same SMILES string. Since the intermediate vector must contain enough information to reconstruct the original SMILES, it serves as a rich, task-agnostic molecular fingerprint.

The architecture consists of two components:

Perceiver network: A multi-layer GRU encoder that reads the SMILES string and compresses it into a fixed-length vector
Interpreter network: A multi-layer GRU decoder that reconstructs the original SMILES from the fingerprint vector

The GRU cell computes a sequence of outputs $(s_1, \ldots, s_T)$ from input sequences $(x_1, \ldots, x_T)$ by iterating:

$$ z_t = \sigma_g(W_z x_t + U_z s_{t-1} + b_z) $$

$$ r_t = \sigma_r(W_r x_t + U_r s_{t-1} + b_r) $$

$$ h_t = \tanh(U_h x_t + W_h(s_{t-1} \circ r_t)) $$

$$ s_t = (1 - z_t) \circ h_{t-1} + z_t \circ s_{t-1} $$

where $z_t$ is the update gate, $r_t$ is the reset gate, $\circ$ denotes element-wise multiplication, and $W$, $U$, $b$ are trainable parameters.

Several adaptations to the original seq2seq framework make this work for molecular data:

GRU instead of LSTM: GRU provides comparable performance with faster training, which is important given the large training data pool
Attention mechanism: Establishes a stronger connection between the perceiver and interpreter networks via soft alignment, addressing the challenge of passing information through hidden memory for long sequences (SMILES can be up to 250 characters)
Dropout layers: Added to input and output gates (but not hidden memory transfer) following the approach of Zaremba et al. to combat overfitting when training on large datasets
Fingerprint extraction layer: A fixed-unit fully connected layer combined with a GRU cell state concatenation layer is inserted between encoder and decoder to explicitly output the fingerprint vector
Reverse target sequence: Following Sutskever et al., the target sequence is reversed to improve SGD optimization
Bucket training: Sequences are distributed into buckets by length and padded to enable GPU parallelization

Classification Experiments on LogP and PM2 Datasets

Training Setup

The unsupervised training used 334,092 valid SMILES representations from combined LogP and PM2-full datasets obtained from the National Center for Advancing Translational Sciences (NCATS) at NIH. Three model variants were trained with fingerprint dimensions of 512, 768, and 1024, differing in the number of GRU layers (2, 3, and 4 respectively) while keeping the latent dimension at 256. Each model was trained for 24 hours on a workstation with an Intel i7-6700K CPU, 16 GB RAM, and an NVIDIA GTX 1080 GPU.

Reconstruction Performance

The models were evaluated on their ability to reconstruct SMILES strings from their fingerprints:

Model	GRU Layers	Latent Dim	Perplexity	Exact Match Accuracy
seq2seq-512	2	256	1.00897	94.24%
seq2seq-768	3	256	1.00949	92.92%
seq2seq-1024	4	256	1.01472	90.26%

Deeper models showed lower reconstruction accuracy, possibly because larger fingerprint spaces introduce more null spaces and require longer training to converge.

Classification Results

Two labeled datasets were used for downstream classification:

LogP: 10,850 samples with water-octanol partition coefficient values, binarized at a threshold of 1.88
PM2-10k: 10,000 samples with binary promiscuity class labels

The seq2seq fingerprints were evaluated with three ensemble classifiers (AdaBoost, GradientBoost, RandomForest) against circular fingerprints (ECFP) and neural fingerprints. Results are 100-run averages of 5-fold cross-validation accuracy.

LogP classification accuracy:

Method	Mean Accuracy	Std Dev
Circular FP (ECFP)	0.3674	0.0074
Neural FP	0.6080	0.0135
Seq2seq-1024 + GradientBoost	0.7664	0.0043
Seq2seq-1024 + AdaBoost	0.7342	0.0042
Seq2seq-512 + GradientBoost	0.7350	0.0060

PM2-10k classification accuracy:

Method	Mean Accuracy	Std Dev
Circular FP (ECFP)	0.3938	0.0114
Neural FP	0.5227	0.0112
Seq2seq-1024 + GradientBoost	0.6206	0.0198
Seq2seq-1024 + AdaBoost	0.6036	0.0147
Seq2seq-512 + GradientBoost	0.5741	0.0086

The seq2seq fingerprint outperformed both baselines across all configurations. Despite the seq2seq-1024 model having lower reconstruction accuracy, it provided the best classification performance, suggesting that the longer fingerprint captures more discriminative information for downstream tasks even if the reconstruction is less exact.

Unsupervised Transfer Learning for Molecular Properties

The results demonstrate that unsupervised pretraining on large unlabeled molecular datasets can produce fingerprints that transfer well to supervised property prediction with limited labels. The key advantages confirmed by the experiments are:

Label-free training: The unsupervised approach uses essentially unlimited SMILES data, avoiding the expensive label collection process
Task-agnostic representations: The same fingerprints work across different classification tasks (solubility and promiscuity) without retraining
Invertibility: The fingerprints contain enough information to reconstruct the original SMILES (up to 94.24% exact match), unlike hash-based methods

Limitations acknowledged by the authors include:

Long training times (24 hours per model variant), motivating future work on distributed training
The relationship between fingerprint dimensionality and downstream performance is non-monotonic (768-dim underperforms 512-dim on some tasks), suggesting sensitivity to hyperparameter choices
Only classification tasks were evaluated; regression performance was not assessed
The comparison baselines are limited to ECFP and neural fingerprints from 2015

Future directions proposed include distributed training strategies, hyperparameter optimization methods, and semi-supervised extensions that incorporate label information into the fingerprint training.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Unsupervised training	LogP + PM2-full (combined)	334,092 SMILES	Obtained from NCATS at NIH
Classification	LogP	10,850 samples	Binary labels at LogP threshold 1.88
Classification	PM2-10k	10,000 samples	Binary promiscuity labels

Algorithms

Encoder-decoder: Multi-layer GRU with attention mechanism and dropout
Fingerprint dimensions: 512, 768, 1024 (with 2, 3, 4 GRU layers respectively)
Latent dimension: 256 for all variants
Downstream classifiers: AdaBoost, GradientBoost, RandomForest
Evaluation: 5-fold cross-validation, 100-run averages
Baselines: ECFP via RDKit, Neural Fingerprint from HIPS/neural-fingerprint

Models

Three model variants trained for 24 hours each. The paper states code would become publicly available after acceptance, but no public repository has been confirmed.

Evaluation

Metric	Best Value	Task	Configuration
Classification accuracy	0.7664	LogP	seq2seq-1024 + GradientBoost
Classification accuracy	0.6206	PM2-10k	seq2seq-1024 + GradientBoost
Exact match reconstruction	94.24%	SMILES recovery	seq2seq-512
Perplexity	1.00897	SMILES recovery	seq2seq-512

Hardware

Training: Intel i7-6700K @ 4.00 GHz, 16 GB RAM, NVIDIA GTX 1080 GPU
Hyperparameter search and classifier training: TACC Lonestar 5 cluster
Training time: 24 hours per model variant

Artifacts

Artifact	Type	License	Notes
Neural Fingerprint (baseline)	Code	MIT	Baseline comparison code

The authors indicated the seq2seq fingerprint code would be released after acceptance, but no public repository has been found as of this writing. The datasets were sourced from NCATS/NIH.

Paper Information

Citation: Xu, Z., Wang, S., Zhu, F., & Huang, J. (2017). Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery. Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB ‘17), 285-294. https://doi.org/10.1145/3107411.3107424

@inproceedings{xu2017seq2seq,
  title={Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery},
  author={Xu, Zheng and Wang, Sheng and Zhu, Feiyun and Huang, Junzhou},
  booktitle={Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics},
  pages={285--294},
  year={2017},
  publisher={ACM},
  doi={10.1145/3107411.3107424}
}

Randomized SMILES Improve Molecular Generative Models

Thu, 26 Mar 2026 00:00:00 +0000

Data Augmentation Through SMILES Randomization

This is an Empirical paper that performs an extensive benchmark of RNN-based molecular generative models trained with different SMILES string variants. The primary contribution is demonstrating that randomized SMILES (non-unique molecular string representations obtained by randomizing atom orderings) substantially improve the quality of the generated chemical space compared to canonical SMILES, without requiring any changes to the model architecture.

The paper evaluates three properties of generated chemical spaces: uniformity (equal probability of sampling each molecule), completeness (coverage of the target space), and closedness (generating only molecules within the target space). These are measured using a new composite metric called UC-JSD.

Canonical SMILES Bias in Generative Models

Recurrent Neural Networks trained on SMILES strings have shown the capacity to create large chemical spaces of valid molecules. However, when trained with canonical SMILES (the unique string representation produced by a canonicalization algorithm), these models exhibit biases. Specifically, prior work by the same group showed that models trained on one million GDB-13 molecules could only recover 68% of GDB-13 when sampled two billion times, compared to the theoretical maximum of 87% from an ideal uniform sampler.

The canonical SMILES representation introduces two problems. First, the canonicalization algorithm constrains how the molecular graph is traversed (e.g., prioritizing sidechains over ring atoms), forcing the model to learn both valid SMILES syntax and the specific canonical ordering rules. Second, structurally similar molecules can have substantially different canonical SMILES, making some molecules harder to sample than others. Molecules with more ring systems and complex topologies are particularly underrepresented.

The authors also note that DeepSMILES, a recently proposed alternative syntax, had not been benchmarked against randomized SMILES, and that the data augmentation capabilities of randomized SMILES at different training set sizes were unexplored.

Randomized SMILES as Non-Canonical Representations

The core insight is that by randomizing the atom ordering before SMILES generation, each molecule can be represented by multiple different but equally valid SMILES strings. This effectively provides data augmentation: a molecule with $n$ heavy atoms can theoretically yield up to $n$ different SMILES strings (though the actual number is typically lower due to molecular symmetry).

Two randomized SMILES variants are explored:

Restricted randomized SMILES: Atom ordering is randomized, but RDKit’s built-in fixes are applied. These fixes prevent overly complicated traversals, such as prioritizing sidechains before completing ring atoms.
Unrestricted randomized SMILES: Atom ordering is randomized without any RDKit restrictions, producing a superset of the restricted variant that includes more convoluted SMILES strings.

For each training epoch, a new set of randomized SMILES is generated for the same molecules, so a model trained for 300 epochs on one million molecules sees approximately 300 million different SMILES strings (with some overlap due to sampling).

The model architecture is a standard RNN with an embedding layer, $l$ layers of LSTM or GRU cells of size $w$, optional dropout, and a linear output layer with softmax. The training objective minimizes the average negative log-likelihood (NLL):

$$ J(T) = -\ln P(X_{0} = x_{0}) - \sum_{t=1}^{T} \ln P(X_{t} = x_{t} \mid X_{t-1} = x_{t-1} \dots X_{1} = x_{1}) $$

The key metric is the Uniformity-Completeness JSD (UC-JSD), which extends the Jensen-Shannon Divergence to measure how uniform, complete, and closed the generated chemical space is:

$$ JSD = H\left(\sum_{d \in D} \alpha_{i} \cdot d_{i}\right) - \sum_{d \in D} \alpha_{i} H(d_{i}) $$

where $H(d)$ is the Shannon entropy of a probability distribution. The UC-JSD is computed over the NLL vectors of the validation, training, and sampled sets. The composite UCC score is defined as:

$$ UCC = \text{completeness} \times \text{uniformity} \times \text{closedness} $$

where completeness measures coverage of GDB-13, uniformity measures how equal the sampling probabilities are, and closedness measures how few invalid (out-of-target-space) molecules are generated.

Benchmark Design Across SMILES Variants, Training Sizes, and Architectures

The benchmark covers a systematic grid of experimental conditions:

SMILES variants: Canonical, restricted randomized, unrestricted randomized, and three DeepSMILES variants (branch syntax, ring syntax, both).

Training set sizes from GDB-13: 1,000,000, 10,000, and 1,000 molecules with corresponding validation sets.

Architecture choices: LSTM vs. GRU cells, with hyperparameter grids over number of layers ($l$), hidden size ($w$), dropout rate ($d$), and batch size ($b$).

Model	Layers ($l$)	Hidden ($w$)	Dropout ($d$)	Batch ($b$)	Cell
GDB-13 1M	3	512	0, 25, 50	64, 128, 256, 512	GRU, LSTM
GDB-13 10K	2, 3, 4	256, 384, 512	0, 25, 50	8, 16, 32	LSTM
GDB-13 1K	2, 3, 4	128, 192, 256	0, 25, 50	4, 8, 16	LSTM
ChEMBL	3	512	0, 25, 50	64, 128, 256, 512	LSTM

Each model’s best epoch was selected using a smoothed UC-JSD curve, and the best epoch was then sampled with replacement $k = 2 \times 10^{9}$ times for GDB-13 benchmarks.

For ChEMBL experiments, models were trained on 1,483,943 molecules with a validation set of 78,102 molecules. Evaluation used validity, unique molecule count, and Frechet ChemNet Distance (FCD).

Randomized SMILES Produce More Complete and Uniform Chemical Spaces

GDB-13 results (1M training set)

The restricted randomized SMILES model recovered 83.0% of GDB-13, compared to 72.8% for canonical SMILES and 68.4-72.1% for DeepSMILES variants. All three quality metrics improved substantially:

SMILES Variant	% GDB-13	Uniformity	Completeness	Closedness	UCC
Canonical	72.8	0.879	0.836	0.861	0.633
Rand. restricted	83.0	0.977	0.953	0.925	0.860
Rand. unrestricted	80.9	0.970	0.929	0.876	0.790
DeepSMILES (both)	68.4	0.851	0.785	0.796	0.532

The NLL distribution of GDB-13 molecules under the randomized SMILES model was centered near $NLL_{GDB13} = -\ln(1/|GDB13|) = 20.6$ with a narrow spread, indicating near-uniform sampling probability. The canonical model showed a much wider NLL distribution, meaning some molecules were orders of magnitude harder to sample.

Randomized SMILES without data augmentation (same SMILES each epoch) still outperformed canonical SMILES (UCC 0.712 vs. 0.633 for restricted), confirming that the non-canonical representation itself is beneficial beyond the augmentation effect.

Smaller training sets amplify the advantage

With only 10,000 training molecules (0.001% of GDB-13), the randomized model generated 62.3% of GDB-13 vs. 38.8% for canonical. With 1,000 training molecules, the gap widened further: 34.1% vs. 14.5%. Validity also improved dramatically (81.2% vs. 50.4% for the 1K setting), suggesting randomized SMILES helps the model learn valid SMILES syntax more effectively from limited data.

ChEMBL results

On the drug-like ChEMBL dataset, the randomized SMILES model generated at least double the number of unique molecules compared to canonical (64.09% vs. 34.67% unique in a 2B sample), with comparable validity (98.33% vs. 98.26%). The canonical model showed a lower FCD (0.0712 vs. 0.1265), but the authors argue this reflects overfitting: the canonical model’s NLL distributions for training and validation sets overlapped tightly, while the randomized model showed more uniform coverage. Physicochemical property distributions (molecular weight, logP, SA score, QED, NP score, internal diversity) were nearly identical across both models.

Architecture findings

LSTM cells consistently outperformed GRU cells across all SMILES variants. Despite GRU’s faster per-epoch training time, LSTM models converged in fewer epochs, making them faster overall. Dropout improved canonical SMILES models but was less beneficial (or detrimental) for randomized SMILES, suggesting that randomized SMILES themselves serve as a regularization mechanism. Larger batch sizes generally improved performance across all variants.

UC-JSD as a model selection metric

The UC-JSD showed strong correlation with UCC ($R^{2} = 0.931$ for canonical, $R^{2} = 0.856$ for restricted randomized, $R^{2} = 0.885$ for unrestricted randomized), validating its use as a model selection criterion without requiring expensive sampling of every model.

The authors interpret randomized SMILES models as occupying a hybrid space between grammar-based and action-based generative models. The vocabulary serves as a fixed action space where atom tokens are “add atom” actions, bond tokens are “add bond” actions, and ring/branching tokens enable graph traversal. Canonical SMILES constrain this action space to a single deterministic path, while randomized SMILES allow the model to explore multiple valid traversals. This perspective also explains why DeepSMILES performed worse: its altered syntax creates a more complex action space without compensating benefits.

The authors encourage the use of randomized SMILES across different model architectures and tasks, including classification and property prediction, and suggest that finding optimal restricted variants of randomized SMILES is a promising research direction.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Eval	GDB-13 subsets	1M / 10K / 1K molecules	Randomly sampled from 975M GDB-13
Training/Eval	ChEMBL	1,483,943 training / 78,102 validation	Filtered subset of ChEMBL database

GDB-13 is available from the Reymond group website. ChEMBL is publicly available.

Algorithms

Character-level tokenization with special handling for multi-character tokens (Cl, Br, bracketed atoms, %-prefixed ring numbers)
Teacher forcing during training with NLL loss
Gradient norm clipping to 1.0
Weight initialization from $\mathcal{U}(-\sqrt{1/w}, \sqrt{1/w})$
Adaptive learning rate decay based on UC-JSD
Best epoch selection via smoothed UC-JSD (window size 4)

Models

Standard RNN architecture: embedding layer, stacked LSTM/GRU layers with optional dropout, linear output with softmax. Best models used 3 layers of 512-dimensional LSTM cells. Vocabulary sizes: 26 (GDB-13), 31 (ChEMBL).

Evaluation

Metric	Best Randomized	Best Canonical	Notes
% GDB-13 (1M)	83.0%	72.8%	2B sample with replacement
UCC (1M)	0.860	0.633	Composite score
% GDB-13 (10K)	62.3%	38.8%	2B sample with replacement
% GDB-13 (1K)	34.1%	14.5%	2B sample with replacement
% Unique ChEMBL	64.09%	34.67%	2B sample with replacement

Hardware

Nvidia Tesla V100 (Volta) 16 GB VRAM with CUDA 9.1, driver 390.30. Training times ranged from 1 minute (1K canonical) to 131 hours (ChEMBL canonical). Randomized SMILES models required longer per-epoch training due to augmentation overhead but converged to better solutions.

Artifacts

Artifact	Type	License	Notes
reinvent-randomized	Code	MIT	Training and benchmarking code
GDB-13	Dataset	Academic use	975 million fragment-like molecules
MOSES benchmark	Code	MIT	Used for FCD and property calculations

Paper Information

Citation: Arús-Pous, J., Johansson, S. V., Prykhodko, O., Bjerrum, E. J., Tyrchan, C., Reymond, J.-L., Chen, H., & Engkvist, O. (2019). Randomized SMILES strings improve the quality of molecular generative models. Journal of Cheminformatics, 11(1), 71. https://doi.org/10.1186/s13321-019-0393-0

@article{aruspous2019randomized,
  title={Randomized SMILES strings improve the quality of molecular generative models},
  author={Ar{\'u}s-Pous, Josep and Johansson, Simon Viet and Prykhodko, Oleksii and Bjerrum, Esben Jannik and Tyrchan, Christian and Reymond, Jean-Louis and Chen, Hongming and Engkvist, Ola},
  journal={Journal of Cheminformatics},
  volume={11},
  number={1},
  pages={71},
  year={2019},
  doi={10.1186/s13321-019-0393-0},
  publisher={Springer}
}

Neural Machine Translation of Chemical Nomenclature

Thu, 26 Mar 2026 00:00:00 +0000

A Method for Neural Translation of Chemical Names

This is a Method paper that introduces deep learning approaches for translating chemical nomenclature between English and Chinese. The primary contribution is demonstrating that character-level sequence-to-sequence neural networks (both CNN-based and LSTM-based) can serve as viable alternatives to hand-crafted rule-based translation systems for chemical names. The work compares two neural architectures against an existing rule-based tool on bilingual chemical name datasets.

Bridging the English-Chinese Chemical Nomenclature Gap

English and Chinese are the two most widely used languages for chemical nomenclature worldwide. Translation between them is important for chemical data processing, especially for converting Chinese chemical names extracted via named entity recognition into English names that existing name-to-structure tools can parse. Rule-based translation between these languages faces considerable challenges:

Chinese chemical names lack word boundaries (no spaces), making segmentation difficult.
Word order is often reversed between English and Chinese chemical names (e.g., “ethyl acetate” maps to characters meaning “acetate-ethyl” in Chinese).
The same English morpheme can map to different Chinese characters depending on chemical context (e.g., “ethyl” translates differently in “ethyl acetate” vs. “ethyl alcohol”).
Trivial names, especially for natural products, follow irregular translation patterns or are transliterations.

Building comprehensive rule sets requires a formally trained chemist fluent in both languages, making rule-based approaches expensive and fragile.

Character-Level Sequence-to-Sequence Translation

The core idea is to treat chemical name translation as a character-level machine translation task, applying encoder-decoder architectures with attention mechanisms. Two architectures are proposed:

CNN-based architecture: Three 1D convolutional layers encode the input character sequence. A decoder with three 1D convolutional layers processes the target sequence offset by one timestep, combined with attention mechanism layers that connect encoder and decoder outputs. Two additional 1D convolutional layers produce the final decoded output sequence.

LSTM-based architecture: An LSTM encoder converts the input sequence into two state vectors. An LSTM decoder is trained with teacher forcing, using the encoder’s state vectors as its initial state, and generating the target sequence offset by one timestep.

Both models operate at the character level. Input chemical name strings are transformed into embedding vectors, with the vocabulary size equal to the number of unique characters in the respective language (100 unique characters for English names, 2,056 unique characters for Chinese names).

Experimental Setup and Comparison with Rule-Based Tool

Datasets

The authors built two directional datasets from a manually curated corpus of scientific literature maintained at their institution:

En2Ch (English to Chinese): 30,394 name pairs after deduplication
Ch2En (Chinese to English): 37,207 name pairs after deduplication

The datasets cover systematic compound names through trivial names. For names with multiple valid translations, the most commonly used translation was selected. Each dataset was split 80/20 for training and validation.

Model Configuration

Both neural network models used the following hyperparameters:

Batch size: 64
Epochs: 100
Latent dimensionality: 256 (encoding and decoding space)
Implementation: Python 3.7 with Keras 2.3 and TensorFlow backend

Evaluation Metrics

The models were evaluated on five metrics across both translation directions:

Success Rate: Percentage of inputs that produced any output
String Matching Accuracy: Exact match with the single target name
Data Matching Accuracy: Exact match allowing any valid translation from the corpus
Manual Spot Check: Blind evaluation of 100 random samples per approach
Running Time: Wall-clock time on the same hardware

Baseline

The rule-based comparison system operates in three steps: disassemble the input name into word fragments, translate each fragment, and reassemble into the target language. This tool had been deployed as an online service with over one million uses at the time of publication.

Key Findings and Limitations

Main Results

Metric	CNN	LSTM	Rule-based
Success Rate En2Ch	100%	100%	75.97%
Success Rate Ch2En	100%	100%	59.90%
String Match En2Ch	82.92%	89.64%	39.81%
String Match Ch2En	78.11%	55.44%	43.77%
Data Match En2Ch	84.44%	90.82%	45.15%
Data Match Ch2En	80.22%	57.40%	44.91%
Manual Check En2Ch	90.00%	89.00%	80.00%
Manual Check Ch2En	82.00%	61.00%	78.00%
Time En2Ch (s)	1423	190	288
Time Ch2En (s)	1876	303	322

Both neural approaches achieved 100% success rate (always producing output), while the rule-based tool failed on 24% and 40% of inputs for En2Ch and Ch2En respectively. The rule-based tool’s failures were concentrated on Chinese names lacking word boundaries and on trivial names of natural products.

For English-to-Chinese translation, LSTM performed best at 89.64% string matching accuracy (90.82% data matching), followed by CNN at 82.92%. For Chinese-to-English, CNN substantially outperformed LSTM (78.11% vs. 55.44% string matching), suggesting that LSTM had difficulty with long-term dependencies in Chinese character sequences. The authors observed that many LSTM errors appeared at the ends of chemical names.

Analysis by Name Type

The CNN-based approach outperformed LSTM on CAS names (80% vs. 52% in manual checks) and was more robust for longer names. The rule-based tool showed consistent performance regardless of name length, suggesting it was more suited to regular systematic names but struggled with the diversity of real-world chemical nomenclature.

Limitations

Performance depends heavily on training data quality and quantity.
Neither neural approach was validated on an external test set outside the institution’s corpus.
The CNN model was considerably slower (5-6x) than the other two approaches.
No comparison against modern transformer-based NMT architectures (the study predates widespread adoption of transformers for this task).
The dataset is relatively small by modern NMT standards (30-37K pairs).
The authors noted that some neural translations were actually better than the target labels, suggesting the evaluation metrics understate true performance.

Future Directions

The authors suggest that combining CNN and LSTM architectures could yield further improvements, and that the approach has practical applications in scientific publishing (Chinese journals requiring English abstracts) and chemical database interoperability.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Validation (En2Ch)	Curated bilingual corpus	30,394 pairs	80/20 split, from SIOC chemical data system
Training/Validation (Ch2En)	Curated bilingual corpus	37,207 pairs	80/20 split, from SIOC chemical data system
Testing (En2Ch)	Held-out validation split	6,079 records	Same source
Testing (Ch2En)	Held-out validation split	7,441 records	Same source

Training data, Python code for both models, and result data are provided as supplementary files with the paper.

Algorithms

Character-level CNN encoder-decoder with attention (3+3+2 conv layers)
Character-level LSTM encoder-decoder with teacher forcing
Batch size: 64, epochs: 100, latent dim: 256

Models

Both models implemented in Python 3.7 with Keras 2.3 / TensorFlow. No pre-trained weights are released separately, but the training code is provided as supplementary material.

Evaluation

Metric	Best Value (En2Ch)	Best Value (Ch2En)	Notes
Success Rate	100% (both DL)	100% (both DL)	Rule-based: 75.97% / 59.90%
String Matching	89.64% (LSTM)	78.11% (CNN)	Best neural model per direction
Data Matching	90.82% (LSTM)	80.22% (CNN)	Allows multiple valid translations
Manual Spot Check	90.00% (CNN)	82.00% (CNN)	Blind evaluation of 100 samples

Hardware

Not specified in the paper. Running times reported but hardware details not provided.

Artifacts

Artifact	Type	License	Notes
Supplementary files	Code + Data	CC-BY 4.0	Training data, CNN/LSTM code, results (Additional files 1-6)
SIOC Translation Tool	Other	Not specified	Rule-based baseline tool, online service

Paper Information

Citation: Xu, T., Chen, W., Zhou, J., Dai, J., Li, Y., & Zhao, Y. (2020). Neural machine translation of chemical nomenclature between English and Chinese. Journal of Cheminformatics, 12, 50. https://doi.org/10.1186/s13321-020-00457-0

@article{xu2020neural,
  title={Neural machine translation of chemical nomenclature between English and Chinese},
  author={Xu, Tingjun and Chen, Weiming and Zhou, Junhong and Dai, Jingfang and Li, Yingyong and Zhao, Yingli},
  journal={Journal of Cheminformatics},
  volume={12},
  pages={50},
  year={2020},
  doi={10.1186/s13321-020-00457-0},
  publisher={Springer}
}

nach0: A Multimodal Chemical and NLP Foundation Model

Thu, 26 Mar 2026 00:00:00 +0000

A Multi-Domain Encoder-Decoder for Chemistry and NLP

nach0 is a Method paper that introduces a unified encoder-decoder foundation model capable of handling both natural language processing (NLP) tasks and chemistry tasks within a single architecture. The primary contribution is demonstrating that a T5-based model pre-trained on scientific text, patents, and SMILES molecular strings can be instruction-tuned to perform molecular property prediction, reaction prediction, molecular generation, named entity recognition, question answering, and cross-domain translation (text-to-molecule and molecule-to-text) simultaneously. The model is available in base (250M parameters) and large (780M parameters) configurations.

Bridging Chemical and Linguistic Representations

Most existing biomedical language models (BioBERT, SciFive, BioMegatron) are trained exclusively on natural language text from sources like PubMed, omitting chemical structure information encoded in SMILES strings. Conversely, chemistry-specific models trained on SMILES data often lack the ability to process natural language instructions or perform NLP tasks. Models like Galactica and MolT5 attempted to bridge this gap by training on both natural language and chemical data, but they were not fine-tuned on a diverse set of chemical tasks using instruction tuning in a multi-task fashion.

nach0 addresses this by creating a shared representation space for both modalities and fine-tuning across a comprehensive set of tasks spanning three domains: NLP-only tasks, chemistry-only tasks, and cross-domain tasks that require translating between natural language and molecular representations.

Unified Text-to-Text Framework with SMILES Tokenization

The core innovation in nach0 is formulating all chemical and linguistic tasks as text-to-text problems within a single encoder-decoder transformer, combined with a specialized SMILES tokenization strategy.

SMILES Token Integration

Rather than treating SMILES as plain text, nach0 extends the T5 vocabulary with dedicated SMILES tokens. Each SMILES token is annotated with special symbols in the format , creating a distinct vocabulary space for molecular representations while preserving the natural language vocabulary from FLAN-T5. The embedding matrix is initialized by reusing learned embeddings from the pre-trained model for original tokens, with new chemical tokens initialized from the first embeddings.

Architecture

Both model sizes use the standard T5 encoder-decoder architecture:

Configuration	Parameters	Layers	Hidden Size	FFN Size	Attention Heads
Base	250M	12	768	3072	12
Large	780M	24	1024	4096	16

Pre-training Data

The model is pre-trained with a language modeling objective on three data sources:

Source	Documents	Tokens
PubMed abstracts (chemistry-filtered)	13M	355M
USPTO patent descriptions	119K	2.9B
ZINC molecular database	~100M	4.7B

Instruction Tuning

Following the approach of Raffel et al. and Chung et al., nach0 uses natural language prompts to formulate each task. For example, a retrosynthesis task might be phrased as “What reactants could be used to synthesize [SMILES]?” and a property prediction task as “Can [SMILES] penetrate the BBB?” This enables multi-task training across all domains with a single loss function and shared hyperparameters.

Training uses a batch size of 1024, learning rate of $1 \times 10^{-4}$, and weight decay of 0.01. Pre-training runs for one epoch, and fine-tuning for 10 epochs. Data mixing follows the examples-proportional mixing strategy from T5.

Multi-Task Evaluation Across NLP and Chemistry Benchmarks

nach0 is evaluated on a comprehensive set of benchmarks spanning three task categories.

Task Categories

NLP tasks: Named entity recognition (BC5CDR-Chemical, BC5CDR-Disease, NCBI-Disease, BC2GM, JNLPBA), PICO extraction (EBM PICO), textual entailment (MedNLI, SciTail), relation extraction (ChemProt, DDI, GAD), sentence similarity (BIOSSES), document classification (HoC), and question answering (PubMedQA, BioASQ, MedMCQA, MMLU).

Chemistry tasks: Molecular property prediction (ESOL, FreeSolv, Lipophilicity, BBBP, HIV, BACE from MoleculeNet; QM9 from Mol-Instructions), molecular generation (MOSES), forward reaction prediction, reagent prediction, and retrosynthesis (from Mol-Instructions/USPTO).

Cross-domain tasks: Description-guided molecule design and molecular description generation (from Mol-Instructions).

Baselines

nach0 is compared against FLAN-T5 (250M), SciFive (220M), and MolT5 (220M), all trained in multi-task fashion.

Key Results

On chemistry and cross-domain tasks, nach0 base consistently outperforms all base-sized baselines. Selected highlights from Table 3:

Task	Metric	MolT5	SciFive	FLAN	nach0 Base	nach0 Large
Forward reaction	Acc@1	27.0%	60.0%	59.0%	88.0%	89.9%
Retrosynthesis	Acc@1	15.0%	31.0%	31.0%	53.0%	56.3%
Reagent prediction	Acc@1	1.1%	3.8%	4.0%	6.3%	13.1%
BACE	BA	0.58	0.65	0.65	0.74	0.71
BBBP	BA	0.55	0.66	0.60	0.67	0.68
HFE (FreeSolv)	R2	-0.36	0.51	0.55	0.77	0.78
MOSES (FCD)	FCD/Test	0.521	0.578	0.529	0.311	0.304
Description-guided mol. design	BLEU-2	30.3%	44.2%	43.6%	49.0%	48.8%
Mol. description gen.	BLEU-2	35.6%	39.6%	38.6%	43.9%	41.7%

On NLP tasks, nach0 base performs comparably to FLAN base, with the two models trading wins across different tasks. nach0 large improves substantially over nach0 base on most tasks.

Ablation Study

The ablation study (Table 4) examines the impact of multi-task training across chemical task groups. Key findings:

nach0 trained on all chemical tasks jointly outperforms models trained on individual task groups (prediction-only, reaction-only, or generation-only) on the total set of metrics
The joint model shows lower novelty scores on MOSES compared to the generation-only model, but this reflects less overfitting to training data rather than worse performance
nach0 consistently outperforms MolT5 across all chemical task configurations, demonstrating the benefit of pre-training on both natural language and chemical data with specialized SMILES tokens

Case Studies

Two applied case studies demonstrate nach0 in drug discovery scenarios:

End-to-end drug discovery for diabetes mellitus: Using a sequence of prompts, nach0 identifies biological targets, analyzes mechanisms of action, generates molecular structures, proposes synthesis routes, and predicts molecular properties.
JAK3 inhibitor generation with Chemistry42: nach0 replaces 42 specialized generative models in Insilico Medicine’s Chemistry42 platform. In 45 minutes, nach0 generates 8 molecules satisfying all 2D and 3D requirements (hinge binding, active site binding), compared to a 0.04% discovery rate from a combinatorial generator over 24 hours. Chemistry42’s full pipeline (72 hours) still produces better structures since it uses reinforcement learning feedback and explicit structural constraints.

Comparison with ChatGPT

On a subset evaluation, fine-tuned nach0 base outperforms GPT-3.5-turbo on all tested tasks: EBM PICO (F1: 67.6% vs. 64.4%), MedMCQA-Open (BLEU-2: 6.3% vs. 1.7%), and molecular description generation (BLEU-2: 42.8% vs. 2.2%).

Competitive Multi-Task Performance with Clear Limitations

nach0 demonstrates that a single encoder-decoder model can achieve competitive results across both chemical and NLP tasks when pre-trained on mixed-modality data and fine-tuned with instruction tuning. The model’s strongest advantages appear on chemistry tasks (reaction prediction, property prediction, molecular generation), where specialized SMILES tokenization and chemical pre-training provide clear benefits over general-purpose models of similar scale.

Limitations Acknowledged by the Authors

Not at chemist expert level: Human evaluations indicate the model does not match domain expert performance. Key gaps include chemical reasoning, knowledge alignment with domain-specific knowledge graphs, and the ability to learn from expert feedback.
SMILES-only molecular representation: The model lacks 3D geometric information. SMILES notation is not one-to-one with molecular structures, and the model does not incorporate molecular graphs or 3D coordinates. The authors suggest SELFIES as a potential alternative representation.
Prompt sensitivity: Performance depends on prompt quality and specificity. Over-reliance on domain-specific prompts may limit response diversity.
Limited chemical diversity: Cross-domain datasets from Mol-Instructions primarily cover known drugs and chemical probes from PubChem, representing only a fraction of predicted chemical space.

Future Directions

The authors propose extending nach0 with protein sequence modalities (using Group SELFIES), expanding zero-shot evaluation capabilities, and integrating knowledge graph information through self-supervised approaches.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training (text)	PubMed abstracts	13M docs, 355M tokens	Filtered for chemistry-related content
Pre-training (text)	USPTO patents	119K docs, 2.9B tokens	Patent descriptions
Pre-training (chemical)	ZINC	~100M docs, 4.7B tokens	Molecular SMILES strings
Fine-tuning (NLP)	17 NLP datasets	Varies	See Table 1 in paper
Fine-tuning (chemistry)	MoleculeNet, MOSES, Mol-Instructions	Varies	Predefined or random splits

Algorithms

Architecture: T5 encoder-decoder (base: 250M, large: 780M parameters)
Pre-training objective: Language modeling (masked span prediction)
Fine-tuning: Multi-task instruction tuning with examples-proportional mixing
Hyperparameters: batch size 1024, learning rate $1 \times 10^{-4}$, weight decay 0.01
Pre-training: 1 epoch; fine-tuning: 10 epochs

Models

Artifact	Type	License	Notes
nach0 Base (HuggingFace)	Model	CC-BY-NC-4.0	250M parameter encoder-decoder
nach0 Large (HuggingFace)	Model	CC-BY-NC-4.0	780M parameter encoder-decoder
nach0 GitHub Repository	Code	Not specified	Training and inference code

Evaluation

Evaluation spans 17+ NLP benchmarks and 10+ chemistry benchmarks. Metrics include F1 (NER, RE, classification), accuracy (QA, entailment, reaction prediction), balanced accuracy (molecular property classification), R2/RMSE (regression), BLEU-2 (generation), and FCD/SNN/validity/novelty (molecular generation via MOSES).

Hardware

Base models: NVIDIA A4000 and A5000 GPUs
Large models: NVIDIA DGX cloud platform
Training used tensor and pipeline parallelism via NeMo toolkit
Specific GPU counts and training times not reported

Paper Information

Citation: Livne, M., Miftahutdinov, Z., Tutubalina, E., Kuznetsov, M., Polykovskiy, D., Brundyn, A., Jhunjhunwala, A., Costa, A., Aliper, A., Aspuru-Guzik, A., & Zhavoronkov, A. (2024). nach0: Multimodal Natural and Chemical Languages Foundation Model. Chemical Science, 15(22), 8380-8389. https://doi.org/10.1039/D4SC00966E

@article{livne2024nach0,
  title={nach0: multimodal natural and chemical languages foundation model},
  author={Livne, Micha and Miftahutdinov, Zulfat and Tutubalina, Elena and Kuznetsov, Maksim and Polykovskiy, Daniil and Brundyn, Annika and Jhunjhunwala, Aastha and Costa, Anthony and Aliper, Alex and Aspuru-Guzik, Al{\'a}n and Zhavoronkov, Alex},
  journal={Chemical Science},
  volume={15},
  number={22},
  pages={8380--8389},
  year={2024},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D4SC00966E}
}

MolBERT: Auxiliary Tasks for Molecular BERT Models

Thu, 26 Mar 2026 00:00:00 +0000

BERT-Based Molecular Representations with Auxiliary Pre-Training Tasks

This is a Method paper that introduces MolBERT, a bidirectional Transformer (BERT) architecture applied to SMILES-based molecular representations for drug discovery. The primary contribution is a systematic study of how different domain-relevant self-supervised pre-training tasks affect the quality of learned molecular embeddings, paired with a model that achieves state-of-the-art performance on virtual screening and quantitative structure-activity relationship (QSAR) benchmarks.

Why Domain-Relevant Pre-Training Matters for Molecular Language Models

Molecular representations are foundational for predictive, generative, and analytical tasks in drug discovery. Language models applied to text-based molecular representations like SMILES have demonstrated strong performance across property prediction, reaction prediction, and molecular generation. However, several open questions remained at the time of this work:

Task selection for pre-training: Prior work explored masked token prediction, input translation, and property concatenation, but there was no systematic comparison of how different self-supervised tasks affect downstream performance.
SMILES ambiguity: The same molecule can be encoded as many different SMILES strings depending on how the molecular graph is traversed. Canonicalization algorithms address this but introduce their own artifacts that may distract the model.
Domain knowledge integration: Standard NLP pre-training objectives (e.g., masked language modeling) do not explicitly encode chemical knowledge. It was unclear whether incorporating chemistry-specific supervision during pre-training could improve representation quality.

MolBERT addresses these gaps by evaluating three pre-training tasks, including a novel physicochemical property prediction objective, and measuring their individual and combined effects on downstream drug discovery benchmarks.

Three Auxiliary Tasks for Chemistry-Aware Pre-Training

MolBERT uses the BERT-Base architecture (12 attention heads, 12 layers, 768-dimensional hidden states, approximately 85M parameters) and explores three self-supervised pre-training tasks:

Masked Language Modeling (MaskedLM): The standard BERT objective where 15% of input tokens are masked and the model predicts their identity. The loss is cross-entropy between predicted and true tokens.

SMILES Equivalence (SMILES-Eq): A binary classification task where the model receives two SMILES strings and predicts whether they represent the same molecule. The second string is either a random permutation of the first (same molecule, different traversal) or a randomly sampled molecule. This is optimized with cross-entropy loss.

Physicochemical Property Prediction (PhysChemPred): Using RDKit, a set of 200 real-valued molecular descriptors are computed for each molecule. The model predicts these normalized descriptors from the SMILES input using mean squared error:

$$\mathcal{L}_{\text{PhysChemPred}} = \frac{1}{D} \sum_{d=1}^{D} (y_d - \hat{y}_d)^2$$

where $D = 200$ is the number of descriptors, $y_d$ is the true normalized descriptor value, and $\hat{y}_d$ is the model’s prediction.

The final training loss is the arithmetic mean of all active task losses:

$$\mathcal{L}_{\text{total}} = \frac{1}{|\mathcal{T}|} \sum_{t \in \mathcal{T}} \mathcal{L}_t$$

where $\mathcal{T}$ is the set of active pre-training tasks.

Additionally, MolBERT supports SMILES permutation augmentation during training, where each input molecule is represented by a randomly sampled non-canonical SMILES string rather than the canonical form. The model uses a fixed vocabulary of 42 tokens, a sequence length of 128, and relative positional embeddings (from Transformer-XL) to support arbitrary-length SMILES at inference time.

Ablation Study and Benchmark Evaluation

Pre-Training Setup

All models were pre-trained on the GuacaMol benchmark dataset, consisting of approximately 1.6M compounds curated from ChEMBL, using an 80%/5% train/validation split. Training used the Adam optimizer with a learning rate of $3 \times 10^{-5}$ for 20 epochs (ablation) or 100 epochs (final model).

Ablation: Impact of Task Combinations on Virtual Screening

The ablation study evaluated all seven possible task combinations on the RDKit virtual screening benchmark (69 datasets, 5 query molecules per target). Results measured by AUROC and BEDROC20 (an early enrichment metric with $\alpha = 20$):

MaskedLM	PhysChemPred	SMILES-Eq	AUROC (w/ perm)	BEDROC20 (w/ perm)	AUROC (w/o perm)	BEDROC20 (w/o perm)
Yes	Yes	Yes	0.685 +/- 0.069	0.246 +/- 0.041	0.707 +/- 0.059	0.280 +/- 0.042
Yes	Yes	No	0.738 +/- 0.060	0.323 +/- 0.071	0.740 +/- 0.066	0.322 +/- 0.065
Yes	No	Yes	0.483 +/- 0.092	0.092 +/- 0.069	0.493 +/- 0.068	0.108 +/- 0.070
No	Yes	Yes	0.476 +/- 0.077	0.064 +/- 0.034	0.514 +/- 0.165	0.084 +/- 0.014
Yes	No	No	0.696 +/- 0.058	0.283 +/- 0.077	0.676 +/- 0.060	0.250 +/- 0.073
No	Yes	No	0.719 +/- 0.057	0.293 +/- 0.071	0.716 +/- 0.061	0.290 +/- 0.076
No	No	Yes	0.129 +/- 0.067	0.005 +/- 0.037	0.508 +/- 0.068	0.048 +/- 0.035

Key findings from the ablation:

PhysChemPred had the highest individual impact (average BEDROC20 of 0.292 alone vs. 0.266 for MaskedLM alone).
Combining MaskedLM + PhysChemPred achieved the best performance (BEDROC20 of 0.323), though the additive gain from MaskedLM was modest (+0.031).
The SMILES-Eq task consistently decreased performance when added to other task combinations.

A further sub-ablation on PhysChemPred descriptor groups showed that surface descriptors alone (49 of 200 descriptors) achieved nearly the same performance as the full set, suggesting molecular surface properties provide particularly informative supervision.

Virtual Screening Results

Using the best task combination (MaskedLM + PhysChemPred) trained for 100 epochs:

Method	AUROC	BEDROC20
MolBERT (100 epochs)	0.743 +/- 0.062	0.344 +/- 0.062
CDDD	0.725 +/- 0.057	0.310 +/- 0.080
RDKit descriptors	0.633 +/- 0.027	0.217 +/- 0.000
ECFC4	0.603 +/- 0.056	0.170 +/- 0.079

MolBERT outperformed all baselines including CDDD (the prior state of the art), RDKit calculated descriptors, and extended-connectivity fingerprints (ECFC4).

QSAR Results

On MoleculeNet regression tasks (RMSE, lower is better):

Dataset	RDKit (norm)	ECFC4	CDDD	MolBERT	MolBERT (finetune)
ESOL	0.687 +/- 0.08	0.902 +/- 0.06	0.567 +/- 0.06	0.552 +/- 0.07	0.531 +/- 0.04
FreeSolv	1.671 +/- 0.45	2.876 +/- 0.38	1.456 +/- 0.43	1.523 +/- 0.66	0.948 +/- 0.33
Lipophilicity	0.738 +/- 0.04	0.770 +/- 0.03	0.669 +/- 0.02	0.602 +/- 0.01	0.561 +/- 0.03

On MoleculeNet classification tasks (AUROC, higher is better):

Dataset	RDKit (norm)	ECFC4	CDDD	MolBERT	MolBERT (finetune)
BACE	0.831	0.845	0.833	0.849	0.866
BBBP	0.696	0.678	0.761	0.750	0.762
HIV	0.708	0.714	0.753	0.747	0.783

Fine-tuned MolBERT achieved the best performance on all six QSAR datasets. When used as a fixed feature extractor with an SVM, MolBERT embeddings outperformed other representations on three of six tasks.

Key Findings and Limitations

Key Findings

Pre-training task selection matters significantly. The choice of auxiliary tasks during pre-training has a large effect on downstream performance. PhysChemPred provides the strongest individual signal.
Domain-relevant auxiliary tasks improve representation quality. Predicting physicochemical properties during pre-training encodes chemical knowledge directly into the embeddings, outperforming purely linguistic objectives.
The SMILES equivalence task hurts performance. Despite being chemically motivated, the SMILES-Eq task consistently degraded results, suggesting it may introduce conflicting learning signals.
PhysChemPred organizes the embedding space. Analysis of pairwise cosine similarities showed that models trained with PhysChemPred assign high similarity to permutations of the same molecule and low similarity to different molecules, creating a more semantically meaningful representation space.

Limitations

The paper evaluates only SMILES-based representations, inheriting all limitations of string-based molecular encodings (inability to capture 3D structure, sensitivity to tokenization).
The virtual screening evaluation uses a fixed number of query molecules ($n = 5$), which may not reflect realistic screening scenarios.
Cross-validation splits from ChemBench were used for QSAR evaluation rather than scaffold splits, which may overestimate performance on structurally novel compounds.
The model’s 128-token sequence length limit may truncate larger molecules, though relative positional embeddings partially address this at inference time.

Future Directions

The authors propose extending MolBERT to learn representations for other biological entities such as proteins, and developing more advanced pre-training strategies.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	GuacaMol (ChEMBL)	~1.6M compounds	80% train / 5% validation split
Virtual Screening	RDKit benchmark v1.2	69 target datasets	Filtered subset with active/decoy compounds
QSAR (Regression)	ESOL, FreeSolv, Lipophilicity	Varies	From MoleculeNet, ChemBench splits
QSAR (Classification)	BACE, BBBP, HIV	Varies	From MoleculeNet, ChemBench splits

Algorithms

Architecture: BERT-Base (12 heads, 12 layers, 768-dim hidden, ~85M params)
Optimizer: Adam, learning rate $3 \times 10^{-5}$
Vocabulary: 42 tokens, sequence length 128
Masking: 15% of tokenized input
Positional encoding: relative positional embeddings (Transformer-XL)
Fine-tuning SVM: $C = 5.0$, RBF kernel (from Winter et al.)
Fine-tuning head: single linear layer on pooled output
Embeddings: pooled output (or average sequence output when only MaskedLM is used)

Models

BERT-Base with ~85M parameters
Pre-trained weights available at BenevolentAI/MolBERT

Evaluation

Metric	Task	Notes
AUROC	Virtual Screening, Classification QSAR	Standard area under ROC curve
BEDROC20	Virtual Screening	Early enrichment metric, $\alpha = 20$
RMSE	Regression QSAR	Root mean squared error

Hardware

2 GPUs, 16 CPUs
Pre-training time: ~40 hours (20 epochs)

Artifacts

Artifact	Type	License	Notes
BenevolentAI/MolBERT	Code + Model	MIT	Official implementation with pre-trained weights

Paper Information

Citation: Fabian, B., Edlich, T., Gaspar, H., Segler, M., Meyers, J., Fiscato, M., & Ahmed, M. (2020). Molecular representation learning with language models and domain-relevant auxiliary tasks. arXiv preprint arXiv:2011.13230.

@article{fabian2020molecular,
  title={Molecular representation learning with language models and domain-relevant auxiliary tasks},
  author={Fabian, Benedek and Edlich, Thomas and Gaspar, H{\'e}l{\'e}na and Segler, Marwin and Meyers, Joshua and Fiscato, Marco and Ahmed, Mohamed},
  journal={arXiv preprint arXiv:2011.13230},
  year={2020}
}

Group SELFIES: Fragment-Based Molecular Strings

Thu, 26 Mar 2026 00:00:00 +0000

A Fragment-Aware Extension of SELFIES

This is a Method paper that introduces Group SELFIES, a molecular string representation extending SELFIES by incorporating group tokens that represent functional groups or entire substructures. The primary contribution is a representation that maintains the 100% chemical validity guarantee of SELFIES while enabling fragment-level molecular encoding. Group SELFIES is shorter, more human-readable, and produces better distribution learning compared to both SMILES and standard SELFIES.

From Atoms to Fragments in Molecular Strings

Molecular string representations underpin nearly all string-based molecular generation, from chemical language models and VAEs to genetic algorithms. SMILES, the dominant representation, suffers from validity issues: generated strings frequently contain syntax errors or violate valency constraints. SELFIES solved this by guaranteeing that every string decodes to a valid molecule, but both SMILES and SELFIES operate at the atomic level. Human chemists, by contrast, think about molecules in terms of functional groups and substructures.

Fragment-based generative models exploit this inductive bias by constructing custom representations amenable to fragment-based molecular design. However, these approaches are typically graph-based, losing the desirable properties of string representations: easy manipulation and direct input into established language models. Historical string representations like Wiswesser Line Notation (WLN), Hayward Notation, and SYBYL Line Notation (SLN) did use non-atomic tokens, but none provided chemical robustness guarantees.

The gap is clear: no existing string representation combines the chemical robustness of SELFIES with the fragment-level abstraction that captures meaningful chemical motifs.

Group Tokens with Chemical Robustness Guarantees

The core innovation is the introduction of group tokens into the SELFIES framework. Each group token represents a predefined molecular fragment (such as a benzene ring, carboxyl group, or any user-specified substructure) and is treated as a single unit during encoding and decoding.

Group Definition

Each group is defined as a set of atoms and bonds with labeled attachment points that specify how the group participates in bonding. Each attachment point has a specified maximum valency, allowing the decoder to continue tracking available valency during string construction. Group tokens take the form [:S], where S is the starting attachment index.

Encoding

To encode a molecule, the encoder first recognizes and replaces substructure matches from the group set. By default, the encoder processes larger groups first, but users can override this with priority values. The encoder then traverses the molecular graph similarly to standard SELFIES encoding, inserting tokens that track attachment indices for entering and exiting groups.

Decoding

When the decoder encounters a group token, it looks up the corresponding group in the group set dictionary, places all atoms of the group, and connects the main chain to the starting attachment point. Navigation between attachment points is handled by reading subsequent tokens as relative indices. If an attachment point is occupied, the next available one is used. If all attachment points are exhausted, the group is immediately popped from the stack.

Chemical Robustness

The key property preserved from SELFIES is that any arbitrary Group SELFIES string decodes to a molecule with valid valency. This is achieved by maintaining the same two SELFIES decoder features within the group framework:

Token overloading: every token can be interpreted as a number when needed (for branch lengths, ring targets, or attachment indices).
Valency tracking: if adding a bond would exceed available valency, the decoder adjusts the bond order or skips the bond.

The authors verified robustness by encoding and decoding 25 million molecules from the eMolecules database.

Chirality Handling

Group SELFIES handles chirality differently from SMILES and SELFIES. Rather than using @-notation for tetrahedral chirality, all chiral centers must be specified as groups. An “essential set” of 23 groups covers all relevant chiral centers in the eMolecules database. This approach also supports extended chirality (axial, helical, planar) by abstracting the entire chiral substructure into a group token.

Fragment Selection

The group set is a user-defined dictionary that maps group names to molecular fragments. Users can specify groups manually using SMILES-like syntax, extract them from fragment libraries, or use fragmentation algorithms such as matched molecular pair analysis. The authors tested several approaches, including a naive method that cleaves side chains from rings and methods based on cheminformatics fragmentation tools. A useful group set typically contains fragments that appear in many molecules and replace many atoms, with similar fragments merged to reduce redundancy.

Experiments on Compactness, Generation, and Distribution Learning

Compactness (Section 4.1)

Using 53 groups (30 extracted from ZINC-250k plus 23 from the essential set), Group SELFIES strings are shorter than their SMILES and SELFIES equivalents. Despite Group SELFIES having a larger alphabet, the compressed file size of the ZINC-250k dataset is smallest for Group SELFIES, indicating lower information-theoretic complexity.

Random Molecular Generation (Section 4.2)

To isolate the effect of the representation from the generative model, the authors use a primitive generative model: sample a random string length from the dataset, draw tokens uniformly from a bag of all tokens, and concatenate. From 100,000 ZINC-250k molecules:

Randomly sampled Group SELFIES strings produce molecules whose SAScore and QED distributions more closely overlap with the original ZINC dataset than molecules from randomly sampled SELFIES strings.
The Wasserstein distances to the ZINC distribution are consistently lower for Group SELFIES.
On a nonfullerene acceptor (NFA) dataset, Group SELFIES preserves aromatic rings while SELFIES rarely does.

Distribution Learning with VAEs (Section 4.3)

Using the MOSES benchmarking framework, VAEs were trained for 125 epochs on both Group SELFIES and SELFIES representations. The Group SELFIES VAE used 300 groups extracted from the MOSES training set. Results from 100,000 generated molecules:

Metric	Group-VAE-125	SELFIES-VAE-125	Train (Reference)
Valid	1.0 (0)	1.0 (0)	1.0
Unique@1k	1.0 (0)	0.9996 (5)	1.0
Unique@10k	0.9985 (4)	0.9986 (4)	1.0
FCD (Test)	0.1787 (29)	0.6351 (43)	0.008
FCD (TestSF)	0.734 (109)	1.3136 (128)	0.4755
SNN (Test)	0.6051 (4)	0.6014 (3)	0.6419
Frag (Test)	0.9995 (0)	0.9989 (0)	1.0
Scaf (Test)	0.9649 (21)	0.9588 (15)	0.9907
IntDiv	0.8587 (1)	0.8579 (1)	0.8567
Novelty	0.9623 (7)	0.96 (4)	1.0

The most notable improvement is in Frechet ChemNet Distance (FCD), where Group SELFIES achieves 0.1787 versus 0.6351 for SELFIES on the test set. FCD measures the difference between penultimate-layer activations of ChemNet, encoding a mixture of biological and chemical properties relevant to drug-likeness. Most other metrics are comparable, with Group SELFIES matching or slightly outperforming SELFIES across the board.

Advantages, Limitations, and Future Directions

Key Findings

Group SELFIES provides three main advantages over standard SELFIES:

Substructure control: Important scaffolds, chiral centers, and charged groups can be preserved during molecular optimization.
Compactness: Group tokens represent multiple atoms, yielding shorter strings with lower information-theoretic complexity.
Improved distribution learning: The FCD metric shows substantial improvement, indicating generated molecules better capture biological and chemical properties of the training set.

Both SELFIES and Group SELFIES achieve 100% validity, eliminating the validity issues associated with SMILES-based generation.

Limitations

The authors acknowledge several limitations:

Computational speed: Encoding and decoding is slower than SELFIES due to RDKit overhead, particularly for the encoder which performs substructure matching for every group in the set.
No group overlap: Groups cannot overlap in the current formulation, which limits expressiveness for polycyclic compounds.
Group set design: Choosing an effective group set remains an open design choice that may require domain expertise or fragmentation algorithm tuning.
Limited generative model evaluation: The paper focuses on random sampling and VAEs; evaluation with more sophisticated models (GANs, reinforcement learning, genetic algorithms) is left to future work.

Future Directions

The authors propose several extensions: flexible scaffold tokens that preserve topology while allowing atom-type variation, representations based on cellular complexes or hypergraphs to handle overlapping groups, and integration with genetic algorithms like JANUS for molecular optimization.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Compactness / Generation	ZINC-250k	250,000 molecules	Random subset of 10,000 for fragment extraction; 100,000 for generation
Distribution Learning	MOSES benchmark	~1.9M molecules	Standard train/test split from MOSES framework
Robustness Verification	eMolecules	25M molecules	Full database encode-decode round trip
NFA Generation	NFA dataset	Not specified	Nonfullerene acceptors from Lopez et al. (2017)

Algorithms

Fragmentation: Naive ring-sidechain cleavage, matched molecular pair analysis, and diversity-based selection of 300 groups for VAE experiments.
Essential set: 23 chiral groups covering all relevant chiral centers in eMolecules.
Random generation: Bag-of-tokens sampling with length matched to dataset distribution.

Models

VAE: Trained for 125 epochs on MOSES dataset using both SELFIES and Group SELFIES tokenizations.
Architecture details follow the MOSES benchmark VAE configuration.

Evaluation

Metric	Description
FCD	Frechet ChemNet Distance (penultimate layer activations)
SNN	Average Tanimoto similarity to nearest neighbor in reference set
Frag	Cosine similarity of BRICS fragment distributions
Scaf	Cosine similarity of Bemis-Murcko scaffold distributions
IntDiv	Internal diversity via Tanimoto similarity
Validity	Percentage passing RDKit parsing
Uniqueness	Percentage of non-duplicate generated molecules
Novelty	Fraction of generated molecules not in training set

Hardware

Robustness verification performed on the Niagara supercomputer (SciNet HPC Consortium).
VAE training hardware not specified.

Artifacts

Artifact	Type	License	Notes
group-selfies	Code	Apache-2.0	Open-source Python implementation

Paper Information

Citation: Cheng, A. H., Cai, A., Miret, S., Malkomes, G., Phielipp, M., & Aspuru-Guzik, A. (2023). Group SELFIES: A robust fragment-based molecular string representation. Digital Discovery, 2(3), 748-758. https://doi.org/10.1039/D3DD00012E

@article{cheng2023group,
  title={Group SELFIES: A Robust Fragment-Based Molecular String Representation},
  author={Cheng, Austin H. and Cai, Andy and Miret, Santiago and Malkomes, Gustavo and Phielipp, Mariano and Aspuru-Guzik, Al{\'a}n},
  journal={Digital Discovery},
  volume={2},
  number={3},
  pages={748--758},
  year={2023},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D3DD00012E}
}

DeepSMILES: Adapting SMILES Syntax for Machine Learning

Thu, 26 Mar 2026 00:00:00 +0000

A New Molecular String Notation for Generative Models

This is a Method paper that introduces DeepSMILES, a modified SMILES syntax designed to reduce the rate of syntactically invalid strings produced by machine-learning generative models. The primary contribution is a pair of string-level transformations (for ring closures and for branches) that can be applied independently and interconverted with standard SMILES without loss of information, including stereochemistry.

The Problem of Invalid SMILES in Molecular Generation

Deep neural networks for de novo molecular design commonly operate on SMILES strings. Variational autoencoders (Gomez-Bombarelli et al., 2018), recurrent neural networks with LSTM (Segler et al., 2018; Olivecrona et al., 2017), and grammar-based approaches (Kusner et al., 2017) all generate molecules by sampling character sequences. A persistent problem is that many generated strings are syntactically invalid SMILES, with reported validity rates ranging from 7% to 80%.

Two structural features of SMILES syntax are responsible for most invalid strings:

Balanced parentheses: Branches require matched open/close parenthesis pairs. A generative model must track nesting state across long sequences to produce valid brackets.
Paired ring closure symbols: Rings require two identical digit tokens at corresponding positions. The model must remember which digits are “open” and close them appropriately.

Grammar-based approaches (e.g., Grammar VAE) can enforce balanced parentheses through a context-free grammar, but they cannot enforce the ring closure pairing constraint because that constraint is context-sensitive. Syntax-directed approaches (Dai et al., 2018) add explicit ring closure constraints but at the cost of significantly more complex decoder architectures.

Core Innovation: Postfix Branch Notation and Single Ring Closure Symbols

DeepSMILES addresses both syntax problems through two independent string transformations.

Ring closure transformation

Standard SMILES uses a pair of identical digits to mark ring openings and closings (e.g., c1ccccc1 for benzene). DeepSMILES eliminates the ring-opening digit and replaces the ring-closing digit with the ring size, counting back along the tree path to the ring-opening atom. Benzene becomes cccccc6, where 6 means “connect to the atom 6 positions back.”

This transformation has three key properties:

Every ring of a given size always uses the same digit, regardless of context. A phenyl ring is always cccccc6 in DeepSMILES, whereas in SMILES it might be c1ccccc1, c2ccccc2, c3ccccc3, etc.
A single symbol cannot be “unmatched” since there is no corresponding opening symbol.
For double-digit ring sizes, the %N notation is used (and %(N) for sizes above 99).

Bond stereochemistry is preserved by moving any explicit or stereo bond from the eliminated ring-opening symbol to the ring-closing symbol, with direction adjusted as needed.

Branch (parenthesis) transformation

Standard SMILES uses matched open/close parenthesis pairs for branches (e.g., C(OC)(SC)F). DeepSMILES replaces this with a postfix notation inspired by Reverse Polish Notation (RPN). Only close parentheses are used, and the number of consecutive close parentheses indicates how far back on the current branch the next atom attaches.

For example, C(OC)(SC)F becomes COC))SC))F. The interpretation uses a stack: atoms are pushed onto the stack as they are read, each close parenthesis pops one atom from the stack, and the next atom connects to whatever is on top of the stack.

Stereochemistry preservation

Tetrahedral stereochemistry is fully preserved through the transformations. When ring closure symbol reordering would change the stereo configuration, the @/@@ annotation is inverted during encoding to compensate.

Independence of transformations

The two transformations are independent and can be applied separately or together. Any application of DeepSMILES should specify which transformations were applied.

Roundtrip Validation on ChEMBL 23

The authors validated DeepSMILES by roundtripping all entries in the ChEMBL 23 database through SMILES-to-DeepSMILES-to-SMILES conversion. Canonical SMILES (including stereochemistry) were generated by four independent cheminformatics toolkits: CDK, OEChem, Open Babel, and RDKit. Using multiple toolkits ensures coverage of different traversal orders and ring closure ordering conventions.

All SMILES strings roundtripped without error across all three configurations (branches only, rings only, both). The exact string representation may differ in ring closure digit assignment or digit ordering, sometimes with an associated stereo inversion at tetrahedral centers, but the canonical SMILES of the original and roundtripped molecules are identical.

Performance characteristics

The following table shows the effect of DeepSMILES conversion on string length and throughput, measured on canonical SMILES from Open Babel for ChEMBL 23:

Transformation	Mean % change in length	Encoding (per sec)	Decoding (per sec)
Branches only	+8.2%	32,000	16,000
Rings only	-6.4%	26,000	24,000
Both	+1.9%	26,000	17,500

The ring transformation slightly shortens strings (by removing one digit per ring), while the branch transformation slightly lengthens them (additional close parentheses). Combined, the net effect is a small increase of about 2%. Throughput is in the tens of thousands of conversions per second in pure Python.

Limitations and Future Directions

DeepSMILES does not eliminate all invalid strings. Invalid DeepSMILES can still be generated, for example when there are more close parentheses than atoms on the stack, or when a ring size exceeds the number of available atoms. The reference implementation raises a DecodeError in these cases, though the authors note that a more tolerant decoder (ignoring extra parentheses or defaulting to the first atom for oversized rings) could be used during generation.

The paper assumes that input SMILES are generated by a standard cheminformatics toolkit as a depth-first traversal of the molecular graph. Non-standard SMILES (e.g., CC(C1)CCCC1) cannot be directly encoded.

The authors suggest several directions for future work:

Investigating whether a preferred traversal order (e.g., shorter branches first) would make DeepSMILES even easier for models to learn.
Exploring notations where atoms in the organic subset explicitly list their hydrogen count, which would allow a fully parenthesis-free representation.
Using SMILES augmentation with random traversal orders (as explored by Bjerrum and Threlfall, 2017) in combination with DeepSMILES.
Designing entirely new line notations optimized for ML, where every string maps to a valid molecule, there are few duplicate representations, small string changes produce small structural changes, and string length correlates with pharmaceutical relevance.

The fused ring case presents additional complexity: a bicyclic system has three cycles, and depending on traversal order, the ring size digit may not directly correspond to the ring size of any individual ring. This is an inherent limitation of depth-first traversal-based notations.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Validation	ChEMBL 23	~1.7M compounds	Canonical SMILES from CDK, OEChem, Open Babel, RDKit

Algorithms

The DeepSMILES encoder and decoder are pure string-processing algorithms with no machine-learning components. The transformations operate on SMILES syntax tokens (atoms, bonds, parentheses, ring closure digits) without chemical interpretation.

Evaluation

Metric	Value	Notes
Roundtrip accuracy	100%	All ChEMBL 23 entries across 4 toolkits
Encoding throughput	26,000-32,000/s	Pure Python, varies by transformation
Decoding throughput	16,000-24,000/s	Pure Python, varies by transformation

Hardware

No specific hardware requirements. The implementation is a pure Python module with no GPU dependencies.

Artifacts

Artifact	Type	License	Notes
deepsmiles	Code	MIT	Pure Python encoder/decoder

Paper Information

Citation: O’Boyle, N. M., & Dalke, A. (2018). DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures. ChemRxiv. https://doi.org/10.26434/chemrxiv.7097960.v1

@article{oboyle2018deepsmiles,
  title={DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures},
  author={O'Boyle, Noel M. and Dalke, Andrew},
  journal={ChemRxiv},
  year={2018},
  doi={10.26434/chemrxiv.7097960.v1}
}

CDDD: Learning Descriptors by Translating SMILES

Thu, 26 Mar 2026 00:00:00 +0000

A Translation-Based Method for Learned Molecular Descriptors

This is a Method paper that introduces Continuous and Data-Driven Descriptors (CDDD), a neural machine translation approach for learning fixed-size, continuous molecular representations. Rather than training an autoencoder to reconstruct SMILES strings, Winter et al. train an encoder-decoder model to translate between semantically equivalent but syntactically different molecular representations (e.g., randomized SMILES to canonical SMILES, or InChI to canonical SMILES). The bottleneck latent vector serves as a general-purpose molecular descriptor. Pretrained on approximately 72 million compounds from ZINC15 and PubChem, CDDD produces 512-dimensional descriptors that achieve competitive QSAR performance and significantly outperform all tested molecular fingerprints in ligand-based virtual screening.

Why Translation Instead of Reconstruction?

Molecular descriptors are central to cheminformatics. Traditional approaches rely on human-engineered fingerprints like ECFPs, which encode structural features as fixed-length bit vectors. While effective, these representations are constrained by predefined feature extraction rules.

Recent work applied deep neural networks directly to molecular graphs or SMILES strings to learn task-specific representations. However, these end-to-end approaches must learn features from scratch for each new dataset, making them prone to overfitting on the small bioactivity datasets typical in drug discovery.

Unsupervised approaches based on autoencoders (notably Gomez-Bombarelli et al.’s VAE and Xu et al.’s seq2seq model) offered a path toward general-purpose learned descriptors. These models reconstruct SMILES strings through an information bottleneck, forcing the latent space to capture molecular information. The concern with reconstruction, however, is that the model may focus on syntactic patterns of the string representation rather than the underlying chemical semantics. A model that memorizes SMILES syntax shortcuts can achieve low reconstruction error without truly encoding chemical meaning.

Winter et al. address this by drawing on the analogy to neural machine translation: a translator must understand the meaning of a sentence to produce a correct translation in another language. By training the model to translate between different molecular representations (which share chemical semantics but differ in syntax), the latent space is forced to capture the chemical information common to both representations, rather than representation-specific syntactic artifacts.

Translation as Semantic Compression

The core insight is that translating between two syntactically different but semantically equivalent representations forces the encoder to capture only the chemical meaning shared by both. The model architecture follows the standard encoder-decoder framework from neural machine translation.

The encoder reads a source molecular string (e.g., a randomized SMILES or InChI) and compresses it into a fixed-size latent vector. The decoder takes this latent vector and generates the target molecular string (canonical SMILES). The model is trained to minimize character-level cross-entropy between the decoder output and the target sequence.

Four translation tasks were evaluated:

Randomized SMILES to canonical SMILES (best performing)
InChI to canonical SMILES
Canonical SMILES to canonical SMILES (autoencoding baseline)
Canonical SMILES to InChI (failed to learn)

The final model uses an RNN encoder with 3 stacked GRU layers (512, 1024, and 2048 units). The concatenated cell states pass through a fully connected layer with tanh activation to produce a 512-dimensional latent vector. The decoder mirrors this architecture, initializing its GRU states from the latent vector via separate fully connected layers. Teacher forcing is used during training, and left-to-right beam search is used at inference.

An auxiliary property prediction network takes the latent vector as input and predicts nine molecular properties (logP, partial charges, valence electrons, H-bond donors/acceptors, Balaban’s J, molar refractivity, TPSA). This multi-task signal encourages the latent space to encode physically meaningful information. The full training objective combines the translation cross-entropy loss with the property prediction mean squared error:

$$\mathcal{L} = \mathcal{L}_{\text{translation}} + \mathcal{L}_{\text{properties}}$$

To ensure invariance to input SMILES representation at inference time, the model uses randomized SMILES as input half the time and canonical SMILES the other half during training. Input dropout (15% at the character level) and Gaussian noise (standard deviation 0.05) are applied for regularization.

QSAR Benchmarks, Virtual Screening, and Latent Space Exploration

Pretraining

The model was pretrained on approximately 72 million compounds from ZINC15 and PubChem (merged, deduplicated, filtered for organic molecules with MW 12-600, >3 heavy atoms, logP between -7 and 5). All evaluation compounds were removed from the pretraining set.

QSAR Experiments

Ten QSAR datasets were used, spanning classification (Ames mutagenicity, hERG inhibition, BBB penetration, BACE inhibition, bee toxicity) and regression (EGFR inhibition, Plasmodium falciparum inhibition, lipophilicity, aqueous solubility, melting point). Two datasets (Ames and lipophilicity) served as validation for architecture selection; the remaining eight were held out for final evaluation.

CDDD descriptors with an SVM were benchmarked against:

Nine circular fingerprint variants (Morgan fingerprints, radius 1-3, folded to 512/1024/2048 bits) with RF, SVM, and GB
Graph convolution models (DeepChem)

Both random-split and cluster-split (K-means on MACCS fingerprints, K=5) cross-validation were performed.

Task	Split	CDDD + SVM	Best Fingerprint	Graph Conv
Ames (ROC-AUC)	Random	0.89	0.89 (ecfc2, RF)	0.88
hERG (ROC-AUC)	Random	0.86	0.85 (ecfc4, RF)	0.86
BBBP (ROC-AUC)	Random	0.93	0.93 (ecfc2, RF)	0.92
BACE (ROC-AUC)	Random	0.90	0.91 (ecfc2, RF)	0.91
Bee toxicity (ROC-AUC)	Random	0.92	0.91 (ecfc6, RF)	0.89
Lipophilicity ($r^2$)	Random	0.72	0.69 (ecfc2, SVM)	0.73
ESOL ($r^2$)	Random	0.92	0.58 (ecfc6, SVM)	0.86
Melting point ($r^2$)	Random	0.42	0.38 (ecfc2, SVM)	0.39

CDDD descriptors showed competitive or better performance across all tasks. Notably, CDDD achieved substantially higher $r^2$ on aqueous solubility (0.92 vs. 0.58 for the best fingerprint). The authors emphasize that CDDD’s feature extraction was fixed based on two validation tasks, while baseline methods selected the best fingerprint/model combination per task, making the comparison conservative for CDDD.

Virtual Screening

Ligand-based virtual screening experiments followed the Riniker et al. benchmarking protocol on 40 DUD targets and 17 MUV targets. Five active compounds were randomly selected per target, and remaining compounds were ranked by similarity (cosine similarity for CDDD, Tanimoto for fingerprints). This process was repeated 50 times per target.

Database	CDDD (ROC-AUC)	Second Best	p-value (Wilcoxon)
DUD	0.949	0.899 (laval)	$5 \times 10^{-38}$
MUV	0.679	0.677 (ap)	0.04

CDDD significantly outperformed all 14 baseline fingerprints on both databases. The DUD improvement was particularly large (+5.0 ROC-AUC points over the next best). On MUV, which is designed to be harder, the advantage was smaller but still statistically significant. Importantly, while the best baseline fingerprint varied between DUD and MUV (laval vs. ap), CDDD ranked first on both, demonstrating consistent performance.

Latent Space Exploration

The continuous, reversible nature of CDDD enables chemical space navigation. Shifting a molecule’s embedding along the first principal component of the pretraining data correlates with molecular size (Spearman $r = 0.947$, $p = 0.00048$), while the second principal component correlates with polarity/logP ($r = -0.916$, $p = 0.00015$).

When shifting 1000 compounds along 100 random directions, the model maintained high valid SMILES generation rates (>97% for the top beam search output, >99% when considering the top 3 outputs). Euclidean distance in the descriptor space correlated smoothly with Tanimoto distance in fingerprint space, confirming that the latent space supports meaningful interpolation.

Consistent Learned Descriptors for Chemistry

CDDD demonstrated that translation between molecular representations produces more informative latent spaces than autoencoder reconstruction. The key findings are:

Translation outperforms reconstruction: Models trained on translating between different representations consistently produced better downstream descriptors than autoencoding models, despite autoencoding being an easier task.
Auxiliary property prediction helps: The additional classification task for molecular properties improved descriptor quality, particularly for physicochemical endpoints correlated with the predicted properties.
Consistent performance: Unlike baseline methods where the best fingerprint varies by task, CDDD showed consistent performance across all QSAR and VS experiments.
Smooth latent space: The continuous descriptor space supports meaningful interpolation and chemical space exploration with high valid SMILES rates.

The authors acknowledge several limitations. The InChI-to-SMILES translation worked but produced inferior descriptors compared to SMILES-to-SMILES, and SMILES-to-InChI translation failed entirely, likely due to InChI’s complex syntax (counting, arithmetic). The approach was only tested with string-based representations; translation between conceptually different representations (e.g., 3D structures) remains future work. The QSAR evaluation, while extensive, used relatively standard datasets, and the method’s advantage over graph convolution models was modest on tasks where end-to-end learning had sufficient data.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ZINC15 + PubChem (merged)	~72M compounds	Filtered: organic, MW 12-600, >3 heavy atoms, logP -7 to 5
Validation	Ames mutagenicity	6,130	Classification
Validation	Lipophilicity	3,817	Regression
Test	hERG, BBBP, BACE, bee toxicity	188-3,440	Classification
Test	EGFR, Plasmodium, ESOL, melting point	184-4,451	Regression
VS	DUD	40 targets	Ligand-based virtual screening
VS	MUV	17 targets	Maximum unbiased validation

Algorithms

Encoder: 3 stacked GRU layers (512, 1024, 2048 units) with tanh bottleneck to 512-dim latent space
Decoder: Matching 3 stacked GRU layers, initialized from latent space
Auxiliary classifier: 3 FC layers (512, 128, 9) predicting molecular properties
Optimizer: Adam, initial LR $5 \times 10^{-4}$, decayed by 0.9 every 50,000 steps
Batch size: 64 with bucketing by sequence length
Input regularization: 15% character dropout + Gaussian noise (std 0.05)
Beam search for decoding at inference

Models

Artifact	Type	License	Notes
CDDD (GitHub)	Code + Model	MIT	Pretrained model and extraction code

Evaluation

QSAR: 5-fold random CV and 5-fold cluster CV (K-means on MACCS, K=5)
Classification metric: ROC-AUC
Regression metric: $r^2$
VS: ROC-AUC averaged over 50 random active set selections per target
Statistical test: Wilcoxon signed-rank test for VS comparisons

Hardware

Framework: TensorFlow 1.4.1
Fingerprint extraction on GPU is comparable in speed to RDKit on CPU
SVM training on 512-dim CDDD descriptors takes seconds (vs. minutes for 2048-dim fingerprints)
Graph convolution training: ~30 minutes per task on GPU

Paper Information

Citation: Winter, R., Montanari, F., Noe, F., & Clevert, D.-A. (2019). Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chemical Science, 10(6), 1692-1701. https://doi.org/10.1039/C8SC04175J

@article{winter2019learning,
  title={Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations},
  author={Winter, Robin and Montanari, Floriane and No{\'e}, Frank and Clevert, Djork-Arn{\'e}},
  journal={Chemical Science},
  volume={10},
  number={6},
  pages={1692--1701},
  year={2019},
  publisher={Royal Society of Chemistry},
  doi={10.1039/C8SC04175J}
}

Atom-in-SMILES: Better Tokens for Chemical Models

Thu, 26 Mar 2026 00:00:00 +0000

A New Tokenization Method for Chemical Language Models

This is a Method paper that introduces Atom-in-SMILES (AIS), a tokenization scheme for SMILES strings that replaces generic atomic tokens with environment-aware tokens encoding each atom’s local chemical neighborhood. The primary contribution is demonstrating that tokenization quality has a significant impact on chemical language model outcomes across multiple tasks: SMILES canonicalization, single-step retrosynthesis, and molecular property prediction.

Why Standard SMILES Tokenization Falls Short

Standard atom-wise SMILES tokenization treats all atoms of the same element identically. Every carbon is tokenized as “C” regardless of whether it is part of an aromatic ring, a carbonyl group, or a methyl chain. This creates a highly degenerate token space where chemically distinct atoms share the same representation.

The authors draw an analogy between natural language and chemical language. A typical SMILES sequence is about three times longer than a natural language sentence, yet the token vocabulary is roughly 1000 times smaller. This mismatch leads to extreme token repetition: the same tokens (C, c, N, O) appear many times within a single sequence. In natural language processing, token degeneration (where models repeatedly predict the same token) is a known failure mode of autoregressive decoders. The repetitive nature of SMILES tokens exacerbates this problem in chemical language models.

SMILES also lacks a one-to-one correspondence between tokens and chemical meaning. Two molecules that differ in only one atom substitution (e.g., swapping a carbon for a nitrogen in a ring) produce identical token sets under atom-wise tokenization, making it harder for models to distinguish structurally similar molecules.

Core Innovation: Encoding Atom Environments into Tokens

The key insight is to replace each atomic token with a richer token that encodes the atom’s local chemical environment, inspired by the atoms-in-molecules (AIM) concept from quantum chemistry. For a given SMILES string, the AIS mapping function $f$ operates on the token space:

$$ f(X) = \begin{cases} AE|_{X_{\text{central}}} & \text{if } X \text{ is an atom} \\ X & \text{otherwise} \end{cases} $$

where $AE|_{X_{\text{central}}}$ denotes the atomic environment centered on atom $X$. Non-atomic tokens (brackets, bond symbols, ring closures) pass through unchanged.

Each AIS token is formatted as [Sym;Ring;Neighbors] where:

Sym is the atomic symbol with chirality, aromaticity (lowercase for aromatic), hydrogen count, and formal charge
Ring indicates whether the atom is in a ring (R) or not (!R)
Neighbors lists the neighboring atoms interacting with the central atom

This mapping is bijective: SMILES strings can be fully recovered from AIS strings via an inverse projection. The algorithm iterates over atoms in a molecule, computes their local environments using RDKit, and produces environment-aware token variants.

As a concrete example, in glycine the two carbons and two oxygens are indistinguishable under atom-wise tokenization. Under AIS, each receives a unique token reflecting its bonding environment (e.g., the carboxyl carbon is distinguished from the alpha carbon).

The AIS tokenization also exhibits a fingerprint-like property. Because each token encodes local structural information, the set of AIS tokens for a molecule functions similarly to circular fingerprints like ECFP2. The authors show that pairwise Tanimoto similarities computed from AIS token sets have resolution comparable to ECFP2 and HashAP fingerprints, and better resolution than MACCS, Avalon, and RDKit fingerprints.

Token repetition can be quantified as:

$$ \text{rep-}l = \sum_{t=1}^{|s|} \mathbb{1}[s_t \in s_{t-w-1:t-1}] $$

where $s$ is the predicted sequence, $|s|$ is the token count, and $w$ is the window size. AIS tokens exhibit consistently lower normalized repetition rates compared to SMILES, SELFIES, and DeepSMILES across diverse molecular datasets (drugs, natural products, steroids, lipids, metal complexes, octane isomers).

Experimental Evaluation Across Three Chemical Tasks

Input-Output Equivalent Mapping (SMILES Canonicalization)

The first task tests whether a model can translate non-canonical SMILES enumerations into canonical form. The authors constructed deliberately challenging datasets from GDB-13 subsets with cumulative structural constraints (no cyclic heteroatom-heteroatom bonds, stable functional groups only, fragment-like, scaffold-like, etc.), generating training sets of 1M molecules augmented with 150K molecules from the most restrictive subset at 10x, 30x, and 50x augmentation levels.

GDB-13 Subset	Atom-wise (x10)	Atom-wise (x50)	AIS (x10)	AIS (x50)
ab	34.2%	33.2%	37.3%	34.1%
abc	31.0%	29.6%	33.7%	30.4%
abcde	48.7%	45.5%	53.6%	47.0%
abcdef	41.8%	39.1%	52.5%	46.9%
abcdefg	50.9%	50.0%	59.9%	56.8%

AIS outperformed atom-wise tokenization on all subsets and augmentation levels. The performance gap grew larger for more restrictive (more similar) subsets, reaching up to 10.7% on the abcdef subset. This demonstrates that AIS is particularly effective when molecules are structurally similar and harder to distinguish.

Single-Step Retrosynthesis

The second task uses the USPTO-50K benchmark for single-step retrosynthetic prediction via a template-free transformer encoder-decoder model. The model was trained for 200,000 steps with Adam optimizer, negative log-likelihood loss, and cyclic learning rate scheduling.

Tokenization	rep-\|P - rep-\|GT >= 2	String Exact (%)	Tc Exact (%)
Atom-wise baseline	–	42.00	–
Atom-wise (reproduced)	801	42.05	44.72
SmilesPE	821	19.82	22.74
SELFIES	886	28.82	30.76
DeepSMILES	902	38.63	41.20
Atom-in-SMILES	727	46.32	47.62

AIS achieved 46.32% string exact accuracy (4.3% above the atom-wise baseline) and 47.62% Tanimoto exact accuracy (2.9% above baseline). AIS also had the fewest degenerate token repetitions (727 vs. 801 for atom-wise), representing approximately a 10% reduction. DeepSMILES had the highest repetition count (902) despite reasonable overall accuracy. SELFIES and SmilesPE both performed substantially worse than the atom-wise baseline on this task.

The authors identified six common token repetition patterns in retrosynthetic predictions: long head repetitions, long tail repetitions, repetitive rings, repetitive chains, and halogen repetitions on both aliphatic and aromatic carbons.

Molecular Property Prediction

The third task evaluates tokenization schemes on MoleculeNet benchmarks using Random Forest models with 5-fold cross-validation. AIS tokens were converted to fingerprint-like feature vectors.

Dataset	SMILES	DeepSMILES	SELFIES	SmilesPE	AIS
Regression (RMSE, lower is better)
ESOL	0.628	0.631	0.675	0.689	0.553
FreeSolv	0.545	0.544	0.564	0.761	0.441
Lipophilicity	0.924	0.895	0.938	0.800	0.683
Classification (ROC-AUC, higher is better)
BBBP	0.758	0.777	0.799	0.847	0.885
BACE	0.740	0.774	0.746	0.837	0.835
HIV	0.649	0.648	0.653	0.739	0.729

AIS achieved the best performance on all three regression datasets and two of three classification datasets. On ESOL, the RMSE improvement over standard SMILES was 12%. On lipophilicity, the improvement was 26%.

Key Findings: Better Tokens Yield Better Chemical Models

The main findings of this work are:

Tokenization significantly impacts chemical language model quality. The choice of tokenization scheme can change prediction accuracy by over 10 percentage points on equivalent mapping tasks.
AIS reduces token degeneration by approximately 10% compared to atom-wise SMILES tokenization, with consistently lower normalized repetition rates across diverse molecular datasets.
AIS outperforms all compared tokenization schemes (atom-wise SMILES, SmilesPE, SELFIES, DeepSMILES) on canonicalization, retrosynthesis, and property prediction.
The fingerprint-like nature of AIS tokens enables direct use as molecular features for property prediction and provides resolution comparable to established circular fingerprints.
The mapping is invertible, so AIS strings can always be converted back to valid SMILES. This is a practical advantage over approaches that may lose structural information.

Limitations: AIS cannot distinguish environmentally identical substructures or atoms related by a molecular symmetry plane, since it only considers nearest-neighbor environments. Performance on long-chain molecules (e.g., lipids) is similar across all tokenization schemes, suggesting that local environment encoding is less informative for repetitive linear structures.

Future directions: The authors suggest AIS has potential for broader adoption in molecular generative models, chemical translation, and property prediction tasks across the cheminformatics community.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Canonicalization training	GDB-13 subsets	1M + 150K augmented	Cumulative structural constraints a-h
Canonicalization testing	GDB-13 disjoint test sets	20K per subset	Various restriction levels
Retrosynthesis	USPTO-50K	~50K reactions	Sequences > 150 tokens removed
Property prediction	MoleculeNet (ESOL, FreeSolv, Lipophilicity, BBBP, BACE, HIV)	Varies	Standard benchmark splits

Algorithms

Transformer encoder-decoder architecture for canonicalization and retrosynthesis tasks
200,000 training steps with Adam optimizer, negative log-likelihood loss, cyclic learning rate scheduler
Random Forest with 5-fold cross-validation for property prediction
AIS tokenization implemented via RDKit for atom environment extraction

Evaluation

Metric	Task	Notes
String exact match (%)	Canonicalization, Retrosynthesis	Exact SMILES match
Tanimoto exactness (Tc)	Retrosynthesis	Morgan FP radius 3, 2048 bits
RMSE	Regression property prediction	ESOL, FreeSolv, Lipophilicity
ROC-AUC	Classification property prediction	BBBP, BACE, HIV
rep-l	Token degeneration	Single-token repetition count

Hardware

Not explicitly specified in the paper.

Artifacts

Artifact	Type	License	Notes
atom-in-SMILES	Code	CC-BY-NC-SA-4.0	AIS tokenization implementation

Paper Information

Citation: Ucak, U. V., Ashyrmamatov, I., & Lee, J. (2023). Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization. Journal of Cheminformatics, 15, 55. https://doi.org/10.1186/s13321-023-00725-9

@article{ucak2023improving,
  title={Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization},
  author={Ucak, Umit V. and Ashyrmamatov, Islambek and Lee, Juyong},
  journal={Journal of Cheminformatics},
  volume={15},
  number={1},
  pages={55},
  year={2023},
  publisher={Springer},
  doi={10.1186/s13321-023-00725-9}
}

Review of Molecular Representation Learning Models

Wed, 25 Mar 2026 00:00:00 +0000

A Systematization of Molecular Representation Foundation Models

This paper is a Systematization that provides the first comprehensive review of foundation models for molecular representation learning (MRL). The authors classify existing models by their input modality (unimodal vs. multimodal), analyze four mainstream pretraining strategies, survey five downstream application domains, and propose practical guidelines for model selection. The review covers over 35 representative models published between 2020 and 2024, with parameter counts ranging from 2 million to over 1 trillion.

Why a Systematic Review of MRL Foundation Models Is Needed

Molecular representation learning transforms molecular structures and properties into numerical vectors that serve as inputs for machine learning models. The field has evolved rapidly from molecular fingerprints through SMILES-based sequence models to graph neural networks and 3D geometry-aware architectures. Foundation models, characterized by large-scale pretraining on unlabeled molecular data followed by fine-tuning on downstream tasks, have introduced new opportunities for generalizability and transfer learning in drug discovery.

Despite this rapid progress, the authors identify a gap: no prior work has systematically reviewed MRL foundation models across all input modalities and pretraining paradigms. Existing surveys tend to focus on specific representations (e.g., graph-based methods) or specific applications (e.g., property prediction) without providing the cross-cutting perspective needed to guide model selection. This review fills that gap by offering a unified taxonomy and practical guidelines.

Taxonomy of Molecular Descriptors and Model Architectures

The core organizational framework classifies models along two axes: the molecular descriptor used as input and the backbone architecture.

Molecular Descriptors

The review identifies five primary descriptor types:

Molecular fingerprints: Binary vectors encoding structural features (e.g., Morgan fingerprints). Rarely used in foundation models due to information loss and dimensional complexity.
1D sequences: SMILES and SELFIES string representations. SMILES is compact and widely used but can produce invalid molecules. SELFIES guarantees valid molecular strings by construction.
2D topological graphs: Atoms as nodes, bonds as edges. Can be derived from SMILES via RDKit, making graph datasets effectively interchangeable with SMILES datasets.
3D geometry: Spatial coordinates capturing conformational information, energy states, and stereochemistry. Experimentally expensive to obtain, limiting dataset availability.
Multimodal: Combinations of the above with text, IUPAC names, knowledge graphs, and molecular images.

The paper also discusses mathematically abstract molecular representations. For example, the Wiener index quantifies structural complexity:

$$ W = \frac{1}{2} \sum_{i < j} d_{ij} $$

where $d_{ij}$ is the topological distance (shortest bonding path length) between atoms $i$ and $j$.

Degree centrality captures local connectivity:

$$ C_{D}(v_{i}) = \sum_{j=1}^{n} A_{ij} $$

where $A \in \mathbb{R}^{n \times n}$ is the molecular graph adjacency matrix.

Model Architectures

Models are classified into two primary categories:

Unimodal-based models:

Sequence-based: Transformer models operating on SMILES/SELFIES (e.g., ChemBERTa-2, MoLFormer, MolGEN, LlaSMol). These capture syntactic patterns but miss spatial and topological features.
Topological graph-based: GNN variants (GIN, GCN, GAT) and Transformer-based graph models (Graphormer). GNNs capture local topology through message passing; Transformers overcome locality limitations through global self-attention.
3D geometry-based: Models like Uni-Mol and 3D PGT that incorporate spatial coordinates. Uni-Mol uses distance-aware self-attention with an SE(3)-equivariant coordinate head for rotation/translation invariance.
Image-based: CNN-based models (ImageMol) that process 2D molecular images using visual representation learning.

Multimodal-based models:

Sequence + Graph: DVMP, PanGu Drug Model. Combines the strengths of string and topological representations.
Graph + 3D Geometry: GraphMVP, Transformer-M. Enriches topological features with spatial information.
Text + Molecular Structure: KV-PLM, MolT5, MoleculeSTM, MolReGPT, Y-mol. Aligns molecular structural information with biomedical text through cross-modal learning.

Four Pretraining Paradigms for MRL

The review systematically categorizes pretraining strategies into four paradigms:

Masked Language Modeling (MLM)

The cornerstone strategy for sequence-based models. Randomly masks tokens in molecular sequences and trains the model to predict them. ChemBERTa pretrained on 77 million SMILES sequences from PubChem achieves 5-10% improvement in AUC-ROC on property prediction tasks compared to task-specific models. MLM captures local dependencies and global sequence patterns but cannot model spatial or topological features, making it best suited for unimodal sequence inputs.

Contrastive Learning (CL)

The dominant strategy for multimodal models. Constructs positive-negative sample pairs to align features across modalities or views. In unimodal settings, CL generates negative samples by perturbing molecular graphs. In multimodal settings, it aligns features from different modalities. GraphMVP, which contrasts 2D topological features with 3D spatial features, reduces RMSE by 15% on QM9 energy prediction compared to unimodal models. Performance depends heavily on the quality of positive sample construction.

Reconstruction-Based Pretraining (RBP)

Learns global molecular features by reconstructing original data from corrupted inputs. Tasks include node feature reconstruction, graph structure reconstruction, and coordinate/energy reconstruction. MGMAE masks more than 50% of nodes and edges in molecular graphs and trains the model to reconstruct them, achieving 94.2% AUC-ROC on BBBP. RBP captures global structural patterns but requires high model complexity and training cost.

Multimodal Alignment Pretraining (MAP)

Designed for multimodal inputs, aligning and fusing features from different modalities through cross-modal tasks. KV-PLM uses SMILES-to-text matching to align molecular structure and functional information. MAP fuses structural information (SMILES, graphs) with semantic information (text) but requires large-scale cross-modal labeled data, posing significant data acquisition challenges.

Downstream Applications and Performance Benchmarks

The review evaluates MRL foundation models across five application domains.

Molecular Property Prediction

The most common benchmark for MRL models. The review provides comprehensive ROC-AUC comparisons across eight MoleculeNet classification datasets:

Model	Type	BBBP	BACE	ClinTox	Tox21	SIDER	HIV
MGMAE	Graph	94.2	92.7	96.7	86.0	66.4	-
MPG	Graph	92.2	92.0	96.3	83.7	66.1	-
GROVER	Graph+Trans.	94.0	89.4	94.4	83.1	65.8	-
MoLFormer	Sequence	93.7	88.2	94.8	84.7	69.0	82.2
MM-Deacon	Seq.+IUPAC	78.5	-	99.5	-	69.3	80.1
Uni-Mol	3D	72.9	85.7	91.9	79.6	65.9	80.8
DVMP	Seq.+Graph	77.8	89.4	95.6	79.1	69.8	81.4
TxD-T-LLM	Seq.+Text	-	-	86.3	88.2	-	73.2

The table shows that no single architecture dominates across all datasets. Transformer- and GIN-based architectures with graph inputs generally perform well. The review notes that model effectiveness depends heavily on the dataset, with Mole-BERT encountering negative transfer due to a small and unbalanced atomic vocabulary.

Molecular Generation

MolGEN (SELFIES-based, 8B parameters) achieves 100% validity on synthetic molecules. MolT5 excels at text-to-molecule generation. Uni-Mol generates 3D conformations with 97.95% coverage on QM9.

Drug-Drug Interaction Prediction

MPG achieves 96.6% AUC-ROC on BIOSNAP by combining unsupervised pretraining with supervised fine-tuning and multi-task learning.

Retrosynthesis Prediction

DVMP achieves 66.5% top-1 accuracy on USPTO-50K when reaction types are provided as priors (54.2% without).

Drug Synergy Prediction

SynerGPT (GPT-based) achieves 77.7% AUC-ROC in few-shot settings for novel drug combinations, outperforming baselines through contextual learning.

Guidelines, Limitations, and Future Directions

Model Selection Guidelines

The authors provide structured guidelines for choosing MRL foundation models based on:

Task objective: Property prediction favors GNNs or large pretrained frameworks (ChemBERTa-2, Uni-Mol). Generation tasks favor GPT-style autoregressive models (MolGEN). Retrosynthesis benefits from multimodal architectures.
Data characteristics: SMILES/graph representations suit generation tasks. Knowledge graph-enhanced models benefit interaction and synergy prediction. Transfer learning helps data-limited scenarios.
Interpretability needs: Transformer architectures are preferred when interpretability is required, as attention matrices enable visualization of learned molecular features.
Computational budget: GIN-based models have $\mathcal{O}(|V| + |E|)$ complexity, while Transformer-based models scale as $\mathcal{O}(n^2 \cdot d)$.

Limitations and Future Directions

The review identifies five key challenges:

Multimodal data integration: Each representation paradigm has distinct limitations (1D neglects spatial configuration, 2D omits conformational details, 3D faces rotational invariance challenges). The authors propose incorporating molecular dynamics trajectories as a dynamic modality and using cross-modal data augmentation.
Data scarcity: Semi-supervised learning can achieve more than 90% of fully supervised performance using only 10% labeled data on QM9. Cross-modal augmentation (e.g., 3D InfoMax) can generate plausible 3D conformers from 2D graphs.
Interpretability: Current methods rely primarily on attention-based visualization, which is insufficient for multimodal models. The authors suggest assessing decision consistency across modalities and incorporating chemical knowledge graphs.
Training efficiency: Large parameter counts demand distributed parallel training techniques, with data parallelism being the most common approach.
Robustness and generalization: Strategies include data augmentation (multiple SMILES representations, 3D conformer generation), meta-learning for rapid adaptation, and sparse attention mechanisms to reduce sensitivity to irrelevant long-range interactions.

Reproducibility Details

This is a review paper, so standard reproducibility criteria for experimental papers do not directly apply. The review compiles results from the original publications of each surveyed model.

Data

The review catalogs 28 representative molecular datasets used by the surveyed foundation models:

Dataset	Size	Descriptor	Primary Use
PubChem	~118M	SMILES, 3D, Image, IUPAC	Pretraining
ZINC15	~980M	SMILES	Pretraining
ChEMBL	~2.4M	SMILES	Pretraining
QM9	133,884	SMILES	Property prediction
GEOM	450,000	3D coordinates	Property prediction
USPTO-full	950,000	SMILES	Reaction prediction
Molecule3D	4M	3D coordinates	Property prediction

Artifacts

Artifact	Type	License	Notes
Review Materials (GitHub)	Code/Data	Not specified	Code and data tables for figures
Paper (PMC)	Paper	CC-BY	Open access via PubMed Central

Evaluation

All performance metrics reported in the review are directly cited from the original studies. The evaluation protocols follow each model’s original setup. The review covers:

ROC-AUC for classification tasks (property prediction, DDI, synergy)
RMSE/MAE for regression tasks
Validity and novelty for molecular generation
Top-k accuracy for retrosynthesis
COV and MAT for conformation generation

Paper Information

Citation: Song, B., Zhang, J., Liu, Y., Liu, Y., Jiang, J., Yuan, S., Zhen, X., & Liu, Y. (2025). A systematic review of molecular representation learning foundation models. Briefings in Bioinformatics, 27(1), bbaf703. https://doi.org/10.1093/bib/bbaf703

@article{song2025systematic,
  title={A systematic review of molecular representation learning foundation models},
  author={Song, Bosheng and Zhang, Jiayi and Liu, Ying and Liu, Yuansheng and Jiang, Jing and Yuan, Sisi and Zhen, Xia and Liu, Yiping},
  journal={Briefings in Bioinformatics},
  volume={27},
  number={1},
  pages={bbaf703},
  year={2025},
  publisher={Oxford University Press},
  doi={10.1093/bib/bbaf703}
}

AMORE: Testing ChemLLM Robustness to SMILES Variants

Wed, 25 Mar 2026 00:00:00 +0000

An Empirical Framework for Probing Chemical Understanding

This is an Empirical paper that introduces Augmented Molecular Retrieval (AMORE), a zero-shot evaluation framework for chemical language models (ChemLMs). The primary contribution is a method to assess whether ChemLMs have learned genuine molecular semantics or simply memorize textual patterns. Rather than relying on traditional NLP metrics like BLEU and ROUGE, AMORE tests whether a model’s embedding space treats chemically equivalent SMILES representations as similar. The authors evaluate 12 models across multiple architectures (encoder-only, encoder-decoder, decoder-only) on two datasets and five augmentation types, and extend the analysis to downstream MoleculeNet tasks.

Why Standard NLP Metrics Fail for Chemical Evaluation

Chemical language models are typically evaluated using text-based metrics from NLP (BLEU, ROUGE, METEOR) on tasks like molecule captioning. These metrics compare word overlap and sentence fluency but cannot detect whether a model truly understands molecular structure. A SMILES string like C(=O)O and its canonicalized or kekulized form represent the same molecule, yet text-based metrics would penalize valid reformulations. Embedding-based metrics like BERTScore are also insufficient because they were trained on general text, not chemical notation.

The core research question is direct: do evaluation metrics used on ChemLMs reflect actual chemical knowledge, or do the models simply imitate understanding by learning textual features? This question has practical consequences in pharmaceuticals and healthcare, where missteps in chemical reasoning carry serious risks.

Embedding-Based Retrieval as a Chemical Litmus Test

AMORE exploits a fundamental property of molecular representations: a single molecule can be written as multiple valid SMILES strings that are chemically identical. These serve as “total synonyms,” a concept without a true analogue in natural language.

The framework works in four steps:

Take a set $X = (x_1, x_2, \ldots, x_n)$ of $n$ molecular representations.
Apply a transformation $f$ to obtain augmented representations $X’ = (x’_1, x’_2, \ldots, x’_n)$, where $x’_i = f(x_i)$. The constraint is that $f$ must not change the underlying molecule.
Obtain vectorized embeddings $e(x_i)$ and $e(x’_j)$ from the model for each original and augmented SMILES.
Evaluate in a retrieval task: given $e(x_i)$, retrieve $e(x’_i)$ from the augmented set.

The evaluation metrics are top-$k$ accuracy (whether the correct augmented SMILES ranks at position $\leq k$) and Mean Reciprocal Rank (MRR). Retrieval uses FAISS for efficient nearest-neighbor search. The key insight is that if a model truly understands molecular structure, it should embed different SMILES representations of the same molecule close together.

Five SMILES Augmentation Types

The framework uses five identity-preserving augmentations, all executed through RDKit:

Canonicalization: Transform SMILES to the standardized RDKit canonical form.
Hydrogen addition: Explicitly add hydrogen atoms that are normally implied (e.g., C becomes [CH4]). This dramatically increases string length.
Kekulization: Convert aromatic ring notation to explicit alternating double bonds.
Cycle renumbering: Replace ring-closure digit identifiers with random valid alternatives.
Random atom order: Randomize the atom traversal order used to generate the SMILES string.

Twelve Models, Two Datasets, Five Augmentations

Models Evaluated

The authors test 12 publicly available Transformer-based models spanning three architecture families:

Model	Domain	Parameters
Text+Chem T5-standard	Cross-modal	220M
Text+Chem T5-augm	Cross-modal	220M
MolT5-base	Cross-modal	220M
MolT5-large	Cross-modal	770M
SciFive	Text-only	220M
PubChemDeBERTa	Chemical	86M
ChemBERT-ChEMBL	Chemical	6M
ChemBERTa	Chemical	125M
BARTSmiles	Chemical	400M
ZINC-RoBERTa	Chemical	102M
nach0	Chemical	220M
ZINC-GPT	Chemical	87M

Datasets

ChEBI-20 test set: ~3,300 molecule-description pairs, used for both AMORE retrieval and molecule captioning comparisons.
Isomers (QM9 subset): 918 molecules that are all isomers of C9H12N2O, making retrieval harder because all molecules share the same molecular formula.

Key Results on ChEBI-20

On the ChEBI-20 dataset (Table 2 from the paper), top-1 accuracy varies enormously by augmentation type. Cycle renumbering is easiest (up to 98.48% Acc@1 for SciFive), while hydrogen addition is hardest (no model exceeds 5.97% Acc@1).

For the cross-modal Text+Chem T5-standard model:

Augmentation	Acc@1	Acc@5	MRR
Canonical	63.03	82.76	72.4
Hydrogen	5.46	10.85	8.6
Kekulization	76.76	92.03	83.8
Cycle	96.70	99.82	98.2
Random	46.94	74.18	59.33

Key Results on Isomers

Performance drops substantially on the Isomers dataset, where all molecules share the same formula. The best Acc@1 for hydrogen augmentation is just 1.53% (MolT5-large). Even for the relatively easy cycle augmentation, top scores drop from the high 90s to the low 90s for most models, and some models (BARTSmiles: 41.83%) struggle considerably.

Downstream MoleculeNet Impact

The authors also fine-tuned models on original MoleculeNet training data and tested on augmented test sets across 9 tasks (regression, binary classification, multilabel classification). Results confirm that augmentations degrade downstream performance. For example, on ESOL regression, RMSE increased from 0.87 to 7.93 with hydrogen addition. Rankings computed using the Vote’n’Rank framework (using the Copeland rule) show that hydrogen augmentation is the only one that substantially reshuffles model rankings; other augmentations preserve the original ordering.

Correlation Between AMORE and Captioning Metrics

The differences in ROUGE/METEOR between original and augmented SMILES correlate with AMORE retrieval accuracy (Spearman correlation > 0.7 with p-value = 0.003 for Acc@1). This validates AMORE as a proxy for predicting how augmentations will affect generation quality, without requiring labeled captioning data.

Current ChemLMs Learn Syntax, Not Chemistry

The central finding is that existing ChemLMs are not robust to identity-preserving SMILES augmentations. Several specific conclusions emerge:

Hydrogen augmentation is catastrophic: All models fail (< 6% Acc@1 on ChEBI-20, < 2% on Isomers). The authors attribute this to the near-complete absence of explicit hydrogen in pretraining data, creating a distribution shift.
Cross-modal models outperform unimodal ones: Models trained on both text and SMILES (Text+Chem T5, MolT5) consistently achieve higher retrieval accuracy on four of five augmentations.
Augmentation difficulty follows a consistent order: For all models, hydrogen is hardest, followed by canonicalization, random ordering, kekulization, and cycle renumbering (easiest).
Layer-wise analysis reveals instability: Retrieval accuracy across Transformer layers is correlated across augmentation types, suggesting that representations degrade at the same layers regardless of augmentation.
Levenshtein distance partially explains difficulty: Hydrogen augmentation produces strings ~2x longer than originals (Levenshtein ratio of 1.49), but the low correlation between Levenshtein ratio and downstream metrics (ROUGE1 correlation of -0.05 for hydrogen) suggests string length alone does not explain the failure.

Limitations

The authors acknowledge several limitations. Only publicly available HuggingFace models were evaluated, excluding models like Chemformer and Molformer that lack HF checkpoints. The study focuses exclusively on SMILES sequences, not 3D molecular structures or other formats like SELFIES. The augmentation types, while representative, do not cover all possible identity transformations.

The authors suggest that AMORE could serve as a regularization tool during training, for example by using metric learning to encourage models to embed SMILES variants of the same molecule close together.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Retrieval evaluation	ChEBI-20 test set	3,300 molecules	Standard benchmark for molecule captioning
Retrieval evaluation	Isomers (QM9 subset)	918 molecules	All isomers of C9H12N2O
Downstream evaluation	MoleculeNet (9 tasks)	Varies	ESOL, Lipophilicity, FreeSolv, HIV, BBBP, BACE, Tox21, ToxCast, SIDER

Algorithms

SMILES augmentations via RDKit (canonicalization, hydrogen addition, kekulization, cycle renumbering, random atom ordering)
Nearest-neighbor retrieval using FAISS with L2, cosine, inner product, and HNSW metrics
Model ranking via Vote’n’Rank (Copeland rule) on MoleculeNet tasks

Models

All 12 evaluated models are publicly available on HuggingFace. No custom model training was performed for the AMORE retrieval experiments. MoleculeNet experiments used standard fine-tuning on original training splits.

Evaluation

Metric	Description	Notes
Acc@1	Top-1 retrieval accuracy	Primary AMORE metric
Acc@5	Top-5 retrieval accuracy	Secondary AMORE metric
MRR	Mean Reciprocal Rank	Average rank of correct match
ROUGE-2	Bigram overlap for captioning	Compared against AMORE
METEOR	MT evaluation metric for captioning	Compared against AMORE

Hardware

Computational resources from HPC facilities at HSE University. Specific GPU types and training times are not reported.

Artifacts

Artifact	Type	License	Notes
AMORE GitHub	Code	Not specified	Framework code and evaluation data

Paper Information

Citation: Ganeeva, V., Khrabrov, K., Kadurin, A., & Tutubalina, E. (2025). Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework. Journal of Cheminformatics, 17(1). https://doi.org/10.1186/s13321-025-01079-0

@article{ganeeva2025measuring,
  title={Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework},
  author={Ganeeva, Veronika and Khrabrov, Kuzma and Kadurin, Artur and Tutubalina, Elena},
  journal={Journal of Cheminformatics},
  volume={17},
  number={1},
  year={2025},
  publisher={Springer},
  doi={10.1186/s13321-025-01079-0}
}

Neural Scaling of Deep Chemical Models

Tue, 24 Mar 2026 00:00:00 +0000

What kind of paper is this?

This is a discovery paper that identifies empirical neural scaling laws in two distinct domains of chemical deep learning: large language models (LLMs) for generative chemistry and graph neural networks (GNNs) for machine-learned interatomic potentials. The paper also introduces training performance estimation (TPE) as a practical tool for accelerating hyperparameter optimization in these domains.

Why scaling laws matter for chemistry

Neural scaling laws, first characterized for NLP models by Kaplan et al. (2020), describe how model loss decreases as a power law with increasing model size, dataset size, or compute:

$$ L(R) = \alpha R^{-\beta} $$

where $\alpha$ is a coefficient, $\beta$ is the scaling exponent, and $R$ is the resource being scaled (parameters, data, or compute). These relationships have guided resource allocation decisions in NLP and computer vision, but their applicability to scientific deep learning was unknown.

Chemical deep learning differs from standard NLP and vision tasks in several key ways. Physics-based priors (like symmetry constraints) may reduce the need for massive scale. The heterogeneity of chemical space and molecular tasks makes general pre-training more challenging. There are no established default architectures, datasets, or training recipes at large scale for chemistry.

This paper asks: do the same scaling behaviors hold for chemical models, and how do physical priors affect them?

Training performance estimation for efficient scaling

Before running expensive scaling experiments, the authors needed a way to efficiently select hyperparameters. They introduced TPE, a generalization of training speed estimation (TSE) to new domains. TSE computes the cumulative training loss over the first $T$ epochs:

$$ \text{TSE} = \sum_{t=1}^{T} \left( \frac{1}{B} \sum_{i=1}^{B} \mathcal{L}\left(f_{\theta(t,i)}(\mathbf{X}_i), \mathbf{y}_i\right) \right) $$

where $B$ is the number of training steps per epoch, $\mathcal{L}$ is the loss function, and $f_{\theta(t,i)}$ is the network at epoch $t$ and mini-batch $i$. A linear regression then predicts converged loss from early-training TSE:

$$ L = m \times \text{TSE} + b $$

Using only 20% of the total training budget, TPE achieves $R^2 = 0.98$ and Spearman’s $\rho = 1.0$ for ChemGPT on the MOSES dataset. For GNNs, it achieves $R^2 \geq 0.86$ and $\rho \geq 0.92$ across SchNet, PaiNN, and SpookyNet. This enables discarding suboptimal configurations early, saving up to 90% of compute.

ChemGPT: scaling chemical language models

ChemGPT is a GPT-3-style autoregressive transformer for molecular generation. It uses GPT-Neo as its backbone with a SELFIES tokenizer, factorizing the probability of a molecular sequence as:

$$ p(x) = \prod_{i=1}^{n} p\left(s_i \mid s_1, \dots, s_{i-1}\right) $$

The authors trained ChemGPT models ranging from ~78K to over 1 billion non-embedding parameters on subsets of PubChem10M (up to ~10 million molecules, or ~300 million tokens). Key findings from the scaling experiments:

Pre-training loss monotonically improves with increasing dataset size up to nearly 10 million molecules, with no saturation observed.
For a fixed data budget, increasing model size provides monotonic improvements until models reach ~1 billion parameters.
The scaling exponent $\beta = 0.17 \pm 0.01$ for the largest dataset (after excluding the three largest models from the power-law fit), and $\beta = 0.30 \pm 0.01$ for the next largest dataset.
Resolution-limited regimes appear where the power-law behavior breaks down, indicating either insufficient data for a given model size or vice versa. These regimes shift depending on the data budget.

An interesting observation: for small datasets, large models ($10^7$ parameters and above) still provide notable loss improvements, suggesting that scaling up model size helps even when data is limited.

Neural force field scaling with GNNs

For tasks requiring three-dimensional molecular geometry, the authors studied GNN-based neural force fields (NFFs). These models predict energies $\hat{E} = f_\theta(X)$ and derive forces by differentiation:

$$ \hat{F}_{ij} = -\frac{\partial \hat{E}}{\partial r_{ij}} $$

Training uses an L1 loss over energies and forces:

$$ \mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} \left[ \alpha_E | E_i - \hat{E}_i | + \alpha_F | \mathbf{F}_i - \hat{\mathbf{F}}_i | \right] $$

Four NFF architectures were studied, spanning a range of physical priors:

Model	Type	Key Characteristic
SchNet	E(3) invariant	Continuous filter convolutions
PaiNN	E(3) equivariant	Equivariant message passing
Allegro	E(3) equivariant	Local, learned many-body functions
SpookyNet	E(3) equivariant	Non-local interactions, empirical corrections

Model capacity is parameterized as $c = d \times w$ (depth times width). Models were trained on subsets of the ANI-1x dataset (up to 100,000 geometries, corresponding to ~4.5 million force labels).

Key GNN scaling findings:

PaiNN shows monotonic loss improvement with increasing dataset size and strong correlation between converged loss and model capacity (Spearman’s $\rho \geq 0.88$).
Equivariant GNNs (PaiNN, Allegro) show better scaling efficiency than invariant GNNs (SchNet), with larger $\beta$ values.
The scaling exponent for equivariant GNNs is $\beta = 0.26$, indicating that physics-based equivariance priors provide greater sample efficiency that persists to much larger and more chemically diverse datasets than previously studied.
A transition at $10^4$ datapoints shows nearly perfect rank correlation between model capacity and converged loss ($\rho \geq 0.93$), suggesting this may be a threshold where models move from memorization to generalization.

Results and practical implications

The scaling results provide actionable guidance for resource allocation:

For chemical LLMs with large data budgets, the greatest loss improvements come from scaling up small models (around $10^5$ parameters).
For small data budgets, rapid improvements come from scaling medium-sized models ($10^7$ parameters).
For NFFs, low-capacity models show diminishing returns with more data, while high-capacity models show rapid improvements with increasing dataset size.
Neither model type has saturated with respect to model size, dataset size, or compute, suggesting substantial room for improvement with further scaling.

The 300-million-parameter ChemGPT trained on 300 million tokens and the PaiNN model with capacity ~1,000 trained on $10^5$ frames achieved the minimum losses in their respective scaling plots, providing concrete targets for practitioners.

Reproducibility Details

Data:

PubChem10M (10M SMILES strings, via DeepChem)
MOSES (2M molecules, for TPE validation)
ANI-1x (5M DFT calculations, via Figshare)
Revised MD-17 (10 small organic molecules, 10,000 frames for TPE)

Models:

ChemGPT: GPT-Neo backbone, 24 layers, widths from 16 to 2,048, sizes from ~78K to ~1.2B non-embedding parameters
SchNet, PaiNN, Allegro, SpookyNet: widths of 16, 64, 256; depths of 2, 3, 4; 5 Angstrom cutoff

Training:

ChemGPT: AdamW optimizer, learning rate $2 \times 10^{-5}$, batch size 8 per GPU, 10 epochs, cross-entropy loss
GNNs: Adam optimizer, learning rate scheduler (halved after 30 epochs without improvement), early stopping after 50 stagnant epochs, max 1,000 epochs, L1 loss (force-only training)

Hardware:

NVIDIA Volta V100 GPUs (32 GB), 2 GPUs per node
PyTorch with distributed data parallel (DDP), PyTorch Lightning, LitMatter

Code: LitMatter repository

Paper Information

Citation: Frey, N.C., Soklaski, R., Axelrod, S. et al. Neural scaling of deep chemical models. Nat Mach Intell 5, 1297-1305 (2023).

@article{frey2023neural,
  title={Neural scaling of deep chemical models},
  author={Frey, Nathan C. and Soklaski, Ryan and Axelrod, Simon and Samsi, Siddharth and G{\'o}mez-Bombarelli, Rafael and Coley, Connor W. and Gadepally, Vijay},
  journal={Nature Machine Intelligence},
  volume={5},
  number={11},
  pages={1297--1305},
  year={2023},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-023-00740-3}
}

BARTSmiles: BART Pre-Training for Molecular SMILES

Sun, 22 Mar 2026 00:00:00 +0000

A BART-Based Method for Molecular Self-Supervised Learning

BARTSmiles is a Method paper. It introduces a self-supervised pre-training approach for molecular representations based on the BART (Bidirectional and Auto-Regressive Transformers) architecture from Lewis et al. (2019). The primary contribution is a pre-training strategy, discovered through systematic ablations, that trains a BART-large model on 1.7 billion deduplicated SMILES strings from the ZINC20 dataset. BARTSmiles achieves the best reported results on 11 tasks spanning molecular property classification, regression, and chemical reaction generation.

Scaling Self-Supervised Molecular Representations Beyond Prior Work

At the time of publication, large-scale self-supervised representation learning had produced significant improvements in NLP, computer vision, and speech, but molecular representation learning had not benefited from comparable scale. Previous SMILES-based pre-trained models such as ChemBERTa (Chithrananda et al., 2020) and ChemFormer (Irwin et al., 2022) used encoder-only or encoder-decoder architectures with substantially less compute. ChemFormer, the most closely related prior work, also trained a BART-like model but with a fraction of the compute and data.

The paper argues that three gaps needed to be addressed:

Scale: Prior molecular pre-training used orders of magnitude less compute than NLP pre-training.
Architecture choice: Encoder-only models like ChemBERTa cannot perform generative fine-tuning (retrosynthesis, reaction prediction), limiting their applicability.
Pre-training recipe: Standard BART hyperparameters (e.g., 30% mask token budget) were tuned for natural language and had not been validated for molecular SMILES strings.

Core Innovation: Ablation-Driven Pre-Training Recipe for SMILES

The key insight of BARTSmiles is that the BART denoising objective, when carefully tuned for the molecular domain, learns representations that implicitly encode downstream task information. The authors discover this through a systematic three-stage ablation:

Tokenization

Rather than using hand-crafted tokenization rules that separate individual atoms (C, N, H) and bond symbols (#, =), BARTSmiles uses a learned SentencePiece unigram tokenizer trained on 10 million random SMILES with a vocabulary size of 1,021. On matched compute budgets, learned tokenization achieves 0.801 average AUC-ROC vs. 0.779 for hand-crafted tokenization on the ablation benchmark (HIV, BBBP, ClinTox).

Masking Strategy

The BART denoising objective has three main hyperparameters: the mask token budget (fraction of tokens masked), random mask probability, and the Poisson $\lambda$ controlling mask span length. The ablation results show:

Mask token budget: The standard BART value of 0.30 is suboptimal for molecules. A budget of 0.20 performs best (0.821 AUC-ROC), with performance degrading at both lower (0.10: 0.753) and higher (0.40: 0.701) budgets.
Span masking: The choice of random mask probability and $\lambda$ has a minor effect once the budget is set to 0.20. Values of random mask = 0.10 and $\lambda$ = 2.5 or 3.5 all yield 0.821.
Token randomization: Disabling the randomize-tokens noise (where some tokens are replaced with random tokens rather than masked) improves performance from 0.821 to 0.835.

Scale

Training on the full 1.7 billion molecule ZINC20 dataset (20 hours on 1,024 A100 GPUs, totaling 20,480 A100 GPU-hours) improves performance by 5 absolute AUC-ROC points over the same model trained on 100 million samples. The previous most compute-intensive molecular pre-training used 3,330 V100-hours (Ross et al., 2021).

Implicit Task Encoding

The paper provides a quantitative demonstration that frozen BARTSmiles representations encode task-specific information. Using L1-regularized logistic regression on frozen 1,024-dimensional mean-pooled representations, just 7 neurons are sufficient to achieve 0.987 AUC-ROC on ClinTox (within 2 percentage points of full fine-tuning). Even a single neuron achieves 0.77 AUC-ROC on ClinTox subtask 1.

Experimental Setup: MoleculeNet, Toxicology, and Generative Benchmarks

Classification Tasks

BARTSmiles is evaluated on 7 classification datasets from MoleculeNet (SIDER, ClinTox, Tox21, ToxCast, HIV, BACE, BBBP) plus 2 toxicology datasets (Ames, Micronucleus Assay). All classification tasks use AUC-ROC. Baselines include both supervised graph models (D-MPNN, Attentive FP, 3D InfoMax) and self-supervised methods (ChemBERTa, MolFormer-XL, GROVER-large, MolCLR, iMolCLR).

Selected classification results (AUC-ROC):

Dataset	BARTSmiles	Previous Best	Previous Best Model
ClinTox	0.997	0.954	iMolCLR
ToxCast	0.825	0.805	Attentive FP
SIDER	0.705	0.699	iMolCLR
Tox21	0.851	0.858	Attentive FP

The authors note that three scaffold-split datasets (HIV, BACE, BBBP) are highly sensitive to the specific split used, and they suspect some baseline results use different or random splits. These results are marked with caveats in the paper.

Regression Tasks

All three MoleculeNet regression tasks (ESOL, FreeSolv, Lipophilicity) are evaluated using RMSE:

Dataset	BARTSmiles	Previous Best	Previous Best Model
ESOL	0.095	0.279	MoLFormer-XL
FreeSolv	0.114	0.231	MoLFormer-XL
Lipophilicity	0.292	0.529	MoLFormer-XL

BARTSmiles achieves substantial improvements on all three regression tasks.

Generative Tasks

Retrosynthesis (USPTO-50k): BARTSmiles achieves 55.6% Top-1 accuracy using a sample-128 + perplexity re-ranking strategy, compared to 55.3% for Dual-TF and 54.3% for ChemFormer. Top-5 and Top-10 results are 74.2% and 80.9% respectively.

Chemical Reaction Prediction (USPTO MIT/LEF/STEREO): BARTSmiles with beam search outperforms the Molecular Transformer baseline across all six evaluation settings. On USPTO-MIT (split), BARTSmiles achieves 91.8% vs. 90.4% for the Transformer baseline.

Fine-Tuning Recipe

The fine-tuning approach is designed to minimize hyperparameter tuning:

Batch size 16, 10 epochs, polynomial decay learning rate schedule with warmup at 16% of training
Grid search over dropout (0.1, 0.2, 0.3) and learning rate ($5 \times 10^{-6}$, $1 \times 10^{-5}$, $3 \times 10^{-5}$)
Stochastic Weight Averaging (SWA) over three sets of four checkpoints
For generative tasks: R3F regularization (Aghajanyan et al., 2020a) and full fp32 precision
For generation: beam search (beam size 10) or sample 128 sequences with perplexity re-ranking

Key Findings and Limitations

Key Findings

Scale matters for molecular pre-training: Training on 1.7B molecules with 20,480 A100 GPU-hours yields 5 absolute points of AUC-ROC improvement over training on 100M molecules.
Domain-specific ablation is necessary: The optimal BART masking configuration for molecules (20% budget, no token randomization) differs from the standard NLP configuration (30% budget, with randomization).
Frozen representations capture task structure: A small number of neurons from the frozen model can nearly match full fine-tuning performance on certain tasks, suggesting the pre-training objective implicitly encodes molecular properties.
Interpretability aligns with domain knowledge: Integrated Gradients attribution on fine-tuned BARTSmiles highlights known structural alerts (e.g., nitro groups in mutagenic compounds, hydroxyl groups in soluble compounds).

Limitations

Scaffold split sensitivity: Results on HIV, BACE, and BBBP are sensitive to the specific scaffold split, making direct comparison with baselines difficult.
Pre-training data distribution: The Frechet distance analysis shows that some downstream datasets (BBBP, SIDER) are far from ZINC20 in representation space, which may explain weaker performance on those tasks.
Fingerprints carry complementary information: On the Ames and Micronucleus Assay datasets, BARTSmiles alone does not beat fingerprint-based baselines. Combining BARTSmiles with ECFP4 fingerprints closes the gap, implying that SMILES-based pre-training does not fully capture all structural information.
Compute requirements: Pre-training requires 1,024 A100 GPUs, which limits accessibility.

Future Directions

The authors suggest investigating the impact of pre-training data composition, noting that ZINC20 contains over a billion molecules but its distribution may be irrelevant for many downstream tasks. They also propose further collaboration between ML and chemistry experts to discover new molecular substructure-property relationships.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
BARTSmiles (GitHub)	Code + Model	MIT	Pre-training, fine-tuning, and evaluation scripts with pre-trained weights

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC20 (deduplicated)	~1.7B molecules	Canonicalized SMILES, 10K validation holdout
Classification	MoleculeNet (7 datasets)	1,427-41,127 compounds	AUC-ROC metric
Regression	MoleculeNet (3 datasets)	642-4,200 compounds	RMSE metric
Toxicology	Ames, MN Assay	6,512 / 641 compounds	Cross-validation for Ames; external test for MN
Retrosynthesis	USPTO-50k	Standard split	Top-K accuracy
Reaction prediction	USPTO (MIT/LEF/STEREO)	Standard splits	Top-1 accuracy

Algorithms

Architecture: BART-Large (pre-layer norm Transformer encoder-decoder)
Tokenizer: SentencePiece unigram, vocabulary size 1,021, max sequence length 128
Pre-training objective: BART denoising (mask token budget 0.20, Poisson span masking with $\lambda$ = 2.5, no token randomization)
Fine-tuning: polynomial decay LR, SWA, grid search over dropout and LR
Generative fine-tuning: R3F regularization, fp32 precision, Adam initialized from pre-training moving averages

Models

BART-Large architecture (exact parameter count not specified in paper)
Pre-trained checkpoint released on GitHub
Maximum sequence length: 128 tokens

Evaluation

Task	Metric	BARTSmiles	Notes
ClinTox	AUC-ROC	0.997	New SOTA
ToxCast	AUC-ROC	0.825	New SOTA
ESOL	RMSE	0.095	New SOTA
FreeSolv	RMSE	0.114	New SOTA
Lipophilicity	RMSE	0.292	New SOTA
USPTO-50k Retro (Top-1)	Accuracy	55.6%	New SOTA (sample + re-rank)
USPTO-MIT Rxn (Split)	Accuracy	91.8%	New SOTA (beam-10)

Hardware

Pre-training: 1,024 NVIDIA A100 GPUs for 20 hours (20,480 A100 GPU-hours)
Ablation runs: 128 A100 GPUs per run
Framework: FairSeq with FairScale (fully sharded data parallel), automatic mixed precision
Experiment tracking: Aim

Paper Information

Citation: Chilingaryan, G., Tamoyan, H., Tevosyan, A., Babayan, N., Khondkaryan, L., Hambardzumyan, K., Navoyan, Z., Khachatrian, H., & Aghajanyan, A. (2024). BARTSmiles: Generative Masked Language Models for Molecular Representations. Journal of Chemical Information and Modeling, 64(15), 5832-5843. https://doi.org/10.1021/acs.jcim.4c00512

Publication: Journal of Chemical Information and Modeling, 2024 (preprint: arXiv 2022)

Additional Resources:

BARTSmiles GitHub Repository (MIT License)

Citation

@article{chilingaryan2024bartsmiles,
  title={BARTSmiles: Generative Masked Language Models for Molecular Representations},
  author={Chilingaryan, Gayane and Tamoyan, Hovhannes and Tevosyan, Ani and Babayan, Nelly and Khondkaryan, Lusine and Hambardzumyan, Karen and Navoyan, Zaven and Khachatrian, Hrant and Aghajanyan, Armen},
  journal={Journal of Chemical Information and Modeling},
  volume={64},
  number={15},
  pages={5832--5843},
  doi={10.1021/acs.jcim.4c00512},
  year={2024}
}

SELFormer: A SELFIES-Based Molecular Language Model

Mon, 16 Mar 2026 00:00:00 +0000

A SELFIES-Based Chemical Language Model

This is primarily a Method paper ($\Psi_{\text{Method}}$) with a secondary Resource component ($\Psi_{\text{Resource}}$).

SELFormer applies the RoBERTa transformer architecture to SELFIES molecular string representations instead of the SMILES notation used by prior chemical language models. The model is pretrained via masked language modeling (MLM) on 2M drug-like compounds from ChEMBL and fine-tuned for molecular property prediction tasks on MoleculeNet benchmarks. The authors release pretrained models, fine-tuning code, and datasets as open-source resources.

Why SELFIES Over SMILES for Pretraining?

Existing chemical language models, including ChemBERTa, ChemBERTa-2, MolBERT, and MolFormer, all use SMILES as their input representation. SMILES has well-documented validity and robustness issues: arbitrary perturbations to a SMILES string frequently produce syntactically invalid outputs. This means a pretrained model must spend capacity learning SMILES grammar rules rather than chemical semantics.

SELFIES addresses this by construction: every possible SELFIES string decodes to a valid molecule. Despite this theoretical advantage and SELFIES’ growing adoption in generative chemistry, no prior work had systematically evaluated SELFIES as input for large-scale transformer pretraining. SELFormer fills this gap by providing a direct comparison between SELFIES-based and SMILES-based chemical language models on standard benchmarks.

Masked Language Modeling on Guaranteed-Valid Molecular Strings

SELFormer uses byte-level Byte-Pair Encoding (BPE) to tokenize SELFIES strings, then pretrains a RoBERTa encoder using the standard MLM objective. 15% of input tokens are masked, and the model minimizes the cross-entropy loss over the masked positions:

$$ \mathcal{L}_{\text{MLM}} = -\frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}; \theta) $$

where $\mathcal{M}$ is the set of masked token indices, $x_i$ is the true token at position $i$, $x_{\setminus \mathcal{M}}$ is the corrupted input context, and $\theta$ are the model parameters.

The key insight is that because SELFIES guarantees 100% validity, every masked token prediction corresponds to a valid molecular fragment. The model never wastes capacity predicting invalid chemistry. For fine-tuning, a two-layer classification or regression head is added on top of the encoder’s output embedding.

Two model sizes were trained. Notably, the larger SELFormer uses fewer attention heads (4) but more hidden layers (12) than SELFormer-Lite (12 heads, 8 layers). This counterintuitive configuration emerged from the authors’ hyperparameter search over ~100 models, where deeper architectures with fewer heads outperformed wider, shallower ones:

Configuration	SELFormer-Lite	SELFormer
Attention Heads	12	4
Hidden Layers	8	12
Batch Size	16	16
Learning Rate	5e-5	5e-5
Weight Decay	0.01	0.01
Pretraining Epochs	100	100
Parameters	58.3M	86.7M

Benchmarking Against SMILES Transformers and Graph Models

SELFormer was pretrained on 2.08M drug-like compounds from ChEMBL v30 (converted from SMILES to SELFIES), then fine-tuned on nine MoleculeNet tasks. All evaluations use scaffold splitting via the Chemprop library.

Classification tasks (ROC-AUC, scaffold split):

Model	BACE	BBBP	HIV	Tox21	SIDER
SELFormer	0.832	0.902	0.681	0.653	0.745
ChemBERTa-2	0.799	0.728	0.622	-	-
MolBERT	0.866	0.762	0.783	-	-
D-MPNN	0.809	0.710	0.771	0.759	0.570
MolCLR	0.890	0.736	0.806	0.787	0.652
GEM	0.856	0.724	0.806	0.781	0.672
KPGT	0.855	0.908	-	0.848	0.649

Regression tasks (RMSE, scaffold split, lower is better):

Model	ESOL	FreeSolv	Lipophilicity	PDBbind
SELFormer	0.682	2.797	0.735	1.488
ChemBERTa-2	-	-	0.986	-
D-MPNN	1.050	2.082	0.683	1.397
GEM	0.798	1.877	0.660	-
KPGT	0.803	2.121	0.600	-

The ablation study compared SELFormer vs. SELFormer-Lite across pretrained-only, 25-epoch, and 50-epoch fine-tuning configurations on randomly split datasets. SELFormer consistently outperformed SELFormer-Lite, confirming the benefit of the deeper (12-layer) architecture.

Strong Classification Performance with Compact Pretraining

SELFormer’s strongest results come on classification tasks where molecular substructure matters:

SIDER: Best overall ROC-AUC (0.745), outperforming the next best method (MolCLR at 0.652) by 9.3 percentage points. The authors attribute this to SELFIES’ ability to capture subtle structural differences relevant to drug side effects.
BBBP: Second best (0.902), behind only KPGT (0.908). SELFormer scored 17.4 percentage points above ChemBERTa-2 (0.728) on this task.
BACE/HIV vs. ChemBERTa-2: SELFormer outperformed ChemBERTa-2 by 3.3 points on BACE (0.832 vs 0.799), 17.4 on BBBP, and 5.9 on HIV (0.681 vs 0.622). Since both models use similar RoBERTa architectures, this comparison is suggestive of a SELFIES advantage, though differences in pretraining corpus (ChEMBL vs PubChem), corpus size, and training procedure confound a clean attribution to the input representation alone.
ESOL regression: Best RMSE (0.682) vs GEM (0.798), a 14.5% relative improvement.

Limitations are also apparent:

HIV and Tox21: SELFormer underperforms graph-based methods (MolCLR, GEM, KPGT) on these larger datasets. The authors attribute this to insufficient hyperparameter search given computational constraints.
FreeSolv and Lipophilicity regression: D-MPNN and graph-based methods maintain an edge, suggesting that explicit 2D/3D structural inductive biases remain valuable for certain property types.
Small pretraining corpus: At 2M molecules, SELFormer’s corpus is orders of magnitude smaller than MolFormer’s 1.1B. Despite this, SELFormer outperforms MolFormer on SIDER (0.745 vs 0.690), highlighting SELFIES’ representational advantage.
Single-task ablation scope: Some architectural claims rest on limited task coverage, and broader benchmarking would strengthen the conclusions.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL v30	2,084,725 compounds (2,084,472 after SELFIES conversion)	Drug-like bioactive small molecules
Classification	BACE	1,513	Beta-secretase 1 inhibitor binding
Classification	BBBP	2,039	Blood-brain barrier permeability
Classification	HIV	41,127	HIV replication inhibition
Classification	SIDER	1,427	Drug side effects (27 classes)
Classification	Tox21	7,831	Toxicity (12 targets)
Regression	ESOL	1,128	Aqueous solubility
Regression	FreeSolv	642	Hydration free energy
Regression	Lipophilicity	4,200	Octanol/water distribution coefficient
Regression	PDBbind	11,908	Binding affinity

Algorithms

Pretraining objective: Masked language modeling (MLM), 15% token masking
Tokenization: Byte-level Byte-Pair Encoding (BPE) on SELFIES strings
SMILES to SELFIES conversion: SELFIES API with Pandaral.lel for parallelization
Splitting: Scaffold splitting via Chemprop library (80/10/10 train/validation/test)
Fine-tuning: Two-layer classification/regression head on encoder output; up to 200 epochs with hyperparameter search

Models

Architecture: RoBERTa (HuggingFace Transformers)
SELFormer: 12 hidden layers, 4 attention heads, 86.7M parameters
SELFormer-Lite: 8 hidden layers, 12 attention heads, 58.3M parameters
Hyperparameter search: Sequential search over ~100 configurations on 100K molecule subset

Evaluation

Metric	Task Type	Details
ROC-AUC	Classification	Area under receiver operating characteristic curve
PRC-AUC	Classification	Area under precision-recall curve (reported for random splits)
RMSE	Regression	Root mean squared error

Results reported on scaffold split and random split datasets.

Hardware

Compute: 2x NVIDIA A5000 GPUs
Hyperparameter optimization time: ~11 days
Full pretraining: 100 epochs on 2.08M molecules

Artifacts

Artifact	Type	License	Notes
SELFormer GitHub	Code	GPL-3.0	Pretraining, fine-tuning, and evaluation scripts
SELFormer on HuggingFace	Model	GPL-3.0	Pretrained SELFormer weights
ChEMBL v30	Dataset	CC BY-SA 3.0	Source pretraining data
MoleculeNet	Benchmark	Unknown	Downstream evaluation tasks

Paper Information

Citation: Yüksel, A., Ulusoy, E., Ünlü, A., & Doğan, T. (2023). SELFormer: Molecular Representation Learning via SELFIES Language Models. Machine Learning: Science and Technology, 4(2), 025035. https://doi.org/10.1088/2632-2153/acdb30

Publication: Machine Learning: Science and Technology 2023

Additional Resources:

Citation

@article{yuksel2023selformer,
  title={{SELFormer}: Molecular Representation Learning via {SELFIES} Language Models},
  author={Y{\"u}ksel, Atakan and Ulusoy, Erva and {\"U}nl{\"u}, Atabey and Do{\u{g}}an, Tunca},
  journal={Machine Learning: Science and Technology},
  volume={4},
  number={2},
  pages={025035},
  year={2023},
  publisher={IOP Publishing},
  doi={10.1088/2632-2153/acdb30}
}

MoLFormer: Large-Scale Chemical Language Representations

Mon, 16 Mar 2026 00:00:00 +0000

A Billion-Scale Chemical Language Model

This is primarily a Method paper ($\Psi_{\text{Method}}$).

MoLFormer is a transformer encoder pretrained via masked language modeling on 1.1 billion SMILES strings from PubChem and ZINC. The key architectural choices are linear attention (for $O(N)$ complexity instead of $O(N^2)$) and rotary positional embeddings (RoPE). The resulting model, MoLFormer-XL, produces molecular embeddings that outperform or match GNN baselines across a wide range of MoleculeNet classification and regression tasks, including quantum-chemical property prediction from SMILES alone.

Bridging the Gap Between Molecular Languages and Graph Neural Networks

Prior chemical language models like ChemBERTa were pretrained on relatively small datasets (10M-77M molecules) and generally underperformed GNNs on molecular property prediction. The core question: does a transformer trained on a sufficiently large SMILES corpus learn enough chemical structure to compete with graph-based methods that have explicit topological inductive biases?

Two specific challenges motivated this work:

Scale: The chemical space spans $10^{60}$ to $10^{100}$ plausible molecules, yet labeled property data is scarce. Self-supervised pretraining on the ~1.1B unlabeled molecules available in public databases could provide a general-purpose representation.
Efficiency: Standard transformer attention is $O(N^2)$ in sequence length, making billion-scale pretraining impractical without architectural modifications.

Linear Attention with Rotary Positional Embeddings

MoLFormer’s two key architectural choices are its attention mechanism and positional encoding scheme.

Standard attention computes:

$$ \text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \exp(\langle q_m, k_n \rangle) v_n}{\sum_{n=1}^{N} \exp(\langle q_m, k_n \rangle)} $$

MoLFormer replaces this with linear attention using a generalized feature map $\varphi$, combined with rotary positional embeddings $R_m$ applied before the feature map:

$$ \text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle v_n}{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle} $$

This differs from the original RoFormer formulation, which applies the rotation after the feature map. The authors found that rotating the raw queries and keys before projection led to faster convergence and lower validation loss. The combination of linear attention and adaptive sequence-length bucketing reduces GPU requirements from ~1000 to 16 for training on the full 1.1B corpus.

The model uses masked language modeling (15% token masking, following BERT conventions) with a vocabulary of 2,362 SMILES tokens. Sequence length is capped at 202 tokens, covering 99.4% of all molecules.

Broad MoleculeNet Benchmarking with Scaling Ablations

MoLFormer-XL was evaluated on 11 MoleculeNet tasks against supervised GNNs, self-supervised GNNs, and prior language models.

Classification tasks (ROC-AUC, scaffold split; values reported as percentages in the original paper, converted to proportions here for consistency):

Model	BBBP	Tox21	ClinTox	HIV	BACE	SIDER
MoLFormer-XL	0.937	0.847	0.948	0.822	0.882	0.690
N-Gram	0.912	0.769	0.855	0.830	0.876	0.632
MolCLR	0.736	0.798	0.932	0.806	0.890	0.680
GEM	0.724	0.781	0.901	0.806	0.856	0.672
Hu et al.	0.708	0.787	0.789	0.802	0.859	0.652
GeomGCL	-	0.850	0.919	-	-	0.648
ChemBERTa	0.643	-	0.906	0.622	-	-

Regression tasks (RMSE for ESOL/FreeSolv/Lipophilicity, avg MAE for QM9/QM8):

Model	QM9	QM8	ESOL	FreeSolv	Lipophilicity
MoLFormer-XL	1.5894	0.0102	0.2787	0.2308	0.5289
A-FP	2.6355	0.0282	0.5030	0.736	0.578
MPNN	3.1898	0.0143	0.58	1.150	0.7190
GC	4.3536	0.0148	0.970	1.40	0.655

MoLFormer-XL also outperforms geometry-aware GNNs (DimeNet, GeomGCL, GEM) on ESOL (0.279 vs 0.575), FreeSolv (0.231 vs 0.866), and Lipophilicity (0.529 vs 0.541).

Key ablation findings:

Data scale matters: Performance improves monotonically from 10% subsets through the full 1.1B corpus. Training on 100% ZINC alone performed worst, likely due to its smaller vocabulary and less diverse molecule lengths.
Model depth matters: MoLFormer-Base (6 layers) underperforms MoLFormer-XL (12 layers) on most tasks.
Fine-tuning » frozen: Fine-tuning the full encoder consistently outperforms using frozen embeddings with a downstream classifier.
Rotary > absolute at scale: Rotary embeddings underperform absolute embeddings on smaller pretraining sets but overtake them once the corpus exceeds 1B molecules.

SMILES Transformers Learn Molecular Geometry

The most striking finding is that MoLFormer’s attention patterns correlate with 3D interatomic distances, despite training only on 1D SMILES strings.

Using QM9 molecules with known 3D geometries, the authors computed cosine similarity between attention maps and spatial distance matrices across three distance categories:

Distance Category	Range	Linear Attention (Rotary)	Full Attention (Rotary)
Short	$\leq$ 2 Å	0.594-0.602	0.598-0.615
Medium	2-4 Å	0.724-0.730	0.716-0.727
Long	4-10 Å	0.209-0.211	0.204-0.210

The strong correlation in the short and medium categories indicates the model captures covalent bond connectivity and near-neighbor spatial relationships. Linear attention shows marginally higher cosine similarity than full attention on medium-range distances (0.724-0.730 vs 0.716-0.727), though the differences are small.

MoLFormer-XL embeddings also correlate more strongly with molecular fingerprint similarity (0.64 vs 0.48 for ChemBERTa) and maximum common subgraph size (-0.60 vs -0.44), confirming that the representations encode structural information.

Limitations:

Quantum-chemical energies: SchNet and DimeNet (which encode explicit 3D geometry) outperform MoLFormer-XL on QM9 atomization energy tasks, with DimeNet achieving roughly 10x lower MAE on U0_atom (0.008 vs 0.083 eV). 3D information remains important for these properties.
Sequence length cap: The 202-token limit excludes 0.6% of molecules, potentially limiting applicability to larger structures.
SMILES canonicalization: The model depends on RDKit canonical SMILES; sensitivity to non-canonical forms is not evaluated.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	PubChem	111M molecules	Canonical SMILES via RDKit
Pretraining	ZINC	~1B molecules	Canonical SMILES via RDKit
Pretraining (combined)	PubChem + ZINC	~1.1B molecules	MoLFormer-XL training set
Classification	BBBP, Tox21, ClinTox, HIV, BACE, SIDER	1,427-41,127	MoleculeNet scaffold splits
Regression	QM9, QM8, ESOL, FreeSolv, Lipophilicity	642-133,885	MoleculeNet random splits (QM9/QM8), scaffold (others)

Algorithms

Pretraining objective: Masked language modeling (15% selection: 80% masked, 10% random, 10% unchanged)
Tokenization: SMILES tokenizer from Schwaller et al., vocabulary of 2,362 tokens
Sequence length: 1-202 tokens (99.4% coverage)
Optimizer: Fused LAMB (via APEX), chosen for stability with large batch sizes and no need for learning rate warm-up
Adaptive bucketing: Sequences grouped by length into buckets to minimize padding waste

Models

Architecture: Transformer encoder with linear attention and rotary positional embeddings
MoLFormer-XL: 12 layers, 12 attention heads, hidden size 768
MoLFormer-Base: 6 layers (ablation only)
Feature map size: 32 (generalized feature map for linear attention)
Frozen head: Fully connected model with hyperparameter sweep (learning rate, batch size, hidden dim, number of layers)

Evaluation

Metric	Task Type	Details
ROC-AUC	Classification	Scaffold splits per MoleculeNet
RMSE	Regression (ESOL, FreeSolv, Lipophilicity)	Scaffold splits
Avg MAE	Regression (QM9, QM8)	Random splits per MoleculeNet

QM9 results also reported with 5-fold cross-validation for robustness.

Hardware

Compute: GPU cluster with nodes containing either 8 NVIDIA Tesla V100 (32GB) or 8 Ampere A100 (40GB) GPUs connected via NVLink and InfiniBand
GPU reduction: Linear attention + bucketing reduced GPU requirements from ~1000 to 16

Artifacts

Artifact	Type	License	Notes
IBM/molformer	Code	Apache-2.0	Pretraining, fine-tuning, and attention visualization
MoLFormer-XL (HuggingFace)	Model	Apache-2.0	Pretrained weights (46.8M parameters)
PubChem	Dataset	Public domain	111M molecules
ZINC	Dataset	See ZINC terms	~1B molecules

Paper Information

Citation: Ross, J., Belgodere, B., Chenthamarakshan, V., Padhi, I., Mroueh, Y., & Das, P. (2022). Large-Scale Chemical Language Representations Capture Molecular Structure and Properties. Nature Machine Intelligence, 4, 1256-1264. https://doi.org/10.1038/s42256-022-00580-7

Publication: Nature Machine Intelligence 2022

Additional Resources:

Citation

@article{ross2022molformer,
  title={Large-Scale Chemical Language Representations Capture Molecular Structure and Properties},
  author={Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel},
  journal={Nature Machine Intelligence},
  volume={4},
  number={12},
  pages={1256--1264},
  year={2022},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-022-00580-7}
}

ChemBERTa-3: Open Source Chemical Foundation Models

Fri, 26 Dec 2025 00:00:00 +0000

Core Contribution: An Open-Source Framework

This is primarily a Resource ($\Psi_{\text{Resource}}$) paper, with secondary Method ($\Psi_{\text{Method}}$) contributions.

Resource Basis: The core contribution is “ChemBERTa-3,” an open-source framework integrated into DeepChem that standardizes the pretraining and benchmarking of chemical foundation models. The authors focus heavily on infrastructure (AWS/Ray integration) and correcting benchmarking inconsistencies in the field.
Method Basis: It trains models like “c3-MoLFormer” to reproduce and validate the infrastructure.

The Pretraining Scalability Challenge

Scalability Challenges: Building robust molecular models is difficult due to the vast size of chemical space and the computational intensity of pretraining on large datasets.
Proprietary Barriers: Many high-performing chemical foundation models (e.g., the full MoLFormer-XL) are partially closed-source or difficult to reproduce.
Benchmarking Inconsistencies: There is a lack of systematic comparison between architectures (e.g., Graph vs. Transformer) using unified protocols. Specifically, previous comparisons relied on reported results that used differing scaffold splitting algorithms, making them inaccurate.

Unified Infrastructure & Standardized Benchmarking

Unified Infrastructure: Integration of DeepChem with Ray for distributed, scalable pretraining and fine-tuning of both graph and transformer models.
Standardized Benchmarking: Identification that MoLFormer’s scaffold splitting algorithm differs from the standard DeepChem/MoleculeNet splitter, and the subsequent standardization of these benchmarks for fair comparison.
New DeepChem Tools: Introduction of the ModularTorchModel class for flexible loss computation and HuggingFaceModel wrappers to bridge ecosystems.

Benchmarking Transformers vs. Graph Models

Architecture Comparison: Benchmarked Transformers (ChemBERTa, MoLFormer) against Graph models (GROVER, InfoGraph, InfoMax3D, DMPNN, GCN) and baselines (Random Forest).
Pretraining Scale Disparity:
- Transformers were pretrained on ZINC20 subsets ranging from 10M to 1.1B molecules (combining ZINC and PubChem).
- Graph models were limited to 250K molecule subsets due to memory and computational overhead of message passing on large graphs. While this highlights the superior scalability of Transformer architectures, comparing a 1.1B-trained Transformer to a 250K-trained Graph model provides an unbalanced evaluation of architectural capacity.
Reproducibility Validation: Trained “c3-MoLFormer” (a reproduction of MoLFormer) on 1.1B molecules using two distinct hardware setups: AWS spot instances (Ray) and a local HPC cluster.
Scaffold Split Analysis: Compared performance metrics using “DeepChem scaffold splits” vs. “MoLFormer scaffold splits” to quantify the impact of data leakage/overlap.

Overcoming Scaffold Splitting Inconsistencies

Scaling Transformers vs. Graphs: Transformer-based models are significantly easier to scale to large datasets than current graph-based approaches, though performance is comparable at small scales.
Benchmarking sensitivity: MoLFormer’s reported superiority over baselines was partly inflated by its specific scaffold splitting method, which had higher structural overlap between train and test sets (yielding a lower Tanimoto distance, generally quantified via $1 - \frac{|A \cap B|}{|A \cup B|}$) than DeepChem splits. When standardized, baselines like DMPNN perform more competitively.
Infrastructure Viability: The framework successfully replicated large-scale training (MoLFormer-1.1B) on both cloud and on-premise HPC, confirming reproducibility.
Open Source Release: All code, configurations, and the c3-MoLFormer-1.1B model weights are released to facilitate future research.

Reproducibility Details

Data

Pretraining:
- Source: ZINC20 (1.4B compounds) and PubChem.
- Scale: Subsets of 10M, 100M, and 1.1B (100% ZINC20 + 100% PubChem) were used for Transformers. Graph models used a 250K subset.
Fine-tuning:
- Suite: MoleculeNet.
- Tasks: Classification (BACE, BBBP, Tox21, HIV, SIDER, ClinTox) and Regression (ESOL, FreeSolv, Lipophilicity, QM9).
- Splits: Critical distinction made between “DeepChem scaffold splits” (80/10/10) and “MoLFormer scaffold splits” (which can be downloaded from https://ibm.ent.box.com/v/MoLFormer-data). The paper notes these algorithms differ.

Algorithms

Framework: DeepChem integrated with Ray for distributed training. To recreate the environment, the repository relies on a nightly version of DeepChem (pip install --pre deepchem) and specific dependencies found within the requirements.txt. Pretraining scripts are available in the chemberta3_benchmarking/pretraining directory of the repository.
Data Preparation: Featurization workflows (e.g., CircularFingerprint, RDKitConformer) are documented under chemberta3_benchmarking/data/data_preprocessing/ in the codebase.
Modular Training: Uses ModularTorchModel to allow loss computation from intermediate values and flexible component connection.
Training Brittleness:
- Optimizer: Linear learning rate scheduler with warmup.
- Instability Handling: The authors observed significant loss spikes during warmup. Their primary mitigation strategy involved checkpointing frequently and restarting from the last stable state upon a spike, highlighting a persistent brittleness in optimizing these large chemical foundation models.
- Numerical Issues: Addressed NaN values by pretraining on a small dataset with low LR before scaling up.

Models

ChemBERTa: RoBERTa-based architecture trained with Masked Language Modeling (MLM) and Multitask Regression (MTR). Specific model identifiers (e.g., DeepChem/ChemBERTa-100M-MLM) are hosted on Hugging Face so researchers can pull them directly via the transformers library. The core pretraining objective minimized the standard MLM loss: $$ \mathcal{L}_{\text{MLM}} = - \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log \hat{y}_{i} $$ where $\mathcal{M}$ represents the set of masked SMILES token indices, and $\hat{y}_{i}$ is the model’s predicted probability for the correct token given the corrupted sequence context.
MoLFormer (c3-MoLFormer): Re-implementation of the MoLFormer architecture (Rotary embeddings, linear attention). Specific model identifiers (e.g., DeepChem/MoLFormer-c3-1.1B) are similarly available on Hugging Face.
- Tokenizer: ibm/MoLFormer-XL-both-10pct tokenizer.
Graph Models:
- GROVER: Graph Transformer with node/edge/graph level self-supervision.
- InfoGraph: Maximizes mutual information between graph-level and substructure representations.
- InfoMax3D: Incorporates 3D conformer data (via RDKit ETKDGv2) into contrastive pretraining.
- DMPNN: Directed Message Passing Neural Network (Chemprop variant).

Evaluation

Metrics: ROC-AUC for classification; RMSE for regression (MAE for QM9).
Baselines: Random Forest, GCN, DMPNN trained on fine-tuning splits only.
Protocol: Three independent runs per configuration to report mean and range (not a confidence interval), with the exception of the compute-heavy QM9 dataset, which only received a single run. Benchmarking execution scripts (e.g., GCN, RF, DMPNN, ChemBERTa) are stored in the repo under chemberta3_benchmarking/models_benchmarking/ and contain the specific fine-tuning hyperparameters and optimizer configurations used for each downstream task.
Key Results:
- c3-MoLFormer-1.1B achieved ~0.848 ROC-AUC on BACE and ~0.900 on BBBP (using MoLFormer splits). This closely matches the original IBM MoLFormer metrics, validating the reproducibility of the open-source framework.
- When constrained to the equivalent 250K subset, Graph models (InfoGraph, GROVER) performed comparably to Transformers, indicating that Transformer superiority in chemistry is largely driven by data scalability rather than an inherent architectural advantage at small scales.

Hardware

Cloud (AWS):
- Compute: 40 NVIDIA T4 GPUs (g4dn.12xlarge spot instances for pretraining, g4dn.2xlarge for benchmarking).
- Cost: ~$4000 for MoLFormer 1.1B pretraining.
- Time: ~10 days (260 hours) for 1.1B model pretraining.
- Setup: Setup scripts for single-node and multi-node spot EC2 clusters are provided in the GitHub repository’s infra/ and spot/ folders.
On-Premise HPC:
- Compute: 16 nodes (AMD EPYC), each with 4 AMD MI300A APUs.
- Environment: Ray multi-node multi-GPU framework.

Artifacts

Artifact	Type	License	Notes
ChemBERTa-3 GitHub Repository	Code	Unknown	Training, fine-tuning, and benchmarking framework
DeepChem/MoLFormer-c3-1.1B	Model	Unknown	MoLFormer re-implementation pretrained on 1.1B molecules
DeepChem/ChemBERTa-100M-MLM	Model	Unknown	ChemBERTa pretrained on 100M ZINC molecules
DeepChem/MoLFormer-c3-100M	Model	Unknown	MoLFormer pretrained on 100M molecules
DeepChem/MoLFormer-c3-550M	Model	Unknown	MoLFormer pretrained on 550M molecules

Paper Information

Citation: Singh, R. et al. (2026). ChemBERTa-3: an open source training framework for chemical foundation models. Digital Discovery, 5, 662-685. https://doi.org/10.1039/D5DD00348B

Publication: Digital Discovery 2026

Additional Resources:

@article{singhChemBERTa3OpenSource2026,
  author = {Singh, Riya and Barsainyan, Aryan Amit and Irfan, Rida and Amorin, Connor Joseph and He, Stewart and Davis, Tony and Thiagarajan, Arun and Sankaran, Shiva and Chithrananda, Seyone and Ahmad, Walid and Jones, Derek and McLoughlin, Kevin and Kim, Hyojin and Bhutani, Anoushka and Sathyanarayana, Shreyas Vinaya and Viswanathan, Venkat and Allen, Jonathan E. and Ramsundar, Bharath},
  title = {{{ChemBERTa-3}}: an open source training framework for chemical foundation models},
  journal = {Digital Discovery},
  year = {2026},
  volume = {5},
  pages = {662-685},
  publisher = {The Royal Society of Chemistry},
  doi = {10.1039/D5DD00348B},
  url = {https://doi.org/10.1039/D5DD00348B}
}

ChemBERTa-2: Scaling Molecular Transformers to 77M

Thu, 25 Dec 2025 00:00:00 +0000

Classifying ChemBERTa-2’s Methodological Contributions

This is primarily a Methodological paper with a secondary Resource contribution.

It fits the Method classification because it focuses on optimizing the architecture and pretraining pipeline for molecular transformers. The authors perform extensive ablation studies (varying dataset size from 5M to 77M, comparing MLM vs. MTR objectives) to determine “how well” these strategies work compared to baselines. The secondary Resource classification applies because they open-source the trained models and establish a benchmark on a massive 77M compound dataset.

Key methodological indicators:

Baseline comparison: The paper explicitly compares ChemBERTa-2 against standard baselines (D-MPNN, Random Forest, GCN) and its predecessor (ChemBERTa-1) with prominent benchmark tables
Ablation studies: Extensive experiments comparing multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size
Scaling analysis: Systematic investigation of whether larger datasets (up to 77M compounds) yield better performance

Motivations for Scaling Molecular Transformers

The authors aim to bridge the gap between NLP success stories (like GPT-3) and molecular machine learning by developing a “chemical foundation model”.

Key motivations:

Label scarcity: Experimental labels for molecular properties are rare and expensive, but unlabeled SMILES strings are abundant
Scaling hypothesis: Testing if scaling pretraining data (up to 77M compounds) yields consistent downstream improvements, similar to scaling laws in NLP
Efficiency: Optimizing the pretraining process introduced in the original ChemBERTa by comparing self-supervised (MLM) and weakly supervised (MTR, using RDKit computed properties as labels) approaches

Novelty in Multi-Task Regression Objectives

Scale: Training on 77M unique SMILES from PubChem, which is one of the largest molecular pretraining datasets used to date (compared to 10M for ChemBERTa-1 or 18.7M for SMILES-BERT).

Pipeline optimization: A direct, controlled comparison of Masked Language Modeling (MLM) vs. Multi-Task Regression (MTR) pretraining objectives on identical datasets.

Proxy selection: The finding that MLM loss correlates well with MTR loss, allowing the cheaper MLM task to be used for hyperparameter tuning before running the expensive MTR pretraining.

Experimental Pretraining Setup on 77M Compounds

Pretraining Setup

Datasets: Subsets of PubChem containing 5M, 10M, and 77M unique SMILES.

Tasks:

MLM: Masking 15% of tokens (following RoBERTa procedure). The model is optimized by minimizing the cross-entropy loss over the predicted masked tokens: $$ \mathcal{L}_{MLM} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid \mathbf{x}_{\setminus \mathcal{M}}) $$ where $\mathcal{M}$ represents the set of masked token indices.
MTR: Predicting 200 calculated molecular properties (via RDKit) simultaneously using a mean squared error objective: $$ \mathcal{L}_{MTR} = \frac{1}{200} \sum_{j=1}^{200} \frac{1}{N} \sum_{i=1}^{N} \left( \hat{y}_{ij} - y_{ij} \right)^2 $$ Continuous target labels $y_{ij}$ are mean-normalized prior to training to equilibrate the disparate scales of different chemical properties.

Hyperparameter search: Ran 50 random configurations on the 5M dataset; selected the top 5 to scale up to 10M and 77M.

Downstream Validation

Finetuning: Evaluated on 8 tasks from MoleculeNet (BACE, BBBP, ClinTox, Delaney, etc.) using scaffold splits (80/10/10).

Analysis: Used UMAP to visualize embeddings from MLM, MTR, and ECFP to check for clustering by label without finetuning.

Key Performance Outcomes and Scaling Realities

Highly competitive performance: ChemBERTa-2 outperforms the D-MPNN baseline (chemprop) on 6 out of 8 MoleculeNet tasks, though the margins demonstrate that task-specific baselines remain notably robust.

MTR superiority: Models pretrained on Multi-Task Regression (MTR) consistently perform better on downstream tasks than those pretrained on MLM on every finetuning task evaluated. MTR is substantially slower than MLM due to the larger input size from the 200-element label vector, but MLM loss serves as a reliable proxy for MTR loss, enabling cheaper architecture search before committing to full MTR pretraining.

Scaling laws versus downstream utility: Pretraining loss improved by 25-35% when increasing the dataset from 5M to 77M compounds. However, this improvement in pretraining loss does not uniformly transfer to downstream tasks. For MTR models, SR-p53 ROC-AUC decreases monotonically from 0.834 (5M) to 0.827 (10M) to 0.817 (77M), and Lipophilicity RMSE is worse at 77M (0.798) than at 5M (0.758), despite a dip at 10M (0.744). This variability in transfer challenges the assumption that pretraining improvements always yield downstream gains.

Transfer learning: The correlation between pretraining loss and downstream performance is task-dependent; it is strong for Lipophilicity but weaker for BACE classification.

Reproducibility Details

Data

The pretraining corpus is derived from PubChem.

Purpose	Dataset	Size	Notes
Pretraining	PubChem	77M SMILES	Canonicalized and globally shuffled. Subsets of 5M and 10M used. Note: Exact splits and datasets are not published.
Validation	PubChem	100k SMILES	A fixed set held out from the 77M corpus. Note: Exact 100k subset is not published.
MTR Labels	RDKit	200 props	200 molecular properties calculated from SMILES using RDKit. Labels are mean-normalized. Note: Calculated labels are not published and must be re-computed.
Finetuning	MoleculeNet	1.5k - 8k	Tasks: BACE, Clearance, Delaney, Lipophilicity, BBBP, ClinTox, HIV, Tox21. Split 80/10/10 via scaffold splitter.

Algorithms

Pretraining Objectives:

Masked Language Modeling (MLM): Follows RoBERTa procedure. Masks 15% of tokens. Max sequence length 512.
Multi-Task Regression (MTR): Predicting 200 RDKit properties. Labels are mean-normalized.

Tokenizer:

Dictionary of common SMILES characters
Maximum vocabulary size: 591 tokens

Optimization:

Patience: Early stopping set to one pass through the dataset to ensure full coverage
Hyperparameter search: Random search (50 configs) varying hidden size, attention heads, dropout, intermediate size, hidden layers, and learning rate. Note: The precise configuration of the winning models that were scaled to 77M is absent from the paper.

Models

Architecture: Based on RoBERTa (HuggingFace implementation)
Parameter scale: Models ranged between 5M and 46M parameters
Selection: Top 5 configurations from the 5M-dataset random search were trained on the full 77M dataset
Checkpoints: Pre-trained weights are hosted by DeepChem on Hugging Face. Direct links include DeepChem/ChemBERTa-77M-MTR and DeepChem/ChemBERTa-77M-MLM (Note: Model cards are currently empty).
Code Reference: While the DeepChem repository is referenced for code, isolated training scripts tailored to recreate ChemBERTa-2’s exact pipeline are not separated from the generalized deepchem library tooling.

Evaluation

Benchmarks were performed on MoleculeNet using DeepChem.

Metric	Tasks	Baseline	Notes
RMSE ($\downarrow$)	Delaney, Lipo, BACE (Reg), Clearance	D-MPNN	ChemBERTa-2 outperformed D-MPNN on Delaney (0.889 vs 1.105) and Clearance (48.5 vs 49.8).
ROC-AUC ($\uparrow$)	BBBP, ClinTox, HIV, Tox21, BACE (Cls)	D-MPNN	ChemBERTa-2 generally competitive; MTR-77M achieved 0.728 on BBBP vs D-MPNN 0.697.

Hardware

Compute: AWS EC2 instances with Nvidia T4 GPUs
Strategy: AWS Spot instances were used to reduce cost; implemented frequent checkpointing to handle interruptions.
Note: For MTR, they wrote a custom data loader wrapper around HuggingFace’s text loader to handle CSV parsing efficiency, as the default CSV loader was a major bottleneck for the 200-element target vectors.

Paper Information

Citation: Ahmad, W., Simon, E., Chithrananda, S., Grand, G., & Ramsundar, B. (2022). ChemBERTa-2: Towards Chemical Foundation Models. arXiv preprint arXiv:2209.01712. https://doi.org/10.48550/arXiv.2209.01712

Publication: arXiv 2022 (Presented at 2021 ELLIS ML for Molecule Discovery Workshop)

Additional Resources:

ChemBERTa-1 Paper

@misc{ahmadChemBERTa2ChemicalFoundation2022,
  title = {{{ChemBERTa-2}}: {{Towards Chemical Foundation Models}}},
  shorttitle = {{{ChemBERTa-2}}},
  author = {Ahmad, Walid and Simon, Elana and Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath},
  year = 2022,
  month = sep,
  number = {arXiv:2209.01712},
  eprint = {2209.01712},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2209.01712},
  urldate = {2025-12-25},
  archiveprefix = {arXiv}
}

ChemBERTa: Molecular Property Prediction via Transformers

Tue, 23 Dec 2025 00:00:00 +0000

Taxonomy and Paper Contributions

This is primarily a Method paper ($\Psi_{\text{Method}}$), with a significant Resource component ($\Psi_{\text{Resource}}$).

It is a methodological investigation because it systematically evaluates a specific architecture (Transformers/RoBERTa) against established State-of-the-Art (SOTA) baselines like directed Message Passing Neural Networks (D-MPNNs) to determine “how well does this work?” in the chemical domain. It ablates dataset size, tokenization, and input representation.

It is also a resource paper as it introduces “PubChem-77M,” a curated dataset of 77 million SMILES strings designed to facilitate large-scale self-supervised pretraining for the community.

Overcoming Data Scarcity in Property Prediction

The primary motivation is data scarcity in molecular property prediction. Graph Neural Networks (GNNs) achieve strong performance on property prediction tasks when provided with sufficient labeled data. Generating these labels requires costly and time-consuming laboratory testing, leading to severe data scarcity in specialized chemical domains.

Massive quantities of unlabeled chemical structure data exist in the form of SMILES strings. Inspired by the success of Transformers in NLP, where self-supervised pretraining on large corpora yields strong transfer learning, the authors aim to use these unlabeled datasets to learn effective molecular representations. Additionally, Transformers benefit from a mature software ecosystem (HuggingFace) that offers efficiency advantages over GNNs.

Pretraining Scaling Laws and Novelty

Previous works applied Transformers to SMILES strings. This paper advances the field by systematically evaluating scaling laws and architectural components for this domain. Specifically:

Scaling Analysis: It explicitly tests how pretraining dataset size (100K to 10M) impacts downstream performance.
Tokenizer Comparison: It compares standard NLP Byte-Pair Encoding (BPE) against a chemically-aware “SmilesTokenizer”.
Representation Comparison: It evaluates if the robust SELFIES string representation offers advantages over standard SMILES in a Transformer context.

Experimental Setup: Pretraining and Finetuning

The authors trained ChemBERTa (based on RoBERTa) using Masked Language Modeling (MLM) on subsets of the PubChem dataset. The core training objective minimizes the cross-entropy loss over a corrupted input where a subset of basic tokens, denoted by $\mathcal{M}$, are masked:

$$ \mathcal{L}_{\text{MLM}} = - \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}; \theta) $$

where $x_i$ is the exact masked token, $x_{\setminus \mathcal{M}}$ is the corrupted SMILES context string, and $\theta$ represents the network parameters.

Pretraining: Models were pretrained on dataset sizes of 100K, 250K, 1M, and 10M compounds.
Baselines: Performance was compared against D-MPNN (Graph Neural Network), Random Forest (RF), and SVM using 2048-bit Morgan Fingerprints.
Downstream Tasks: Finetuning was performed individually on small MoleculeNet classification tasks: BBBP (blood-brain barrier), ClinTox (clinical toxicity), HIV, and Tox21 (p53 stress-response). This poses a transfer learning challenge, as the model must adapt from pretraining on 10 million molecules to classifying datasets ranging from ~1.5K to ~41K examples.
Ablations:
- Tokenization: BPE vs. SmilesTokenizer on the 1M dataset, evaluated on Tox21.
- Input: SMILES vs. SELFIES strings on the Tox21 task.

Results vs. Graph Neural Network Baselines

The main comparison between ChemBERTa (pretrained on 10M compounds) and Chemprop baselines on MoleculeNet tasks is summarized below (Table 1 from the paper):

Model	BBBP ROC	BBBP PRC	ClinTox ROC	ClinTox PRC	HIV ROC	HIV PRC	Tox21 ROC	Tox21 PRC
ChemBERTa 10M	0.643	0.620	0.733	0.975	0.622	0.119	0.728	0.207
D-MPNN	0.708	0.697	0.906	0.993	0.752	0.152	0.688	0.429
RF	0.681	0.692	0.693	0.968	0.780	0.383	0.724	0.335
SVM	0.702	0.724	0.833	0.986	0.763	0.364	0.708	0.345

Scaling Improvements & Training Dynamics: Performance scales predictably with pretraining data size. Increasing data from 100K to 10M improved ROC-AUC by +0.110 and PRC-AUC by +0.059 on average across BBBP, ClinTox, and Tox21 (HIV was omitted due to resource constraints). Notably, researchers had to halt pretraining on the 10M subset after just 3 epochs due to overfitting, suggesting that simple 15% token masking might not provide a sufficiently difficult learning curvature for large-scale chemical representation.
Performance Limits vs. GNNs: ChemBERTa generally performs below the D-MPNN baseline. On the Tox21 dataset, ChemBERTa-10M achieved a higher ROC-AUC (0.728) than D-MPNN (0.688); nonetheless, it recorded a substantially lower PRC-AUC (0.207 vs 0.429). This gap indicates that current Transformer iterations lack the explicit inductive biases of graph algorithms and struggle with the severe class imbalances typical of chemical datasets.
Ablation Limitations (Tokenization & SELFIES): The authors’ ablation studies for tokenization (SmilesTokenizer narrowly beating BPE) and input representation (SELFIES performing comparably to SMILES) were evaluated exclusively on the single Tox21 task. Deriving broad architectural conclusions regarding “semantically-aware tokenization” or string robustness from an $N=1$ empirical evaluation is a significant limitation of the study. Broader benchmarking is required to validate these findings.
Interpretability: Attention heads organically learn to track chemically relevant substructures (like specific functional groups and aromatic rings), mimicking the inductive biases of graph convolutions.

Reproducibility Details

Data

The authors curated a massive dataset for pretraining and utilized standard benchmarks for evaluation.

Pretraining Data: PubChem-77M.
- Source: 77 million unique SMILES from PubChem.
- Preprocessing: Canonicalized and globally shuffled.
- Subsets used: 100K, 250K, 1M, and 10M subsets.
- Availability Note: The authors provided a direct link to the canonicalized 10M compound subset used for their largest experiments. Full reproducibility of the smaller (100K, 250K, 1M) or full 77M sets may require re-extracting from PubChem.
Evaluation Data: MoleculeNet.
- Tasks: BBBP (2,039), ClinTox (1,478), HIV (41,127), Tox21 (7,831).
- Splitting: 80/10/10 train/valid/test split using a scaffold splitter to ensure chemical diversity between splits.

Algorithms

The core training methodology mirrors standard BERT/RoBERTa procedures adapted for chemical strings.

Objective: Masked Language Modeling (MLM) with 15% token masking.
Tokenization:
- BPE: Byte-Pair Encoder (vocab size 52K).
- SmilesTokenizer: Regex-based custom tokenizer available in DeepChem (documented here).
Sequence Length: Maximum sequence length of 512 tokens.
Finetuning: Appended a linear classification layer; backpropagated through the base model for up to 25 epochs with early stopping on ROC-AUC.

Models

Architecture: RoBERTa (via HuggingFace).
- Layers: 6
- Attention Heads: 12 (72 distinct mechanisms total).
- Implementation Note: The original training notebooks and scripts are maintained in the authors’ bert-loves-chemistry repository, alongside the primary downstream tasks integrated into DeepChem. A full Tox21 transfer learning tutorial has been incorporated into the DeepChem repository.
Baselines (via Chemprop library):
- D-MPNN: Directed Message Passing Neural Network with default hyperparameters.
- RF/SVM: Scikit-learn Random Forest and SVM using 2048-bit Morgan fingerprints (RDKit).

Evaluation

Performance is measured using dual metrics to account for class imbalance common in toxicity datasets.

Metric	Details
ROC-AUC	Area Under Receiver Operating Characteristic Curve
PRC-AUC	Area Under Precision-Recall Curve (vital for imbalanced data)

Hardware

Compute: Single NVIDIA V100 GPU.
Training Time: Approximately 48 hours for the 10M compound subset.
Carbon Footprint: Estimated 17.1 kg $\text{CO}_2\text{eq}$ (offset by Google Cloud).

Artifacts

Artifact	Type	License	Notes
bert-loves-chemistry	Code	MIT	Training notebooks and finetuning scripts
DeepChem	Code	MIT	Integration of ChemBERTa and SmilesTokenizer
ChemBERTa-zinc-base-v1	Model	Unknown	Pre-trained RoBERTa on 100K ZINC SMILES
PubChem-10M subset	Dataset	Unknown	Canonicalized 10M compound subset used for largest experiments

Reproducibility status: Partially Reproducible. Code and pre-trained models are available, and the 10M pretraining subset is downloadable. However, smaller subsets (100K, 250K, 1M) may need re-extraction from PubChem, and exact hyperparameter details for finetuning (learning rate, batch size) are not fully specified in the paper.

Paper Information

Citation: Chithrananda, S., Grand, G., & Ramsundar, B. (2020). ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. arXiv preprint arXiv:2010.09885. https://doi.org/10.48550/arXiv.2010.09885

Publication: arXiv 2020 (Preprint)

Additional Resources:

HuggingFace Model Hub (ChemBERTa-zinc-base-v1) - Additional pre-trained variations on PubChem & ZINC datasets are available on the author’s seyonec HF profile.
bert-loves-chemistry GitHub Repository - Notebooks and scripts used for MLM pretraining and finetuning evaluations.

BibTeX

@misc{chithranandaChemBERTaLargeScaleSelfSupervised2020,
  title = {{{ChemBERTa}}: {{Large-Scale Self-Supervised Pretraining}} for {{Molecular Property Prediction}}},
  shorttitle = {{{ChemBERTa}}},
  author = {Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath},
  year = 2020,
  month = oct,
  number = {arXiv:2010.09885},
  eprint = {2010.09885},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2010.09885},
  urldate = {2025-12-24},
  archiveprefix = {arXiv}
}

Translating InChI to IUPAC Names with Transformers

Sat, 20 Dec 2025 00:00:00 +0000

Primary Contribution: A Transformer-Based Method

This is primarily a Method paper. It adapts a specific architecture (Transformer) to a specific task (InChI-to-IUPAC translation) and evaluates its performance against both machine learning and commercial baselines. It also has a secondary Resource contribution, as the trained model and scripts are released as open-source software.

Motivation: The Bottleneck in Algorithmic IUPAC Nomenclature

Generating correct IUPAC names is difficult due to the comprehensive but complex rules defined by the International Union of Pure and Applied Chemistry. Commercial software generates names from structures but remains closed-source with opaque methodologies and frequent inter-package disagreements. Open identifiers like InChI and SMILES lack direct human readability. This creates a need for an open, automated method to generate informative IUPAC names from standard identifiers like InChI, which are ubiquitous in online chemical databases.

Novelty: Treating Chemical Translation as a Character-Level Sequence

The key novelty is treating chemical nomenclature translation as a character-level sequence-to-sequence problem using a Transformer architecture, specifically using InChI as the source language.

Standard Neural Machine Translation (NMT) uses sub-word tokenization. This model processes InChI and predicts IUPAC names character-by-character.
It demonstrates that character-level tokenization outperforms byte-pair encoding or unigram models for this specific chemical task.
It uses InChI’s standardization to avoid the canonicalization issues inherent in SMILES-based approaches.
The attention mechanism allows the decoder to align specific parts of the generated IUPAC name with corresponding structural features in the source InChI string, operating via the standard scaled dot-product attention: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Methodology & Experimental Validation

Training: The model was trained on 10 million InChI/IUPAC pairs sampled from PubChem using a character-level objective. The model is supervised using categorical cross-entropy loss across the vocabulary of characters: $$ \mathcal{L} = -\sum_{i=1}^{N} y_i \log(\hat{y}_i) $$
Ablation Studies: The authors experimentally validated architecture choices, finding that LSTM models and sub-word tokenization (BPE) performed worse than the Transformer with character tokenization. They also optimized dropout rates.
Performance Benchmarking: The model was evaluated on a held-out test set of 200,000 samples. Performance was quantified primarily by Whole-Name Accuracy and Normalized Edit Distance (based on the Damerau-Levenshtein distance, scaled by the maximum string length).
Commercial Comparison: The authors compared their model against four major commercial packages (ACD/I-Labs, ChemAxon, Mestrelab, and PubChem’s Lexichem). However, this evaluation used a highly limited test set of only 100 molecules, restricting the statistical confidence of the external baseline.
Error Analysis: They analyzed performance across different chemical classes (organics, charged species, macrocycles, inorganics) and visualized attention coefficients to interpret model focus.

Key Results and the Inorganic Challenge

High Accuracy on Organics: The model achieved 91% whole-name accuracy on the test set, performing particularly well on organic compounds.
Comparable to Commercial Tools: On the limited 100-molecule benchmark, the edit distance between the model’s predictions and commercial packages (15-23%) was similar to the variation found between the commercial packages themselves (16-21%).
Limitations on Inorganics: The model performed poorly on inorganic (14% accuracy) and organometallic compounds (20% accuracy). This is attributed to inherent data limitations in the standard InChI format (which deliberately disconnects metal atoms from their ligands) and low training data coverage for those classes.
Character-Level Superiority: Character-level tokenization was found to be essential; byte-pair encoding reduced accuracy significantly.

Reproducibility Details

Data

The dataset was derived from PubChem’s public FTP server (CID-SMILES.gz and CID-IUPAC.gz).

Purpose	Dataset	Size	Notes
Raw	PubChem	100M pairs	Filtered for length (InChI < 200 chars, IUPAC < 150 chars). 132k unparseable SMILES dropped.
Training	Subsampled	10M pairs	Random sample from the filtered set.
Validation	Held-out	10,000 samples	Limited to InChI length > 50 chars.
Test	Held-out	200,000 samples	Limited to InChI length > 50 chars.
Tokenization	Vocab	InChI: 66 chars IUPAC: 70 chars	Character-level tokenization. Spaces treated as tokens.

Algorithms

Framework: OpenNMT-py 2.0.0 (using PyTorch). Training scripts and vocabularies are available as supplementary files to the original publication. Pre-trained model weights are hosted on Zenodo.
Architecture Type: Transformer Encoder-Decoder.
Optimization: ADAM optimizer ($\beta_1=0.9, \beta_2=0.998$).
Learning Rate: Linear warmup over 8000 steps to 0.0005, then decayed by inverse square root of iteration.
Regularization:
- Dropout: 0.1 (applied to dense and attentional layers).
- Label Smoothing: Magnitude 0.1.
Training Strategy: Teacher forcing used for both training and validation.
Gradient Accumulation: Gradients accumulated over 4 batches before updating parameters.
Inference: Beam search with width 10 and length penalty 1.0.

Models

Structure: 6 layers in encoder, 6 layers in decoder.
Attention: 8 heads per attention sub-layer.
Dimensions:
- Feed-forward hidden state size: 2048.
- Embedding vector length: 512.
Initialization: Glorot’s method.
Position: Positional encoding added to word vectors.

Evaluation

Metrics reported include Whole-Name Accuracy (percentage of exact matches) and Normalized Edit Distance (Damerau-Levenshtein, scale 0-1).

Metric	Value	Baseline	Notes
Accuracy (All)	91%	N/A	Test set of 200k samples.
Accuracy (Inorganic)	14%	N/A	Limited by InChI format and data.
Accuracy (Organometallic)	20%	N/A	Limited by InChI format and data.
Accuracy (Charged)	79%	N/A	Test set subset.
Accuracy (Rajan)	72%	N/A	Comparative ML model (STOUT).
Edit Dist (Organic)	$0.02 \pm 0.03$	N/A	Very high similarity for organics.
Edit Dist (Inorganic)	$0.32 \pm 0.20$	N/A	Poor performance on inorganics.
Edit Dist (Organometallic)	$0.37 \pm 0.24$	N/A	Poor performance on organometallics.

Hardware

GPU: Tesla K80.
Training Time: 7 days.
Throughput: ~6000 tokens/sec (InChI) and ~3800 tokens/sec (IUPAC).
Batch Size: 4096 tokens (approx. 30 compounds).

Artifacts

Artifact	Type	License	Notes
InChI to IUPAC model	Model	CC BY 4.0	Pre-trained Transformer weights (551 MB), requires OpenNMT-py 2.0.0
PubChem FTP	Dataset	Public Domain	Source data: CID-SMILES.gz and CID-IUPAC.gz
Training scripts & vocabularies	Code	Unknown	Included as supplementary files with the publication

Paper Information

Citation: Handsel, J., Matthews, B., Knight, N. J., & Coles, S. J. (2021). Translating the InChI: Adapting Neural Machine Translation to Predict IUPAC Names from a Chemical Identifier. Journal of Cheminformatics, 13(1), 79. https://doi.org/10.1186/s13321-021-00535-x

Publication: Journal of Cheminformatics 2021

@article{handselTranslatingInChIAdapting2021a,
  title = {Translating the {{InChI}}: Adapting Neural Machine Translation to Predict {{IUPAC}} Names from a Chemical Identifier},
  shorttitle = {Translating the {{InChI}}},
  author = {Handsel, Jennifer and Matthews, Brian and Knight, Nicola J. and Coles, Simon J.},
  year = 2021,
  month = oct,
  journal = {Journal of Cheminformatics},
  volume = {13},
  number = {1},
  pages = {79},
  issn = {1758-2946},
  doi = {10.1186/s13321-021-00535-x},
  urldate = {2025-12-20},
  abstract = {We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine's online PubChem service. Training took seven days on a Tesla K80 GPU, and the model achieved a test set accuracy of 91\%. The model performed particularly well on organics, with the exception of macrocycles, and was comparable to commercial IUPAC name generation software. The predictions were less accurate for inorganic and organometallic compounds. This can be explained by inherent limitations of standard InChI for representing inorganics, as well as low coverage in the training data.},
  langid = {english},
  keywords = {Attention,GPU,InChI,IUPAC,seq2seq,Transformer}
}

Struct2IUPAC: Translating SMILES to IUPAC via Transformers

Sat, 20 Dec 2025 00:00:00 +0000

Struct2IUPAC as a Methodological Shift

This is primarily a Method paper with significant elements of Position.

Method: The authors propose a specific neural architecture (Transformer with custom tokenization) and a verification pipeline (round-trip check) to solve the SMILES $\leftrightarrow$ IUPAC translation task. They rigorously benchmark this against rule-based baselines (OPSIN).
Position: The authors explicitly argue for a paradigm shift, suggesting that “heavy” neural architectures should replace complex, costly rule-based legacy systems even for “exact” algorithmic tasks.

The Cost of Rule-Based Chemical Naming

Complexity of Naming: Generating IUPAC names manually is error-prone and requires deep algorithmic knowledge.
Lack of Open Source Tools: While open-source tools exist for Name-to-Structure (e.g., OPSIN), there were no open-source tools for the inverse “Structure-to-Name” conversion at the time of writing.
Cost of Development: Developing rule-based converters “from scratch” is prohibitively expensive and time-consuming compared to training a neural model on existing data.

Struct2IUPAC Core Innovation

Struct2IUPAC: The first effective open-source neural model for converting SMILES to IUPAC names, treating chemical translation as a Neural Machine Translation (NMT) problem.
Verification Loop: A novel inference pipeline that generates multiple candidates via beam search and validates them using a reverse converter (OPSIN) to ensure the generated name maps back to the original structure.
Custom Tokenization: A manually curated rule-based tokenizer for IUPAC names that handles specific chemical suffixes, prefixes, and stereochemical markers.

Experimental Setup and Stress Testing

Accuracy Benchmarking: The models were tested on a held-out subset of 100,000 molecules from PubChem. The authors measured accuracy across different beam sizes (1, 3, 5).
Comparison to Rules: The neural IUPAC2Struct model was compared directly against the rule-based OPSIN tool.
Stress Testing:
- Sequence Length: Evaluated performance across varying token lengths, identifying a “sweet spot” (10-60 tokens) and failure modes for very short (e.g., methane) or long molecules.
- Stereochemistry: Tested on “stereo-dense” compounds. The authors define a “stereo-density” index ($I$) as the ratio of stereocenters ($S$) to total tokens ($N$): $$I = \frac{S}{N}$$ They observed a performance drop for these dense molecules, though the model still handled many stereocenters robustly.
- Tautomers: Verified the model’s ability to handle different tautomeric forms (e.g., Guanine and Uracil variants).
Latency Analysis: Benchmarked inference speeds on CPU vs. GPU relative to output sequence length.

Benchmarks and Outcomes

High Accuracy: The Struct2IUPAC model achieved 98.9% accuracy (Beam 5 with verification). The reverse model (IUPAC2Struct) achieved 99.1%, comparable to OPSIN’s 99.4%.
Distribution Modeling vs. Intuition: The authors claim the model infers “chemical logic,” because it correctly generates multiple valid IUPAC names for single molecules where naming ambiguity exists (e.g., parent group selection). However, this more likely reflects the Transformer successfully modeling the high-frequency conditional probability distribution of synonymous names present in the PubChem training data, rather than learning intrinsic chemical rules.
Production Readiness: Inference on GPU takes less than 0.5 seconds even for long names, making it viable for production use.
Paradigm Shift: The authors conclude that neural networks are a viable, cost-effective alternative to developing rule-based algorithms for legacy notation conversion.

Reproducibility Details

Data

The study utilized the PubChem database.

Purpose	Dataset	Size	Notes
Total	PubChem	~95M	Filtered for RDKit compatibility
Training	Split A	47,312,235	Random 50% split
Testing	Split B	47,413,850	Random 50% split

Cleaning: Molecules that could not be processed by RDKit were removed. Molecules containing tokens not in the tokenizer (e.g., aromatic selenium) were excluded.
Availability: A subset of 100,000 test molecules is available on GitHub (data/test_100000.csv) and Zenodo. The full train/test splits are not explicitly provided.

Algorithms

Tokenization:
- SMILES: Character-based tokenization.
- IUPAC: Custom rule-based tokenizer splitting suffixes (-one, -al), prefixes (-oxy, -di), and special symbols ((, ), R(S)).
Verification Step:
1. Generate $N$ names using Beam Search ($N=5$).
2. Reverse translate the candidate name using OPSIN.
3. Check if the OPSIN structure matches the original input SMILES.
4. Display the first verified match; otherwise, report failure.

Models

Architecture: Standard Transformer with 6 encoder layers and 6 decoder layers.
Hyperparameters:
- Attention Heads: 8
- Attention Dimension ($d_{\text{model}}$): 512
- Feed-Forward Dimension ($d_{\text{ff}}$): 2048
Training Objective: The models were trained using standard autoregressive cross-entropy loss over the target token sequence $y$ given the input string $x$: $$\mathcal{L} = - \sum_{t=1}^{T} \log P(y_t \mid y_{
Training: Two separate models were trained: Struct2IUPAC (SMILES $\to$ IUPAC) and IUPAC2Struct (IUPAC $\to$ SMILES).
Availability: Code for model architecture is provided in the GitHub repository. Pre-trained weights for the IUPAC2Struct model are available, but the Struct2IUPAC model weights are not publicly released, meaning researchers would need to retrain that model on their own PubChem data to reproduce those results.

Evaluation

Evaluation was performed on a random subset of 100,000 molecules from the test set.

Metric	Task	Beam Size	Accuracy
Exact Match	Struct2IUPAC	1	96.1%
Exact Match	Struct2IUPAC	5	98.9%
Exact Match	IUPAC2Struct	1	96.6%
Exact Match	IUPAC2Struct	5	99.1%

Robustness: Accuracy drops significantly for augmented (non-canonical) SMILES (37.16%) and stereo-enriched compounds (66.52%).

Hardware

Training Infrastructure: 4 $\times$ Tesla V100 GPUs and 36 CPUs.
Training Time: Approximately 10 days under full load.
Inference Speed: <0.5s per molecule on GPU; scale is linear with output token length.

Artifacts

Artifact	Type	License	Notes
IUPAC2Struct (GitHub)	Code	MIT	Transformer code and pre-trained IUPAC2Struct model
Test data (Zenodo)	Dataset	Unknown	100k test molecules, OPSIN failure cases, model failure cases
Struct2IUPAC web demo	Other	N/A	Online interface for SMILES to IUPAC conversion

Paper Information

Citation: Krasnov, L., Khokhlov, I., Fedorov, M. V., & Sosnin, S. (2021). Transformer-based artificial neural networks for the conversion between chemical notations. Scientific Reports, 11(1), 14798. https://doi.org/10.1038/s41598-021-94082-y

Publication: Scientific Reports 2021

@article{krasnovTransformerbasedArtificialNeural2021a,
  title = {Transformer-Based Artificial Neural Networks for the Conversion between Chemical Notations},
  author = {Krasnov, Lev and Khokhlov, Ivan and Fedorov, Maxim V. and Sosnin, Sergey},
  year = 2021,
  month = jul,
  journal = {Scientific Reports},
  volume = {11},
  number = {1},
  pages = {14798},
  publisher = {Nature Publishing Group},
  doi = {10.1038/s41598-021-94082-y}
}

Additional Resources:

STOUT: SMILES to IUPAC Names via Neural Machine Translation

Sat, 20 Dec 2025 00:00:00 +0000

Contribution: Translating Chemistry as a Language

This is primarily a Method paper, with a strong secondary contribution as a Resource paper.

Method: It proposes a neural machine translation (NMT) architecture to approximate the complex, rule-based algorithm of IUPAC naming, treating it as a language translation task.
Resource: It provides an open-source tool and trained models to the community, addressing a gap where such functionality was previously limited to proprietary software.

Motivation: Democratizing IUPAC Nomenclature

The International Union of Pure and Applied Chemistry (IUPAC) naming scheme is universally accepted but algorithmically complex. Generating these names correctly is challenging for humans, and automated generation is largely missing from major open-source toolkits like CDK, RDKit, or Open Babel. While reliable commercial tools exist (e.g., ChemAxon’s molconvert), there was a lack of open-source alternatives for the scientific community. STOUT aims to fill this gap using a data-driven approach.

Core Innovation: Sequence-to-Sequence Naming

Language Translation Approach: The authors treat chemical representations (SMILES/SELFIES) and IUPAC names as two different languages, applying Neural Machine Translation (NMT) to translate between them.
Use of SELFIES: The work establishes SELFIES (Self-Referencing Embedded Strings) as a robust choice over SMILES for deep learning tokenization in this specific task, capitalizing on its syntactic robustness.
Hardware Acceleration: The paper benchmarks GPU versus TPU training and highlights the practical necessity of Tensor Processing Units (TPUs) for training large-scale chemical language models, reducing training time by an order of magnitude.

Methodology & Translation Validation

Data Scale: The model was trained on datasets of 30 million and 60 million molecules derived from PubChem.
Hardware Benchmarking: Training efficiency was compared between an nVidia Tesla V100 GPU and Google TPU v3-8/v3-32 units.
Bidirectional Translation: The system was tested on two distinct tasks:
1. Forward: SELFIES → IUPAC names
2. Reverse: IUPAC names → SELFIES
Validation: Performance was evaluated on a held-out test set of 2.2 million molecules.

Translation Accuracy & Hardware Scaling

High Accuracy: The model achieved an average BLEU score of ~90% and a Tanimoto similarity index > 0.9 for both translation directions.
Generalization: Even when predictions were textually mismatched (low BLEU score), the underlying chemical structures often remained highly similar (high Tanimoto similarity), suggesting the system captures fundamental chemical semantics rather than merely memorizing strings.
Impact of Data Size: Expanding training from 30 million to 60 million molecules yielded consistent performance gains without saturating.
Hardware Necessity: Training on TPUs proved up to 54 times faster than a standard GPU baseline (Tesla V100), making scaling highly computationally tractable.

Reproducibility

Artifact	Type	License	Notes
STOUT (GitHub)	Code	MIT	Current repo hosts STOUT V2.0 transformer models; V1 RNN code available in earlier commits
PubChem	Dataset	Public Domain	Source of 111M molecules; 30M/60M training subsets not directly provided

Data

The dataset was curated from PubChem (111 million molecules). Note that the specific 30M and 60M subsets are not directly linked in the publication repository, which means a user would have to reconstruct the filtering process.

Preprocessing & Filtering:

Explicit hydrogens removed; converted to canonical SMILES.
Filtering Rules: MW < 1500 Da, no counter ions, limited element set (C, H, O, N, P, S, F, Cl, Br, I, Se, B), no hydrogen isotopes, 3-40 bonds, no charged groups.
Ground Truth Generation: ChemAxon’s molconvert (Marvin Suite 20.15) was used to generate target IUPAC names for training.
Representation: All SMILES were converted to SELFIES for training.

Purpose	Dataset	Size	Notes
Training	PubChem Filtered	30M & 60M	Two distinct training sets created.
Testing	PubChem Held-out	2.2M	Molecules not present in training sets; uniform token frequency.

Algorithms

Tokenization:
- SELFIES: Split iteratively by brackets [ and ].
- IUPAC: Split via punctuation ((, ), {, }, [, ], -, ., ,) and a discrete set of sub-word chemical morphemes (e.g., methyl, benzene, fluoro).
- Padding: SELFIES padded to 48 tokens; IUPAC padded to 78 tokens. “Start” and “End” sequence markers append each chain.
Optimization: Adam optimizer instantiated with a learning rate of $0.0005$.
Objective Function: Sparse categorical cross-entropy, assessing prediction probabilities for token $i$ over vocabulary $V$: $$ \mathcal{L} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i) $$

Models

Architecture: Encoder-Decoder sequence-to-sequence network with Bahdanau attention mechanism context weighting.
Components:
- Encoder/Decoder: Recurrent Neural Networks (RNN) constructed using Gated Recurrent Units (GRU).
- Attention: Bahdanau (additive) soft attention, which calculates alignment scores to softly weight encoder hidden states natively: $$ e_{tj} = v_a^\top \tanh(W_a s_{t-1} + U_a h_j) $$
- Embedding: Decoder output passes through a continuous embedding layer before concatenating with the attention context vector.
Implementation: Python 3 backend using TensorFlow 2.3.0. Note: The linked GitHub repository currently defaults to the STOUT V2.0 transformer models, so researchers aiming to reproduce this specific V1 RNN paper should reference the older tag/commit history.

Evaluation

Metrics heavily emphasize both linguistic accuracy and cheminformatic structural correctness:

Metric	Details	Result (60M Model)	Notes
BLEU Score	NLTK sentence BLEU (unigram to 4-gram)	0.94 (IUPAC $\to$ SELFIES)	Exact text overlap. Serves as a strictly syntactic proxy.
Tanimoto Similarity	PubChem fingerprints via CDK	0.98 (Valid IUPAC names)	Evaluates substructure alignment over bit vectors, $T(A, B) = \frac{\vert A \cap B \vert}{\vert A \cup B \vert}$.

Hardware

Comparison of hardware efficiency for training large chemical language models:

Hardware	Batch Size	Time per Epoch (15M subset)	Speedup Factor
GPU (Tesla V100)	256	~27 hours	1x
TPU v3-8	1024 (Global)	~2 hours	13x
TPU v3-32	1024 (Global)	~0.5 hours	54x

Paper Information

Citation: Rajan, K., Zielesny, A., & Steinbeck, C. (2021). STOUT: SMILES to IUPAC names using neural machine translation. Journal of Cheminformatics, 13(1), 34. https://doi.org/10.1186/s13321-021-00512-4

Publication: Journal of Cheminformatics 2021

@article{rajanSTOUTSMILESIUPAC2021,
  title = {STOUT: SMILES to IUPAC Names Using Neural Machine Translation},
  shorttitle = {STOUT},
  author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
  year = {2021},
  month = apr,
  journal = {Journal of Cheminformatics},
  volume = {13},
  number = {1},
  pages = {34},
  issn = {1758-2946},
  doi = {10.1186/s13321-021-00512-4},
  urldate = {2025-09-22},
  abstract = {Chemical compounds can be identified through a graphical depiction, a suitable string representation, or a chemical name. A universally accepted naming scheme for chemistry was established by the International Union of Pure and Applied Chemistry (IUPAC) based on a set of rules. Due to the complexity of this ruleset a correct chemical name assignment remains challenging for human beings and there are only a few rule-based cheminformatics toolkits available that support this task in an automated manner. Here we present STOUT (SMILES-TO-IUPAC-name translator), a deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e. predicting the SMILES string from the IUPAC name. In both cases, the system is able to predict with an average BLEU score of about 90% and a Tanimoto similarity index of more than 0.9. Also incorrect predictions show a remarkable similarity between true and predicted compounds.},
  langid = {english},
  keywords = {Attention mechanism,Chemical language,Deep neural network,DeepSMILES,IUPAC names,Neural machine translation,Recurrent neural network,SELFIES,SMILES}
}

Additional Resources:

STOUT V2.0: Transformer-Based SMILES to IUPAC Translation

Sat, 20 Dec 2025 00:00:00 +0000

Paper Contribution & Methodological Scope

Method (Primary) / Resource (Secondary)

This paper presents a Methodological contribution by developing and validating a Transformer-based neural machine translation model (STOUT V2) for bidirectional chemical nomenclature (SMILES $\leftrightarrow$ IUPAC). It systematically compares this new architecture against previous RNN-based baselines (STOUT V1) and performs ablation studies on tokenization strategies.

It also serves as a significant Resource contribution by generating a massive training dataset of nearly 1 billion SMILES-IUPAC pairs (curated via commercial Lexichem software) and releasing the resulting models and code as open-source tools for chemical naming.

The Need for Robust Open-Source IUPAC Nomenclature Rules

Assigning systematic IUPAC names to chemical structures requires adherence to complex rules, challenging human consistency. Deterministic, rule-based software options like OpenEye Lexichem and ChemAxon are reliable commercial solutions. Existing open-source tools like OPSIN focus on parsing names to structures.

The previous version of STOUT (V1), based on RNNs/GRUs, achieved ~90% BLEU accuracy, with known limitations in capturing long-distance dependencies required for stereochemistry handling. This work uses the sequence-learning capabilities of Transformers combined with large-scale datasets to create a competitive open-source IUPAC naming tool.

Architectural Shift and Billion-Scale Training

The primary advancements over previous iterations address both architecture and dataset scale:

Architecture Shift: Moving from an RNN-based Seq2Seq model to a Transformer-based architecture (4 layers, 8 heads), which captures intricate chemical patterns better than GRUs.
Billion-Scale Training: Training on a dataset of nearly 1 billion molecules (combining PubChem and ZINC15), significantly larger than the 60 million used for STOUT V1.
Tokenization Strategy: Determining that character-wise tokenization for IUPAC names is superior to word-wise tokenization in terms of both accuracy and training efficiency (15% faster).

Experimental Validation and Scaling Limits

The authors conducted three primary experiments to validate bidirectional translation (SMILES $\rightarrow$ IUPAC and IUPAC $\rightarrow$ SMILES):

Experiment 1 (Optimization): Assessed the impact of dataset size (1M vs 10M vs 50M) and tokenization strategy on SMILES-to-IUPAC performance.
Experiment 2 (Scaling): Trained models on 110 million PubChem molecules for both forward and reverse translation tasks to test performance on longer sequences.
Experiment 3 (Generalization): Trained on the full ~1 billion dataset (PubChem + ZINC15) for both translation directions.
External Validation: Benchmarked against an external dataset from ChEBI (1,485 molecules) and ChEMBL34 to test generalization to unseen data.

Evaluation Metrics:

Textual Accuracy: BLEU scores (1-4) and Exact String Match.
Chemical Validity: Retranslation of generated names back to SMILES using OPSIN, followed by Tanimoto similarity checks (PubChem fingerprints) against the original input.

Translation Accuracy and Structural Validity

Superior Performance: STOUT V2 achieved an average BLEU score of 0.99 (vs 0.94 for V1). While exact string matches varied by experiment (83-89%), the model notably achieved a perfect BLEU score (1.0) on 97.49% of a specific test set where STOUT V1 only reached 66.65%.
Structural Validity (“Near Misses”): When the generated name differed from the ground truth string, the re-generated structure often remained chemically valid. The model maintained an average Tanimoto similarity $T(A,B)$ of 0.68 for these divergent names between bit-vector fingerprints $A$ and $B$, roughly defined as: $$ T(A,B) = \frac{\sum (A \cap B)}{\sum (A \cup B)} $$ Critique: Note that an average Tanimoto coefficient of 0.68 typically suggests moderate structural similarity/drift, not an almost-identical “near miss” (which would be $>0.85$). This implies the model constructs chemically related but structurally distinct outputs when it fails exact string matching.
Tokenization: Character-level splitting for IUPAC names outperformed word-level splitting and was more computationally efficient.
Data Imbalance & Generalization: The model’s drop in performance for sequences >600 characters highlights a systemic issue in open chemical databases: long, highly complex SMILES strings are significantly underrepresented. Even billion-scale training datasets are still bound by the chemical diversity of their source material.
Limitations:
- Preferred Names (PINs): The model mimics Lexichem’s naming conventions, generating valid IUPAC names distinct from strict Preferred IUPAC Names (PINs).
- Sequence Length: Performance degrades for very long SMILES (>600 characters) due to scarcity in the training data.
- Algorithmic Distillation Bottleneck: Because the 1 billion training pairs were generated entirely by OpenEye’s Lexichem, STOUT V2 acts as a knowledge distillation of that specific commercial algorithm. The model learns Lexichem’s heuristic mapping, specific dialects, and potential systematic errors, rather than deriving true nomenclature rules from first principles.

Reproducibility Details

Data

The training data was derived from PubChem and ZINC15. Ground truth IUPAC names were generated using OpenEye Lexichem TK 2.8.1 to ensure consistency.

Purpose	Dataset	Size	Notes
Training (Exp 1)	PubChem Subset	1M, 10M, 50M	Selected via MaxMin algorithm for diversity
Training (Exp 2)	PubChem	110M	Filtered for SMILES length < 600
Training (Exp 3)	PubChem + ZINC15	~1 Billion	999,637,326 molecules total
Evaluation	ChEBI	1,485	External validation set, non-overlapping with training

Preprocessing:

SMILES: Canonicalized, isomeric, and kekulized using RDKit (v2023.03.1).
Formatting: Converted to TFRecord format in 100 MB chunks for TPU efficiency.

Algorithms

SMILES Tokenization: Regex-based splitting. Atoms (e.g., “Cl”, “Au”), bonds, brackets, and digits are separate tokens.
IUPAC Tokenization: Character-wise split was selected as the optimal strategy (treating every character as a token).
Optimization: Adam optimizer with a custom learning rate scheduler based on model dimensions.
Loss Function: Trained to minimize the Sparse Categorical Cross-Entropy $L$, masking padding tokens. For a correctly predicted target class $t$ alongside probabilities $p_i$, the masked loss is represented mathematically as: $$ L = - \sum_{i=1}^{m} m_i y_{i} \log(p_{i}) $$ where $m_i$ masks padded positions.
Code Availability: The main STOUT V2 repository contains the inference package. The training pipeline/instructions (originally linked to a separate repo that is currently a 404) can still be found within the Zenodo archive release.

Models

The model follows the standard Transformer architecture from “Attention is All You Need” (Vaswani et al.).

Architecture: 4 Transformer layers (encoder/decoder stack).
Attention: Multi-head attention with 8 heads.
Dimensions: Embedding size ($d_{model}$) = 512; Feed-forward dimension ($d_{ff}$) = 2048.
Regularization: Dropout rate of 0.1.
Context Window: Max input length (SMILES) = 600; Max output length (IUPAC) = 700-1000.
Weights: Model weights for forward and reverse architectures are available via Zenodo (v3).

Evaluation

Evaluation focused on both string similarity and chemical structural integrity.

Metric	Scope	Method
BLEU Score	N-gram overlap	Compared predicted IUPAC string to Ground Truth.
Exact Match	Accuracy	Binary 1/0 check for identical strings.
Tanimoto	Structural Similarity	Predicted Name $\rightarrow$ OPSIN $\rightarrow$ SMILES $\rightarrow$ Fingerprint comparison to input.

Artifacts

Artifact	Type	License	Notes
STOUT V2 GitHub	Code	MIT	Inference package (PyPI: STOUT-pypi)
Model Weights (Zenodo v3)	Model	Unknown	Forward and reverse translation weights
Code Snapshot (Zenodo)	Code	Unknown	Training pipeline archive
Web Application	Other	Unknown	Demo with Ketcher, bulk submission, DECIMER integration

Hardware

Training was conducted entirely on Google Cloud Platform (GCP) TPUs.

STOUT V1: Trained on TPU v3-8.
STOUT V2: Trained on TPU v4-128 pod slices (128 nodes).
Large Scale (Exp 3): Trained on TPU v4-256 pod slice (256 nodes).
Training Time: Average of 15 hours and 2 minutes per epoch for the 1 billion dataset.
Framework: TensorFlow 2.15.0-pjrt with Keras.

Paper Information

Citation: Rajan, K., Zielesny, A., & Steinbeck, C. (2024). STOUT V2.0: SMILES to IUPAC name conversion using transformer models. Journal of Cheminformatics, 16(146). https://doi.org/10.1186/s13321-024-00941-x

Publication: Journal of Cheminformatics 2024

@article{rajanSTOUTV20SMILES2024,
  title = {{{STOUT V2}}.0: {{SMILES}} to {{IUPAC}} Name Conversion Using Transformer Models},
  shorttitle = {{{STOUT V2}}.0},
  author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
  year = 2024,
  month = dec,
  journal = {Journal of Cheminformatics},
  volume = {16},
  number = {1},
  pages = {146},
  issn = {1758-2946},
  doi = {10.1186/s13321-024-00941-x}
}

Additional Resources:

Web Application (Includes Ketcher drawing, bulk submission, and DECIMER integration)
DECIMER Project
STOUT V1 Note
Zenodo Archive (Code Snapshot)

SELFIES and the Future of Molecular String Representations

Tue, 02 Dec 2025 00:00:00 +0000

Position: A Roadmap for Robust Chemical Languages

This is a Position paper (perspective) that proposes a research agenda for molecular representations in AI. It reviews the evolution of chemical notation over 250 years and argues for extending SELFIES-style robust representations beyond traditional organic chemistry into polymers, crystals, reactions, and other complex chemical systems.

The Generative Bottleneck in Traditional Representations

While SMILES has been the standard molecular representation since 1988, its fundamental weakness for machine learning is well-established: randomly generated SMILES strings are often invalid. The motivation is twofold:

Current problem: Traditional representations (SMILES, InChI, DeepSMILES) lack 100% robustness; random mutations or generations can produce invalid strings, limiting their use in generative AI models.
Future opportunity: SELFIES solved this for small organic molecules, but many important chemical domains (polymers, crystals, reactions) still lack robust representations, creating a bottleneck for AI-driven discovery in these areas.

16 Concrete Research Directions for SELFIES

The novelty is in the comprehensive research roadmap. The authors propose 16 concrete research projects organized around key themes:

Domain extension: Includes metaSELFIES for learning graph rules directly from data, BigSELFIES for stochastic polymers, and crystal structures via labeled quotient graphs.
Chemical reactions: Robust reaction representations that enforce conservation laws.
Programming perspective: Treating molecular representations as programming languages, potentially achieving Turing-completeness.
Benchmarking: Systematic comparisons across representation formats.
Interpretability: Understanding how humans and machines actually learn from different representations.

Evidence from Generative Case Studies

This perspective paper includes case studies:

Pasithea (Deep Molecular Dreaming): A generative model that first learns to predict a chemical property from a one-hot encoded SELFIES, then freezes the network weights and uses gradient descent on the one-hot input encoding to optimize molecular properties (logP). The target property increases or decreases nearly monotonically, demonstrating that the model has learned meaningful structure-property relationships from the SELFIES representation.
DECIMER and STOUT: DECIMER (Deep lEarning for Chemical ImagE Recognition) is an image-to-structure tool, and STOUT (SMILES-TO-IUPAC-name Translator) translates between IUPAC names and molecular string representations. Both show improved performance when using SELFIES as an intermediate representation. STOUT internally converts SMILES to SELFIES before processing and decodes predicted SELFIES back to SMILES. These results suggest SELFIES provides a more learnable internal representation for sequence-to-sequence models.

Strategic Outcomes and Future Vision

The paper establishes robust representations as a fundamental bottleneck in computational chemistry and proposes a clear path forward:

Key outcomes:

Identification of 16 concrete research projects spanning domain extension, benchmarking, and interpretability
Evidence that SELFIES enables capabilities (like smooth property optimization) impossible with traditional formats
Framework for thinking about molecular representations as programming languages

Strategic impact: The proposed extensions could enable new applications across drug discovery (efficient exploration beyond small molecules), materials design (systematic crystal structure discovery), synthesis planning (better reaction representations), and fundamental research (new ways to understand chemical behavior).

Future vision: The authors emphasize that robust representations could become a bridge for bidirectional learning between humans and machines, enabling humans to learn new chemical concepts from AI systems.

The Mechanism of Robustness

The key difference between SELFIES and other representations lies in how they handle syntax:

SMILES/DeepSMILES: Rely on non-local markers (opening/closing parentheses or ring numbers) that must be balanced. A mutation or random generation can easily break this balance, producing invalid strings.
SELFIES: Uses a formal grammar (automaton) where derivation rules are entirely local. The critical innovation is overloading: a state-modifying symbol like [Branch1] starts a branch and changes the interpretation of the next symbol to represent a numerical parameter (the branch length).

This overloading mechanism ensures that any arbitrary sequence of SELFIES tokens can be parsed into a valid molecular graph. The derivation can never fail because every symbol either adds an atom or modifies how subsequent symbols are interpreted.

The 16 Research Projects: Technical Details

This section provides technical details on the proposed research directions:

Extending to New Domains

metaSELFIES (Project 1): The authors propose learning graph construction rules automatically from data. This could enable robust representations for any graph-based system, from quantum optics to biological networks, without needing domain-specific expertise.

Token Optimization (Project 2): SELFIES uses “overloading” where a symbol’s meaning changes based on context. This project would investigate how this affects machine learning performance and whether the approach can be optimized.

Handling Complex Molecular Systems

BigSELFIES (Project 3): Current representations struggle with large, often random structures like polymers and biomolecules. BigSELFIES would combine hierarchical notation with stochastic building blocks to handle these complex systems where traditional small-molecule representations break down.

Crystal Structures (Projects 4-5): Crystals present unique challenges due to their infinite, periodic arrangements. An infinite net cannot be represented by a finite string directly. The proposed approach uses labeled quotient graphs (LQGs), which are finite graphs that uniquely determine a periodic net. However, current SELFIES cannot represent LQGs because they lack symbols for edge directions and edge labels (vector shifts encoding periodicity). Extending SELFIES to handle these structures could enable AI-driven materials design without relying on predefined crystal structures, opening up systematic exploration of theoretical materials space.

Beyond Organic Chemistry (Project 6): Transition metals and main-group compounds feature complex bonding that breaks the simple two-center, two-electron model. The solution: use machine learning on large structural databases to automatically learn these complex bonding rules.

Chemical Reactions and Programming Concepts

Reaction Representations (Project 7): Moving beyond static molecules to represent chemical transformations. A robust reaction format would enforce conservation laws and could learn reactivity patterns from large reaction datasets, improving synthesis planning.

Developing a 100% Robust Programming Language

Programming Language Perspective (Projects 8-9): An intriguing reframing views molecular representations as programming languages executed by chemical parsers. This opens possibilities for adding loops, logic, and other programming concepts to efficiently describe complex structures. The ambitious goal is a Turing-complete programming language that is also 100% robust. While fascinating, it is worth critically noting that enforcing 100% syntactical robustness inherently restricts grammar flexibility. Can a purely robust string representation realistically describe highly fuzzy, delocalized electron bonds (like in Project 6) without becoming impractically long or collapsing into specialized sub-languages?

Empirical Comparisons (Projects 10-11): With multiple representation options (strings, matrices, images), we need systematic comparisons. The proposed benchmarks would go beyond simple validity metrics to focus on real-world design objectives in drug discovery, catalysis, and materials science.

Human Readability (Project 12): While SMILES is often called “human-readable,” this claim lacks scientific validation. The proposed study would test how well humans actually understand different molecular representations.

Machine Learning Perspectives (Projects 13-16): These projects explore how machines interpret molecular representations:

Training networks to translate between formats to find universal representations
Comparing learning efficiency across different formats
Investigating latent space smoothness in generative models
Visualizing what models actually learn about molecular structure

Reproducibility Details

Since this is a position paper outlining future research directions, standard empirical reproducibility metrics do not apply. However, the foundational tools required to pursue the proposed roadmap are open-source.

Artifact	Type	License	Notes
aspuru-guzik-group/selfies	Code	Apache-2.0	Core SELFIES Python library, installable via `pip install selfies`
arXiv:2204.00056	Paper	N/A	Open-access preprint of the published Patterns article

Paper Information

Citation: Krenn, M., Ai, Q., Barthel, S., Carson, N., Frei, A., Frey, N. C., Friederich, P., Gaudin, T., Gayle, A. A., Jablonka, K. M., Lameiro, R. F., Lemm, D., Lo, A., Moosavi, S. M., Nápoles-Duarte, J. M., Nigam, A., Pollice, R., Rajan, K., Schatzschneider, U., … Aspuru-Guzik, A. (2022). SELFIES and the future of molecular string representations. Patterns, 3(10). https://doi.org/10.1016/j.patter.2022.100588

Publication: Patterns 2022

@article{Krenn2022,
  title = {SELFIES and the future of molecular string representations},
  volume = {3},
  ISSN = {2666-3899},
  url = {http://dx.doi.org/10.1016/j.patter.2022.100588},
  DOI = {10.1016/j.patter.2022.100588},
  number = {10},
  journal = {Patterns},
  publisher = {Elsevier BV},
  author = {Krenn, Mario and Ai, Qianxiang and Barthel, Senja and Carson, Nessa and Frei, Angelo and Frey, Nathan C. and Friederich, Pascal and Gaudin, Théophile and Gayle, Alberto Alexander and Jablonka, Kevin Maik and Lameiro, Rafael F. and Lemm, Dominik and Lo, Alston and Moosavi, Seyed Mohamad and Nápoles-Duarte, José Manuel and Nigam, AkshatKumar and Pollice, Robert and Rajan, Kohulan and Schatzschneider, Ulrich and Schwaller, Philippe and Skreta, Marta and Smit, Berend and Strieth-Kalthoff, Felix and Sun, Chong and Tom, Gary and von Rudorff, Guido Falk and Wang, Andrew and White, Andrew and Young, Adamo and Yu, Rose and Aspuru-Guzik, Alán},
  year = {2022},
  month = oct,
  pages = {100588}
}

Additional Resources:

Invalid SMILES Benefit Chemical Language Models: A Study

Tue, 02 Dec 2025 00:00:00 +0000

Core Contribution: Repurposing Invalid SMILES

This is an Empirical paper that challenges a fundamental assumption in the field of chemical language models. Skinnider provides both empirical evidence and mechanistic explanations for why the ability to generate “invalid” SMILES strings is beneficial for model performance.

The Problem with Absolute Validity in Chemical LMs

Prior research attempted to eliminate invalid generations using constrained representations like SELFIES. This paper demonstrates that invalid outputs serve as low-likelihood samples whose removal acts as an implicit quality filter, improving distribution learning.

Invalid Generation as an Implicit Quality Filter

The central insight is counterintuitive: invalid SMILES generation acts as a built-in quality control mechanism. The key contributions are:

Empirical Evidence: Direct comparisons showing that SMILES-based models consistently outperform SELFIES-based models across multiple metrics, with performance gains strongly correlated with the proportion of invalid outputs generated.
Mechanistic Explanation: Invalid SMILES are demonstrated to be low-likelihood samples from the model’s probability distribution. When these are filtered out, it’s equivalent to removing the model’s least confident predictions, a form of automatic quality control.
Causal Evidence: By modifying SELFIES to allow invalid generation (through relaxed constraints), the author shows that performance improves when models can generate and discard invalid outputs, directly proving the causal relationship.
Bias Analysis: SELFIES models are shown to introduce systematic structural biases (fewer aromatic rings, more aliphatic rings) due to their validity constraints, limiting their ability to explore chemical space naturally.

Experimental Design and Causal Interventions

The paper uses a multi-pronged approach to establish both correlation and causation:

Performance Comparisons: SMILES and SELFIES models were trained on identical datasets and evaluated using distribution-learning metrics like Fréchet ChemNet distance. The comparison was robust across different architectures, training set sizes, and chemical databases.

Loss Analysis: The relationship between SMILES validity and model confidence was examined by analyzing the sequence loss. For a given SMILES string $S$ composed of tokens $t_1, t_2, …, t_N$, the negative log-likelihood acts as a proxy for the model’s uncertainty:

$$ \text{NLL}(S) = -\sum_{i=1}^N \log P(t_i | t_1, …, t_{i-1}) $$

Invalid SMILES strings consistently register higher $\text{NLL}$ scores, meaning they represent the model’s least confident predictions. Filtering them effectively acts as automatic quality control, providing the mechanistic explanation for why invalid filtering improves performance.

Causal Intervention: A key experiment involved modifying the SELFIES valency constraints at two levels: first allowing pentavalent carbons (“Texas SELFIES”), then removing all constraints entirely (“unconstrained SELFIES”). This allowed direct testing of whether the ability to generate invalid outputs (which are then discarded) causally improves performance.

Structural Bias Analysis: Generated molecules were analyzed for chemical features like ring types and bond patterns to quantify how validity constraints systematically distort the model’s exploration of chemical space.

Generalization Testing: Models were trained on subsets of chemical databases and tested on their ability to reproduce the broader chemical space, measuring how validity constraints affect generalization.

Practical Application: The approach was tested on structure elucidation, using models to identify unknown molecules from minimal experimental data like mass spectrometry.

Key Findings on Validity Constraints and Bias

Superior Performance Across the Board: SMILES-based models consistently outperformed SELFIES models on distribution-learning tasks. Using metrics like Fréchet ChemNet distance, SMILES models generated molecules that more closely matched the statistical properties of their training data. This performance advantage was directly correlated with the proportion of invalid SMILES generated. Models that produced more invalid outputs performed better after filtering.

Invalid SMILES Are Low-Confidence Predictions: The analysis revealed that invalid SMILES consistently have higher loss values than valid ones, meaning they represent the model’s least confident predictions. This suggests that validity checking acts as an automatic confidence filter, removing low-quality samples without requiring explicit uncertainty estimation.

Causal Evidence Through Unconstrained SELFIES: Direct causal evidence came from modifying SELFIES to allow invalid generation. When “unconstrained SELFIES” models could generate and discard invalid molecules, their performance improved, approaching that of SMILES models. This provides direct causal evidence that the ability to generate invalid outputs is what drives the performance gains.

Validity Constraints Introduce Systematic Bias: SELFIES models showed clear structural biases compared to both training data and SMILES outputs. They generated fewer aromatic rings and more aliphatic structures, systematic distortions caused by the valency constraints used to ensure validity. These biases limit the model’s ability to faithfully represent chemical space.

Reduced Generalization: When trained on subsets of chemical databases, SMILES models could reproduce a larger portion of the complete chemical space compared to SELFIES models. Although SELFIES generated more valid molecules in absolute terms, their structural biases constrained exploration and limited generalization beyond the training set.

Real-World Application Benefits: In structure elucidation tasks, identifying unknown molecules from experimental data like mass spectrometry, SMILES-based models significantly outperformed SELFIES models. This demonstrates that the benefits extend beyond academic benchmarks to practical applications.

CASMI 2022 Benchmark: The language model trained on the LOTUS database was benchmarked against 19 submissions to the CASMI 2022 competition for structure elucidation of unknown compounds. Using only accurate mass as input (no MS/MS data), the model achieved competitive performance, highlighting the practical utility of the sampling-frequency-based approach for de novo structure elucidation.

Computational Efficiency: Filtering invalid SMILES is computationally trivial. Parsing ten million SMILES strings with RDKit takes approximately 7.5 minutes on a single CPU, making the post-processing overhead negligible compared to model training and inference costs.

Reproducibility Details

Models

Primary Architecture (LSTM): The main results rely on a Recurrent Neural Network (RNN) using Long Short-Term Memory (LSTM) units.

Structure: Three-layer LSTM with a hidden layer size of 1,024 dimensions
Embedding: An embedding layer of 128 dimensions
Decoder: A linear decoder layer outputs token probabilities

Secondary Architecture (Transformer/GPT): To confirm robustness across architectures, the author also used a Generative Pretrained Transformer (GPT) architecture adapted from MolGPT.

Structure: Eight transformer blocks
Internals: Each block contains eight masked self-attention heads and a feed-forward network (1,024 dimensions) using GELU activation
Embedding: 256 dimensions, concatenated with learned positional encodings

Algorithms

Optimizer: Adam optimizer for both architectures with $\beta_1=0.9$ and $\beta_2=0.999$.

Learning Rate:

LSTM: 0.001
Transformer: 0.0005

Batch Size: 64

Loss Function: Cross-entropy loss of next-token prediction.

Stopping Criteria: Early stopping using a validation set (10% of training data) with patience of 50,000 minibatches.

Data

Primary Source: ChEMBL database (version 28).

Preprocessing Pipeline:

Cleaning: Removal of duplicate SMILES, salts, and solvents (retaining heavy fragments with $\geq 3$ heavy atoms)
Filtering: Molecules with atoms other than {Br, C, Cl, F, H, I, N, O, P, S} were removed
Normalization: Charged molecules were neutralized and converted to canonical SMILES

Training Subsets: Models were trained on random samples of 30,000, 100,000, and 300,000 molecules to test scalability.

Generalization Data: To test generalization, models were also trained on the GDB-13 database (enumerating drug-like molecules up to 13 heavy atoms).

Structure Elucidation Data: For practical application tasks, models were trained on natural products (LOTUS, COCONUT), food compounds (FooDB), and environmental contaminants (NORMAN).

Evaluation

Primary Metric: Fréchet ChemNet Distance (FCD), measuring chemical similarity between generated molecules and the training set (lower is better).

Secondary Metrics:

Validity: Percentage of outputs parseable by RDKit
Scaffold Similarity: Jensen-Shannon distances between Murcko scaffold compositions
Physical Properties: Comparisons of molecular weight, LogP, topological polar surface area (TPSA), and ring counts (aromatic vs. aliphatic)
Structure Elucidation: “Top-k accuracy,” the proportion of held-out molecules where the correct structure appeared in the model’s top $k$ ranked outputs

Hardware

Compute Nodes: Dell EMC C4140 GPU compute nodes
GPUs: NVIDIA Tesla V100
Compute Time: Parsing 10 million SMILES took ~7.5 minutes on a single CPU; SELFIES models required an average of 0.6 hours longer to train than SMILES models

Replicability

Code Availability: Source code and intermediate data are available via Zenodo. Pre-trained model weights are not provided in the archive, requiring researchers to train models from scratch using the included scripts to fully replicate the study.

Data Availability: Training datasets and generated molecule samples (10 million from ChEMBL/GDB-13 models, 100 million from LOTUS/COCONUT/FooDB/NORMAN cross-validation folds) are available via Zenodo.

Software Libraries:

PyTorch: LSTM and Transformer implementations
RDKit: SMILES parsing, validity checking, and property calculation
SELFIES: Version 2.1.1 for conversion

Artifacts

Artifact	Type	License	Notes
Source code (Zenodo)	Code	Unknown	Training scripts, analysis code, and intermediate data
Training and generated molecules (Zenodo)	Dataset	Unknown	Preprocessed training sets and sampled molecules

Implications and Takeaways

This work reframes how we think about “errors” in generative models. The key insight is that model outputs appearing incorrect often represent low-likelihood samples whose removal improves overall performance.

The findings suggest that the field’s drive toward guaranteed validity leads to systematic biases. Letting models fail informatively and using those failures as quality signals can yield better distribution learning. This is relevant as the field moves toward larger, more capable models where such self-correction mechanisms become increasingly valuable.

For practitioners, the takeaway is to consider the role of invalid outputs before eliminating them. Filtering low-confidence generations provides automatic quality control that improves final results.

Paper Information

Citation: Skinnider, M. A. (2024). Invalid SMILES are beneficial rather than detrimental to chemical language models. Nature Machine Intelligence, 6(4), 437-448. https://doi.org/10.1038/s42256-024-00821-x

Publication: Nature Machine Intelligence (2024)

@article{skinnider2024invalid,
  title={Invalid SMILES are beneficial rather than detrimental to chemical language models},
  author={Skinnider, Michael A},
  journal={Nature Machine Intelligence},
  volume={6},
  number={4},
  pages={437--448},
  year={2024},
  publisher={Nature Publishing Group UK London}
}

SMILES Notation: The Original Paper by Weininger (1988)

Sun, 12 Oct 2025 00:00:00 +0000

Paper Information

Citation: Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1), 31-36. https://doi.org/10.1021/ci00057a005

Publication: Journal of Chemical Information and Computer Sciences, 1988

Additional Resources:

SMILES notation overview - Modern usage summary
Converting SMILES to 2D images - Practical visualization tutorial

Core Contribution: A String-Based Molecular Notation

This is a Method paper that introduces a novel notation system for representing chemical structures as text strings. It establishes the encoding rules and input conventions for SMILES (Simplified Molecular Input Line Entry System), while explicitly deferring the canonicalization algorithm to subsequent papers in the series.

The Computational Complexity of Chemical Information in the 1980s

As computers became central to chemical information processing in the 1980s, the field faced a fundamental problem: existing line notations were either too complex for chemists to use practically or too limited for computational applications. Previous systems required extensive training to write correctly and were prone to errors.

The goal was ambitious: create a system that could represent any molecule as a simple text string, making it both human-readable and machine-efficient. This would enable compact database storage, fast processing, and easy exchange between software systems.

Separating Input Rules from Canonicalization

Weininger’s key insight was to separate the problem into two parts: create simple, flexible rules that chemists could easily learn for input, while deferring to the computer the complex task of generating a unique, canonical representation. This division of labor made SMILES both practical and powerful.

The specific innovations include:

Simple input rules - Chemists could write molecules intuitively (e.g., CCO or OCC for ethanol)
Ring closure notation - Breaking one bond and marking ends with matching digits
Implicit hydrogens - Automatic calculation based on standard valences keeps strings compact
Algorithmic aromaticity detection - Automatic recognition of aromatic systems from Kekulé structures
Human-readable output - Unlike binary formats, SMILES strings are readable and debuggable

Important scope note: This first paper in the series establishes the input syntax and encoding rules. The canonicalization algorithm (how to generate unique SMILES) is explicitly stated as the subject of following papers: “specification of isomerisms, substructures, and unique SMILES generation are the subjects of following papers.”

Demonstrating Notation Rules Across Molecular Classes

The paper is primarily a specification document establishing notation rules. The methodology is demonstrated through worked examples showing how to encode various molecular structures:

Basic molecules: Ethane (CC), ethylene (C=C), acetylene (C#C)
Branches: Isobutyric acid (CC(C)C(=O)O)
Rings: Cyclohexane (C1CCCCC1), benzene (c1ccccc1)
Aromatic systems: Tropone (O=c1cccccc1), quinone (showing exocyclic bond effects)
Complex structures: Morphine (40 characters vs 1000-2000 for connection tables)
Edge cases: Salts, isotopes, charged species, tautomers

Performance comparisons are mentioned qualitatively: SMILES processing was approximately 100 times faster than traditional connection table methods on the hardware of the era (1988), with dramatic reductions in storage space.

Performance and Practical Viability

The paper successfully establishes SMILES as a practical notation system with several key outcomes:

Practical benefits:

Compactness: 40 characters for morphine vs 1000-2000 for connection tables
Speed: ~100x faster processing than traditional methods
Accessibility: Simple enough for chemists to learn without extensive training
Machine-friendly: Efficient parsing and string-based operations

Design principles validated:

Separating user input from canonical representation makes the system both usable and rigorous
Implicit hydrogens reduce string length without loss of information
Ring closure notation with digit markers is more intuitive than complex graph syntax
Automatic aromaticity detection handles most cases correctly

Acknowledged limitations:

Canonicalization algorithm not included in this paper
Stereochemistry handling deferred to subsequent papers
Some edge cases (like unusual valence states) require explicit specification

The paper concludes by positioning SMILES as a foundation for database storage, substructure searching, and chemical informatics applications - a vision that proved accurate as SMILES became one of the most widely used molecular representations in computational chemistry.

Reproducibility Details

To implement the method described in this paper, the following look-up tables and algorithms are required. Note: These details are critical for replication but are often glossed over in high-level summaries.

1. The Valence Look-Up Table

To calculate implicit hydrogens, the system assumes the “lowest normal valence” greater than or equal to the explicit bond count. The paper explicitly defines these valences:

Element	Allowed Valences
B	3
C	4
N	3, 5
O	2
P	3, 5
S (aliphatic)	2, 4, 6
S (aromatic)	3, 5
F, Cl, Br, I	1

Example: For sulfur in $\text{H}_2\text{SO}_4$ written as OS(=O)(=O)O, the explicit bond count is 6 (two single bonds + two double bonds to four oxygens), so the system uses valence 6 with zero implicit hydrogens. Without knowing S allows valence 6, the algorithm would fail.

2. Explicit Hydrogen Requirements

The paper lists exactly three cases where hydrogen atoms are retained (not suppressed):

Hydrogen connected to other hydrogen (molecular hydrogen, $\text{H}_2$, written as [H][H])
Hydrogen connected to zero or more than one other atom (bridging hydrogens, isolated protons)
Isotopic hydrogen specifications in isomeric SMILES (deuterium [2H], tritium [3H])

For all other cases, hydrogens are implicit and calculated from the valence table.

3. Ring Closure Notation

Standard SMILES supports single digits 1-9 for ring closures. For rings numbered 10 and higher, the notation requires a percent sign prefix:

Ring closures 1-9: C1CCCCC1
Ring closures 10+: C%10CCCCC%10, C2%13%24 (ring 2, ring 13, ring 24)

Without this rule, a parser would fail on large polycyclic structures.

4. Aromaticity Detection Algorithm

The system uses an extended version of Hückel’s Rule ($4N+2$ π-electrons). The “excess electron” count for the aromatic system is determined by these rules:

Carbon contribution:

C in aromatic ring: Contributes 1 electron
C double-bonded to exocyclic electronegative atom (e.g., $\text{C}=\text{O}$ in quinone): Contributes 0 electrons (the carbon “loses” its electron to the oxygen)

Heteroatom contribution:

O, S in ring: Contributes 2 electrons (lone pair)
N in ring: Contributes 1 electron (pyridine-like) or 2 electrons (pyrrole-like, must have explicit hydrogen [nH])

Charge effects:

Positive charge: Reduces electron count by 1
Negative charge: Increases electron count by 1

Critical example - Quinone:

O=C1C=CC(=O)C=C1

Quinone has 6 carbons in the ring, but the two carbons bonded to exocyclic oxygens contribute 0 electrons each. The four remaining carbons contribute 4 electrons total (not 6), so quinone is not aromatic by this algorithm. This exocyclic bond rule is essential for correct aromaticity detection.

Aromatic ring test:

All atoms must be sp² hybridized
Count excess electrons using the rules above
Calculate whether the system complies with Hückel’s parity rule constraint: $$ \text{Excess Electrons} \equiv 2 \pmod 4 \iff \text{Excess Electrons} = 4N + 2 $$ If the electron count satisfies this property for some integer $N$, the ring is determined to be aromatic.

Encoding Rules Reference

The following sections provide a detailed reference for the six fundamental SMILES encoding rules. These are the rules a user would apply when writing SMILES strings.

1. Atoms

Atoms use their standard chemical symbols. Elements in the “organic subset” (B, C, N, O, P, S, F, Cl, Br, I) can be written directly when they have their most common valence - so C automatically means a carbon with enough implicit hydrogens to satisfy its valence.

Everything else goes in square brackets: [Au] for gold, [NH4+] for ammonium ion, or [13C] for carbon-13. Aromatic atoms get lowercase letters: c for aromatic carbon in benzene.

2. Bonds

Bond notation is straightforward:

- for single bonds (usually omitted)
= for double bonds
# for triple bonds
: for aromatic bonds (also usually omitted)

So CC and C-C both represent ethane, while C=C is ethylene.

3. Branches

Branches use parentheses, just like in mathematical expressions. Isobutyric acid becomes CC(C)C(=O)O - the main chain is CC C(=O)O with a methyl (C) branch.

4. Rings

This is where SMILES gets clever. You break one bond and mark both ends with the same digit. Cyclohexane becomes C1CCCCC1 - the 1 connects the first and last carbon, closing the ring.

You can reuse digits for different rings in the same molecule, making complex structures manageable.

5. Disconnected Parts

Salts and other disconnected structures use periods. Sodium phenoxide: [Na+].[O-]c1ccccc1. The order doesn’t matter - you’re just listing the separate components.

6. Aromaticity

Aromatic rings can be written directly with lowercase letters. Benzoic acid becomes c1ccccc1C(=O)O. The system can also detect aromaticity automatically from Kekulé structures, so C1=CC=CC=C1C(=O)O works just as well.

Simplified Subset for Organic Chemistry

Weininger recognized that most chemists work primarily with organic compounds, so he defined a simplified subset that covers the vast majority of cases. For organic molecules, you only need four rules:

Atoms: Use standard symbols (C, N, O, etc.)
Multiple bonds: Use = and # for double and triple bonds
Branches: Use parentheses ()
Rings: Use matching digits

This “basic SMILES” covers the vast majority of organic compounds, making the system immediately accessible without having to learn all the edge cases.

Design Decisions and Edge Cases

Beyond the basic rules, the paper established several important conventions for handling ambiguous cases:

Hydrogen Handling

Hydrogens are usually implicit - the system calculates how many each atom needs based on standard valences. So C represents CH₄, N represents NH₃, and so on. This keeps strings compact and readable.

Explicit hydrogens only appear in special cases: when hydrogen connects to multiple atoms, when you need to specify an exact count, or in isotopic specifications like [2H] for deuterium.

Bond Representation

The paper made an important choice about how to represent bonds in ambiguous cases. For example, nitromethane could be written as charge-separated C[N+](=O)[O-] or with covalent double bonds CN(=O)=O. Weininger chose to prefer the covalent form when possible, because it preserves the correct topological symmetry.

However, when covalent representation would require unusual valences, charge separation is preferred. Diazomethane becomes C=[N+]=[N-] to avoid forcing carbon into an unrealistic valence state.

Tautomers

SMILES doesn’t try to be too clever about tautomers - it represents exactly what you specify. So 2-pyridone can be written as either the enol form Oc1ncccc1 or the keto form O=c1[nH]cccc1. The system won’t automatically convert between them.

This explicit approach means you need to decide which tautomeric form to represent, but it also means the notation precisely captures what you intend.

Aromaticity Detection

One of the most sophisticated parts of the original system was automatic aromaticity detection. The algorithm uses an extended Hückel rule: a ring is aromatic if all atoms are sp² hybridized and it contains 4N+2 π-electrons.

This means you can input benzene as the Kekulé structure C1=CC=CC=C1 and the system will automatically recognize it as aromatic and convert it to c1ccccc1. The algorithm handles complex cases like tropone (O=c1cccccc1) and correctly identifies them as aromatic.

Aromatic Nitrogen

The system makes an important distinction for nitrogen in aromatic rings. Pyridine-type nitrogen (like in pyridine itself) is written as n and has no attached hydrogens. Pyrrole-type nitrogen has an attached hydrogen that must be specified explicitly: [nH]1cccc1 for pyrrole.

This distinction captures the fundamental difference in electron contribution between these two nitrogen types in aromatic systems.

Impact and Legacy

Nearly four decades later, SMILES remains one of the most widely used molecular notations in computational chemistry. The notation became the foundation for:

Database storage - Compact, searchable molecular representations
Substructure searching - Pattern matching in chemical databases
Property prediction - Input format for QSAR models
Chemical informatics - Standard exchange format between software
Modern ML - Text-based representation for neural networks

While newer approaches like SELFIES have addressed some limitations (like the possibility of invalid strings), SMILES’ combination of simplicity and power has made it enduringly useful.

The paper established both a notation system and a design philosophy: chemical informatics tools should be powerful enough for computers while remaining accessible to working chemists. That balance remains relevant today as we develop new molecular representations for machine learning and AI applications.

SELFIES: The Original Paper on Robust Molecular Strings

Sun, 12 Oct 2025 00:00:00 +0000

Contribution: A 100% Robust Representation for ML

This is a Method paper that introduces a new molecular string representation designed specifically for machine learning applications.

Motivation: The Invalidity Bottleneck

When neural networks generate molecules using SMILES notation, a huge fraction of output strings are invalid: either syntax errors or chemically impossible structures. This was a fundamental bottleneck: if your generative model produces a large fraction of invalid molecules, you are wasting computational effort and severely limiting chemical space exploration.

Novelty: A Formal Grammar Approach

The authors’ key insight was using a formal grammar approach (specifically, a Chomsky type-2, context-free grammar with self-referencing functions) where each symbol is interpreted based on chemical context. The “state of the derivation” tracks available valence bonds, preventing impossible structures like a carbon with five single bonds.

For example, generating 2-Fluoroethenimine (FC=C=N) follows a state derivation where each step restricts the available valency for the next element:

$$ \mathbf{X}_0 \xrightarrow{[F]} \text{F } \mathbf{X}_1 \xrightarrow{[=C]} \text{FC } \mathbf{X}_3 \xrightarrow{[=C]} \text{FC=C } \mathbf{X}_2 \xrightarrow{[\#N]} \text{FC=C=N} $$

This approach guarantees 100% validity: every SELFIES string corresponds to a valid molecule, and every valid molecule can be represented.

Methodology & Experiments: Validating Robustness

The authors ran several experiments to demonstrate SELFIES’ robustness:

Random Mutation Test

They took the SELFIES and SMILES representations of MDMA and introduced random changes:

SMILES: After just one random mutation, only 9.9% of strings remained valid (dropping to 1.1% after three mutations).
SELFIES: 100% of mutated strings still represented valid molecules (though different from the original).

This empirical difference demonstrates why SELFIES is well suited for evolutionary algorithms and genetic programming approaches to molecular design, where random mutations of strings are a core operation.

Generative Model Performance

The real test came with actual machine learning models. The authors trained Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) on both representations:

VAE Results:

SMILES-based VAE: Large invalid regions scattered throughout the latent space
SELFIES-based VAE: Every point in the continuous latent space mapped to a valid molecule
The SELFIES model encoded over 100 times more diverse molecules

GAN Results:

Best SMILES GAN: 18.6% diverse, valid molecules
Best SELFIES GAN: 78.9% diverse, valid molecules

Evaluation Metrics:

Validity: Percentage of generated strings representing valid molecular structures
Diversity: Number of unique valid molecules produced
Reconstruction Accuracy: How well the autoencoder reproduced input molecules

Scalability Test

The authors showed SELFIES works beyond toy molecules by successfully encoding and decoding all 72 million molecules from the PubChem database (with fewer than 500 SMILES characters per molecule), demonstrating practical applicability to real chemical databases.

Results & Conclusions: Chemical Space Exploration

Key Findings:

SELFIES achieves 100% validity guarantee: every string represents a valid molecule
SELFIES-based VAEs encode over 100x more diverse molecules than SMILES-based models
SELFIES-based GANs produce 78.9% diverse valid molecules vs. 18.6% for SMILES GANs
Successfully validated on all 72 million PubChem molecules

Limitations Acknowledged:

No standardization or canonicalization method at time of publication
The initial grammar covered only small biomolecules; extensions for stereochemistry, ions, polyvalency, and full periodic table coverage were planned
Requires community testing and adoption

Impact:

This work demonstrated that designing ML-native molecular representations could enable new approaches in drug discovery and materials science. SELFIES was subsequently evaluated as an alternative input representation to SMILES in ChemBERTa, a transformer pretrained on molecular strings for property prediction, where it performed comparably to SMILES on the Tox21 benchmark, though the comparison was limited to a single task.

Reproducibility Details

Data

The machine learning experiments used two distinct datasets:

QM9 (134k molecules): Primary training dataset for VAE and GAN models
PubChem (72M molecules): Used only to test representation coverage and scalability; not used for model training

Models

The VAE implementation included:

Latent space: 241-dimensional with Gaussian distributions
Input encoding: One-hot encoding of SELFIES/SMILES strings
Full architectural details (encoder/decoder structures, layer types) provided in Supplementary Information

Algorithms

The authors found GAN performance was highly sensitive to hyperparameter selection:

Searched 200 different hyperparameter configurations to achieve the reported 78.9% diversity
Specific optimizers, learning rates, and training duration detailed in Supplementary Information
Full rule generation algorithm provided in Table 2

Evaluation

All models evaluated on:

Validity rate: Percentage of syntactically and chemically valid outputs
Diversity: Count of unique valid molecules generated
Reconstruction accuracy: Fidelity of autoencoder reconstruction (VAEs only)

Hardware

Training performed on the SciNet supercomputing infrastructure.
The paper does not specify GPU types or training times.

Artifacts

Artifact	Type	License	Notes
SELFIES GitHub Repository	Code	Apache-2.0	Official implementation; has evolved significantly since the original paper

Replication Resources

Complete technical replication is highly accessible due to the paper being published open-access in Machine Learning: Science and Technology. It primarily requires:

The full rule generation algorithm (Table 2 in paper)
Code: https://github.com/aspuru-guzik-group/selfies
Supplementary Information for complete architectural and hyperparameter specifications

Note: The modern SELFIES library has evolved significantly since this foundational paper, addressing many of the implementation challenges identified by the authors.

Paper Information

Citation: Krenn, M., Häse, F., Nigam, A., Friederich, P., & Aspuru-Guzik, A. (2020). Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4), 045024. https://doi.org/10.1088/2632-2153/aba947

Publication: Machine Learning: Science and Technology, 2020

@article{Krenn_2020,
	doi = {10.1088/2632-2153/aba947},
	url = {https://doi.org/10.1088%2F2632-2153%2Faba947},
	year = 2020,
	month = {aug},
	publisher = {{IOP} Publishing},
	volume = {1},
	number = {4},
	pages = {045024},
	author = {Mario Krenn and Florian H{\"{a}}se and AkshatKumar Nigam and Pascal Friederich and Alan Aspuru-Guzik},
	title = {Self-referencing embedded strings ({SELFIES}): A 100{\%} robust molecular string representation},
	journal = {Machine Learning: Science and Technology}
}

Additional Resources:

RInChI: The Reaction International Chemical Identifier

Sun, 12 Oct 2025 00:00:00 +0000

Paper Classification and Scope

This is an infrastructure/resource paper combined with a methods paper. It establishes a standard format, releases an open-source software library, and enables large-scale database operations. The methods component details the specific algorithmic rules for constructing identifiers through hashing, sorting, and layering.

The Need for Standardized Reaction Identifiers

While we have excellent standards for identifying individual molecules (like SMILES and InChI), there was no equivalent for chemical reactions. This creates real problems:

Different researchers working on the same reaction might describe it completely differently
Searching large reaction databases becomes nearly impossible
No way to check if two apparently different reaction descriptions are actually the same process
Chemical databases can’t easily link related reactions or identify duplicates

If a reaction converts “starting material A + reagent B to product C,” it is difficult to determine if that is identical to another researcher’s description of the same transformation using different names or graphical representations. A working group was established in 2008 to address this, producing prototype versions at the University of Cambridge starting in 2011. The first official release (RInChI V1.00) was funded by the InChI Trust.

Core Innovation: Standardizing Reaction Strings

RInChI solves this by creating a standardized, machine-readable label for any chemical reaction. The key insight is to focus on the essential chemistry while ignoring experimental details that can vary between labs.

Core Principles

RInChI captures three fundamental pieces of information:

Starting materials: What molecules you begin with
Products: What molecules you end up with
Agents: Substances present at both the beginning and end (catalysts, solvents, etc.)

Importantly, RInChI intentionally excludes experimental conditions like temperature, pressure, yield, or reaction time. These details can vary significantly even for identical chemical transformations, so including them would make it nearly impossible for different researchers to generate the same identifier.

How RInChI Works

The RInChI String Structure

A RInChI string has six distinct layers. Crucially, Layers 2 and 3 are assigned alphabetically. This is essential for generating consistent identifiers.

Layer 1: Version

Standard header defining the RInChI version (e.g., RInChI=1.00.1S)

Layers 2 & 3: Component Molecules

These layers contain the InChI strings of reaction participants (reactants and products)
Sorting Rule: The distinct groups (Reactant Group vs. Product Group) are sorted alphabetically as aggregate strings. The group that comes first alphabetically becomes Layer 2; the other becomes Layer 3
This means if a product’s InChI is alphabetically “earlier” than the reactant’s, the product goes in Layer 2
Formatting: Molecules within a layer are separated by !. The two layers are separated by <>

Layer 4: Agents

Contains catalysts, solvents, and any molecule found in both the reactant and product input lists
Algorithmic rule: Anything appearing in both the reactant list and product list must be removed from both and added to Layer 4

Layer 5: Direction (The Decoder)

This layer determines which component layer represents the starting material:
- /d+: Layer 2 is the Starting Material (forward direction)
- /d-: Layer 3 is the Starting Material (reverse direction)
- /d=: Equilibrium reaction
Without this layer, you cannot determine reactants from products

Layer 6: No-Structure Data

Format: /uA-B-C where the numbers indicate the count of structureless materials in Layer 2, Layer 3, and Layer 4 respectively
Used when substances lack defined structures and cannot be represented by InChI

Separator Syntax

For parsing or generating RInChI strings, the separator characters are:

Separator	Purpose
`/`	Separates layers
`!`	Separates molecules within a layer
`<>`	Separates reactant/product groups

Example Structure

RInChI=1.00.1S/[Layer2 InChIs]<>[Layer3 InChIs]<>[Agent InChIs]/d+/u0-0-0

This systematic approach ensures that any researcher starting with the same reaction will generate an identical RInChI string.

RInChIKeys: Shorter Identifiers for Practical Use

Since full RInChI strings can become extremely long, the standard includes three types of shorter, hashed keys for different applications:

Long-RInChIKey

Contains complete InChIKeys for every molecule in the reaction
Variable length, but allows searching for reactions containing specific compounds
Useful for substructure searches: “Show me all reactions involving compound X”

Short-RInChIKey

Fixed length (63 characters): 55 letters plus eight hyphens
Generated by separately hashing the major InChI layers (molecular formula and connectivity) of layers two, three, and four into ten-character strings, then hashing the minor layers (stereochemistry) and protonation states into five-character groups
Suitable for exact matching, database indexing, and linking identical reactions across different databases

Web-RInChIKey

Shortest format (47 characters)
Generated by combining all InChIs from every layer, removing duplicates, sorting alphabetically, then hashing the major layers into a seventeen-character block and the minor layers into a twelve-character block, with a protonation indicator
Ignores molecular roles (reactant vs. product), making it useful for finding related reactions where a molecule’s role might differ between studies
Good for discovering “reverse” reactions, comparing databases with different drawing models, or finding alternative synthetic routes

Experimental Validation and Software Implementation

This infrastructure paper focuses on developing and validating the RInChI standard. The validation approach includes:

Software implementation: Development of the official RInChI software library capable of parsing reaction files and generating identifiers
Format testing: Validation that the system correctly handles standard reaction file formats (.RXN, .RD)
Consistency verification: Ensuring identical reactions produce identical RInChI strings regardless of input variations
Key generation: Testing all three RInChIKey variants (Long, Short, Web) for different use cases
Database integration: Demonstrating practical application in reaction database management. A database of over one million RInChIs was assembled using data that NextMove Software extracted from the patent literature, available at www-rinchi.ch.cam.ac.uk

Impact on Chemical Database Analytics

Practical Applications

RInChI enables systematic organization and analysis of chemical reactions:

Database Management

RInChI enables systematic organization of reaction databases. You can:

Automatically identify and merge duplicate reaction entries
Find all variations of a particular transformation
Link related reactions across different data sources

Reaction Analysis

With standardized identifiers, you can perform large-scale analysis:

Identify the most commonly used reagents or catalysts
Find cases where identical starting materials yield different products
Analyze reaction trends and patterns across entire databases

Multi-Step Synthesis Representation

RInChI can represent complex, multi-step syntheses as single combined identifiers, making it easier to analyze and compare different synthetic routes.

Research Integration

The standard enables better collaboration by ensuring different research groups can generate identical identifiers for the same chemical processes, facilitating data sharing and literature analysis.

Limitations and Considerations

What Gets Lost

Since RInChI builds on the Standard InChI for individual molecules, it inherits certain limitations:

Tautomers: Different tautomeric forms are treated as identical
Stereochemistry: Relative stereochemical relationships aren’t captured
Experimental conditions: Temperature, pressure, yield, and reaction time are intentionally excluded

The Trade-off

This is an intentional feature. By focusing on core chemical identity, RInChI achieves its primary goal: ensuring that different researchers working on the same fundamental transformation generate the same identifier.

Implementation and Tools

Official Software

The RInChI software, available from the InChI Trust, handles the practical details:

Accepts standard reaction file formats (.RXN, .RD)
Generates RInChI strings, all three RInChIKey variants, and auxiliary information
Automates the complex process of creating consistent identifiers

RAuxInfo: Preserving Visual Information

While RInChI discards graphical information (atom coordinates, drawing layout), the software can generate supplementary “RAuxInfo” strings that preserve this data. This allows reconstruction of the original visual representation when needed.

Future Directions

RInChI development continues to evolve:

Integration: Plans for compatibility with other emerging standards like MInChI for chemical mixtures
Extended applications: Work on representing complex, multi-component reaction systems
Software development: Tools for generating graphical representations directly from RInChI without auxiliary information

Key Takeaways

Filling a critical gap: RInChI provides the first standardized way to uniquely identify chemical reactions, solving a fundamental problem in chemical informatics.
Focus on essential chemistry: By excluding experimental variables, RInChI achieves consistent identification of core chemical transformations.
Flexible searching: Multiple RInChIKey formats enable different types of database queries, from exact matching to similarity searching.
Practical implementation: Official software tools make RInChI generation accessible to working chemists and database managers.
Foundation for analysis: Standardized reaction identifiers enable large-scale analysis of chemical databases and systematic study of reaction patterns.

RInChI brings to reaction data the same kind of standardization and machine-readability that SMILES and InChI provide for individual molecules.

Reproducibility

The RInChI software is available for download from the InChI Trust website (http://www.inchi-trust.org/downloads/). It is also available as an Oracle cartridge and as a Pipeline Pilot component from StructurePendium. A database of over one million RInChIs is hosted at www-rinchi.ch.cam.ac.uk.

Artifact	Type	License	Notes
RInChI Software (InChI Trust)	Code	Unknown	Official RInChI V1.00 implementation
RInChI Database	Dataset	Unknown	Over 1M reactions from patent literature

Paper Information

Citation: Grethe, G., Blanke, G., Kraut, H., & Goodman, J. M. (2018). International chemical identifier for reactions (RInChI). Journal of Cheminformatics, 10(1), 22. https://doi.org/10.1186/s13321-018-0277-8

Publication: Journal of Cheminformatics (2018)

@article{Grethe2018,
  title={International chemical identifier for reactions (RInChI)},
  author={Grethe, Guenter and Blanke, Gerd and Kraut, Hans and Goodman, Jonathan M},
  journal={Journal of Cheminformatics},
  volume={10},
  number={1},
  pages={22},
  year={2018},
  publisher={Springer},
  doi={10.1186/s13321-018-0277-8}
}

Recent Advances in the SELFIES Library: 2023 Update

Sun, 12 Oct 2025 00:00:00 +0000

Overview

This software update paper documents major improvements to the SELFIES Python library (version 2.1.1), covering its history, underlying algorithms, design, and performance.

Limitations in the Original SELFIES Implementation

While the original SELFIES concept was promising, the initial 2019 implementation had critical limitations that prevented widespread adoption:

Performance: Too slow for production ML workflows
Limited chemistry: Couldn’t represent aromatic molecules, stereochemistry, or many other important chemical features
Poor usability: Lacked user-friendly APIs for common tasks

These barriers meant that despite SELFIES’ theoretical advantages (100% validity guarantee), researchers couldn’t practically use it for real-world applications like drug discovery or materials science.

Architectural Refactoring and New ML Integrations

The 2023 update refactors the underlying SELFIES engine with improvements to design, efficiency, and supported features. The key updates include:

Streamlined Grammar: The underlying context-free grammar has been generalized and streamlined, improving execution speed and extensibility while maintaining the 100% validity guarantee.
Expanded Chemical Support: Adds support for aromatic systems (via internal kekulization), stereochemistry (chirality, cis/trans), charged species, and isotopic data, covering nearly all features supported by SMILES while preserving the validity guarantee.
Semantic Constraint API: Introduces the set_semantic_constraints() function, allowing specification of custom valence definitions useful for theoretical studies or hypervalent states.
ML Utility Functions: Provides tokenization (split_selfies), length estimation (len_selfies), label/one-hot encoding (selfies_to_encoding), vocabulary extraction, and attribution tracking for integration with neural network pipelines.

Performance Benchmarks & Validity Testing

The authors validated the library through several benchmarks:

Performance testing: Roundtrip conversion (SMILES to SELFIES to SMILES) on the DTP open compound collection (slightly over 300K molecules) completed in 252 seconds total (136s encoding, 116s decoding), using pure Python with no external dependencies.

Random SELFIES generation: Demonstrated that random SELFIES strings of varying lengths always decode to valid molecules, with the size distribution of generated molecules controllable by filtering the sampling alphabet (e.g., removing multi-bond and low-valence atom symbols shifts the distribution toward larger molecules).

Validity guarantee: By construction, every SELFIES string decodes to a valid molecule. The grammar’s bond demotion and deferred ring closure mechanisms make it impossible to generate chemically invalid structures.

Attribution system: Showed both encoder and decoder can track which input symbols produce which output symbols, useful for property alignment.

Future Trajectories for General Chemical Representations

The 2023 update successfully addresses the main adoption barriers:

Fast enough for large-scale ML applications (300K molecules in ~4 minutes)
Chemically comprehensive enough for drug discovery and materials science
User-friendly enough for straightforward integration into existing workflows

The validity guarantee, SELFIES’ core advantage, is now practically accessible for real-world research. The roadmap includes future extensions for polymers, crystals, chemical reactions, and non-covalent interactions, which would expand SELFIES’ applicability beyond small-molecule chemistry.

Limitations acknowledged: The paper focuses on implementation improvements. Some advanced chemical systems (polymers, crystals) still need future work.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
selfies	Code	Apache 2.0	Official Python library, installable via `pip install selfies`

Code

The selfies library is completely open-source and written in pure Python. It requires no extra dependencies and is available on GitHub, installable via pip install selfies. The repository includes testing suites (tox) and example benchmarking scripts to reproduce the translation speeds reported in the paper.

Hardware

Performance benchmarks (e.g., the 252-second roundtrip conversion on 300K molecules) were executed on Google Colaboratory using two 2.20GHz Intel Xeon CPUs.

Algorithms

Technical Specification: The Grammar

The core innovation of SELFIES is a Context-Free Grammar (CFG) augmented with state-machine logic to ensure that every derived string represents a valid molecule. While the software features are important, understanding the underlying derivation rules is essential for replication or extension of the system.

1. Derivation Rules: The Atom State Machine

The fundamental mechanism that guarantees validity is a state machine that tracks the remaining valence of the most recently added atom:

State Tracking: The derivation maintains a non-terminal state $X_l$, where $l$ represents the current atom’s remaining valence (number of bonds it can still form)
Standard Derivation: An atom symbol $[\beta \alpha]$ (bond order + atom type) transitions the state from $S$ (start) to $X_l$, where $l$ is calculated from the atom’s standard valence minus the incoming bond order
Bond Demotion (The Key Rule): When deriving atom symbol $[\beta \alpha]$ in state $X_i$, the actual bond order used is $d_0 = \min(\ell, i, d(\beta))$, where $\ell$ is the new atom’s valence, $i$ is the previous atom’s remaining capacity, and $d(\beta)$ is the requested bond order. This automatic downward adjustment is the mathematical core of the validity guarantee.

This state machine ensures that no atom ever exceeds its allowed valence, making it impossible to generate chemically invalid structures.

2. Control Symbols: Branches and Rings

Branch length calculation: SELFIES uses a hexadecimal encoding to determine branch lengths. A branch symbol [Branch l] consumes the next $\ell$ symbols from the queue and converts them to integer indices $c_1, \dots, c_\ell$ via a fixed mapping (Table III in the paper). The number of symbols $N$ to include in the branch is then:

$$ N = 1 + \sum_{k=1}^{\ell} 16^{\ell - k} , c_k $$

This formula interprets the indices as hexadecimal digits, allowing compact specification of branches up to hundreds of symbols long.

Ring closure queue system: Ring formation uses a deferred evaluation strategy to maintain validity. Ring symbols don’t create bonds immediately; instead, they push closure candidates into a queue $R$. These candidates are resolved after the main derivation completes. A ring closure candidate is rejected if either ring atom has no remaining valence ($m_1 = 0$ or $m_2 = 0$), or if the left and right ring atoms are not distinct (to avoid self-loops). If a prior bond already exists between the two atoms, the bond order is incremented rather than duplicated. This deferred validation prevents invalid ring structures while keeping the grammar context-free during the main derivation.

3. Symbol Structure and Standardization

SELFIES enforces a strict, standardized format for atom symbols to eliminate ambiguity:

Canonical Format: Atom symbols follow the structure [Bond, Isotope, Element, Chirality, H-count, Charge]
No Variation: There is only one way to write each symbol (e.g., [Fe++] and [Fe+2] are standardized to a single form)
Order Matters: The components must appear in the specified order

4. Default Semantic Constraints

By default, the library enforces standard organic chemistry valence rules:

Charge-Dependent Valences: Default constraints specify maximum bonds per charge state (e.g., C: 4/5/3 for neutral/+1/-1; S: 6/7/5). Unlisted atom types default to 8 maximum bonds as a catch-all.
Preset Options: Three preset constraint sets are available: default, octet_rule, and hypervalent.
Customizable: Constraints can be modified via set_semantic_constraints() for specialized applications (hypervalent compounds, theoretical studies, etc.)

The combination of these grammar rules with the state machine ensures that every valid SELFIES string decodes to a chemically valid molecule, regardless of how the string was generated (random, ML model output, manual construction, etc.).

Data

Benchmark dataset: DTP (Developmental Therapeutics Program) open compound collection with slightly over 300K SMILES strings, a set of molecules tested experimentally for potential treatment against cancer and AIDS.

Random generation testing: Random SELFIES strings of varying lengths (10, 100, 250 symbols) generated from both basic and filtered alphabets to test decoding validity and molecule size distributions.

Evaluation

Performance metric: Roundtrip conversion time (SMILES to SELFIES to SMILES) is 252 seconds for 300K+ molecules (136s encoding, 116s decoding). Times averaged over 3 replicate trials on Google Colaboratory.

Validity testing: Random SELFIES strings of lengths 10, 100, and 250 all decode to valid molecules. Decoding 1000 random strings of length 250 from the basic alphabet takes 0.341s; from the filtered alphabet, 1.633s.

Attribution system: Both encoder() and decoder() support an attribute flag that returns AttributionMap objects, tracing which input symbols produce which output symbols for property alignment.

Paper Information

Citation: Lo, A., Pollice, R., Nigam, A., White, A. D., Krenn, M., & Aspuru-Guzik, A. (2023). Recent advances in the self-referencing embedded strings (SELFIES) library. Digital Discovery, 2(4), 897-908. https://doi.org/10.1039/D3DD00044C

Publication: Digital Discovery 2023

@article{lo2023recent,
  title={Recent advances in the self-referencing embedded strings (SELFIES) library},
  author={Lo, Alston and Pollice, Robert and Nigam, AkshatKumar and White, Andrew D and Krenn, Mario and Aspuru-Guzik, Al{\'a}n},
  journal={Digital Discovery},
  volume={2},
  number={4},
  pages={897--908},
  year={2023},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D3DD00044C}
}

Additional Resources:

NInChI: Toward a Chemical Identifier for Nanomaterials

Sun, 12 Oct 2025 00:00:00 +0000

A New Standard for Nanoinformatics

This is a Systematization paper that proposes a new standard: the NInChI. It addresses a fundamental limitation in nanoinformatics. The result of a collaborative workshop organized by the H2020 research infrastructure NanoCommons and the nanoinformatics project NanoSolveIT, this work uses six detailed case studies to systematically develop a hierarchical, machine-readable notation for complex nanomaterials that could work across experimental research, regulatory frameworks, and computational modeling.

The Breakdown of Traditional Chemical Identifiers

Chemoinformatics has fantastic tools for representing small molecules: SMILES strings, InChI identifiers, and standardized databases that make molecular data searchable and shareable. But when you step into nanotechnology, everything breaks down.

Consider trying to describe a gold nanoparticle with a silica shell and organic surface ligands. How do you capture:

The gold core composition and size
The silica shell thickness and interface
The surface chemistry and ligand density
The overall shape and morphology

There’s simply no standardized way to represent this complexity in a machine-readable format. This creates massive problems for:

Data sharing between research groups
Regulatory assessment where precise identification matters
Computational modeling that needs structured input
Database development and search capabilities

Without a standard notation, nanomaterials research suffers from the same data fragmentation that plagued small molecule chemistry before SMILES existed.

The Five-Tier Nanomaterial Description Hierarchy

The authors propose NInChI (Nanomaterials InChI), a layered extension to the existing InChI system. The core insight is organizing nanomaterial description from the inside out, following the OECD’s framework for risk assessment, with a five-tier hierarchy:

Tier 1: Chemical Composition: What is the core made of? This differentiates uniform compositions (Tier 1.1), randomly mixed (Tier 1.2), ordered core-shell materials (Tier 1.3), and onion-like multi-shell morphologies (Tier 1.4).
Tier 2: Morphology: What shape, size, and dimensionality? This encodes dimension (0D-3D), size and size distribution, and shape information.
Tier 3: Surface Properties: Physical and chemical surface parameters such as charge, roughness, and hydrophobicity. Many of these depend on external conditions (pH, solvent, temperature).
Tier 4: Surface Functionalization: How are coatings attached to the core? This includes functionalization density, orientation, and binding type (covalent vs. non-covalent).
Tier 5: Surface Ligands: What molecules are on the surface, their density, orientation, and distribution?

This hierarchy captures the essential information needed to distinguish between different nanomaterials while building on familiar chemical concepts.

Testing the Standard: Six Case Studies

The authors tested their concept against six real-world case studies to identify what actually matters in practice.

Case Study 1: Gold Nanoparticles

Gold NPs provided a relatively simple test case: an inert metallic core with various surface functionalizations. Key insights: core composition and size are essential, surface chemistry (what molecules are attached) matters critically, shape affects properties, and dynamic properties like protein corona formation belong outside the intrinsic NInChI representation. This established the boundary: NInChI should capture intrinsic, stable properties.

Case Study 2: Graphene-Family NMs

Carbon nanotubes and graphene introduced additional complexity: dimensionality (1D tubes vs 2D sheets vs 0D fullerenes), chirality (the (n,m) vector that defines a nanotube’s structure), defects and impurities that can alter properties, and number of layers (for nanotubes, single-wall vs multi-wall). This case showed that the notation needed to handle both topological complexity and chemical composition.

Case Study 3: Complex Engineered (Doped and Multi-Metallic) NMs

Doped materials, alloys, and core-shell structures revealed key requirements: need to distinguish true alloys (homogeneous mixing) and core-shell structures with the same overall composition, crystal structure information becomes crucial, and component ratios must be precisely specified. The case study assessed whether the MInChI extension could represent these solid solutions.

Case Study 4: Database Applications

The FAIR (Findable, Accessible, Interoperable, Reusable) principles guided this analysis. NInChI addresses real database problems: it provides greater specificity than CAS numbers (which lack nanoform distinction), offers a systematic alternative to ad-hoc naming schemes, and enables machine-searchability.

Case Study 5: Computational Modeling

This explored several applications: automated descriptor generation from NInChI structure, read-across predictions for untested materials, and model input preparation from standardized notation. The layered structure provides structured input that computational tools need for both physics-based and data-driven nanoinformatics approaches.

Case Study 6: Regulatory Applications

Under frameworks like REACH, regulators need to distinguish between different “nanoforms”, which are materials with the same chemical composition but different sizes, shapes, or surface treatments. NInChI directly addresses this by encoding the specific properties that define regulatory categories, providing precision sufficient for legal definitions and risk assessment frameworks.

The NInChI Alpha Specification in Practice

Synthesizing insights from all six case studies, the authors propose the NInChI alpha specification (version 0.00.1A), a three-layer structure. Importantly, the paper distinguishes the five-tier NM description hierarchy (Section 1.2 above) from the three-layer NInChI notation hierarchy. NM properties from the five tiers are encoded into these three notation layers:

Layer 1 (Version Number): Standard header indicating the NInChI version, denoted as 0.00.1A for the alpha version. This follows the convention of all InChI-based notations.

Layer 2 (Composition): Each component (core, shell, ligands, impurities, dopants, linkers) gets described using standard InChI (or PInChI/MInChI) for chemical composition, with additional sublayers for morphology (prefix m, e.g., sp for sphere, sh for shell, tu for tube), size (prefix s, in scientific notation in meters), crystal structure (prefix k), and chirality (prefix w for carbon nanotubes). Components are separated by !.

Layer 3 (Arrangement): Specified with prefix y, this layer describes how the components from Layer 2 are combined, proceeding from inside out. A core-shell material is written as y2&1 where the numbers reference components in Layer 2. Covalent bonding between components is indicated with parentheses, e.g., (1&2&3) for a nano core with a covalently bound ligand coating.

The paper provides concrete worked examples from the case studies:

Silica with gold coating (20 nm silica, 2 nm gold shell): NInChI=0.00.1A/Au/msh/s2t10r1-9;12r2-9!/O2Si/c1-3-2/msp/s20d-9/k000/y2&1
CTAB-capped gold nanoparticle (20 nm diameter): NinChI=0.00.1A/Au/msp/s20d-9!C19H42N.BrH/c1-5-6-7.../y1&2
Chiral single-wall nanotube of the (3,1) type with 0.4 nm diameter: NInChI=0.00.1A/C/mtu/s4d-10/w(3,1)/y1

Property Prioritization: The case studies produced a prioritization of NM properties into four categories (Table 3 in the paper):

Category 1: Must Have	Category 2a: Nice to Have	Category 2b: Extrinsic	Category 3: Out of Scope
Chemical composition	Structural defects	Surface charge	Optical properties
Size/size distribution	Density	Corona	Magnetic properties
Shape	Surface composition	Agglomeration state	Chemical/oxidation state
Crystal structure		Dispersion
Chirality
Ligand and ligand binding

Implementation: The authors built a prototype NInChI generation tool using the ZK framework with a Java backend, available through the Enalos Cloud Platform. The tool lets users specify core composition, morphology, size, crystal structure, and chirality, then build outward by adding shells or clusters. InChIs for shell components are retrieved via the NCI/CADD chemical structure REST API.

Limitations: The alpha version acknowledges areas for future development: nanocomposite and nanostructured materials, inverse NMs (nano holes in bulk material), and nanoporous materials are beyond current scope. Dynamic properties such as dissolution, agglomeration, and protein corona formation are excluded. The stochastic nature of NMs (e.g., broad size distributions) is not yet fully addressed. Covalent bonding between components needs further refinement.

Impact: For researchers, NInChI enables precise structural queries for nanomaterials data sharing. For regulators, it provides systematic identification for risk assessment and nanoform classification under frameworks like REACH. For computational modelers, it enables automated descriptor generation and read-across predictions.

Key Conclusions: The 8-month collaborative process demonstrates that creating systematic notation for nanomaterials is feasible. The hierarchical, inside-out organization provides an approach that satisfies experimentalists, modelers, database owners, and regulators. Testing against six case studies identified the essential features that must be captured. By extending InChI and reusing conventions from MInChI, RInChI, and PInChI, the work builds on existing infrastructure. The proposed NInChI alpha is intended to stimulate further analysis and refinement with the broader community and the InChI Trust.

Reproducibility Details

Paper Accessibility: The paper is fully open-access under the CC BY 4.0 license, allowing for straightforward reading and analysis.
Tools & Code: The authors provided a prototype NInChI generation tool available through the Enalos Cloud Platform, built using the ZK framework with a Java backend. The underlying backend code was not released as an open-source library.
Documentation: The paper serves as the first alpha specification for community discussion and refinement. No formal algorithmic pseudocode for automated string parsing or generation from structured nanomaterials files (like .cif) is provided.

Artifact	Type	License	Notes
NInChI Generator (Enalos Cloud)	Other	Unknown	Prototype web tool for generating NInChI strings; backend not open-source
Paper (MDPI)	Other	CC BY 4.0	Open-access alpha specification

Paper Information

Citation: Lynch, I., Afantitis, A., Exner, T., Himly, M., Lobaskin, V., Doganis, P., … & Melagraki, G. (2020). Can an InChI for Nano Address the Need for a Simplified Representation of Complex Nanomaterials across Experimental and Nanoinformatics Studies? Nanomaterials, 10(12), 2493. https://doi.org/10.3390/nano10122493

Publication: Nanomaterials (2020)

@article{lynch2020inchi,
  title={Can an InChI for Nano Address the Need for a Simplified Representation of Complex Nanomaterials across Experimental and Nanoinformatics Studies?},
  author={Lynch, Iseult and Afantitis, Antreas and Exner, Thomas and others},
  journal={Nanomaterials},
  volume={10},
  number={12},
  pages={2493},
  year={2020},
  publisher={MDPI},
  doi={10.3390/nano10122493}
}

Mixfile & MInChI: Machine-Readable Mixture Formats

Sun, 12 Oct 2025 00:00:00 +0000

A Standardized Resource for Chemical Mixtures

This is a Resource paper that introduces two complementary standards for representing chemical mixtures: the detailed Mixfile format for comprehensive mixture descriptions and the compact MInChI (Mixtures InChI) specification for canonical mixture identifiers.

The Missing Format for Complex Formulations

There is a fundamental gap in chemical informatics: current standards excel at representing pure individual molecules (SMILES, InChI, Molfile), but a corresponding standard for multi-component mixtures remains an open challenge. This is a major problem because real-world chemistry predominantly involves complex mixtures.

Everyday chemical work frequently involves:

Reagents with specified purity (e.g., “$\geq$ 97% pure”)
Solutions and formulations
Complex mixtures like “hexanes” (which contains multiple isomers)
Drug formulations with active ingredients and excipients

Without a machine-readable standard, chemists are forced to describe these mixtures in plain text that software cannot parse or analyze systematically. This creates barriers for automated safety analysis, inventory management, and data sharing.

Dual Design: Comprehensive Mixfiles and Canonical MInChIs

The authors propose a two-part solution:

Mixfile: A detailed, hierarchical JSON format that captures the complete composition of a mixture
MInChI: A compact, canonical string identifier derived from Mixfile data

This dual approach provides both comprehensive description (Mixfile) and simple identification (MInChI), similar to having both a detailed recipe and a short name for a dish.

What Makes a Good Mixture Format?

The authors identify three essential properties any mixture format must capture:

Compound: What molecules are present?
Quantity: How much of each component?
Hierarchy: How are components organized (e.g., mixtures-of-mixtures)?

The hierarchical aspect is crucial. Consider “hexanes”: it is a named mixture containing specific proportions of n-hexane, 2-methylpentane, 3-methylpentane, etc. A mixture format needs to represent both the individual isomers and the fact that they are grouped under the umbrella term “hexanes.”

Mixfile Format Details

Mixfile uses JSON as its foundation, making it both human-readable and easy to parse in modern programming languages. The core structure is a hierarchical tree where each component can contain:

name: Component identifier
molfile/smiles/inchi/formula: Molecular structure (molfile is the primary source of truth)
quantity/units/relation/ratio: Concentration data with optional relation operators
contents: Array of sub-components for hierarchical mixtures
identifiers: Database IDs or URLs for additional information

Simple Example

A basic Mixfile might look like:

{
  "mixfileVersion": 0.01,
  "name": "Acetone, ≥99%",
  "contents": [
    {
      "name": "acetone",
      "smiles": "CC(=O)C",
      "quantity": 99,
      "units": "%",
      "relation": ">="
    }
  ]
}

Note that the paper specifies distinct fields for molecular structures: molfile (the primary source of truth), smiles, inchi, and formula. Concentration data uses separate quantity, units, and relation fields.

Complex Example: Mixture-of-Mixtures

For something like “ethyl acetate dissolved in hexanes,” the structure would be:

{
  "mixfileVersion": 0.01,
  "name": "Ethyl acetate in hexanes",
  "contents": [
    {
      "name": "ethyl acetate",
      "smiles": "CCOC(=O)C",
      "quantity": 10,
      "units": "%"
    },
    {
      "name": "hexanes",
      "contents": [
        {
          "name": "n-hexane",
          "smiles": "CCCCCC",
          "quantity": 60,
          "units": "%"
        },
        {
          "name": "2-methylpentane",
          "smiles": "CC(C)CCC",
          "quantity": 25,
          "units": "%"
        }
      ]
    }
  ]
}

This hierarchical structure captures the “recipe” of complex mixtures while remaining machine-readable.

MInChI: Canonical Mixture Identifiers

While Mixfiles provide comprehensive descriptions, simple identifiers are also needed for database storage and searching. This is where MInChI comes in.

A MInChI string is structured as:

MInChI=0.00.1S//n/g

Header: Version information (0.00.1S in the paper’s specification)
Components: Standard InChI for each unique molecule, sorted alphabetically by the InChI strings themselves, then concatenated with &
Indexing (prefixed with /n): Hierarchical structure using curly braces {} for branches and & for adjacent nodes; uses 1-based integer indices referring to the sorted InChI list
Concentration (prefixed with /g): Quantitative information for each component, with units converted to canonical codes

Why This Matters

MInChI strings enable simple database searches:

Check if a specific component appears in any mixture
Compare different formulations of the same product
Identify similar mixtures based on string similarity

Validating the Standard Through Practical Tooling

The paper demonstrates the format’s capabilities through several practical applications and a proof-of-concept implementation:

Text Extraction Algorithm

The authors demonstrate a proof-of-concept algorithm that uses regular expressions and chemical name recognition to parse plain-text mixture descriptions into structured Mixfile data. The algorithm:

Applies regex rules to remove filler words and extract concentrations
Looks up cleaned names against a custom chemical database
Falls back to OPSIN for SMILES generation from chemical names
Generates 2D coordinates for molecular structures

Graphical Editor

An open-source editor provides:

Tree-based interface for building and editing hierarchical structures
Chemical structure sketching and editing
Database lookup (e.g., PubChem integration)
Automatic MInChI generation
Import/export capabilities

Example Use Cases

The paper validates the format through real-world applications:

Safety compliance: Automated hazard assessment based on concentration-dependent properties (e.g., solid osmium tetroxide vs. 1% aqueous solution)
Inventory management: Precise, searchable laboratory records
Data extraction: Parsing vendor catalogs and safety data sheets

Outcomes and Future Extensibility

The work successfully establishes the first standardized, machine-readable formats for chemical mixtures. Key achievements:

Comprehensive representation: Mixfile captures component identity, quantity, and hierarchy
Canonical identification: MInChI provides compact, searchable identifiers
Practical tooling: Open-source editor and text extraction demonstrate feasibility
Real-world validation: Format handles diverse use cases from safety to inventory

Limitations and Future Directions

The authors acknowledge areas for improvement:

Machine learning improvements: Better text extraction using modern NLP techniques
Extended coverage: Support for polymers, complex formulations, analytical results
Community adoption: Integration with existing chemical databases and software

The hierarchical design makes Mixfile suitable for both “recipe” descriptions (how to make something) and analytical results (what was found). This flexibility should help drive adoption across different use cases in chemistry and materials science.

Reproducibility Details

Open Source Tooling & Data

While the central repository focusing on validating and establishing the MInChI standard is github.com/IUPAC/MInChI, the tools and datasets actually used to develop the paper’s proofs-of-concept are hosted elsewhere:

Graphical Editor & App codebase: The Electron application and Mixfile handling codebase (console.js) can be found at github.com/cdd/mixtures.
Text Extraction Data: The several thousand extracted mixture records generated through the text extraction method can be accessed inside the cdd/mixtures repository under reference/gathering.zip.

Artifacts

Artifact	Type	License	Notes
IUPAC/MInChI	Code / Data	Unknown	Validation test suite with ~150 mixture JSON files
cdd/mixtures	Code / Data	GPL-3.0	Electron-based Mixfile editor, CLI tools, and reference mixture corpus

The paper was funded by NIH Grant 1R43TR002528-01. No specific hardware requirements are needed, as this is a format specification with lightweight tooling.

Algorithms

This section provides the specific algorithmic logic, schema definitions, and standardization rules needed to replicate the Mixfile parser or MInChI generator.

The Strict Mixfile JSON Schema

To implement the format, a parser must recognize these specific fields:

Root Structure:

{
  "mixfileVersion": 0.01,
  "header": {},
  "contents": []
}

Component Fields:

name: string (required if no structure is provided)
molfile: string (the primary source of truth for molecular structure)
smiles, inchi, formula: derived/transient fields for convenience
quantity: number OR [min, max] array for ranges
units: string (must map to supported ontology)
relation: string (e.g., ">", "~", ">=")
ratio: array of two numbers [numerator, denominator]
identifiers: database assignments (e.g., CASRN, PubChem)
links: URLs relevant to the component
contents: recursive array for hierarchical mixtures

MInChI Generation Algorithm

To generate MInChI=0.00.1S/..., the software must follow these steps:

Component Layer:
- Calculate standard InChI for all structures in the mixture
- Sort distinct InChIs alphabetically by the InChI string itself
- Join with & to form the structure layer
Hierarchy & Concentration Layers:
- Traverse the Mixfile tree recursively
- Indexing: Use integer indices (1-based) referring to the sorted InChI list
- Grouping: Use {} to denote hierarchy branches and & to separate nodes at the same level
- Concentration: Convert all quantities to canonical unit codes and apply scaling factors

Unit Standardization Table

Replication requires mapping input units to canonical MInChI codes. The full table from the paper (Table 1) includes:

Input Unit	MInChI Code	Scale Factor
%	pp	1
w/v%	wv	0.01
w/w%	wf	0.01
v/v%	vf	0.01
mol/mol%	mf	0.01
mol/L (M)	mr	1
mmol/L	mr	$10^{-3}$
g/L	wv	$10^{-3}$
mol/kg	mb	1
ratio	vp	1

Text Extraction Logic

The paper defines a recursive procedure for parsing plain-text mixture descriptions:

Input: Raw text string (e.g., “2 M acetone in water”)
Rule Application: Apply RegEx rules in order:
- Remove: Delete common filler words (“solution”, “in”)
- Replace: Substitute known variations
- Concentration: Extract quantities like “2 M”, “97%”
- Branch: Split phrases like “A in B” into sub-nodes
Lookup: Check cleaned name against a custom table (handles cases like “xylenes” or specific structures)
OPSIN: If no lookup match, send to the OPSIN tool to generate SMILES from the chemical name
Embed: If structure found, generate 2D coordinates (Molfile) via RDKit

Paper Information

Citation: Clark, A. M., McEwen, L. R., Gedeck, P., & Bunin, B. A. (2019). Capturing mixture composition: an open machine-readable format for representing mixed substances. Journal of Cheminformatics, 11(1), 33. https://doi.org/10.1186/s13321-019-0357-4

Publication: Journal of Cheminformatics (2019)

@article{clark2019capturing,
  title={Capturing mixture composition: an open machine-readable format for representing mixed substances},
  author={Clark, Alex M and McEwen, Leah R and Gedeck, Peter and Bunin, Barry A},
  journal={Journal of cheminformatics},
  volume={11},
  number={1},
  pages={33},
  year={2019},
  publisher={BioMed Central}
}

Additional Resources:

Official MInChI GitHub repository

Making InChI FAIR and Sustainable for Inorganic Chemistry

Sun, 12 Oct 2025 00:00:00 +0000

Paper Contribution: Modernizing Chemical Identifiers

This is a Resource paper that describes the development and maintenance of InChI (International Chemical Identifier), a fundamental infrastructure component for chemical databases. While it includes methodological improvements to the canonicalization algorithm for inorganic compounds, its primary contribution is ensuring the sustainability and accessibility of a critical chemical informatics resource.

Motivation: The Inorganic Chemistry Problem

The International Chemical Identifier (InChI) is prevalent in chemistry databases, with over a billion structures using it. The original system was designed specifically for organic chemistry and systematically fails to parse organometallic structures accurately. The original implementation had significant limitations:

FAIR principles gap: Development was closed-source, documentation was inadequate, and the codebase was difficult to maintain
Inorganic chemistry failure: Metal-ligand bonds were automatically disconnected, destroying stereochemical information for coordination complexes
Technical debt: More than 3000 bugs and security vulnerabilities, nearly 60 Google OSS-Fuzz issues, and an unmaintainable codebase

If you’ve ever tried to search for a metal complex in a chemical database and gotten nonsense results, this is why. This paper describes the fix.

Core Innovation: Smart Metal-Ligand Handling

The key innovations are:

Smart metal-ligand bond handling: A decision tree algorithm that uses coordination number and electronegativity to determine which bonds to keep and which to disconnect, preserving stereochemistry for coordination complexes
Modernized development infrastructure: Migration to GitHub with open development, comprehensive testing, and maintainable documentation
Backward compatibility: The core canonicalization algorithm remained unchanged, preserving over a billion existing InChIs for organic compounds

The preprocessing step applies a two-pass iterative process for every metal in a structure:

Terminal metals (connected to only one other atom): check the electronegativity lookup table and disconnect if $\Delta EN \geq 1.7$
Non-terminal metals: if coordination number exceeds the element’s standard valence threshold, keep all bonds; otherwise, apply the same electronegativity check per bond (if at least one bond is kept, all are retained)
Hardcoded exceptions exist for Grignard reagents and organolithium compounds

For example, $\text{FeCl}_2$ is treated as ionic and disconnected into $\text{Fe}^{2+}$ and $2\ \text{Cl}^-$, while $[\text{FeCl}_4]^{2-}$ remains connected as a coordination complex.

Validation Methods & Experiments

The paper focuses on software engineering validation:

Bug fixing: Fixed more than 3000 bugs and security issues, plus nearly 60 Google OSS-Fuzz issues from the legacy codebase
Backward compatibility testing: Verified that existing organic molecule InChIs remained unchanged
Inorganic compound validation: Tested the new decision tree algorithm on coordination complexes, organometallic compounds, and ionic salts
Documentation overhaul: Split technical documentation into Chemical Manual (for chemists) and Technical Manual (for developers)
Web Demo: Created a browser-based InChI Web Demo that calculates InChI, InChIKey, and AuxInfo from drawn structures or Molfiles, with all computation performed client-side

The validation approach emphasizes maintaining the “same molecule, same identifier” principle while extending coverage to inorganic chemistry.

Key Outcomes and Future Work

The v1.07 release successfully:

Modernizes infrastructure: Open development on GitHub with maintainable codebase
Extends to inorganic chemistry: Proper handling of coordination complexes and organometallic compounds
Maintains backward compatibility: No breaking changes for existing organic compound InChIs
Improves database search: Metal complexes now searchable with correct stereochemistry preserved
IUPAC approval: Version 1.07 has been approved by IUPAC’s Committee on Publications and Cheminformatics Data Standards (CPCDS)

Acknowledged limitations for future work:

Stereochemistry for inorganic and organometallic compounds still needs improvement, including atropisomers and MDL enhanced stereochemistry
Mixtures (MInChI) and nanomaterials (NInChI) remain unsolved problems
Chemical identifiers work best for discrete molecules and struggle with variable-composition materials

Impact: This update improves searchability of inorganic and organometallic compounds in major chemical databases by preserving coordination bond information that was previously discarded.

Reproducibility Details

Software & Data Availability

Artifact	Type	License	Notes
IUPAC-InChI/InChI	Code	Open source (IUPAC/InChI Trust)	Official C/C++ implementation of InChI v1.07
InChI Web Demo	Other	Open source	Browser-based InChI/InChIKey generator for testing

The InChI v1.07 codebase, primarily written in C/C++, is openly available on GitHub at IUPAC-InChI/InChI. The repository includes the core canonicalization engine and the new inorganic preprocessing logic. Both the Technical Manual (for structural integration) and the Chemical Manual are maintained alongside the codebase. Compiled binaries are available for Windows, Linux, and macOS.

Benchmarking Data: Validation of the new decision tree logic is managed through rigorous unit testing built directly into the repository’s continuous integration pipelines. Standard tests with existing organic compounds confirm backward compatibility, while newly integrated suites of coordination complexes and organometallic compounds ensure the 1.07 processing triggers as expected.

Algorithms

The Metal Problem

InChI’s original algorithm assumed that bonds to metals were ionic and automatically disconnected them. This makes sense for something like sodium chloride (NaCl), where you have separate $\text{Na}^+$ and $\text{Cl}^-$ ions.

It fails for:

Coordination complexes: Where ligands are bonded to the metal center
Organometallic compounds: Where carbon-metal bonds are covalent
Sandwich compounds: Like ferrocene, where the bonding has both ionic and covalent character

The result: loss of stereochemical information and identical InChIs for structurally different compounds.

The Solution: Smart Preprocessing

The new system uses a decision tree to figure out which metal-ligand bonds to keep and which to disconnect. The process is iterative: it runs for every metal in the structure, then checks every bond to that metal. In the C/C++ repository, this preprocessing logic acts as a filter applied before the traditional organic canonicalization engine (from v1.06) runs, dynamically determining whether coordination bonds are retained for downstream layer generation.

Decision Tree Logic

The algorithm handles metals in two passes. First, terminal metals (bonded to only one atom) are checked against the electronegativity lookup table and disconnected if $\Delta EN \geq 1.7$. This preserves all metal-metal bonds.

Second, non-terminal metals are examined. For a metal $m$ bonded to ligand $l$:

$$ \begin{aligned} B(m, l) &= \begin{cases} \text{Connected (all bonds)} & \text{if } CN(m) > V(m) \\ \text{Connected} & \text{if } |EN(m) - EN(l)| < 1.7 \\ \text{Disconnected} & \text{if } |EN(m) - EN(l)| \geq 1.7 \end{cases} \end{aligned} $$

A key rule: if at least one metal-ligand bond is kept for a given metal, all other bonds to that metal are also retained (no disconnection is carried out).

(Note: Explicit overrides exist for specific classes like Grignard reagents).

Hardcoded Chemical Exceptions

The algorithm includes specific overrides based on well-established chemistry:

Grignard reagents (RMgX): Explicitly configured to keep the Mg-C bond but disconnect the Mg-halide bond
Organolithium compounds (RLi): Explicitly configured to keep the structure intact

These exceptions exist because the general electronegativity rules would give incorrect results for these compound classes.

Practical Example

For example, $\text{FeCl}_2$ is treated as ionic and disconnected into $\text{Fe}^{2+}$ and $2\ \text{Cl}^-$, while $[\text{FeCl}_4]^{2-}$ remains connected because its coordination number exceeds the threshold.

How InChI Generation Works

The process has six main steps:

Parse input: Read the structure from a file (Molfile, SDF, etc.)
Convert to internal format: Transform into the software’s data structures
Normalize: Standardize tautomers, resolve ambiguities (where the new metal rules apply)
Canonicalize: Create a unique representation independent of atom numbering
Generate InChI string: Build the layered text identifier
Create InChIKey: Hash the full string into a 27-character key for databases

The InChI itself has separate layers for formula, connectivity, hydrogens, stereochemistry, isotopes, and charge. The InChIKey is what actually gets stored in databases for fast searching.

InChIKey Version Flag

Character 25 of the InChIKey indicates the version status:

“S”: Standard InChI
“N”: Non-standard InChI
“B”: Beta (experimental features)

This flag is important for anyone parsing InChIKeys programmatically, as it tells you whether the identifier was generated using stable or experimental algorithms.

Additional Context

What InChI Actually Does

InChI creates a unique text string for any chemical structure. SMILES has multiple vendor implementations and can represent the same molecule in different ways. InChI provides a single, standardized format controlled by IUPAC. The goal is simple: same molecule, same identifier, every time.

This matters for FAIR data principles:

Findable: You can search for a specific compound across databases
Accessible: The standard is open and free
Interoperable: Different systems can connect chemical knowledge
Reusable: The identifiers work consistently across platforms

Better Documentation

The technical manual is being split into two documents:

Chemical Manual: For chemists who need to understand what InChIs mean
Technical Manual: For developers who need to implement the algorithms

This addresses the problem of current documentation serving both audiences poorly.

The Bigger Picture

InChI’s evolution reflects chemistry’s expansion beyond its organic roots. The fact that it took this long to properly handle inorganic compounds shows how much computational chemistry has historically focused on carbon-based molecules.

As the field moves into catalysis, materials science, and coordination chemistry applications, having proper chemical identifiers becomes essential. You can’t build FAIR chemical databases if half of chemistry is represented incorrectly.

Paper Information

Citation: Blanke, G., Brammer, J., Baljozovic, D., Khan, N. U., Lange, F., Bänsch, F., Tovee, C. A., Schatzschneider, U., Hartshorn, R. M., & Herres-Pawlis, S. (2025). Making the InChI FAIR and sustainable while moving to inorganics. Faraday Discussions, 256, 503-519. https://doi.org/10.1039/D4FD00145A

Publication: Faraday Discussions, 2025

@article{blanke2025making,
  title={Making the InChI FAIR and sustainable while moving to inorganics},
  author={Blanke, G. and Brammer, J. and Baljozovic, D. and Khan, N. U. and Lange, F. and B{\"a}nsch, F. and Tovee, C. A. and Schatzschneider, U. and Hartshorn, R. M. and Herres-Pawlis, S.},
  journal={Faraday Discussions},
  volume={256},
  pages={503--519},
  year={2025},
  publisher={Royal Society of Chemistry}
}

InChI: The Worldwide Chemical Structure Identifier Standard

Sun, 12 Oct 2025 00:00:00 +0000

InChI as a Resource and Systematization Standard

This is a Resource & Systematization Paper that reviews the history, technical architecture, governance structure, and implementation status of the InChI standard. It documents both the institutional development of an open chemical identifier and the technical specification that enables it.

The Motivation: Interoperability in Chemical Databases

Before InChI, the chemistry community faced a fundamental interoperability problem. Chemical databases used proprietary systems like CAS Registry Numbers, or format-dependent representations like SMILES strings. These were expensive, restricted, and relied on “in-house” databases.

The authors argue the Internet and Open Source software acted as a “black swan” event that disrupted this status quo. The Internet created a need to link diverse, free and fee-based resources without a central gatekeeper. InChI was designed as the solution: a non-proprietary, open-source identifier enabling linking of distinct data compilations.

Technical and Institutional Innovations of InChI

InChI’s innovation is both technical and institutional:

Technical novelty: A hierarchical “layered” canonicalization system where structure representations build from basic connectivity to full stereochemistry. This allows flexible matching: a molecule with unknown stereochemistry produces an InChI that’s a subset of the same molecule with known stereochemistry.

Institutional novelty: Creating an open standard governed by a charitable trust (the InChI Trust) that convinced commercial competitors (publishers, databases) to adopt it as a “pre-competitive” necessity. This solved the political problem of maintaining an open standard in a competitive industry.

Technical Architecture: Layers and Hashing

The InChI String

InChI is a canonicalized structure representation derived from IUPAC conventions. It uses a hierarchical “layered” format where specific layers add detail. The exact technical specification includes these string segments:

Main Layer: Chemical Formula
Connectivity Layer (/c): Atoms and bonds (excluding bond orders)
Hydrogen Layer (/h): Tautomeric and immobile H atoms
Charge (/q) & Proton Balance (/p): Accounting for ionization
Stereochemistry:
- Double bond (/b) and Tetrahedral (/t) parity
- Parity inversion (/m)
- Stereo type (/s): absolute, relative, or racemic
Fixed-H Layer (/f): Distinguishes specific tautomers if needed

This layered approach means that a molecule with unknown stereochemistry will have an InChI that’s a subset of the same molecule with known stereochemistry. This allows for flexible matching at the connectivity level even without complete stereochemical information.

The InChIKey

Because InChI strings can be too long for search engines (which break at ~30 characters or at symbols like / and +), the InChIKey was created.

Mechanism: A 27-character string generated via a SHA-256 hash of the InChI string. This can be represented as:

$$ \text{InChIKey} = f_{\text{SHA-256}}(\text{InChI}) $$

Structure:

Block 1 (14 characters): Encodes the molecular skeleton (connectivity)
Block 2 (10 characters): Eight letters encoding stereochemistry and isotopes, plus a flag indicating standard InChI (S) and an InChI version indicator (A for version 1)
Block 3 (1 character): Protonation flag (e.g., ‘N’ for neutral)

Because the InChIKey is a hash, it cannot be converted back to a structure (irreversible) and has a theoretical risk of collision. It is important to distinguish between InChI collisions (which are due to flaws/bugs and are very rare) and InChIKey collisions (which are mathematically inevitable due to hashing).

What experiments were performed?

This is a systematization paper documenting an existing standard. However, the authors provide:

Validation evidence:

Certification Suite: A test suite that software vendors must pass to display the “InChI Certified” logo, preventing fragmentation
Round-trip conversion testing: Demonstrated >99% success rate converting InChI back to structure (100% with AuxInfo layer)
Real-world adoption metrics: Documented integration across major chemical databases and publishers

Known limitations identified:

Tautomer representation issues in Version 1 (different drawings of same tautomer can generate different InChIs)
Edge cases in stereochemistry representation

Institutional History & Governance

Origin: The project was initiated at a March 2000 IUPAC meeting in Washington, DC. It was originally called the IUPAC Chemical Identifier Project (IChIP).

Development: Technical work was done by NIST (Stein, Heller, Tchekhovskoi), overseen by the IUPAC CCINS committee, which later became the InChI Subcommittee of Division VIII.

The InChI Trust: To ensure the algorithm survived beyond a volunteer organization, the InChI Trust was formed in 2009. It is a UK charity supported by publishers and databases (e.g., Nature, RSC) to maintain the standard pre-competitively. This was a critical innovation: getting commercial publishers and software vendors to agree that a non-proprietary standard would benefit everyone.

Real-World Impact and Future Directions

Key Findings

Success through “un-coerced adoption”: InChI succeeded because commercial competitors viewed it as a “pre-competitive” necessity for the Internet age. The open governance model proved durable.

Technical achievements:

Reversible representation (>99% without AuxInfo, 100% with it)
Hierarchical structure enables flexible matching at different levels of detail
InChIKey enables web search despite being a hash (with inherent collision risk)

Limitations Acknowledged (as of 2013)

Tautomerism Issues: Different drawings of the same tautomer (e.g., 1,4-oxime vs nitroso) can generate different InChIs in Version 1, which is targeted for Version 2
Hash collision risk: InChIKey collisions are mathematically inevitable due to SHA-256 hashing, though InChI collisions (actual bugs) are very rare
Certification required: To prevent fragmentation, software must pass the InChI Certification Suite

Future Directions

The authors note that while this paper documents the state as of 2013, InChI continues to evolve. Tautomer handling and edge cases in stereochemistry representation were priorities for future versions. The governance model through the InChI Trust was designed to ensure long-term maintenance beyond the original volunteer contributors.

Reproducibility Details

This systematization paper documents an existing standard. Key implementation resources are openly maintained by the InChI Trust.

Code & Software

Official Open Source Implementation: The C source code and pre-compiled binaries for the InChI algorithm are freely available via the InChI Trust Downloads Page and their official GitHub repository.
Canonicalization algorithm: Open-source implementation of IUPAC-based rules for generating unique representations from multiple possible drawings of the same molecule.

Data & Validation

InChI Certification Suite: A test suite of chemical structures provided by the InChI Trust used to validate that third-party software implementations generate correct InChIs.
Version 1 specification: Complete technical documentation of the layered format.

Evaluation

Round-trip conversion: >99% success rate (100% with AuxInfo) as validated by NIST and IUPAC.
Certification testing: Pass/fail validation for software claiming InChI compliance.

Paper Information

Citation: Heller, S., McNaught, A., Stein, S., Tchekhovskoi, D., & Pletnev, I. (2013). InChI - the worldwide chemical structure identifier standard. Journal of Cheminformatics, 5(1), 7. https://doi.org/10.1186/1758-2946-5-7

Publication: Journal of Cheminformatics, 2013

@article{heller2013inchi,
  title={{InChI} - the worldwide chemical structure identifier standard},
  author={Heller, Stephen and McNaught, Alan and Stein, Stephen and Tchekhovskoi, Dmitrii and Pletnev, Igor},
  journal={Journal of Cheminformatics},
  volume={5},
  number={1},
  pages={7},
  year={2013},
  publisher={Springer},
  doi={10.1186/1758-2946-5-7}
}

InChI and Tautomerism: Toward Comprehensive Treatment

Sun, 12 Oct 2025 00:00:00 +0000

Paper Contribution: A Systematized Tautomer Database Resource

This is a Resource paper with strong Systematization elements. It provides a comprehensive catalog of 86 tautomeric transformation rules (20 pre-existing CACTVS defaults plus 66 new rules derived from experimental literature), designed to serve as a foundational resource for chemical database systems and the InChI V2 identifier standard. The systematic validation across 400+ million structures also makes it a benchmarking study for evaluating current chemoinformatics tools.

The Tautomerism Problem in Chemical Databases

Chemical databases face a fundamental problem: the same molecule can appear multiple times under different identifiers simply because it exists in different tautomeric forms. For example, glucose’s ring-closed and open-chain forms are the same molecule; however, current chemical identifiers (including InChI) often treat them as distinct compounds.

Ring-chain tautomerism in glucose: the open-chain aldehyde form (left) and the cyclic pyranose form (right) are the same molecule in different tautomeric states.

This creates three critical problems:

Database redundancy: Millions of duplicate entries for the same chemical entities
Search failures: Researchers miss relevant compounds during structure searches
ML training issues: Machine learning models learn to treat tautomers as different molecules

The motivation for this work is to provide a comprehensive, experimentally-grounded rule set that enables InChI V2 to properly recognize tautomeric relationships, eliminating these problems at the identifier level.

86 Comprehensive Tautomeric Transformation Rules

The key contributions are:

Comprehensive Rule Set: Compilation of 86 tautomeric transformation rules (20 pre-existing CACTVS defaults plus 66 new rules derived from experimental literature), categorized into:
- 54 Prototropic rules (classic H-movement tautomerism)
- 21 Ring-Chain rules (cyclic/open-chain transformations)
- 11 Valence rules (structural rearrangements with valence changes)
Massive-Scale Validation: Testing these rules against nine major chemical databases totaling over 400 million structures to identify coverage gaps in current InChI implementations
Quantitative Assessment: Systematic measurement showing that current InChI (even with Nonstandard 15T + KET settings) only achieves ~50% success in recognizing tautomeric relationships, with some new rules showing <2% success rates
Practical Tools: Creation of the Tautomerizer web tool for public use, demonstrating practical application of the rule set

The novelty lies in the systematic compilation and validation of transformation rules at a scale that reveals critical gaps in current chemical identification systems.

Massive-Scale Validation Across 400M+ Structures

Database Analysis

The researchers analyzed 9 chemical databases totaling 400+ million structures:

Public databases: PubChem (largest), ChEMBL, DrugBank, PDB Ligands, SureChEMBL, AMS, ChemNavigator
Private databases: CSD (Cambridge Structural Database), CSDB (NCI internal)

Methodology

Software: CACTVS Chemoinformatics Toolkit (versions 3.4.6.33 and 3.4.8.6)

Tautomer Generation Protocol:

Algorithm: Single-step generation (apply transforms to input structure only, avoiding recursion)
Constraints: Max 10 tautomers per structure, 30-second CPU timeout per transform
Format: All rules expressed as SMIRKS strings
Stereochemistry: Stereocenters involved in tautomerism were flattened during transformation

Success Metrics (tested against InChI V.1.05):

Complete InChI match: All tautomers share identical InChI
Partial InChI match: At least two tautomers share an InChI
Tested against two InChI configurations: Standard InChI and Nonstandard InChI (with 15T and KET options enabled)

Rule Coverage Analysis

For each of the 86 rules, the researchers:

Applied the transformation to all molecules in each database
Generated tautomers using the SMIRKS patterns
Computed InChI identifiers for each tautomer
Measured success rates (percentage of cases where InChI recognized the relationship)

Key Findings from Experiments

Rule Frequency: The most common rule PT_06_00 (1,3-heteroatom H-shift, covering keto-enol tautomerism) affects >70% of molecules across databases.

InChI Performance:

Standard InChI: ~37% success rate
Nonstandard InChI (15T + KET): ~50% success rate
Many newly defined rules: <2% success rate

Scale Impact: Implementing the full 86-rule set would approximately triple the number of compounds recognized as having tautomeric relationships relative to Standard InChI.

Outcomes: InChI V2 Requirements and Coverage Gaps

Main Findings

Current Systems Are Inadequate: Even with the Nonstandard 15T + KET settings, InChI only achieves ~50% success in recognizing tautomeric relationships, with Standard InChI at ~37%
Massive Coverage Gap: The new rule set reveals millions of tautomeric relationships that current InChI completely misses, particularly for ring-chain and valence tautomerism
Implementation Requirement: InChI V2 will require a major redesign to handle the comprehensive rule set
Rule Validation: The 86-rule set provides a validated foundation for next-generation chemical identifiers, with the new rules further confirmed against an independent ChEMBL 24.1 tautomer extraction

Implications

For Chemical Databases:

Reduced redundancy through proper tautomer recognition
Improved data quality and consistency
More comprehensive structure search results

For Machine Learning:

More accurate training data (tautomers properly grouped)
Better molecular property prediction models
Reduced dataset bias from tautomeric duplicates

For Chemoinformatics Tools:

Blueprint for InChI V2 development
Standardized rule set for tautomer generation
Public tool (Tautomerizer) for practical use

Limitations Acknowledged

Single-step generation only (omits recursive enumeration of all possible tautomers)
30-second timeout may miss complex transformations
Some tautomeric preferences are context-dependent (pH, solvent) and require more than static rules for capture

Additional Validation

The authors validated their rule set against 4,158 tautomeric systems independently extracted from ChEMBL 24.1 via a SMILES-based tautomer hash (provided by Noel O’Boyle and Roger Sayle). Their rules covered essentially all tautomeric systems in that set, with practically all cases handled by the standard CACTVS rules PT_02_00 through PT_21_00.

Companion Resource: Tautomer Database

A companion paper describes the creation of a publicly available Tautomer Database (Tauto DB) containing over 2,800 tautomeric tuples extracted from experimental literature, available at https://cactus.nci.nih.gov/download/tautomer/. Data from this database informed the generation of new rules in this work.

Future Directions

The paper lays groundwork for InChI V2 development, emphasizing that the comprehensive rule set necessitates algorithmic redesign.

Reproducibility Details

Data

Datasets Analyzed (400M+ total structures):

Public Databases (Enable partial reproduction):

PubChem: Largest public chemical database
ChEMBL: Bioactive molecules with drug-like properties
DrugBank: FDA-approved and experimental drugs
PDB Ligands: Small molecules from protein structures
SureChEMBL: Chemical structures from patents
AMS: Screening samples
ChemNavigator: Commercial chemical database

Private/Proprietary Databases (Prevent 100% full-scale reproduction):

CSD: Cambridge Structural Database (requires commercial/academic license)
CSDB: NCI internal database (private)

Algorithms

Tautomer Generation:

Method: Single-step SMIRKS-based transformations
Constraints:
- Maximum 10 tautomers per input structure
- 30-second CPU timeout per transformation
- Stereochemistry flattening for affected centers
Toolkit Dependency: The authors used the CACTVS Chemoinformatics Toolkit. Researchers attempting to reproduce this with fully open-source tools (like RDKit) may encounter differing behavior due to proprietary chemical perception logic and licensing differences.

Rule Categories:

Prototropic (PT): 54 rules for hydrogen movement
- Most common: PT_06_00 (1,3-heteroatom H-shift, >70% coverage)
Ring-Chain (RC): 21 rules for cyclic/open-chain transformations
- Examples: RC_03_00 (pentose sugars), RC_04_01 (hexose sugars)
Valence (VT): 11 rules for valence changes
- Notable: VT_02_00 (tetrazole/azide, ~2.8M hits)

InChI Comparison:

Standard InChI (default settings)
Nonstandard InChI with 15T and KET options (mobile H and keto-enol)

Evaluation

Success Metrics:

Let $\mathcal{T}(m)$ be the set of generated tautomers for molecule $m$.

Complete Match: Occurs iff $\forall t_i, t_j \in \mathcal{T}(m), \text{InChI}(t_i) = \text{InChI}(t_j)$.
Partial Match: At least 2 tautomers share the same InChI.
Fail: All tautomers have different InChIs.

Benchmark Results:

Standard InChI: ~37% success rate across all rules
Nonstandard (15T + KET): ~50% success rate
New rules: Many show <2% recognition by current InChI

Hardware

Software Environment:

Toolkit: CACTVS Chemoinformatics Toolkit v3.4.6.33 and v3.4.8.6
Hash Functions:
- E_TAUTO_HASH (tautomer-invariant identifier)
- E_ISOTOPE_STEREO_HASH128 (tautomer-sensitive identifier)

Note: The paper omits computational hardware specifications but acknowledges using the NIH HPC Biowulf cluster. Evaluating 400M+ structures necessitates high-throughput cluster computing, making it computationally expensive for an individual to replicate the full analysis from scratch.

Artifacts

Artifact	Type	License	Notes
Tautomerizer Web Tool	Other	Unknown	Public web tool for applying tautomeric rules to user molecules
Tautomer Database	Dataset	Unknown	2800+ experimental tautomeric tuples (companion resource)
SMIRKS and Scripts (SI)	Code	Unknown	CACTVS Tcl scripts and SMIRKS provided as Supporting Information

Paper Information

Citation: Dhaked, D. K., Ihlenfeldt, W.-D., Patel, H., Delannée, V., & Nicklaus, M. C. (2020). Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2. Journal of Chemical Information and Modeling, 60(3), 1253-1275. https://doi.org/10.1021/acs.jcim.9b01080

Publication: Journal of Chemical Information and Modeling, 2020

@article{dhaked2020toward,
  title={Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2},
  author={Dhaked, Devendra K and Ihlenfeldt, Wolf-Dietrich and Patel, Hitesh and Delann{\'e}e, Victorien and Nicklaus, Marc C},
  journal={Journal of Chemical Information and Modeling},
  volume={60},
  number={3},
  pages={1253--1275},
  year={2020},
  publisher={ACS Publications},
  doi={10.1021/acs.jcim.9b01080}
}

Additional Resources:

Tautomerizer Tool - Public web tool for testing tautomeric transformations

SELFIES: A Robust Molecular String Representation

Fri, 12 Sep 2025 00:00:00 +0000

Overview

SELFIES (SELF-referencIng Embedded Strings) is a string-based molecular representation where every possible string, even one generated randomly, corresponds to a syntactically and semantically valid molecule. This property addresses a major limitation of SMILES, where a large fraction of strings produced by machine learning models represent invalid chemical structures.

The format is implemented in an open-source Python library called selfies. Since the original publication, the library has undergone significant architectural changes, most notably replacing the original string-manipulation engine with a graph-based internal representation that improved both performance and extensibility (see Recent Developments).

Key Characteristics

Guaranteed Validity: Every possible SELFIES string can be decoded into a valid molecular graph that obeys chemical valence rules. This is its fundamental advantage over SMILES.
Machine Learning Friendly: Can be used directly in any machine learning model (like VAEs or GANs) without adaptation, guaranteeing that all generated outputs are valid molecules.
Customizable Constraints: The underlying chemical rules, such as maximum valence for different atoms, can be customized by the user. The library provides presets (e.g., for hypervalent species) and allows users to define their own rule sets.
Human-readable: With some familiarity, SELFIES strings are human-readable, allowing interpretation of functional groups and connectivity.
Local Operations: SELFIES encodes branch length and ring size as adjacent symbols in the string (rather than requiring matched delimiters or repeated digits at distant positions, as SMILES does), preventing common syntactical errors like unmatched parentheses or mismatched ring-closure digits.
Broad Support: The current selfies library supports aromatic molecules (via kekulization), isotopes, charges, radicals, and stereochemistry. It also includes a dot symbol (.) for representing disconnected molecular fragments.

Basic Syntax

SELFIES uses symbols enclosed in square brackets (e.g., [C], [O], [#N]). The interpretation of each symbol depends on the current state of the derivation (described below), which ensures chemical valence rules are strictly obeyed. The syntax is formally defined by a Chomsky type-2 context-free grammar.

Derivation Rules

SELFIES are constructed using a table of derivation rules. The process starts in an initial state (e.g., $X_0$) and reads the SELFIES string symbol by symbol. Each symbol, combined with the current state, determines the resulting atom/bond and the next state. The derivation state $X_n$ intuitively tracks that the previously added atom can form a maximum of $n$ additional bonds.

For example, the string [F][=C][=C][#N] is derived as follows, where $X_n$ indicates the atom can form up to $n$ additional bonds. Notice how bond demotion occurs: the first [=C] requests a double bond, but only a single bond is formed because state $X_1$ limits the connection to one bond.

$$ \begin{aligned} \text{State } X_0 + \text{[F]} &\rightarrow \text{F} + \text{State } X_1 \\ \text{State } X_1 + \text{[=C]} &\rightarrow \text{F-C} + \text{State } X_3 \\ \text{State } X_3 + \text{[=C]} &\rightarrow \text{F-C=C} + \text{State } X_2 \\ \text{State } X_2 + [\#\text{N}] &\rightarrow \text{F-C=C=N} + \text{Final} \end{aligned} $$

Structural Features

Branches: Represented by a [Branch] symbol. The symbols immediately following it are interpreted as an index that specifies the number of SELFIES symbols belonging to that branch. This structure prevents errors like unmatched parentheses in SMILES.
Rings: Represented by a [Ring] symbol. Similar to branches, subsequent symbols specify an index that indicates which previous atom to connect to, forming a ring closure. To avoid violating valence constraints, ring bond creation is postponed to a final post-processing step, where it is only completed if the target atom has available bonds.

Examples

To see how these derivation rules work in practice, here are SELFIES representations for common molecules of increasing complexity:

Ethanol: [C][C][O]

Benzene: [C][=C][C][=C][C][=C][Ring1][=Branch1]

Aspirin: [C][C][Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][Branch1][C][=O][O]

The `selfies` Python Library

The selfies library provides a dependency-free Python implementation. Here are the core operations:

import selfies as sf

# SMILES -> SELFIES
smiles = "c1ccc(C(=O)O)cc1"  # benzoic acid
encoded = sf.encoder(smiles)
print(encoded)
# -> [C][=C][C][=C][C][Branch1][C][=O][O][=C][Ring1][=Branch1]

# SELFIES -> SMILES
decoded = sf.decoder(encoded)
print(decoded)
# -> C1=CC=CC(=C1)C(=O)O

# Robustness: random strings always decode to valid molecules
random_selfies = "[C][F][Ring1][O][=N][Branch1][C][S]"
print(sf.decoder(random_selfies))
# -> always returns a valid molecule

Tokenization and Encoding

import selfies as sf

selfies_str = "[C][=C][C][=C][C][Branch1][C][=O][O][=C][Ring1][=Branch1]"

# Tokenize into individual symbols
tokens = list(sf.split_selfies(selfies_str))
print(tokens)
# -> ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[Branch1]', '[C]',
#     '[=O]', '[O]', '[=C]', '[Ring1]', '[=Branch1]']

# Get the alphabet (unique token set) from a dataset
dataset = ["[C][C][O]", "[C][=C][C][=C][C][=C][Ring1][=Branch1]"]
alphabet = sf.get_alphabet_from_selfies(dataset)
print(sorted(alphabet))
# -> ['[=Branch1]', '[=C]', '[C]', '[O]', '[Ring1]']

# Convert to integer encoding for ML pipelines
encoding, _ = sf.selfies_to_encoding(
    selfies=selfies_str,
    vocab_stoi={s: i for i, s in enumerate(sorted(alphabet))},
    pad_to_len=20,
    enc_type="label",
)

Customizing Valence Constraints

import selfies as sf

# View current constraints
print(sf.get_semantic_constraints())

# Allow hypervalent sulfur (e.g., SF6)
sf.set_semantic_constraints("hypervalent")

# Or define custom constraints
sf.set_semantic_constraints({
    "S": 6,  # allow hexavalent sulfur
    "P": 5,  # allow pentavalent phosphorus
})

# Reset to defaults
sf.set_semantic_constraints("default")

SELFIES in Machine Learning

Molecular Generation

SELFIES is particularly advantageous for generative models in computational chemistry. When used in a VAE, the entire continuous latent space decodes to valid molecules, unlike SMILES where large regions of the latent space are invalid. The original SELFIES paper demonstrated this concretely: a VAE trained with SELFIES stored two orders of magnitude more diverse molecules than a SMILES-based VAE, and a GAN produced 78.9% diverse valid molecules compared to 18.6% for SMILES (Krenn et al., 2020).

Several generation approaches build directly on SELFIES:

Latent space optimization: LIMO uses a SELFIES-based VAE with gradient-based optimization to generate molecules with nanomolar binding affinities, achieving 6-8x speedup over RL baselines (Eckmann et al., 2022).
Training-free generation: STONED demonstrates that simple character-level mutations in SELFIES (replacement, deletion, insertion) produce valid molecules by construction, eliminating the need for neural networks entirely. STONED achieved a GuacaMol score of 14.70, competitive with deep generative models (Nigam et al., 2021).
Gradient-based dreaming: PASITHEA computes gradients with respect to one-hot encoded SELFIES inputs to steer molecules toward target property values. Because SELFIES’ surjective mapping guarantees every intermediate representation is a valid molecule, this continuous optimization over the input space is feasible. PASITHEA generated molecules with properties outside the training data range (logP up to 4.24 vs. a training max of 3.08), with 97.2% novelty (Shen et al., 2021).
Large-scale pre-training: MolGen is a BART-based model pre-trained on 100M+ SELFIES molecules. It achieves 100% validity and an FCD of 0.0015 on MOSES (vs. 0.0061 for Chemformer), and introduces chemical feedback to align outputs with preference rankings (Fang et al., 2024).

In benchmarks, SELFIES performs well for optimization-oriented tasks. In the PMO benchmark of 25 methods, SELFIES-REINVENT ranked 3rd and STONED ranked 5th. SELFIES-based genetic algorithms outperformed SMILES-based GAs, likely because SELFIES provides more intuitive mutation operations (Gao et al., 2022). The Tartarus benchmark corroborates this across more diverse real-world objectives (organic emitters, protein ligands, reaction substrates): SELFIES-VAE consistently outperforms SMILES-VAE, and the representation matters most where validity is a bottleneck (Nigam et al., 2022).

SELFIES mutations provide a simple but effective way to explore chemical space:

import selfies as sf
import random

def mutate_selfies(selfies_str, mutation_type="replace"):
    """Mutate a SELFIES string. Every output is a valid molecule."""
    tokens = list(sf.split_selfies(selfies_str))
    alphabet = list(sf.get_semantic_robust_alphabet())
    idx = random.randint(0, len(tokens) - 1)

    if mutation_type == "replace":
        tokens[idx] = random.choice(alphabet)
    elif mutation_type == "insert":
        tokens.insert(idx, random.choice(alphabet))
    elif mutation_type == "delete" and len(tokens) > 1:
        tokens.pop(idx)

    return "".join(tokens)

# Every mutation produces a valid molecule
original = sf.encoder("c1ccccc1")  # benzene
for _ in range(5):
    mutant = mutate_selfies(original)
    print(sf.decoder(mutant))  # always valid

Property Prediction and Pretraining

SELFormer is a RoBERTa-based chemical language model pretrained on 2M ChEMBL compounds using SELFIES as input. Because every masked token prediction corresponds to a valid molecular fragment, the model never wastes capacity learning invalid chemistry. SELFormer outperformed ChemBERTa-2 by approximately 12% on average across BACE, BBBP, and HIV classification benchmarks (Yüksel et al., 2023). ChemBERTa also evaluated SELFIES as an input representation, finding comparable performance to SMILES on the Tox21 task (Chithrananda et al., 2020).

The Regression Transformer demonstrated that SELFIES achieves ~100% validity vs. ~40% for SMILES in conditional molecular generation, while performing comparably for property prediction. This dual prediction-generation capability is enabled by interleaving numerical property tokens with SELFIES molecular tokens in a single sequence (Born & Manica, 2023).

At larger scales, ChemGPT (up to 1B parameters) uses a GPT-Neo backbone with SELFIES tokenization for autoregressive molecular generation, demonstrating that SELFIES follows the same power-law neural scaling behavior observed in NLP (Frey et al., 2023).

Optical Chemical Structure Recognition

In image-to-text chemical structure recognition, Rajan et al. (2022) compared SMILES, DeepSMILES, SELFIES, and InChI as output formats using the same transformer architecture. SELFIES achieved 100% structural validity (every prediction could be decoded), while SMILES predictions occasionally contained syntax errors. The trade-off: SMILES achieved higher exact match accuracy (88.62%) partly because SELFIES strings are longer, producing more tokens for the decoder to predict.

Chemical Name Translation

STOUT uses SELFIES as its internal representation for translating between chemical line notations and IUPAC names. All SMILES are converted to SELFIES before processing, and the model achieves a BLEU score of 0.94 for IUPAC-to-SELFIES translation and 0.98 Tanimoto similarity on valid outputs. The authors found SELFIES’ syntactic robustness particularly valuable for this sequence-to-sequence task, where the decoder must produce a chemically valid output string (Rajan et al., 2021).

Tokenization

Converting SELFIES strings into tokens for neural models is more straightforward than SMILES tokenization. Each bracket-enclosed symbol ([C], [=C], [Branch1]) is a natural token boundary. Atom Pair Encoding (APE) extends byte pair encoding with chemistry-aware constraints for both SMILES and SELFIES. For SELFIES specifically, APE preserves atomic identity during subword merging, and SELFIES models showed strong inter-tokenizer agreement: all true positives from SELFIES-BPE were captured by SELFIES-APE (Leon et al., 2024).

Limitations and Trade-offs

Validity Constraints Can Introduce Bias

The guarantee that every string decodes to a valid molecule is SELFIES’ core advantage, but recent work has shown this comes with trade-offs. Skinnider (2024) demonstrated that SMILES-based models consistently outperform SELFIES-based models on distribution-learning tasks. The mechanism: invalid SMILES represent a model’s least confident predictions, and filtering them out acts as implicit quality control. SELFIES models, by construction, cannot discard low-confidence outputs this way. Furthermore, SELFIES validity constraints introduce systematic structural biases, generating fewer aromatic rings and more aliphatic structures compared to training data. When SELFIES constraints were relaxed to allow invalid generation (“unconstrained SELFIES”), performance improved, providing causal evidence that the ability to generate and discard invalid outputs benefits distribution learning.

This finding reframes the SMILES vs. SELFIES choice as context-dependent. As Grisoni (2023) summarizes in a review of chemical language models: “SMILES offer a richer, more interpretable language with well-studied augmentation strategies, while SELFIES guarantee validity at the cost of chemical realism and edit interpretability.”

The PMO benchmark provides further nuance: SELFIES-based variants of language model methods (REINVENT, LSTM HC, VAE) generally do not outperform their SMILES counterparts, because modern language models learn SMILES grammar well enough that syntactic invalidity is no longer a practical bottleneck. The exception is genetic algorithms, where SELFIES mutations are naturally well-suited.

A study on complex molecular distributions paints a consistent picture: SELFIES-trained RNNs achieve better standard metrics (validity, uniqueness, novelty), while SMILES-trained RNNs achieve better distributional fidelity as measured by Wasserstein distance (Flam-Shepherd et al., 2022). Taken together, these findings suggest that SELFIES and SMILES have genuinely complementary strengths, and the best choice depends on whether the task prioritizes validity/novelty or distributional faithfulness.

Degenerate Outputs

Although every SELFIES string decodes to a valid molecule, the decoded molecule may not always be chemically meaningful in context. The Regression Transformer reported ~1.9% defective generations where the output molecule had fewer than 50% of the seed molecule’s atoms (Born & Manica, 2023). This highlights a distinction between syntactic validity (which SELFIES guarantees) and semantic appropriateness (which it does not).

Other Limitations

Indirect Canonicalization: A canonical SELFIES string is currently generated by first creating a canonical SMILES string and then converting it to SELFIES. Direct canonicalization is a goal for future development.
String Length: SELFIES strings are generally longer than their corresponding SMILES strings, which can impact storage, processing times, and sequence modeling difficulty for very large datasets.
Ongoing Standardization: While the library now supports most major features found in SMILES, work is ongoing to extend the format to more complex systems like polymers, crystals, and reactions.

Variants and Extensions

Group SELFIES

Group SELFIES extends the representation with group tokens that represent functional groups or entire substructures (e.g., a benzene ring or carboxyl group) as single units. Each group token has labeled attachment points with specified valency, allowing the decoder to continue tracking available bonds. Group SELFIES maintains the validity guarantee while producing shorter, more human-readable strings. On MOSES VAE benchmarks, Group SELFIES achieved an FCD of 0.1787 versus 0.6351 for standard SELFIES, indicating substantially better distribution learning (Cheng et al., 2023).

STONED Algorithms

STONED (Superfast Traversal, Optimization, Novelty, Exploration and Discovery) is a suite of algorithms that exploit SELFIES’ validity guarantee for training-free molecular design through point mutations, interpolation, and optimization (Nigam et al., 2021). See Molecular Generation above for benchmark results.

Recent Developments

The 2023 library update replaced the original string-manipulation engine with a graph-based internal representation. This change resolved several long-standing limitations: the original approach could not handle aromatics (requiring kekulization), stereochemistry, or charged species. The graph-based engine now supports all of these, and processes 300K+ molecules in approximately 4 minutes in pure Python. The library has been validated on all 72 million molecules from PubChem.

Looking forward, researchers have outlined 16 future research directions for extending robust representations to complex systems like polymers, crystals, and chemical reactions.

References

Krenn, M., Häse, F., Nigam, A., Friederich, P., & Aspuru-Guzik, A. (2020). Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4), 045024.
Krenn, M., Ai, Q., Barthel, S., Carson, N., Frei, A., Frey, N. C., … & Aspuru-Guzik, A. (2022). SELFIES and the future of molecular string representations. Patterns, 3(10), 100588.
Lo, A., Pollice, R., Nigam, A., White, A. D., Krenn, M., & Aspuru-Guzik, A. (2023). Recent advances in the self-referencing embedded strings (SELFIES) library. Digital Discovery, 2, 897-908.
Skinnider, M. A. (2024). Invalid SMILES are beneficial rather than detrimental to chemical language models. Nature Machine Intelligence, 6, 437-448.
Shen, C., Krenn, M., Eppel, S., & Aspuru-Guzik, A. (2021). Deep molecular dreaming: inverse machine learning for de-novo molecular design and interpretability with surjective representations. Machine Learning: Science and Technology, 2(3), 03LT02.
Fang, Y., et al. (2024). Domain-agnostic molecular generation with chemical feedback. ICLR 2024.
Born, J., & Manica, M. (2023). Regression Transformer enables concurrent sequence regression and generation for molecular language modelling. Nature Machine Intelligence, 5, 432-444.
Frey, N. C., Soklaski, R., Axelrod, S., Samsi, S., Gómez-Bombarelli, R., Coley, C. W., & Gadepally, V. (2023). Neural scaling of deep chemical models. Nature Machine Intelligence, 5, 1297-1305.
Rajan, K., Zielesny, A., & Steinbeck, C. (2021). STOUT: SMILES to IUPAC names using neural machine translation. Journal of Cheminformatics, 13, 34.
Nigam, A., Pollice, R., & Aspuru-Guzik, A. (2022). Tartarus: A benchmarking platform for realistic and practical inverse molecular design. NeurIPS 2022 Datasets and Benchmarks.
SELFIES GitHub Repository

The Number of Isomeric Hydrocarbons of the Methane Series

Mon, 08 Sep 2025 00:00:00 +0000

A Theoretical Foundation for Mathematical Chemistry

This is a foundational theoretical paper in mathematical chemistry and chemical graph theory. It derives exact mathematical laws governing molecular topology. The paper also serves as a benchmark resource, establishing the first systematic isomer counts that corrected historical errors and whose recursive method remains the basis for modern molecular enumeration.

Historical Motivation and the Failure of Centric Trees

The primary motivation was the lack of a rigorous mathematical relationship between carbon content ($N$) and isomer count.

Previous failures: Earlier attempts by Cayley (1875) (as cited by Henze and Blair, referring to the Berichte der deutschen chemischen Gesellschaft summary) and Schiff (1875) used “centric” and “bicentric” symmetry tree methods that broke down as carbon content increased, producing incorrect counts as early as $N = 12$. Subsequent efforts by Tiemann (1893), Delannoy (1894), Losanitsch (1897), Goldberg (1898), and Trautz (1924), as cited in the paper, each improved on specific aspects but none achieved general accuracy beyond moderate carbon content.
The theoretical gap: All prior formulas depended on exhaustively identifying centers of symmetry, meaning they required additional correction terms for each increase in $N$ and could not reliably predict counts for larger molecules like $C_{40}$.

This work aimed to develop a theoretically sound, generalizable method that could be extended to any number of carbons.

Core Innovation: Recursive Enumeration of Graphs

The core novelty is the proof that the count of hydrocarbons is a recursive function of the count of alkyl radicals (alcohols) of size $N/2$ or smaller. The authors rely on a preliminary calculation of the total number of isomeric alcohols (the methanol series) to make this hydrocarbon enumeration possible. By defining $T_k$ as the exact number of possible isomeric alkyl radicals strictly containing $k$ carbon atoms, graph enumeration transforms into a mathematical recurrence.

To rigorously prevent double-counting when functionally identical branches connect to a central carbon, Henze and Blair applied combinations with substitution. Because the chemical branches are unordered topologically, connecting $x$ branches of identical structural size $k$ results in combinations with repetition:

$$ \binom{T_k + x - 1}{x} $$

For example, if a Group B central carbon is bonded to three identical sub-branches of length $k$, the combinatoric volume for that precise topological partition resolves to:

$$ \frac{T_k (T_k + 1)(T_k + 2)}{6} $$

Summing these constrained combinatorial partitions across all valid branch sizes (governed by the Even/Odd bisection rules) yields the exact isomer count for $N$ without overestimating due to symmetric permutations.

The Symmetry Constraints: The paper rigorously divides the problem space to prevent double-counting:

Group A (Centrosymmetric): Hydrocarbons that can be bisected into two smaller alkyl radicals.
- Even $N$: Split into two radicals of size $N/2$.
- Odd $N$: Split into sizes $(N+1)/2$ and $(N-1)/2$.
Group B (Asymmetric): Hydrocarbons whose graphic formula cannot be symmetrically bisected. They contain exactly one central carbon atom attached to 3 or 4 branches. To prevent double-counting, Henze and Blair established strict maximum branch sizes:
- Even $N$: No branch can be larger than $(N/2 - 1)$ carbons.
- Odd $N$: No branch can be larger than $(N-3)/2$ carbons.
- The Combinatorial Partitioning: They further subdivided these 3-branch and 4-branch molecules into distinct mathematical cases based on whether the branches were structurally identical or unique, applying distinct combinatorial formulas to each scenario.

The five isomers of hexane ($C_6$) classified by Henze and Blair’s symmetry scheme. Group A molecules (top row) can be bisected along a bond (highlighted in red) into two $C_3$ alkyl radicals. Group B molecules (bottom row) have a central carbon atom (red circle) with 3-4 branches, preventing symmetric bisection.

This classification is the key insight that enables the recursive formulas. By exhaustively partitioning hydrocarbons into these mutually exclusive groups, the authors could derive separate combinatorial expressions for each and sum them without double-counting.

For each structural class, combinatorial formulas are derived that depend on the number of isomeric alcohols ($T_k$) where $k < N$. This transforms the problem of counting large molecular graphs into a recurrence relation based on the counts of smaller, simpler sub-graphs.

Validation via Exhaustive Hand-Enumeration

The experiments were computational and enumerative:

Derivation of the recursion formulas: The main effort was the mathematical derivation of the set of equations for each structural class of hydrocarbon.
Calculation: They applied their formulas to calculate the number of isomers for alkanes up to $N=40$, reaching over $6.2 \times 10^{13}$ isomers. This was far beyond what was previously possible.
Validation by exhaustive enumeration: To prove the correctness of their theory, the authors manually drew and counted all possible structural formulas for the undecanes ($C_{11}$), dodecanes ($C_{12}$), tridecanes ($C_{13}$), and tetradecanes ($C_{14}$). This brute-force check confirmed their calculated numbers and corrected long-standing errors in the literature.
- Key correction: The manual enumeration proved that the count for tetradecane ($C_{14}$) is 1,858, correcting erroneous values previously published by Losanitsch (1897), whose results for $C_{12}$ and $C_{14}$ the paper identifies as incorrect.

Benchmark Outcomes and Scaling Limits

The Constitutional Limit: The paper establishes the mathematical ground truth for organic molecular graphs by strictly counting constitutional (structural) isomers. The derivation completely excludes 3D stereoisomerism (enantiomers and diastereomers). For modern geometric deep learning applications (e.g., generating 3D conformers), Henze and Blair’s scaling sequence serves as a lower bound, representing a severe underestimation of the true number of spatial configurations feasible within chemical space.
Theoretical outcome: The paper proves that the problem’s inherent complexity requires a recursive approach.
Benchmark resource: The authors published a table of isomer counts up to $C_{40}$ (Table II), correcting historical errors and establishing the first systematic enumeration across this range. Later computational verification revealed that the paper’s hand-calculated values are exact through at least $C_{14}$ (confirmed by exhaustive enumeration) but accumulate minor arithmetic errors beyond that range (e.g., at $C_{40}$). The recursive method itself is exact and remains the basis for the accepted values in OEIS A000602.

The number of structural isomers grows super-exponentially with carbon content, reaching over 62 trillion for C₄₀. This plot, derived from Henze and Blair’s Table II, illustrates the combinatorial explosion that makes direct enumeration intractable for larger molecules.

The plot above illustrates the staggering growth rate. Methane ($C_1$) through propane ($C_3$) each have exactly one isomer. Beyond this, the count accelerates rapidly: 75 isomers at $C_{10}$, nearly 37 million at $C_{25}$, and over 4 billion at $C_{30}$. By $C_{40}$, the count exceeds $6.2 \times 10^{13}$ (the paper’s hand-calculated Table II reports 62,491,178,805,831, while the modern OEIS-verified value is 62,481,801,147,341). This super-exponential scaling demonstrates why brute-force enumeration becomes impossible and why the recursive approach was essential.

Foundational impact: This work established the mathematical framework that would later evolve into modern chemical graph theory and computational chemistry approaches for molecular enumeration. In the context of AI for molecular generation, this is an early form of expressivity analysis, defining the size of the chemical space that generative models must learn to cover.

Reproducibility Details

Algorithms: The exact mathematical recursive formulas and combinatorial partitioning logic are fully provided in the text, allowing for programmatic implementation.
Evaluation: The authors scientifically validated their recursive formulas through exhaustive manual hand-enumeration (brute-force drawing of structural formulas) up to $C_{14}$ to establish absolute correctness.
Data: The paper’s Table II provides isomer counts up to $C_{40}$. These hand-calculated values are exact through at least $C_{14}$ (validated by exhaustive enumeration) but accumulate minor arithmetic errors beyond that range. The corrected integer sequence is maintained in the On-Line Encyclopedia of Integer Sequences (OEIS) as A000602.

Code: The OEIS page provides Mathematica and Maple implementations. The following pure Python implementation uses the OEIS generating functions (which formalize Henze and Blair’s recursive method) to compute the corrected isomer counts up to any arbitrary $N$:

def compute_alkane_isomers(max_n: int) -> list[int]:
    """
    Computes the number of alkane structural isomers C_nH_{2n+2}
    up to max_n using the generating functions from OEIS A000602.
    """
    if max_n == 0: return [1]

    # Helper: multiply two polynomials (cap at degree max_n)
    def poly_mul(a: list[int], b: list[int]) -> list[int]:
        res = [0] * (max_n + 1)
        for i, v_a in enumerate(a):
            for j, v_b in enumerate(b):
                if i + j <= max_n: res[i + j] += v_a * v_b
                else: break
        return res

    # Helper: evaluate P(x^k) by spacing out terms
    def poly_pow(a: list[int], k: int) -> list[int]:
        res = [0] * (max_n + 1)
        for i, v in enumerate(a):
            if i * k <= max_n: res[i * k] = v
            else: break
        return res

    # T represents the alkyl radicals (OEIS A000598), T[0] = 1
    T = [0] * (max_n + 1)
    T[0] = 1

    # Iteratively build coefficients of T
    # We only need to compute the (n-1)-th degree terms at step n
    for n in range(1, max_n + 1):
        # Extract previously calculated slices
        t_prev = T[:n]

        # T(x^2) and T(x^3) terms up to n-1
        t2_term = T[(n - 1) // 2] if (n - 1) % 2 == 0 else 0
        t3_term = T[(n - 1) // 3] if (n - 1) % 3 == 0 else 0

        # T(x)^2 and T(x)^3 terms up to n-1
        t_squared_n_1 = sum(t_prev[i] * t_prev[n - 1 - i] for i in range(n))

        t_cubed_n_1 = sum(
            T[i] * T[j] * T[n - 1 - i - j]
            for i in range(n)
            for j in range(n - i)
        )

        # T(x) * T(x^2) term up to n-1
        t_t2_n_1 = sum(
            T[i] * T[j]
            for i in range(n)
            for j in range((n - 1 - i) // 2 + 1)
            if i + 2*j == n - 1
        )

        T[n] = (t_cubed_n_1 + 3 * t_t2_n_1 + 2 * t3_term) // 6

    # Calculate Alkanes (OEIS A000602) from fully populated T
    T2 = poly_pow(T, 2)
    T3 = poly_pow(T, 3)
    T4 = poly_pow(T, 4)
    T_squared = poly_mul(T, T)
    T_cubed = poly_mul(T_squared, T)
    T_fourth = poly_mul(T_cubed, T)

    term2 = [(T_squared[i] - T2[i]) // 2 for i in range(max_n + 1)]

    term3_inner = [
        T_fourth[i]
        + 6 * poly_mul(T_squared, T2)[i]
        + 8 * poly_mul(T, T3)[i]
        + 3 * poly_mul(T2, T2)[i]
        + 6 * T4[i]
        for i in range(max_n + 1)
    ]

    alkanes = [1] + [0] * max_n
    for n in range(1, max_n + 1):
        alkanes[n] = T[n] - term2[n] + term3_inner[n - 1] // 24

    return alkanes

# Calculate and verify
isomers = compute_alkane_isomers(40)
print(f"C_14 isomers: {isomers[14]}")   # Output: 1858
print(f"C_40 isomers: {isomers[40]}")   # Output: 62481801147341

Hardware: Derived analytically and enumerated manually by the authors in 1931 without computational hardware.

Paper Information

Citation: Henze, H. R., & Blair, C. M. (1931). The number of isomeric hydrocarbons of the methane series. Journal of the American Chemical Society, 53(8), 3077-3085. https://doi.org/10.1021/ja01359a034

Publication: Journal of the American Chemical Society (JACS) 1931

@article{henze1931number,
  title={The number of isomeric hydrocarbons of the methane series},
  author={Henze, Henry R and Blair, Charles M},
  journal={Journal of the American Chemical Society},
  volume={53},
  number={8},
  pages={3077--3085},
  year={1931},
  publisher={ACS Publications}
}

SMILES: A Compact Notation for Chemical Structures

Mon, 08 Sep 2025 00:00:00 +0000

Overview

SMILES (Simplified Molecular Input Line Entry System), originally developed by David Weininger in the late 1980s, is a one-dimensional string format for representing chemical molecular structures. It linearizes 3D molecular structures by performing a depth-first traversal of the molecular graph, recording the atoms and bonds along the way.

For example, the simple molecule ethanol ($\text{C}_2\text{H}_6\text{O}$) can be represented as CCO, while the more complex caffeine molecule becomes CN1C=NC2=C1C(=O)N(C(=O)N2C)C.

Key Characteristics

Human-readable: Designed primarily for human readability. Compare with InChI, a hierarchical representation optimized for machine parsing.
Compact: More compact than other representations (3D coordinates, connectivity tables)
Simple syntax: A language with simple syntax and structure, making it relatively easy to learn and use for chemists and researchers
Flexible: Both linear and cyclic structures can be represented in many different valid ways

For a hands-on tutorial on visualizing SMILES strings as 2D molecular images, see Converting SMILES Strings to 2D Molecular Images.

Basic Syntax

Atomic Symbols

SMILES uses standard atomic symbols with implied hydrogen atoms:

C (methane, $\text{CH}_4$)
N (ammonia, $\text{NH}_3$)
O (water, $\text{H}_2\text{O}$)
P (phosphine, $\text{PH}_3$)
S (hydrogen sulfide, $\text{H}_2\text{S}$)
Cl (hydrogen chloride, $\text{HCl}$)

Bracket notation: Elements outside the organic subset must be shown in brackets, e.g., [Pt] for elemental platinum. The organic subset (B, C, N, O, P, S, F, Cl, Br, and I) can omit brackets.

Bond Representation

Bonds are represented by symbols:

Single bond: - (usually omitted)

Ethane ($\text{C}_2\text{H}_6$), SMILES: CC

Double bond: =

Methyl Isocyanate ($\text{C}_2\text{H}_3\text{NO}$), SMILES: CN=C=O

Triple bond: #

Hydrogen Cyanide (HCN), SMILES: C#N

Aromatic bond: : (usually omitted when lowercase atom symbols indicate aromaticity)

Vanillin ($\text{C}_8\text{H}_8\text{O}_3$), SMILES: O=Cc1ccc(O)c(OC)c1

Disconnected structures: . (separates disconnected components such as salts and ionic compounds)

Copper(II) Sulfate ($\text{CuSO}_4$), SMILES: [Cu+2].[O-]S(=O)(=O)[O-]

Structural Features

Branches: Enclosed in parentheses and can be nested. For example, CC(C)C(=O)O represents isobutyric acid, where (C) and (=O) are branches off the main chain.

3-Propyl-4-isopropyl-1-heptene ($\text{C}\{12}\text{H}\{22}$), SMILES: C=CC(CCC)C(C(C)C)CCC

Cyclic structures: Written by breaking bonds and using numbers to indicate bond connections. For example, C1CCCCC1 represents cyclohexane (the 1 connects the first and last carbon).
Aromaticity: Lower case letters are used for atoms in aromatic rings. For example, benzene is written as c1ccccc1.
Formal charges: Indicated by placing the charge in brackets after the atom symbol, e.g., [C+], [C-], or [C-2]

Stereochemistry and Isomers

Isotope Notation

Isotope notation specifies the exact isotope of an element and comes before the element within square brackets, e.g., [13C] for carbon-13.

Double Bond Stereochemistry

Directional bonds can be specified using \ and / symbols to indicate the stereochemistry of double bonds:

C/C=C\C represents (E)-2-butene (trans configuration)
C/C=C/C represents (Z)-2-butene (cis configuration)

The direction of the slashes indicates which side of the double bond each substituent is on.

Tetrahedral Chirality

Chirality around tetrahedral centers uses @ and @@ symbols:

N[C@](C)(F)C(=O)O vs N[C@@](F)(C)C(=O)O
Anti-clockwise counting vs clockwise counting
@ and @@ are shorthand for @TH1 and @TH2, respectively

Glucose ($\text{C}\6\text{H}\{12}\text{O}\_6$), SMILES: OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@H](O)1

Advanced Stereochemistry

More general notation for other stereocenters:

@AL1, @AL2 for allene-type stereocenters
@SP1, @SP2, @SP3 for square-planar stereocenters
@TB1…@TB20 for trigonal bipyramidal stereocenters
@OH1…@OH30 for octahedral stereocenters

SMILES allows partial specification since it relies on local chirality.

SMILES in Machine Learning

Beyond its original role as a compact notation, SMILES has become the dominant molecular input format for deep learning in chemistry. Its adoption has revealed both strengths and challenges specific to neural architectures.

Canonical vs. Randomized SMILES

Canonical SMILES algorithms produce a single unique string per molecule, which is valuable for database deduplication. In generative modeling, however, canonical representations introduce training bias: the canonicalization algorithm constrains how the molecular graph is traversed (e.g., prioritizing sidechains over ring atoms), forcing models to learn both valid SMILES syntax and the specific ordering rules. Structurally similar molecules can have substantially different canonical strings, making complex topologies harder to sample.

Randomized SMILES address this by generating non-unique representations through random atom orderings. Training RNN-based generative models on randomized SMILES acts as data augmentation, improving chemical space coverage, sampling uniformity, and completeness compared to canonical SMILES (Arus-Pous et al., 2019). In one benchmark, randomized SMILES recovered significantly more of GDB-13 chemical space than canonical SMILES across all training set sizes.

RDKit makes it straightforward to enumerate randomized SMILES for a given molecule:

from rdkit import Chem

mol = Chem.MolFromSmiles("c1ccc(C(=O)O)cc1")  # benzoic acid

# Canonical form (deterministic)
print(Chem.MolToSmiles(mol))
# -> O=C(O)c1ccccc1

# Randomized forms (different each call)
for _ in range(5):
    print(Chem.MolToSmiles(mol, doRandom=True))
# -> OC(=O)c1ccccc1
# -> O=C(c1ccccc1)O
# -> OC(c1ccccc1)=O
# -> C(O)(c1ccccc1)=O
# -> c1c(C(=O)O)cccc1

Each of these strings encodes the same molecule but presents a different traversal of the molecular graph, giving a generative model more diverse training signal per molecule.

Validity and the Role of Invalid SMILES

A large fraction of SMILES strings generated by neural models are syntactically or semantically invalid. Early efforts aimed to eliminate invalid outputs entirely, either through constrained representations like SELFIES (which guarantee 100% validity) or modified syntax like DeepSMILES (which removes paired syntax; see Variants below for syntax details).

More recent work has complicated this picture. Skinnider (2024) demonstrated that invalid SMILES generation actually benefits chemical language models. Invalid strings tend to be low-likelihood samples from the model’s probability distribution. Filtering them out is equivalent to removing the model’s least confident predictions, acting as implicit quality control. Meanwhile, enforcing absolute validity (as SELFIES does) can introduce systematic structural biases that impair distribution learning. This reframes SMILES’ non-robustness as potentially advantageous in certain ML contexts.

Tokenization Challenges

Converting SMILES strings into token sequences for neural models is non-trivial. The two baseline approaches illustrate the problem using chloramphenicol (O=C(NC([C@@H](O)c1ccc([N+](=O)[O-])cc1)CO)C(Cl)Cl):

import re

smiles = "O=C(NC([C@@H](O)c1ccc([N+](=O)[O-])cc1)CO)C(Cl)Cl"

# Character-level: splits every character individually
char_tokens = list(smiles)
# ['O', '=', 'C', '(', 'N', 'C', '(', '[', 'C', '@', '@', 'H', ']',
#  '(', 'O', ')', 'c', '1', 'c', 'c', 'c', '(', '[', 'N', '+', ']',
#  '(', '=', 'O', ')', '[', 'O', '-', ']', ')', 'c', 'c', '1', ')',
#  'C', 'O', ')', 'C', '(', 'C', 'l', ')', 'C', 'l']
# -> 49 tokens

# Atom-level: regex groups brackets, two-char elements, and bond symbols
atom_pattern = (
    r"(\[[^\]]+\]|Br?|Cl?|N|O|S|P|F|I|"
    r"b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|"
    r"\\|\/|:|~|@|\?|>>?|\*|%[0-9]{2}|[0-9])"
)
atom_tokens = re.findall(atom_pattern, smiles)
# ['O', '=', 'C', '(', 'N', 'C', '(', '[C@@H]', '(', 'O', ')', 'c',
#  '1', 'c', 'c', 'c', '(', '[N+]', '(', '=', 'O', ')', '[O-]', ')',
#  'c', 'c', '1', ')', 'C', 'O', ')', 'C', '(', 'Cl', ')', 'Cl']
# -> 36 tokens

Character-level tokenization splits Cl (chlorine) into C + l, making the chlorine indistinguishable from carbon. It also fragments [C@@H] (a chiral carbon) into six meaningless tokens: [, C, @, @, H, ]. Atom-level tokenization preserves these as single tokens but still produces long sequences (~40 tokens per molecule on average in ChEMBL).

Several chemistry-aware tokenizers go further:

SMILES Pair Encoding (SPE) adapts byte pair encoding to learn high-frequency SMILES substrings from large chemical datasets, compressing average sequence length from ~40 to ~6 tokens while preserving chemically meaningful substructures.
Atom Pair Encoding (APE) preserves atomic identity during subword merging, preventing chemically meaningless token splits.
Atom-in-SMILES (AIS) encodes each atom’s local chemical environment into the token itself (e.g., distinguishing a carbonyl carbon from a methyl carbon), reducing token degeneration and improving translation accuracy.
Smirk achieves full OpenSMILES coverage with only 165 tokens by decomposing bracketed atoms into glyphs.

SMILES-Based Foundation Models

SMILES serves as the primary input format for molecular encoder models, including SMILES-BERT, SMILES-Transformer, BARTSmiles, SMI-TED, and MolBERT. These models learn molecular representations from large SMILES corpora through pre-training objectives like masked language modeling.

A key open challenge is robustness to SMILES variants. The AMORE framework revealed that current chemical language models struggle to recognize chemically equivalent SMILES representations (such as hydrogen-explicit vs. implicit forms, or different atom orderings) as encoding the same molecule.

Molecular Generation

SMILES is the dominant representation for de novo molecular generation. The typical pipeline trains a language model on SMILES corpora, then steers sampling toward molecules with desired properties. Major architecture families include:

Variational autoencoders: The Automatic Chemical Design VAE (Gomez-Bombarelli et al., 2018) encodes SMILES into a continuous latent space, enabling gradient-based optimization toward target properties.
RL-tuned generators: REINVENT and its successors fine-tune a pre-trained SMILES language model using reinforcement learning, rewarding molecules that satisfy multi-objective scoring functions. DrugEx extends this with Pareto-based multi-objective optimization.
Adversarial approaches: ORGAN and LatentGAN apply GAN-based training to SMILES generation, using domain-specific rewards alongside the discriminator signal.

The challenges of canonical vs. randomized SMILES and invalid outputs discussed above are particularly relevant in this generation context.

Property Prediction

SMILES strings serve as the primary input for quantitative structure-activity relationship (QSAR) models. SMILES2Vec learns fixed-length molecular embeddings directly from SMILES for property regression and classification. MaxSMI demonstrates that SMILES augmentation (training on multiple randomized SMILES per molecule) improves property prediction accuracy, connecting the data augmentation benefits observed in generative settings to discriminative tasks.

Optical Chemical Structure Recognition

SMILES is also the standard output format for optical chemical structure recognition (OCSR) systems, which extract molecular structures from images in scientific literature. Deep learning approaches like DECIMER and Image2SMILES frame this as an image-to-SMILES translation problem, using encoder-decoder architectures to generate SMILES strings directly from molecular diagrams. For a taxonomy of OCSR approaches, see the OCSR methods overview.

Limitations

Classical Limitations

Non-uniqueness: Different SMILES strings can represent the same molecule (e.g., ethanol can be written as CCO or OCC). Canonical SMILES algorithms address this by producing a single unique representation.
Non-robustness: SMILES strings can be written that do not correspond to any valid molecular structure.
- Strings that cannot represent a molecular structure.
- Strings that violate basic rules (more bonds than is physically possible).
Information loss: If 3D structural information exists, a SMILES string cannot encode it.

Machine Learning Limitations

The challenges described above (canonical ordering bias motivating randomized SMILES, validity constraints motivating DeepSMILES and SELFIES, and tokenization ambiguity motivating chemistry-aware tokenizers) remain active areas of research. See the linked sections for details on each.

Variants and Standards

Canonical SMILES

For how canonical vs. randomized SMILES affects generative modeling, see Canonical vs. Randomized SMILES above.

Canonical SMILES algorithms produce a single unique string per molecule by assigning a deterministic rank to each atom and then traversing the molecular graph in that rank order. Most implementations build on the Morgan algorithm (extended connectivity): each atom starts with an initial invariant based on its properties (atomic number, degree, charge, hydrogen count), then iteratively updates its invariant by incorporating its neighbors’ invariants until the ranking stabilizes. The final atom ranks determine the traversal order, which determines the canonical string.

In practice, the Morgan algorithm alone does not fully resolve all ties. Implementations must also make choices about tie-breaking heuristics, aromaticity perception (Kekulé vs. aromatic form), and stereochemistry encoding. Because these choices differ across toolkits (RDKit, OpenBabel, Daylight, ChemAxon), the same molecule can produce different “canonical” SMILES depending on the software. A canonical SMILES is only guaranteed unique within a single implementation, not across implementations.

from rdkit import Chem

# RDKit's canonical SMILES for caffeine
mol = Chem.MolFromSmiles("CN1C=NC2=C1C(=O)N(C(=O)N2C)C")
print(Chem.MolToSmiles(mol))
# -> Cn1c(=O)c2c(ncn2C)n(C)c1=O

Isomeric SMILES

Isomeric SMILES incorporates isotopes and stereochemistry information, providing more detailed molecular representations than generic SMILES. Non-isomeric SMILES strip this information, collapsing stereoisomers and isotopologues into the same string:

from rdkit import Chem

# L-alanine (chiral center)
mol = Chem.MolFromSmiles("N[C@@H](C)C(=O)O")
print(Chem.MolToSmiles(mol, isomericSmiles=True))
# -> C[C@H](N)C(=O)O    (preserves chirality)
print(Chem.MolToSmiles(mol, isomericSmiles=False))
# -> CC(N)C(=O)O         (chirality lost)

# Deuterated water (isotope labels)
mol2 = Chem.MolFromSmiles("[2H]O[2H]")
print(Chem.MolToSmiles(mol2, isomericSmiles=True))
# -> [2H]O[2H]           (preserves isotopes)
print(Chem.MolToSmiles(mol2, isomericSmiles=False))
# -> [H]O[H]             (isotope info lost)

OpenSMILES vs. Proprietary

Proprietary: The original SMILES specification was proprietary (Daylight Chemical Information Systems), which led to compatibility issues between different implementations.
OpenSMILES: An open-source alternative standardization effort to address compatibility concerns and provide a freely available specification.

DeepSMILES

DeepSMILES modifies two aspects of SMILES syntax that cause most invalid strings in generative models, while remaining interconvertible with standard SMILES without information loss.

Ring closures: Standard SMILES uses paired digits (c1ccccc1 for benzene). A model must remember which digits are “open” and close them correctly. DeepSMILES replaces this with a single ring-size indicator at the closing position: cccccc6 means “connect to the atom 6 positions back.”

Branches: Standard SMILES uses matched parentheses (C(OC)(SC)F). DeepSMILES uses a postfix notation with only closing parentheses, where consecutive ) symbols indicate how far to pop back on the atom stack: COC))SC))F.

SMILES:       c1ccccc1          C(OC)(SC)F
DeepSMILES:   cccccc6           COC))SC))F
              ↑                 ↑
              single digit =    no opening parens,
              ring size         )) pops back to C

A single unpaired symbol cannot be “unmatched,” eliminating the two main sources of syntactically invalid strings from generative models.

Reaction SMILES

Reaction SMILES extends the notation to represent chemical reactions by separating reactants, reagents, and products with > symbols. The general format is reactants>reagents>products, where each group can contain multiple molecules separated by .:

CC(=O)O.CCO>>CC(=O)OCC.O
│         │ │            │
│         │ │            └─ water
│         │ └─ ethyl acetate
│         └─ ethanol
└─ acetic acid

(Fischer esterification: acetic acid + ethanol → ethyl acetate + water)

The Molecular Transformer treats this as a machine translation problem, translating reactant SMILES to product SMILES with a Transformer encoder-decoder architecture.

SMARTS and SMIRKS

SMARTS (SMILES Arbitrary Target Specification) is a pattern language built on SMILES syntax for substructure searching. It extends SMILES with query primitives like atom environments ([CX3] for a carbon with three connections) and logical operators, enabling precise structural pattern matching:

from rdkit import Chem

# SMARTS pattern for a carboxylic acid group: C(=O)OH
pattern = Chem.MolFromSmarts("[CX3](=O)[OX2H1]")

for name, smi in [("acetic acid", "CC(=O)O"),
                  ("benzoic acid", "c1ccc(C(=O)O)cc1"),
                  ("ethanol", "CCO"),
                  ("acetone", "CC(=O)C")]:
    mol = Chem.MolFromSmiles(smi)
    print(f"  {name:15s} -> {'match' if mol.HasSubstructMatch(pattern) else 'no match'}")
# -> acetic acid      -> match
# -> benzoic acid     -> match
# -> ethanol          -> no match
# -> acetone          -> no match

SMIRKS extends SMARTS to describe reaction transforms, using atom maps (:1, :2, …) to track which atoms in the reactants correspond to which atoms in the products:

from rdkit.Chem import AllChem, MolFromSmiles, MolToSmiles

# SMIRKS for ester hydrolysis: break the C-O ester bond
smirks = "[C:1](=[O:2])[O:3][C:4]>>[C:1](=[O:2])[OH:3].[C:4][OH]"
rxn = AllChem.ReactionFromSmarts(smirks)

reactant = MolFromSmiles("CC(=O)OCC")  # ethyl acetate
products = rxn.RunReactants((reactant,))
print(" + ".join(MolToSmiles(p) for p in products[0]))
# -> CC(=O)O + CCO    (acetic acid + ethanol)

See the Smirk tokenizer for a recent approach to tokenizing these extensions for molecular foundation models.

t-SMILES

t-SMILES encodes molecules as fragment-based strings by decomposing a molecule into chemically meaningful substructures, arranging them into a full binary tree, and traversing it breadth-first. This dramatically reduces nesting depth compared to standard SMILES (99.3% of tokens at depth 0-2 vs. 68.0% for SMILES on ChEMBL).

Standard SMILES (depth-first, atom-level):
  CC(=O)Oc1ccccc1C(=O)O                     (aspirin)

t-SMILES pipeline:
  1. Fragment:     [CC(=O)O*]  [*c1ccccc1*]  [*C(=O)O]
  2. Binary tree:
                   [*c1ccccc1*]
                  /             \
         [CC(=O)O*]          [*C(=O)O]
  3. BFS string:   [*c1ccccc1*] ^ [CC(=O)O*] ^ [*C(=O)O]

The framework introduces two symbols beyond standard SMILES: ^ separates adjacent fragments (analogous to spaces between words), and & marks empty tree nodes. Only single closure symbols are needed per fragment, eliminating the deep nesting that makes standard SMILES difficult for generative models on small datasets.

Molecular Representations on Hunter Heidenreich | ML Research Scientist

Materials Representations for ML Review

A Systematization of Material Representations

Why Material Representations Matter

Structural Descriptors: Local, Global, and Topological

Local Descriptors

Global Descriptors

Topological Descriptors

Crystal Graph Neural Networks

Compositional Descriptors Without Structure

Defects, Surfaces, and Grain Boundaries

Transfer Learning Across Representations

Generative Models for Crystal Inverse Design

Open Problems and Future Directions

Reproducibility Details

Artifacts

Algorithms

Hardware

Reproducibility Status

Paper Information

InChI: The International Chemical Identifier

Overview

Key Characteristics

Layered Structure

Layer Breakdown

Standard vs. Non-Standard InChI

The InChIKey

Structure

Collision Risk

Working with InChI in Python

Layer-Level Matching

InChI in Machine Learning

Optical Chemical Structure Recognition

Chemical Name Translation

Representation Comparison for ML

Limitations

Tautomerism

Inorganic and Organometallic Chemistry

Not Designed for Generation

Irreversibility of InChIKey

Variants and Extensions

RInChI: Reactions

MInChI: Mixtures

NInChI: Nanomaterials

References

MoMu: Bridging Molecular Graphs and Natural Language

Bridging Molecular Graphs and Natural Language Through Contrastive Learning

Why Single-Modality Models Are Insufficient for Molecular Understanding

Contrastive Pre-Training with Inter-Modal and Intra-Modal Objectives

Data Collection

Contrastive Training Objective

Zero-Shot Text-to-Graph Generation

Evaluation Across Four Downstream Tasks

Cross-Modal Retrieval

Molecule Captioning

Molecular Property Prediction

Zero-Shot Text-to-Graph Generation

Promising Multimodal Transfer with Clear Data Limitations

Limitations

Future Directions

Reproducibility Details

Data

Algorithms

Models

Evaluation

Hardware

Artifacts

Paper Information

MolFM: Trimodal Molecular Foundation Pre-training

Trimodal Pre-training for Molecular Understanding

Why Existing Molecular Models Fall Short

Cross-Modal Attention and Metric Learning Guarantees

Architecture

Pre-training Objectives

Theoretical Justifications

Experiments Across Four Downstream Tasks

Pre-training Data

Cross-Modal Retrieval

Molecule Captioning

Text-Based Molecule Generation