Molecular Representations

How a molecule is encoded as a string or graph determines what a model can learn from it. This section traces the evolution of molecular string representations: from the original SMILES notation and InChI identifiers, through the SELFIES format designed to guarantee every generated string maps to a valid molecule, to newer extensions like RInChI for reactions and MInChI for mixtures. Notes cover both the technical specifications of each format and the practical tradeoffs (validity, expressiveness, canonicality) that matter when choosing a representation for machine learning.

Computational Chemistry

D-glucose open-chain aldehyde form converting to beta-D-glucopyranose ring form, illustrating ring-chain tautomerism

InChI and Tautomerism: Toward Comprehensive Treatment

A comprehensive 2020 analysis of the tautomerism problem in chemical databases, compiling 86 tautomeric transformation rules (20 existing, 66 new) and validating them across 400M+ structures to inform algorithmic improvements for InChI V2.

Computational Chemistry

2D molecular structure diagram of tricyclohexylphosphine showing a central phosphorus atom bonded to three cyclohexyl groups

InChI: The Worldwide Chemical Structure Identifier Standard

A comprehensive 2013 review explaining how InChI emerged as the global standard for chemical structure identifiers, covering its history as a response to the Internet’s need for non-proprietary molecular linking, its governance under IUPAC, and the technical layers that ensure uniqueness across diverse chemical databases.

Computational Chemistry

Crystal structure of Na8Si46 clathrate displaying dodecahedral and tetrakaidecahedral coordination polyhedra

Making InChI FAIR and Sustainable for Inorganic Chemistry

A 2025 Faraday Discussions paper describing the major overhaul of InChI v1.07 that fixed more than 3000 bugs, added support for inorganic and organometallic compounds, and modernized the system to align with FAIR data principles for chemistry databases.

Computational Chemistry

A cobalt sulfate and ethylenediamine mixture being prepared

Mixfile & MInChI: Machine-Readable Mixture Formats

A 2019 format specification introducing two complementary standards for chemical mixtures. Mixfile provides comprehensive mixture descriptions and MInChI provides compact canonical identifiers. This addresses the long-standing lack of standardized machine-readable formats for multi-component chemical systems.

Computational Chemistry

Colorized electron microscope image of nanostructured indium phosphide surface showing spatially oriented cubic crystallites

NInChI: Toward a Chemical Identifier for Nanomaterials

Can we create a SMILES-like notation for nanomaterials? A collaborative workshop tackles the challenge of representing complex, multi-component nanomaterials with a proposed extension to the established InChI system.

Computational Chemistry

Recent Advances in the SELFIES Library: 2023 Update

A 2023 software update paper documenting improvements to the SELFIES Python library (v2.1.1), including a streamlined context-free grammar, expanded support for aromatic systems and stereochemistry, customizable semantic constraints, ML utility functions, and performance benchmarks on 300K+ molecules.

Computational Chemistry

Chemical diagram showing a generalized Grignard reaction

RInChI: The Reaction International Chemical Identifier

A 2018 infrastructure paper introducing RInChI (Reaction InChI), the first standardized format for uniquely identifying chemical reactions through algorithmic hashing and layering, enabling reaction database searching and duplicate detection analogous to how InChI works for individual molecules.

Computational Chemistry

SELFIES molecular representation overview

SELFIES: The Original Paper on Robust Molecular Strings

The 2020 paper that introduced SELFIES: Mario Krenn and colleagues created a molecular representation that solves SMILES validity problems. It guarantees every generated string corresponds to a valid chemical structure.

Computational Chemistry

SMILES Notation: The Original Paper by Weininger (1988)

David Weininger introduced SMILES notation in 1988, establishing encoding rules for representing chemical structures as compact, human-readable strings.

Computational Chemistry

SELFIES representation of 2-Fluoroethenimine molecule

SELFIES: The 100% Robust Molecular String Representation

An in-depth overview of SELFIES, the 100% robust molecular string representation designed to overcome SMILES limitations in machine learning, where every possible string (even random ones) decodes to a valid molecule through local operations, customizable valence rules, and graph-based internal representations.

Computational Chemistry

SMILES: A Compact Notation for Chemical Structures

Comprehensive overview of SMILES notation for chemical structures, covering syntax for atoms, bonds, branches, rings, and stereochemistry, plus its key limitations for machine learning.

Computational Chemistry

Invalid SMILES Benefit Chemical Language Models: A Study

A 2024 Nature Machine Intelligence paper providing causal evidence that invalid SMILES generation improves chemical language model performance by filtering low-likelihood samples, while validity constraints (as in SELFIES) introduce structural biases that impair distribution learning.