Overview

InChI (International Chemical Identifier) is an open, non-proprietary chemical structure identifier developed by IUPAC and NIST. Unlike SMILES, which linearizes a molecular graph through depth-first traversal, InChI decomposes a molecule into a hierarchy of layers (connectivity, hydrogen atoms, charge, stereochemistry) that build progressively from the molecular formula to full stereochemical detail. This layered design means that two representations of the same molecule always produce the same InChI, even if their input drawings differ in atom ordering or layout.

InChI was created to solve a specific problem: linking chemical information across databases on the open web. Before InChI, interoperability between chemical databases depended on proprietary identifiers (like CAS Registry Numbers) or format-dependent representations. The project began at a March 2000 IUPAC meeting and is maintained by the InChI Trust, a UK charity supported by publishers and database providers. The algorithm’s source code is open source.

Key Characteristics

  • Canonical by design: Every valid molecular structure maps to exactly one standard InChI string, regardless of how the structure was drawn or which atoms were numbered first. This uniqueness is built into the algorithm, not added as a post-processing step.
  • Hierarchical layers: Information is organized from general (molecular formula) to specific (stereochemistry, isotopes). This allows matching at different levels of detail: a query with unknown stereochemistry can match against structures with known stereochemistry by comparing only the connectivity layers.
  • Web-searchable via InChIKey: Because InChI strings contain characters (/, +, =) that break web search engines, the 27-character InChIKey hash provides a fixed-length, search-friendly identifier.
  • Non-proprietary and open: Governed by IUPAC through the InChI Trust. The algorithm, source code, and specification are freely available.
  • Machine-optimized: Designed for programmatic parsing and database operations rather than human readability. Compare with SMILES, which prioritizes human readability.

Layered Structure

An InChI string begins with the prefix InChI= followed by a version number, then a series of layers separated by /. Each layer encodes a specific aspect of the molecular structure.

Layer Breakdown

For L-alanine (an amino acid with a chiral center):

InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1
       │  │      │            │                   │   │  │
       │  │      │            │                   │   │  └─ /s: stereo type (1=absolute)
       │  │      │            │                   │   └─ /m: parity inversion flag
       │  │      │            │                   └─ /t: tetrahedral parity
       │  │      │            └─ /h: hydrogen layer
       │  │      └─ /c: connectivity layer
       │  └─ molecular formula
       └─ version (1S = standard InChI v1)

The full set of layers, in order:

  1. Main layer: Molecular formula (e.g., C3H7NO2)
  2. Connectivity (/c): Atom-to-atom connections, excluding bond orders. Atoms are numbered starting from 1, and connections are listed as pairs.
  3. Hydrogen (/h): Hydrogen atom assignments, distinguishing mobile (tautomeric) from fixed hydrogens
  4. Charge (/q) and proton balance (/p): Net charge and protonation state
  5. Double bond stereochemistry (/b): E/Z configuration around double bonds
  6. Tetrahedral stereochemistry (/t): R/S configuration at sp3 centers
  7. Parity inversion (/m): Relates computed parity to actual configuration
  8. Stereo type (/s): Whether stereochemistry is absolute, relative, or racemic
  9. Isotope layer (/i): Isotopic labeling (e.g., deuterium, carbon-13)

Standard vs. Non-Standard InChI

The S in InChI=1S/ indicates a Standard InChI, which uses a fixed set of normalization options to guarantee that any software producing Standard InChI will generate the same string for the same molecule. Non-standard InChI allows custom options (such as the Fixed-H layer /f, which distinguishes specific tautomeric forms) but sacrifices cross-implementation consistency.

The InChIKey

InChI strings can be arbitrarily long for large molecules, and their /, +, and = characters cause problems for web search engines. The InChIKey addresses both issues by hashing the InChI into a fixed 27-character string:

$$ \text{InChIKey} = f_{\text{SHA-256}}(\text{InChI}) $$

Structure

An InChIKey has the format XXXXXXXXXXXXXX-XXXXXXXXXX-X:

  • First block (14 characters): SHA-256 hash of the connectivity layer (molecular skeleton)
  • Second block (10 characters): 8 characters encoding stereochemistry and isotopes, plus a standard/non-standard flag (S or N) and a version indicator (A for v1)
  • Third block (1 character): Protonation flag (N for neutral)

For example, L-alanine:

InChIKey: QNAYBMKLOCPYGJ-REOHCLBHSA-N
          │                │          │
          └─ connectivity  └─ stereo  └─ protonation

Collision Risk

Because the InChIKey is a hash, collisions are theoretically possible. The first block provides $2^{65}$ possible values for connectivity, making accidental collisions extremely unlikely for practical database sizes (estimated 1 in $10^{12}$ chance for $10^9$ compounds). It is important to distinguish InChIKey collisions (a mathematical inevitability of hashing, but rare in practice) from InChI collisions (bugs in the algorithm, which are very rare and targeted by the certification suite).

Working with InChI in Python

The RDKit library provides InChI support through its built-in functions:

from rdkit import Chem
from rdkit.Chem.inchi import MolFromInchi, MolToInchi, InchiToInchiKey

# SMILES -> InChI
mol = Chem.MolFromSmiles("C[C@@H](N)C(=O)O")  # L-alanine
inchi = MolToInchi(mol)
print(inchi)
# -> InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1

# InChI -> Molecule -> SMILES
mol2 = MolFromInchi(inchi)
print(Chem.MolToSmiles(mol2))
# -> C[C@@H](N)C(=O)O

# InChI -> InChIKey
key = InchiToInchiKey(inchi)
print(key)
# -> QNAYBMKLOCPYGJ-REOHCLBHSA-N

Layer-Level Matching

Because InChI is hierarchical, you can compare molecules at different levels of detail by truncating layers. Two molecules that differ only in stereochemistry will share the same connectivity layers:

from rdkit import Chem
from rdkit.Chem.inchi import MolToInchi, InchiToInchiKey

# L-alanine and D-alanine differ only in chirality
l_ala = Chem.MolFromSmiles("C[C@@H](N)C(=O)O")
d_ala = Chem.MolFromSmiles("C[C@H](N)C(=O)O")

l_inchi = MolToInchi(l_ala)
d_inchi = MolToInchi(d_ala)

# Full InChIs differ (different /t and /m layers)
print(l_inchi)
# -> InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1
print(d_inchi)
# -> InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m1/s1

# First block of InChIKey is identical (same connectivity)
l_key = InchiToInchiKey(l_inchi)
d_key = InchiToInchiKey(d_inchi)
print(l_key[:14] == d_key[:14])
# -> True (same molecular skeleton)
print(l_key == d_key)
# -> False (different stereochemistry)

InChI in Machine Learning

InChI was designed for database interoperability, not for machine learning. Its hierarchical, layer-based structure differs fundamentally from the sequential, atom-by-atom encoding used by SMILES and SELFIES. This has practical implications for ML applications.

Optical Chemical Structure Recognition

InChI is widely used as an output format for optical chemical structure recognition (OCSR) systems that extract molecular structures from images in scientific literature. Because InChI is canonical, it provides an unambiguous target for image-to-text models.

Image2InChI uses an improved SwinTransformer encoder with attention-based feature fusion to convert molecular images directly to InChI strings, achieving 99.8% accuracy on the BMS dataset. The ViT-InChI Transformer takes a similar approach with a Vision Transformer backbone.

In a systematic comparison of string representations for OCSR, Rajan et al. (2022) evaluated SMILES, DeepSMILES, SELFIES, and InChI using the same transformer architecture. InChI strings are longer than SMILES (producing more tokens for the decoder), which increases sequence modeling difficulty. SMILES achieved the highest exact match accuracy (88.62%), while SELFIES achieved 100% structural validity.

Chemical Name Translation

InChI’s canonical structure makes it a natural intermediate representation for translating between chemical names and structures. Handsel et al. (2021) trained a sequence-to-sequence Transformer to translate InChI identifiers to IUPAC names character-by-character, achieving 91% accuracy on organic compounds from PubChem (10 million training pairs). STOUT converts through SELFIES as an intermediate but validates outputs against InChI for structural equivalence.

Representation Comparison for ML

InChI’s design trade-offs position it differently from SMILES and SELFIES for machine learning:

PropertyInChISMILESSELFIES
UniquenessCanonical by designRequires canonicalization algorithmVia SMILES roundtrip
Validity guaranteeN/A (not generative)NoYes (every string is valid)
Human readabilityLow (machine-optimized)HighModerate
String lengthLongestShortestModerate
Primary ML useOCSR output, database linkingGeneration, property predictionGeneration with validity
TokenizationComplex (layers, separators)Regex-based atom tokensBracket-delimited tokens

InChI’s length and structural complexity (layer separators, parenthetical groupings, comma-delimited atom lists) make it less common as a direct input representation for generative models. Most molecular language models use SMILES or SELFIES for generation tasks, and convert to InChI only for canonicalized comparison or database lookup.

Limitations

Tautomerism

InChI v1 handles many tautomeric forms by normalizing mobile hydrogen atoms in the /h layer. However, certain tautomeric transformations (such as 1,4-oxime/nitroso conversions) can produce different InChIs for what chemists consider the same compound. This is a known limitation targeted for InChI v2, with 86 tautomeric transformation rules compiled and validated across 400M+ structures to inform the update.

Inorganic and Organometallic Chemistry

The original InChI specification was designed primarily for organic molecules. Metal-ligand bonds, coordination compounds, and extended solid-state structures posed challenges. The InChI v1.07 release addresses this with dedicated handling for metal-ligand bonds, though complete coverage of all inorganic chemistry remains an ongoing effort.

Not Designed for Generation

Unlike SMILES (which can be generated token-by-token through depth-first graph traversal) or SELFIES (which guarantees validity by construction), InChI’s layered format does not lend itself to autoregressive generation. A generative model would need to produce internally consistent layers: the connectivity layer must agree with the molecular formula, the hydrogen layer must be consistent with the connectivity, and the stereochemistry layers must reference valid atom indices. This cross-layer dependency makes InChI poorly suited as a target for token-by-token molecular generation, which is why most generative chemistry models use SMILES or SELFIES.

Irreversibility of InChIKey

The InChIKey is a one-way hash. An InChIKey cannot be converted back to an InChI or a molecular structure. It is useful only for search and comparison, not for structure retrieval (without a lookup table).

Variants and Extensions

RInChI: Reactions

RInChI (Reaction InChI) extends InChI to represent chemical reactions by combining the InChIs of reactants, products, and agents into a single identifier. It provides a canonical identifier for reactions, enabling reaction database searching and duplicate detection (Grethe et al., 2018).

MInChI: Mixtures

MInChI (Mixture InChI) represents mixtures of substances, combined with the Mixfile format for storing detailed mixture composition data. This extends the InChI framework to complex multi-component systems like formulations and alloys (Clark et al., 2019).

NInChI: Nanomaterials

NInChI proposes a hierarchical adaptation of InChI for nanomaterial identification. Traditional chemical identifiers break down at the nanoscale, where a single “entity” may consist of millions of atoms arranged in layers, coatings, and surface functionalizations (Lynch et al., 2020).

References