InChI: The Worldwide Chemical Structure Identifier Standard

InChI as a Resource and Systematization Standard

This is a Resource & Systematization Paper that reviews the history, technical architecture, governance structure, and implementation status of the InChI standard. It documents both the institutional development of an open chemical identifier and the technical specification that enables it.

The Motivation: Interoperability in Chemical Databases

Before InChI, the chemistry community faced a fundamental interoperability problem. Chemical databases used proprietary systems like CAS Registry Numbers, or format-dependent representations like SMILES strings. These were expensive, restricted, and relied on “in-house” databases.

The authors argue the Internet and Open Source software acted as a “black swan” event that disrupted this status quo. The Internet created a need to link diverse, free and fee-based resources without a central gatekeeper. InChI was designed as the solution: a non-proprietary, open-source identifier enabling linking of distinct data compilations.

Technical and Institutional Innovations of InChI

InChI’s innovation is both technical and institutional:

Technical novelty: A hierarchical “layered” canonicalization system where structure representations build from basic connectivity to full stereochemistry. This allows flexible matching: a molecule with unknown stereochemistry produces an InChI that’s a subset of the same molecule with known stereochemistry.

Institutional novelty: Creating an open standard governed by a charitable trust (the InChI Trust) that convinced commercial competitors (publishers, databases) to adopt it as a “pre-competitive” necessity. This solved the political problem of maintaining an open standard in a competitive industry.

Technical Architecture: Layers and Hashing

The InChI String

InChI is a canonicalized structure representation derived from IUPAC conventions. It uses a hierarchical “layered” format where specific layers add detail. The exact technical specification includes these string segments:

Main Layer: Chemical Formula
Connectivity Layer (/c): Atoms and bonds (excluding bond orders)
Hydrogen Layer (/h): Tautomeric and immobile H atoms
Charge (/q) & Proton Balance (/p): Accounting for ionization
Stereochemistry:
- Double bond (/b) and Tetrahedral (/t) parity
- Parity inversion (/m)
- Stereo type (/s): absolute, relative, or racemic
Fixed-H Layer (/f): Distinguishes specific tautomers if needed

This layered approach is clever: a molecule with unknown stereochemistry will have an InChI that’s a subset of the same molecule with known stereochemistry. This allows for flexible matching at the connectivity level even without complete stereochemical information.

The InChIKey

Because InChI strings can be too long for search engines (which break at ~30 characters or at symbols like / and +), the InChIKey was created.

Mechanism: A 27-character string generated via a SHA-256 hash of the InChI string. This can be represented as:

$$ \text{InChIKey} = f_{\text{SHA-256}}(\text{InChI}) $$

Structure:

Block 1 (14 characters): Encodes the molecular skeleton (connectivity)
Block 2 (8 characters): Encodes stereochemistry and isotopes
Block 3: Version and protonation flags (e.g., ‘N’ for neutral)

Because the InChIKey is a hash, it cannot be converted back to a structure (irreversible) and has a theoretical risk of collision. It is important to distinguish between InChI collisions (which are due to flaws/bugs and are very rare) and InChIKey collisions (which are mathematically inevitable due to hashing).

What experiments were performed?

This is a systematization paper documenting an existing standard. However, the authors provide:

Validation evidence:

Certification Suite: A test suite that software vendors must pass to display the “InChI Certified” logo, preventing fragmentation
Round-trip conversion testing: Demonstrated >99% success rate converting InChI back to structure (100% with AuxInfo layer)
Real-world adoption metrics: Documented integration across major chemical databases and publishers

Known limitations identified:

Tautomer representation issues in Version 1 (different drawings of same tautomer can generate different InChIs)
Edge cases in stereochemistry representation

Institutional History & Governance

Origin: The project was initiated at a March 2000 IUPAC meeting in Washington, DC. It was originally called the IUPAC Chemical Identifier Project (IChIP).

Development: Technical work was done by NIST (Stein, Heller, Tchekhovskoi), overseen by the IUPAC CCINS committee, which later became the InChI Subcommittee of Division VIII.

The InChI Trust: To ensure the algorithm survived beyond a volunteer organization, the InChI Trust was formed in 2009. It is a UK charity supported by publishers and databases (e.g., Nature, RSC) to maintain the standard pre-competitively. This was a critical innovation: getting commercial publishers and software vendors to agree that a non-proprietary standard would benefit everyone.

Real-World Impact and Future Directions

Key Findings

Success through “un-coerced adoption”: InChI succeeded because commercial competitors viewed it as a “pre-competitive” necessity for the Internet age. The open governance model proved durable.

Technical achievements:

Reversible representation (>99% without AuxInfo, 100% with it)
Hierarchical structure enables flexible matching at different levels of detail
InChIKey enables web search despite being a hash (with inherent collision risk)

Limitations Acknowledged (as of 2013)

Tautomerism Issues: Different drawings of the same tautomer (e.g., 1,4-oxime vs nitroso) can generate different InChIs in Version 1, which is targeted for Version 2
Hash collision risk: InChIKey collisions are mathematically inevitable due to SHA-256 hashing, though InChI collisions (actual bugs) are very rare
Certification required: To prevent fragmentation, software must pass the InChI Certification Suite

Future Directions

The authors note that while this paper documents the state as of 2013, InChI continues to evolve. Tautomer handling and edge cases in stereochemistry representation were priorities for future versions. The governance model through the InChI Trust was designed to ensure long-term maintenance beyond the original volunteer contributors.

Reproducibility Details

This systematization paper documents an existing standard. Key implementation resources are openly maintained by the InChI Trust.

Code & Software

Official Open Source Implementation: The C source code and pre-compiled binaries for the InChI algorithm are freely available via the InChI Trust Downloads Page and their official GitHub repository.
Canonicalization algorithm: Open-source implementation of IUPAC-based rules for generating unique representations from multiple possible drawings of the same molecule.

Data & Validation

InChI Certification Suite: A test suite of chemical structures provided by the InChI Trust used to validate that third-party software implementations generate correct InChIs.
Version 1 specification: Complete technical documentation of the layered format.

Evaluation

Round-trip conversion: >99% success rate (100% with AuxInfo) as validated by NIST and IUPAC.
Certification testing: Pass/fail validation for software claiming InChI compliance.

Paper Information

Citation: Heller, S., McNaught, A., Stein, S., Tchekhovskoi, D., & Pletnev, I. (2013). InChI: the worldwide chemical structure identifier standard. Journal of Cheminformatics, 5(1), 7. https://doi.org/10.1186/1758-2946-5-7

Publication: Journal of Cheminformatics, 2013

@article{heller2013inchi,
  title={InChI, the worldwide chemical structure identifier standard},
  author={Heller, Stephen and McNaught, Alan and Stein, Stephen and Tchekhovskoi, Dmitrii and Pletnev, Igor},
  journal={Journal of cheminformatics},
  volume={5},
  number={1},
  pages={1--9},
  year={2013},
  publisher={Springer}
}

InChI as a Resource and Systematization Standard#

The Motivation: Interoperability in Chemical Databases#

Technical and Institutional Innovations of InChI#

Technical Architecture: Layers and Hashing#

The InChI String#

The InChIKey#

What experiments were performed?#

Institutional History & Governance#

Real-World Impact and Future Directions#

Key Findings#

Limitations Acknowledged (as of 2013)#

Future Directions#

Reproducibility Details#

Code & Software#

Data & Validation#

Evaluation#

Paper Information#