Paper Information
Citation: Heller, S., McNaught, A., Stein, S., Tchekhovskoi, D., & Pletnev, I. (2013). InChI - the worldwide chemical structure identifier standard. Journal of Cheminformatics, 5(1), 7. https://doi.org/10.1186/1758-2946-5-7
Publication: Journal of Cheminformatics, 2013
What kind of paper is this?
This is a Resource & Systematization Paper that reviews the history, technical architecture, governance structure, and implementation status of the InChI standard. It documents both the institutional development of an open chemical identifier and the technical specification that enables it.
What is the motivation?
Before InChI, the chemistry community faced a fundamental interoperability problem. Chemical databases used proprietary systems like CAS Registry Numbers, or format-dependent representations like SMILES strings. These were expensive, restricted, and relied on “in-house” databases.
Before InChI, the chemistry community faced a fundamental interoperability problem. Chemical databases used proprietary systems like CAS Registry Numbers, or format-dependent representations like SMILES strings. These were expensive, restricted, and relied on “in-house” databases.
The authors argue the Internet and Open Source software acted as a “black swan” event that disrupted this status quo. The Internet created a need to link diverse, free and fee-based resources without a central gatekeeper. InChI was designed as the solution: a non-proprietary, open-source identifier enabling linking of distinct data compilations.
What is the novelty here?
InChI’s innovation is both technical and institutional:
Technical novelty: A hierarchical “layered” canonicalization system where structure representations build from basic connectivity to full stereochemistry. This allows flexible matching - a molecule with unknown stereochemistry produces an InChI that’s a subset of the same molecule with known stereochemistry.
Institutional novelty: Creating an open standard governed by a charitable trust (the InChI Trust) that convinced commercial competitors (publishers, databases) to adopt it as a “pre-competitive” necessity. This solved the political problem of maintaining an open standard in a competitive industry.
Technical Architecture: Layers and Hashing
The InChI String
InChI is not a registry number; it is a canonicalized structure representation derived from IUPAC conventions. It uses a hierarchical “layered” format where specific layers add detail. The exact technical specification includes these string segments:
- Main Layer: Chemical Formula
- Connectivity Layer (
/c): Atoms and bonds (excluding bond orders) - Hydrogen Layer (
/h): Tautomeric and immobile H atoms - Charge (
/q) & Proton Balance (/p): Accounting for ionization - Stereochemistry:
- Double bond (
/b) and Tetrahedral (/t) parity - Parity inversion (
/m) - Stereo type (
/s): absolute, relative, or racemic
- Double bond (
- Fixed-H Layer (
/f): Distinguishes specific tautomers if needed
This layered approach is clever: a molecule with unknown stereochemistry will have an InChI that’s a subset of the same molecule with known stereochemistry. This allows for flexible matching at the connectivity level even without complete stereochemical information.
The InChIKey
Because InChI strings can be too long for search engines (which break at ~30 characters or at symbols like / and +), the InChIKey was created.
Mechanism: A 27-character string generated via SHA-256 hash of the InChI string.
Structure:
- Block 1 (14 characters): Encodes the molecular skeleton (connectivity)
- Block 2 (8 characters): Encodes stereochemistry and isotopes
- Block 3: Version and protonation flags (e.g., ‘N’ for neutral)
Trade-off: InChIKey is a hash, so it cannot be converted back to a structure (irreversible) and has a theoretical risk of collision. It is important to distinguish between InChI collisions (which are due to flaws/bugs and are very rare) and InChIKey collisions (which are mathematically inevitable due to hashing).
What experiments were performed?
This is a systematization paper documenting an existing standard, not an experimental research paper. However, the authors provide:
Validation evidence:
- Certification Suite: A test suite that software vendors must pass to display the “InChI Certified” logo, preventing fragmentation
- Round-trip conversion testing: Demonstrated >99% success rate converting InChI back to structure (100% with AuxInfo layer)
- Real-world adoption metrics: Documented integration across major chemical databases and publishers
Known limitations identified:
- Tautomer representation issues in Version 1 (different drawings of same tautomer can generate different InChIs)
- Edge cases in stereochemistry representation
Institutional History & Governance
Origin: The project was initiated at a March 2000 IUPAC meeting in Washington, DC. It was originally called the IUPAC Chemical Identifier Project (IChIP).
Development: Technical work was done by NIST (Stein, Heller, Tchekhovskoi), overseen by the IUPAC CCINS committee, which later became the InChI Subcommittee of Division VIII.
The InChI Trust: To ensure the algorithm survived beyond a volunteer organization, the InChI Trust was formed in 2009. It is a UK charity supported by publishers and databases (e.g., Nature, RSC) to maintain the standard pre-competitively. This was a critical innovation: getting commercial publishers and software vendors to agree that a non-proprietary standard would benefit everyone, rather than trying to lock in users with proprietary formats.
What outcomes/conclusions?
Key Findings
Success through “un-coerced adoption”: Unlike failed standardization efforts (e.g., metric system in the US), InChI succeeded because commercial competitors viewed it as a “pre-competitive” necessity for the Internet age. The open governance model proved durable.
Technical achievements:
- Reversible representation (>99% without AuxInfo, 100% with it)
- Hierarchical structure enables flexible matching at different levels of detail
- InChIKey enables web search despite being a hash (with inherent collision risk)
Limitations Acknowledged (as of 2013)
- Tautomerism Issues: Different drawings of the same tautomer (e.g., 1,4-oxime vs nitroso) can generate different InChIs in Version 1 - targeted for Version 2
- Hash collision risk: InChIKey collisions are mathematically inevitable due to SHA-256 hashing, though InChI collisions (actual bugs) are very rare
- Certification required: To prevent fragmentation, software must pass the InChI Certification Suite
Future Directions
The authors note that while this paper documents the state as of 2013, InChI continues to evolve. Tautomer handling and edge cases in stereochemistry representation were priorities for future versions. The governance model through the InChI Trust was designed to ensure long-term maintenance beyond the original volunteer contributors.
Reproducibility Details
This systematization paper documents an existing standard rather than presenting novel experimental results. Key implementation resources:
Data
- InChI Certification Suite: A test suite of chemical structures used to validate software implementations
- Version 1 specification: Complete technical documentation of the layered format
Algorithms
- Canonicalization algorithm: IUPAC-based rules for generating unique representations from multiple possible drawings of the same molecule
- InChIKey generation: SHA-256 hash of InChI string, structured as 14-character (connectivity) + 8-character (stereochemistry) + version/protonation flags
Evaluation
- Round-trip conversion: >99% success rate (100% with AuxInfo)
- Adoption metrics: Integration across major publishers (Nature, RSC) and databases by 2013
- Certification testing: Pass/fail validation for software claiming InChI compliance
