Paper Contribution: Modernizing Chemical Identifiers

This is a Resource paper that describes the development and maintenance of InChI (International Chemical Identifier), a fundamental infrastructure component for chemical databases. While it includes methodological improvements to the canonicalization algorithm for inorganic compounds, its primary contribution is ensuring the sustainability and accessibility of a critical chemical informatics resource.

Motivation: The Inorganic Chemistry Problem

The International Chemical Identifier (InChI) is prevalent in chemistry databases, with over a billion structures using it. The original system was designed specifically for organic chemistry and systematically fails to parse organometallic structures accurately. The original implementation had significant limitations:

  • FAIR principles gap: Development was closed-source, documentation was inadequate, and the codebase was difficult to maintain
  • Inorganic chemistry failure: Metal-ligand bonds were automatically disconnected, destroying stereochemical information for coordination complexes
  • Technical debt: Thousands of bugs, security vulnerabilities, and an unmaintainable codebase

If you’ve ever tried to search for a metal complex in a chemical database and gotten nonsense results, this is why. This paper describes the fix.

Core Innovation: Smart Metal-Ligand Handling

The key innovations are:

  1. Smart metal-ligand bond handling: A decision tree algorithm that uses coordination number and electronegativity to determine which bonds to keep and which to disconnect, preserving stereochemistry for coordination complexes

  2. Modernized development infrastructure: Migration to GitHub with open development, comprehensive testing, and maintainable documentation

  3. Backward compatibility: The core canonicalization algorithm remained unchanged, preserving over a billion existing InChIs for organic compounds

The preprocessing step applies an iterative decision tree for every metal in a structure:

  • If coordination number exceeds standard valence: keep all bonds (bypass electronegativity check)
  • If terminal metal: check electronegativity table (threshold: $\Delta EN = 1.7$)
  • Include hardcoded exceptions for Grignard reagents and organolithium compounds

For example, $\text{FeCl}_2$ is treated as ionic and disconnected into $\text{Fe}^{2+}$ and $2\ \text{Cl}^-$, while $[\text{FeCl}_4]^{2-}$ remains connected as a coordination complex.

Validation Methods & Experiments

The paper focuses on software engineering validation:

  • Bug fixing: Addressed thousands of bugs from the legacy codebase, including security vulnerabilities
  • Backward compatibility testing: Verified that existing organic molecule InChIs remained unchanged
  • Inorganic compound validation: Tested the new decision tree algorithm on coordination complexes, organometallic compounds, and ionic salts
  • Documentation overhaul: Split technical documentation into Chemical Manual (for chemists) and Technical Manual (for developers)

The validation approach emphasizes maintaining the “same molecule, same identifier” principle while extending coverage to inorganic chemistry.

Key Outcomes and Future Work

The v1.07 release successfully:

  • Modernizes infrastructure: Open development on GitHub with maintainable codebase
  • Extends to inorganic chemistry: Proper handling of coordination complexes and organometallic compounds
  • Maintains backward compatibility: No breaking changes for existing organic compound InChIs
  • Improves database search: Metal complexes now searchable with correct stereochemistry preserved

Acknowledged limitations for future work:

  • Stereochemistry representation still needs improvement
  • Mixtures (MInChI) and nanomaterials (NInChI) remain unsolved problems
  • Chemical identifiers work best for discrete molecules and struggle with variable-composition materials

Impact: This update should dramatically improve searchability of inorganic and organometallic compounds in major chemical databases, addressing a critical gap in computational chemistry workflows.

Reproducibility Details

Software & Data Availability

The InChI v1.07 codebase, primarily written in C/C++, is openly available on GitHub at IUPAC-InChI/InChI. The repository includes the core canonicalization engine and the new inorganic preprocessing logic. Both the Technical Manual (for structural integration) and the Chemical Manual are maintained alongside the codebase.

Benchmarking Data: Validation of the new decision tree logic is managed through rigorous unit testing built directly into the repository’s continuous integration pipelines. Standard tests with existing organic compounds confirm backward compatibility, while newly integrated suites of coordination complexes and organometallic compounds ensure the 1.07 processing triggers as expected.

Algorithms

The Metal Problem

InChI’s original algorithm assumed that bonds to metals were ionic and automatically disconnected them. This makes sense for something like sodium chloride (NaCl), where you have separate $\text{Na}^+$ and $\text{Cl}^-$ ions.

It fails for:

  • Coordination complexes: Where ligands are bonded to the metal center
  • Organometallic compounds: Where carbon-metal bonds are covalent
  • Sandwich compounds: Like ferrocene, where the bonding has both ionic and covalent character

The result: loss of stereochemical information and identical InChIs for structurally different compounds.

The Solution: Smart Preprocessing

The new system uses a decision tree to figure out which metal-ligand bonds to keep and which to disconnect. The process is iterative: it runs for every metal in the structure, then checks every bond to that metal. In the C/C++ repository, this preprocessing logic acts as a filter applied before the traditional organic canonicalization engine (from v1.06) runs, dynamically determining whether coordination bonds are retained for downstream layer generation.

Decision Tree Logic

The algorithm determines the bond state $B(m, l)$ between a metal $m$ and ligand $l$ based on coordination number $CN(m)$, standard valence $V(m)$, and Pauling electronegativity $EN$:

$$ \begin{aligned} B(m, l) &= \begin{cases} \text{Connected} & \text{if } CN(m) > V(m) \\ \text{Connected} & \text{if } |EN(m) - EN(l)| < 1.7 \\ \text{Disconnected} & \text{if } |EN(m) - EN(l)| \geq 1.7 \end{cases} \end{aligned} $$

(Note: Explicit overrides exist for specific classes like Grignard reagents).

Hardcoded Chemical Exceptions

The algorithm includes specific overrides based on well-established chemistry:

  • Grignard reagents (RMgX): Explicitly configured to keep the Mg-C bond but disconnect the Mg-halide bond
  • Organolithium compounds (RLi): Explicitly configured to keep the structure intact

These exceptions exist because the general electronegativity rules would give incorrect results for these compound classes.

Practical Example

For example, $\text{FeCl}_2$ is treated as ionic and disconnected into $\text{Fe}^{2+}$ and $2\ \text{Cl}^-$, while $[\text{FeCl}_4]^{2-}$ remains connected because its coordination number exceeds the threshold.

How InChI Generation Works

The process has six main steps:

  1. Parse input: Read the structure from a file (Molfile, SDF, etc.)
  2. Convert to internal format: Transform into the software’s data structures
  3. Normalize: Standardize tautomers, resolve ambiguities (where the new metal rules apply)
  4. Canonicalize: Create a unique representation independent of atom numbering
  5. Generate InChI string: Build the layered text identifier
  6. Create InChIKey: Hash the full string into a 27-character key for databases

The InChI itself has separate layers for formula, connectivity, hydrogens, stereochemistry, isotopes, and charge. The InChIKey is what actually gets stored in databases for fast searching.

InChIKey Version Flag

Character 25 of the InChIKey indicates the version status:

  • “S”: Standard InChI
  • “N”: Non-standard InChI
  • “B”: Beta (experimental features)

This flag is important for anyone parsing InChIKeys programmatically, as it tells you whether the identifier was generated using stable or experimental algorithms.

Additional Context

What InChI Actually Does

InChI creates a unique text string for any chemical structure. SMILES has multiple vendor implementations and can represent the same molecule in different ways. InChI provides a single, standardized format controlled by IUPAC. The goal is simple: same molecule, same identifier, every time.

This matters for FAIR data principles:

  • Findable: You can search for a specific compound across databases
  • Accessible: The standard is open and free
  • Interoperable: Different systems can connect chemical knowledge
  • Reusable: The identifiers work consistently across platforms

Better Documentation

The technical manual is being split into two documents:

  • Chemical Manual: For chemists who need to understand what InChIs mean
  • Technical Manual: For developers who need to implement the algorithms

This addresses the problem of current documentation serving both audiences poorly.

The Bigger Picture

InChI’s evolution reflects chemistry’s expansion beyond its organic roots. The fact that it took this long to properly handle inorganic compounds shows how much computational chemistry has historically focused on carbon-based molecules.

As the field moves into catalysis, materials science, and coordination chemistry applications, having proper chemical identifiers becomes essential. You can’t build FAIR chemical databases if half of chemistry is represented incorrectly.

Paper Information

Citation: Blanke, G., Brammer, J., Baljozovic, D., Khan, N. U., Lange, F., Bänsch, F., Tovee, C. A., Schatzschneider, U., Hartshorn, R. M., & Herres-Pawlis, S. (2025). Making the InChI FAIR and sustainable while moving to inorganics. Faraday Discussions, 256(0), 503-519. https://doi.org/10.1039/D4FD00145A

Publication: Faraday Discussions, 2025

@article{blanke2025making,
  title={Making the InChI FAIR and sustainable while moving to inorganics},
  author={Blanke, G. and Brammer, J. and Baljozovic, D. and Khan, N. U. and Lange, F. and B{\"a}nsch, F. and Tovee, C. A. and Schatzschneider, U. and Hartshorn, R. M. and Herres-Pawlis, S.},
  journal={Faraday Discussions},
  volume={256},
  pages={503--519},
  year={2025},
  publisher={Royal Society of Chemistry}
}