Paper Information
Citation: Blanke, G., Brammer, J., Baljozovic, D., Khan, N. U., Lange, F., Bänsch, F., Tovee, C. A., Schatzschneider, U., Hartshorn, R. M., & Herres-Pawlis, S. (2025). Making the InChI FAIR and sustainable while moving to inorganics. Faraday Discussions, 256(0), 503-519. https://doi.org/10.1039/D4FD00145A
Publication: Faraday Discussions, 2025
What kind of paper is this?
This is a Resource paper that describes the development and maintenance of InChI (International Chemical Identifier), a fundamental infrastructure component for chemical databases. While it includes methodological improvements to the canonicalization algorithm for inorganic compounds, its primary contribution is ensuring the sustainability and accessibility of a critical chemical informatics resource.
What is the motivation?
The International Chemical Identifier (InChI) is everywhere in chemistry databases - over a billion structures use it. But there’s a problem: the system was designed for organic chemistry and basically breaks when you give it anything with metals in it. The original implementation had significant limitations:
- FAIR principles gap: Development was closed-source, documentation was inadequate, and the codebase was difficult to maintain
- Inorganic chemistry failure: Metal-ligand bonds were automatically disconnected, destroying stereochemical information for coordination complexes
- Technical debt: Thousands of bugs, security vulnerabilities, and an unmaintainable codebase
If you’ve ever tried to search for a metal complex in a chemical database and gotten nonsense results, this is why. This paper describes the fix.
What is the novelty here?
The key innovations are:
Smart metal-ligand bond handling: A decision tree algorithm that uses coordination number and electronegativity to determine which bonds to keep and which to disconnect, preserving stereochemistry for coordination complexes
Modernized development infrastructure: Migration to GitHub with open development, comprehensive testing, and maintainable documentation
Backward compatibility: The core canonicalization algorithm remained unchanged, preserving over a billion existing InChIs for organic compounds
The preprocessing step applies an iterative decision tree for every metal in a structure:
- If coordination number exceeds standard valence: keep all bonds (bypass electronegativity check)
- If terminal metal: check electronegativity table (threshold: $\Delta EN = 1.7$)
- Include hardcoded exceptions for Grignard reagents and organolithium compounds
This means $\text{FeCl}_2$ gets disconnected (ionic) while $[\text{FeCl}_4]^{2-}$ stays connected (coordination complex).
What experiments were performed?
The paper focuses on software engineering validation rather than traditional chemical experiments:
- Bug fixing: Addressed thousands of bugs from the legacy codebase, including security vulnerabilities
- Backward compatibility testing: Verified that existing organic molecule InChIs remained unchanged
- Inorganic compound validation: Tested the new decision tree algorithm on coordination complexes, organometallic compounds, and ionic salts
- Documentation overhaul: Split technical documentation into Chemical Manual (for chemists) and Technical Manual (for developers)
The validation approach emphasizes maintaining the “same molecule, same identifier” principle while extending coverage to inorganic chemistry.
What outcomes/conclusions?
The v1.07 release successfully:
- Modernizes infrastructure: Open development on GitHub with maintainable codebase
- Extends to inorganic chemistry: Proper handling of coordination complexes and organometallic compounds
- Maintains backward compatibility: No breaking changes for existing organic compound InChIs
- Improves database search: Metal complexes now searchable with correct stereochemistry preserved
Acknowledged limitations for future work:
- Stereochemistry representation still needs improvement
- Mixtures (MInChI) and nanomaterials (NInChI) remain unsolved problems
- Chemical identifiers work best for discrete molecules, not variable-composition materials
Impact: This update should dramatically improve searchability of inorganic and organometallic compounds in major chemical databases, addressing a critical gap in computational chemistry workflows.
Reproducibility Details
Algorithms
The Metal Problem
InChI’s original algorithm assumed that bonds to metals were ionic and automatically disconnected them. This makes sense for something like sodium chloride (NaCl), where you have separate $\text{Na}^+$ and $\text{Cl}^-$ ions.
But it completely fails for:
- Coordination complexes: Where ligands are definitely bonded to the metal center
- Organometallic compounds: Where carbon-metal bonds are covalent
- Sandwich compounds: Like ferrocene, where the bonding is neither purely ionic nor covalent
The result: loss of stereochemical information and identical InChIs for structurally different compounds.
The Solution: Smart Preprocessing
The new system uses a decision tree to figure out which metal-ligand bonds to keep and which to disconnect. The process is iterative: it runs for every metal in the structure, then checks every bond to that metal.
Decision Tree Logic
Is the metal terminal? (connected to only one other atom)
- Yes: Check the electronegativity table to decide whether to disconnect
- No: Proceed to coordination number check
Coordination number check (for non-terminal metals):
- If coordination number > X (where X is the standard valence from a per-element lookup table): Keep the bond (bypasses electronegativity check)
- If coordination number ≤ X: Check the electronegativity table
Electronegativity table check:
- Calculate the electronegativity difference ($\Delta EN$) between the metal and the bonded atom
- Threshold: $Z = 1.7$ (Pauling scale)
- If $\Delta EN < 1.7$: Keep the bond (covalent character)
- If $\Delta EN > 1.7$: Disconnect the bond (ionic character)
Hardcoded Chemical Exceptions
The algorithm includes specific overrides based on well-established chemistry:
- Grignard reagents (RMgX): Explicitly configured to keep the Mg-C bond but disconnect the Mg-halide bond
- Organolithium compounds (RLi): Explicitly configured to keep the structure intact
These exceptions exist because the general electronegativity rules would give incorrect results for these compound classes.
Practical Example
This means $\text{FeCl}_2$ (probably ionic) gets disconnected into $\text{Fe}^{2+}$ and $2\ \text{Cl}^-$, while $[\text{FeCl}_4]^{2-}$ (definitely a coordination complex) stays connected because its coordination number exceeds the threshold.
How InChI Generation Works
The process has six main steps:
- Parse input: Read the structure from a file (Molfile, SDF, etc.)
- Convert to internal format: Transform into the software’s data structures
- Normalize: Standardize tautomers, resolve ambiguities (where the new metal rules apply)
- Canonicalize: Create a unique representation independent of atom numbering
- Generate InChI string: Build the layered text identifier
- Create InChIKey: Hash the full string into a 27-character key for databases
The InChI itself has separate layers for formula, connectivity, hydrogens, stereochemistry, isotopes, and charge. The InChIKey is what actually gets stored in databases for fast searching.
InChIKey Version Flag
Character 25 of the InChIKey indicates the version status:
- “S”: Standard InChI
- “N”: Non-standard InChI
- “B”: Beta (experimental features)
This flag is important for anyone parsing InChIKeys programmatically, as it tells you whether the identifier was generated using stable or experimental algorithms.
Additional Context
What InChI Actually Does
InChI creates a unique text string for any chemical structure. Unlike SMILES, which has multiple vendor implementations and can represent the same molecule in different ways, InChI is a single, standardized format controlled by IUPAC. The goal is simple: same molecule, same identifier, every time.
This matters for FAIR data principles:
- Findable: You can search for a specific compound across databases
- Accessible: The standard is open and free
- Interoperable: Different systems can connect chemical knowledge
- Reusable: The identifiers work consistently across platforms
Better Documentation
The technical manual is being split into two documents:
- Chemical Manual: For chemists who need to understand what InChIs mean
- Technical Manual: For developers who need to implement the algorithms
This addresses the problem that current documentation tries to serve both audiences without doing either particularly well.
The Bigger Picture
InChI’s evolution reflects chemistry’s expansion beyond its organic roots. The fact that it took this long to properly handle inorganic compounds shows how much computational chemistry has historically focused on carbon-based molecules.
As the field moves into catalysis, materials science, and coordination chemistry applications, having proper chemical identifiers becomes essential. You can’t build FAIR chemical databases if half of chemistry is represented incorrectly.
