Paper Contribution: A Systematized Tautomer Database Resource

This is a Resource paper with strong Systematization elements. It provides a comprehensive catalog of 86 tautomeric transformation rules derived from experimental literature, designed to serve as a foundational resource for chemical database systems and the InChI V2 identifier standard. The systematic validation across 400+ million structures also makes it a benchmarking study for evaluating current chemoinformatics tools.

The Tautomerism Problem in Chemical Databases

Chemical databases face a fundamental problem: the same molecule can appear multiple times under different identifiers simply because it exists in different tautomeric forms. For example, glucose’s ring-closed and open-chain forms are the same molecule; however, current chemical identifiers (including InChI) often treat them as distinct compounds.

D-glucose open-chain aldehyde form converting to beta-D-glucopyranose ring form, illustrating ring-chain tautomerism
Ring-chain tautomerism in glucose: the open-chain aldehyde form (left) and the cyclic pyranose form (right) are the same molecule in different tautomeric states.

This creates three critical problems:

  1. Database redundancy: Millions of duplicate entries for the same chemical entities
  2. Search failures: Researchers miss relevant compounds during structure searches
  3. ML training issues: Machine learning models learn to treat tautomers as different molecules

The motivation for this work is to provide a comprehensive, experimentally-grounded rule set that enables InChI V2 to properly recognize tautomeric relationships, eliminating these problems at the identifier level.

86 Comprehensive Tautomeric Transformation Rules

The key contributions are:

  1. Comprehensive Rule Set: Development of 86 tautomeric transformation rules based on experimental literature, categorized into:

    • 54 Prototropic rules (classic H-movement tautomerism)
    • 21 Ring-Chain rules (cyclic/open-chain transformations)
    • 11 Valence rules (structural rearrangements with valence changes)
  2. Massive-Scale Validation: Testing these rules against nine major chemical databases totaling over 400 million structures to identify coverage gaps in current InChI implementations

  3. Quantitative Assessment: Systematic measurement showing that current InChI (even with flexible settings) only achieves ~50% success in recognizing tautomeric relationships, with some new rules showing <2% success rates

  4. Practical Tools: Creation of the Tautomerizer web tool for public use, demonstrating practical application of the rule set

The novelty lies in the systematic compilation and validation of transformation rules at a scale that reveals critical gaps in current chemical identification systems.

Massive-Scale Validation Across 400M+ Structures

Database Analysis

The researchers analyzed 9 chemical databases totaling 400+ million structures:

  • Public databases: PubChem (largest), ChEMBL, DrugBank, PDB Ligands, SureChEMBL, AMS, ChemNavigator
  • Private databases: CSD (Cambridge Structural Database), CSDB (NCI internal)

Methodology

Software: CACTVS Chemoinformatics Toolkit (versions 3.4.6.33 and 3.4.8.6)

Tautomer Generation Protocol:

  • Algorithm: Single-step generation (apply transforms to input structure only, avoiding recursion)
  • Constraints: Max 10 tautomers per structure, 30-second CPU timeout per transform
  • Format: All rules expressed as SMIRKS strings
  • Stereochemistry: Stereocenters involved in tautomerism were flattened during transformation

Success Metrics:

  • Complete InChI match: All tautomers share identical InChI
  • Partial InChI match: At least two tautomers share an InChI
  • Tested against three InChI configurations: Standard, Nonstandard (15T), Nonstandard (15T + KET)

Rule Coverage Analysis

For each of the 86 rules, the researchers:

  1. Applied the transformation to all molecules in each database
  2. Generated tautomers using the SMIRKS patterns
  3. Computed InChI identifiers for each tautomer
  4. Measured success rates (percentage of cases where InChI recognized the relationship)

Key Findings from Experiments

Rule Frequency: The most common rule PT_06_00 (1,3-heteroatom H-shift, covering keto-enol tautomerism) affects >70% of molecules across databases.

InChI Performance:

  • Standard InChI: ~37% success rate
  • Nonstandard InChI (15T): ~50% success rate
  • Many newly defined rules: <2% success rate

Scale Impact: Implementing the full 86-rule set would triple the number of compounds recognized as having tautomeric relationships compared to current InChI.

Outcomes: InChI V2 Requirements and Coverage Gaps

Main Findings

  1. Current Systems Are Inadequate: Even the most flexible InChI settings only achieve ~50% success in recognizing tautomeric relationships, with Standard InChI at ~37%

  2. Massive Coverage Gap: The new rule set reveals millions of tautomeric relationships that current InChI completely misses, particularly for ring-chain and valence tautomerism

  3. Implementation Requirement: InChI V2 will require a major redesign to handle the comprehensive rule set

  4. Rule Validation: The 86-rule set, derived from experimental literature, provides a validated foundation for next-generation chemical identifiers

Implications

For Chemical Databases:

  • Reduced redundancy through proper tautomer recognition
  • Improved data quality and consistency
  • More comprehensive structure search results

For Machine Learning:

  • More accurate training data (tautomers properly grouped)
  • Better molecular property prediction models
  • Reduced dataset bias from tautomeric duplicates

For Chemoinformatics Tools:

  • Blueprint for InChI V2 development
  • Standardized rule set for tautomer generation
  • Public tool (Tautomerizer) for practical use

Limitations Acknowledged

  • Single-step generation only (omits recursive enumeration of all possible tautomers)
  • 30-second timeout may miss complex transformations
  • Some tautomeric preferences are context-dependent (pH, solvent) and require more than static rules for capture

Future Directions

The paper lays groundwork for InChI V2 development, emphasizing that the comprehensive rule set necessitates algorithmic redesign.

Reproducibility Details

Data

Datasets Analyzed (400M+ total structures):

Public Databases (Enable partial reproduction):

  • PubChem: Largest public chemical database
  • ChEMBL: Bioactive molecules with drug-like properties
  • DrugBank: FDA-approved and experimental drugs
  • PDB Ligands: Small molecules from protein structures
  • SureChEMBL: Chemical structures from patents
  • AMS: Screening samples
  • ChemNavigator: Commercial chemical database

Private/Proprietary Databases (Prevent 100% full-scale reproduction):

  • CSD: Cambridge Structural Database (requires commercial/academic license)
  • CSDB: NCI internal database (private)

Algorithms

Tautomer Generation:

  • Method: Single-step SMIRKS-based transformations
  • Constraints:
    • Maximum 10 tautomers per input structure
    • 30-second CPU timeout per transformation
    • Stereochemistry flattening for affected centers
  • Toolkit Dependency: The authors used the CACTVS Chemoinformatics Toolkit. Researchers attempting to reproduce this with fully open-source tools (like RDKit) may encounter differing behavior due to proprietary chemical perception logic and licensing differences.

Rule Categories:

  • Prototropic (PT): 54 rules for hydrogen movement
    • Most common: PT_06_00 (1,3-heteroatom H-shift, >70% coverage)
  • Ring-Chain (RC): 21 rules for cyclic/open-chain transformations
    • Examples: RC_03_00 (pentose sugars), RC_04_01 (hexose sugars)
  • Valence (VT): 11 rules for valence changes
    • Notable: VT_02_00 (tetrazole/azide, ~2.8M hits)

InChI Comparison:

  • Standard InChI (default settings)
  • Nonstandard InChI with 15T option (mobile H)
  • Nonstandard InChI with 15T and KET options (keto-enol)

Evaluation

Success Metrics:

Let $\mathcal{T}(m)$ be the set of generated tautomers for molecule $m$.

  • Complete Match: Occurs iff $\forall t_i, t_j \in \mathcal{T}(m), \text{InChI}(t_i) = \text{InChI}(t_j)$.
  • Partial Match: At least 2 tautomers share the same InChI.
  • Fail: All tautomers have different InChIs.

Benchmark Results:

  • Standard InChI: ~37% success rate across all rules
  • Nonstandard (15T): ~50% success rate
  • New rules: Many show <2% recognition by current InChI

Hardware

Software Environment:

  • Toolkit: CACTVS Chemoinformatics Toolkit v3.4.6.33 and v3.4.8.6
  • Hash Functions:
    • E_TAUTO_HASH (tautomer-invariant identifier)
    • E_ISOTOPE_STEREO_HASH128 (tautomer-sensitive identifier)

Note: The paper omits computational hardware specifications because the analysis used existing chemical databases. However, evaluating 400M+ structures necessitates high-throughput cluster computing, making it computationally expensive for an individual to replicate the full analysis from scratch.

Paper Information

Citation: Dhaked, D. K., Ihlenfeldt, W.-D., Patel, H., Delannée, V., & Nicklaus, M. C. (2020). Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2. Journal of Chemical Information and Modeling, 60(3), 1253-1275. https://doi.org/10.1021/acs.jcim.9b01080

Publication: Journal of Chemical Information and Modeling, 2020

@article{dhaked2020toward,
  title={Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2},
  author={Dhaked, Devendra K and Ihlenfeldt, Wolf-Dietrich and Patel, Hitesh and Delann{\'e}e, Vincent and Nicklaus, Marc C},
  journal={Journal of Chemical Information and Modeling},
  volume={60},
  number={3},
  pages={1253--1275},
  year={2020},
  publisher={ACS Publications}
}

Additional Resources: