Paper Information

Citation: Park, J., Rosania, G. R., Shedden, K. A., Nguyen, M., Lyu, N., & Saitou, K. (2009). Automated extraction of chemical structure information from digital raster images. Chemistry Central Journal, 3(1), 4. https://doi.org/10.1186/1752-153X-3-4

Publication: Chemistry Central Journal 2009

What kind of paper is this?

This is a Method paper.

It proposes a novel software system, ChemReader, designed to automate the analog-to-digital conversion of chemical structure diagrams. The paper focuses on the algorithmic pipeline - specifically modifying standard computer vision techniques like the Hough Transform to suit chemical graphs - and validates the method through direct performance comparisons against existing State-of-the-Art tools (OSRA, CLiDE, Kekule).

What is the motivation?

There is a massive amount of chemical information (molecular interactions, pathways, disease processes) locked in scientific literature. However, this information is typically encoded as “analog diagrams” (raster images) embedded in text. Existing text-based search engines cannot index these structures effectively.

While previous tools existed (Kekule, OROCS, CLiDE), they often required high-resolution images (150-300 dpi) or manual intervention to separate diagrams from text, making fully automated, large-scale database annotation impractical.

What is the novelty here?

The authors introduce ChemReader, a fully automated toolkit with several specific algorithmic innovations tailored for chemical diagrams:

  • Modified Hough Transform (HT): Unlike standard HT, which treats all pixels equally, ChemReader uses a modified weight function that accounts for pixel connectivity and line thickness to better detect chemical bonds.
  • Chemical Spell Checker: A post-processing step that uses a dictionary of common chemical abbreviations (770 entries) and n-gram probabilities to correct Optical Character Recognition (OCR) errors (e.g., correcting specific atom labels based on valence rules), improving accuracy from 66% to 87%.
  • Specific Substructure Detection: Dedicated algorithms for detecting stereochemical “wedge” bonds using corner detection and aromatic rings using the Generalized Hough Transform.

What experiments were performed?

The authors compared ChemReader against three other systems: OSRA V1.0.1, CLiDE V2.1 Lite, and Kekule V2.0 demo.

They used three distinct datasets to test robustness:

  1. Set I (50 images): Diverse drawing styles and fonts collected via Google Image Search.
  2. Set II (100 images): Ligand images from the GLIDA database, linked to PubChem for ground truth.
  3. Set III (212 images): Low-resolution images embedded in 121 scanned journal articles from PubMed.

What outcomes/conclusions?

  • Accuracy: ChemReader significantly outperformed competitors. In the difficult Set III (journal articles), ChemReader achieved 30.2% correct exact output, compared to 17% for OSRA and 6.6% for CLiDE.
  • Similarity: Even when exact matches failed, ChemReader maintained high Tanimoto similarity scores (0.74-0.86), indicating it successfully captured the majority of chemically significant features.
  • Substructure Recognition: ChemReader demonstrated higher recall rates across all PubChem fingerprint categories (rings, atom pairs, SMARTS patterns) compared to other tools.
  • Error Correction: The “Chemical Spell Checker” improved character recognition accuracy from 66% to 87%.

Reproducibility Details

Data

The study utilized three test sets collected from public sources.

PurposeDatasetSizeNotes
EvaluationSet I50 imagesSourced from Google Image Search to vary styles/fonts.
EvaluationSet II100 imagesRandomly selected ligands from the GLIDA database; ground truth via PubChem.
EvaluationSet III212 imagesExtracted from 121 PubMed journal articles; specifically excludes non-chemical figures.

Algorithms

The pipeline consists of several sequential processing steps:

  • De-noising: Uses GREYCstoration, an anisotropic smoothing algorithm, to regulate image noise.
  • Segmentation: Uses an 8-connectivity algorithm to group pixels. Components are classified as text or graphics based on height/area ratios.
  • Line Detection (Modified Hough Transform):
    • Standard Hough Transform is modified to weight pixel pairs $(P_i, P_j)$ based on connectivity.
    • Weight Function ($W_{ij}$): $$W_{ij} = \begin{cases} n_{ij}(P_0 - x_{ij}) & \text{if } x_{ij}/n_{ij} > P_0 \ 0 & \text{otherwise} \end{cases}$$ Where $n_{ij}$ is the pixel count between points, $x_{ij}$ is the count of black pixels, and $P_0$ is a density threshold.
  • Wedge Bond Detection: Uses corner detection to find triangles where the area equals the number of black pixels (isosceles shape check).
  • Chemical Spell Checker:
    • Calculates the Maximum Likelihood ($ML$) of a character string being a valid chemical word $T$ from a dictionary.
    • Similarity Metric: $$Sim(S_i, T_i) = 1 - \sqrt{\sum_{j=1}^{M} [I^{S_i}(j) - I^{T_i}(j)]^2}$$ Uses pixel-by-pixel intensity difference between the input segment $S$ and candidate template $T$.

Models

  • Character Recognition: Uses the open-source GOCR library. It employs template matching based on features like holes, pixel densities, and transitions.
  • Chemical Dictionary: A lookup table containing 770 frequently used chemical abbreviations and fundamental valence rules.

Evaluation

Performance was measured using exact structure matching and fingerprint similarity.

MetricValue (Set III)Baseline (OSRA)Notes
% Correct30.2%17%Exact structure match using ChemAxon JChem.
Avg Similarity0.7400.526Tanimoto similarity on PubChem Substructure Fingerprints.
Precision (Rings)0.870.84Precision rate for recognizing ring systems.
Recall (Rings)0.830.73Recall rate for recognizing ring systems.

Hardware

  • Platform: C++ implementation running on MS Windows.
  • Dependencies: GOCR (OCR), GREYCstoration (Image processing).