Paper Information
Citation: Park, J., Rosania, G. R., Shedden, K. A., Nguyen, M., Lyu, N., & Saitou, K. (2009). Automated extraction of chemical structure information from digital raster images. Chemistry Central Journal, 3(1), 4. https://doi.org/10.1186/1752-153X-3-4
Publication: Chemistry Central Journal 2009
What kind of paper is this?
This is a Method paper.
It proposes a novel software system, ChemReader, designed to automate the analog-to-digital conversion of chemical structure diagrams. The paper focuses on the algorithmic pipeline - specifically modifying standard computer vision techniques like the Hough Transform to suit chemical graphs - and validates the method through direct performance comparisons against existing State-of-the-Art tools (OSRA, CLiDE, Kekule).
What is the motivation?
There is a massive amount of chemical information (molecular interactions, pathways, disease processes) locked in scientific literature. However, this information is typically encoded as “analog diagrams” (raster images) embedded in text. Existing text-based search engines cannot index these structures effectively.
While previous tools existed (Kekule, OROCS, CLiDE), they often required high-resolution images (150-300 dpi) or manual intervention to separate diagrams from text, making fully automated, large-scale database annotation impractical.
What is the novelty here?
The authors introduce ChemReader, a fully automated toolkit with several specific algorithmic innovations tailored for chemical diagrams:
- Modified Hough Transform (HT): Unlike standard HT, which treats all pixels equally, ChemReader uses a modified weight function that accounts for pixel connectivity and line thickness to better detect chemical bonds.
- Chemical Spell Checker: A post-processing step that uses a dictionary of common chemical abbreviations (770 entries) and n-gram probabilities to correct Optical Character Recognition (OCR) errors (e.g., correcting specific atom labels based on valence rules), improving accuracy from 66% to 87%.
- Specific Substructure Detection: Dedicated algorithms for detecting stereochemical “wedge” bonds using corner detection and aromatic rings using the Generalized Hough Transform.
What experiments were performed?
The authors compared ChemReader against three other systems: OSRA V1.0.1, CLiDE V2.1 Lite, and Kekule V2.0 demo.
They used three distinct datasets to test robustness:
- Set I (50 images): Diverse drawing styles and fonts collected via Google Image Search.
- Set II (100 images): Ligand images from the GLIDA database, linked to PubChem for ground truth.
- Set III (212 images): Low-resolution images embedded in 121 scanned journal articles from PubMed.
What outcomes/conclusions?
- Accuracy: ChemReader significantly outperformed competitors. In the difficult Set III (journal articles), ChemReader achieved 30.2% correct exact output, compared to 17% for OSRA and 6.6% for CLiDE.
- Similarity: Even when exact matches failed, ChemReader maintained high Tanimoto similarity scores (0.74-0.86), indicating it successfully captured the majority of chemically significant features.
- Substructure Recognition: ChemReader demonstrated higher recall rates across all PubChem fingerprint categories (rings, atom pairs, SMARTS patterns) compared to other tools.
- Error Correction: The “Chemical Spell Checker” improved character recognition accuracy from 66% to 87%.
Reproducibility Details
Data
The study utilized three test sets collected from public sources.
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Evaluation | Set I | 50 images | Sourced from Google Image Search to vary styles/fonts. |
| Evaluation | Set II | 100 images | Randomly selected ligands from the GLIDA database; ground truth via PubChem. |
| Evaluation | Set III | 212 images | Extracted from 121 PubMed journal articles; specifically excludes non-chemical figures. |
Algorithms
The pipeline consists of several sequential processing steps:
- De-noising: Uses GREYCstoration, an anisotropic smoothing algorithm, to regulate image noise.
- Segmentation: Uses an 8-connectivity algorithm to group pixels. Components are classified as text or graphics based on height/area ratios.
- Line Detection (Modified Hough Transform):
- Standard Hough Transform is modified to weight pixel pairs $(P_i, P_j)$ based on connectivity.
- Weight Function ($W_{ij}$): $$W_{ij} = \begin{cases} n_{ij}(P_0 - x_{ij}) & \text{if } x_{ij}/n_{ij} > P_0 \ 0 & \text{otherwise} \end{cases}$$ Where $n_{ij}$ is the pixel count between points, $x_{ij}$ is the count of black pixels, and $P_0$ is a density threshold.
- Wedge Bond Detection: Uses corner detection to find triangles where the area equals the number of black pixels (isosceles shape check).
- Chemical Spell Checker:
- Calculates the Maximum Likelihood ($ML$) of a character string being a valid chemical word $T$ from a dictionary.
- Similarity Metric: $$Sim(S_i, T_i) = 1 - \sqrt{\sum_{j=1}^{M} [I^{S_i}(j) - I^{T_i}(j)]^2}$$ Uses pixel-by-pixel intensity difference between the input segment $S$ and candidate template $T$.
Models
- Character Recognition: Uses the open-source GOCR library. It employs template matching based on features like holes, pixel densities, and transitions.
- Chemical Dictionary: A lookup table containing 770 frequently used chemical abbreviations and fundamental valence rules.
Evaluation
Performance was measured using exact structure matching and fingerprint similarity.
| Metric | Value (Set III) | Baseline (OSRA) | Notes |
|---|---|---|---|
| % Correct | 30.2% | 17% | Exact structure match using ChemAxon JChem. |
| Avg Similarity | 0.740 | 0.526 | Tanimoto similarity on PubChem Substructure Fingerprints. |
| Precision (Rings) | 0.87 | 0.84 | Precision rate for recognizing ring systems. |
| Recall (Rings) | 0.83 | 0.73 | Recall rate for recognizing ring systems. |
Hardware
- Platform: C++ implementation running on MS Windows.
- Dependencies: GOCR (OCR), GREYCstoration (Image processing).
