Paper Information
Citation: Park, J., Rosania, G. R., Shedden, K. A., Nguyen, M., Lyu, N., & Saitou, K. (2009). Automated extraction of chemical structure information from digital raster images. Chemistry Central Journal, 3(1), 4. https://doi.org/10.1186/1752-153X-3-4
Publication: Chemistry Central Journal 2009
What kind of paper is this?
This is a Method paper.
It proposes a novel software system, ChemReader, designed to automate the analog-to-digital conversion of chemical structure diagrams. The paper focuses on the algorithmic pipeline - specifically modifying standard computer vision techniques like the Hough Transform to suit chemical graphs - and validates the method through direct performance comparisons against existing State-of-the-Art tools (OSRA, CLiDE, Kekule).
What is the motivation?
There is a massive amount of chemical information (molecular interactions, pathways, disease processes) locked in scientific literature. However, this information is typically encoded as “analog diagrams” (raster images) embedded in text, rather than machine-readable formats. Existing text-based search engines cannot index these structures effectively.
While previous tools existed (Kekule, OROCS, CLiDE), they often required high-resolution images (150-300 dpi) or manual intervention to separate diagrams from text, making fully automated, large-scale database annotation impractical.
What is the novelty here?
The authors introduce ChemReader, a fully automated toolkit with several specific algorithmic innovations tailored for chemical diagrams:
- Modified Hough Transform (HT): Unlike standard HT, which treats all pixels equally, ChemReader uses a modified weight function that accounts for pixel connectivity and line thickness to better detect chemical bonds.
- Chemical Spell Checker: A post-processing step that uses a dictionary of common chemical abbreviations (770 entries) and n-gram probabilities to correct Optical Character Recognition (OCR) errors (e.g., correcting specific atom labels based on valence rules), improving accuracy from 66% to 87%.
- Specific Substructure Detection: Dedicated algorithms for detecting stereochemical “wedge” bonds using corner detection and aromatic rings using the Generalized Hough Transform.
What experiments were performed?
The authors compared ChemReader against three other systems: OSRA V1.0.1, CLiDE V2.1 Lite, and Kekule V2.0 demo.
They used three distinct datasets to test robustness:
- Set I (50 images): Diverse drawing styles and fonts collected via Google Image Search.
- Set II (100 images): Ligand images from the GLIDA database, linked to PubChem for ground truth.
- Set III (212 images): Low-resolution images embedded in 121 scanned journal articles from PubMed.
What outcomes/conclusions?
- Accuracy: ChemReader significantly outperformed competitors. In the difficult Set III (journal articles), ChemReader achieved 30.2% correct exact output, compared to 17% for OSRA and 6.6% for CLiDE.
- Similarity: Even when exact matches failed, ChemReader maintained high Tanimoto similarity scores (0.74-0.86), indicating it successfully captured the majority of chemically significant features.
- Substructure Recognition: ChemReader demonstrated higher recall rates across all PubChem fingerprint categories (rings, atom pairs, SMARTS patterns) compared to other tools.
- Error Correction: The “Chemical Spell Checker” improved character recognition accuracy from 66% to 87%.
Reproducibility Details
Data
The study utilized three test sets collected from public sources.
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Evaluation | Set I | 50 images | Sourced from Google Image Search to vary styles/fonts. |
| Evaluation | Set II | 100 images | Randomly selected ligands from the GLIDA database; ground truth via PubChem. |
| Evaluation | Set III | 212 images | Extracted from 121 PubMed journal articles; specifically excludes non-chemical figures. |
Algorithms
The pipeline consists of several sequential processing steps:
- De-noising: Uses GREYCstoration, an anisotropic smoothing algorithm, to regulate image noise.
- Segmentation: Uses an 8-connectivity algorithm to group pixels. Components are classified as text or graphics based on height/area ratios.
- Line Detection (Modified Hough Transform):
- Standard Hough Transform is modified to weight pixel pairs $(P_i, P_j)$ based on connectivity.
- Weight Function ($W_{ij}$): $$W_{ij} = \begin{cases} n_{ij}(P_0 - x_{ij}) & \text{if } x_{ij}/n_{ij} > P_0 \ 0 & \text{otherwise} \end{cases}$$ Where $n_{ij}$ is the pixel count between points, $x_{ij}$ is the count of black pixels, and $P_0$ is a density threshold.
- Wedge Bond Detection: Uses corner detection to find triangles where the area equals the number of black pixels (isosceles shape check).
- Chemical Spell Checker:
- Calculates the Maximum Likelihood ($ML$) of a character string being a valid chemical word $T$ from a dictionary.
- Similarity Metric: $$Sim(S_i, T_i) = 1 - \sqrt{\sum_{j=1}^{M} [I^{S_i}(j) - I^{T_i}(j)]^2}$$ Uses pixel-by-pixel intensity difference between the input segment $S$ and candidate template $T$.
Models
- Character Recognition: Uses the open-source GOCR library. It employs template matching based on features like holes, pixel densities, and transitions.
- Chemical Dictionary: A lookup table containing 770 frequently used chemical abbreviations and fundamental valence rules.
Evaluation
Performance was measured using exact structure matching and fingerprint similarity.
| Metric | Value (Set III) | Baseline (OSRA) | Notes |
|---|---|---|---|
| % Correct | 30.2% | 17% | Exact structure match using ChemAxon JChem. |
| Avg Similarity | 0.740 | 0.526 | Tanimoto similarity on PubChem Substructure Fingerprints. |
| Precision (Rings) | 0.87 | 0.84 | Precision rate for recognizing ring systems. |
| Recall (Rings) | 0.83 | 0.73 | Recall rate for recognizing ring systems. |
Hardware
- Platform: C++ implementation running on MS Windows.
- Dependencies: GOCR (OCR), GREYCstoration (Image processing).