Paper Information

Citation: Filippov, I. V., Katsubo, D., & Nicklaus, M. C. (2011). Optical Structure Recognition Application entry in Image2Structure task. TREC-CHEM.

Publication: TREC-CHEM 2011

Additional Resources:

What kind of paper is this?

This is primarily a Methodological Paper ($\Psi_{\text{Method}}$), with a secondary Resource ($\Psi_{\text{Resource}}$) component.

It is Methodological because it details the specific algorithmic workflow - segmentation, binarization, vectorization, and rule-based recognition - used to translate pixel data into chemical semantics. It specifically addresses the “Image2Structure” task. It also serves as a Resource contribution by introducing OSRA as a free, open-source utility available to the community.

What is the motivation?

A vast body of chemical information exists in journal publications and patents as two-dimensional structure diagrams. While human-readable, these images are inaccessible to machine data mining techniques like virtual screening. Standard Optical Character Recognition (OCR) or face recognition techniques (like neural networks or wavelets) are insufficient because chemical diagrams contain far more structural complexity than alphabet characters, and misinterpretation of a single element can yield a valid but incorrect molecule.

What is the novelty here?

The authors present a specialized pipeline distinct from standard OCR that combines image processing with domain-specific chemical logic. Key technical contributions include:

  • Entropy-based Page Segmentation: A statistical method using row entropy to distinguish between pages with mixed text/graphics and pages with single structures.
  • Custom Binarization: A specific grayscale conversion ($Gr=\min(R,G,B)$) rather than standard linear combinations.
  • Heuristic Confidence Scoring: A linear “confidence function” derived from atom and ring counts to select the best structure resolution.
  • Specialized Bond Recognition: Algorithms to detect bridge bonds, wedge/dashed bonds (3D info), and aromatic rings via inner circles.

What experiments were performed?

The system was validated through submission to the Image2Structure task of TREC-CHEM.

  • Setup: Two runs were submitted: one with default settings (automatic scale selection) and one fixed at 300 dpi.
  • Data: The evaluation used a “Training set” and a “Challenge Set” provided by the task organizers.
  • Metric: Recall rates were measured for both sets.

What were the outcomes and conclusions drawn?

  • Performance: The default settings achieved an 84.3% recall on the training set and 84.8% on the challenge set. The 300 dpi run performed slightly better (86.1% training, 85.6% challenge).
  • Utility: The tool is widely used by academic and commercial researchers to extract data from patents (USPTO, JPO).
  • Validation: Recognition rates have shown steady improvement over a 3-year development period.

Reproducibility Details

Data

  • Source: The primary evaluation data came from the TREC-CHEM Image2Structure task.
  • Reference Datasets: The paper references the “Chem-Infty Dataset” as a source of ground-truthed chemical structure images.

Algorithms

The OSRA pipeline is heuristic-heavy. Key implementation details for replication include:

1. Page Segmentation

  • Entropy Calculation: Used to detect text vs. graphics. Entropy $E = -p \log p$ is calculated for rows in a feature matrix of component distances.
  • Thresholds: Max entropy > 6 indicates mixed text/graphics; $\le$ 3 indicates a single structure. A threshold of 4 is used to distinguish the two.
  • Text Removal: Text blocks are identified if a group of segments (distance determined by local minima in distance matrix) contains > 8 segments, has a fill ratio > 0.2, or aspect ratio > 10.

2. Image Preprocessing

  • Grayscale: $Gr = \min(R, G, B)$.
  • Resolutions: Processed at 72, 150, 300 dpi, and a dynamic resolution between 500-1200 dpi.
  • Noise Factor: Ratio of 2-pixel line segments to 3-pixel line segments. If this factor is between 0.5 and 1.0, anisotropic smoothing (GREYCstoration) is applied.
  • Thinning: Uses the method by J. M. Cychosz to reduce lines to 1 pixel width.

3. Vectorization & Atom Detection

  • Library: Potrace is used for vectorization.
  • Atom Identification: Atoms are detected at Bezier curve control points if:
    • Potrace classifies it as a corner.
    • Direction change normal component is $\ge$ 2 pixels.
    • Distance checks regarding the previous atom are satisfied.
  • OCR: GOCR and OCRAD are used for label recognition on connected sets smaller than max character dimensions. Tesseract was tested but did not improve results.

4. Chemical Logic

  • Average Bond Length: Defined as the value at the 75th percentile of the sorted bond length list (to avoid bias from small artifacts).
  • Aromaticity: Flagged if a circle is found inside a ring, atoms are within half the average bond length of the circle, and bond angles to the center are < 90 degrees.
  • Bridge Bonds: Detected if an atom connected to 4 pairwise collinear single bonds can be removed without changing fragment count, rotatable bonds, or reducing ring counts by 2.

Models

This is a non-learning based system (Rule-based/Heuristic). However, it uses a tuned linear function for confidence estimation.

Confidence Function: Used to select the best resolution result.

$$ \begin{aligned} \text{confidence} &= 0.316030 - 0.016315 N_C + 0.034336 N_N + 0.066810 N_O \\ &+ 0.035674 N_F + 0.065504 N_S + 0.04 N_{Cl} + 0.066811 N_{Br} \\ &+ 0.01 N_R - 0.02 N_{Xx} - 0.212739 N_{rings} + 0.071300 N_{aromatic} \\ &+ 0.329922 N_{rings5} + 0.342865 N_{rings6} - 0.037796 N_{fragments} \end{aligned} $$

Where $N_C$ is carbon count, $N_{rings}$ is ring count, etc.

Evaluation

MetricRunTraining SetChallenge Set
RecallDefault Settings84.3%84.8%
RecallFixed 300 dpi86.1%85.6%

Citation

@techreport{filippovOpticalStructureRecognition2011,
  title = {Optical {{Structure Recognition Application}} Entry in {{Image2Structure}} Task},
  author = {Filippov, Igor V. and Katsubo, Dmitry and Nicklaus, Marc C.},
  year = {2011},
  month = oct,
  institution = {National Cancer Institute},
  note = {TREC-CHEM Entry}
}