OSRA at TREC-CHEM 2011: Optical Structure Recognition

Paper Information

Citation: Filippov, I. V., Katsubo, D., & Nicklaus, M. C. (2011). Optical Structure Recognition Application entry in Image2Structure task. TREC-CHEM.

Publication: TREC-CHEM 2011

Additional Resources:

Contribution: Method and Resource

This is primarily a Methodological Paper ($\Psi_{\text{Method}}$), with a secondary Resource ($\Psi_{\text{Resource}}$) component.

It is Methodological because it details the specific algorithmic workflow (segmentation, binarization, vectorization, and rule-based recognition) used to translate pixel data into chemical semantics. It specifically addresses the “Image2Structure” task. It also serves as a Resource contribution by introducing OSRA as a free, open-source utility available to the community.

Motivation: Limitations of Standard OCR in Chemistry

A vast body of chemical information exists in journal publications and patents as two-dimensional structure diagrams. While human-readable, these images are inaccessible to machine data mining techniques like virtual screening. Standard Optical Character Recognition (OCR) is insufficient, and widely used techniques such as wavelet transforms or neural networks (as used in face recognition) are not applicable here because chemical diagrams contain far more structural complexity than alphabet characters, and misinterpretation of a single element can yield a valid but incorrect molecule.

Core Innovation: Chemistry-Aware Heuristic Pipeline

The authors present a specialized pipeline distinct from standard OCR that combines image processing with domain-specific chemical logic. Key technical contributions include:

Entropy-based Page Segmentation: A statistical method using row entropy to distinguish between pages with mixed text/graphics and pages with single structures.
Custom Binarization: A specific grayscale conversion ($Gr=\min(R,G,B)$).
Heuristic Confidence Scoring: A linear “confidence function” derived from atom and ring counts to select the best structure resolution.
Specialized Bond Recognition: Algorithms to detect bridge bonds, wedge/dashed bonds (3D info), and aromatic rings via inner circles.

Methodology: Evaluation on TREC-CHEM Image2Structure

The system was validated through submission to the Image2Structure task of TREC-CHEM.

Version: OSRA version 1.3.8 was used without modifications.
Setup: Two runs were submitted: one with default settings (automatic scale selection) and one fixed at 300 dpi.
Data: The evaluation used a “Training set” and a “Challenge Set” provided by the task organizers.
Metric: Recall rates were measured for both sets.

Results and Real-World Impact

Performance: The default settings achieved an 84.3% recall on the training set and 84.8% on the challenge set. The 300 dpi run performed slightly better (86.1% training, 85.6% challenge).
Utility: The tool is widely used by academic and commercial researchers to extract data from patents (USPTO, JPO).
Validation: Recognition rates have shown steady improvement over a 3-year development period.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
OSRA (SourceForge)	Code	Unknown	Open-source OCSR tool

Data

Source: The primary evaluation data came from the TREC-CHEM Image2Structure task.
Reference Datasets: The paper references the “Chem-Infty Dataset” as a source of ground-truthed chemical structure images.

Algorithms

The OSRA pipeline is heuristic-heavy. Key implementation details for replication include:

1. Page Segmentation

Entropy Calculation: Used to detect text vs. graphics. Entropy $E = -p \log p$ is calculated for rows in a feature matrix of component distances.
Thresholds: Max entropy > 6 indicates mixed text/graphics; $\le$ 3 indicates a single structure. A threshold of 4 is used to distinguish the two.
Separator Removal: Linear separators (aspect ratio above 100 or below 0.01, size above 300 pixels) are deleted early. Table frames are identified as connected components with aspect ratio between 0.1 and 10, with at least 300 pixels lying on the surrounding rectangle.
Text Removal: Text blocks are identified if a group of segments (distance determined by local minima in distance matrix) contains > 8 segments, has a fill ratio > 0.2, or aspect ratio > 10.

2. Image Preprocessing

Grayscale: $Gr = \min(R, G, B)$.
Resolutions: Processed at 72, 150, 300 dpi, and a dynamic resolution between 500-1200 dpi.
Noise Factor: Ratio of 2-pixel line segments to 3-pixel line segments. If this factor is between 0.5 and 1.0, anisotropic smoothing (GREYCstoration) is applied.
Thinning: Uses the method by J. M. Cychosz to reduce lines to 1 pixel width.

3. Vectorization & Atom Detection

Library: Potrace is used for vectorization.
Atom Identification: Atoms are detected at Bezier curve control points if:
- Potrace classifies it as a corner.
- Direction change normal component is $\ge$ 2 pixels.
- The distance from the last atom to the next control point is less than the distance from the last atom to the current control point.
OCR: GOCR and OCRAD are used for label recognition on connected sets smaller than max character dimensions. Tesseract and Cuneiform were also tested but did not improve recognition results.

4. Chemical Logic

Average Bond Length: Defined as the value at the 75th percentile of the sorted bond length list (to avoid bias from small artifacts).
Aromaticity: Flagged if a circle is found inside a ring, atoms are within half the average bond length of the circle, and bond angles to the center are less than 90 degrees.
Bridge Bonds: Detected if an atom connected to 4 pairwise collinear single bonds (none terminal) can be removed without changing fragment count, rotatable bonds, or reducing the number of 5- and 6-membered rings by 2.

5. Connection Table Compilation

Library: OpenBabel is used for conversion into SMILES or SDF formats.
Process: A molecular object is constructed from connectivity information along with stereo- and aromaticity flags. Superatom fragments are added at this stage using a user-modifiable dictionary.

Models

This is a non-learning based system (Rule-based/Heuristic). However, it uses a tuned linear function for confidence estimation.

Confidence Function: Used to select the best resolution result.

$$ \begin{aligned} \text{confidence} &= 0.316030 - 0.016315 N_C + 0.034336 N_N + 0.066810 N_O \\ &+ 0.035674 N_F + 0.065504 N_S + 0.04 N_{Cl} + 0.066811 N_{Br} \\ &+ 0.01 N_R - 0.02 N_{Xx} - 0.212739 N_{rings} + 0.071300 N_{aromatic} \\ &+ 0.329922 N_{rings5} + 0.342865 N_{rings6} - 0.037796 N_{fragments} \end{aligned} $$

Where $N_C$ is carbon count, $N_{rings}$ is ring count, etc.

Evaluation

Metric	Run	Training Set	Challenge Set
Recall	Default Settings	84.3%	84.8%
Recall	Fixed 300 dpi	86.1%	85.6%

Citation

@techreport{filippovOpticalStructureRecognition2011,
  title = {Optical {{Structure Recognition Application}} Entry in {{Image2Structure}} Task},
  author = {Filippov, Igor V. and Katsubo, Dmitry and Nicklaus, Marc C.},
  year = {2011},
  month = oct,
  institution = {National Cancer Institute},
  note = {TREC-CHEM Entry}
}

Paper Information#

Contribution: Method and Resource#

Motivation: Limitations of Standard OCR in Chemistry#

Core Innovation: Chemistry-Aware Heuristic Pipeline#

Methodology: Evaluation on TREC-CHEM Image2Structure#

Results and Real-World Impact#

Reproducibility Details#

Artifacts#

Data#

Algorithms#

Models#

Evaluation#

Citation#