OSRA at CLEF-IP 2012

Paper Information

Citation: Filippov, I. V., Katsubo, D., & Nicklaus, M. C. (2012). Optical Structure Recognition Application entry to CLEF-IP 2012. CLEF 2012 Evaluation Labs and Workshop, Online Working Notes.

Publication: CLEF 2012

Additional Resources:

Project Home Page

Contribution: Evaluating Native Processing in OSRA

This is a Method paper ($\Psi_{\text{Method}}$).

It focuses on evaluating the algorithmic performance of the Optical Structure Recognition Application (OSRA) and justifies specific implementation details (such as pairwise distance clustering) through comparative analysis. The paper systematically compares preprocessing workflows (native vs. tiffsplit) to demonstrate how implementation choices impact precision, recall, and F1 scores.

Motivation: Advancing Chemical Structure Recognition

The primary motivation is to solve the Chemical Structure Recognition task within the context of the CLEF-IP 2012 challenge. The goal is to accurately convert images of chemical structures found in patent documents into established computerized molecular formats (connection tables).

A secondary technical motivation is to address issues in page segmentation where standard bounding box approaches fail to separate overlapping or nested molecular structures.

Core Innovation: Pairwise Distance Segmentation

The core novelty lies in the algorithmic approach to object detection and page segmentation:

Rejection of Bounding Boxes: Unlike standard OCR approaches, OSRA does not use a bounding box paradigm internally. Instead, it relies on the minimum pairwise distance between points of different connected components. This allows the system to correctly handle cases where a larger molecule “surrounds” a smaller one, which bounding boxes would incorrectly merge.
Native TIFF Processing: The authors identify that external tools (specifically tiffsplit) introduce artifacts during multi-page TIFF conversion. They implement native splitting facilities within OSRA, which drastically improves retrieval results.

Experimental Setup: Segmentation and Recognition Tracks

The authors performed two specific tracks for the CLEF-IP 2012 challenge:

Page Segmentation:
- Dataset: 5421 ground truth structures.
- Comparison: Run 1 used tiffsplit (external tool) to separate pages; Run 2 used OSRA’s native internal page splitting.
- Metrics: Precision, Recall, and F1 scores calculated at varying pixel tolerances (0, 10, 20, 40, 55 pixels).
Structure Recognition:
- Dataset: A test set split into an “Automatic” evaluation set (865 structures checkable via InChI keys) and a “Manual” evaluation set (95 structures requiring human review due to Markush labels).
- Metric: Recognition rate (Recalled %).

Results and Conclusions: The Superiority of Native Processing

Native Processing Superiority: The native OSRA page splitting significantly outperformed the external tiffsplit tool. For example, at tolerance 0, native processing achieved 0.70 Precision compared to 0.43 for tiffsplit.
High Recall: The system demonstrated competitive recall values, achieving an overall 83% recognition rate across 960 total structures (88% on the automatic set).
Validation: The results were consistent with previous top-tier performance (2nd place) at TREC-CHEM 2011.

Reproducibility Details

Data

The experiments used the CLEF-IP 2012 benchmark datasets.

Purpose	Set	Size	Notes
Segmentation	Ground Truth	5,421 structures	Used to evaluate bounding box/coordinate accuracy.
Recognition	Automatic	865 structures	Evaluated via InChI key matching.
Recognition	Manual	95 structures	Evaluated manually due to Markush-style labels.

Algorithms

1. Component Clustering (Pairwise Distance)

The segmentation algorithm avoids bounding boxes.

Logic: Calculate the minimum pairwise distance between points of distinct graphical components.
Criterion: If distance $$d < \text{threshold}$$, components are grouped.
Advantage: Enables separation of complex geometries where a bounding box $$B_1$$ might fully encompass $$B_2$$ (e.g., a large ring surrounding a salt ion), whereas the actual pixels are disjoint.

2. Image Pre-processing

Workflow A (Inferior): Multi-page TIFF → tiffsplit binary → Single TIFFs → OSRA.
Workflow B (Superior): Multi-page TIFF → OSRA Internal Split → Recognition.

Evaluation

Page Segmentation Results (Native Split)

Performance improves as the pixel tolerance for coordinate matching increases.

Tolerance (px)	Precision	Recall	F1
0	0.708	0.686	0.697
10	0.793	0.769	0.781
20	0.821	0.795	0.808
40	0.867	0.840	0.853
55	0.887	0.860	0.873

Structure Recognition Results

Set	Count	Recalled	Percentage
Automatic	865	761	88%
Manual	95	38	40%
Total	960	799	83%

Citation

@inproceedings{filippovOpticalStructureRecognition2012,
  title = {Optical {{Structure Recognition Application}} Entry to {{CLEF-IP}} 2012},
  author = {Filippov, Igor V and Katsubo, Dmitry and Nicklaus, Marc C},
  year = {2012},
  booktitle = {CLEF 2012 Evaluation Labs and Workshop, Online Working Notes},
  url = {https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-FilippovEt2012.pdf},
  abstract = {We present our entry to CLEF 2012 Chemical Structure Recognition task. Our submission includes runs for both bounding box extraction and molecule structure recognition tasks using Optical Structure Recognition Application. OSRA is an open source utility to convert images of chemical structures to connection tables into established computerized molecular formats. It has been under constant development since 2007.}
}

Paper Information#

Contribution: Evaluating Native Processing in OSRA#

Motivation: Advancing Chemical Structure Recognition#

Core Innovation: Pairwise Distance Segmentation#

Experimental Setup: Segmentation and Recognition Tracks#

Results and Conclusions: The Superiority of Native Processing#

Reproducibility Details#

Data#

Algorithms#

Evaluation#

Citation#