Paper Information
Citation: Filippov, I. V., Katsubo, D., & Nicklaus, M. C. (2012). Optical Structure Recognition Application entry to CLEF-IP 2012. CLEF 2012 Evaluation Labs and Workshop, Online Working Notes.
Publication: CLEF 2012
Additional Resources:
Contribution: Evaluating Native Processing in OSRA
This is a Method paper ($\Psi_{\text{Method}}$).
It focuses on evaluating the algorithmic performance of the Optical Structure Recognition Application (OSRA) and justifies specific implementation details (such as pairwise distance clustering) through comparative analysis. The paper systematically compares preprocessing workflows (native vs. tiffsplit) to demonstrate how implementation choices impact precision, recall, and F1 scores.
Motivation: Advancing Chemical Structure Recognition
The primary motivation is to solve the Chemical Structure Recognition task within the context of the CLEF-IP 2012 challenge. The goal is to accurately convert images of chemical structures found in patent documents into established computerized molecular formats (connection tables).
A secondary technical motivation is to address issues in page segmentation where standard bounding box approaches fail to separate overlapping or nested molecular structures.
Core Innovation: Pairwise Distance Segmentation
The core novelty lies in the algorithmic approach to object detection and page segmentation:
Rejection of Bounding Boxes: Unlike standard OCR approaches, OSRA does not use a bounding box paradigm internally. Instead, it relies on the minimum pairwise distance between points of different connected components. This allows the system to correctly handle cases where a larger molecule “surrounds” a smaller one, which bounding boxes would incorrectly merge.
Native TIFF Processing: The authors identify that external tools (specifically
tiffsplit) introduce artifacts during multi-page TIFF conversion. They implement native splitting facilities within OSRA, which substantially improves precision (from 0.433 to 0.708 at tolerance 0).
Experimental Setup: Segmentation and Recognition Tracks
The authors performed two specific tracks for the CLEF-IP 2012 challenge:
Page Segmentation:
- Dataset: 5421 ground truth structures.
- Comparison: Run 1 used
tiffsplit(external tool) to separate pages; Run 2 used OSRA’s native internal page splitting. - Metrics: Precision, Recall, and F1 scores calculated at varying pixel tolerances (0, 10, 20, 40, 55 pixels).
Structure Recognition:
- Dataset: A test set split into an “Automatic” evaluation set (865 structures checkable via InChI keys) and a “Manual” evaluation set (95 structures requiring human review due to Markush labels).
- Metric: Recognition rate (Recalled %).
Results and Conclusions: Native Processing Gains
- Native vs. External Splitting: The native OSRA page splitting outperformed the external
tiffsplittool by a wide margin. At tolerance 0, native processing achieved 0.708 Precision compared to 0.433 fortiffsplit. The authors attribute this gap to artifacts introduced duringtiffsplit’s internal TIFF format conversion. The native run also returned far fewer records (5,254 vs. 8,800 fortiffsplit), indicating fewer false detections. - Recognition Rate: Across 960 total structures, the system achieved an 83% recognition rate (88% on the automatic set, 40% on the manual Markush set).
- Context: The results were consistent with OSRA’s second-place finish (out of 6 participants) at TREC-CHEM 2011.
Reproducibility Details
Data
The experiments used the CLEF-IP 2012 benchmark datasets.
| Purpose | Set | Size | Notes |
|---|---|---|---|
| Segmentation | Ground Truth | 5,421 structures | Used to evaluate bounding box/coordinate accuracy. |
| Recognition | Automatic | 865 structures | Evaluated via InChI key matching. |
| Recognition | Manual | 95 structures | Evaluated manually due to Markush-style labels. |
Algorithms
1. Component Clustering (Pairwise Distance)
The segmentation algorithm avoids bounding boxes.
- Logic: Calculate the minimum pairwise distance between points of distinct graphical components.
- Criterion: If distance $d < \text{threshold}$, components are grouped.
- Advantage: Enables separation of complex geometries where a bounding box $B_1$ might fully encompass $B_2$ (e.g., a large ring surrounding a salt ion), whereas the actual pixels are disjoint.
2. Image Pre-processing
- Workflow A (Run 1): Multi-page TIFF →
tiffsplitbinary → Single TIFFs → OSRA. - Workflow B (Run 2): Multi-page TIFF → OSRA Internal Split → Recognition.
Evaluation
Page Segmentation Results (tiffsplit, Run 1)
Using tiffsplit for page splitting returned 8,800 records against 5,421 ground truth structures.
| Tolerance (px) | Precision | Recall | F1 |
|---|---|---|---|
| 0 | 0.433 | 0.703 | 0.536 |
| 10 | 0.490 | 0.795 | 0.606 |
| 20 | 0.507 | 0.823 | 0.627 |
| 40 | 0.536 | 0.870 | 0.663 |
| 55 | 0.549 | 0.891 | 0.679 |
Page Segmentation Results (Native Split, Run 2)
Using OSRA’s native TIFF reading returned 5,254 records, with much higher precision.
| Tolerance (px) | Precision | Recall | F1 |
|---|---|---|---|
| 0 | 0.708 | 0.686 | 0.697 |
| 10 | 0.793 | 0.769 | 0.781 |
| 20 | 0.821 | 0.795 | 0.808 |
| 40 | 0.867 | 0.840 | 0.853 |
| 55 | 0.887 | 0.860 | 0.873 |
Structure Recognition Results
| Set | Count | Recalled | Percentage |
|---|---|---|---|
| Automatic | 865 | 761 | 88% |
| Manual | 95 | 38 | 40% |
| Total | 960 | 799 | 83% |
Artifacts
| Artifact | Type | License | Notes |
|---|---|---|---|
| OSRA | Code | Open Source | Official project page at NCI/NIH |
OSRA is described as an open source utility. The CLEF-IP 2012 benchmark datasets were provided as part of the shared task. No hardware or compute requirements are specified in the paper.
Citation
@inproceedings{filippovOpticalStructureRecognition2012,
title = {Optical {{Structure Recognition Application}} Entry to {{CLEF-IP}} 2012},
author = {Filippov, Igor V and Katsubo, Dmitry and Nicklaus, Marc C},
year = {2012},
booktitle = {CLEF 2012 Evaluation Labs and Workshop, Online Working Notes},
url = {https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-FilippovEt2012.pdf},
abstract = {We present our entry to CLEF 2012 Chemical Structure Recognition task. Our submission includes runs for both bounding box extraction and molecule structure recognition tasks using Optical Structure Recognition Application. OSRA is an open source utility to convert images of chemical structures to connection tables into established computerized molecular formats. It has been under constant development since 2007.}
}