MolRec: Chemical Structure Recognition at CLEF 2012

Paper Information

Citation: Sadawi, N. M., Sexton, A. P., & Sorge, V. (2012). MolRec at CLEF 2012: Overview and Analysis of Results. Working Notes of CLEF 2012 Evaluation Labs and Workshop. CLEF. https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf

Publication: CLEF 2012 Workshop (ImageCLEF Track)

Systematization of Rule-Based OCSR

This is a Systematization paper that evaluates and analyzes MolRec’s performance in the CLEF 2012 chemical structure recognition competition. The work provides systematic insights into how the improved MolRec system performed on different types of molecular diagrams and reveals structural challenges facing rule-based OCSR approaches through comprehensive failure analysis.

Investigating the Limits of Rule-Based Recognition

This work builds on the TREC 2011 competition, where a previous implementation of MolRec already performed well. The CLEF 2012 competition provided an opportunity to test an improved, more computationally efficient version of MolRec on different datasets and understand how performance varies across complexity levels.

The motivation is to understand exactly where rule-based chemical structure recognition breaks down. Examining the specific types of structures that cause failures provides necessary context for the high accuracy rates achieved on simpler structures.

The Two-Stage MolRec Architecture

The novelty lies in the systematic evaluation across two different difficulty levels and the comprehensive failure analysis. The authors tested an improved MolRec implementation that was more efficient than the TREC 2011 version, providing insights into both system evolution and the inherent challenges of chemical structure recognition.

MolRec Architecture Overview: The system follows a two-stage pipeline approach:

Vectorization Stage: The system preprocesses input images through three steps:
- Image binarization using Otsu’s method to convert grayscale images to black and white, followed by labelling of connected components
- OCR processing using nearest neighbor classification with a Euclidean metric to identify and remove text components (atom labels, charges, etc.)
- Separation of bond elements: thinning connected components to single-pixel width, building polyline representations, detecting circles, arrows, and solid triangles, and applying the Douglas-Peucker line simplification algorithm to clean up vectorized bonds
Rule Engine Stage: A set of 18 chemical rules converts geometric primitives into molecular graphs:
- Bridge bond recognition (2 rules applied before all others, handling structures with multiple connection paths depicted in 2.5-dimensional perspective drawings)
- Standard bond and atom recognition (16 rules applied in arbitrary order)
- Context-aware disambiguation resolving ambiguities using the full graph structure and character groups
- Superatom expansion looking up character groups identifying more than one atom in a dictionary and replacing them with molecule subgraphs

The system can output results in standard formats like MOL files or SMILES strings.

CLEF 2012 Experimental Design

The CLEF 2012 organizers provided a set of 961 test images clipped from patent documents, split into two sets:

Automatic Evaluation Set (865 images): Images selected for automatic evaluation by comparison of generated MOL files with ground truth MOL files using the OpenBabel toolkit.
Manual Evaluation Set (95 images): A more challenging collection of images containing elements not supported by OpenBabel (typically Markush structures), requiring manual visual evaluation. This set was intentionally included to provide a greater challenge.

The authors ran MolRec four times with slightly adjusted internal parameters, then manually examined every incorrect recognition to categorize error types.

Performance Divergence and Critical Failure Modes

The results reveal a stark performance gap between simple and complex molecular structures:

Performance on Automatic Evaluation Set: On the 865-image set, MolRec achieved 94.91% to 96.18% accuracy across four runs with different parameter settings. A total of 46 different diagrams were mis-recognized across all runs.

Performance on Manual Evaluation Set: On the 95-image set, accuracy dropped to 46.32% to 58.95%. A total of 52 different diagrams were mis-recognized across all runs. Some diagrams failed for multiple reasons.

Key Failure Modes Identified (with counts from the paper’s Table 3):

Character Grouping (26 manual, 0 automatic): An implementation bug caused the digit “1” to be repeated within atom groups, so $R_{21}$ was incorrectly recognized as $R_{211}$. A separate problem was difficulty correctly separating closely spaced atom groups.
Touching Characters (8 manual, 1 automatic): The system does not handle touching characters, so overlapping characters cause mis-recognition.
Four-Way Junction Failures (6 manual, 7 automatic): The vectorization process could not handle junctions where four lines meet, leading to incorrect connectivity.
OCR Errors (5 manual, 11 automatic): Character recognition errors included “G” interpreted as “O”, “alkyl” being mis-recognized, and “I” interpreted as a vertical single bond.
Missed Solid and Dashed Wedge Bonds (0 manual, 6 each in automatic): The system incorrectly recognized a number of solid wedge and dashed wedge bonds.
Missed Wavy Bonds (2 manual, 1 automatic): Some wavy bonds were not recognized despite the dedicated wavy bond rule.
Missed Charge Signs (1 manual, 2 automatic): While correctly recognizing positive charge signs, MolRec missed three negative charge signs, including one placed at the top left of an atom name.
Other Errors: An atom too close to a bond endpoint was erroneously considered connected, a solid wedge bond too close to a closed node was incorrectly connected, and a dashed bold bond had its stereocentre incorrectly determined.

Dataset Quality Issues: The authors discovered 11 images where the ground truth MOL files were incorrect and MolRec’s recognition was actually correct. As the authors note, such ground truth errors are very difficult to avoid in this complex task.

Key Insights:

Performance gap between simple and complex structures: While MolRec achieved over 94% accuracy on standard molecular diagrams, the performance drop on the more challenging manual evaluation set (down to 46-59%) highlights the difficulty of real patent document images.
Many errors are fixable: The authors note that many mis-recognition problems (such as the character grouping bug and four-way junction vectorization) can be solved with relatively simple enhancements.
Touching character segmentation remains a notoriously difficult open problem that the authors plan to explore further.
Evaluation challenges: The discovery of 11 incorrect ground truth MOL files illustrates how difficult it is to create reliable benchmarks for chemical structure recognition.

The authors conclude that despite the high recognition rates on simpler structures, there is still plenty of room for improvement. They identify future work areas including recognition of more general Markush structures, robust charge sign spotting, and accurate identification of wedge bonds.

Reproducibility Details

System Architecture

Model Type: Non-neural, Rule-Based System (Vectorization Pipeline + Logic Engine)

Data

Evaluation Datasets (CLEF 2012): 961 total test images clipped from patent documents:

Automatic Evaluation Set: 865 images evaluated automatically using OpenBabel for exact structural matching of generated MOL files against ground truth
Manual Evaluation Set: 95 images containing elements not supported by OpenBabel (typically Markush structures), requiring manual visual evaluation

Training Data: The paper does not describe the reference set used to build OCR character prototypes for the nearest neighbor classifier.

Algorithms

Vectorization Pipeline (three steps):

Image Binarization: Otsu’s method, followed by connected component labelling
OCR: Nearest neighbor classification with Euclidean distance metric; recognized characters are removed from the image
Bond Element Separation: Thinning to single-pixel width, polyline construction, Douglas-Peucker line simplification (threshold set to 1-2x average line width), detection of circles, arrows, and solid triangles

Rule Engine: 18 chemical structure rules converting geometric primitives to molecular graphs:

Bridge Bond Rules (2 rules): Applied before all other rules, handling bridge bond structures depicted in 2.5-dimensional perspective drawings
Wavy Bond Rule: Detailed in paper, identifies approximately collinear connected line segments with zig-zag patterns ($n \geq 3$ segments)
Standard Recognition Rules: 16 rules for bonds, atoms, and chemical features (applied in arbitrary order; most not detailed in this paper)

Optimization: Performance tuned via manual adjustment of fuzzy and strict geometric threshold parameters.

Evaluation

Metrics:

Automated: Exact structural match via OpenBabel MOL file comparison
Manual: Visual inspection by human experts for structures where OpenBabel fails

Results:

Automatic Evaluation Set (865 images): 94.91% to 96.18% accuracy across four runs
Manual Evaluation Set (95 images): 46.32% to 58.95% accuracy across four runs

Hardware

Not specified in the paper. Given the era (2012) and algorithmic nature (Otsu, thinning, geometric analysis), likely ran on standard CPUs.

Reproducibility Assessment

Closed. No public code, data, or models are available. The paper outlines high-level logic (Otsu binarization, Douglas-Peucker simplification, 18-rule system) but does not provide:

The complete specification of all 18 rules (only Rule 2.2 for wavy bonds is detailed)
Exact numerical threshold values for fuzzy/strict parameters used in the CLEF runs
OCR training data or character prototype specifications

The authors refer readers to a separate 2012 DRR (SPIE) paper [5] for a more detailed overview of the MolRec system architecture.

Paper Information#

Systematization of Rule-Based OCSR#

Investigating the Limits of Rule-Based Recognition#

The Two-Stage MolRec Architecture#

CLEF 2012 Experimental Design#

Performance Divergence and Critical Failure Modes#

Reproducibility Details#

System Architecture#

Data#

Algorithms#

Evaluation#

Hardware#

Reproducibility Assessment#