MolRec at CLEF 2012: Rule-Based Structure Recognition

Paper Information

Citation: Sadawi, N. M., Sexton, A. P., & Sorge, V. (2012). MolRec at CLEF 2012 – Overview and Analysis of Results. CLEF 2012 Evaluation Labs and Workshop, Online Working Notes. https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf

Publication: CLEF 2012 Evaluation Labs and Workshop, Online Working Notes

Contribution to Chemical Structure Recognition

This is a Method paper. It describes the architecture of an engineered artifact (the “MolRec” system) and evaluates its efficacy on a specific task (Chemical Structure Recognition) using a standardized benchmark. It focuses on the mechanisms of vectorization and rule-based rewriting.

Motivation and CLEF 2012 Context

The work was motivated by the CLEF 2012 chemical structure recognition task. The goal was to automatically interpret chemical diagram images clipped from patent documents. This is challenging because real-world patent images contain complex structures, such as bridge bonds and elements not supported by standard conversion tools like OpenBabel.

Novelty in Rule-Based Vectorization

The primary contribution is an improved rule-based rewrite engine compared to the authors’ previous TREC 2011 submission, featuring a fully overhauled implementation that improves both recognition performance and computational efficiency. The system uses a two-stage approach:

Vectorization: Extracts geometric primitives (lines, circles, arrows) and characters.
Rule Engine: Applies 18 specific geometric rewriting rules to transform primitives into a chemical graph, which can then be exported to MOL or SMILES format.

Notably, the system explicitly handles “bridge bonds” (3D perspective structures) by applying specific recognition rules before general bond detection.

Experimental Setup on the CLEF 2012 Corpus

The system was evaluated on the CLEF 2012 corpus of 961 test images, split into two distinct sets to test different capabilities:

Automatic Set: 865 images evaluated automatically using OpenBabel to compare generated MOL files against ground truth.
Manual Set: 95 “challenging” images containing elements beyond OpenBabel’s scope (e.g., Markush structures), evaluated via manual visual inspection.

The authors performed four runs with slightly different internal parameters to test system stability.

Performance Outcomes and Failure Analysis

Performance:

Automatic Set: High performance, achieving accuracy between 94.91% and 96.18%.
Manual Set: Lower performance, with accuracy between 46.32% and 58.95%, reflecting the difficulty of complex patent diagrams containing Markush structures and other elements beyond OpenBabel’s scope.

Failure Analysis:

The authors conducted a detailed error analysis on 52 distinct mis-recognized diagrams from the manual set and 46 from the automatic set. Key failure modes include:

Character Grouping: The largest error source in the manual set (26 images). A bug caused the digit “1” to be repeated within atom groups, and closely-spaced atom groups were incorrectly merged.
Touching Characters: 8 images in the manual set and 1 in the automatic set. The system lacks segmentation for characters that touch, causing OCR failure.
Four-way Junctions: 6 manual and 7 automatic images. Vectorization failed to correctly identify junctions where four lines meet.
Missed Wedge Bonds: 6 images each for missed solid wedge and dashed wedge bonds in the automatic set.
OCR Errors: 5 manual and 11 automatic images, including misrecognition of “G” as “O” and “I” interpreted as a vertical single bond.
Charge Signs: MolRec correctly recognized positive charge signs but missed three negative charge signs, including one placed at the top left of an atom name.
Dataset Errors: The authors identified 11 images where the ground truth MOL files were incorrect, but MolRec’s recognition was actually correct.

Reproducibility Details

Data

The dataset was provided by CLEF 2012 organizers and consists of images clipped from patent documents.

Purpose	Dataset	Size	Notes
Evaluation (Auto)	CLEF 2012 Set 1	865 images	Evaluated via OpenBabel
Evaluation (Manual)	CLEF 2012 Set 2	95 images	Complex/Markush structures

Algorithms

The MolRec pipeline consists of two primary modules:

1. Vectorization Module

Binarization: Uses Otsu’s method.
OCR: Extracts connected components and classifies them using nearest neighbor classification with a Euclidean metric. Detected characters are removed from the image.
Bond Separation:
- Thins remaining components to single-pixel width.
- Builds polyline representations.
- Splits polylines at junctions (3+ lines meeting).
- Simplification: Applies the Douglas-Peucker algorithm with a threshold of 1-2 average line widths to remove scanning artifacts while preserving corners. The threshold is based on measured average line width, allowing adaptation to different line styles.
- Also detects circles, arrow heads, and solid triangles (annotated with direction).

2. Rule Engine

Input: Geometric primitives (segments, circles, triangles, arrows, character groups).
Structure: 18 rewrite rules.
Priority: Two rules for Bridge Bonds (Open/Closed) are applied first.
Standard Rules: 16 rules applied in arbitrary order for standard bonds (Single, Double, Triple, Wedge, Dative, etc.).
Implicit Nodes: Some rules handle cases where carbon atoms are implicit at bond junctions. These rules detect double or triple bonds while producing new geometric objects by splitting bonds at implicit nodes for further processing.
Example Rule (Wavy Bond):
- Condition 1: Set of line segments $L$ where $n \ge 3$.
- Condition 2: Segment lengths match “dash length” parameter.
- Condition 3: All elements are connected.
- Condition 4: Center points are approximately collinear.
- Condition 5: Endpoints form a single sequence (end elements have 1 neighbor, internal have 2).
- Condition 6: Two unconnected endpoints must be the pair of endpoints that are furthest apart.
- Consequence: Replace $L$ with a Wavy Bond between the furthest two endpoints. The bond has unknown direction.

Models

MolRec is a rule-based system and does not use trained deep learning models or weights.

Superatoms: Uses a dictionary look-up to resolve character groups representing superatoms into subgraphs.
Disambiguation: Context-based logic is applied after graph construction to resolve ambiguities (e.g., distinguishing vertical bond | from letter I or digit 1).

Evaluation

Set	Run 1	Run 2	Run 3	Run 4
Auto (865 images)	96.18% (832/865)	94.91% (821/865)	94.91% (821/865)	96.18% (832/865)
Manual (95 images)	46.32% (44/95)	58.95% (56/95)	46.32% (44/95)	56.84% (54/95)

Key Parameters:

Dash Length: Range of acceptable values for dashed lines.
Simplification Threshold: 1-2x average line width for Douglas-Peucker.

Artifacts

Artifact	Type	License	Notes
CLEF 2012 Workshop Paper	Other	Open Access	CEUR Workshop Proceedings

Reproducibility Classification: Closed

No source code for the MolRec system has been publicly released. The CLEF 2012 evaluation dataset was distributed to task participants and is not openly available. The rule-based algorithm is described in sufficient detail to re-implement, but exact parameter values and the character classification training set are not fully specified. No hardware or compute requirements are reported.

Citation

@inproceedings{sadawi2012molrec,
  title={MolRec at CLEF 2012--Overview and Analysis of Results},
  author={Sadawi, Noureddin M and Sexton, Alan P and Sorge, Volker},
  booktitle={CLEF 2012 Evaluation Labs and Workshop, Online Working Notes},
  year={2012},
  url={https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf}
}

Paper Information#

Contribution to Chemical Structure Recognition#

Motivation and CLEF 2012 Context#

Novelty in Rule-Based Vectorization#

Experimental Setup on the CLEF 2012 Corpus#

Performance Outcomes and Failure Analysis#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Artifacts#

Reproducibility Classification: Closed#

Citation#