Paper Information

Citation: Sadawi, N. M., Sexton, A. P., & Sorge, V. (2012). Chemical structure recognition: A rule based approach. Proceedings of SPIE, 8297, 82970E. https://doi.org/10.1117/12.912185

Publication: IS&T/SPIE Electronic Imaging 2012

What kind of paper is this?

Methodological Paper ($\Psi_{\text{Method}}$)

This paper proposes a novel mechanism (MolRec) for Optical Chemical Structure Recognition (OCSR). It focuses on defining a “strictly rule based system” to transform vectorised molecule images into graph representations, contrasting this declarative approach with procedural or heuristic-heavy methods. The contribution is validated through direct comparison with the leading open-source tool (OSRA).

What is the motivation?

Chemical literature contains vast amounts of information locked in 2D diagrams. This visual data is generally inaccessible to search tools or electronic processing. While commercial and academic tools existed (e.g., OSRA, Kekulé), they typically relied on procedural heuristics that required experimental tuning and were difficult to extend. The authors sought to create a system based on precise, declarative rewrite rules to handle the ambiguity inherent in chemical drawing conventions.

What is the novelty here?

The core novelty is the geometric rewrite rule system (MolRec).

  • Geometric Primitives: Instead of raw pixels, the system operates on five high-level primitives: Line Segment, Arrow, Circle, Triangle, and Character Group.
  • Fuzzy Parameters: It introduces formal definitions for “fuzzy” relationships (e.g., dash-neighbouring, approximate collinearity) to handle drawing irregularities and scanning artifacts.
  • Ambiguity Resolution: Specific rules (R4-R6) are designed to disambiguate visual homoglyphs, such as distinguishing a “triple bond” from a “dashed bold bond” based on context (connected atoms) rather than just pixel density.
  • Explicit “Cutting”: A mechanism to identify implicit carbon nodes within continuous line segments (e.g., splitting a long line intersected by parallel lines into a double bond).

What experiments were performed?

The authors compared their system (MolRec) against OSRA (the leading open-source system) on two datasets:

  1. OSRA Benchmark: 5,735 computer-generated diagrams with ground truth MOL files.
  2. Maybridge Dataset: 5,730 scanned images (300dpi) from a drug catalogue, converted to ground truth MOL files via InChI lookups.

Evaluation was semantic: The output MOL files were compared using OpenBabel to check for structural equivalence, ignoring syntactic file differences.

What outcomes/conclusions?

MolRec outperformed OSRA on both datasets:

  • OSRA Benchmark: MolRec achieved 88.46% accuracy vs. OSRA’s 77.23%.
  • Maybridge Dataset: MolRec achieved 83.84% accuracy vs. OSRA’s 72.57%.

Key Findings:

  • Robustness: The line thinning + Douglas-Peucker vectorization approach was found to be more robust than Hough transform approaches used by other tools.
  • Failure Modes: Major remaining errors were caused by “touching components” (ligatures, characters touching bonds) and complex “superatoms” (abbreviations like “-Ph” or “-COOH”) with ambiguous connection points.
  • Triangle Detection: The “expanding disc” method for identifying wedge bonds was highly effective.

Reproducibility Details

Data

Two distinct datasets were used for validation:

DatasetTypeSizeNotes
OSRA BenchmarkSynthetic5,735Computer-generated diagrams provided by the OSRA project.
MaybridgeScanned5,730Scanned at 300dpi from the Maybridge drug catalogue. Ground truth generated via CAS Registry Number $\to$ InChI $\to$ OpenBabel.

Algorithms

The recognition pipeline consists of three stages: Vectorization, Geometric Processing, and Rule Application.

1. Vectorization & Primitives

  • Binarization & OCR: Connected components are labelled and passed to an OCR engine to extract “Character Groups”.
  • Thinning: Image is thinned to unit width.
  • Simplification: Douglas-Peucker algorithm converts pixel paths into straight Line Segments.
  • Triangle Detection: A disc growing algorithm walks inside black regions to identify Triangles (wedges). If the disc cannot grow, it is a thick line (Bold Bond).

2. Fuzzy Parameters

The rules rely on tolerating drawing imperfections using defined parameters:

  • re: Radius of collinearity (strict).
  • dl / ds: Dash length / Dash separation (fuzzy).
  • bdl / bdw: Bold dash length / width (fuzzy).
  • bs: Bond separation (max distance between parallel bonds).
  • ol: Minimal overlap.

3. The Rule System (R1-R18)

The core logic uses 18 mutual-exclusion rules to rewrite geometric primitives into chemical graph edges.

  • Planar Bonds:
    • R1-R3 (Single/Double/Triple): Identifies parallel lines based on bs and ol. Uses “cutting” to split lines at implicit nodes.
  • Ambiguity Resolution (Stereo vs. Planar):
    • R4 (Dashed Bold vs. Triple): Checks context. If purely geometric measures match both, it defaults to Triple unless specific dash constraints are met.
    • R5 (Dashed Wedge vs. Triple): Similar disambiguation based on length monotonicity.
    • R6 (Dashed Wedge vs. Double): Differentiates based on line length differences ($l_1 > l_2$).
  • Stereo Bonds:
    • R7-R9 (Dashed Types): Identifies collinear segments with specific neighbor patterns (1 neighbor for ends, 2 for internal).
    • R10-R11 (Hollow Wedge): Detects triangles formed by 3 or 4 lines.
    • R14 (Solid Wedge): Direct mapping from Triangle primitive.
  • Special Structures:
    • R12 (Wavy Bond): Zig-zag line segments.
    • R13 (Arrow): Dative bond.
    • R16 (Aromatic Ring): Circle inside a cycle of >5 lines.
    • R17-R18 (Bridge Bonds): Handles 2.5D crossing bonds (open or closed gaps).

Evaluation

Metric: Semantic graph matching. The output MOL file is compared to the ground truth MOL file using OpenBabel. Success = correct graph isomorphism.

Results Table:

DatasetSystemSuccess RateFail Rate
OSRAMolRec88.46%11.54%
OSRA77.23%22.77%
MaybridgeMolRec83.84%16.16%
OSRA72.57%27.43%

Hardware

  • Compute: Requirements not specified, but the approach (vectorization + rule matching) is computationally lightweight compared to modern deep learning methods.