Chemical Structure Recognition (Rule-Based)

Paper Information

Citation: Sadawi, N. M., Sexton, A. P., & Sorge, V. (2012). Chemical structure recognition: A rule based approach. Proceedings of SPIE, 8297, 82970E. https://doi.org/10.1117/12.912185

Publication: IS&T/SPIE Electronic Imaging 2012

Methodological Contribution

Methodological Paper ($\Psi_{\text{Method}}$)

This paper proposes a novel mechanism (MolRec) for Optical Chemical Structure Recognition (OCSR). It focuses on defining a “strictly rule based system” to transform vectorised molecule images into graph representations, contrasting this declarative approach with procedural or heuristic-heavy methods. The contribution is validated through direct comparison with the leading open-source tool (OSRA).

Motivation: Overcoming Procedural Heuristics

Chemical literature contains vast amounts of information locked in 2D diagrams. This visual data is generally inaccessible to search tools or electronic processing. While commercial and academic tools existed (e.g., OSRA, Kekulé), they typically relied on procedural heuristics that required experimental tuning and were difficult to extend. The authors sought to create a system based on precise, declarative rewrite rules to handle the ambiguity inherent in chemical drawing conventions.

Core Innovation: Geometric Rewrite Rules

The core novelty is the geometric rewrite rule system (MolRec).

Geometric Primitives: The system operates on five high-level primitives: Line Segment, Arrow, Circle, Triangle, and Character Group.
Fuzzy Parameters: It introduces formal definitions for “fuzzy” relationships (e.g., dash-neighbouring, approximate collinearity) to handle drawing irregularities and scanning artifacts.
Ambiguity Resolution: Specific rules (R4-R6) are designed to disambiguate visual homoglyphs, such as distinguishing a “triple bond” from a “dashed bold bond” based on context (connected atoms).
Explicit “Cutting”: A mechanism to identify implicit carbon nodes within continuous line segments (e.g., splitting a long line intersected by parallel lines into a double bond).

Experimental Setup vs. Baselines

The authors compared their system (MolRec) against OSRA (the leading open-source system) on two datasets:

OSRA Benchmark: 5,735 computer-generated diagrams with ground truth MOL files.
Maybridge Dataset: 5,730 scanned images (300dpi) from a drug catalogue, converted to ground truth MOL files via InChI lookups.

Evaluation was semantic: The output MOL files were compared using OpenBabel to check for structural equivalence, ignoring syntactic file differences.

Results and Key Findings

MolRec outperformed OSRA on both datasets:

OSRA Benchmark: MolRec achieved 88.46% accuracy vs. OSRA’s 77.23%.
Maybridge Dataset: MolRec achieved 83.84% accuracy vs. OSRA’s 72.57%.

Key Findings:

Robustness: The line thinning + Douglas-Peucker vectorization approach was found to be more robust than Hough transform approaches used by other tools.
Failure Modes: Major remaining errors were caused by “touching components” (ligatures, characters touching bonds) and complex “superatoms” (abbreviations like “-Ph” or “-COOH”) with ambiguous connection points.
Triangle Detection: The “expanding disc” method for identifying wedge bonds was highly effective.

Reproducibility Details

Data

Two distinct datasets were used for validation:

Dataset	Type	Size	Notes
OSRA Benchmark	Synthetic	5,735	Computer-generated diagrams provided by the OSRA project.
Maybridge	Scanned	5,730	Scanned at 300dpi from the Maybridge drug catalogue. Ground truth generated via CAS Registry Number $\to$ InChI $\to$ OpenBabel.

Algorithms

The recognition pipeline consists of three stages: Vectorization, Geometric Processing, and Rule Application.

1. Vectorization & Primitives

Binarization & OCR: Connected components are labelled and passed to an OCR engine to extract “Character Groups”.
Thinning: Image is thinned to unit width.
Simplification: Douglas-Peucker algorithm converts pixel paths into straight Line Segments.
Triangle Detection: A disc growing algorithm walks inside black regions to identify Triangles (wedges). If the disc cannot grow, it is a thick line (Bold Bond).

2. Fuzzy Parameters

The rules rely on tolerating drawing imperfections using defined parameters:

$r_e$: Radius of collinearity (strict).
$d_l$ / $d_s$: Dash length / Dash separation (fuzzy).
$bdl$ / $bdw$: Bold dash length / width (fuzzy).
$bs$: Bond separation (max distance between parallel bonds).
$ol$: Minimal overlap.

3. The Rule System (R1-R18)

The core logic uses 18 mutual-exclusion rules to rewrite geometric primitives into chemical graph edges.

Planar Bonds:
- R1-R3 (Single/Double/Triple): Identifies parallel lines based on bs and ol. Uses “cutting” to split lines at implicit nodes.
Ambiguity Resolution (Stereo vs. Planar):
- R4 (Dashed Bold vs. Triple): Checks context. If purely geometric measures match both, it defaults to Triple unless specific dash constraints are met.
- R5 (Dashed Wedge vs. Triple): Similar disambiguation based on length monotonicity.
- R6 (Dashed Wedge vs. Double): Differentiates based on line length differences ($l_1 > l_2$).
Stereo Bonds:
- R7-R9 (Dashed Types): Identifies collinear segments with specific neighbor patterns (1 neighbor for ends, 2 for internal).
- R10-R11 (Hollow Wedge): Detects triangles formed by 3 or 4 lines.
- R14 (Solid Wedge): Direct mapping from Triangle primitive.
Special Structures:
- R12 (Wavy Bond): Zig-zag line segments.
- R13 (Arrow): Dative bond.
- R16 (Aromatic Ring): Circle inside a cycle of >5 lines.
- R17-R18 (Bridge Bonds): Handles 2.5D crossing bonds (open or closed gaps).

Evaluation

Metric: Semantic graph matching. The output MOL file is compared to the ground truth MOL file using OpenBabel. Success = correct graph isomorphism.

Results Table:

Dataset	System	Success Rate	Fail Rate
OSRA	MolRec	88.46%	11.54%
	OSRA	77.23%	22.77%
Maybridge	MolRec	83.84%	16.16%
	OSRA	72.57%	27.43%

Hardware

Compute: Requirements not specified, but the approach (vectorization + rule matching) is computationally lightweight compared to modern deep learning methods.

Paper Information#

Methodological Contribution#

Motivation: Overcoming Procedural Heuristics#

Core Innovation: Geometric Rewrite Rules#

Experimental Setup vs. Baselines#

Results and Key Findings#

Reproducibility Details#

Data#

Algorithms#

Evaluation#

Hardware#