Rule-Based Systems on Hunter Heidenreich | ML Research Scientist

GraphReco: Probabilistic Structure Recognition (2026)

Sun, 15 Mar 2026 00:00:00 +0000

Paper Information

Citation: Wang, H., Yu, Y., & Liu, J.-C. (2026). GraphReco: Probabilistic Structure Recognition for Chemical Molecules. ChemistryOpen, e202500537. https://doi.org/10.1002/open.202500537

Publication: ChemistryOpen 2026 (Open Access)

A Rule-Based OCSR System with Probabilistic Graph Assembly

GraphReco tackles a challenge that is rarely addressed explicitly in rule-based OCSR: the ambiguity that arises during graph assembly when lower-level component extraction results are imprecise. Small deviations in bond endpoint locations, false positive detections, and spatial proximity between elements all create uncertainty about which atoms and bonds should be connected, merged, or discarded.

The system introduces two main contributions:

Fragment Merging (FM) line detection: An adaptive three-stage algorithm for precise bond line identification across images of variable resolution
Probabilistic ambiguity resolution: A Markov network that infers the most likely existence and merging state of atom and bond candidates

Three-Stage Pipeline

GraphReco follows a three-stage workflow:

Component Extraction: Detects circles (aromatic bonds), bond lines (via the FM algorithm), and chemical symbols (via Tesseract OCR). Includes detection of solid wedge, dashed wedge, dashed line, and wavy bond styles. A semi-open-loop correction step resolves cases where symbols are misclassified as bonds and vice versa.
Atom and Bond Ambiguity Resolution: Creates atom and bond candidates from detected components, builds a Markov network to infer their most probable states, and resolves candidates through existence and merging decisions.
Graph Reconstruction: Assembles resolved atoms and bonds into a molecule graph, selects the largest connected component, and exports as MDL Molfile.

Fragment Merging Line Detection

Classical Line Hough Transform (LHT) struggles with chemical structure images because bond lines suffer from pixelization, and algorithm parameters that work for one image resolution fail at others. The FM algorithm addresses this with three stages:

Fragment extraction: Apply LHT with high-resolution parameters (resolution $r = 2$, resolution $\theta = 2°$) to detect fine line fragments. Walk along detected theoretical lines to find actual black pixels and group them by connectivity.
Fragment grouping: Pair fragments that share similar angles, are close in the perpendicular direction, and are either overlapping or connected by a path of black pixels.
Fragment merging: Merge grouped fragments into single line segments using the two border pixels farthest from the centroid.

The FM algorithm effectively handles the tradeoff that plagues standard LHT: coarse parameters miss short lines and produce overlaps, while fine parameters return many fragments shorter than actual bonds.

Probabilistic Ambiguity Resolution via Markov Network

After component extraction, GraphReco creates atom and bond candidates rather than directly assembling the graph. Each bond endpoint generates an atom candidate with a circular bounding area of radius:

$$r_b = \min(l_{\text{bond}}, l_{\text{med}}) / 4$$

where $l_{\text{bond}}$ is the bond length and $l_{\text{med}}$ is the median bond length.

A Markov network is constructed with four types of nodes:

Atom nodes: Boolean existence variables for each atom candidate
Bond nodes: Boolean existence variables for each bond candidate
Atom merge nodes: Boolean variables for pairs of overlapping atom candidates
Bond merge nodes: Boolean variables for pairs of nearby bond candidates

Potential functions encode rules about when candidates should exist or merge, with merging likelihood between two bond-ending atom candidates defined as a piecewise function of center distance $d$:

$$P(a_1, a_2) = \begin{cases} 0.9, & \text{if } d \leq Q \\ 0.7 - 0.4(d - Q)/(R - Q), & \text{if } Q < d \leq R \\ 0.1, & \text{if } d > R \end{cases}$$

where $Q = \max(r_1, r_2)$ and $R = \min(1.5Q, r_1 + r_2)$. MAP inference determines the final state of all candidates.

Evaluation Results

GraphReco is evaluated on USPTO benchmarks with InChI string comparison (stereochemistry removed):

System	USPTO-10K	USPTO-10K-Abb	USPTO
GraphReco	94.2	86.7	89.9
MolVec 0.9.7	92.4	70.3	89.1
Imago 2.0	89.9	63.0	89.4
OSRA 2.1	89.7	63.9	89.3
MolGrapher	93.3	82.8	91.5
Img2Mol	35.4	13.8	25.2

GraphReco outperforms all rule-based systems and most ML systems, with a particularly large margin on USPTO-10K-Abb (abbreviation-heavy molecules). MolGrapher achieves slightly higher accuracy on the USPTO dataset.

Robustness on Perturbed Images

On USPTO-perturbed (rotation and shearing applied), rule-based methods degrade substantially:

System	USPTO-perturbed
MolGrapher	86.7
Img2Mol	42.3
GraphReco	40.6
MolVec 0.9.7	30.7
OSRA 2.1	6.4
Imago 2.0	5.1

GraphReco performs better than other rule-based systems on perturbed inputs (40.6% vs. under 31%) thanks to its probabilistic assembly, but still falls far behind MolGrapher (86.7%), demonstrating the robustness advantage of learned approaches.

Ablation Study

Each component contributes substantially to overall performance on USPTO-10K:

Configuration	USPTO-10K	USPTO-10K-Abb	USPTO
Full system	94.2	86.7	89.9
Without FM line detection	2.9	5.5	4.8
Without atom candidates	9.8	0.4	5.0
Without bond candidates	79.1	75.8	75.0
Without Markov network	88.2	81.4	84.2

The FM algorithm and atom candidate mechanism are both critical (accuracy drops below 10% without either). Bond candidates provide a moderate improvement (~15 percentage points), and the Markov network adds ~6 points over hard-threshold alternatives.

Limitations

Deterministic expert rules limit robustness on perturbed or noisy images, as evidenced by the large accuracy gap with MolGrapher on USPTO-perturbed
The system relies on Tesseract OCR for symbol recognition, which may struggle with unusual fonts or degraded image quality
Only handles single 2D molecule structures per image
Stereochemistry is removed during evaluation, so performance on stereo-bond recognition is not assessed

Reproducibility

GraphReco is implemented in Python and relies on Tesseract OCR, OpenCV, and RDKit. The authors provide an online demo for testing but have not released the source code or a public repository.

Artifact	Type	License	Notes
Online Demo	Other	Unknown	Google Cloud Run deployment (no longer available)

Missing components for full reproduction:

Source code is not publicly available
No pre-built package or installable library
Hyperparameters for Markov network potential functions are given in the paper (Equations 8-11), but full implementation details are not released

Hardware/compute requirements: Not specified in the paper. The system uses classical computer vision (Hough transforms, thinning) and probabilistic inference (Markov networks), so GPU hardware is likely not required.

Research on Chemical Expression Images Recognition

Tue, 16 Dec 2025 00:00:00 +0000

Paper Information

Citation: Hong, C., Du, X., & Zhang, L. (2015). Research on Chemical Expression Images Recognition. Proceedings of the 2015 Joint International Mechanical, Electronic and Information Technology Conference, 267-271. https://doi.org/10.2991/jimet-15.2015.50

Publication: JIMET 2015 (Atlantis Press)

Additional Resources:

JSME Editor (used for visualization)

Contribution: New OCSR Workflow for Adhesion and Wedge Bonds

Method. The paper proposes a novel algorithmic pipeline (OCSR) for recognizing 2D organic chemical structures from images. It validates this method by comparing it against an existing tool (OSRA) using a quantitative metric (Tanimoto Coefficient) on a test set of 200 images.

Motivation: Challenges with Connecting Symbols and Stereochemistry

A vast amount of chemical structural information exists in scientific literature (PDFs/images) that is not machine-readable. Manually converting these images to formats like InChI or CML is labor-intensive. Existing tools face challenges with:

Adhesion: Poor separation when chemical symbols touch or overlap with bonds.
Stereochemistry: Incomplete identification of “real” (solid) and “virtual” (dashed/hashed) wedge bonds.

Core Innovation: Vector-Based Separation and Stereochemical Logic

The authors propose a specific OCSR (Optical Chemical Structure Recognition) workflow with two key technical improvements:

Vector-based Separation: The method vectorizes the image (using Potrace) to extract straight lines and curves, allowing better separation of “adhesive” chemical symbols (like H, N, O attached to bonds).
Stereochemical Logic: Specific rules for identifying wedge bonds:
- Virtual (Dashed) Wedges: Identified by grouping connected domains and checking linear correlation of their center points.
- Real (Solid) Wedges: Identified after thinning by analyzing linear correlation and width variance of line segments.

Methodology & Experimental Setup

Dataset: 200 chemical structure images collected from the network.
Baselines: Compared against OSRA (Optical Structure Recognition Application), a free online tool.
Metric: Tanimoto Coefficient, measuring the similarity of the set of recognized bonds and symbols against the ground truth. The similarity $T(A, B)$ is defined as:

$$ T(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$

Results & Conclusions

Performance: The proposed OCSR method achieved higher recognition rates than OSRA.
- Exact Match (100%): OCSR achieved 90.0% vs. OSRA’s 82.2%.
- High Similarity (>85%): OCSR recognized 157 structures vs. OSRA’s 114.
Limitations: The paper notes that “real wedge” and “virtual wedge” identification was a primary focus, but general recognition effectiveness still “has room for improvement”.

Reproducibility Details

Data

The study used a custom collection of images, not a standard benchmark.

Purpose	Dataset	Size	Notes
Evaluation	Web-crawled chemical images	200 structures	Images containing 2D organic structures; specific source URLs not provided.

Algorithms

The recognition pipeline follows these specific steps:

Preprocessing:
- Grayscale: via cvCvtColor (OpenCV).
- Binarization: via Otsu’s method.
Isolated Symbol Removal:
- Identifies connected domains with aspect ratios in [0.8, 3.0].
- Recognizes them using OCR (GOCR, OCRAD, Tesseract) and removes them from the image.
Virtual Wedge Recognition:
- Groups small connected domains (points/clumps).
- Calculates linear correlation of center points; if collinear, treats as a dashed bond.
Vectorization & Thinning:
- Thinning: Rosenfeld algorithm (optimized) to reduce lines to single pixel width.
- Vectorization: Uses Potrace to convert pixels to vector segments.
- Merging: Combines split vector segments based on angle thresholds to form long straight lines.
Adhesive Symbol Separation:
- Identifies curves (short segments after vectorization) attached to long lines.
- Separates these domains and re-runs OCR.
“Super Atom” Merging:
- Merges adjacent vertical/horizontal symbols (e.g., “HO”, “CH3”) based on distance thresholds between bounding boxes.

Models

The system relies on off-the-shelf OCR tools for character recognition; no custom ML models were trained.

OCR Engines: GOCR, OCRAD, TESSERACT.
Visualization: JSME (JavaScript Molecule Editor) used to render output strings.

Evaluation

Metric	Value (OCSR)	Baseline (OSRA)	Notes
Exact Match (100%)	90.0%	82.2%	Percentage of 200 images perfectly recognized.
>95% Similarity	95 images	71 images	Count of images with Tanimoto > 0.95.
>85% Similarity	157 images	114 images	Count of images with Tanimoto > 0.85.

Hardware

Requirements: Unspecified; runs on standard CPU architecture (implied by use of standard libraries like OpenCV and Potrace).

Citation

@inproceedings{hongResearchChemicalExpression2015,
  title = {Research on {{Chemical Expression Images Recognition}}},
  booktitle = {Proceedings of the 2015 {{Joint International Mechanical}}, {{Electronic}} and {{Information Technology Conference}}},
  author = {Hong, Chen and Du, Xiaoping and Zhang, Lu},
  year = {2015},
  publisher = {Atlantis Press},
  address = {Chongqing, China},
  doi = {10.2991/jimet-15.2015.50},
  isbn = {978-94-6252-129-2}
}

Probabilistic OCSR with Markov Logic Networks

Tue, 16 Dec 2025 00:00:00 +0000

Paper Information

Citation: Frasconi, P., Gabbrielli, F., Lippi, M., & Marinai, S. (2014). Markov Logic Networks for Optical Chemical Structure Recognition. Journal of Chemical Information and Modeling, 54(8), 2380-2390. https://doi.org/10.1021/ci5002197

Publication: Journal of Chemical Information and Modeling 2014

Contribution: Probabilistic Method for OCSR

This is a Method paper ($\Psi_{\text{Method}}$).

It proposes a novel algorithmic architecture (MLOCSR) that integrates low-level pattern recognition with a high-level probabilistic reasoning engine based on Markov Logic Networks (MLNs). While it contributes to resources by creating a clustered dataset for evaluation, the primary focus is on demonstrating that probabilistic inference offers a superior methodology to the deterministic, rule-based heuristics employed by previous state-of-the-art systems like OSRA and CLiDE.

Motivation: Overcoming Brittle Rule-Based Systems

Optical Chemical Structure Recognition (OCSR) is critical for converting the vast archive of chemical literature (bitmap images in patents and papers) into machine-readable formats.

Limitation of Prior Work: Existing systems (OSRA, CLiDE, ChemReader) rely on “empirical hard-coded geometrical rules” to assemble atoms and bonds. These heuristics are brittle, requiring manual tuning of parameters for different image resolutions and failing when images are degraded or noisy.
Gap: Chemical knowledge is typically used only in post-processing (e.g., to fix valency errors).
Goal: To create a resolution-independent system that uses probabilistic reasoning to handle noise and ambiguity in graphical primitives.

Core Innovation: Markov Logic Networks for Diagram Interpretation

The core novelty is the application of Markov Logic Networks (MLNs) to the problem of diagram interpretation.

Probabilistic Reasoning: The system treats extracted visual elements (lines, text boxes) as “evidence” and uses weighted first-order logic formulas to infer the most likely molecular graph (Maximum A Posteriori inference). The probability of a state $x$ is defined by the MLN log-linear model: $$ P(X=x) = \frac{1}{Z} \exp\left(\sum_{i} w_i n_i(x)\right) $$ where $w_i$ is the weight of the $i$-th formula and $n_i(x)$ is the number of true groundings in $x$.
Unified Knowledge Representation: Geometric constraints (e.g., collinearity) and chemical rules (e.g., valency) are encoded in the same logic framework.
Methodology and Experimental Setupe low-level extraction module dynamically estimates character size ($T$) and stroke width ($S$) to normalize parameters, removing the dependence on image DPI metadata.

What experiments were performed?

The authors evaluated the system on recognition accuracy against the leading open-source baseline, OSRA (v1.4.0).

Datasets:
- USPTO Clustered: A non-redundant subset of 937 images derived from a larger set of 5,719 US Patent Office images.
- ChemInfty: 869 images from Japanese patents.
- Degraded Images: The USPTO set was synthetically degraded at three resampling levels (Low, Medium, High degradation) to test robustness.
Metrics:
- Geometric: Precision, Recall, and $F_1$ scores for individual atoms and bonds.
- Chemical: Tanimoto similarity (using path fingerprints) and InChI string matching (basic and full stereochemistry).

Results and Conclusions

Superior Robustness: MLOCSR significantly outperformed OSRA on degraded images. On high-degradation images, MLOCSR achieved an atom $F_1$ of 80.3% compared to OSRA’s 76.0%.
Geometric Accuracy: In clean datasets (USPTO cluster), MLOCSR achieved higher $F_1$ scores for atoms (99.1% vs 97.5%) and bonds (98.8% vs 97.8%).
Chemical Fidelity: The system achieved comparable Tanimoto similarity scores (0.948 vs 0.940 for OSRA).
Limitation: OSRA slightly outperformed MLOCSR on “Full InChI” matching (81.4% vs 79.4%), indicating the probabilistic model still needs improvement in handling complex stereochemistry.

Reproducibility Details

Data

The study utilized public datasets, with specific preprocessing to ensure non-redundancy.

Purpose	Dataset	Size	Notes
Evaluation	USPTO Clustered	937 images	Selected via spectral clustering from 5,719 raw images to remove near-duplicates.
Evaluation	ChemInfty	869 images	Ground-truthed dataset from Japanese patent applications (2008).

Algorithms

The pipeline consists of two distinct phases: Low-Level Vectorization and High-Level Inference.

1. Low-Level Extraction (Image Processing)

Binarization: Global thresholding followed by morphological closing.
Text/Stroke Estimation:
- Finds text height ($T$) by looking for “N” or “H” characters via OCR, or averaging compatible connected components.
- Estimates stroke width ($S$) by inspecting pixel density on potential segments identified by Hough transform.
Vectorization:
- Canny Edge Detection + Hough Transform to find lines.
- Douglas-Peucker algorithm for polygonal approximation of contours.
- Circle Detection: Finds aromatic rings by checking for circular arrangements of carbon candidates.

2. High-Level Inference (Markov Logic)

Evidence Generation: Visual primitives (lines, text boxes, circles) are converted into logical ground atoms (e.g., LineBetweenCpoints(c1, c2)).
Inference Engine: Uses MaxWalkSAT for Maximum A Posteriori (MAP) inference to determine the most probable state of query predicates (e.g., DoubleBond(c1, c2)).
Parameters: MaxWalkSAT run with 3 tries and 1,000,000 steps per try.

Models

Markov Logic Network (MLN):
- Contains 128 first-order logic formulas.
- Geometric Rules: Example: VeryCloseCpoints(c1, c2) => SameCarbon(c1, c2) (weighted rule to merge close nodes).
- Chemical Rules: Example: IsHydroxyl(t) ^ Connected(c,t) => SingleBond(c,t) (imposes valency constraints).
OCR Engine: Tesseract is used for character recognition on text connected components.

Evaluation

The authors introduced a bipartite graph matching method to evaluate geometric accuracy when superatoms (e.g., “COOH”) are not expanded.

Metric	Details
Atom/Bond $F_1$	Calculated via minimum-weight bipartite matching between predicted graph and ground truth, weighted by Euclidean distance.
InChI	Standard unique identifier string. “Basic” ignores stereochemistry; “Full” includes it.
Tanimoto	Jaccard index of path fingerprints between predicted and ground truth molecules.

Hardware

Software: Logic inference performed using the Alchemy software package (University of Washington).
Web Server: The system was made available at http://mlocsr.dinfo.unifi.it (Note: URL likely inactive).

Citation

@article{frasconiMarkovLogicNetworks2014,
  title = {Markov {{Logic Networks}} for {{Optical Chemical Structure Recognition}}},
  author = {Frasconi, Paolo and Gabbrielli, Francesco and Lippi, Marco and Marinai, Simone},
  year = 2014,
  month = aug,
  journal = {Journal of Chemical Information and Modeling},
  volume = {54},
  number = {8},
  pages = {2380--2390},
  issn = {1549-9596, 1549-960X},
  doi = {10.1021/ci5002197},
  urldate = {2025-10-13},
  langid = {english}
}

Kekulé-1 System for Chemical Structure Recognition

Mon, 15 Dec 2025 00:00:00 +0000

Paper Information

Citation: McDaniel, J. R., & Balmuth, J. R. (1996). Automatic Interpretation of Chemical Structure Diagrams. Graphics Recognition. Methods and Applications, 148-158. https://doi.org/10.1007/3-540-61226-2_13

Publication: Lecture Notes in Computer Science (LNCS), Vol. 1072, Springer, 1996.

System Architecture and Contribution

This is a Method paper. It proposes a novel software architecture (“Kekulé-1”) designed to solve the specific technical problem of converting rasterized chemical diagrams into machine-readable connection tables. The paper is characterized by:

Algorithmic Specification: It details specific algorithms for vectorization, polygon approximation, and character recognition.
Performance Metrics: It validates the method using quantitative accuracy (98.9%) and speed comparisons against manual entry.
System Architecture: It describes the integration of typically disparate components (OCR, vectorization, chemical rules) into a cohesive pipeline.

Motivation: The Chemical Data Entry Bottleneck

Chemical structure diagrams are the primary medium for communication between chemists, but computers cannot natively “read” these raster images.

Efficiency Gap: Manual redrawing of structures into chemical databases takes 6 to 10 minutes per structure.
Technical Challenge: Existing commercial OCR systems failed on chemical diagrams because they could not handle the mix of graphics (bonds) and text (atom labels), nor could they recognize small fonts (3-7 points) or chemical symbols accurately.
Goal: To create an “Optical Chemical Structure Recognition” (OCSR) system that reduces processing time to seconds while handling complex notation like stereochemistry and group formulas.

Core Innovations in Chemical OCR

Kekulé-1 represents the “first successful attempt” to integrate image processing, OCR, and structure editing into a single workflow. Key innovations include:

Context-Aware OCR: Unlike standard OCR, Kekulé-1 uses “chemical spell checking” by applying valence rules and chemical context to correct raw character recognition errors (e.g., distinguishing ‘5’ from ‘S’ based on bonding).
Adaptive Polygon Approximation: A modified vectorization algorithm that partitions objects at the farthest node to prevent artifact nodes in U-shaped structures.
Hybrid Parsing: It treats the diagram as a graph where nodes can be explicit atoms or geometric intersections, using rule-based logic to parse “group formulas” (like $COOH$) recursively.

Experimental Validation and Benchmarks

The authors evaluated the system on a private test set to validate robustness and speed.

Dataset: 524 chemical structures chosen from a “wide variety of sources” specifically to test the system’s limits.
Metrics: Success rate (percentage of structures processed with minimal editing) and processing time per structure.
Comparators: Performance was compared against the “manual redrawing” baseline.

Results, Performance, and Conclusions

High Accuracy: 98.9% of the test structures were successfully processed (with an average of 0.74 user prompts per structure).
Speedup: Processing took 7 to 30 seconds per structure, a significant improvement over the 6 to 10 minute manual baseline.
Robustness: The system successfully handled pathological cases like broken characters, skew (rotation), and touching characters.
Impact: The authors conclude that the techniques are generalizable to other domains like electrical circuits and utility maps.

Reproducibility Details

Data

Training/Test Data: The evaluation used 524 chemical structures. These were not released publicly but were selected to represent “limit” cases.
Input format: Scanned images at 300-400 dpi. The authors note that higher resolutions do not add information due to ink wicking and paper limitations.

Algorithms

The paper details several specific algorithmic implementations:

Vectorization (Polygon Approximation):

Standard thinning and raster-to-vector translation are used.
Innovation: The algorithm searches for the node farthest from the current start node to partition the object. This prevents artifact nodes in curved lines.
Threshold Formula: The allowed deviation ($dist$) from a straight line is adaptive based on segment length ($length$):

$$dist = \max(1, \frac{length}{10.0} + 0.4)$$

(Units in pixels)

Rotation Correction:

The system computes the angle of all “long” line segments modulo 15 degrees.
It bins these angles; the bin with the highest count (representing < 4 degrees rotation) is treated as the scan skew and corrected.

Optical Character Recognition (OCR):

Uses a neural network with linked/shared weights (similar to Convolutional Neural Networks, though not named as such) acting as a feature detector.
Training: Trained on specific chemical fonts.
Inference: Outputs are ranked; if multiple characters (e.g., ‘5’ and ‘S’) exceed a threshold, both are kept, and chemical context resolves the ambiguity later.

Chemical Parsing:

Group formulas (e.g., $COOH$) are parsed left-to-right by subtracting valences.
Example: For $COOH$, the external bond reduces Carbon’s valence to 3. The first Oxygen takes 2, leaving 1. The final Oxygen takes 1 (attaching to Carbon), and the Hydrogen takes 1 (attaching to Oxygen).

Models

OCR Model: A neural network with a “shared weights” paradigm, effectively creating a learned convolution map. It achieves ~99.9% raw accuracy on isolated test sets of chemical fonts.

Hardware

Compute: The evaluation was performed on an 80486 processor at 33 MHz.
Time: Average processing time was 9 seconds per structure.

Citation

@inproceedings{mcdanielAutomaticInterpretationChemical1996,
  title = {Automatic Interpretation of Chemical Structure Diagrams},
  booktitle = {Graphics Recognition. Methods and Applications},
  author = {McDaniel, Joe R. and Balmuth, Jason R.},
  editor = {O'Gorman, Lawrence and Kasturi, Rangachar},
  series = {Lecture Notes in Computer Science},
  volume = {1072},
  pages = {148--158},
  year = {1996},
  publisher = {Springer},
  doi = {10.1007/3-540-61226-2_14}
}

CLiDE Pro: Optical Chemical Structure Recognition Tool

Mon, 15 Dec 2025 00:00:00 +0000

Paper Information

Citation: Valko, A. T., & Johnson, A. P. (2009). CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition. Journal of Chemical Information and Modeling, 49(4), 780-787. https://doi.org/10.1021/ci800449t

Publication: Journal of Chemical Information and Modeling 2009

Contribution: Robust Algorithmic Pipeline for OCSR

This is primarily a Method ($\Psi_{\text{Method}}$) paper, as it proposes a specific algorithmic architecture (CLiDE Pro) for converting raster images of chemical structures into connection tables. It details the procedural steps for segmentation, vectorization, and graph reconstruction.

It also has a secondary Resource ($\Psi_{\text{Resource}}$) contribution, as the authors compile and release a validation set of 454 real-world images to serve as a community benchmark for OCSR systems.

Motivation: Bridging the Gap Between Legacy Document Images and Machine-Readable Chemistry

While modern chemical drawing software captures structural information explicitly, the vast majority of legacy and current chemical literature (journals, patents, reports) exists as images or PDF documents. These images are human-readable but lack the semantic “connection table” data required for chemical databases and software. Manual redrawing is time-consuming and error-prone. Therefore, there is a critical need for efficient Optical Chemical Structure Recognition (OCSR) systems to automate this extraction.

Novelty: Integrated Document Segmentation and Ambiguity Resolution Heuristics

CLiDE Pro introduces several algorithmic improvements over its predecessor (CLiDE) and contemporary tools:

Integrated Document Segmentation: Unlike page-oriented systems, it processes whole documents to link information across pages.
Robust “Difficult Feature” Handling: It implements specific heuristic rules to resolve ambiguities in crossing bonds (bridged structures), which are often misinterpreted as carbon atoms in other systems.
Generic Structure Interpretation: It includes a module to parse “generic” (Markush) structures by matching R-group labels in the diagram with text-based definitions found in the document.
Ambiguity Resolution: It uses context-aware rules to distinguish between geometrically similar features, such as vertical lines representing bonds vs. the letter ’l’ in ‘Cl’.

Methodology and Benchmarking on Real-World Data

The authors conducted a systematic validation on a dataset of 454 images containing 519 structure diagrams.

Source Material: Images were extracted from published materials (journals, patents), ensuring “real artifacts” like noise and scanning distortions were present.
Automation: The test was fully automated without human intervention.
Metrics: The primary metric was the “success rate,” defined as the correct reconstruction of the molecule’s connection table. They also performed fine-grained error analysis on specific features (e.g., atom labels, dashed bonds, wavy bonds).

Results: High Topological Accuracy and Persistent OCR Challenges

High Accuracy: The system achieved a 89.79% retrieval rate (466/519 molecules correctly reconstructed).
Robustness on Primitives: Solid straight bonds were recognized with 99.92% accuracy.
Key Failure Modes: The majority of errors (58 cases) occurred in atom label construction, specifically when labels touched nearby bonds or other artifacts, causing OCR failures.
Impact: The study demonstrated that handling “difficult” drawing features like crossing bonds and bridged structures significantly reduces topological errors. The authors released the test set to encourage standardized benchmarking in the OCSR field.

Reproducibility Details

Data

The authors utilized a custom dataset designed to reflect real-world noise.

Purpose	Dataset	Size	Notes
Evaluation	CLiDE Pro Validation Set	454 images (519 structures)	Extracted from scanned journals and PDFs. Includes noise/artifacts. Available in Supporting Information.

Algorithms

The CLiDE Pro pipeline consists of five distinct phases. To replicate this system, one would need to implement:

Image Binarization:
- Input images are binarized using a threshold-based technique to separate foreground (molecule) from background.
- Connected Component Analysis (CCA): A non-recursive scan identifies connected components (CCs) and generates interpixel contours (using N, S, E, W directions).
Document Segmentation:
- Layout Analysis: Uses a bottom-up approach building a tree structure. It treats CCs as graph vertices and distances as edges.
- Clustering: A minimal-cost spanning tree (Kruskal’s algorithm) groups CCs into words, lines, and blocks.
- Classification: CCs are classified (Character, Dash, Line, Graphics, Noise) based on size thresholds derived from statistical image analysis.
Vectorization:
- Contour Approximation: Uses a method similar to Sklansky and Gonzalez to approximate contours into polygons.
- Vector Formation: Long polygon sides become straight lines; short consecutive sides become curves. Opposing borders of a line are matched to define the bond vector.
- Wavy Bonds: Detected by finding groups of short vectors lying on a straight line.
- Dashed Bonds: Detected using the Hough transform to find collinear or parallel dashes.
Atom Label Construction:
- OCR: An OCR engine (filtering + topological analysis) interprets characters.
- Grouping: Characters are grouped into words based on horizontal and vertical proximity (for vertical labels).
- Superatom Lookup: Labels are matched against a database of elements, functional groups, and R-groups. Unknown linear formulas (e.g., $\text{CH}_2\text{CH}_2\text{OH}$) are parsed.
Graph Reconstruction:
- Connection Logic: Bond endpoints are joined to atoms if they are within a distance threshold and “point toward” the label.
- Implicit Carbons: Unconnected bond ends are joined if close; parallel bonds merge into double/triple bonds.
- Crossing Bonds: Rules check proximity, length, and ring membership to determine if crossing lines are valid atoms or 3D visual artifacts.
Generic Structure Interpretation:
- Text Mining: A lexical/syntactic analyzer extracts R-group definitions (e.g., “R = Me or H”) from text blocks.
- Matching: The system attempts to match R-group labels in the diagram with the parsed text definitions.

Models

OCR Engine: The system relies on a customized OCR engine capable of handling rotation and chemical symbols, though the specific architecture (neural vs. feature-based) is not detailed beyond “topological and geometrical feature analysis”.
Superatom Database: A lookup table containing elements, common functional groups, and R-group labels.

Evaluation

The evaluation focused on the topological correctness of the output.

Metric	Value	Notes
Total Success Rate	89.79%	466/519 structures perfectly reconstructed.
Atom Label Accuracy	98.54%	3923/3981 labels correct. Main error source: labels touching bonds.
Solid Bond Accuracy	>99.9%	16061/16074 solid bonds correct.
Dashed Bond Accuracy	98.37%	303/308 dashed bonds correct.

Hardware

Requirements: Unspecified; described as efficient.
Performance: The system processed the complex Palytoxin structure “within a few seconds”. This implies low computational overhead suitable for standard desktop hardware of the 2009 era.

Citation

@article{valkoCLiDEProLatest2009,
  title = {CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition},
  author = {Valko, Aniko T. and Johnson, A. Peter},
  journal = {Journal of Chemical Information and Modeling},
  volume = {49},
  number = {4},
  pages = {780--787},
  year = {2009},
  doi = {10.1021/ci800449t}
}

Chemical Structure Recognition (Rule-Based)

Mon, 15 Dec 2025 00:00:00 +0000

Paper Information

Citation: Sadawi, N. M., Sexton, A. P., & Sorge, V. (2012). Chemical structure recognition: A rule based approach. Proceedings of SPIE, 8297, 82970E. https://doi.org/10.1117/12.912185

Publication: IS&T/SPIE Electronic Imaging 2012

Methodological Contribution

Methodological Paper ($\Psi_{\text{Method}}$)

This paper proposes a novel mechanism (MolRec) for Optical Chemical Structure Recognition (OCSR). It focuses on defining a “strictly rule based system” to transform vectorised molecule images into graph representations, contrasting this declarative approach with procedural or heuristic-heavy methods. The contribution is validated through direct comparison with the leading open-source tool (OSRA).

Motivation: Overcoming Procedural Heuristics

Chemical literature contains vast amounts of information locked in 2D diagrams. This visual data is generally inaccessible to search tools or electronic processing. While commercial and academic tools existed (e.g., OSRA, Kekulé), they typically relied on procedural heuristics that required experimental tuning and were difficult to extend. The authors sought to create a system based on precise, declarative rewrite rules to handle the ambiguity inherent in chemical drawing conventions.

Core Innovation: Geometric Rewrite Rules

The core novelty is the geometric rewrite rule system (MolRec).

Geometric Primitives: The system operates on five high-level primitives: Line Segment, Arrow, Circle, Triangle, and Character Group.
Fuzzy Parameters: It introduces formal definitions for “fuzzy” relationships (e.g., dash-neighbouring, approximate collinearity) to handle drawing irregularities and scanning artifacts.
Ambiguity Resolution: Specific rules (R4-R6) are designed to disambiguate visual homoglyphs, such as distinguishing a “triple bond” from a “dashed bold bond” based on context (connected atoms).
Explicit “Cutting”: A mechanism to identify implicit carbon nodes within continuous line segments (e.g., splitting a long line intersected by parallel lines into a double bond).

Experimental Setup vs. Baselines

The authors compared their system (MolRec) against OSRA (the leading open-source system) on two datasets:

OSRA Benchmark: 5,735 computer-generated diagrams with ground truth MOL files.
Maybridge Dataset: 5,730 scanned images (300dpi) from a drug catalogue, converted to ground truth MOL files via InChI lookups.

Evaluation was semantic: The output MOL files were compared using OpenBabel to check for structural equivalence, ignoring syntactic file differences.

Results and Key Findings

MolRec outperformed OSRA on both datasets:

OSRA Benchmark: MolRec achieved 88.46% accuracy vs. OSRA’s 77.23%.
Maybridge Dataset: MolRec achieved 83.84% accuracy vs. OSRA’s 72.57%.

Key Findings:

Robustness: The line thinning + Douglas-Peucker vectorization approach was found to be more robust than Hough transform approaches used by other tools.
Failure Modes: Major remaining errors were caused by “touching components” (ligatures, characters touching bonds) and complex “superatoms” (abbreviations like “-Ph” or “-COOH”) with ambiguous connection points.
Triangle Detection: The “expanding disc” method for identifying wedge bonds was highly effective.

Reproducibility Details

Data

Two distinct datasets were used for validation:

Dataset	Type	Size	Notes
OSRA Benchmark	Synthetic	5,735	Computer-generated diagrams provided by the OSRA project.
Maybridge	Scanned	5,730	Scanned at 300dpi from the Maybridge drug catalogue. Ground truth generated via CAS Registry Number $\to$ InChI $\to$ OpenBabel.

Algorithms

The recognition pipeline consists of three stages: Vectorization, Geometric Processing, and Rule Application.

1. Vectorization & Primitives

Binarization & OCR: Connected components are labelled and passed to an OCR engine to extract “Character Groups”.
Thinning: Image is thinned to unit width.
Simplification: Douglas-Peucker algorithm converts pixel paths into straight Line Segments.
Triangle Detection: A disc growing algorithm walks inside black regions to identify Triangles (wedges). If the disc cannot grow, it is a thick line (Bold Bond).

2. Fuzzy Parameters

The rules rely on tolerating drawing imperfections using defined parameters:

$r_e$: Radius of collinearity (strict).
$d_l$ / $d_s$: Dash length / Dash separation (fuzzy).
$bdl$ / $bdw$: Bold dash length / width (fuzzy).
$bs$: Bond separation (max distance between parallel bonds).
$ol$: Minimal overlap.

3. The Rule System (R1-R18)

The core logic uses 18 mutual-exclusion rules to rewrite geometric primitives into chemical graph edges.

Planar Bonds:
- R1-R3 (Single/Double/Triple): Identifies parallel lines based on bs and ol. Uses “cutting” to split lines at implicit nodes.
Ambiguity Resolution (Stereo vs. Planar):
- R4 (Dashed Bold vs. Triple): Checks context. If purely geometric measures match both, it defaults to Triple unless specific dash constraints are met.
- R5 (Dashed Wedge vs. Triple): Similar disambiguation based on length monotonicity.
- R6 (Dashed Wedge vs. Double): Differentiates based on line length differences ($l_1 > l_2$).
Stereo Bonds:
- R7-R9 (Dashed Types): Identifies collinear segments with specific neighbor patterns (1 neighbor for ends, 2 for internal).
- R10-R11 (Hollow Wedge): Detects triangles formed by 3 or 4 lines.
- R14 (Solid Wedge): Direct mapping from Triangle primitive.
Special Structures:
- R12 (Wavy Bond): Zig-zag line segments.
- R13 (Arrow): Dative bond.
- R16 (Aromatic Ring): Circle inside a cycle of >5 lines.
- R17-R18 (Bridge Bonds): Handles 2.5D crossing bonds (open or closed gaps).

Evaluation

Metric: Semantic graph matching. The output MOL file is compared to the ground truth MOL file using OpenBabel. Success = correct graph isomorphism.

Results Table:

Dataset	System	Success Rate	Fail Rate
OSRA	MolRec	88.46%	11.54%
	OSRA	77.23%	22.77%
Maybridge	MolRec	83.84%	16.16%
	OSRA	72.57%	27.43%

Hardware

Compute: Requirements not specified, but the approach (vectorization + rule matching) is computationally lightweight compared to modern deep learning methods.

Reconstruction of Chemical Molecules from Images

Sun, 14 Dec 2025 00:00:00 +0000

Methodological Basis

This paper is a clear methodological contribution describing a novel system architecture. It proposes a five-stage pipeline to solve a specific engineering problem: converting rasterized chemical images into structured chemical files (SDF). The authors validate the method by benchmarking it against a commercial product (CLIDE) and analyzing performance across multiple databases.

The Inaccessibility of Raster Chemical Images

Data Inaccessibility: A massive amount of chemical knowledge (scientific articles, patents) exists only as raster images, rendering it inaccessible to computational analysis.
Inefficiency of Manual Entry: Manual replication of molecules into CAD programs is the standard but unscalable solution for extracting this information.
Limitations of Existing Tools: Previous academic and commercial attempts (early 90s systems like CLIDE) had faded or remained limited in robustness, leaving the problem “wide open”.

Topology-Preserving Chemical Vectorization

The core novelty is the topology-preserving vectorization strategy designed specifically for chemical graphs.

Graph-Centric Vectorizer: This system prioritizes graph characteristics over the pixel precision of traditional CAD vectorizers, ensuring one line in the image becomes exactly one vector, regardless of line width or vertex thickness.
Chemical Knowledge Module: The inclusion of a final validation step that applies chemical rules (valence, charge) to detect and potentially correct reconstruction errors.
Hybrid Recognition: The separation of the pipeline into a “Body” path (vectorizer for bonds) and an “OCR” path (SVM for atomic symbols), which are re-integrated in a reconstruction phase.

Validating Reconstruction Accuracy

The authors performed a quantitative validation using ground-truth SDF files to verify reconstruction accuracy. The success rate metric evaluated whether the reconstructed graph perfectly matched the true SDF:

$$ \text{Accuracy} = \frac{\text{Correctly Reconstructed SDFs}}{\text{Total Images Evaluated}} $$

Baselines: The system was benchmarked against the commercial software CLIDE on “Database 1”.
Datasets: Three distinct databases were used:
- Database 1: 100 images (varied fonts/line widths).
- Database 2: 100 images.
- Database 3: 7,604 images (large-scale test).

System Performance and Scalability

Superior Performance: On Database 1, the proposed system correctly reconstructed 97% of images, whereas the commercial CLIDE system only reconstructed 25% (after parameter tuning).
Scalability: The system maintained reasonable performance on the large dataset (Database 3), achieving 67% accuracy.
Robustness: The system can handle varying fonts and line widths via parameterization.
Future Work: The authors plan to implement a feedback loop where the Chemical Knowledge Module can send error signals back to earlier modules to correct inconsistencies.

Reproducibility Details

Reproducibility Status: Closed / Not Reproducible (Paywalled paper, no public code or data).

Data

The paper utilizes three databases for validation. The authors note that for these images, the correct SDF files were already available, allowing for direct automated checking.

Purpose	Dataset	Size	Notes
Evaluation	Database 1	100 Images	Varied line widths, fonts, symbols; used for CLIDE comparison.
Evaluation	Database 2	100 Images	General chemical database.
Evaluation	Database 3	7,604 Images	Large-scale database.

Algorithms

The system is composed of five distinct modules executed in sequence:

1. Binarization & Segmentation

Preprocessing: Removal of anti-aliasing effects followed by adaptive histogram binarization.
Connected Components: A non-recursive raster-scan algorithm identifies connected Run-Length Encoded (RLE) segments.

2. Optical Character Recognition (OCR)

Feature Extraction: Uses functions similar to Zernike moments and a wavelet transform strategy.
Classification: Identifies isolated characters/symbols and separates them from the molecular “body”.

3. Vectorizer

Logic: Assigns local directions to RLE segments based on neighbors, then groups segments with similar local direction patterns.
Constraint: Enforces a 1-to-1 mapping between visual lines and graph vectors to prevent spurious small vectors at thick joints.

4. Reconstruction (Heuristics)

This module annotates vectors with chemical significance:

Chiral Bonds (Wedges): Identified by registering vectors against original pixel density. If a vector corresponds to a thick geometric form (triangle/rectangle), it is labeled chiral.
Dotted Chiral Bonds: Identified by clustering isolated vectors (no neighbors) using quadtree clustering on geometric centers. Coherent parallel clusters are fused into a single bond.
Double/Triple Bonds: Detected by checking for parallel vectors within a Region of Interest (ROI) defined as the vector’s bounding box dilated by a factor of 2.
Superatoms: OCR results are clustered by dilating bounding boxes; overlapping boxes are grouped into names (e.g., “COOH”).

5. Chemical Knowledge

Validates the generated graph against rules for valences and charges. If valid, an SDF file is generated.

Models

SVM (Support Vector Machine): Used within the OCR module to classify connected components as characters or symbols. It is trained to be tolerant to rotation and font variations.

Evaluation

The primary metric is binary success rate per molecule (perfect reconstruction of the SDF).

Metric	Value (DB1)	Value (DB3)	Baseline (CLIDE on DB1)	Notes
Correct Reconstruction	97%	67%	25%	CLIDE required significant parameter tuning to reach 25%.

Paper Information

Citation: Algorri, M.-E., Zimmermann, M., Friedrich, C. M., Akle, S., & Hofmann-Apitius, M. (2007). Reconstruction of Chemical Molecules from Images. Proceedings of the 29th Annual International Conference of the IEEE EMBS, 4609-4612. https://doi.org/10.1109/IEMBS.2007.4353366

Publication venue: IEEE EMBS 2007

@inproceedings{algorriReconstructionChemicalMolecules2007,
  title = {Reconstruction of {{Chemical Molecules}} from {{Images}}},
  booktitle = {Proceedings of the 29th Annual International Conference of the IEEE EMBS},
  author = {Algorri, Maria-Elena and Zimmermann, Marc and Friedrich, Christoph M. and Akle, Santiago and {Hofmann-Apitius}, Martin},
  year = {2007},
  pages = {4609--4612},
  publisher = {IEEE},
  doi = {10.1109/IEMBS.2007.4353366}
}

OSRA: Open Source Optical Structure Recognition

Sun, 14 Dec 2025 00:00:00 +0000

Paper Information

Citation: Filippov, I. V., & Nicklaus, M. C. (2009). Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution. Journal of Chemical Information and Modeling, 49(3), 740-743. https://doi.org/10.1021/ci800067r

Publication: J. Chem. Inf. Model. 2009

Additional Resources:

Overview and Motivation

Resource

This paper is a quintessential Infrastructure contribution ($\Psi_{\text{Resource}}$). While it contains significant algorithmic detail, the rhetorical structure and primary goal place it squarely as an infrastructure paper. The dominant contribution is the creation, release, and documentation of a software tool (OSRA).

A vast amount of chemical knowledge is locked in scientific and patent documents as graphical images (Kekulé structures). This is the classic chemical informatics challenge: decades of chemical knowledge are trapped in visual form.

Legacy Data Gap: Historical literature does not use computer-parsable formats, making automated processing of millions of documents impossible without optical recognition. Scientific papers and patents have historically depicted molecules as 2D structural diagrams.
Need for Automation: Manual transcription is not scalable for the hundreds of thousands of documents available. While modern standards like InChI and CML exist, the vast majority of chemical literature remains inaccessible for computational analysis.
Open Source Gap: Before OSRA, only commercial software like CLiDE existed for this task, creating a barrier for academic researchers and limiting reproducibility. no universal or open-source solution was available prior to this work.

Core Innovations and Pipeline

OSRA is claimed to be the first open-source optical structure recognition (OSR) program. The novelty lies in creating an accessible OCSR system with a practical, multi-stage pipeline that combines classical image processing techniques with chemical knowledge.

Key contributions:

Integrated Pipeline: It uniquely combines existing open-source image processing tools (ImageMagick for formats, Potrace for vectorization, GOCR/OCRAD for text) into a chemical recognition workflow. The value is in the assembly and integration.
Vectorization-Based Approach: OSRA uses the Potrace library to convert bitmap images into vector graphics (Bezier curves), then analyzes the geometry of these curves to identify bonds and atoms. This is more robust than angle-based detection methods because it leverages continuous mathematical properties of curves.
Multi-Resolution Processing with Confidence Estimation: The system automatically processes each image at three different resolutions (72, 150, and 300 dpi), generating up to three candidate structures. A learned confidence function trained via linear regression on chemical features (heteroatom count, ring patterns, fragment count) selects the most chemically sensible result.
Resolution Independence: Unlike some predecessors, it is designed to handle over 90 image formats and works independently of specific resolutions or fonts.
Comprehensive Chemical Rules: OSRA implements sophisticated heuristics for chemical structure interpretation:
- Distinguishes bridge bond crossings from tetrahedral carbon centers using graph connectivity rules
- Recognizes stereochemistry from wedge bonds (detected via line thickness gradients)
- Handles old-style aromatic notation (circles inside rings)
- Expands common chemical abbreviations (superatoms like “COOH” or “CF₃”)
- Uses the 75th percentile of bond lengths as the reference length to avoid outlier bias

Methodology and Validation

The authors validated OSRA against both commercial software and manual curation:

Commercial Comparison: They compared OSRA against CLiDE (a commercial OSR tool) using a “small test set” of 11 files provided by Simbiosys containing 42 structures. Performance was measured both as exact match accuracy and as Tanimoto similarity using molecular fingerprints.
Internal Validation: They tested on an internal set of 66 images containing 215 structures, covering various resolutions, color depths, and drawing styles to assess performance at scale and characterize typical error patterns.
Metric Definition: They defined recognition success using both exact matches (“Perfect by InChI”) and Tanimoto similarity (using CACTVS fingerprints). The authors explicitly argued for using Tanimoto similarity as the primary evaluation metric, reasoning that partial recognition (e.g., missing a methyl group) still provides useful chemical information, which binary “correct/incorrect” judgments fail to capture.

Results and Conclusions

Competitive Accuracy: On the small comparative set, OSRA recognized 26 structures perfectly (by InChI) versus CLiDE’s 11, demonstrating that an open-source, rule-based approach could outperform established commercial systems.
Robustness: On the internal diverse set (215 structures), OSRA achieved a 93% average Tanimoto similarity and perfectly recognized 107 structures (50%). Tanimoto similarity above 85% was achieved for 182 structures (85%). This established OSRA as a competitive tool for practical use.
Multi-Resolution Success: The multi-resolution strategy allowed OSRA to handle images with varying quality and formats. The confidence function (with correlation coefficient $r=0.89$) successfully identified which resolution produced the most chemically plausible structure.
Limitations: The authors acknowledge issues with:
- “Imperfect segmentation” leading to missed structures (3 missed in internal set) and false positives (7 in internal set)
- Novel drawing conventions not covered by the implemented heuristics
- Highly degraded or noisy images where vectorization fails
- Hand-drawn structures that deviate significantly from standard chemical drawing practices
- Complex reaction schemes with multiple molecules and arrows
Open-Source Impact: By releasing OSRA as open-source software, the authors enabled widespread adoption and community contribution. This established a foundation for future OCSR research and made the technology accessible to researchers without commercial software budgets.

The work established that rule-based OCSR systems could achieve competitive performance when carefully engineered with chemical knowledge. OSRA became a standard baseline for the field and remained the dominant open-source solution until the rise of deep learning methods over a decade later. The vectorization-based approach and the emphasis on Tanimoto similarity as an evaluation metric influenced subsequent work in the area.

Technical Details

Grayscale Conversion

OSRA uses a non-standard grayscale conversion to preserve light-colored atoms (e.g., yellow sulfur):

$$\text{Gray} = \min(R, G, B)$$

This prevents light colors from being washed out during binarization, unlike the standard weighted formula ($0.3R + 0.59G + 0.11B$).

Image Segmentation

Chemical structures are identified within a page using specific bounding box criteria:

Black pixel density: Must be between 0.0 and 0.2
Aspect ratio: Height-to-width ratio must be between 0.2 and 5.0
Minimum size: Width and height must be >50 pixels at resolutions >150 dpi

Noise Detection and Smoothing

A “noise factor” determines whether anisotropic smoothing is applied:

$$\text{Noise Factor} = \frac{\text{Count of 2-pixel line segments}}{\text{Count of 3-pixel line segments}}$$

Smoothing is applied only if this ratio is between 0.5 and 1.0.

Atom Detection from Bezier Curves

Potrace Bezier control points are flagged as potential atoms if:

The point is classified as a “corner” by Potrace
The vector direction change has a normal component of at least 2 pixels

The normal component criterion is more robust than angle-based detection because angles are difficult to measure accurately in pixelated environments where line thickness is non-zero.

Bond Length Estimation

The reference bond length is computed as the 75th percentile of all detected bond lengths. This avoids bias from outlier bonds (e.g., extremely short or long bonds from recognition errors).

Confidence Function

A linear regression function selects the best result from the multi-scale processing:

$$\text{confidence} = 0.316 - 0.016N_{c} + 0.034N_{N} + 0.067N_{o} + \ldots + 0.330N_{\text{rings5}} + \ldots$$

where $N_C$, $N_N$, $N_O$ represent counts of carbon, nitrogen, oxygen atoms, respectively. It prioritizes structures with more recognized heteroatoms and rings, while penalizing fragment counts. Additional terms account for ring pattern

Minimum size: Width and height must be >50 pixels at resolutions >150 dpi

Noise Detection and Smoothing

A “noise factor” determines whether anisotropic smoothing is applied:

$$\text{Noise Factor} = \frac{\text{Count of 2-pixel line segments}}{\text{Count of 3-pixel line segments}}$$ | Purpose | Dataset | Size | Notes | |———|———|——|——-| | Comparison | “Small test set” (Simbiosys) | 11 files (42 structures) | Used to compare vs. CLiDE | | Validation | Internal Test Set | 66 images (215 structures) | Various resolutions, color depths, styles |

Evaluation

Metrics used to define “Success”:

Metric	Definition
Perfect by InChI	Exact match of the InChI string to the human-curated structure.
Average Tanimoto	Tanimoto similarity (CACTVS fingerprints) between OSRA output and ground truth.
uuuuu	NCI CADD identifier match (topology only; ignores stereochem/charge/tautomers).

Results Table (Comparison):

Tool	Perfect (InChI)	T > 85%	uuuuu Match
OSRA	26 / 42	39 / 42	28 / 42
CLiDE	11 / 42	26 / 42	12 / 42

Software/Dependencies

The system relies on external libraries:

ImageMagick: Image format parsing (supports 90+ formats)
Ghostscript: PDF/PS interpretation
Potrace: Vectorization (converts bitmap to Bezier curves)
GOCR / OCRAD: Optical Character Recognition (heteroatom label recognition)
OpenBabel / RDKit: Chemical backends for connection table compilation
Output Formats: SMILES strings and SD files

Citation

@article{filippovOpticalStructureRecognition2009,
  title = {Optical {{Structure Recognition Software To Recover Chemical Information}}: {{OSRA}}, {{An Open Source Solution}}},
  shorttitle = {Optical {{Structure Recognition Software To Recover Chemical Information}}},
  author = {Filippov, Igor V. and Nicklaus, Marc C.},
  year = {2009},
  month = mar,
  journal = {Journal of Chemical Information and Modeling},
  volume = {49},
  number = {3},
  pages = {740--743},
  doi = {10.1021/ci800067r}
}

The confidence function is a linear regression model trained on chemical features:

$$\text{Confidence} = 0.316 - 0.016N_C + 0.034N_N + 0.067N_O + 0.036N_F + \ldots$$

where $N_C$, $N_N$, $N_O$, $N_F$ represent counts of carbon, nitrogen, oxygen, and fluorine atoms, respectively. Additional terms account for ring counts and fragment counts. The model achieves a correlation coefficient of $r=0.89$.

This function scores the three resolution candidates (72, 150, and 300 dpi), and the highest-scoring structure is selected as the final output.

Data

Test Sets:

CLiDE Comparison: 42 structures from 11 files (Simbiosys small test set)
Internal Validation: 215 structures

Evaluation Metrics:

Exact match accuracy (binary correct/incorrect)
Tanimoto similarity using molecular fingerprints (preferred metric for partial recognition credit)

Models

Pipeline Components:

Image Preprocessing: ImageMagick (supports 90+ formats)
Vectorization: Potrace library (converts bitmap to Bezier curves)
OCR: GOCR and OCRAD (heteroatom label recognition)
Output Formats: SMILES strings and SD files

Optical Recognition of Chemical Graphics

Sun, 14 Dec 2025 00:00:00 +0000

Contribution: Early OCSR Pipeline Methodology

Method. This paper proposes a novel architectural pipeline for the automatic recognition of chemical structure diagrams. It defines a specific sequence of algorithmic steps, including diagram separation, vectorization, segmentation, and structural analysis, which converts pixel data into a semantic chemical representation (MDL Molfile).

Motivation: Digitizing Legacy Chemical Data

Problem: In 1993, vast databases of chemical information existed, but the entry of graphical data was significantly less advanced than the facilities for manipulating it.

Gap: Creating digital chemical structures required trained operators to manually redraw diagrams that already existed in printed journals and catalogs, leading to a costly duplication of effort.

Goal: To automate the creation of coded representations (connection tables) directly from optically scanned diagrams on printed pages.

Novelty: General Document Analysis Integrated with Chemical Rules

Pipeline Approach: The authors present a complete end-to-end system that integrates general document analysis with domain-specific chemical rules.

Convex Bounding Separation: A novel use of “bounding polygons” defined by 8 fixed-direction bands to distinguish diagram components from text with linear computational cost.

Vector-Based Segmentation: The system uses the output of a vectorizer (GIFTS) to classify diagram elements. It relies on the observation that vectorizers approximate characters with sets of short vectors to distinguish them from bonds.

Methodology and System Evaluation

System Implementation: The algorithm was implemented in ‘C’ on IBM PS/2 personal computers running OS/2 Presentation Manager.

Input Specification: The system was tested on documents scanned at 300 dpi using an IBM 3119 scanner.

Qualitative Evaluation: The authors evaluated the system on “typical scanned structures” and “simple planar diagrams”. Large-scale quantitative benchmarking was not conducted in this work.

Results, Performance, and Limitations

Performance: The prototype processes a typical structure (after extraction) in less than one minute.

Accuracy: It is reported to be accurate for simple planar diagrams.

Output Format: The system successfully generates MDL Molfiles that interface with standard chemistry software like REACCS, MACCS, and modeling tools.

Limitations: The system struggles with broken lines, characters touching bond structures, and requires manual intervention for complex errors.

Reproducibility Details

Status: Closed (Historical). As an early prototype from 1993, no source code, datasets, or digital models were publicly released. Reproducing this exact system would require recreating the pipeline from the described heuristics and sourcing vintage OCR software.

Artifacts

Artifact	Type	License	Notes
None available	N/A	N/A	No digital artifacts were released with this 1993 publication.

Data

The paper does not release a dataset but specifies the input requirements for the system.

Purpose	Dataset	Size	Notes
Input	Scanned Documents	N/A	Black ink on white paper; scanned at 300 dpi bi-level.

Algorithms

The paper relies on a pipeline of specific heuristics and geometric rules.

1. Diagram Separation (Region Growing)

Bounding Polygons: Uses convex polygons defined by pairs of parallel sides in 8 fixed directions. This approximation improves distance estimation compared to bounding rectangles.
Seed Detection: Finds a connected component with bounding dimension $D > d_{\text{max char size}}$.
Aggregation: Iteratively searches for neighboring components within a specific distance threshold $d_t$ (where $d_t$ is smaller than the whitespace margin) and merges them into the bounding polygon.

2. Vectorization & Segmentation

Vectorization: Uses the GIFTS system (IBM Tokyo) to fit lines to pixels.
Classification Heuristics:
- Ratio Test: If the ratio of a group’s dimension to the full diagram dimension is below a threshold $\tau$, it is classified as a Symbol: $$ \frac{D_{\text{group}}}{D_{\text{diagram}}} < \tau $$
- Context Rule: Small vector groups near letters are classified as Characters (handles ’l’ in ‘Cl’).
- Circle Rule: A group is a Circle (aromatic ring) if it contains $N \ge 8$ vectors in a roughly circular arrangement.
- Default: Otherwise, classified as Bond Structure.

3. Cleanup & Structure Recognition

Short Vector Removal: Vectors shorter than a fraction of the median line length $L_{\text{median}}$ are shrunk to their midpoint (fixing broken junctions).
Vertex Merging: If two vectors meet at an angle $\theta < 35^{\circ}$, the vertex is removed (fixing single lines broken into two).
Aromatic Processing: If a circle is detected, the system identifies the 6 closest atoms and adds double bonds to every second bond in the ring.

Models

OCR:

The system uses a feature-based, single-font OCR engine.
It assumes non-serif, plain styles typical of drafting standards.
Character images are normalized for size before recognition.

Hardware

Scanner: IBM 3119 (300 dpi).
Compute: IBM PS/2 series running OS/2.

Paper Information

Citation: Casey, R., et al. (1993). Optical Recognition of Chemical Graphics. Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR ‘93), 627-631. https://doi.org/10.1109/ICDAR.1993.395658

Publication: ICDAR 1993

@inproceedings{caseyOpticalRecognitionChemical1993,
  title = {Optical Recognition of Chemical Graphics},
  booktitle = {Proceedings of 2nd {{International Conference}} on {{Document Analysis}} and {{Recognition}} ({{ICDAR}} '93)},
  author = {Casey, R. and Boyer, S. and Healey, P. and Miller, A. and Oudot, B. and Zilles, K.},
  year = 1993,
  pages = {627--631},
  publisher = {IEEE Comput. Soc. Press},
  address = {Tsukuba Science City, Japan},
  doi = {10.1109/ICDAR.1993.395658}
}

Kekulé: OCR-Optical Chemical Recognition

Sun, 14 Dec 2025 00:00:00 +0000

Paper Information

Citation: McDaniel, J. R., & Balmuth, J. R. (1992). Kekulé: OCR-Optical Chemical (Structure) Recognition. Journal of Chemical Information and Computer Sciences, 32(4), 373-378. https://doi.org/10.1021/ci00008a018

Publication: Journal of Chemical Information and Computer Sciences, 1992

System Architecture and Methodological Approach

This is a Methodological Paper ($\Psi_{\text{Method}}$). It proposes a novel software architecture (“Kekulé”) designed to solve a specific technical problem: the automatic conversion of printed chemical structure diagrams into computer-readable connection tables. The paper focuses on the “how” of the system by detailing the seven-step pipeline from scanning to graph compilation, validating the method through performance testing on a specific dataset.

Motivation: Bridging Visual Diagrams and Connection Tables

The primary motivation is to bridge the gap between how chemists communicate (structural diagrams) and how chemical databases store information (connection tables like MOLfiles).

Inefficiency of Manual Entry: Manual compilation of structural descriptions is “tedious and highly prone to error”.
Redrawing Costs: Even using drawing programs (like ChemDraw ancestors) to capture connectivity is inefficient; redrawing a complex molecule like vitamin $B_{12}$ takes ~20 minutes.
Lack of Existing Solutions: Existing OCR systems at the time failed on chemical diagrams because they could not handle the mix of graphics (bonds) and text (atom labels), and struggled with small, mixed fonts.

Novelty: A Hybrid OCR and Heuristic Approach

Kekulé represents the first successful attempt to integrate all of the required elements of image processing, OCR, structure editing, and database communication into a complete system.

Hybrid OCR Approach: Unlike commercial OCR of the time, it used a custom implementation combining rotation correction (for skew) with a multilayer perceptron neural network trained specifically on small fonts (down to 3.2 points).
Heuristic Feature Extraction: The authors developed specific heuristics to handle chemical artifacts, such as an exhaustive search for dashed lines, explicitly rejecting Hough transforms as unreliable for short segments.
Contextual “Spell Checking”: The system uses chemical context to verify OCR results, such as checking atom symbols against a valid list and using bond connections to disambiguate characters.

Experimental Setup and Dataset Validation

The authors performed a validation study on a diverse set of chemical structures to stress-test the system:

Dataset: 444 chemical structures were selected from a wide variety of sources, including the Merck Index, Aldrich Handbook, and ACS Nomenclature Guide, specifically chosen to “test Kekulé’s limits”.
Metrics:
- Processing Success: Percentage of structures processed.
- User Intervention: Average number of prompts per structure for verification.
- Editing Time: Time required to correct interpretation errors (arbitrary “good” limit set at 30 seconds).

Results and System Performance

High Success Rate: 98.9% of the 444 structures were processed successfully.
Performance Speed: The average processing time was 9 seconds per structure on an 80486 (33 MHz) processor.
Error Modes: The primary bottleneck was broken characters in scanned images (e.g., breaks in ‘H’ or ‘N’ crossbars), which slowed down the OCR significantly.
Impact: The system demonstrated that automated interpretation was faster and less error-prone than manual redrawing.

Reproducibility Details

The following details outline the specific technical implementation described in the 1992 paper.

Data

The authors did not release a public dataset but described their test set sources in detail.

Purpose	Dataset	Size	Notes
Evaluation	Mixed Chemical Sources	444 structures	Sourced from Merck Index, Aldrich Handbook, ACS Nomenclature Guide, etc.
Training (OCR)	Font Exemplars	Unknown	“Exemplars of characters from numerous serif and sanserif fonts”.

Algorithms

The paper details a 7-step pipeline. Key algorithmic choices include:

Vectorization:
- Images are reduced to 1-pixel width using thinning and raster-to-vector translation.
- An adaptive smoothing algorithm is applied to remove pixel-level jitter.
Feature Extraction (Dashed Lines):
- Hough Transforms were rejected due to poor performance on short line segments.
- Slope sorting was rejected due to variance in short dashes.
- Chosen Method: Exhaustive search/testing of all features that might be dashed lines (subset of features).
Graph Compilation:
- Character Grouping: Characters are assembled into strings based on XY adjacency.
- Node Creation: Character strings become nodes. Vectors with endpoints “too far” from strings create new nodes.
- Heuristics: Circles are converted to alternating single-double bonds; “thick” bonds between wedges are automatically generated.

Models

The core machine learning component is the OCR engine.

Architecture: A multilayer perceptron neural network (fully connected).
Input: Normalized characters. Normalization involves rotation (for skew), scaling, under-sampling, and contrast/density adjustments.
Output: Ranked probability matches. Outputs above an experimental threshold are retained. If a character is ambiguous (e.g., ‘5’ vs ‘S’), both are kept and resolved via chemical context.
Performance: Raw accuracy ~96% on small fonts (compared to ~85% for commercial OCR of the era).

Hardware

The system was developed and tested on hardware typical of the early 1990s.

Processor: Intel 80486 at 33 MHz.
Scanners: Hewlett-Packard ScanJet (300 dpi) and Logitech ScanMan (400 dpi hand-held).
Platform: Microsoft Windows.

Citation

@article{mcdanielKekuleOCRopticalChemical1992,
  title = {Kekulé: {{OCR-optical}} Chemical (Structure) Recognition},
  shorttitle = {Kekulé},
  author = {McDaniel, Joe R. and Balmuth, Jason R.},
  year = 1992,
  month = jul,
  journal = {Journal of Chemical Information and Computer Sciences},
  volume = {32},
  number = {4},
  pages = {373--378},
  issn = {0095-2338, 1520-5142},
  doi = {10.1021/ci00008a018},
  urldate = {2025-12-15},
  langid = {english}
}

Graph Perception for Chemical Structure OCR

Sun, 14 Dec 2025 00:00:00 +0000

Paper Information

Citation: Contreras, M. L., Allendes, C., Alvarez, L. T., & Rozas, R. (1990). Computational perception and recognition of digitized molecular structures. Journal of Chemical Information and Computer Sciences, 30(3), 302-307. https://doi.org/10.1021/ci00067a014

Publication: Journal of Chemical Information and Computer Sciences, 1990

Contribution: Graph Perception and Character Recognition

This is a Methodological Paper ($\Psi_{\text{Method}}$).

It proposes a specific algorithmic pipeline (“graph perception and character recognition”) to solve the technical problem of converting pixelated images of molecules into machine-readable connectivity tables. The dominant contribution is the novel set of algorithms (contour search, circular inspection, matrix parametrization).

Motivation: Automating Chemical Database Entry

The primary motivation is to automate the input of chemical structures into databases.

Problem: Manual input of structures (especially large ones with stereochemistry) is time-consuming and prone to human error.
Gap: Existing methods required significant human intervention. The authors created a system that handles the “graph/skeleton” and the “alphanumeric characters” effectively to speed up entry into systems like ARIUSA or CAD tools.

Algorithmic Novelty: Circular Inspection Processing

The paper introduces a unified “capture-to-recognition” system written in C that handles both type-printed and hand-printed structures. Key novelties include:

Circular Inspection Algorithm: A specific technique for detecting internal rings and multiple bonds by sweeping a radius of 0.3 bond lengths around atoms.
Hybrid Recognition: Combining “graph perception” (vectorizing the lines) with “character recognition” (OCR for atom labels) in a single pipeline.
Matrix Parametrization for OCR: A feature extraction method that assigns hexadecimal IDs to character matrices based on pixel gradients and “semibytes”.

Methodology: Validation via Custom Structure Dataset

The authors validated the system by digitizing and recognizing a set of test structures:

Dataset: 200 type-printed structures and 50 hand-printed structures.
Metric: “Reliability” percentage (correct recognition of the connectivity table).
Speed Comparison: Measured processing time against a “qualified person” performing manual input for an average 20-atom molecule.

Results: Speed and File Size Efficiency

Accuracy: The system achieved 94% reliability for both type- and hand-printed graphs.
Character Recognition: Isolated character recognition achieved >99% reliability.
Speed: The system was 3-5 times faster than manual human input.
Efficiency: The storage required for a recognized molecule (e.g., $C_{19}H_{31}N$) was significantly smaller (4.1 kb) than the raw image bitmap.

Reproducibility Details

Data

The paper does not use a standard external dataset but rather a custom set of structures for validation.

Purpose	Dataset	Size	Notes
Validation	Type-printed structures	200 images	Used to test reliability
Validation	Hand-printed structures	50 images	“Straight enough” drawings required

Algorithms

The paper details three specific algorithmic components crucial for replication:

Graph Perception (Contour Search):
- Sweep: Left-to-right horizontal sweep to find the first pixel.
- Contour Follow: Counter-clockwise algorithm used to trace borders.
- Vertex Detection: A vertex is flagged if the linear trajectory deflection angle is $>18^\circ$.
- Atom Localization: Two or more vertices in a small space indicate an atom position.
Circular Inspection (Branching/Rings):
- Radius: A circle is inspected around each atom with $r = 0.3 \times \text{single bond length}$.
- Branch Detection: “Unknown border pixels” found on this circle trigger new contour searches to find attached bonds or rings.
Character Recognition (Matrix Feature Extraction):
- Separation: Characters are separated into isolated matrices and “relocated” to the top-left corner.
- Parametrization: The matrix is divided into zones. A “semibyte” (4-bit code) is generated by checking for pixel density in specific directions.
- ID Assignment: Matrices are assigned a Hex ID (e.g., 8, 1, 0, 6) based on these semibytes.
- Differentiation: Secondary parameters (concavities, vertical lines) resolve conflicts (e.g., between ‘b’ and ‘h’).

Models

The system does not use learned weights (neural networks). It relies on rule-based topological recognition.

Representation: The final output is a Prolog data structure converted into a connectivity table.
Atom Recognition: Terminal atoms are identified by linear projection; if no pixels are found, it defaults to Carbon.

Hardware

The performance metrics reflect 1990s hardware, useful for historical context or low-resource reimplementation.

Capture: PC-AT microcomputer with HP-Scanjet.
Processing: MicroVax II (8 MB real memory, 159 MB hard disc) running Ultrix-32.
Memory Usage: A $300 \times 300$ dpi image required ~175 kb; a recognized graph required ~1.6 kb.
Time: Processing time per molecule was 0.7 - 1.0 minutes.

Citation

@article{contrerasComputationalPerceptionRecognition1990,
  title = {Computational Perception and Recognition of Digitized Molecular Structures},
  author = {Contreras, M. Leonor and Allendes, Carlos and Alvarez, L. Tomas and Rozas, Roberto},
  year = 1990,
  month = aug,
  journal = {Journal of Chemical Information and Computer Sciences},
  volume = {30},
  number = {3},
  pages = {302--307},
  issn = {0095-2338, 1520-5142},
  doi = {10.1021/ci00067a014}
}

ChemReader: Automated Structure Extraction

Sun, 14 Dec 2025 00:00:00 +0000

Paper Information

Citation: Park, J., Rosania, G. R., Shedden, K. A., Nguyen, M., Lyu, N., & Saitou, K. (2009). Automated extraction of chemical structure information from digital raster images. Chemistry Central Journal, 3(1), 4. https://doi.org/10.1186/1752-153X-3-4

Publication: Chemistry Central Journal 2009

Paper Contribution: Method & Pipeline

This is a Method paper.

It proposes a novel software system, ChemReader, designed to automate the analog-to-digital conversion of chemical structure diagrams. The paper focuses on the algorithmic pipeline, specifically modifying standard computer vision techniques like the Hough Transform to suit chemical graphs. It validates the method through direct performance comparisons against existing State-of-the-Art tools (OSRA, CLiDE, Kekule).

Motivation: Unlocking Analog Chemical Information

There is a massive amount of chemical information (molecular interactions, pathways, disease processes) locked in scientific literature. However, this information is typically encoded as “analog diagrams” (raster images) embedded in text. Existing text-based search engines cannot index these structures effectively.

While previous tools existed (Kekule, OROCS, CLiDE), they often required high-resolution images (150-300 dpi) or manual intervention to separate diagrams from text, making fully automated, large-scale database annotation impractical.

Core Innovation: Modified Transforms and Spell Checking

The authors introduce ChemReader, a fully automated toolkit with several specific algorithmic innovations tailored for chemical diagrams:

Modified Hough Transform (HT): Unlike standard HT, which treats all pixels equally, ChemReader uses a modified weight function that accounts for pixel connectivity and line thickness to better detect chemical bonds.
Chemical Spell Checker: A post-processing step that uses a dictionary of common chemical abbreviations (770 entries) and n-gram probabilities to correct Optical Character Recognition (OCR) errors (e.g., correcting specific atom labels based on valence rules), improving accuracy from 66% to 87%.
Specific Substructure Detection: Dedicated algorithms for detecting stereochemical “wedge” bonds using corner detection and aromatic rings using the Generalized Hough Transform.

Experimental Setup and Baselines

The authors compared ChemReader against three other systems: OSRA V1.0.1, CLiDE V2.1 Lite, and Kekule V2.0 demo.

They used three distinct datasets to test robustness:

Set I (50 images): Diverse drawing styles and fonts collected via Google Image Search.
Set II (100 images): Ligand images from the GLIDA database, linked to PubChem for ground truth.
Set III (212 images): Low-resolution images embedded in 121 scanned journal articles from PubMed.

Results and Conclusions

Accuracy: ChemReader significantly outperformed competitors. In the difficult Set III (journal articles), ChemReader achieved 30.2% correct exact output, compared to 17% for OSRA and 6.6% for CLiDE.
Similarity: Even when exact matches failed, ChemReader maintained high Tanimoto similarity scores (0.74-0.86), indicating it successfully captured the majority of chemically significant features.
Substructure Recognition: ChemReader demonstrated higher recall rates across all PubChem fingerprint categories (rings, atom pairs, SMARTS patterns) compared to other tools.
Error Correction: The “Chemical Spell Checker” improved character recognition accuracy from 66% to 87%.

Reproducibility Details

Data

The study utilized three test sets collected from public sources.

Purpose	Dataset	Size	Notes
Evaluation	Set I	50 images	Sourced from Google Image Search to vary styles/fonts.
Evaluation	Set II	100 images	Randomly selected ligands from the GLIDA database; ground truth via PubChem.
Evaluation	Set III	212 images	Extracted from 121 PubMed journal articles; specifically excludes non-chemical figures.

Algorithms

The pipeline consists of several sequential processing steps:

De-noising: Uses GREYCstoration, an anisotropic smoothing algorithm, to regulate image noise.
Segmentation: Uses an 8-connectivity algorithm to group pixels. Components are classified as text or graphics based on height/area ratios.
Line Detection (Modified Hough Transform):
- Standard Hough Transform is modified to weight pixel pairs $(P_i, P_j)$ based on connectivity.
- Weight Function ($W_{ij}$): $$W_{ij} = \begin{cases} n_{ij}(P_0 - x_{ij}) & \text{if } x_{ij}/n_{ij} > P_0 \\ 0 & \text{otherwise} \end{cases}$$ Where $n_{ij}$ is the pixel count between points, $x_{ij}$ is the count of black pixels, and $P_0$ is a density threshold.
Wedge Bond Detection: Uses corner detection to find triangles where the area equals the number of black pixels (isosceles shape check).
Chemical Spell Checker:
- Calculates the Maximum Likelihood ($ML$) of a character string being a valid chemical word $T$ from a dictionary.
- Similarity Metric: $$Sim(S_i, T_i) = 1 - \sqrt{\sum_{j=1}^{M} [I^{S_i}(j) - I^{T_i}(j)]^2}$$ Uses pixel-by-pixel intensity difference between the input segment $S$ and candidate template $T$.

Models

Character Recognition: Uses the open-source GOCR library. It employs template matching based on features like holes, pixel densities, and transitions.
Chemical Dictionary: A lookup table containing 770 frequently used chemical abbreviations and fundamental valence rules.

Evaluation

Performance was measured using exact structure matching and fingerprint similarity.

Metric	Value (Set III)	Baseline (OSRA)	Notes
% Correct	30.2%	17%	Exact structure match using ChemAxon JChem.
Avg Similarity	0.740	0.526	Tanimoto similarity on PubChem Substructure Fingerprints.
Precision (Rings)	0.87	0.84	Precision rate for recognizing ring systems.
Recall (Rings)	0.83	0.73	Recall rate for recognizing ring systems.

Hardware

Platform: C++ implementation running on MS Windows.
Dependencies: GOCR (OCR), GREYCstoration (Image processing).

Chemical Machine Vision

Sun, 14 Dec 2025 00:00:00 +0000

Paper Information

Citation: Gkoutos, G. V., Rzepa, H., Clark, R. M., Adjei, O., & Johal, H. (2003). Chemical Machine Vision: Automated Extraction of Chemical Metadata from Raster Images. Journal of Chemical Information and Computer Sciences, 43(5), 1342-1355. https://doi.org/10.1021/ci034017n

Publication: J. Chem. Inf. Comput. Sci. 2003

Paper Classification: Methodological Approach

This is a Method paper. It proposes a novel architectural pipeline applying “machine vision” techniques (Gabor wavelets and Kohonen networks) to the problem of identifying chemical diagrams in low-resolution raster images. The paper focuses on the “how” (the algorithm and its parameters) and validates the method through quantitative experiments optimizing feature vectors and masks.

Motivation: Extracting Legacy Chemical Data

The primary motivation is to unlock the “large amount of data” trapped in legacy raster images (GIF, JPEG) on the Web that lack semantic metadata.

Legacy Data Problem: Most chemical structural information on the Web is embedded in raster images, not machine-readable formats like Molfiles.
Limitations of Existing Tools: Previous tools like Kekule and CLiDE acted as “Chemical OCR,” attempting to reconstruct exact atom-bond connections. This required high-resolution images (>300 dpi) and human intervention, making them unsuitable for automated Web crawling of low-resolution (72-96 dpi) images.
Goal: To create a low-cost, automated tool for a “robot-based Internet resource discovery tool” that can classify images (e.g., “is this a molecule?”).

Core Innovation: Texture Recognition over Structural OCR

The core novelty is the shift from “Optical Character Recognition” (exact reconstruction) to “Texture Recognition” (classification).

Texture-Based Approach: The authors treat chemical diagrams as textures. They use Gabor wavelets to extract texture features. Crucially, this system does not recognize specific chemical structures (i.e., atom-bond connectivity tables, SMILES, or Molfiles). It only classifies images into broad categories.
Incremental Learning: The system uses a Kohonen Self-Organizing Feature Map (KSOFM) combined with Class Boundary Analysis (CBA). This allows for “incremental learning,” where new classes (e.g., aromatic vs. non-aromatic) can be added without retraining the entire system.
Optimization for Chemistry: The authors identify specific parameters (frequency channels, mask sizes) that are optimal for the “texture” of chemical diagrams.
Integration with ChemDig: The method was designed to feed into ChemDig, a robot-based index engine for automated web crawling and metadata generation.

Experimental Setup: Parameter Optimization

The authors performed optimization and validation experiments using a dataset of 300 images divided into three classes: Ring Systems, Non-Ring Systems, and Non-Chemistry (textures, biological figures, etc.).

Parameter Optimization: They systematically varied hyperparameters to find the optimal configuration:
- Feature Vector Size: Tested sizes from 100 to 4000 elements.
- Energy Mask Size: Tested windows from $3 \times 3$ to $15 \times 15$ pixels.
- Frequency Channels: Tested seven spatial frequencies ($\sqrt{2}$ to $64\sqrt{2}$).
Classification Performance: Evaluated the system’s ability to classify unseen test images using a 50:50 training/test split.
Comparison: Qualitatively compared the approach against vectorization tools (Autotrace, CR2V).

Results: Robust Classification of Low-Resolution Images

Optimal Configuration: The system performed best with a feature vector size of ~1500 elements, a $9 \x9$ energy mask, and frequency channel $4\sqrt{2}$.
High Accuracy: Achieved a recognition rate of 91% with a 50:50 training/test split, and up to 92% with a 70:30 split.
Robustness: The system successfully distinguished between chemical and non-chemical images (zero false negatives for chemical images).
Limitations: Misclassifications occurred between “ring” and “non-ring” systems when structures had similar visual “textures” (e.g., similar density or layout).
Impact: The method is viable for automating metadata generation (e.g., alt tags) for web crawlers, functioning as a coarse-grained filter before more expensive processing.

Reproducibility Details

Data

The study used a custom dataset of raster images collected from the Web.

Purpose	Dataset	Size	Notes
Training/Eval	Custom Web Dataset	300 images	Split into 3 classes: Ring Systems, Non-Ring Systems, Non-Chemistry.
Resolution	Low-Res Web Images	72-96 dpi	Deliberately chosen to mimic Web conditions where OCR fails.
Format	Raster	GIF, JPEG	Typical web formats.

Algorithms

The core pipeline consists of a Gabor Transform Unit followed by a Training/Classification Unit.

Gabor Wavelets: Used for feature extraction. The 2D Gabor wavelet equation is: $$h(x,y)=\exp\left{-\frac{1}{2}\left[\frac{x^{2}}{\sigma_{x}^{2}}+\frac{y^{2}}{\sigma_{y}^{2}}\right]\right}\cos(2\pi\mu_{\sigma}x+\phi)$$
- Bank Structure: 28 filters total (4 orientations $\times$ 7 radial frequencies).
- Orientations: $0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}$.
- Frequencies: 1 octave apart, specifically $1\sqrt{2}, \dots, 64\sqrt{2}$.
- Selected Frequency: $4\sqrt{2}$ was found to be optimal for chemistry.
Preprocessing:
- Buffer Mounting: Images are mounted in a buffer (set to 0) to handle edge artifacts.
- Look-Up-Tables (LUT/LUF): A binary Look-Up-Frame (LUF) indicates Regions of Interest (ROI) to avoid computing empty space; values are stored in a Look-Up-Table (LUT) to prevent re-computation of overlapping windows.
Feature Extraction:
- Non-linear Thresholding: $\psi(t) = \tanh(\alpha t)$ with $\alpha = 0.25$.
- Energy Function: Calculated as average absolute deviation from the mean using a window $W_{xy}$. $$e_{k}(x,y)=\frac{1}{M^{2}}\sum_{(a,b)\in W_{xy}}|\psi(r_{k}(a,b))|$$
- Optimal Window: $9 \times 9$ pixels.

Models

The classification model relies on competitive learning.

Architecture: Kohonen Self-Organizing Feature Map (KSOFM).
Training:
- Learning Rate: Starts at 1.0, decreases to 0.1.
- Class Boundary Analysis (CBA): Computes the centroid (mean) and variance of each cluster. The variance defines the class boundary.
Classification Metric: Euclidean Distance Norm. An unknown vector is classified based on the shortest distance to a cluster center, provided it falls within the variance boundary. $$D_{ij}=||x_{i}-x_{j}||$$

Evaluation

Performance was measured using recognition rate ($R_s$) and misclassification error ($E_s$).

Metric	Value	Baseline	Notes
Recognition Rate	91%	N/A	Achieved with 50:50 split. 92% with 70:30 split.
Feature Size	~1500	4000	Reducing vector size from 4000 to 1500 maintained ~80% accuracy while improving speed.

Citation

@article{gkoutosChemicalMachineVision2003,
  title = {Chemical {{Machine Vision}}: {{Automated Extraction}} of {{Chemical Metadata}} from {{Raster Images}}},
  shorttitle = {Chemical {{Machine Vision}}},
  author = {Gkoutos, Georgios V. and Rzepa, Henry and Clark, Richard M. and Adjei, Osei and Johal, Harpal},
  year = 2003,
  month = sep,
  journal = {Journal of Chemical Information and Computer Sciences},
  volume = {43},
  number = {5},
  pages = {1342--1355},
  issn = {0095-2338},
  doi = {10.1021/ci034017n},
  urldate = {2025-12-15},
  langid = {english}
}

Chemical Literature Data Extraction: The CLiDE Project

Sun, 14 Dec 2025 00:00:00 +0000

Paper Information

Citation: Ibison, P., Jacquot, M., Kam, F., Neville, A. G., Simpson, R. W., Tonnelier, C., Venczel, T., & Johnson, A. P. (1993). Chemical Literature Data Extraction: The CLiDE Project. Journal of Chemical Information and Computer Sciences, 33(3), 338-344. https://doi.org/10.1021/ci00013a010

Publication: J. Chem. Inf. Comput. Sci. 1993

Contribution and Taxonomy

Classification: Method ($\Psi_{\text{Method}}$)

This methodological paper proposes a novel software architecture for Optical Chemical Structure Recognition (OCSR). It details specific algorithms for image segmentation, vectorization, and chemical interpretation, validated through the successful extraction of complex structures from literature.

Motivation: Automating Literature Extraction

The manual creation of chemical reaction databases is a time-consuming and expensive process requiring trained chemists to abstract information from literature. While commercial tools existed for interpreting isolated scanned structures (like Kekulé), there was a lack of systems capable of processing whole pages of journals (including embedded text, reaction schemes, and structures) without significant human intervention.

Core Innovation: A Three-Phase Hybrid Architecture

CLiDE introduces a comprehensive three-phase architecture (Recognition, Grouping, Interpretation) that integrates computer vision with chemical knowledge. Key novelties include:

Context-Aware Interpretation: The use of an extendable superatom database to resolve ambiguities in chemical text (e.g., expanding “OAc” or “Me” into connection tables).
Hybrid Primitive Detection: A combination of contour coding for solid lines and a modified Hough transform specifically tuned for detecting dashed chemical bonds.
Semantic Re-construction: A scoring system for bond-atom association that considers both distance and vector direction to handle poorly drawn structures.

Methodology and Experimental Validation

The authors validated the system on a set of “difficult cases” selected to test specific capabilities. These included:

Crossing Bonds: Structures where bonds intersect without a central atom (Fig. 9d, 9e).
Stereochemistry: Identification of wedged, dashed, and wavy bonds.
Generic Structures: Parsing generic text blocks (e.g., $R^1 = Me$) and performing substitutions.
Accuracy Estimation: The authors report an approximate 90% recognition rate for distinct characters in literature scans.

Results and Structural Reconstruction

The system successfully generates connection tables (exported as MOLfiles or ChemDraw files) from scanned bitmaps. It effectively distinguishes between graphical primitives (wedges, lines) and text, accurately reconstructing stereochemistry and resolving superatom synonyms (e.g., converting “MeO” to “OMe”). The authors conclude that while character recognition depends heavily on image quality, the graphic primitive recognition is robust for lines above a threshold length.

Reproducibility Details

Data

Input Format: Binary bitmaps scanned from journal pages.
Resolution: 300 dpi (generating ~1 MB per page).
Superatom Database: A lookup table containing ~200 entries. Each entry includes:
- Valency/Charge: Explicit constraints (e.g., “HO” takes 1 bond, “CO2” takes 2).
- Bonding Index: Specifies which letter in the string serves as the attachment point (e.g., letter 2 for “HO”, letters 1 and 2 for “CO2”).
- Sub-Connection Table: The internal atomic representation of the group.

Algorithms

1. Primitive Recognition (Vectorization)

Contour Coding: Uses the Ahronovitz-Bertier-Habib method to generate interpixel contours (directions N, S, E, W) for connected components.
Polygonal Approximation: A method similar to Sklansky and Gonzalez breaks contours into “fractions”.
- Rule: Long sides are “straight fractions”; consecutive short sides are “curved fractions”.
- Reconstruction: Parallel fractions are paired to form bond borders. If a border is split (due to noise or crossing lines), the system attempts to merge collinear segments.
Dash Detection: A modified Hough transform is applied to small connected components. It requires at least three collinear dashes to classify a sequence as a dashed bond.

2. Interpretation Rules

Bond-Atom Association:
- Candidate Selection: The system identifies $m$ closest bonds for a superatom requiring $n$ connections ($m \ge n$).
- Scoring Function: Connections are selected based on minimizing perpendicular distance (alignment).
Crossing Bonds: Resolved using rules based on proximity, length, collinearity, and ring membership to distinguish actual crossings from central carbon atoms.

Models

OCR: A neural network trained on alphanumeric characters.
- Input Representation: Density matrices derived from character bitmaps.
- Post-processing: Unrecognized characters are flagged for manual correction.

Hardware

Platform: SUN SPARC workstation.
Scanner: Agfa Focus S 800GS.
Implementation Language: C++.

Citation

@article{ibisonChemicalLiteratureData1993,
  title = {Chemical Literature Data Extraction: {{The CLiDE Project}}},
  shorttitle = {Chemical Literature Data Extraction},
  author = {Ibison, P. and Jacquot, M. and Kam, F. and Neville, A. G. and Simpson, R. W. and Tonnelier, C. and Venczel, T. and Johnson, A. P.},
  year = 1993,
  month = may,
  journal = {Journal of Chemical Information and Computer Sciences},
  volume = {33},
  number = {3},
  pages = {338--344},
  issn = {0095-2338, 1520-5142},
  doi = {10.1021/ci00013a010}
}

Automatic Recognition of Chemical Images

Sun, 14 Dec 2025 00:00:00 +0000

Contribution: Rule-Based Image Mining Architecture

$\Psi_{\text{Method}}$ (Methodological Basis)

This is a methodological paper describing a system architecture for image mining in the chemical domain. It focuses on the engineering challenge of converting rasterized depictions of molecules into computer-readable SDF files. The paper details the algorithmic pipeline and validates it through quantitative benchmarking against a commercial alternative.

Motivation: Digitizing Chemical Literature

Loss of Information: Chemical software creates images. The chemical significance is lost when published in scientific literature, making the data “dead” to computers.
Gap in Technology: Image mining lags behind advances in text mining. Existing commercial solutions (like CLIDE) faded away or remained limited.
Scale of Problem: The colossal production of chemical documents requires automated tools to exploit this information at large scale.

Core Innovation: Graph-Preserving Vectorization

Graph-Preserving Vectorization: The system uses a custom vectorizer designed to preserve the “graph characteristics” of chemical diagrams (1 vector = 1 line), which avoids creating spurious vectors at thick joints. It aims to generate a mathematical graph, $G = (V, E)$, mapped geometrically to the image lines.
Chemical Knowledge Integration: A distinct module validates the reconstructed graph against chemical rules (valences, charges) to ensure the output is chemically valid.
Hybrid Processing: The system splits the image into “connected components” for an OCR path (text/symbols) and a “body” path (bonds), reassembling them later.

Methodology & Experiments: Benchmark Validation

The authors performed a quantitative validation using three different databases where ground-truth SDF files were available. They also compared their system against the commercial tool CLIDE (Chemical Literature Data Extraction).

Database 1: 100 images (varied line widths/fonts)
Database 2: 100 images
Database 3: 7,604 images (large-scale batch processing)

Results & Conclusions: Superior Accuracy over Baselines

High Accuracy: The system achieved 94% correct reconstruction on Database 1 and 77% on Database 2. Accuracy was measured as correct recovery of identical geometry and connections.

$$ \text{Acc} = \frac{\text{Correct Images}}{\text{Total Images}} $$

Baseline Superiority: The commercial tool CLIDE only successfully reconstructed ~50% of images in Database 1 (compared to the authors’ 94%).
Scalability: On the large dataset (Database 3), the system achieved 67% accuracy in batch mode.
Robustness: The authors state the system uses a handful of parameters and works robustly across different image types. CLIDE lacked flexibility and required manual intervention.

Reproducibility Details

Reproducibility Status: Closed / Not Formally Reproducible. As is common with applied research from this era, the source code, training models (SVM), and specific datasets used for benchmarking do not appear to be publicly maintained or available.

Artifacts

Artifact	Type	License	Notes
None available	N/A	Unknown	No public code, models, or datasets were released with this 2007 publication.

Data

Purpose	Dataset	Size	Notes
Evaluation	Database 1	100 Images	Used for comparison with CLIDE; 94% success rate
Evaluation	Database 2	100 Images	77% success rate
Evaluation	Database 3	7,604 Images	Large-scale test; 67% success rate

Algorithms

The paper outlines a 5-module pipeline:

Pre-processing: Adaptive histogram binarization and non-recursive connected component labeling using RLE segments.
OCR: A “chemically oriented OCR” using wavelet functions for feature extraction and a Support Vector Machine (SVM) for classification. It distinguishes characters from molecular structure.
Vectorizer: Assigns local directions to RLE segments and groups them into patterns. Crucially, it enforces a one-to-one mapping between image lines and graph vectors.
Reconstruction: A rule-based module that annotates vectors:
- Stereochemistry: Registers vectors against original pixels; thick geometric forms (triangles) become chiral wedges.
- Dotted Bonds: Identifies isolated vectors and clusters them using quadtree clustering.
- Multi-bonds: Identifies parallel vectors within a dilated bounding box (factor of 2).
Chemical Knowledge: Validates the graph valences and properties before exporting SDF.

Models

SVM: Used in the OCR module to classify text/symbols. It supports dynamic training to correct classification mistakes.

Evaluation

The primary metric is the percentage of correctly reconstructed images (generating a valid, matching SDF file).

Metric	System Value (DB1)	Baseline (CLIDE)	Notes
Reconstruction Accuracy	94%	~50%	CLIDE noted as unsuitable for batch processing

Paper Information

Citation: Algorri, M.-E., Zimmermann, M., & Hofmann-Apitius, M. (2007). Automatic Recognition of Chemical Images. Eighth Mexican International Conference on Current Trends in Computer Science, 41-46. https://doi.org/10.1109/ENC.2007.25

Publication: ENC 2007 (IEEE Computer Society)

@inproceedings{algorriAutomaticRecognitionChemical2007,
  title = {Automatic {{Recognition}} of {{Chemical Images}}},
  booktitle = {Eighth {{Mexican International Conference}} on {{Current Trends}} in {{Computer Science}} ({{ENC}} 2007)},
  author = {Algorri, Maria-Elena and Zimmermann, Marc and {Hofmann-Apitius}, Martin},
  year = {2007},
  pages = {41--46},
  publisher = {IEEE},
  doi = {10.1109/ENC.2007.25}
}

ChemInfty: Chemical Structure Recognition in Patent Images

Sat, 04 Oct 2025 00:00:00 +0000

Paper Information

Citation: Fujiyoshi, A., Nakagawa, K., & Suzuki, M. (2011). Robust Method of Segmentation and Recognition of Chemical Structure Images in ChemInfty. Pre-Proceedings of the 9th IAPR International Workshop on Graphics Recognition, GREC.

Publication: GREC 2011 (Graphics Recognition Workshop)

Additional Resources:

InftyReader Project

Contribution: Segment-Based OCSR Method

This is a method paper that introduces ChemInfty, a rule-based system for Optical Chemical Structure Recognition (OCSR) specifically designed to handle the challenging, low-quality images found in Japanese patent applications.

Motivation: The Challenge of Degraded Patent Images

The motivation is straightforward: Japanese patent applications contain a massive amount of chemical knowledge, but the images are remarkably poor quality. Unlike the relatively clean molecular diagrams in scientific papers, patent images suffer from multiple problems that break conventional OCSR systems.

The authors quantified these issues in a sample of 200 patent images and found that 22% contained touching characters (where atom labels merge together), 19.5% had characters touching bond lines, and 8.5% had broken lines. These are not edge cases; they are pervasive enough to cripple existing recognition tools. Established systems like CLIDE, ChemReader, and OSRA struggle significantly with line-touching characters and broken lines, leading to recognition failures.

The challenge is compounded by the sheer diversity of creation methods. Some structures are drawn with sophisticated molecular editors, others with basic paint programs, and some are even handwritten. This means there’s no standardization in fonts, character sizes, or line thickness. Add in the effects of scanning and faxing, and you have images with significant noise, distortion, and degradation.

The goal of ChemInfty is to build a system robust enough to handle these messy real-world conditions and make Japanese patent chemistry computer-searchable.

Core Innovation: Segment Decomposition and Dynamic Programming

The novelty lies in a segment-based decomposition approach that separates the recognition problem into manageable pieces before attempting to classify them. The key insight is that traditional OCR fails on these images because characters and lines are physically merged. You cannot recognize a character if you cannot cleanly separate it from the surrounding bonds first.

ChemInfty’s approach has several distinctive elements:

Line and Curve Segmentation: The system first decomposes the image into smaller line and curve segments. The decomposition happens at natural breakpoints—crossings, sharp bends, and other locations where touching is likely to occur. This creates a set of primitive elements that can be recombined in different ways.
Linear Order Assumption for Scalability: To make the dynamic programming approach computationally tractable and avoid combinatorial explosion, the system assumes that segments to be combined are adjacent when sorted in one of four directional orderings ($\perp, \setminus, \triangle, \rightarrow$). This constraint dramatically reduces the search space while still capturing the natural spatial relationships in chemical diagrams.
Dynamic Programming for Segment Combination: Once the image is decomposed, the system faces a combinatorial problem: which segments should be grouped together to form characters, and which should be classified as bonds? The authors use dynamic programming to efficiently search for the “most suitable combination” of segments. This optimization finds the configuration that maximizes the likelihood of valid chemical structure elements.
Two-Pass OCR Strategy: ChemInfty integrates with InftyReader, a powerful OCR engine. The system uses OCR twice in the pipeline:
- First pass: High-confidence character recognition removes obvious atom labels early, simplifying the remaining image
- Second pass: After the segment-based method identifies and reconstructs difficult character regions, OCR is applied again to the cleaned-up character image
This two-stage approach handles both easy and hard cases effectively: simple characters are recognized immediately, while complex cases get special treatment.
Image Thinning for Structure Analysis: Before segmentation, the system thins the remaining graphical elements (after removing high-confidence characters) to skeleton lines. This thinning operation reveals the underlying topological structure—crossings, bends, and endpoints—making it easier to detect where segments should be divided.
Proximity-Based Grouping: After identifying potential character segments, the system groups nearby segments together. This spatial clustering ensures that parts of the same character that were separated by bonds get recombined correctly.

Methodology: Real-World Patent Evaluation

The evaluation focused on demonstrating that ChemInfty could handle real-world patent images at scale:

Large-Scale Patent Dataset: The system was tested on chemical structure images from Japanese patent applications published in 2008. This represents a realistic deployment scenario with all the messiness of actual documents.
Touching Character Separation: The authors specifically measured the system’s ability to separate characters from bonds when they were touching. Success was defined as cleanly extracting the character region so that OCR could recognize it.
Recognition Accuracy by Object Type: Performance was broken down by element type (characters, line segments, solid wedges, and hashed wedges). This granular analysis revealed which components were easier or harder for the system to handle.
End-to-End Performance: The overall recognition ratio was calculated across all object types to establish the system’s practical utility for automated patent processing.

Results and Conclusions

Effective Separation for Line-Touching Characters: The segment-based method successfully separated 63.5% of characters that were touching bond lines. This is a substantial improvement over standard OCR, which typically fails completely on such cases. The authors note that when image quality is reasonable, the separation method works well.
Strong Overall Character Recognition: Character recognition achieved 85.86% accuracy, which is respectable given the poor quality of the input images. Combined with the 90.73% accuracy for line segments, this demonstrates the system can reliably reconstruct the core molecular structure.
Weak Performance on Wedges: The system struggled significantly with stereochemistry notation. Solid wedges were correctly recognized only 52.54% of the time, and hashed wedges fared even worse at 23.63%. This is a critical limitation since stereochemistry is often essential for understanding molecular properties.
Image Quality Dependency: The authors acknowledge that the method’s effectiveness is ultimately limited by image quality. When images are severely degraded (blurred to the point where even humans struggle to distinguish characters from noise), the segmentation approach cannot reliably separate touching elements.
Overall System Performance: The combined recognition ratio of 86.58% for all objects indicates that ChemInfty is a working system but not yet production-ready. The authors conclude that further refinement is necessary, particularly for wedge recognition and handling extremely low-quality images.

The work establishes that segment-based decomposition with dynamic programming is a viable approach for handling the specific challenges of patent image OCSR. The two-pass OCR strategy and the use of image thinning to reveal structure are practical engineering solutions that improve robustness. However, the results also highlight that rule-based methods are fundamentally limited by image quality. There is only so much you can do with algorithmic cleverness when the input is severely degraded. This limitation would motivate later work on deep learning approaches that can learn robust feature representations from large datasets.

Reproducibility Details

Technical Paradigm

This is a pre-deep learning (2011) classical computer vision paper. The system uses rule-based methods and traditional OCR engines, not neural networks.

Models

InftyReader: A mathematical OCR engine used for the initial high-confidence character recognition pass. This is a pre-existing external tool.
DEF-based OCR: A standard OCR engine based on Directional Element Features (DEF). These are manually engineered statistical features (histograms of edge directions), not learned neural network features.

Algorithms

The paper details a multi-step recognition pipeline:

Preprocessing: Binarization and smoothing
Initial Character Removal: High-confidence characters are recognized by the InftyReader OCR engine and removed from the image to simplify segmentation
Skeletonization: Thinning using Hilditch’s algorithm to skeletonize graphical elements, revealing topological structure (crossings, bends, endpoints)
Feature Point Detection:
- Crossing points: Direct detection on skeleton
- Bending points: Detected using the Hough transformation
Dynamic Programming Search:
- Input: Set of line/curve segments $S$
- Procedure: Sort segments in 4 directions ($\perp, \setminus, \triangle, \rightarrow$). For each direction, use DP to find the grouping that minimizes a heuristic score
- Complexity: $O(n^2)$ where $n$ is the number of segments
- Scoring: Uses a function Measure(S') that returns a score (0-100) indicating if a subset of segments forms a valid character or bond

The scoring function Measure(S') used in the dynamic programming algorithm is never mathematically defined in the paper, limiting replicability.

Data

Evaluation Dataset: Chemical structure images from Japanese patent applications published in 2008. The complete 2008 dataset contains 229,969 total images.

Purpose	Dataset	Size	Notes
Evaluation	Japanese Published Patent Applications (2008)	1,599 images	Contains 229,969 total images for the year. Format: TIFF, 200-400 dpi.
Analysis	Random subset for frequency analysis	200 images	Used to estimate frequency of touching/broken characters (found in ~20% of images).

No Training Set: The system is rule-based and uses pre-built OCR engines, so no model training was performed.

Evaluation

Primary Metric: Recognition ratio (percentage of correctly recognized objects)

Metric	Value	Notes
Line-touching Separation	63.5%	Success rate for separating text glued to lines
Character Recognition	85.86%	For all character sizes
Line segments	90.73%	Standard bond recognition
Solid Wedge Recognition	52.54%	Low performance noted as area for improvement
Hashed Wedges	23.63%	Poorest performing element type
Overall	86.58%	Combined across all object types

Total Objects Evaluated: 742,287 objects (characters, line segments, solid wedges, hashed wedges) extracted from the patent images.

Hardware

Not reported. Computational cost was not a primary concern for this classical CV system.

Replicability

Low. The paper does not provide sufficient detail for full replication:

The scoring function Measure(S') used in the dynamic programming algorithm is never mathematically defined
Dependency on the proprietary/specialized InftyReader engine
No pseudocode provided for the segment decomposition heuristics

Notes on Wedge Recognition

The system’s poor performance on solid wedges (52.54%) and hashed wedges (23.63%) reflects a fundamental challenge for classical thinning algorithms. Wedge bonds are dense triangular regions that indicate 3D stereochemistry. When skeletonized using algorithms like Hilditch’s method, these “blob” shapes often distort into unrecognizable patterns, unlike the clean thin lines that represent regular bonds.

Citation

@article{fujiyoshiRobustMethodSegmentation2011,
  title = {Robust {{Method}} of {{Segmentation}} and {{Recognition}} of {{Chemical Structure Images}} in {{ChemInfty}}},
  author = {Fujiyoshi, Akio and Nakagawa, Koji and Suzuki, Masakazu},
  year = 2011,
  journal = {Pre-proceedings of the 9th IAPR international workshop on graphics recognition, GREC},
  langid = {english}
}