<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Rule-Based Systems on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/</link><description>Recent content in Rule-Based Systems on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sun, 05 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/index.xml" rel="self" type="application/rss+xml"/><item><title>GraphReco: Probabilistic Structure Recognition (2026)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/graphreco-2026/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/graphreco-2026/</guid><description>GraphReco is a rule-based OCSR system using Markov networks for probabilistic atom/bond ambiguity resolution during graph assembly.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, H., Yu, Y., &amp; Liu, J.-C. (2026). GraphReco: Probabilistic Structure Recognition for Chemical Molecules. <em>ChemistryOpen</em>, e202500537. <a href="https://doi.org/10.1002/open.202500537">https://doi.org/10.1002/open.202500537</a></p>
<p><strong>Publication</strong>: ChemistryOpen 2026 (Open Access)</p>
<h2 id="a-rule-based-ocsr-system-with-probabilistic-graph-assembly">A Rule-Based OCSR System with Probabilistic Graph Assembly</h2>
<p>GraphReco tackles a challenge that is rarely addressed explicitly in rule-based OCSR: the ambiguity that arises during graph assembly when lower-level component extraction results are imprecise. Small deviations in bond endpoint locations, false positive detections, and spatial proximity between elements all create uncertainty about which atoms and bonds should be connected, merged, or discarded.</p>
<p>The system introduces two main contributions:</p>
<ol>
<li><strong>Fragment Merging (FM) line detection</strong>: An adaptive three-stage algorithm for precise bond line identification across images of variable resolution</li>
<li><strong>Probabilistic ambiguity resolution</strong>: A Markov network that infers the most likely existence and merging state of atom and bond candidates</li>
</ol>
<h2 id="three-stage-pipeline">Three-Stage Pipeline</h2>
<p>GraphReco follows a three-stage workflow:</p>
<ol>
<li>
<p><strong>Component Extraction</strong>: Detects circles (aromatic bonds), bond lines (via the FM algorithm), and chemical symbols (via Tesseract OCR). Includes detection of solid wedge, dashed wedge, dashed line, and wavy bond styles. A semi-open-loop correction step resolves cases where symbols are misclassified as bonds and vice versa.</p>
</li>
<li>
<p><strong>Atom and Bond Ambiguity Resolution</strong>: Creates atom and bond candidates from detected components, builds a Markov network to infer their most probable states, and resolves candidates through existence and merging decisions.</p>
</li>
<li>
<p><strong>Graph Reconstruction</strong>: Assembles resolved atoms and bonds into a molecule graph, selects the largest connected component, and exports as MDL Molfile.</p>
</li>
</ol>
<h2 id="fragment-merging-line-detection">Fragment Merging Line Detection</h2>
<p>Classical Line Hough Transform (LHT) struggles with chemical structure images because bond lines suffer from pixelization, and algorithm parameters that work for one image resolution fail at others. The FM algorithm addresses this with three stages:</p>
<ol>
<li>
<p><strong>Fragment extraction</strong>: Apply LHT with high-resolution parameters (resolution $r = 2$, resolution $\theta = 2°$) to detect fine line fragments. Walk along detected theoretical lines to find actual black pixels and group them by connectivity.</p>
</li>
<li>
<p><strong>Fragment grouping</strong>: Pair fragments that share similar angles, are close in the perpendicular direction, and are either overlapping or connected by a path of black pixels.</p>
</li>
<li>
<p><strong>Fragment merging</strong>: Merge grouped fragments into single line segments using the two border pixels farthest from the centroid.</p>
</li>
</ol>
<p>The FM algorithm effectively handles the tradeoff that plagues standard LHT: coarse parameters miss short lines and produce overlaps, while fine parameters return many fragments shorter than actual bonds.</p>
<h2 id="probabilistic-ambiguity-resolution-via-markov-network">Probabilistic Ambiguity Resolution via Markov Network</h2>
<p>After component extraction, GraphReco creates atom and bond candidates rather than directly assembling the graph. Each bond endpoint generates an atom candidate with a circular bounding area of radius:</p>
<p>$$r_b = \min(l_{\text{bond}}, l_{\text{med}}) / 4$$</p>
<p>where $l_{\text{bond}}$ is the bond length and $l_{\text{med}}$ is the median bond length.</p>
<p>A Markov network is constructed with four types of nodes:</p>
<ul>
<li><strong>Atom nodes</strong>: Boolean existence variables for each atom candidate</li>
<li><strong>Bond nodes</strong>: Boolean existence variables for each bond candidate</li>
<li><strong>Atom merge nodes</strong>: Boolean variables for pairs of overlapping atom candidates</li>
<li><strong>Bond merge nodes</strong>: Boolean variables for pairs of nearby bond candidates</li>
</ul>
<p>Potential functions encode rules about when candidates should exist or merge, with merging likelihood between two bond-ending atom candidates defined as a piecewise function of center distance $d$:</p>
<p>$$P(a_1, a_2) = \begin{cases} 0.9, &amp; \text{if } d \leq Q \\ 0.7 - 0.4(d - Q)/(R - Q), &amp; \text{if } Q &lt; d \leq R \\ 0.1, &amp; \text{if } d &gt; R \end{cases}$$</p>
<p>where $Q = \max(r_1, r_2)$ and $R = \min(1.5Q, r_1 + r_2)$. MAP inference determines the final state of all candidates.</p>
<h2 id="evaluation-results">Evaluation Results</h2>
<p>GraphReco is evaluated on USPTO benchmarks with InChI string comparison (stereochemistry removed):</p>
<table>
  <thead>
      <tr>
          <th>System</th>
          <th>USPTO-10K</th>
          <th>USPTO-10K-Abb</th>
          <th>USPTO</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GraphReco</strong></td>
          <td><strong>94.2</strong></td>
          <td><strong>86.7</strong></td>
          <td>89.9</td>
      </tr>
      <tr>
          <td>MolVec 0.9.7</td>
          <td>92.4</td>
          <td>70.3</td>
          <td>89.1</td>
      </tr>
      <tr>
          <td>Imago 2.0</td>
          <td>89.9</td>
          <td>63.0</td>
          <td>89.4</td>
      </tr>
      <tr>
          <td>OSRA 2.1</td>
          <td>89.7</td>
          <td>63.9</td>
          <td>89.3</td>
      </tr>
      <tr>
          <td>MolGrapher</td>
          <td>93.3</td>
          <td>82.8</td>
          <td><strong>91.5</strong></td>
      </tr>
      <tr>
          <td>Img2Mol</td>
          <td>35.4</td>
          <td>13.8</td>
          <td>25.2</td>
      </tr>
  </tbody>
</table>
<p>GraphReco outperforms all rule-based systems and most ML systems, with a particularly large margin on USPTO-10K-Abb (abbreviation-heavy molecules). MolGrapher achieves slightly higher accuracy on the USPTO dataset.</p>
<h3 id="robustness-on-perturbed-images">Robustness on Perturbed Images</h3>
<p>On USPTO-perturbed (rotation and shearing applied), rule-based methods degrade substantially:</p>
<table>
  <thead>
      <tr>
          <th>System</th>
          <th>USPTO-perturbed</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolGrapher</td>
          <td><strong>86.7</strong></td>
      </tr>
      <tr>
          <td>Img2Mol</td>
          <td>42.3</td>
      </tr>
      <tr>
          <td><strong>GraphReco</strong></td>
          <td>40.6</td>
      </tr>
      <tr>
          <td>MolVec 0.9.7</td>
          <td>30.7</td>
      </tr>
      <tr>
          <td>OSRA 2.1</td>
          <td>6.4</td>
      </tr>
      <tr>
          <td>Imago 2.0</td>
          <td>5.1</td>
      </tr>
  </tbody>
</table>
<p>GraphReco performs better than other rule-based systems on perturbed inputs (40.6% vs. under 31%) thanks to its probabilistic assembly, but still falls far behind MolGrapher (86.7%), demonstrating the robustness advantage of learned approaches.</p>
<h2 id="ablation-study">Ablation Study</h2>
<p>Each component contributes substantially to overall performance on USPTO-10K:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>USPTO-10K</th>
          <th>USPTO-10K-Abb</th>
          <th>USPTO</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Full system</td>
          <td>94.2</td>
          <td>86.7</td>
          <td>89.9</td>
      </tr>
      <tr>
          <td>Without FM line detection</td>
          <td>2.9</td>
          <td>5.5</td>
          <td>4.8</td>
      </tr>
      <tr>
          <td>Without atom candidates</td>
          <td>9.8</td>
          <td>0.4</td>
          <td>5.0</td>
      </tr>
      <tr>
          <td>Without bond candidates</td>
          <td>79.1</td>
          <td>75.8</td>
          <td>75.0</td>
      </tr>
      <tr>
          <td>Without Markov network</td>
          <td>88.2</td>
          <td>81.4</td>
          <td>84.2</td>
      </tr>
  </tbody>
</table>
<p>The FM algorithm and atom candidate mechanism are both critical (accuracy drops below 10% without either). Bond candidates provide a moderate improvement (~15 percentage points), and the Markov network adds ~6 points over hard-threshold alternatives.</p>
<h2 id="limitations">Limitations</h2>
<ul>
<li>Deterministic expert rules limit robustness on perturbed or noisy images, as evidenced by the large accuracy gap with MolGrapher on USPTO-perturbed</li>
<li>The system relies on Tesseract OCR for symbol recognition, which may struggle with unusual fonts or degraded image quality</li>
<li>Only handles single 2D molecule structures per image</li>
<li>Stereochemistry is removed during evaluation, so performance on stereo-bond recognition is not assessed</li>
</ul>
<h2 id="reproducibility">Reproducibility</h2>
<p>GraphReco is implemented in Python and relies on Tesseract OCR, OpenCV, and RDKit. The authors provide an online demo for testing but have not released the source code or a public repository.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Online Demo</td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Google Cloud Run deployment (no longer available)</td>
      </tr>
  </tbody>
</table>
<p><strong>Missing components for full reproduction:</strong></p>
<ul>
<li>Source code is not publicly available</li>
<li>No pre-built package or installable library</li>
<li>Hyperparameters for Markov network potential functions are given in the paper (Equations 8-11), but full implementation details are not released</li>
</ul>
<p><strong>Hardware/compute requirements:</strong> Not specified in the paper. The system uses classical computer vision (Hough transforms, thinning) and probabilistic inference (Markov networks), so GPU hardware is likely not required.</p>
]]></content:encoded></item><item><title>Research on Chemical Expression Images Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/hong-chemical-expression-2015/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/hong-chemical-expression-2015/</guid><description>A 2015 methodology for Optical Chemical Structure Recognition (OCSR) focusing on improved handling of adhesive symbols and wedge bonds.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hong, C., Du, X., &amp; Zhang, L. (2015). Research on Chemical Expression Images Recognition. <em>Proceedings of the 2015 Joint International Mechanical, Electronic and Information Technology Conference</em>, 267-271. <a href="https://doi.org/10.2991/jimet-15.2015.50">https://doi.org/10.2991/jimet-15.2015.50</a></p>
<p><strong>Publication</strong>: JIMET 2015 (Atlantis Press)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://jsme-editor.github.io/">JSME Editor (used for visualization)</a></li>
</ul>
<h2 id="contribution-new-ocsr-workflow-for-adhesion-and-wedge-bonds">Contribution: New OCSR Workflow for Adhesion and Wedge Bonds</h2>
<p><strong>Method</strong>. The paper proposes a novel algorithmic pipeline (OCSR) for recognizing 2D organic chemical structures from images. It validates this method by comparing it against an existing tool (OSRA) using a quantitative metric (Tanimoto Coefficient) on a test set of 200 images.</p>
<h2 id="motivation-challenges-with-connecting-symbols-and-stereochemistry">Motivation: Challenges with Connecting Symbols and Stereochemistry</h2>
<p>A vast amount of chemical structural information exists in scientific literature (PDFs/images) that is not machine-readable. Manually converting these images to formats like <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> or CML is labor-intensive. Existing tools face challenges with:</p>
<ol>
<li><strong>Adhesion</strong>: Poor separation when chemical symbols touch or overlap with bonds.</li>
<li><strong>Stereochemistry</strong>: Incomplete identification of &ldquo;real&rdquo; (solid) and &ldquo;virtual&rdquo; (dashed/hashed) wedge bonds.</li>
</ol>
<h2 id="core-innovation-vector-based-separation-and-stereochemical-logic">Core Innovation: Vector-Based Separation and Stereochemical Logic</h2>
<p>The authors propose a specific <strong>OCSR (Optical Chemical Structure Recognition)</strong> workflow with two key technical improvements:</p>
<ol>
<li><strong>Vector-based Separation</strong>: The method vectorizes the image (using Potrace) to extract straight lines and curves, allowing better separation of &ldquo;adhesive&rdquo; chemical symbols (like H, N, O attached to bonds).</li>
<li><strong>Stereochemical Logic</strong>: Specific rules for identifying wedge bonds:
<ul>
<li><strong>Virtual (Dashed) Wedges</strong>: Identified by grouping connected domains and checking linear correlation of their center points.</li>
<li><strong>Real (Solid) Wedges</strong>: Identified after thinning by analyzing linear correlation and width variance of line segments.</li>
</ul>
</li>
</ol>
<h2 id="methodology--experimental-setup">Methodology &amp; Experimental Setup</h2>
<ul>
<li>
<p><strong>Dataset</strong>: 200 chemical structure images collected from the network.</p>
</li>
<li>
<p><strong>Baselines</strong>: Compared against <strong>OSRA</strong> (Optical Structure Recognition Application), a free online tool.</p>
</li>
<li>
<p><strong>Metric</strong>: <strong>Tanimoto Coefficient</strong>, measuring the similarity of the set of recognized bonds and symbols against the ground truth. The similarity $T(A, B)$ is defined as:</p>
<p>$$ T(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$</p>
</li>
</ul>
<h2 id="results--conclusions">Results &amp; Conclusions</h2>
<ul>
<li><strong>Performance</strong>: The proposed OCSR method achieved higher recognition rates than OSRA.
<ul>
<li><strong>Exact Match (100%)</strong>: OCSR achieved 90.0% vs. OSRA&rsquo;s 82.2%.</li>
<li><strong>High Similarity (&gt;85%)</strong>: OCSR recognized 157 structures vs. OSRA&rsquo;s 114.</li>
</ul>
</li>
<li><strong>Limitations</strong>: The paper notes that &ldquo;real wedge&rdquo; and &ldquo;virtual wedge&rdquo; identification was a primary focus, but general recognition effectiveness still &ldquo;has room for improvement&rdquo;.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study used a custom collection of images, not a standard benchmark.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Web-crawled chemical images</td>
          <td>200 structures</td>
          <td>Images containing 2D organic structures; specific source URLs not provided.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The recognition pipeline follows these specific steps:</p>
<ol>
<li><strong>Preprocessing</strong>:
<ul>
<li><strong>Grayscale</strong>: via <code>cvCvtColor</code> (OpenCV).</li>
<li><strong>Binarization</strong>: via Otsu&rsquo;s method.</li>
</ul>
</li>
<li><strong>Isolated Symbol Removal</strong>:
<ul>
<li>Identifies connected domains with aspect ratios in <code>[0.8, 3.0]</code>.</li>
<li>Recognizes them using OCR (GOCR, OCRAD, Tesseract) and removes them from the image.</li>
</ul>
</li>
<li><strong>Virtual Wedge Recognition</strong>:
<ul>
<li>Groups small connected domains (points/clumps).</li>
<li>Calculates linear correlation of center points; if collinear, treats as a dashed bond.</li>
</ul>
</li>
<li><strong>Vectorization &amp; Thinning</strong>:
<ul>
<li><strong>Thinning</strong>: Rosenfeld algorithm (optimized) to reduce lines to single pixel width.</li>
<li><strong>Vectorization</strong>: Uses <strong>Potrace</strong> to convert pixels to vector segments.</li>
<li><strong>Merging</strong>: Combines split vector segments based on angle thresholds to form long straight lines.</li>
</ul>
</li>
<li><strong>Adhesive Symbol Separation</strong>:
<ul>
<li>Identifies curves (short segments after vectorization) attached to long lines.</li>
<li>Separates these domains and re-runs OCR.</li>
</ul>
</li>
<li><strong>&ldquo;Super Atom&rdquo; Merging</strong>:
<ul>
<li>Merges adjacent vertical/horizontal symbols (e.g., &ldquo;HO&rdquo;, &ldquo;CH3&rdquo;) based on distance thresholds between bounding boxes.</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<p>The system relies on off-the-shelf OCR tools for character recognition; no custom ML models were trained.</p>
<ul>
<li><strong>OCR Engines</strong>: GOCR, OCRAD, TESSERACT.</li>
<li><strong>Visualization</strong>: JSME (JavaScript Molecule Editor) used to render output strings.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (OCSR)</th>
          <th>Baseline (OSRA)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Exact Match (100%)</td>
          <td><strong>90.0%</strong></td>
          <td>82.2%</td>
          <td>Percentage of 200 images perfectly recognized.</td>
      </tr>
      <tr>
          <td>&gt;95% Similarity</td>
          <td><strong>95 images</strong></td>
          <td>71 images</td>
          <td>Count of images with Tanimoto &gt; 0.95.</td>
      </tr>
      <tr>
          <td>&gt;85% Similarity</td>
          <td><strong>157 images</strong></td>
          <td>114 images</td>
          <td>Count of images with Tanimoto &gt; 0.85.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Requirements</strong>: Unspecified; runs on standard CPU architecture (implied by use of standard libraries like OpenCV and Potrace).</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{hongResearchChemicalExpression2015,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Research on {{Chemical Expression Images Recognition}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 2015 {{Joint International Mechanical}}, {{Electronic}} and {{Information Technology Conference}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Hong, Chen and Du, Xiaoping and Zhang, Lu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2015}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Atlantis Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Chongqing, China}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.2991/jimet-15.2015.50}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{978-94-6252-129-2}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Probabilistic OCSR with Markov Logic Networks</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/mlocsr/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/mlocsr/</guid><description>A probabilistic approach using Markov Logic Networks to recognize chemical structures from images, improving robustness over rule-based systems.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Frasconi, P., Gabbrielli, F., Lippi, M., &amp; Marinai, S. (2014). Markov Logic Networks for Optical Chemical Structure Recognition. <em>Journal of Chemical Information and Modeling</em>, 54(8), 2380-2390. <a href="https://doi.org/10.1021/ci5002197">https://doi.org/10.1021/ci5002197</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2014</p>
<h2 id="contribution-probabilistic-method-for-ocsr">Contribution: Probabilistic Method for OCSR</h2>
<p>This is a <strong>Method</strong> paper ($\Psi_{\text{Method}}$).</p>
<p>It proposes a novel algorithmic architecture (<strong>MLOCSR</strong>) that integrates low-level pattern recognition with a high-level probabilistic reasoning engine based on Markov Logic Networks (MLNs). While it contributes to resources by creating a clustered dataset for evaluation, the primary focus is on demonstrating that probabilistic inference offers a superior methodology to the deterministic, rule-based heuristics employed by previous state-of-the-art systems like OSRA and CLiDE.</p>
<h2 id="motivation-overcoming-brittle-rule-based-systems">Motivation: Overcoming Brittle Rule-Based Systems</h2>
<p>Optical Chemical Structure Recognition (OCSR) is critical for converting the vast archive of chemical literature (bitmap images in patents and papers) into machine-readable formats.</p>
<ul>
<li><strong>Limitation of Prior Work</strong>: Existing systems (OSRA, CLiDE, ChemReader) rely on &ldquo;empirical hard-coded geometrical rules&rdquo; to assemble atoms and bonds. These heuristics are brittle, requiring manual tuning of parameters for different image resolutions and failing when images are degraded or noisy.</li>
<li><strong>Gap</strong>: Chemical knowledge is typically used only in post-processing (e.g., to fix valency errors).</li>
<li><strong>Goal</strong>: To create a resolution-independent system that uses probabilistic reasoning to handle noise and ambiguity in graphical primitives.</li>
</ul>
<h2 id="core-innovation-markov-logic-networks-for-diagram-interpretation">Core Innovation: Markov Logic Networks for Diagram Interpretation</h2>
<p>The core novelty is the application of <strong>Markov Logic Networks (MLNs)</strong> to the problem of diagram interpretation.</p>
<ul>
<li><strong>Probabilistic Reasoning</strong>: The system treats extracted visual elements (lines, text boxes) as &ldquo;evidence&rdquo; and uses weighted first-order logic formulas to infer the most likely molecular graph (Maximum A Posteriori inference). The probability of a state $x$ is defined by the MLN log-linear model:
$$ P(X=x) = \frac{1}{Z} \exp\left(\sum_{i} w_i n_i(x)\right) $$
where $w_i$ is the weight of the $i$-th formula and $n_i(x)$ is the number of true groundings in $x$.</li>
<li><strong>Unified Knowledge Representation</strong>: Geometric constraints (e.g., collinearity) and chemical rules (e.g., valency) are encoded in the same logic framework.</li>
<li>Methodology and Experimental Setupe low-level extraction module dynamically estimates character size ($T$) and stroke width ($S$) to normalize parameters, removing the dependence on image DPI metadata.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors evaluated the system on recognition accuracy against the leading open-source baseline, <strong>OSRA (v1.4.0)</strong>.</p>
<ul>
<li><strong>Datasets</strong>:
<ul>
<li><strong>USPTO Clustered</strong>: A non-redundant subset of 937 images derived from a larger set of 5,719 US Patent Office images.</li>
<li><strong>ChemInfty</strong>: 869 images from Japanese patents.</li>
<li><strong>Degraded Images</strong>: The USPTO set was synthetically degraded at three resampling levels (Low, Medium, High degradation) to test robustness.</li>
</ul>
</li>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Geometric</strong>: Precision, Recall, and $F_1$ scores for individual atoms and bonds.</li>
<li><strong>Chemical</strong>: Tanimoto similarity (using path fingerprints) and InChI string matching (basic and full stereochemistry).</li>
</ul>
</li>
</ul>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li><strong>Superior Robustness</strong>: MLOCSR significantly outperformed OSRA on degraded images. On high-degradation images, MLOCSR achieved an atom $F_1$ of 80.3% compared to OSRA&rsquo;s 76.0%.</li>
<li><strong>Geometric Accuracy</strong>: In clean datasets (USPTO cluster), MLOCSR achieved higher $F_1$ scores for atoms (99.1% vs 97.5%) and bonds (98.8% vs 97.8%).</li>
<li><strong>Chemical Fidelity</strong>: The system achieved comparable Tanimoto similarity scores (0.948 vs 0.940 for OSRA).</li>
<li><strong>Limitation</strong>: OSRA slightly outperformed MLOCSR on &ldquo;Full <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>&rdquo; matching (81.4% vs 79.4%), indicating the probabilistic model still needs improvement in handling complex stereochemistry.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study utilized public datasets, with specific preprocessing to ensure non-redundancy.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td><strong>USPTO Clustered</strong></td>
          <td>937 images</td>
          <td>Selected via spectral clustering from 5,719 raw images to remove near-duplicates.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><strong>ChemInfty</strong></td>
          <td>869 images</td>
          <td>Ground-truthed dataset from Japanese patent applications (2008).</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The pipeline consists of two distinct phases: Low-Level Vectorization and High-Level Inference.</p>
<p><strong>1. Low-Level Extraction (Image Processing)</strong></p>
<ul>
<li><strong>Binarization</strong>: Global thresholding followed by morphological closing.</li>
<li><strong>Text/Stroke Estimation</strong>:
<ul>
<li>Finds text height ($T$) by looking for &ldquo;N&rdquo; or &ldquo;H&rdquo; characters via OCR, or averaging compatible connected components.</li>
<li>Estimates stroke width ($S$) by inspecting pixel density on potential segments identified by Hough transform.</li>
</ul>
</li>
<li><strong>Vectorization</strong>:
<ul>
<li><strong>Canny Edge Detection</strong> + <strong>Hough Transform</strong> to find lines.</li>
<li><strong>Douglas-Peucker algorithm</strong> for polygonal approximation of contours.</li>
<li><strong>Circle Detection</strong>: Finds aromatic rings by checking for circular arrangements of carbon candidates.</li>
</ul>
</li>
</ul>
<p><strong>2. High-Level Inference (Markov Logic)</strong></p>
<ul>
<li><strong>Evidence Generation</strong>: Visual primitives (lines, text boxes, circles) are converted into logical ground atoms (e.g., <code>LineBetweenCpoints(c1, c2)</code>).</li>
<li><strong>Inference Engine</strong>: Uses <strong>MaxWalkSAT</strong> for Maximum A Posteriori (MAP) inference to determine the most probable state of query predicates (e.g., <code>DoubleBond(c1, c2)</code>).</li>
<li><strong>Parameters</strong>: MaxWalkSAT run with 3 tries and 1,000,000 steps per try.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Markov Logic Network (MLN)</strong>:
<ul>
<li>Contains <strong>128 first-order logic formulas</strong>.</li>
<li><strong>Geometric Rules</strong>: Example: <code>VeryCloseCpoints(c1, c2) =&gt; SameCarbon(c1, c2)</code> (weighted rule to merge close nodes).</li>
<li><strong>Chemical Rules</strong>: Example: <code>IsHydroxyl(t) ^ Connected(c,t) =&gt; SingleBond(c,t)</code> (imposes valency constraints).</li>
</ul>
</li>
<li><strong>OCR Engine</strong>: Tesseract is used for character recognition on text connected components.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The authors introduced a bipartite graph matching method to evaluate geometric accuracy when superatoms (e.g., &ldquo;COOH&rdquo;) are not expanded.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Atom/Bond $F_1$</strong></td>
          <td>Calculated via minimum-weight bipartite matching between predicted graph and ground truth, weighted by Euclidean distance.</td>
      </tr>
      <tr>
          <td><strong>InChI</strong></td>
          <td>Standard unique identifier string. &ldquo;Basic&rdquo; ignores stereochemistry; &ldquo;Full&rdquo; includes it.</td>
      </tr>
      <tr>
          <td><strong>Tanimoto</strong></td>
          <td>Jaccard index of path fingerprints between predicted and ground truth molecules.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Software</strong>: Logic inference performed using the <strong>Alchemy</strong> software package (University of Washington).</li>
<li><strong>Web Server</strong>: The system was made available at <code>http://mlocsr.dinfo.unifi.it</code> (Note: URL likely inactive).</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{frasconiMarkovLogicNetworks2014,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Markov {{Logic Networks}} for {{Optical Chemical Structure Recognition}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Frasconi, Paolo and Gabbrielli, Francesco and Lippi, Marco and Marinai, Simone}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2014</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = aug,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{54}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{2380--2390}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1549-9596, 1549-960X}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci5002197}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-10-13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Kekulé-1 System for Chemical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/kekule-1996/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/kekule-1996/</guid><description>Foundational OCSR method combining neural OCR with chemical rule-based post-processing for automated structure interpretation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: McDaniel, J. R., &amp; Balmuth, J. R. (1996). Automatic Interpretation of Chemical Structure Diagrams. <em>Graphics Recognition. Methods and Applications</em>, 148-158. <a href="https://doi.org/10.1007/3-540-61226-2_13">https://doi.org/10.1007/3-540-61226-2_13</a></p>
<p><strong>Publication</strong>: Lecture Notes in Computer Science (LNCS), Vol. 1072, Springer, 1996.</p>
<h2 id="system-architecture-and-contribution">System Architecture and Contribution</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel software architecture (&ldquo;Kekulé-1&rdquo;) designed to solve the specific technical problem of converting rasterized chemical diagrams into machine-readable connection tables. The paper is characterized by:</p>
<ul>
<li><strong>Algorithmic Specification</strong>: It details specific algorithms for vectorization, polygon approximation, and character recognition.</li>
<li><strong>Performance Metrics</strong>: It validates the method using quantitative accuracy (98.9%) and speed comparisons against manual entry.</li>
<li><strong>System Architecture</strong>: It describes the integration of typically disparate components (OCR, vectorization, chemical rules) into a cohesive pipeline.</li>
</ul>
<h2 id="motivation-the-chemical-data-entry-bottleneck">Motivation: The Chemical Data Entry Bottleneck</h2>
<p>Chemical structure diagrams are the primary medium for communication between chemists, but computers cannot natively &ldquo;read&rdquo; these raster images.</p>
<ul>
<li><strong>Efficiency Gap</strong>: Manual redrawing of structures into chemical databases takes 6 to 10 minutes per structure.</li>
<li><strong>Technical Challenge</strong>: Existing commercial OCR systems failed on chemical diagrams because they could not handle the mix of graphics (bonds) and text (atom labels), nor could they recognize small fonts (3-7 points) or chemical symbols accurately.</li>
<li><strong>Goal</strong>: To create an &ldquo;Optical Chemical Structure Recognition&rdquo; (OCSR) system that reduces processing time to seconds while handling complex notation like stereochemistry and group formulas.</li>
</ul>
<h2 id="core-innovations-in-chemical-ocr">Core Innovations in Chemical OCR</h2>
<p>Kekulé-1 represents the &ldquo;first successful attempt&rdquo; to integrate image processing, OCR, and structure editing into a single workflow. Key innovations include:</p>
<ul>
<li><strong>Context-Aware OCR</strong>: Unlike standard OCR, Kekulé-1 uses &ldquo;chemical spell checking&rdquo; by applying valence rules and chemical context to correct raw character recognition errors (e.g., distinguishing &lsquo;5&rsquo; from &lsquo;S&rsquo; based on bonding).</li>
<li><strong>Adaptive Polygon Approximation</strong>: A modified vectorization algorithm that partitions objects at the farthest node to prevent artifact nodes in U-shaped structures.</li>
<li><strong>Hybrid Parsing</strong>: It treats the diagram as a graph where nodes can be explicit atoms or geometric intersections, using rule-based logic to parse &ldquo;group formulas&rdquo; (like $COOH$) recursively.</li>
</ul>
<h2 id="experimental-validation-and-benchmarks">Experimental Validation and Benchmarks</h2>
<p>The authors evaluated the system on a private test set to validate robustness and speed.</p>
<ul>
<li><strong>Dataset</strong>: 524 chemical structures chosen from a &ldquo;wide variety of sources&rdquo; specifically to test the system&rsquo;s limits.</li>
<li><strong>Metrics</strong>: Success rate (percentage of structures processed with minimal editing) and processing time per structure.</li>
<li><strong>Comparators</strong>: Performance was compared against the &ldquo;manual redrawing&rdquo; baseline.</li>
</ul>
<h2 id="results-performance-and-conclusions">Results, Performance, and Conclusions</h2>
<ul>
<li><strong>High Accuracy</strong>: 98.9% of the test structures were successfully processed (with an average of 0.74 user prompts per structure).</li>
<li><strong>Speedup</strong>: Processing took 7 to 30 seconds per structure, a significant improvement over the 6 to 10 minute manual baseline.</li>
<li><strong>Robustness</strong>: The system successfully handled pathological cases like broken characters, skew (rotation), and touching characters.</li>
<li><strong>Impact</strong>: The authors conclude that the techniques are generalizable to other domains like electrical circuits and utility maps.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Training/Test Data</strong>: The evaluation used 524 chemical structures. These were not released publicly but were selected to represent &ldquo;limit&rdquo; cases.</li>
<li><strong>Input format</strong>: Scanned images at 300-400 dpi. The authors note that higher resolutions do not add information due to ink wicking and paper limitations.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The paper details several specific algorithmic implementations:</p>
<p><strong>Vectorization (Polygon Approximation)</strong>:</p>
<ul>
<li>Standard thinning and raster-to-vector translation are used.</li>
<li><strong>Innovation</strong>: The algorithm searches for the node <em>farthest</em> from the current start node to partition the object. This prevents artifact nodes in curved lines.</li>
<li><strong>Threshold Formula</strong>: The allowed deviation ($dist$) from a straight line is adaptive based on segment length ($length$):</li>
</ul>
<p>$$dist = \max(1, \frac{length}{10.0} + 0.4)$$</p>
<p>(Units in pixels)</p>
<p><strong>Rotation Correction</strong>:</p>
<ul>
<li>The system computes the angle of all &ldquo;long&rdquo; line segments modulo 15 degrees.</li>
<li>It bins these angles; the bin with the highest count (representing &lt; 4 degrees rotation) is treated as the scan skew and corrected.</li>
</ul>
<p><strong>Optical Character Recognition (OCR)</strong>:</p>
<ul>
<li>Uses a neural network with linked/shared weights (similar to Convolutional Neural Networks, though not named as such) acting as a feature detector.</li>
<li><strong>Training</strong>: Trained on specific chemical fonts.</li>
<li><strong>Inference</strong>: Outputs are ranked; if multiple characters (e.g., &lsquo;5&rsquo; and &lsquo;S&rsquo;) exceed a threshold, both are kept, and chemical context resolves the ambiguity later.</li>
</ul>
<p><strong>Chemical Parsing</strong>:</p>
<ul>
<li>Group formulas (e.g., $COOH$) are parsed left-to-right by subtracting valences.</li>
<li>Example: For $COOH$, the external bond reduces Carbon&rsquo;s valence to 3. The first Oxygen takes 2, leaving 1. The final Oxygen takes 1 (attaching to Carbon), and the Hydrogen takes 1 (attaching to Oxygen).</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>OCR Model</strong>: A neural network with a &ldquo;shared weights&rdquo; paradigm, effectively creating a learned convolution map. It achieves ~99.9% raw accuracy on isolated test sets of chemical fonts.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: The evaluation was performed on an <strong>80486 processor at 33 MHz</strong>.</li>
<li><strong>Time</strong>: Average processing time was 9 seconds per structure.</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{mcdanielAutomaticInterpretationChemical1996,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Automatic Interpretation of Chemical Structure Diagrams}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Graphics Recognition. Methods and Applications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{McDaniel, Joe R. and Balmuth, Jason R.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">editor</span> = <span style="color:#e6db74">{O&#39;Gorman, Lawrence and Kasturi, Rangachar}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">series</span> = <span style="color:#e6db74">{Lecture Notes in Computer Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{1072}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{148--158}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{1996}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1007/3-540-61226-2_14}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>CLiDE Pro: Optical Chemical Structure Recognition Tool</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/clide-pro-2009/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/clide-pro-2009/</guid><description>A methodological paper presenting CLiDE Pro, an OCSR system for reconstructing chemical graphs from images with ~90% accuracy.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Valko, A. T., &amp; Johnson, A. P. (2009). CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition. <em>Journal of Chemical Information and Modeling</em>, 49(4), 780-787. <a href="https://doi.org/10.1021/ci800449t">https://doi.org/10.1021/ci800449t</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2009</p>
<h2 id="contribution-robust-algorithmic-pipeline-for-ocsr">Contribution: Robust Algorithmic Pipeline for OCSR</h2>
<p>This is primarily a <strong>Method ($\Psi_{\text{Method}}$)</strong> paper, as it proposes a specific algorithmic architecture (CLiDE Pro) for converting raster images of chemical structures into connection tables. It details the procedural steps for segmentation, vectorization, and graph reconstruction.</p>
<p>It also has a secondary <strong>Resource ($\Psi_{\text{Resource}}$)</strong> contribution, as the authors compile and release a validation set of 454 real-world images to serve as a community benchmark for OCSR systems.</p>
<h2 id="motivation-bridging-the-gap-between-legacy-document-images-and-machine-readable-chemistry">Motivation: Bridging the Gap Between Legacy Document Images and Machine-Readable Chemistry</h2>
<p>While modern chemical drawing software captures structural information explicitly, the vast majority of legacy and current chemical literature (journals, patents, reports) exists as images or PDF documents. These images are human-readable but lack the semantic &ldquo;connection table&rdquo; data required for chemical databases and software. Manual redrawing is time-consuming and error-prone. Therefore, there is a critical need for efficient Optical Chemical Structure Recognition (OCSR) systems to automate this extraction.</p>
<h2 id="novelty-integrated-document-segmentation-and-ambiguity-resolution-heuristics">Novelty: Integrated Document Segmentation and Ambiguity Resolution Heuristics</h2>
<p>CLiDE Pro introduces several algorithmic improvements over its predecessor (CLiDE) and contemporary tools:</p>
<ul>
<li><strong>Integrated Document Segmentation</strong>: Unlike page-oriented systems, it processes whole documents to link information across pages.</li>
<li><strong>Robust &ldquo;Difficult Feature&rdquo; Handling</strong>: It implements specific heuristic rules to resolve ambiguities in crossing bonds (bridged structures), which are often misinterpreted as carbon atoms in other systems.</li>
<li><strong>Generic Structure Interpretation</strong>: It includes a module to parse &ldquo;generic&rdquo; (Markush) structures by matching R-group labels in the diagram with text-based definitions found in the document.</li>
<li><strong>Ambiguity Resolution</strong>: It uses context-aware rules to distinguish between geometrically similar features, such as vertical lines representing bonds vs. the letter &rsquo;l&rsquo; in &lsquo;Cl&rsquo;.</li>
</ul>
<h2 id="methodology-and-benchmarking-on-real-world-data">Methodology and Benchmarking on Real-World Data</h2>
<p>The authors conducted a systematic validation on a dataset of <strong>454 images</strong> containing <strong>519 structure diagrams</strong>.</p>
<ul>
<li><strong>Source Material</strong>: Images were extracted from published materials (journals, patents), ensuring &ldquo;real artifacts&rdquo; like noise and scanning distortions were present.</li>
<li><strong>Automation</strong>: The test was fully automated without human intervention.</li>
<li><strong>Metrics</strong>: The primary metric was the &ldquo;success rate,&rdquo; defined as the correct reconstruction of the molecule&rsquo;s connection table. They also performed fine-grained error analysis on specific features (e.g., atom labels, dashed bonds, wavy bonds).</li>
</ul>
<h2 id="results-high-topological-accuracy-and-persistent-ocr-challenges">Results: High Topological Accuracy and Persistent OCR Challenges</h2>
<ul>
<li><strong>High Accuracy</strong>: The system achieved a <strong>89.79%</strong> retrieval rate (466/519 molecules correctly reconstructed).</li>
<li><strong>Robustness on Primitives</strong>: Solid straight bonds were recognized with 99.92% accuracy.</li>
<li><strong>Key Failure Modes</strong>: The majority of errors (58 cases) occurred in atom label construction, specifically when labels touched nearby bonds or other artifacts, causing OCR failures.</li>
<li><strong>Impact</strong>: The study demonstrated that handling &ldquo;difficult&rdquo; drawing features like crossing bonds and bridged structures significantly reduces topological errors. The authors released the test set to encourage standardized benchmarking in the OCSR field.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors utilized a custom dataset designed to reflect real-world noise.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>CLiDE Pro Validation Set</td>
          <td>454 images (519 structures)</td>
          <td>Extracted from scanned journals and PDFs. Includes noise/artifacts. Available in Supporting Information.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The CLiDE Pro pipeline consists of five distinct phases. To replicate this system, one would need to implement:</p>
<ol>
<li>
<p><strong>Image Binarization</strong>:</p>
<ul>
<li>Input images are binarized using a threshold-based technique to separate foreground (molecule) from background.</li>
<li><strong>Connected Component Analysis (CCA)</strong>: A non-recursive scan identifies connected components (CCs) and generates interpixel contours (using N, S, E, W directions).</li>
</ul>
</li>
<li>
<p><strong>Document Segmentation</strong>:</p>
<ul>
<li><strong>Layout Analysis</strong>: Uses a bottom-up approach building a tree structure. It treats CCs as graph vertices and distances as edges.</li>
<li><strong>Clustering</strong>: A minimal-cost spanning tree (Kruskal&rsquo;s algorithm) groups CCs into words, lines, and blocks.</li>
<li><strong>Classification</strong>: CCs are classified (Character, Dash, Line, Graphics, Noise) based on size thresholds derived from statistical image analysis.</li>
</ul>
</li>
<li>
<p><strong>Vectorization</strong>:</p>
<ul>
<li><strong>Contour Approximation</strong>: Uses a method similar to <strong>Sklansky and Gonzalez</strong> to approximate contours into polygons.</li>
<li><strong>Vector Formation</strong>: Long polygon sides become straight lines; short consecutive sides become curves. Opposing borders of a line are matched to define the bond vector.</li>
<li><strong>Wavy Bonds</strong>: Detected by finding groups of short vectors lying on a straight line.</li>
<li><strong>Dashed Bonds</strong>: Detected using the <strong>Hough transform</strong> to find collinear or parallel dashes.</li>
</ul>
</li>
<li>
<p><strong>Atom Label Construction</strong>:</p>
<ul>
<li><strong>OCR</strong>: An OCR engine (filtering + topological analysis) interprets characters.</li>
<li><strong>Grouping</strong>: Characters are grouped into words based on horizontal and vertical proximity (for vertical labels).</li>
<li><strong>Superatom Lookup</strong>: Labels are matched against a database of elements, functional groups, and R-groups. Unknown linear formulas (e.g., $\text{CH}_2\text{CH}_2\text{OH}$) are parsed.</li>
</ul>
</li>
<li>
<p><strong>Graph Reconstruction</strong>:</p>
<ul>
<li><strong>Connection Logic</strong>: Bond endpoints are joined to atoms if they are within a distance threshold and &ldquo;point toward&rdquo; the label.</li>
<li><strong>Implicit Carbons</strong>: Unconnected bond ends are joined if close; parallel bonds merge into double/triple bonds.</li>
<li><strong>Crossing Bonds</strong>: Rules check proximity, length, and ring membership to determine if crossing lines are valid atoms or 3D visual artifacts.</li>
</ul>
</li>
<li>
<p><strong>Generic Structure Interpretation</strong>:</p>
<ul>
<li><strong>Text Mining</strong>: A lexical/syntactic analyzer extracts R-group definitions (e.g., &ldquo;R = Me or H&rdquo;) from text blocks.</li>
<li><strong>Matching</strong>: The system attempts to match R-group labels in the diagram with the parsed text definitions.</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<ul>
<li><strong>OCR Engine</strong>: The system relies on a customized OCR engine capable of handling rotation and chemical symbols, though the specific architecture (neural vs. feature-based) is not detailed beyond &ldquo;topological and geometrical feature analysis&rdquo;.</li>
<li><strong>Superatom Database</strong>: A lookup table containing elements, common functional groups, and R-group labels.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The evaluation focused on the topological correctness of the output.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Total Success Rate</strong></td>
          <td>89.79%</td>
          <td>466/519 structures perfectly reconstructed.</td>
      </tr>
      <tr>
          <td><strong>Atom Label Accuracy</strong></td>
          <td>98.54%</td>
          <td>3923/3981 labels correct. Main error source: labels touching bonds.</td>
      </tr>
      <tr>
          <td><strong>Solid Bond Accuracy</strong></td>
          <td>&gt;99.9%</td>
          <td>16061/16074 solid bonds correct.</td>
      </tr>
      <tr>
          <td><strong>Dashed Bond Accuracy</strong></td>
          <td>98.37%</td>
          <td>303/308 dashed bonds correct.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Requirements</strong>: Unspecified; described as efficient.</li>
<li><strong>Performance</strong>: The system processed the complex Palytoxin structure &ldquo;within a few seconds&rdquo;. This implies low computational overhead suitable for standard desktop hardware of the 2009 era.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{valkoCLiDEProLatest2009,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Valko, Aniko T. and Johnson, A. Peter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{49}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{780--787}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2009}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci800449t}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemical Structure Recognition (Rule-Based)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/molrec-2012/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/molrec-2012/</guid><description>A strictly rule-based expert system (MolRec) for converting raster chemical diagrams into graph representations.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sadawi, N. M., Sexton, A. P., &amp; Sorge, V. (2012). Chemical structure recognition: A rule based approach. <em>Proceedings of SPIE</em>, 8297, 82970E. <a href="https://doi.org/10.1117/12.912185">https://doi.org/10.1117/12.912185</a></p>
<p><strong>Publication</strong>: IS&amp;T/SPIE Electronic Imaging 2012</p>
<h2 id="methodological-contribution">Methodological Contribution</h2>
<p><strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong></p>
<p>This paper proposes a novel mechanism (MolRec) for Optical Chemical Structure Recognition (OCSR). It focuses on defining a &ldquo;strictly rule based system&rdquo; to transform vectorised molecule images into graph representations, contrasting this declarative approach with procedural or heuristic-heavy methods. The contribution is validated through direct comparison with the leading open-source tool (OSRA).</p>
<h2 id="motivation-overcoming-procedural-heuristics">Motivation: Overcoming Procedural Heuristics</h2>
<p>Chemical literature contains vast amounts of information locked in 2D diagrams. This visual data is generally inaccessible to search tools or electronic processing. While commercial and academic tools existed (e.g., OSRA, Kekulé), they typically relied on procedural heuristics that required experimental tuning and were difficult to extend. The authors sought to create a system based on precise, declarative rewrite rules to handle the ambiguity inherent in chemical drawing conventions.</p>
<h2 id="core-innovation-geometric-rewrite-rules">Core Innovation: Geometric Rewrite Rules</h2>
<p>The core novelty is the <strong>geometric rewrite rule system</strong> (MolRec).</p>
<ul>
<li><strong>Geometric Primitives</strong>: The system operates on five high-level primitives: Line Segment, Arrow, Circle, Triangle, and Character Group.</li>
<li><strong>Fuzzy Parameters</strong>: It introduces formal definitions for &ldquo;fuzzy&rdquo; relationships (e.g., <code>dash-neighbouring</code>, <code>approximate collinearity</code>) to handle drawing irregularities and scanning artifacts.</li>
<li><strong>Ambiguity Resolution</strong>: Specific rules (R4-R6) are designed to disambiguate visual homoglyphs, such as distinguishing a &ldquo;triple bond&rdquo; from a &ldquo;dashed bold bond&rdquo; based on context (connected atoms).</li>
<li><strong>Explicit &ldquo;Cutting&rdquo;</strong>: A mechanism to identify implicit carbon nodes within continuous line segments (e.g., splitting a long line intersected by parallel lines into a double bond).</li>
</ul>
<h2 id="experimental-setup-vs-baselines">Experimental Setup vs. Baselines</h2>
<p>The authors compared their system (MolRec) against <strong>OSRA</strong> (the leading open-source system) on two datasets:</p>
<ol>
<li><strong>OSRA Benchmark</strong>: 5,735 computer-generated diagrams with ground truth MOL files.</li>
<li><strong>Maybridge Dataset</strong>: 5,730 scanned images (300dpi) from a drug catalogue, converted to ground truth MOL files via InChI lookups.</li>
</ol>
<p>Evaluation was semantic: The output MOL files were compared using OpenBabel to check for structural equivalence, ignoring syntactic file differences.</p>
<h2 id="results-and-key-findings">Results and Key Findings</h2>
<p><strong>MolRec outperformed OSRA</strong> on both datasets:</p>
<ul>
<li><strong>OSRA Benchmark</strong>: MolRec achieved <strong>88.46%</strong> accuracy vs. OSRA&rsquo;s 77.23%.</li>
<li><strong>Maybridge Dataset</strong>: MolRec achieved <strong>83.84%</strong> accuracy vs. OSRA&rsquo;s 72.57%.</li>
</ul>
<p><strong>Key Findings</strong>:</p>
<ul>
<li><strong>Robustness</strong>: The line thinning + Douglas-Peucker vectorization approach was found to be more robust than Hough transform approaches used by other tools.</li>
<li><strong>Failure Modes</strong>: Major remaining errors were caused by &ldquo;touching components&rdquo; (ligatures, characters touching bonds) and complex &ldquo;superatoms&rdquo; (abbreviations like &ldquo;-Ph&rdquo; or &ldquo;-COOH&rdquo;) with ambiguous connection points.</li>
<li><strong>Triangle Detection</strong>: The &ldquo;expanding disc&rdquo; method for identifying wedge bonds was highly effective.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>Two distinct datasets were used for validation:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>OSRA Benchmark</strong></td>
          <td style="text-align: left">Synthetic</td>
          <td style="text-align: left">5,735</td>
          <td style="text-align: left">Computer-generated diagrams provided by the OSRA project.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Maybridge</strong></td>
          <td style="text-align: left">Scanned</td>
          <td style="text-align: left">5,730</td>
          <td style="text-align: left">Scanned at 300dpi from the Maybridge drug catalogue. Ground truth generated via CAS Registry Number $\to$ <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> $\to$ OpenBabel.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The recognition pipeline consists of three stages: <strong>Vectorization</strong>, <strong>Geometric Processing</strong>, and <strong>Rule Application</strong>.</p>
<p><strong>1. Vectorization &amp; Primitives</strong></p>
<ul>
<li><strong>Binarization &amp; OCR</strong>: Connected components are labelled and passed to an OCR engine to extract &ldquo;Character Groups&rdquo;.</li>
<li><strong>Thinning</strong>: Image is thinned to unit width.</li>
<li><strong>Simplification</strong>: Douglas-Peucker algorithm converts pixel paths into straight <strong>Line Segments</strong>.</li>
<li><strong>Triangle Detection</strong>: A disc growing algorithm walks inside black regions to identify <strong>Triangles</strong> (wedges). If the disc cannot grow, it is a thick line (Bold Bond).</li>
</ul>
<p><strong>2. Fuzzy Parameters</strong></p>
<p>The rules rely on tolerating drawing imperfections using defined parameters:</p>
<ul>
<li>$r_e$: Radius of collinearity (strict).</li>
<li>$d_l$ / $d_s$: Dash length / Dash separation (fuzzy).</li>
<li>$bdl$ / $bdw$: Bold dash length / width (fuzzy).</li>
<li>$bs$: Bond separation (max distance between parallel bonds).</li>
<li>$ol$: Minimal overlap.</li>
</ul>
<p><strong>3. The Rule System (R1-R18)</strong></p>
<p>The core logic uses 18 mutual-exclusion rules to rewrite geometric primitives into chemical graph edges.</p>
<ul>
<li><strong>Planar Bonds</strong>:
<ul>
<li><strong>R1-R3 (Single/Double/Triple)</strong>: Identifies parallel lines based on <code>bs</code> and <code>ol</code>. Uses &ldquo;cutting&rdquo; to split lines at implicit nodes.</li>
</ul>
</li>
<li><strong>Ambiguity Resolution (Stereo vs. Planar)</strong>:
<ul>
<li><strong>R4 (Dashed Bold vs. Triple)</strong>: Checks context. If purely geometric measures match both, it defaults to Triple unless specific dash constraints are met.</li>
<li><strong>R5 (Dashed Wedge vs. Triple)</strong>: Similar disambiguation based on length monotonicity.</li>
<li><strong>R6 (Dashed Wedge vs. Double)</strong>: Differentiates based on line length differences ($l_1 &gt; l_2$).</li>
</ul>
</li>
<li><strong>Stereo Bonds</strong>:
<ul>
<li><strong>R7-R9 (Dashed Types)</strong>: Identifies collinear segments with specific neighbor patterns (1 neighbor for ends, 2 for internal).</li>
<li><strong>R10-R11 (Hollow Wedge)</strong>: Detects triangles formed by 3 or 4 lines.</li>
<li><strong>R14 (Solid Wedge)</strong>: Direct mapping from Triangle primitive.</li>
</ul>
</li>
<li><strong>Special Structures</strong>:
<ul>
<li><strong>R12 (Wavy Bond)</strong>: Zig-zag line segments.</li>
<li><strong>R13 (Arrow)</strong>: Dative bond.</li>
<li><strong>R16 (Aromatic Ring)</strong>: Circle inside a cycle of &gt;5 lines.</li>
<li><strong>R17-R18 (Bridge Bonds)</strong>: Handles 2.5D crossing bonds (open or closed gaps).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metric</strong>: Semantic graph matching. The output MOL file is compared to the ground truth MOL file using OpenBabel. Success = correct graph isomorphism.</p>
<p><strong>Results Table</strong>:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">System</th>
          <th style="text-align: left">Success Rate</th>
          <th style="text-align: left">Fail Rate</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>OSRA</strong></td>
          <td style="text-align: left">MolRec</td>
          <td style="text-align: left"><strong>88.46%</strong></td>
          <td style="text-align: left">11.54%</td>
      </tr>
      <tr>
          <td style="text-align: left"></td>
          <td style="text-align: left">OSRA</td>
          <td style="text-align: left">77.23%</td>
          <td style="text-align: left">22.77%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Maybridge</strong></td>
          <td style="text-align: left">MolRec</td>
          <td style="text-align: left"><strong>83.84%</strong></td>
          <td style="text-align: left">16.16%</td>
      </tr>
      <tr>
          <td style="text-align: left"></td>
          <td style="text-align: left">OSRA</td>
          <td style="text-align: left">72.57%</td>
          <td style="text-align: left">27.43%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Requirements not specified, but the approach (vectorization + rule matching) is computationally lightweight compared to modern deep learning methods.</li>
</ul>
]]></content:encoded></item><item><title>Reconstruction of Chemical Molecules from Images</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/algorri-reconstruction-2007/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/algorri-reconstruction-2007/</guid><description>A 5-module system converting raster images of chemical structures into machine-readable SDF files with custom vectorization.</description><content:encoded><![CDATA[<h2 id="methodological-basis">Methodological Basis</h2>
<p>This paper is a clear methodological contribution describing a novel system architecture. It proposes a five-stage pipeline to solve a specific engineering problem: converting rasterized chemical images into structured chemical files (SDF). The authors validate the method by benchmarking it against a commercial product (CLIDE) and analyzing performance across multiple databases.</p>
<h2 id="the-inaccessibility-of-raster-chemical-images">The Inaccessibility of Raster Chemical Images</h2>
<ul>
<li><strong>Data Inaccessibility</strong>: A massive amount of chemical knowledge (scientific articles, patents) exists only as raster images, rendering it inaccessible to computational analysis.</li>
<li><strong>Inefficiency of Manual Entry</strong>: Manual replication of molecules into CAD programs is the standard but unscalable solution for extracting this information.</li>
<li><strong>Limitations of Existing Tools</strong>: Previous academic and commercial attempts (early 90s systems like CLIDE) had faded or remained limited in robustness, leaving the problem &ldquo;wide open&rdquo;.</li>
</ul>
<h2 id="topology-preserving-chemical-vectorization">Topology-Preserving Chemical Vectorization</h2>
<p>The core novelty is the <strong>topology-preserving vectorization</strong> strategy designed specifically for chemical graphs.</p>
<ul>
<li><strong>Graph-Centric Vectorizer</strong>: This system prioritizes graph characteristics over the pixel precision of traditional CAD vectorizers, ensuring one line in the image becomes exactly one vector, regardless of line width or vertex thickness.</li>
<li><strong>Chemical Knowledge Module</strong>: The inclusion of a final validation step that applies chemical rules (valence, charge) to detect and potentially correct reconstruction errors.</li>
<li><strong>Hybrid Recognition</strong>: The separation of the pipeline into a &ldquo;Body&rdquo; path (vectorizer for bonds) and an &ldquo;OCR&rdquo; path (SVM for atomic symbols), which are re-integrated in a reconstruction phase.</li>
</ul>
<h2 id="validating-reconstruction-accuracy">Validating Reconstruction Accuracy</h2>
<p>The authors performed a quantitative validation using <strong>ground-truth SDF files</strong> to verify reconstruction accuracy. The success rate metric evaluated whether the reconstructed graph perfectly matched the true SDF:</p>
<p>$$ \text{Accuracy} = \frac{\text{Correctly Reconstructed SDFs}}{\text{Total Images Evaluated}} $$</p>
<ul>
<li><strong>Baselines</strong>: The system was benchmarked against the commercial software <strong>CLIDE</strong> on &ldquo;Database 1&rdquo;.</li>
<li><strong>Datasets</strong>: Three distinct databases were used:
<ul>
<li><strong>Database 1</strong>: 100 images (varied fonts/line widths).</li>
<li><strong>Database 2</strong>: 100 images.</li>
<li><strong>Database 3</strong>: 7,604 images (large-scale test).</li>
</ul>
</li>
</ul>
<h2 id="system-performance-and-scalability">System Performance and Scalability</h2>
<ul>
<li><strong>Superior Performance</strong>: On Database 1, the proposed system correctly reconstructed <strong>97%</strong> of images, whereas the commercial CLIDE system only reconstructed <strong>25%</strong> (after parameter tuning).</li>
<li><strong>Scalability</strong>: The system maintained reasonable performance on the large dataset (Database 3), achieving <strong>67%</strong> accuracy.</li>
<li><strong>Robustness</strong>: The system can handle varying fonts and line widths via parameterization.</li>
<li><strong>Future Work</strong>: The authors plan to implement a feedback loop where the Chemical Knowledge Module can send error signals back to earlier modules to correct inconsistencies.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Reproducibility Status</strong>: Closed / Not Reproducible (Paywalled paper, no public code or data).</p>
<h3 id="data">Data</h3>
<p>The paper utilizes three databases for validation. The authors note that for these images, the correct SDF files were already available, allowing for direct automated checking.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Database 1</td>
          <td>100 Images</td>
          <td>Varied line widths, fonts, symbols; used for CLIDE comparison.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Database 2</td>
          <td>100 Images</td>
          <td>General chemical database.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Database 3</td>
          <td>7,604 Images</td>
          <td>Large-scale database.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The system is composed of five distinct modules executed in sequence:</p>
<p><strong>1. Binarization &amp; Segmentation</strong></p>
<ul>
<li><strong>Preprocessing</strong>: Removal of anti-aliasing effects followed by <strong>adaptive histogram binarization</strong>.</li>
<li><strong>Connected Components</strong>: A non-recursive raster-scan algorithm identifies connected Run-Length Encoded (RLE) segments.</li>
</ul>
<p><strong>2. Optical Character Recognition (OCR)</strong></p>
<ul>
<li><strong>Feature Extraction</strong>: Uses functions similar to <strong>Zernike moments</strong> and a <strong>wavelet transform strategy</strong>.</li>
<li><strong>Classification</strong>: Identifies isolated characters/symbols and separates them from the molecular &ldquo;body&rdquo;.</li>
</ul>
<p><strong>3. Vectorizer</strong></p>
<ul>
<li><strong>Logic</strong>: Assigns local directions to RLE segments based on neighbors, then groups segments with similar local direction patterns.</li>
<li><strong>Constraint</strong>: Enforces a 1-to-1 mapping between visual lines and graph vectors to prevent spurious small vectors at thick joints.</li>
</ul>
<p><strong>4. Reconstruction (Heuristics)</strong></p>
<p>This module annotates vectors with chemical significance:</p>
<ul>
<li><strong>Chiral Bonds (Wedges)</strong>: Identified by registering vectors against original pixel density. If a vector corresponds to a thick geometric form (triangle/rectangle), it is labeled chiral.</li>
<li><strong>Dotted Chiral Bonds</strong>: Identified by clustering isolated vectors (no neighbors) using <strong>quadtree clustering</strong> on geometric centers. Coherent parallel clusters are fused into a single bond.</li>
<li><strong>Double/Triple Bonds</strong>: Detected by checking for parallel vectors within a <strong>Region of Interest (ROI)</strong> defined as the vector&rsquo;s bounding box <strong>dilated by a factor of 2</strong>.</li>
<li><strong>Superatoms</strong>: OCR results are clustered by dilating bounding boxes; overlapping boxes are grouped into names (e.g., &ldquo;COOH&rdquo;).</li>
</ul>
<p><strong>5. Chemical Knowledge</strong></p>
<p>Validates the generated graph against rules for valences and charges. If valid, an SDF file is generated.</p>
<h3 id="models">Models</h3>
<ul>
<li><strong>SVM (Support Vector Machine)</strong>: Used within the OCR module to classify connected components as characters or symbols. It is trained to be tolerant to rotation and font variations.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The primary metric is binary success rate per molecule (perfect reconstruction of the SDF).</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (DB1)</th>
          <th>Value (DB3)</th>
          <th>Baseline (CLIDE on DB1)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Correct Reconstruction</td>
          <td><strong>97%</strong></td>
          <td>67%</td>
          <td>25%</td>
          <td>CLIDE required significant parameter tuning to reach 25%.</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Algorri, M.-E., Zimmermann, M., Friedrich, C. M., Akle, S., &amp; Hofmann-Apitius, M. (2007). Reconstruction of Chemical Molecules from Images. <em>Proceedings of the 29th Annual International Conference of the IEEE EMBS</em>, 4609-4612. <a href="https://doi.org/10.1109/IEMBS.2007.4353366">https://doi.org/10.1109/IEMBS.2007.4353366</a></p>
<p><strong>Publication venue</strong>: IEEE EMBS 2007</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{algorriReconstructionChemicalMolecules2007,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Reconstruction of {{Chemical Molecules}} from {{Images}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 29th Annual International Conference of the IEEE EMBS}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Algorri, Maria-Elena and Zimmermann, Marc and Friedrich, Christoph M. and Akle, Santiago and {Hofmann-Apitius}, Martin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2007}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{4609--4612}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/IEMBS.2007.4353366}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>OSRA: Open Source Optical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/osra/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/osra/</guid><description>The first open-source optical structure recognition (OSR) utility for converting chemical images into SMILES/SD formats.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Filippov, I. V., &amp; Nicklaus, M. C. (2009). Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution. <em>Journal of Chemical Information and Modeling</em>, 49(3), 740-743. <a href="https://doi.org/10.1021/ci800067r">https://doi.org/10.1021/ci800067r</a></p>
<p><strong>Publication</strong>: J. Chem. Inf. Model. 2009</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://sourceforge.net/projects/osra/">SourceForge Project</a></li>
<li><a href="http://cactus.nci.nih.gov/osra">Web Interface (Historical)</a></li>
</ul>
<h2 id="overview-and-motivation">Overview and Motivation</h2>
<p><strong>Resource</strong></p>
<p>This paper is a quintessential <strong>Infrastructure</strong> contribution ($\Psi_{\text{Resource}}$). While it contains significant algorithmic detail, the rhetorical structure and primary goal place it squarely as an infrastructure paper. The dominant contribution is the creation, release, and documentation of a software tool (OSRA).</p>
<p>A vast amount of chemical knowledge is locked in scientific and patent documents as graphical images (Kekulé structures). This is the classic chemical informatics challenge: decades of chemical knowledge are trapped in visual form.</p>
<ul>
<li><strong>Legacy Data Gap</strong>: Historical literature does not use computer-parsable formats, making automated processing of millions of documents impossible without optical recognition. Scientific papers and patents have historically depicted molecules as 2D structural diagrams.</li>
<li><strong>Need for Automation</strong>: Manual transcription is not scalable for the hundreds of thousands of documents available. While modern standards like <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> and CML exist, the vast majority of chemical literature remains inaccessible for computational analysis.</li>
<li><strong>Open Source Gap</strong>: Before OSRA, only commercial software like CLiDE existed for this task, creating a barrier for academic researchers and limiting reproducibility. no universal or open-source solution was available prior to this work.</li>
</ul>
<h2 id="core-innovations-and-pipeline">Core Innovations and Pipeline</h2>
<p>OSRA is claimed to be the <strong>first open-source optical structure recognition (OSR) program</strong>. The novelty lies in creating an accessible OCSR system with a practical, multi-stage pipeline that combines classical image processing techniques with chemical knowledge.</p>
<p><strong>Key contributions:</strong></p>
<ol>
<li>
<p><strong>Integrated Pipeline</strong>: It uniquely combines existing open-source image processing tools (ImageMagick for formats, Potrace for vectorization, GOCR/OCRAD for text) into a chemical recognition workflow. The value is in the assembly and integration.</p>
</li>
<li>
<p><strong>Vectorization-Based Approach</strong>: OSRA uses the Potrace library to convert bitmap images into vector graphics (Bezier curves), then analyzes the geometry of these curves to identify bonds and atoms. This is more robust than angle-based detection methods because it leverages continuous mathematical properties of curves.</p>
</li>
<li>
<p><strong>Multi-Resolution Processing with Confidence Estimation</strong>: The system automatically processes each image at three different resolutions (72, 150, and 300 dpi), generating up to three candidate structures. A learned confidence function trained via linear regression on chemical features (heteroatom count, ring patterns, fragment count) selects the most chemically sensible result.</p>
</li>
<li>
<p><strong>Resolution Independence</strong>: Unlike some predecessors, it is designed to handle over 90 image formats and works independently of specific resolutions or fonts.</p>
</li>
<li>
<p><strong>Comprehensive Chemical Rules</strong>: OSRA implements sophisticated heuristics for chemical structure interpretation:</p>
<ul>
<li>Distinguishes bridge bond crossings from tetrahedral carbon centers using graph connectivity rules</li>
<li>Recognizes stereochemistry from wedge bonds (detected via line thickness gradients)</li>
<li>Handles old-style aromatic notation (circles inside rings)</li>
<li>Expands common chemical abbreviations (superatoms like &ldquo;COOH&rdquo; or &ldquo;CF₃&rdquo;)</li>
<li>Uses the 75th percentile of bond lengths as the reference length to avoid outlier bias</li>
</ul>
</li>
</ol>
<h2 id="methodology-and-validation">Methodology and Validation</h2>
<p>The authors validated OSRA against both commercial software and manual curation:</p>
<ol>
<li>
<p><strong>Commercial Comparison</strong>: They compared OSRA against CLiDE (a commercial OSR tool) using a &ldquo;small test set&rdquo; of 11 files provided by Simbiosys containing 42 structures. Performance was measured both as exact match accuracy and as Tanimoto similarity using molecular fingerprints.</p>
</li>
<li>
<p><strong>Internal Validation</strong>: They tested on an internal set of 66 images containing 215 structures, covering various resolutions, color depths, and drawing styles to assess performance at scale and characterize typical error patterns.</p>
</li>
<li>
<p><strong>Metric Definition</strong>: They defined recognition success using both exact matches (&ldquo;Perfect by InChI&rdquo;) and Tanimoto similarity (using CACTVS fingerprints). The authors explicitly argued for using Tanimoto similarity as the primary evaluation metric, reasoning that partial recognition (e.g., missing a methyl group) still provides useful chemical information, which binary &ldquo;correct/incorrect&rdquo; judgments fail to capture.</p>
</li>
</ol>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li>
<p><strong>Competitive Accuracy</strong>: On the small comparative set, OSRA recognized 26 structures perfectly (by InChI) versus CLiDE&rsquo;s 11, demonstrating that an open-source, rule-based approach could outperform established commercial systems.</p>
</li>
<li>
<p><strong>Robustness</strong>: On the internal diverse set (215 structures), OSRA achieved a 93% average Tanimoto similarity and perfectly recognized 107 structures (50%). Tanimoto similarity above 85% was achieved for 182 structures (85%). This established OSRA as a competitive tool for practical use.</p>
</li>
<li>
<p><strong>Multi-Resolution Success</strong>: The multi-resolution strategy allowed OSRA to handle images with varying quality and formats. The confidence function (with correlation coefficient $r=0.89$) successfully identified which resolution produced the most chemically plausible structure.</p>
</li>
<li>
<p><strong>Limitations</strong>: The authors acknowledge issues with:</p>
<ul>
<li>&ldquo;Imperfect segmentation&rdquo; leading to missed structures (3 missed in internal set) and false positives (7 in internal set)</li>
<li>Novel drawing conventions not covered by the implemented heuristics</li>
<li>Highly degraded or noisy images where vectorization fails</li>
<li>Hand-drawn structures that deviate significantly from standard chemical drawing practices</li>
<li>Complex reaction schemes with multiple molecules and arrows</li>
</ul>
</li>
<li>
<p><strong>Open-Source Impact</strong>: By releasing OSRA as open-source software, the authors enabled widespread adoption and community contribution. This established a foundation for future OCSR research and made the technology accessible to researchers without commercial software budgets.</p>
</li>
</ul>
<p>The work established that rule-based OCSR systems could achieve competitive performance when carefully engineered with chemical knowledge. OSRA became a standard baseline for the field and remained the dominant open-source solution until the rise of deep learning methods over a decade later. The vectorization-based approach and the emphasis on Tanimoto similarity as an evaluation metric influenced subsequent work in the area.</p>
<h2 id="technical-details">Technical Details</h2>
<p><strong>Grayscale Conversion</strong></p>
<p>OSRA uses a non-standard grayscale conversion to preserve light-colored atoms (e.g., yellow sulfur):</p>
<p>$$\text{Gray} = \min(R, G, B)$$</p>
<p>This prevents light colors from being washed out during binarization, unlike the standard weighted formula ($0.3R + 0.59G + 0.11B$).</p>
<p><strong>Image Segmentation</strong></p>
<p>Chemical structures are identified within a page using specific bounding box criteria:</p>
<ul>
<li><strong>Black pixel density</strong>: Must be between 0.0 and 0.2</li>
<li><strong>Aspect ratio</strong>: Height-to-width ratio must be between 0.2 and 5.0</li>
<li><strong>Minimum size</strong>: Width and height must be &gt;50 pixels at resolutions &gt;150 dpi</li>
</ul>
<p><strong>Noise Detection and Smoothing</strong></p>
<p>A &ldquo;noise factor&rdquo; determines whether anisotropic smoothing is applied:</p>
<p>$$\text{Noise Factor} = \frac{\text{Count of 2-pixel line segments}}{\text{Count of 3-pixel line segments}}$$</p>
<p>Smoothing is applied only if this ratio is between 0.5 and 1.0.</p>
<p><strong>Atom Detection from Bezier Curves</strong></p>
<p>Potrace Bezier control points are flagged as potential atoms if:</p>
<ol>
<li>The point is classified as a &ldquo;corner&rdquo; by Potrace</li>
<li>The vector direction change has a <strong>normal component</strong> of at least 2 pixels</li>
</ol>
<p>The normal component criterion is more robust than angle-based detection because angles are difficult to measure accurately in pixelated environments where line thickness is non-zero.</p>
<p><strong>Bond Length Estimation</strong></p>
<p>The reference bond length is computed as the <strong>75th percentile</strong> of all detected bond lengths. This avoids bias from outlier bonds (e.g., extremely short or long bonds from recognition errors).</p>
<p><strong>Confidence Function</strong></p>
<p>A linear regression function selects the best result from the multi-scale processing:</p>
<p>$$\text{confidence} = 0.316 - 0.016N_{c} + 0.034N_{N} + 0.067N_{o} + \ldots + 0.330N_{\text{rings5}} + \ldots$$</p>
<p>where $N_C$, $N_N$, $N_O$ represent counts of carbon, nitrogen, oxygen atoms, respectively. It prioritizes structures with more recognized heteroatoms and rings, while penalizing fragment counts. Additional terms account for ring pattern</p>
<ul>
<li><strong>Minimum size</strong>: Width and height must be &gt;50 pixels at resolutions &gt;150 dpi</li>
</ul>
<h4 id="noise-detection-and-smoothing">Noise Detection and Smoothing</h4>
<p>A &ldquo;noise factor&rdquo; determines whether anisotropic smoothing is applied:</p>
<p>$$\text{Noise Factor} = \frac{\text{Count of 2-pixel line segments}}{\text{Count of 3-pixel line segments}}$$
| Purpose | Dataset | Size | Notes |
|&mdash;&mdash;&mdash;|&mdash;&mdash;&mdash;|&mdash;&mdash;|&mdash;&mdash;-|
| Comparison | &ldquo;Small test set&rdquo; (Simbiosys) | 11 files (42 structures) | Used to compare vs. CLiDE |
| Validation | Internal Test Set | 66 images (215 structures) | Various resolutions, color depths, styles |</p>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics used to define &ldquo;Success&rdquo;:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Definition</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Perfect by InChI</strong></td>
          <td>Exact match of the InChI string to the human-curated structure.</td>
      </tr>
      <tr>
          <td><strong>Average Tanimoto</strong></td>
          <td>Tanimoto similarity (CACTVS fingerprints) between OSRA output and ground truth.</td>
      </tr>
      <tr>
          <td><strong>uuuuu</strong></td>
          <td>NCI CADD identifier match (topology only; ignores stereochem/charge/tautomers).</td>
      </tr>
  </tbody>
</table>
<p><strong>Results Table (Comparison)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Perfect (InChI)</th>
          <th>T &gt; 85%</th>
          <th>uuuuu Match</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>OSRA</strong></td>
          <td>26 / 42</td>
          <td>39 / 42</td>
          <td>28 / 42</td>
      </tr>
      <tr>
          <td><strong>CLiDE</strong></td>
          <td>11 / 42</td>
          <td>26 / 42</td>
          <td>12 / 42</td>
      </tr>
  </tbody>
</table>
<h3 id="softwaredependencies">Software/Dependencies</h3>
<p>The system relies on external libraries:</p>
<ul>
<li><strong>ImageMagick</strong>: Image format parsing (supports 90+ formats)</li>
<li><strong>Ghostscript</strong>: PDF/PS interpretation</li>
<li><strong>Potrace</strong>: Vectorization (converts bitmap to Bezier curves)</li>
<li><strong>GOCR / OCRAD</strong>: Optical Character Recognition (heteroatom label recognition)</li>
<li><strong>OpenBabel / RDKit</strong>: Chemical backends for connection table compilation</li>
<li><strong>Output Formats</strong>: SMILES strings and SD files</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{filippovOpticalStructureRecognition2009,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Optical {{Structure Recognition Software To Recover Chemical Information}}: {{OSRA}}, {{An Open Source Solution}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Optical {{Structure Recognition Software To Recover Chemical Information}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Filippov, Igor V. and Nicklaus, Marc C.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2009}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = mar,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{49}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{740--743}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci800067r}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>The confidence function is a linear regression model trained on chemical features:</p>
<p>$$\text{Confidence} = 0.316 - 0.016N_C + 0.034N_N + 0.067N_O + 0.036N_F + \ldots$$</p>
<p>where $N_C$, $N_N$, $N_O$, $N_F$ represent counts of carbon, nitrogen, oxygen, and fluorine atoms, respectively. Additional terms account for ring counts and fragment counts. The model achieves a correlation coefficient of $r=0.89$.</p>
<p>This function scores the three resolution candidates (72, 150, and 300 dpi), and the highest-scoring structure is selected as the final output.</p>
<h3 id="data">Data</h3>
<p><strong>Test Sets</strong>:</p>
<ul>
<li><strong>CLiDE Comparison</strong>: 42 structures from 11 files (Simbiosys small test set)</li>
<li><strong>Internal Validation</strong>: 215 structures</li>
</ul>
<p><strong>Evaluation Metrics</strong>:</p>
<ul>
<li>Exact match accuracy (binary correct/incorrect)</li>
<li>Tanimoto similarity using molecular fingerprints (preferred metric for partial recognition credit)</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Pipeline Components</strong>:</p>
<ol>
<li><strong>Image Preprocessing</strong>: ImageMagick (supports 90+ formats)</li>
<li><strong>Vectorization</strong>: Potrace library (converts bitmap to Bezier curves)</li>
<li><strong>OCR</strong>: GOCR and OCRAD (heteroatom label recognition)</li>
<li><strong>Output Formats</strong>: SMILES strings and SD files</li>
</ol>
]]></content:encoded></item><item><title>Optical Recognition of Chemical Graphics</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/casey-ocsr-1993/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/casey-ocsr-1993/</guid><description>A 1993 prototype system for converting scanned chemical diagrams into connection tables using vectorization and heuristic-based structure recognition.</description><content:encoded><![CDATA[<h2 id="contribution-early-ocsr-pipeline-methodology">Contribution: Early OCSR Pipeline Methodology</h2>
<p><strong>Method</strong>. This paper proposes a novel architectural pipeline for the automatic recognition of chemical structure diagrams. It defines a specific sequence of algorithmic steps, including diagram separation, vectorization, segmentation, and structural analysis, which converts pixel data into a semantic chemical representation (MDL Molfile).</p>
<h2 id="motivation-digitizing-legacy-chemical-data">Motivation: Digitizing Legacy Chemical Data</h2>
<p><strong>Problem</strong>: In 1993, vast databases of chemical information existed, but the entry of graphical data was significantly less advanced than the facilities for manipulating it.</p>
<p><strong>Gap</strong>: Creating digital chemical structures required trained operators to manually redraw diagrams that already existed in printed journals and catalogs, leading to a costly duplication of effort.</p>
<p><strong>Goal</strong>: To automate the creation of coded representations (connection tables) directly from optically scanned diagrams on printed pages.</p>
<h2 id="novelty-general-document-analysis-integrated-with-chemical-rules">Novelty: General Document Analysis Integrated with Chemical Rules</h2>
<p><strong>Pipeline Approach</strong>: The authors present a complete end-to-end system that integrates general document analysis with domain-specific chemical rules.</p>
<p><strong>Convex Bounding Separation</strong>: A novel use of &ldquo;bounding polygons&rdquo; defined by 8 fixed-direction bands to distinguish diagram components from text with linear computational cost.</p>
<p><strong>Vector-Based Segmentation</strong>: The system uses the output of a vectorizer (GIFTS) to classify diagram elements. It relies on the observation that vectorizers approximate characters with sets of short vectors to distinguish them from bonds.</p>
<h2 id="methodology-and-system-evaluation">Methodology and System Evaluation</h2>
<p><strong>System Implementation</strong>: The algorithm was implemented in &lsquo;C&rsquo; on IBM PS/2 personal computers running OS/2 Presentation Manager.</p>
<p><strong>Input Specification</strong>: The system was tested on documents scanned at 300 dpi using an IBM 3119 scanner.</p>
<p><strong>Qualitative Evaluation</strong>: The authors evaluated the system on &ldquo;typical scanned structures&rdquo; and &ldquo;simple planar diagrams&rdquo;. Large-scale quantitative benchmarking was not conducted in this work.</p>
<h2 id="results-performance-and-limitations">Results, Performance, and Limitations</h2>
<p><strong>Performance</strong>: The prototype processes a typical structure (after extraction) in less than one minute.</p>
<p><strong>Accuracy</strong>: It is reported to be accurate for simple planar diagrams.</p>
<p><strong>Output Format</strong>: The system successfully generates MDL Molfiles that interface with standard chemistry software like REACCS, MACCS, and modeling tools.</p>
<p><strong>Limitations</strong>: The system struggles with broken lines, characters touching bond structures, and requires manual intervention for complex errors.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Status:</strong> Closed (Historical). As an early prototype from 1993, no source code, datasets, or digital models were publicly released. Reproducing this exact system would require recreating the pipeline from the described heuristics and sourcing vintage OCR software.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><em>None available</em></td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">No digital artifacts were released with this 1993 publication.</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The paper does not release a dataset but specifies the input requirements for the system.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Input</td>
          <td>Scanned Documents</td>
          <td>N/A</td>
          <td>Black ink on white paper; scanned at 300 dpi bi-level.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper relies on a pipeline of specific heuristics and geometric rules.</p>
<p><strong>1. Diagram Separation (Region Growing)</strong></p>
<ul>
<li><strong>Bounding Polygons</strong>: Uses convex polygons defined by pairs of parallel sides in 8 fixed directions. This approximation improves distance estimation compared to bounding rectangles.</li>
<li><strong>Seed Detection</strong>: Finds a connected component with bounding dimension $D &gt; d_{\text{max char size}}$.</li>
<li><strong>Aggregation</strong>: Iteratively searches for neighboring components within a specific distance threshold $d_t$ (where $d_t$ is smaller than the whitespace margin) and merges them into the bounding polygon.</li>
</ul>
<p><strong>2. Vectorization &amp; Segmentation</strong></p>
<ul>
<li><strong>Vectorization</strong>: Uses the GIFTS system (IBM Tokyo) to fit lines to pixels.</li>
<li><strong>Classification Heuristics</strong>:
<ul>
<li><strong>Ratio Test</strong>: If the ratio of a group&rsquo;s dimension to the full diagram dimension is below a threshold $\tau$, it is classified as a <strong>Symbol</strong>:
$$ \frac{D_{\text{group}}}{D_{\text{diagram}}} &lt; \tau $$</li>
<li><strong>Context Rule</strong>: Small vector groups near letters are classified as <strong>Characters</strong> (handles &rsquo;l&rsquo; in &lsquo;Cl&rsquo;).</li>
<li><strong>Circle Rule</strong>: A group is a <strong>Circle</strong> (aromatic ring) if it contains $N \ge 8$ vectors in a roughly circular arrangement.</li>
<li><strong>Default</strong>: Otherwise, classified as <strong>Bond Structure</strong>.</li>
</ul>
</li>
</ul>
<p><strong>3. Cleanup &amp; Structure Recognition</strong></p>
<ul>
<li><strong>Short Vector Removal</strong>: Vectors shorter than a fraction of the median line length $L_{\text{median}}$ are shrunk to their midpoint (fixing broken junctions).</li>
<li><strong>Vertex Merging</strong>: If two vectors meet at an angle $\theta &lt; 35^{\circ}$, the vertex is removed (fixing single lines broken into two).</li>
<li><strong>Aromatic Processing</strong>: If a circle is detected, the system identifies the 6 closest atoms and adds double bonds to every second bond in the ring.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>OCR</strong>:</p>
<ul>
<li>The system uses a feature-based, single-font OCR engine.</li>
<li>It assumes non-serif, plain styles typical of drafting standards.</li>
<li>Character images are normalized for size before recognition.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Scanner</strong>: IBM 3119 (300 dpi).</li>
<li><strong>Compute</strong>: IBM PS/2 series running OS/2.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Casey, R., et al. (1993). Optical Recognition of Chemical Graphics. <em>Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR &lsquo;93)</em>, 627-631. <a href="https://doi.org/10.1109/ICDAR.1993.395658">https://doi.org/10.1109/ICDAR.1993.395658</a></p>
<p><strong>Publication</strong>: ICDAR 1993</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{caseyOpticalRecognitionChemical1993,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Optical Recognition of Chemical Graphics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of 2nd {{International Conference}} on {{Document Analysis}} and {{Recognition}} ({{ICDAR}} &#39;93)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Casey, R. and Boyer, S. and Healey, P. and Miller, A. and Oudot, B. and Zilles, K.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1993</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{627--631}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE Comput. Soc. Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Tsukuba Science City, Japan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICDAR.1993.395658}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Kekulé: OCR-Optical Chemical Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/kekule-1992/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/kekule-1992/</guid><description>A seminal 1992 system for Optical Chemical Structure Recognition (OCSR) using neural networks and heuristic graph compilation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: McDaniel, J. R., &amp; Balmuth, J. R. (1992). Kekulé: OCR-Optical Chemical (Structure) Recognition. <em>Journal of Chemical Information and Computer Sciences</em>, 32(4), 373-378. <a href="https://doi.org/10.1021/ci00008a018">https://doi.org/10.1021/ci00008a018</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Computer Sciences, 1992</p>
<h2 id="system-architecture-and-methodological-approach">System Architecture and Methodological Approach</h2>
<p>This is a <strong>Methodological Paper</strong> ($\Psi_{\text{Method}}$). It proposes a novel software architecture (&ldquo;Kekulé&rdquo;) designed to solve a specific technical problem: the automatic conversion of printed chemical structure diagrams into computer-readable connection tables. The paper focuses on the &ldquo;how&rdquo; of the system by detailing the seven-step pipeline from scanning to graph compilation, validating the method through performance testing on a specific dataset.</p>
<h2 id="motivation-bridging-visual-diagrams-and-connection-tables">Motivation: Bridging Visual Diagrams and Connection Tables</h2>
<p>The primary motivation is to bridge the gap between how chemists communicate (structural diagrams) and how chemical databases store information (connection tables like MOLfiles).</p>
<ul>
<li><strong>Inefficiency of Manual Entry</strong>: Manual compilation of structural descriptions is &ldquo;tedious and highly prone to error&rdquo;.</li>
<li><strong>Redrawing Costs</strong>: Even using drawing programs (like ChemDraw ancestors) to capture connectivity is inefficient; redrawing a complex molecule like vitamin $B_{12}$ takes ~20 minutes.</li>
<li><strong>Lack of Existing Solutions</strong>: Existing OCR systems at the time failed on chemical diagrams because they could not handle the mix of graphics (bonds) and text (atom labels), and struggled with small, mixed fonts.</li>
</ul>
<h2 id="novelty-a-hybrid-ocr-and-heuristic-approach">Novelty: A Hybrid OCR and Heuristic Approach</h2>
<p>Kekulé represents the first successful attempt to integrate all of the required elements of image processing, OCR, structure editing, and database communication into a complete system.</p>
<ul>
<li><strong>Hybrid OCR Approach</strong>: Unlike commercial OCR of the time, it used a custom implementation combining rotation correction (for skew) with a <strong>multilayer perceptron neural network</strong> trained specifically on small fonts (down to 3.2 points).</li>
<li><strong>Heuristic Feature Extraction</strong>: The authors developed specific heuristics to handle chemical artifacts, such as an exhaustive search for dashed lines, explicitly rejecting Hough transforms as unreliable for short segments.</li>
<li><strong>Contextual &ldquo;Spell Checking&rdquo;</strong>: The system uses chemical context to verify OCR results, such as checking atom symbols against a valid list and using bond connections to disambiguate characters.</li>
</ul>
<h2 id="experimental-setup-and-dataset-validation">Experimental Setup and Dataset Validation</h2>
<p>The authors performed a validation study on a diverse set of chemical structures to stress-test the system:</p>
<ul>
<li><strong>Dataset</strong>: 444 chemical structures were selected from a wide variety of sources, including the <em>Merck Index</em>, <em>Aldrich Handbook</em>, and <em>ACS Nomenclature Guide</em>, specifically chosen to &ldquo;test Kekulé&rsquo;s limits&rdquo;.</li>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Processing Success</strong>: Percentage of structures processed.</li>
<li><strong>User Intervention</strong>: Average number of prompts per structure for verification.</li>
<li><strong>Editing Time</strong>: Time required to correct interpretation errors (arbitrary &ldquo;good&rdquo; limit set at 30 seconds).</li>
</ul>
</li>
</ul>
<h2 id="results-and-system-performance">Results and System Performance</h2>
<ul>
<li><strong>High Success Rate</strong>: 98.9% of the 444 structures were processed successfully.</li>
<li><strong>Performance Speed</strong>: The average processing time was 9 seconds per structure on an 80486 (33 MHz) processor.</li>
<li><strong>Error Modes</strong>: The primary bottleneck was broken characters in scanned images (e.g., breaks in &lsquo;H&rsquo; or &lsquo;N&rsquo; crossbars), which slowed down the OCR significantly.</li>
<li><strong>Impact</strong>: The system demonstrated that automated interpretation was faster and less error-prone than manual redrawing.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>The following details outline the specific technical implementation described in the 1992 paper.</p>
<h3 id="data">Data</h3>
<p>The authors did not release a public dataset but described their test set sources in detail.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Mixed Chemical Sources</td>
          <td>444 structures</td>
          <td>Sourced from <em>Merck Index</em>, <em>Aldrich Handbook</em>, <em>ACS Nomenclature Guide</em>, etc.</td>
      </tr>
      <tr>
          <td>Training (OCR)</td>
          <td>Font Exemplars</td>
          <td>Unknown</td>
          <td>&ldquo;Exemplars of characters from numerous serif and sanserif fonts&rdquo;.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper details a 7-step pipeline. Key algorithmic choices include:</p>
<ul>
<li>
<p><strong>Vectorization</strong>:</p>
<ul>
<li>Images are reduced to 1-pixel width using <strong>thinning</strong> and <strong>raster-to-vector translation</strong>.</li>
<li>An <strong>adaptive smoothing algorithm</strong> is applied to remove pixel-level jitter.</li>
</ul>
</li>
<li>
<p><strong>Feature Extraction (Dashed Lines)</strong>:</p>
<ul>
<li><strong>Hough Transforms</strong> were rejected due to poor performance on short line segments.</li>
<li><strong>Slope sorting</strong> was rejected due to variance in short dashes.</li>
<li><strong>Chosen Method</strong>: Exhaustive search/testing of all features that <em>might</em> be dashed lines (subset of features).</li>
</ul>
</li>
<li>
<p><strong>Graph Compilation</strong>:</p>
<ul>
<li><strong>Character Grouping</strong>: Characters are assembled into strings based on XY adjacency.</li>
<li><strong>Node Creation</strong>: Character strings become nodes. Vectors with endpoints &ldquo;too far&rdquo; from strings create new nodes.</li>
<li><strong>Heuristics</strong>: Circles are converted to alternating single-double bonds; &ldquo;thick&rdquo; bonds between wedges are automatically generated.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The core machine learning component is the OCR engine.</p>
<ul>
<li><strong>Architecture</strong>: A <strong>multilayer perceptron neural network</strong> (fully connected).</li>
<li><strong>Input</strong>: Normalized characters. Normalization involves rotation (for skew), scaling, under-sampling, and contrast/density adjustments.</li>
<li><strong>Output</strong>: Ranked probability matches. Outputs above an experimental threshold are retained. If a character is ambiguous (e.g., &lsquo;5&rsquo; vs &lsquo;S&rsquo;), both are kept and resolved via chemical context.</li>
<li><strong>Performance</strong>: Raw accuracy ~96% on small fonts (compared to ~85% for commercial OCR of the era).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The system was developed and tested on hardware typical of the early 1990s.</p>
<ul>
<li><strong>Processor</strong>: Intel 80486 at 33 MHz.</li>
<li><strong>Scanners</strong>: Hewlett-Packard ScanJet (300 dpi) and Logitech ScanMan (400 dpi hand-held).</li>
<li><strong>Platform</strong>: Microsoft Windows.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{mcdanielKekuleOCRopticalChemical1992,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Kekulé: {{OCR-optical}} Chemical (Structure) Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Kekulé}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{McDaniel, Joe R. and Balmuth, Jason R.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1992</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jul,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Computer Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{32}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{373--378}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0095-2338, 1520-5142}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci00008a018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Graph Perception for Chemical Structure OCR</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/contreras-ocr-1990/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/contreras-ocr-1990/</guid><description>A 1990 methodological paper presenting an early OCR system for digitizing chemical structure images into connectivity tables using C and Prolog.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Contreras, M. L., Allendes, C., Alvarez, L. T., &amp; Rozas, R. (1990). Computational perception and recognition of digitized molecular structures. <em>Journal of Chemical Information and Computer Sciences</em>, 30(3), 302-307. <a href="https://doi.org/10.1021/ci00067a014">https://doi.org/10.1021/ci00067a014</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Computer Sciences, 1990</p>
<h2 id="contribution-graph-perception-and-character-recognition">Contribution: Graph Perception and Character Recognition</h2>
<p>This is a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong>.</p>
<p>It proposes a specific algorithmic pipeline (&ldquo;graph perception and character recognition&rdquo;) to solve the technical problem of converting pixelated images of molecules into machine-readable connectivity tables. The dominant contribution is the novel set of algorithms (contour search, circular inspection, matrix parametrization).</p>
<h2 id="motivation-automating-chemical-database-entry">Motivation: Automating Chemical Database Entry</h2>
<p>The primary motivation is to automate the input of chemical structures into databases.</p>
<ul>
<li><strong>Problem</strong>: Manual input of structures (especially large ones with stereochemistry) is time-consuming and prone to human error.</li>
<li><strong>Gap</strong>: Existing methods required significant human intervention. The authors created a system that handles the &ldquo;graph/skeleton&rdquo; and the &ldquo;alphanumeric characters&rdquo; effectively to speed up entry into systems like ARIUSA or CAD tools.</li>
</ul>
<h2 id="algorithmic-novelty-circular-inspection-processing">Algorithmic Novelty: Circular Inspection Processing</h2>
<p>The paper introduces a unified &ldquo;capture-to-recognition&rdquo; system written in C that handles both type-printed and hand-printed structures. Key novelties include:</p>
<ul>
<li><strong>Circular Inspection Algorithm</strong>: A specific technique for detecting internal rings and multiple bonds by sweeping a radius of 0.3 bond lengths around atoms.</li>
<li><strong>Hybrid Recognition</strong>: Combining &ldquo;graph perception&rdquo; (vectorizing the lines) with &ldquo;character recognition&rdquo; (OCR for atom labels) in a single pipeline.</li>
<li><strong>Matrix Parametrization for OCR</strong>: A feature extraction method that assigns hexadecimal IDs to character matrices based on pixel gradients and &ldquo;semibytes&rdquo;.</li>
</ul>
<h2 id="methodology-validation-via-custom-structure-dataset">Methodology: Validation via Custom Structure Dataset</h2>
<p>The authors validated the system by digitizing and recognizing a set of test structures:</p>
<ul>
<li><strong>Dataset</strong>: 200 type-printed structures and 50 hand-printed structures.</li>
<li><strong>Metric</strong>: &ldquo;Reliability&rdquo; percentage (correct recognition of the connectivity table).</li>
<li><strong>Speed Comparison</strong>: Measured processing time against a &ldquo;qualified person&rdquo; performing manual input for an average 20-atom molecule.</li>
</ul>
<h2 id="results-speed-and-file-size-efficiency">Results: Speed and File Size Efficiency</h2>
<ul>
<li><strong>Accuracy</strong>: The system achieved <strong>94% reliability</strong> for both type- and hand-printed graphs.</li>
<li><strong>Character Recognition</strong>: Isolated character recognition achieved <strong>&gt;99% reliability</strong>.</li>
<li><strong>Speed</strong>: The system was <strong>3-5 times faster</strong> than manual human input.</li>
<li><strong>Efficiency</strong>: The storage required for a recognized molecule (e.g., $C_{19}H_{31}N$) was significantly smaller (4.1 kb) than the raw image bitmap.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The paper does not use a standard external dataset but rather a custom set of structures for validation.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Validation</strong></td>
          <td style="text-align: left">Type-printed structures</td>
          <td style="text-align: left">200 images</td>
          <td style="text-align: left">Used to test reliability</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Validation</strong></td>
          <td style="text-align: left">Hand-printed structures</td>
          <td style="text-align: left">50 images</td>
          <td style="text-align: left">&ldquo;Straight enough&rdquo; drawings required</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper details three specific algorithmic components crucial for replication:</p>
<ol>
<li>
<p><strong>Graph Perception (Contour Search)</strong>:</p>
<ul>
<li><strong>Sweep</strong>: Left-to-right horizontal sweep to find the first pixel.</li>
<li><strong>Contour Follow</strong>: Counter-clockwise algorithm used to trace borders.</li>
<li><strong>Vertex Detection</strong>: A vertex is flagged if the linear trajectory deflection angle is $&gt;18^\circ$.</li>
<li><strong>Atom Localization</strong>: Two or more vertices in a small space indicate an atom position.</li>
</ul>
</li>
<li>
<p><strong>Circular Inspection (Branching/Rings)</strong>:</p>
<ul>
<li><strong>Radius</strong>: A circle is inspected around each atom with $r = 0.3 \times \text{single bond length}$.</li>
<li><strong>Branch Detection</strong>: &ldquo;Unknown border pixels&rdquo; found on this circle trigger new contour searches to find attached bonds or rings.</li>
</ul>
</li>
<li>
<p><strong>Character Recognition (Matrix Feature Extraction)</strong>:</p>
<ul>
<li><strong>Separation</strong>: Characters are separated into isolated matrices and &ldquo;relocated&rdquo; to the top-left corner.</li>
<li><strong>Parametrization</strong>: The matrix is divided into zones. A &ldquo;semibyte&rdquo; (4-bit code) is generated by checking for pixel density in specific directions.</li>
<li><strong>ID Assignment</strong>: Matrices are assigned a Hex ID (e.g., <code>8</code>, <code>1</code>, <code>0</code>, <code>6</code>) based on these semibytes.</li>
<li><strong>Differentiation</strong>: Secondary parameters (concavities, vertical lines) resolve conflicts (e.g., between &lsquo;b&rsquo; and &lsquo;h&rsquo;).</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<p>The system does not use learned weights (neural networks). It relies on <strong>rule-based topological recognition</strong>.</p>
<ul>
<li><strong>Representation</strong>: The final output is a Prolog data structure converted into a connectivity table.</li>
<li><strong>Atom Recognition</strong>: Terminal atoms are identified by linear projection; if no pixels are found, it defaults to Carbon.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The performance metrics reflect 1990s hardware, useful for historical context or low-resource reimplementation.</p>
<ul>
<li><strong>Capture</strong>: PC-AT microcomputer with HP-Scanjet.</li>
<li><strong>Processing</strong>: MicroVax II (8 MB real memory, 159 MB hard disc) running Ultrix-32.</li>
<li><strong>Memory Usage</strong>: A $300 \times 300$ dpi image required ~175 kb; a recognized graph required ~1.6 kb.</li>
<li><strong>Time</strong>: Processing time per molecule was 0.7 - 1.0 minutes.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{contrerasComputationalPerceptionRecognition1990,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Computational Perception and Recognition of Digitized Molecular Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Contreras, M. Leonor and Allendes, Carlos and Alvarez, L. Tomas and Rozas, Roberto}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1990</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = aug,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Computer Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{30}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{302--307}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0095-2338, 1520-5142}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci00067a014}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemReader: Automated Structure Extraction</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/chemreader-2009/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/chemreader-2009/</guid><description>ChemReader extracts chemical structures from raster images using modified Hough transform and chemical spell checking for improved accuracy.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Park, J., Rosania, G. R., Shedden, K. A., Nguyen, M., Lyu, N., &amp; Saitou, K. (2009). Automated extraction of chemical structure information from digital raster images. <em>Chemistry Central Journal</em>, 3(1), 4. <a href="https://doi.org/10.1186/1752-153X-3-4">https://doi.org/10.1186/1752-153X-3-4</a></p>
<p><strong>Publication</strong>: Chemistry Central Journal 2009</p>
<h2 id="paper-contribution-method--pipeline">Paper Contribution: Method &amp; Pipeline</h2>
<p>This is a <strong>Method</strong> paper.</p>
<p>It proposes a novel software system, <strong>ChemReader</strong>, designed to automate the analog-to-digital conversion of chemical structure diagrams. The paper focuses on the algorithmic pipeline, specifically modifying standard computer vision techniques like the Hough Transform to suit chemical graphs. It validates the method through direct performance comparisons against existing State-of-the-Art tools (OSRA, CLiDE, Kekule).</p>
<h2 id="motivation-unlocking-analog-chemical-information">Motivation: Unlocking Analog Chemical Information</h2>
<p>There is a massive amount of chemical information (molecular interactions, pathways, disease processes) locked in scientific literature. However, this information is typically encoded as &ldquo;analog diagrams&rdquo; (raster images) embedded in text. Existing text-based search engines cannot index these structures effectively.</p>
<p>While previous tools existed (Kekule, OROCS, CLiDE), they often required high-resolution images (150-300 dpi) or manual intervention to separate diagrams from text, making fully automated, large-scale database annotation impractical.</p>
<h2 id="core-innovation-modified-transforms-and-spell-checking">Core Innovation: Modified Transforms and Spell Checking</h2>
<p>The authors introduce <strong>ChemReader</strong>, a fully automated toolkit with several specific algorithmic innovations tailored for chemical diagrams:</p>
<ul>
<li><strong>Modified Hough Transform (HT):</strong> Unlike standard HT, which treats all pixels equally, ChemReader uses a modified weight function that accounts for pixel connectivity and line thickness to better detect chemical bonds.</li>
<li><strong>Chemical Spell Checker:</strong> A post-processing step that uses a dictionary of common chemical abbreviations (770 entries) and n-gram probabilities to correct Optical Character Recognition (OCR) errors (e.g., correcting specific atom labels based on valence rules), improving accuracy from 66% to 87%.</li>
<li><strong>Specific Substructure Detection:</strong> Dedicated algorithms for detecting stereochemical &ldquo;wedge&rdquo; bonds using corner detection and aromatic rings using the Generalized Hough Transform.</li>
</ul>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p>The authors compared ChemReader against three other systems: <strong>OSRA V1.0.1</strong>, <strong>CLiDE V2.1 Lite</strong>, and <strong>Kekule V2.0 demo</strong>.</p>
<p>They used three distinct datasets to test robustness:</p>
<ol>
<li><strong>Set I (50 images):</strong> Diverse drawing styles and fonts collected via Google Image Search.</li>
<li><strong>Set II (100 images):</strong> Ligand images from the GLIDA database, linked to PubChem for ground truth.</li>
<li><strong>Set III (212 images):</strong> Low-resolution images embedded in 121 scanned journal articles from PubMed.</li>
</ol>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li><strong>Accuracy:</strong> ChemReader significantly outperformed competitors. In the difficult <strong>Set III</strong> (journal articles), ChemReader achieved <strong>30.2%</strong> correct exact output, compared to 17% for OSRA and 6.6% for CLiDE.</li>
<li><strong>Similarity:</strong> Even when exact matches failed, ChemReader maintained high Tanimoto similarity scores (0.74-0.86), indicating it successfully captured the majority of chemically significant features.</li>
<li><strong>Substructure Recognition:</strong> ChemReader demonstrated higher recall rates across all PubChem fingerprint categories (rings, atom pairs, SMARTS patterns) compared to other tools.</li>
<li><strong>Error Correction:</strong> The &ldquo;Chemical Spell Checker&rdquo; improved character recognition accuracy from <strong>66% to 87%</strong>.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study utilized three test sets collected from public sources.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td><strong>Set I</strong></td>
          <td>50 images</td>
          <td>Sourced from Google Image Search to vary styles/fonts.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><strong>Set II</strong></td>
          <td>100 images</td>
          <td>Randomly selected ligands from the GLIDA database; ground truth via PubChem.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><strong>Set III</strong></td>
          <td>212 images</td>
          <td>Extracted from 121 PubMed journal articles; specifically excludes non-chemical figures.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The pipeline consists of several sequential processing steps:</p>
<ul>
<li><strong>De-noising:</strong> Uses <strong>GREYCstoration</strong>, an anisotropic smoothing algorithm, to regulate image noise.</li>
<li><strong>Segmentation:</strong> Uses an <strong>8-connectivity algorithm</strong> to group pixels. Components are classified as text or graphics based on height/area ratios.</li>
<li><strong>Line Detection (Modified Hough Transform):</strong>
<ul>
<li>Standard Hough Transform is modified to weight pixel pairs $(P_i, P_j)$ based on connectivity.</li>
<li><strong>Weight Function ($W_{ij}$):</strong>
$$W_{ij} = \begin{cases} n_{ij}(P_0 - x_{ij}) &amp; \text{if } x_{ij}/n_{ij} &gt; P_0 \\ 0 &amp; \text{otherwise} \end{cases}$$
Where $n_{ij}$ is the pixel count between points, $x_{ij}$ is the count of black pixels, and $P_0$ is a density threshold.</li>
</ul>
</li>
<li><strong>Wedge Bond Detection:</strong> Uses corner detection to find triangles where the area equals the number of black pixels (isosceles shape check).</li>
<li><strong>Chemical Spell Checker:</strong>
<ul>
<li>Calculates the Maximum Likelihood ($ML$) of a character string being a valid chemical word $T$ from a dictionary.</li>
<li><strong>Similarity Metric:</strong>
$$Sim(S_i, T_i) = 1 - \sqrt{\sum_{j=1}^{M} [I^{S_i}(j) - I^{T_i}(j)]^2}$$
Uses pixel-by-pixel intensity difference between the input segment $S$ and candidate template $T$.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Character Recognition:</strong> Uses the open-source <strong>GOCR</strong> library. It employs template matching based on features like holes, pixel densities, and transitions.</li>
<li><strong>Chemical Dictionary:</strong> A lookup table containing <strong>770</strong> frequently used chemical abbreviations and fundamental valence rules.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance was measured using exact structure matching and fingerprint similarity.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (Set III)</th>
          <th>Baseline (OSRA)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>% Correct</strong></td>
          <td><strong>30.2%</strong></td>
          <td>17%</td>
          <td>Exact structure match using ChemAxon JChem.</td>
      </tr>
      <tr>
          <td><strong>Avg Similarity</strong></td>
          <td><strong>0.740</strong></td>
          <td>0.526</td>
          <td>Tanimoto similarity on PubChem Substructure Fingerprints.</td>
      </tr>
      <tr>
          <td><strong>Precision (Rings)</strong></td>
          <td><strong>0.87</strong></td>
          <td>0.84</td>
          <td>Precision rate for recognizing ring systems.</td>
      </tr>
      <tr>
          <td><strong>Recall (Rings)</strong></td>
          <td><strong>0.83</strong></td>
          <td>0.73</td>
          <td>Recall rate for recognizing ring systems.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Platform:</strong> C++ implementation running on MS Windows.</li>
<li><strong>Dependencies:</strong> GOCR (OCR), GREYCstoration (Image processing).</li>
</ul>
]]></content:encoded></item><item><title>Chemical Machine Vision</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/chemical-machine-vision/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/chemical-machine-vision/</guid><description>Machine vision approach using Gabor wavelets and Kohonen networks to classify chemical raster images and extract structural metadata.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Gkoutos, G. V., Rzepa, H., Clark, R. M., Adjei, O., &amp; Johal, H. (2003). Chemical Machine Vision: Automated Extraction of Chemical Metadata from Raster Images. <em>Journal of Chemical Information and Computer Sciences</em>, 43(5), 1342-1355. <a href="https://doi.org/10.1021/ci034017n">https://doi.org/10.1021/ci034017n</a></p>
<p><strong>Publication</strong>: J. Chem. Inf. Comput. Sci. 2003</p>
<h2 id="paper-classification-methodological-approach">Paper Classification: Methodological Approach</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel architectural pipeline applying &ldquo;machine vision&rdquo; techniques (Gabor wavelets and Kohonen networks) to the problem of identifying chemical diagrams in low-resolution raster images. The paper focuses on the &ldquo;how&rdquo; (the algorithm and its parameters) and validates the method through quantitative experiments optimizing feature vectors and masks.</p>
<h2 id="motivation-extracting-legacy-chemical-data">Motivation: Extracting Legacy Chemical Data</h2>
<p>The primary motivation is to unlock the &ldquo;large amount of data&rdquo; trapped in legacy raster images (GIF, JPEG) on the Web that lack semantic metadata.</p>
<ul>
<li><strong>Legacy Data Problem</strong>: Most chemical structural information on the Web is embedded in raster images, not machine-readable formats like Molfiles.</li>
<li><strong>Limitations of Existing Tools</strong>: Previous tools like Kekule and CLiDE acted as &ldquo;Chemical OCR,&rdquo; attempting to reconstruct exact atom-bond connections. This required high-resolution images (&gt;300 dpi) and human intervention, making them unsuitable for automated Web crawling of low-resolution (72-96 dpi) images.</li>
<li><strong>Goal</strong>: To create a low-cost, automated tool for a &ldquo;robot-based Internet resource discovery tool&rdquo; that can classify images (e.g., &ldquo;is this a molecule?&rdquo;).</li>
</ul>
<h2 id="core-innovation-texture-recognition-over-structural-ocr">Core Innovation: Texture Recognition over Structural OCR</h2>
<p>The core novelty is the shift from &ldquo;Optical Character Recognition&rdquo; (exact reconstruction) to <strong>&ldquo;Texture Recognition&rdquo;</strong> (classification).</p>
<ul>
<li><strong>Texture-Based Approach</strong>: The authors treat chemical diagrams as textures. They use <strong>Gabor wavelets</strong> to extract texture features. <strong>Crucially, this system does not recognize specific chemical structures</strong> (i.e., atom-bond connectivity tables, <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, or Molfiles). It only classifies images into broad categories.</li>
<li><strong>Incremental Learning</strong>: The system uses a <strong>Kohonen Self-Organizing Feature Map (KSOFM)</strong> combined with Class Boundary Analysis (CBA). This allows for &ldquo;incremental learning,&rdquo; where new classes (e.g., aromatic vs. non-aromatic) can be added without retraining the entire system.</li>
<li><strong>Optimization for Chemistry</strong>: The authors identify specific parameters (frequency channels, mask sizes) that are optimal for the &ldquo;texture&rdquo; of chemical diagrams.</li>
<li><strong>Integration with ChemDig</strong>: The method was designed to feed into ChemDig, a robot-based index engine for automated web crawling and metadata generation.</li>
</ul>
<h2 id="experimental-setup-parameter-optimization">Experimental Setup: Parameter Optimization</h2>
<p>The authors performed optimization and validation experiments using a dataset of <strong>300 images</strong> divided into three classes: Ring Systems, Non-Ring Systems, and Non-Chemistry (textures, biological figures, etc.).</p>
<ol>
<li><strong>Parameter Optimization</strong>: They systematically varied hyperparameters to find the optimal configuration:
<ul>
<li><strong>Feature Vector Size</strong>: Tested sizes from 100 to 4000 elements.</li>
<li><strong>Energy Mask Size</strong>: Tested windows from $3 \times 3$ to $15 \times 15$ pixels.</li>
<li><strong>Frequency Channels</strong>: Tested seven spatial frequencies ($\sqrt{2}$ to $64\sqrt{2}$).</li>
</ul>
</li>
<li><strong>Classification Performance</strong>: Evaluated the system&rsquo;s ability to classify unseen test images using a 50:50 training/test split.</li>
<li><strong>Comparison</strong>: Qualitatively compared the approach against vectorization tools (Autotrace, CR2V).</li>
</ol>
<h2 id="results-robust-classification-of-low-resolution-images">Results: Robust Classification of Low-Resolution Images</h2>
<ul>
<li><strong>Optimal Configuration</strong>: The system performed best with a feature vector size of ~1500 elements, a $9 \x9$ energy mask, and frequency channel $4\sqrt{2}$.</li>
<li><strong>High Accuracy</strong>: Achieved a recognition rate of <strong>91%</strong> with a 50:50 training/test split, and up to <strong>92%</strong> with a 70:30 split.</li>
<li><strong>Robustness</strong>: The system successfully distinguished between chemical and non-chemical images (zero false negatives for chemical images).</li>
<li><strong>Limitations</strong>: Misclassifications occurred between &ldquo;ring&rdquo; and &ldquo;non-ring&rdquo; systems when structures had similar visual &ldquo;textures&rdquo; (e.g., similar density or layout).</li>
<li><strong>Impact</strong>: The method is viable for automating metadata generation (e.g., <code>alt</code> tags) for web crawlers, functioning as a coarse-grained filter before more expensive processing.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study used a custom dataset of raster images collected from the Web.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td><strong>Custom Web Dataset</strong></td>
          <td>300 images</td>
          <td>Split into 3 classes: Ring Systems, Non-Ring Systems, Non-Chemistry.</td>
      </tr>
      <tr>
          <td>Resolution</td>
          <td><strong>Low-Res Web Images</strong></td>
          <td>72-96 dpi</td>
          <td>Deliberately chosen to mimic Web conditions where OCR fails.</td>
      </tr>
      <tr>
          <td>Format</td>
          <td><strong>Raster</strong></td>
          <td>GIF, JPEG</td>
          <td>Typical web formats.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The core pipeline consists of a <strong>Gabor Transform Unit</strong> followed by a <strong>Training/Classification Unit</strong>.</p>
<ul>
<li><strong>Gabor Wavelets</strong>: Used for feature extraction. The 2D Gabor wavelet equation is:
$$h(x,y)=\exp\left{-\frac{1}{2}\left[\frac{x^{2}}{\sigma_{x}^{2}}+\frac{y^{2}}{\sigma_{y}^{2}}\right]\right}\cos(2\pi\mu_{\sigma}x+\phi)$$
<ul>
<li><strong>Bank Structure</strong>: 28 filters total (4 orientations $\times$ 7 radial frequencies).</li>
<li><strong>Orientations</strong>: $0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}$.</li>
<li><strong>Frequencies</strong>: 1 octave apart, specifically $1\sqrt{2}, \dots, 64\sqrt{2}$.</li>
<li><strong>Selected Frequency</strong>: $4\sqrt{2}$ was found to be optimal for chemistry.</li>
</ul>
</li>
<li><strong>Preprocessing</strong>:
<ul>
<li><strong>Buffer Mounting</strong>: Images are mounted in a buffer (set to 0) to handle edge artifacts.</li>
<li><strong>Look-Up-Tables (LUT/LUF)</strong>: A binary Look-Up-Frame (LUF) indicates Regions of Interest (ROI) to avoid computing empty space; values are stored in a Look-Up-Table (LUT) to prevent re-computation of overlapping windows.</li>
</ul>
</li>
<li><strong>Feature Extraction</strong>:
<ul>
<li><strong>Non-linear Thresholding</strong>: $\psi(t) = \tanh(\alpha t)$ with $\alpha = 0.25$.</li>
<li><strong>Energy Function</strong>: Calculated as average absolute deviation from the mean using a window $W_{xy}$.
$$e_{k}(x,y)=\frac{1}{M^{2}}\sum_{(a,b)\in W_{xy}}|\psi(r_{k}(a,b))|$$</li>
<li><strong>Optimal Window</strong>: $9 \times 9$ pixels.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The classification model relies on competitive learning.</p>
<ul>
<li><strong>Architecture</strong>: <strong>Kohonen Self-Organizing Feature Map (KSOFM)</strong>.</li>
<li><strong>Training</strong>:
<ul>
<li><strong>Learning Rate</strong>: Starts at 1.0, decreases to 0.1.</li>
<li><strong>Class Boundary Analysis (CBA)</strong>: Computes the centroid (mean) and variance of each cluster. The variance defines the class boundary.</li>
</ul>
</li>
<li><strong>Classification Metric</strong>: <strong>Euclidean Distance Norm</strong>. An unknown vector is classified based on the shortest distance to a cluster center, provided it falls within the variance boundary.
$$D_{ij}=||x_{i}-x_{j}||$$</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance was measured using recognition rate ($R_s$) and misclassification error ($E_s$).</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Recognition Rate</td>
          <td><strong>91%</strong></td>
          <td>N/A</td>
          <td>Achieved with 50:50 split. 92% with 70:30 split.</td>
      </tr>
      <tr>
          <td>Feature Size</td>
          <td><strong>~1500</strong></td>
          <td>4000</td>
          <td>Reducing vector size from 4000 to 1500 maintained ~80% accuracy while improving speed.</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{gkoutosChemicalMachineVision2003,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Chemical {{Machine Vision}}: {{Automated Extraction}} of {{Chemical Metadata}} from {{Raster Images}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Chemical {{Machine Vision}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Gkoutos, Georgios V. and Rzepa, Henry and Clark, Richard M. and Adjei, Osei and Johal, Harpal}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2003</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = sep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Computer Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{43}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1342--1355}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0095-2338}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci034017n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemical Literature Data Extraction: The CLiDE Project</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/clide-1993/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/clide-1993/</guid><description>Seminal OCSR system converting scanned chemical diagrams into connection tables via primitive recognition and semantic interpretation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ibison, P., Jacquot, M., Kam, F., Neville, A. G., Simpson, R. W., Tonnelier, C., Venczel, T., &amp; Johnson, A. P. (1993). Chemical Literature Data Extraction: The CLiDE Project. <em>Journal of Chemical Information and Computer Sciences</em>, 33(3), 338-344. <a href="https://doi.org/10.1021/ci00013a010">https://doi.org/10.1021/ci00013a010</a></p>
<p><strong>Publication</strong>: J. Chem. Inf. Comput. Sci. 1993</p>
<h2 id="contribution-and-taxonomy">Contribution and Taxonomy</h2>
<p><strong>Classification: Method ($\Psi_{\text{Method}}$)</strong></p>
<p>This methodological paper proposes a novel software architecture for Optical Chemical Structure Recognition (OCSR). It details specific algorithms for image segmentation, vectorization, and chemical interpretation, validated through the successful extraction of complex structures from literature.</p>
<h2 id="motivation-automating-literature-extraction">Motivation: Automating Literature Extraction</h2>
<p>The manual creation of chemical reaction databases is a time-consuming and expensive process requiring trained chemists to abstract information from literature. While commercial tools existed for interpreting isolated scanned structures (like Kekulé), there was a lack of systems capable of processing whole pages of journals (including embedded text, reaction schemes, and structures) without significant human intervention.</p>
<h2 id="core-innovation-a-three-phase-hybrid-architecture">Core Innovation: A Three-Phase Hybrid Architecture</h2>
<p>CLiDE introduces a comprehensive <strong>three-phase architecture</strong> (Recognition, Grouping, Interpretation) that integrates computer vision with chemical knowledge. Key novelties include:</p>
<ul>
<li><strong>Context-Aware Interpretation:</strong> The use of an extendable <strong>superatom database</strong> to resolve ambiguities in chemical text (e.g., expanding &ldquo;OAc&rdquo; or &ldquo;Me&rdquo; into connection tables).</li>
<li><strong>Hybrid Primitive Detection:</strong> A combination of contour coding for solid lines and a modified Hough transform specifically tuned for detecting dashed chemical bonds.</li>
<li><strong>Semantic Re-construction:</strong> A scoring system for bond-atom association that considers both distance and vector direction to handle poorly drawn structures.</li>
</ul>
<h2 id="methodology-and-experimental-validation">Methodology and Experimental Validation</h2>
<p>The authors validated the system on a set of &ldquo;difficult cases&rdquo; selected to test specific capabilities. These included:</p>
<ul>
<li><strong>Crossing Bonds:</strong> Structures where bonds intersect without a central atom (Fig. 9d, 9e).</li>
<li><strong>Stereochemistry:</strong> Identification of wedged, dashed, and wavy bonds.</li>
<li><strong>Generic Structures:</strong> Parsing generic text blocks (e.g., $R^1 = Me$) and performing substitutions.</li>
<li><strong>Accuracy Estimation:</strong> The authors report an approximate 90% recognition rate for distinct characters in literature scans.</li>
</ul>
<h2 id="results-and-structural-reconstruction">Results and Structural Reconstruction</h2>
<p>The system successfully generates connection tables (exported as MOLfiles or ChemDraw files) from scanned bitmaps. It effectively distinguishes between graphical primitives (wedges, lines) and text, accurately reconstructing stereochemistry and resolving superatom synonyms (e.g., converting &ldquo;MeO&rdquo; to &ldquo;OMe&rdquo;). The authors conclude that while character recognition depends heavily on image quality, the graphic primitive recognition is robust for lines above a threshold length.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Input Format:</strong> Binary bitmaps scanned from journal pages.</li>
<li><strong>Resolution:</strong> 300 dpi (generating ~1 MB per page).</li>
<li><strong>Superatom Database:</strong> A lookup table containing ~200 entries. Each entry includes:
<ul>
<li><strong>Valency/Charge:</strong> Explicit constraints (e.g., &ldquo;HO&rdquo; takes 1 bond, &ldquo;CO2&rdquo; takes 2).</li>
<li><strong>Bonding Index:</strong> Specifies which letter in the string serves as the attachment point (e.g., letter 2 for &ldquo;HO&rdquo;, letters 1 and 2 for &ldquo;CO2&rdquo;).</li>
<li><strong>Sub-Connection Table:</strong> The internal atomic representation of the group.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Primitive Recognition (Vectorization)</strong></p>
<ul>
<li><strong>Contour Coding:</strong> Uses the <strong>Ahronovitz-Bertier-Habib</strong> method to generate interpixel contours (directions N, S, E, W) for connected components.</li>
<li><strong>Polygonal Approximation:</strong> A method similar to <strong>Sklansky and Gonzalez</strong> breaks contours into &ldquo;fractions&rdquo;.
<ul>
<li><em>Rule:</em> Long sides are &ldquo;straight fractions&rdquo;; consecutive short sides are &ldquo;curved fractions&rdquo;.</li>
<li><em>Reconstruction:</em> Parallel fractions are paired to form bond borders. If a border is split (due to noise or crossing lines), the system attempts to merge collinear segments.</li>
</ul>
</li>
<li><strong>Dash Detection:</strong> A <strong>modified Hough transform</strong> is applied to small connected components. It requires at least <strong>three collinear dashes</strong> to classify a sequence as a dashed bond.</li>
</ul>
<p><strong>2. Interpretation Rules</strong></p>
<ul>
<li><strong>Bond-Atom Association:</strong>
<ul>
<li><em>Candidate Selection:</em> The system identifies $m$ closest bonds for a superatom requiring $n$ connections ($m \ge n$).</li>
<li><em>Scoring Function:</em> Connections are selected based on minimizing <strong>perpendicular distance</strong> (alignment).</li>
</ul>
</li>
<li><strong>Crossing Bonds:</strong> Resolved using rules based on <strong>proximity, length, collinearity, and ring membership</strong> to distinguish actual crossings from central carbon atoms.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>OCR:</strong> A neural network trained on alphanumeric characters.
<ul>
<li><strong>Input Representation:</strong> Density matrices derived from character bitmaps.</li>
<li><strong>Post-processing:</strong> Unrecognized characters are flagged for manual correction.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Platform:</strong> SUN SPARC workstation.</li>
<li><strong>Scanner:</strong> Agfa Focus S 800GS.</li>
<li><strong>Implementation Language:</strong> C++.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ibisonChemicalLiteratureData1993,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Chemical Literature Data Extraction: {{The CLiDE Project}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Chemical Literature Data Extraction}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Ibison, P. and Jacquot, M. and Kam, F. and Neville, A. G. and Simpson, R. W. and Tonnelier, C. and Venczel, T. and Johnson, A. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1993</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = may,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Computer Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{33}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{338--344}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0095-2338, 1520-5142}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci00013a010}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Automatic Recognition of Chemical Images</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/algorri-chemical-image-recognition-2007/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/algorri-chemical-image-recognition-2007/</guid><description>A rule-based system for extracting chemical structure information from raster images, validated against commercial baselines.</description><content:encoded><![CDATA[<h2 id="contribution-rule-based-image-mining-architecture">Contribution: Rule-Based Image Mining Architecture</h2>
<p><strong>$\Psi_{\text{Method}}$ (Methodological Basis)</strong></p>
<p>This is a methodological paper describing a system architecture for <strong>image mining</strong> in the chemical domain. It focuses on the engineering challenge of converting rasterized depictions of molecules into computer-readable SDF files. The paper details the algorithmic pipeline and validates it through quantitative benchmarking against a commercial alternative.</p>
<h2 id="motivation-digitizing-chemical-literature">Motivation: Digitizing Chemical Literature</h2>
<ul>
<li><strong>Loss of Information</strong>: Chemical software creates images. The chemical significance is lost when published in scientific literature, making the data &ldquo;dead&rdquo; to computers.</li>
<li><strong>Gap in Technology</strong>: Image mining lags behind advances in text mining. Existing commercial solutions (like CLIDE) faded away or remained limited.</li>
<li><strong>Scale of Problem</strong>: The colossal production of chemical documents requires automated tools to exploit this information at large scale.</li>
</ul>
<h2 id="core-innovation-graph-preserving-vectorization">Core Innovation: Graph-Preserving Vectorization</h2>
<ul>
<li><strong>Graph-Preserving Vectorization</strong>: The system uses a custom vectorizer designed to preserve the &ldquo;graph characteristics&rdquo; of chemical diagrams (1 vector = 1 line), which avoids creating spurious vectors at thick joints. It aims to generate a mathematical graph, $G = (V, E)$, mapped geometrically to the image lines.</li>
<li><strong>Chemical Knowledge Integration</strong>: A distinct module validates the reconstructed graph against chemical rules (valences, charges) to ensure the output is chemically valid.</li>
<li><strong>Hybrid Processing</strong>: The system splits the image into &ldquo;connected components&rdquo; for an OCR path (text/symbols) and a &ldquo;body&rdquo; path (bonds), reassembling them later.</li>
</ul>
<h2 id="methodology--experiments-benchmark-validation">Methodology &amp; Experiments: Benchmark Validation</h2>
<p>The authors performed a quantitative validation using <strong>three different databases</strong> where ground-truth SDF files were available. They also compared their system against the commercial tool <strong>CLIDE</strong> (Chemical Literature Data Extraction).</p>
<ul>
<li><strong>Database 1</strong>: 100 images (varied line widths/fonts)</li>
<li><strong>Database 2</strong>: 100 images</li>
<li><strong>Database 3</strong>: 7,604 images (large-scale batch processing)</li>
</ul>
<h2 id="results--conclusions-superior-accuracy-over-baselines">Results &amp; Conclusions: Superior Accuracy over Baselines</h2>
<ul>
<li><strong>High Accuracy</strong>: The system achieved <strong>94%</strong> correct reconstruction on Database 1 and <strong>77%</strong> on Database 2. Accuracy was measured as correct recovery of identical geometry and connections.</li>
</ul>
<p>$$ \text{Acc} = \frac{\text{Correct Images}}{\text{Total Images}} $$</p>
<ul>
<li><strong>Baseline Superiority</strong>: The commercial tool CLIDE only successfully reconstructed ~50% of images in Database 1 (compared to the authors&rsquo; 94%).</li>
<li><strong>Scalability</strong>: On the large dataset (Database 3), the system achieved <strong>67%</strong> accuracy in batch mode.</li>
<li><strong>Robustness</strong>: The authors state the system uses a handful of parameters and works robustly across different image types. CLIDE lacked flexibility and required manual intervention.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Reproducibility Status</strong>: Closed / Not Formally Reproducible. As is common with applied research from this era, the source code, training models (SVM), and specific datasets used for benchmarking do not appear to be publicly maintained or available.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><em>None available</em></td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">No public code, models, or datasets were released with this 2007 publication.</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Database 1</td>
          <td>100 Images</td>
          <td>Used for comparison with CLIDE; 94% success rate</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Database 2</td>
          <td>100 Images</td>
          <td>77% success rate</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Database 3</td>
          <td>7,604 Images</td>
          <td>Large-scale test; 67% success rate</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper outlines a 5-module pipeline:</p>
<ol>
<li><strong>Pre-processing</strong>: Adaptive histogram binarization and non-recursive connected component labeling using RLE segments.</li>
<li><strong>OCR</strong>: A &ldquo;chemically oriented OCR&rdquo; using wavelet functions for feature extraction and a <strong>Support Vector Machine (SVM)</strong> for classification. It distinguishes characters from molecular structure.</li>
<li><strong>Vectorizer</strong>: Assigns local directions to RLE segments and groups them into patterns. Crucially, it enforces a one-to-one mapping between image lines and graph vectors.</li>
<li><strong>Reconstruction</strong>: A rule-based module that annotates vectors:
<ul>
<li><strong>Stereochemistry</strong>: Registers vectors against original pixels; thick geometric forms (triangles) become chiral wedges.</li>
<li><strong>Dotted Bonds</strong>: Identifies isolated vectors and clusters them using <strong>quadtree clustering</strong>.</li>
<li><strong>Multi-bonds</strong>: Identifies parallel vectors within a dilated bounding box (factor of 2).</li>
</ul>
</li>
<li><strong>Chemical Knowledge</strong>: Validates the graph valences and properties before exporting SDF.</li>
</ol>
<h3 id="models">Models</h3>
<ul>
<li><strong>SVM</strong>: Used in the OCR module to classify text/symbols. It supports dynamic training to correct classification mistakes.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The primary metric is the percentage of correctly reconstructed images (generating a valid, matching SDF file).</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>System Value (DB1)</th>
          <th>Baseline (CLIDE)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Reconstruction Accuracy</td>
          <td><strong>94%</strong></td>
          <td>~50%</td>
          <td>CLIDE noted as unsuitable for batch processing</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Algorri, M.-E., Zimmermann, M., &amp; Hofmann-Apitius, M. (2007). Automatic Recognition of Chemical Images. <em>Eighth Mexican International Conference on Current Trends in Computer Science</em>, 41-46. <a href="https://doi.org/10.1109/ENC.2007.25">https://doi.org/10.1109/ENC.2007.25</a></p>
<p><strong>Publication</strong>: ENC 2007 (IEEE Computer Society)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{algorriAutomaticRecognitionChemical2007,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Automatic {{Recognition}} of {{Chemical Images}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Eighth {{Mexican International Conference}} on {{Current Trends}} in {{Computer Science}} ({{ENC}} 2007)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Algorri, Maria-Elena and Zimmermann, Marc and {Hofmann-Apitius}, Martin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2007}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{41--46}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ENC.2007.25}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemInfty: Chemical Structure Recognition in Patent Images</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/cheminfty/</link><pubDate>Sat, 04 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/cheminfty/</guid><description>Fujiyoshi et al.'s segment-based approach for recognizing chemical structures in challenging Japanese patent images with touching characters and broken lines.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fujiyoshi, A., Nakagawa, K., &amp; Suzuki, M. (2011). Robust Method of Segmentation and Recognition of Chemical Structure Images in ChemInfty. <em>Pre-Proceedings of the 9th IAPR International Workshop on Graphics Recognition, GREC.</em></p>
<p><strong>Publication</strong>: GREC 2011 (Graphics Recognition Workshop)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.sciaccess.net/en/InftyReader/">InftyReader Project</a></li>
</ul>
<h2 id="contribution-segment-based-ocsr-method">Contribution: Segment-Based OCSR Method</h2>
<p>This is a <strong>method paper</strong> that introduces ChemInfty, a rule-based system for Optical Chemical Structure Recognition (OCSR) specifically designed to handle the challenging, low-quality images found in Japanese patent applications.</p>
<h2 id="motivation-the-challenge-of-degraded-patent-images">Motivation: The Challenge of Degraded Patent Images</h2>
<p>The motivation is straightforward: Japanese patent applications contain a massive amount of chemical knowledge, but the images are remarkably poor quality. Unlike the relatively clean molecular diagrams in scientific papers, patent images suffer from multiple problems that break conventional OCSR systems.</p>
<p>The authors quantified these issues in a sample of 200 patent images and found that 22% contained touching characters (where atom labels merge together), 19.5% had characters touching bond lines, and 8.5% had broken lines. These are not edge cases; they are pervasive enough to cripple existing recognition tools. Established systems like CLIDE, ChemReader, and OSRA struggle significantly with line-touching characters and broken lines, leading to recognition failures.</p>
<p>The challenge is compounded by the sheer diversity of creation methods. Some structures are drawn with sophisticated molecular editors, others with basic paint programs, and some are even handwritten. This means there&rsquo;s no standardization in fonts, character sizes, or line thickness. Add in the effects of scanning and faxing, and you have images with significant noise, distortion, and degradation.</p>
<p>The goal of ChemInfty is to build a system robust enough to handle these messy real-world conditions and make Japanese patent chemistry computer-searchable.</p>
<h2 id="core-innovation-segment-decomposition-and-dynamic-programming">Core Innovation: Segment Decomposition and Dynamic Programming</h2>
<p>The novelty lies in a segment-based decomposition approach that separates the recognition problem into manageable pieces before attempting to classify them. The key insight is that traditional OCR fails on these images because characters and lines are physically merged. You cannot recognize a character if you cannot cleanly separate it from the surrounding bonds first.</p>
<p>ChemInfty&rsquo;s approach has several distinctive elements:</p>
<ol>
<li>
<p><strong>Line and Curve Segmentation</strong>: The system first decomposes the image into smaller line and curve segments. The decomposition happens at natural breakpoints&mdash;crossings, sharp bends, and other locations where touching is likely to occur. This creates a set of primitive elements that can be recombined in different ways.</p>
</li>
<li>
<p><strong>Linear Order Assumption for Scalability</strong>: To make the dynamic programming approach computationally tractable and avoid combinatorial explosion, the system assumes that segments to be combined are adjacent when sorted in one of four directional orderings ($\perp, \setminus, \triangle, \rightarrow$). This constraint dramatically reduces the search space while still capturing the natural spatial relationships in chemical diagrams.</p>
</li>
<li>
<p><strong>Dynamic Programming for Segment Combination</strong>: Once the image is decomposed, the system faces a combinatorial problem: which segments should be grouped together to form characters, and which should be classified as bonds? The authors use dynamic programming to efficiently search for the &ldquo;most suitable combination&rdquo; of segments. This optimization finds the configuration that maximizes the likelihood of valid chemical structure elements.</p>
</li>
<li>
<p><strong>Two-Pass OCR Strategy</strong>: ChemInfty integrates with InftyReader, a powerful OCR engine. The system uses OCR twice in the pipeline:</p>
<ul>
<li><strong>First pass</strong>: High-confidence character recognition removes obvious atom labels early, simplifying the remaining image</li>
<li><strong>Second pass</strong>: After the segment-based method identifies and reconstructs difficult character regions, OCR is applied again to the cleaned-up character image</li>
</ul>
<p>This two-stage approach handles both easy and hard cases effectively: simple characters are recognized immediately, while complex cases get special treatment.</p>
</li>
<li>
<p><strong>Image Thinning for Structure Analysis</strong>: Before segmentation, the system thins the remaining graphical elements (after removing high-confidence characters) to skeleton lines. This thinning operation reveals the underlying topological structure&mdash;crossings, bends, and endpoints&mdash;making it easier to detect where segments should be divided.</p>
</li>
<li>
<p><strong>Proximity-Based Grouping</strong>: After identifying potential character segments, the system groups nearby segments together. This spatial clustering ensures that parts of the same character that were separated by bonds get recombined correctly.</p>
</li>
</ol>
<h2 id="methodology-real-world-patent-evaluation">Methodology: Real-World Patent Evaluation</h2>
<p>The evaluation focused on demonstrating that ChemInfty could handle real-world patent images at scale:</p>
<ol>
<li>
<p><strong>Large-Scale Patent Dataset</strong>: The system was tested on chemical structure images from Japanese patent applications published in 2008. This represents a realistic deployment scenario with all the messiness of actual documents.</p>
</li>
<li>
<p><strong>Touching Character Separation</strong>: The authors specifically measured the system&rsquo;s ability to separate characters from bonds when they were touching. Success was defined as cleanly extracting the character region so that OCR could recognize it.</p>
</li>
<li>
<p><strong>Recognition Accuracy by Object Type</strong>: Performance was broken down by element type (characters, line segments, solid wedges, and hashed wedges). This granular analysis revealed which components were easier or harder for the system to handle.</p>
</li>
<li>
<p><strong>End-to-End Performance</strong>: The overall recognition ratio was calculated across all object types to establish the system&rsquo;s practical utility for automated patent processing.</p>
</li>
</ol>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li>
<p><strong>Effective Separation for Line-Touching Characters</strong>: The segment-based method successfully separated 63.5% of characters that were touching bond lines. This is a substantial improvement over standard OCR, which typically fails completely on such cases. The authors note that when image quality is reasonable, the separation method works well.</p>
</li>
<li>
<p><strong>Strong Overall Character Recognition</strong>: Character recognition achieved 85.86% accuracy, which is respectable given the poor quality of the input images. Combined with the 90.73% accuracy for line segments, this demonstrates the system can reliably reconstruct the core molecular structure.</p>
</li>
<li>
<p><strong>Weak Performance on Wedges</strong>: The system struggled significantly with stereochemistry notation. Solid wedges were correctly recognized only 52.54% of the time, and hashed wedges fared even worse at 23.63%. This is a critical limitation since stereochemistry is often essential for understanding molecular properties.</p>
</li>
<li>
<p><strong>Image Quality Dependency</strong>: The authors acknowledge that the method&rsquo;s effectiveness is ultimately limited by image quality. When images are severely degraded (blurred to the point where even humans struggle to distinguish characters from noise), the segmentation approach cannot reliably separate touching elements.</p>
</li>
<li>
<p><strong>Overall System Performance</strong>: The combined recognition ratio of 86.58% for all objects indicates that ChemInfty is a working system but not yet production-ready. The authors conclude that further refinement is necessary, particularly for wedge recognition and handling extremely low-quality images.</p>
</li>
</ul>
<p>The work establishes that segment-based decomposition with dynamic programming is a viable approach for handling the specific challenges of patent image OCSR. The two-pass OCR strategy and the use of image thinning to reveal structure are practical engineering solutions that improve robustness. However, the results also highlight that rule-based methods are fundamentally limited by image quality. There is only so much you can do with algorithmic cleverness when the input is severely degraded. This limitation would motivate later work on deep learning approaches that can learn robust feature representations from large datasets.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="technical-paradigm">Technical Paradigm</h3>
<p><strong>This is a pre-deep learning (2011) classical computer vision paper.</strong> The system uses rule-based methods and traditional OCR engines, not neural networks.</p>
<h3 id="models">Models</h3>
<ul>
<li><strong>InftyReader</strong>: A mathematical OCR engine used for the initial high-confidence character recognition pass. This is a pre-existing external tool.</li>
<li><strong>DEF-based OCR</strong>: A standard OCR engine based on Directional Element Features (DEF). These are manually engineered statistical features (histograms of edge directions), not learned neural network features.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The paper details a multi-step recognition pipeline:</p>
<ol>
<li><strong>Preprocessing</strong>: Binarization and smoothing</li>
<li><strong>Initial Character Removal</strong>: High-confidence characters are recognized by the InftyReader OCR engine and removed from the image to simplify segmentation</li>
<li><strong>Skeletonization</strong>: Thinning using <strong>Hilditch&rsquo;s algorithm</strong> to skeletonize graphical elements, revealing topological structure (crossings, bends, endpoints)</li>
<li><strong>Feature Point Detection</strong>:
<ul>
<li><strong>Crossing points</strong>: Direct detection on skeleton</li>
<li><strong>Bending points</strong>: Detected using the <strong>Hough transformation</strong></li>
</ul>
</li>
<li><strong>Dynamic Programming Search</strong>:
<ul>
<li><strong>Input</strong>: Set of line/curve segments $S$</li>
<li><strong>Procedure</strong>: Sort segments in 4 directions ($\perp, \setminus, \triangle, \rightarrow$). For each direction, use DP to find the grouping that minimizes a heuristic score</li>
<li><strong>Complexity</strong>: $O(n^2)$ where $n$ is the number of segments</li>
<li><strong>Scoring</strong>: Uses a function <code>Measure(S')</code> that returns a score (0-100) indicating if a subset of segments forms a valid character or bond</li>
</ul>
</li>
</ol>
<p>The scoring function <code>Measure(S')</code> used in the dynamic programming algorithm is never mathematically defined in the paper, limiting replicability.</p>
<h3 id="data">Data</h3>
<p><strong>Evaluation Dataset</strong>: Chemical structure images from Japanese patent applications published in 2008. The complete 2008 dataset contains 229,969 total images.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Japanese Published Patent Applications (2008)</td>
          <td>1,599 images</td>
          <td>Contains 229,969 total images for the year. Format: TIFF, 200-400 dpi.</td>
      </tr>
      <tr>
          <td>Analysis</td>
          <td>Random subset for frequency analysis</td>
          <td>200 images</td>
          <td>Used to estimate frequency of touching/broken characters (found in ~20% of images).</td>
      </tr>
  </tbody>
</table>
<p><strong>No Training Set</strong>: The system is rule-based and uses pre-built OCR engines, so no model training was performed.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Primary Metric</strong>: Recognition ratio (percentage of correctly recognized objects)</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Line-touching Separation</td>
          <td>63.5%</td>
          <td>Success rate for separating text glued to lines</td>
      </tr>
      <tr>
          <td>Character Recognition</td>
          <td>85.86%</td>
          <td>For all character sizes</td>
      </tr>
      <tr>
          <td>Line segments</td>
          <td>90.73%</td>
          <td>Standard bond recognition</td>
      </tr>
      <tr>
          <td>Solid Wedge Recognition</td>
          <td>52.54%</td>
          <td>Low performance noted as area for improvement</td>
      </tr>
      <tr>
          <td>Hashed Wedges</td>
          <td>23.63%</td>
          <td>Poorest performing element type</td>
      </tr>
      <tr>
          <td>Overall</td>
          <td>86.58%</td>
          <td>Combined across all object types</td>
      </tr>
  </tbody>
</table>
<p><strong>Total Objects Evaluated</strong>: 742,287 objects (characters, line segments, solid wedges, hashed wedges) extracted from the patent images.</p>
<h3 id="hardware">Hardware</h3>
<p>Not reported. Computational cost was not a primary concern for this classical CV system.</p>
<h3 id="replicability">Replicability</h3>
<p><strong>Low.</strong> The paper does not provide sufficient detail for full replication:</p>
<ul>
<li>The scoring function <code>Measure(S')</code> used in the dynamic programming algorithm is never mathematically defined</li>
<li>Dependency on the proprietary/specialized InftyReader engine</li>
<li>No pseudocode provided for the segment decomposition heuristics</li>
</ul>
<h3 id="notes-on-wedge-recognition">Notes on Wedge Recognition</h3>
<p>The system&rsquo;s poor performance on solid wedges (52.54%) and hashed wedges (23.63%) reflects a fundamental challenge for classical thinning algorithms. Wedge bonds are dense triangular regions that indicate 3D stereochemistry. When skeletonized using algorithms like Hilditch&rsquo;s method, these &ldquo;blob&rdquo; shapes often distort into unrecognizable patterns, unlike the clean thin lines that represent regular bonds.</p>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{fujiyoshiRobustMethodSegmentation2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Robust {{Method}} of {{Segmentation}} and {{Recognition}} of {{Chemical Structure Images}} in {{ChemInfty}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Fujiyoshi, Akio and Nakagawa, Koji and Suzuki, Masakazu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2011</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Pre-proceedings of the 9th IAPR international workshop on graphics recognition, GREC}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>