<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Optical Chemical Structure Recognition on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/</link><description>Recent content in Optical Chemical Structure Recognition on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Tue, 07 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/index.xml" rel="self" type="application/rss+xml"/><item><title>MarkushGrapher-2: End-to-End Markush Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/markushgrapher-2-multimodal-recognition/</link><pubDate>Mon, 06 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/markushgrapher-2-multimodal-recognition/</guid><description>MarkushGrapher-2 fuses vision, text, and layout encoders with a dedicated OCR module for end-to-end Markush structure recognition from patent images.</description><content:encoded><![CDATA[<h2 id="a-multimodal-method-for-markush-structure-recognition">A Multimodal Method for Markush Structure Recognition</h2>
<p>This is a <strong>Method</strong> paper that introduces MarkushGrapher-2, a universal encoder-decoder model for recognizing both standard molecular structures and multimodal Markush structures from chemical images. The primary contribution is a dual-encoder architecture that fuses a pretrained OCSR (Optical Chemical Structure Recognition) vision encoder with a Vision-Text-Layout (VTL) encoder, connected through a dedicated ChemicalOCR module for end-to-end processing. The paper also introduces two new resources: a large-scale training dataset (USPTO-MOL-M) of real-world Markush structures extracted from USPTO patent MOL files, and IP5-M, a manually annotated benchmark of 1,000 Markush structures from five major patent offices.</p>
<h2 id="why-markush-structure-recognition-remains-challenging">Why Markush Structure Recognition Remains Challenging</h2>
<p><a href="https://en.wikipedia.org/wiki/Markush_structure">Markush structures</a> are compact representations used in patent documents to describe families of related molecules. They combine a visual backbone (atoms, bonds, variable regions) with textual definitions of substituents that can replace those variable regions. This multimodal nature makes them harder to parse than standard molecular diagrams.</p>
<p>Three factors limit automatic Markush recognition. First, visual styles vary across patent offices and publication years. Second, textual definitions lack standardization and often contain conditional or recursive descriptions. Third, real-world training data with comprehensive annotations is scarce. As a result, Markush structures are currently indexed only in two proprietary, manually curated databases: MARPAT and DWPIM.</p>
<p>Prior work, including the original <a href="/notes/chemistry/optical-structure-recognition/markush/markushgrapher/">MarkushGrapher</a>, required pre-annotated OCR outputs at inference time, limiting practical deployment. General-purpose models like GPT-5 and DeepSeek-OCR produce mostly chemically invalid outputs on Markush images, suggesting these lie outside their training distribution.</p>
<h2 id="dual-encoder-architecture-with-dedicated-chemicalocr">Dual-Encoder Architecture with Dedicated ChemicalOCR</h2>
<p>MarkushGrapher-2 uses two complementary encoding pipelines:</p>
<ol>
<li>
<p><strong>Vision encoder pipeline</strong>: The input image passes through a Swin-B Vision Transformer (taken from <a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molscribe/">MolScribe</a>) pretrained for OCSR. This encoder extracts visual features representing molecular structures and remains frozen during training.</p>
</li>
<li>
<p><strong>Vision-Text-Layout (VTL) pipeline</strong>: The same image goes through ChemicalOCR, a compact 256M-parameter vision-language model fine-tuned from SmolDocling for OCR on chemical images. ChemicalOCR extracts character-level text and bounding boxes. These, combined with image patches, feed into a T5-base VTL encoder following the UDOP fusion paradigm, where visual and textual tokens are spatially aligned by bounding box overlap.</p>
</li>
</ol>
<p>The VTL encoder output is concatenated with projected embeddings from the vision encoder. This joint representation feeds a text decoder that auto-regressively generates a CXSMILES (ChemAxon Extended <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>) string describing the backbone structure and a substituent table listing variable group definitions.</p>
<h3 id="two-stage-training-strategy">Two-Stage Training Strategy</h3>
<p>Training proceeds in two phases:</p>
<ul>
<li>
<p><strong>Phase 1 (Adaptation)</strong>: The vision encoder is frozen. The MLP projector and text decoder train on 243K real-world image-SMILES pairs from MolScribe&rsquo;s USPTO dataset (3 epochs). This aligns the decoder to the pretrained OCSR feature space.</p>
</li>
<li>
<p><strong>Phase 2 (Fusion)</strong>: The vision encoder, projector, and ChemicalOCR are all frozen. The VTL encoder and text decoder train on a mix of 235K synthetic and 145K real-world Markush samples (2 epochs). The VTL encoder learns the features needed for CXSMILES and substituent table prediction without disrupting the established OCSR representations.</p>
</li>
</ul>
<p>The total model has 831M parameters, of which 744M are trainable.</p>
<h2 id="datasets-and-evaluation-benchmarks">Datasets and Evaluation Benchmarks</h2>
<h3 id="training-data">Training Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>OCR pretraining</td>
          <td>Synthetic chemical structures</td>
          <td>235K</td>
          <td><a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> SMILES augmented to CXSMILES, rendered with annotations</td>
      </tr>
      <tr>
          <td>OCR fine-tuning</td>
          <td>Manual OCR annotations</td>
          <td>7K</td>
          <td>IP5 patent document crops</td>
      </tr>
      <tr>
          <td>Phase 1 (OCSR)</td>
          <td>MolScribe USPTO</td>
          <td>243K</td>
          <td>Real image-SMILES pairs</td>
      </tr>
      <tr>
          <td>Phase 2 (MMSR)</td>
          <td>Synthetic CXSMILES</td>
          <td>235K</td>
          <td>Same as OCR pretraining set</td>
      </tr>
      <tr>
          <td>Phase 2 (MMSR)</td>
          <td>MolParser dataset</td>
          <td>91K</td>
          <td>Real-world Markush, converted to CXSMILES</td>
      </tr>
      <tr>
          <td>Phase 2 (MMSR)</td>
          <td>USPTO-MOL-M</td>
          <td>54K</td>
          <td>Real-world, auto-extracted from USPTO MOL files (2010-2025)</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation-benchmarks">Evaluation Benchmarks</h3>
<p><strong>Markush benchmarks</strong>: M2S (103 samples), USPTO-M (74), WildMol-M (10K, semi-manual), and the new IP5-M (1,000 manually annotated from USPTO, JPO, KIPO, CNIPA, and EPO patents, 1980-2025).</p>
<p><strong>OCSR benchmarks</strong>: USPTO (5,719), JPO (450), UOB (5,740), WildMol (10K).</p>
<p>The primary metric is <strong>CXSMILES Accuracy (A)</strong>: a prediction is correct when (1) the predicted SMILES matches the ground truth by <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChIKey</a> equivalence, and (2) all Markush features (variable groups, positional and frequency variation indicators) are correctly represented. Stereochemistry is ignored during evaluation.</p>
<h3 id="results-markush-structure-recognition">Results: Markush Structure Recognition</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>M2S</th>
          <th>USPTO-M</th>
          <th>WildMol-M</th>
          <th>IP5-M</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolParser-Base</td>
          <td>39</td>
          <td>30</td>
          <td>38.1</td>
          <td>47.7</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>21</td>
          <td>7</td>
          <td>28.1</td>
          <td>22.3</td>
      </tr>
      <tr>
          <td>GPT-5</td>
          <td>3</td>
          <td>0</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>DeepSeek-OCR</td>
          <td>0</td>
          <td>0</td>
          <td>1.9</td>
          <td>0.0</td>
      </tr>
      <tr>
          <td>MarkushGrapher-1</td>
          <td>38</td>
          <td>10</td>
          <td>32</td>
          <td>-</td>
      </tr>
      <tr>
          <td><strong>MarkushGrapher-2</strong></td>
          <td><strong>56</strong></td>
          <td><strong>13</strong></td>
          <td><strong>55</strong></td>
          <td><strong>48.0</strong></td>
      </tr>
  </tbody>
</table>
<p>On M2S, MarkushGrapher-2 achieves 56% CXSMILES accuracy vs. 38% for MarkushGrapher-1, a relative improvement of 47%. On WildMol-M (the largest benchmark at 10K samples), MarkushGrapher-2 reaches 55% vs. 38.1% for MolParser-Base and 32% for MarkushGrapher-1. GPT-5 and DeepSeek-OCR generate mostly chemically invalid outputs on Markush images: only 30% and 15% of their predictions are valid CXSMILES on M2S, respectively.</p>
<h3 id="results-standard-molecular-structure-recognition">Results: Standard Molecular Structure Recognition</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>WildMol</th>
          <th>JPO</th>
          <th>UOB</th>
          <th>USPTO</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolParser-Base</td>
          <td>76.9</td>
          <td>78.9</td>
          <td>91.8</td>
          <td>93.0</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>66.4</td>
          <td>76.2</td>
          <td>87.4</td>
          <td>93.1</td>
      </tr>
      <tr>
          <td>DECIMER 2.7</td>
          <td>56.0</td>
          <td>64.0</td>
          <td>88.3</td>
          <td>59.9</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molgrapher/">MolGrapher</a></td>
          <td>45.5</td>
          <td>67.5</td>
          <td>94.9</td>
          <td>91.5</td>
      </tr>
      <tr>
          <td>DeepSeek-OCR</td>
          <td>25.8</td>
          <td>31.6</td>
          <td>78.7</td>
          <td>36.9</td>
      </tr>
      <tr>
          <td><strong>MarkushGrapher-2</strong></td>
          <td>68.4</td>
          <td>71.0</td>
          <td><strong>96.6</strong></td>
          <td>89.8</td>
      </tr>
  </tbody>
</table>
<p>MarkushGrapher-2 achieves the highest score on UOB (96.6%) and remains competitive on other OCSR benchmarks, despite being primarily optimized for Markush recognition.</p>
<h3 id="chemicalocr-vs-general-ocr">ChemicalOCR vs. General OCR</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>M2S F1</th>
          <th>USPTO-M F1</th>
          <th>IP5-M F1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PaddleOCR v5</td>
          <td>7.7</td>
          <td>1.2</td>
          <td>1.9</td>
      </tr>
      <tr>
          <td>EasyOCR</td>
          <td>10.2</td>
          <td>18.0</td>
          <td>18.4</td>
      </tr>
      <tr>
          <td><strong>ChemicalOCR</strong></td>
          <td><strong>87.2</strong></td>
          <td><strong>93.0</strong></td>
          <td><strong>86.5</strong></td>
      </tr>
  </tbody>
</table>
<p>General-purpose OCR tools fail on chemical images because they misinterpret bonds as characters and cannot parse chemical abbreviations. ChemicalOCR outperforms both by a large margin.</p>
<h2 id="ablation-results-and-key-findings">Ablation Results and Key Findings</h2>
<p><strong>OCR input is critical for Markush features.</strong> Without OCR, CXSMILES accuracy drops from 56% to 4% on M2S, and from 53.7% to 15.4% on IP5-M. The backbone structure accuracy ($A_{\text{InChIKey}}$) also drops substantially (from 80% to 39% on M2S), though the vision encoder alone can still recover some structural information. This confirms that textual cues (brackets, indices, variable definitions) are essential for Markush feature prediction.</p>
<p><strong>Two-phase training improves both tasks.</strong> Compared to single-phase (fusion only) training, the two-phase strategy improves CXSMILES accuracy from 44% to 50% on M2S and from 53.0% to 61.5% on JPO after the same number of epochs. Adapting the decoder to OCSR features before introducing the VTL encoder prevents the fusion process from degrading learned visual representations.</p>
<p><strong>Frequency variation indicators remain the hardest feature.</strong> On IP5-M, the per-feature breakdown shows 73.3% accuracy for backbone InChI, 74.8% for variable groups, 78.8% for positional variation, but only 30.7% for frequency variation (Sg groups). These repeating structural units are particularly challenging to represent and predict.</p>
<p><strong>Limitations</strong>: The model relies on accurate OCR as a prerequisite. Performance on USPTO-M (13% CXSMILES accuracy) lags behind other benchmarks, likely due to the older patent styles in that dataset. The paper does not report inference latency.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>OCR pretraining</td>
          <td>Synthetic chemical images</td>
          <td>235K</td>
          <td>Generated from PubChem SMILES, augmented to CXSMILES</td>
      </tr>
      <tr>
          <td>OCR fine-tuning</td>
          <td>IP5 patent crops</td>
          <td>7K</td>
          <td>Manually annotated</td>
      </tr>
      <tr>
          <td>Phase 1 training</td>
          <td>MolScribe USPTO</td>
          <td>243K</td>
          <td>Public, real image-SMILES pairs</td>
      </tr>
      <tr>
          <td>Phase 2 training</td>
          <td>Synthetic + MolParser + USPTO-MOL-M</td>
          <td>380K</td>
          <td>Mix of synthetic (235K), MolParser (91K), USPTO-MOL-M (54K)</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>M2S, USPTO-M, WildMol-M, IP5-M</td>
          <td>103 to 10K</td>
          <td>Markush benchmarks</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>WildMol, JPO, UOB, USPTO</td>
          <td>450 to 10K</td>
          <td>OCSR benchmarks</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Architecture</th>
          <th>Parameters</th>
          <th>Status</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Vision encoder</td>
          <td>Swin-B ViT (from MolScribe)</td>
          <td>~87M</td>
          <td>Frozen</td>
      </tr>
      <tr>
          <td>VTL encoder + decoder</td>
          <td>T5-base</td>
          <td>~744M trainable</td>
          <td>Trained</td>
      </tr>
      <tr>
          <td>ChemicalOCR</td>
          <td>SmolDocling-based VLM</td>
          <td>256M</td>
          <td>Fine-tuned, frozen in Phase 2</td>
      </tr>
      <tr>
          <td>MLP projector</td>
          <td>Linear projection</td>
          <td>-</td>
          <td>Trained in Phase 1, frozen in Phase 2</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td></td>
          <td><strong>831M</strong></td>
          <td></td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Definition</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CXSMILES Accuracy (A)</td>
          <td>Percentage of samples where InChIKey matches AND all Markush features correct</td>
      </tr>
      <tr>
          <td>$A_{\text{InChIKey}}$</td>
          <td>Backbone structure accuracy only (ignoring Markush features)</td>
      </tr>
      <tr>
          <td>Table Accuracy</td>
          <td>Percentage of correctly predicted substituent tables</td>
      </tr>
      <tr>
          <td>Markush Accuracy</td>
          <td>Joint CXSMILES + Table accuracy</td>
      </tr>
      <tr>
          <td>OCR F1</td>
          <td>Bounding-box-level precision/recall at IoU &gt; 0.5</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training: NVIDIA A100 GPU</li>
<li>Phase 1: 3 epochs, Adam optimizer, lr 5e-4, 1000 warmup steps, batch size 10, weight decay 1e-3</li>
<li>Phase 2: 2 epochs, batch size 8</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/DS4SD/MarkushGrapher">MarkushGrapher GitHub</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation of MarkushGrapher-2 with models and datasets</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility classification</strong>: Highly Reproducible. Code, models, and datasets are all publicly released under an MIT license with documented training hyperparameters and a single A100 GPU requirement.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Strohmeyer, T., Morin, L., Meijer, G. I., Weber, V., Nassar, A., &amp; Staar, P. (2026). MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures. In <em>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em>.</p>
<p><strong>Publication</strong>: CVPR 2026</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/DS4SD/MarkushGrapher">GitHub Repository (MIT License)</a></li>
<li><a href="https://arxiv.org/abs/2603.28550">arXiv Preprint</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{strohmeyer2026markushgrapher,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Val\&#39;{e}ry and Nassar, Ahmed and Staar, Peter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2603.28550}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span>=<span style="color:#e6db74">{cs.CV}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Uni-Parser: Industrial-Grade Multi-Modal PDF Parsing (2025)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/uni-parser-2025/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/uni-parser-2025/</guid><description>Uni-Parser is a modular, multi-expert PDF parsing engine for scientific documents with integrated OCSR and chemical structure recognition.</description><content:encoded><![CDATA[<h2 id="an-industrial-grade-multi-modal-document-parser">An Industrial-Grade Multi-Modal Document Parser</h2>
<p>Uni-Parser is a modular, loosely coupled PDF parsing engine built for scientific literature and patents. It routes different content types (text, equations, tables, figures, chemical structures) to specialized expert models, then reassembles the parsed outputs into structured formats (JSON, Markdown, HTML) for downstream consumption by LLMs and other applications.</p>
<p>The system processes up to 20 PDF pages per second on 8 NVIDIA RTX 4090D GPUs and supports over 80 languages for OCR.</p>
<h2 id="a-five-stage-pipeline-architecture">A Five-Stage Pipeline Architecture</h2>
<p>The system is organized into five sequential stages:</p>
<ol>
<li><strong>Document Pre-Processing</strong>: Validates PDFs, extracts metadata, checks text accessibility, and identifies language.</li>
<li><strong>Group-based Layout Detection</strong>: Locates semantic blocks and identifies their categories using a novel tree-structured layout representation. Groups naturally paired elements (image-caption, table-title, molecule-identifier).</li>
<li><strong>Semantic Contents Parsing</strong>: Routes each block to a specialized model: OCR for text, formula recognition for equations, table structure recognition, OCSR for chemical structures, reaction extraction, and chart parsing. Over ten sub-models operate in parallel.</li>
<li><strong>Semantic Contents Gathering</strong>: Filters non-essential elements, reconstructs reading order, merges cross-page and multi-column content, and reintegrates inline multimodal elements.</li>
<li><strong>Output Formatting and Semantic Chunking</strong>: Exports parsed documents in task-specific formats with proper chunking for RAG and other downstream tasks.</li>
</ol>
<h2 id="group-based-layout-detection">Group-Based Layout Detection</h2>
<p>A key contribution is the group-based layout detection model (Uni-Parser-LD), which uses a hierarchical tree structure to represent page layouts. Elements are organized into a bottom layer (parent nodes like paragraphs, tables, images) and a top layer (child nodes like captions, footnotes, identifiers). This preserves semantic associations between paired elements, such as molecules and their identifiers.</p>
<p>The model is trained on 500k pages, including 220k human-annotated pages from scientific journals and patents across 85 languages. A modified DETR-based architecture was selected as the backbone after finding that RT-DETRv2, YOLOv12, and D-FINE exhibited training instability for this task.</p>
<h2 id="chemical-structure-recognition-with-molparser-15">Chemical Structure Recognition with MolParser 1.5</h2>
<p>Uni-Parser integrates MolParser 1.5 for OCSR, an end-to-end model that directly generates molecular representations from images. The authors explicitly note that graph-based (atom-bond) methods were the first direction they explored but ultimately abandoned because of:</p>
<ul>
<li>Strong reliance on rigid, hand-crafted rules that limit scalability</li>
<li>Substantially higher annotation costs (over 20x compared to end-to-end approaches)</li>
<li>Lower performance ceilings despite increasing training data</li>
</ul>
<h3 id="molecule-localization">Molecule Localization</h3>
<p>Uni-Parser-LD achieves strong molecule detection performance:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>mAP@50</th>
          <th>mAP@50-95</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Uni-Parser-LD</strong> (Uni-Parser Bench)</td>
          <td><strong>0.994</strong></td>
          <td><strong>0.968</strong></td>
      </tr>
      <tr>
          <td>MolDet-Doc-L</td>
          <td>0.983</td>
          <td>0.919</td>
      </tr>
      <tr>
          <td>MolDet-General-L</td>
          <td>0.974</td>
          <td>0.815</td>
      </tr>
      <tr>
          <td><strong>Uni-Parser-LD</strong> (BioVista Bench)</td>
          <td><strong>0.981</strong></td>
          <td><strong>0.844</strong></td>
      </tr>
      <tr>
          <td>MolDet-Doc-L</td>
          <td>0.961</td>
          <td>0.871</td>
      </tr>
      <tr>
          <td>MolDet-General-L</td>
          <td>0.945</td>
          <td>0.815</td>
      </tr>
      <tr>
          <td>BioMiner</td>
          <td>0.929</td>
          <td>-</td>
      </tr>
      <tr>
          <td>MolMiner</td>
          <td>0.899</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<h3 id="ocsr-accuracy">OCSR Accuracy</h3>
<p>MolParser 1.5 consistently outperforms prior methods across molecule types:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Full</th>
          <th>Chiral</th>
          <th>Markush</th>
          <th>All</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>MolParser 1.5</strong> (Uni-Parser Bench)</td>
          <td><strong>0.979</strong></td>
          <td><strong>0.809</strong></td>
          <td><strong>0.805</strong></td>
          <td><strong>0.886</strong></td>
      </tr>
      <tr>
          <td>MolParser 1.0</td>
          <td>0.953</td>
          <td>0.676</td>
          <td>0.664</td>
          <td>0.800</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>0.617</td>
          <td>0.274</td>
          <td>0.168</td>
          <td>0.417</td>
      </tr>
      <tr>
          <td><strong>MolParser 1.5</strong> (BioVista Bench)</td>
          <td><strong>0.795</strong></td>
          <td><strong>0.604</strong></td>
          <td><strong>0.761</strong></td>
          <td><strong>0.780</strong></td>
      </tr>
      <tr>
          <td>MolParser 1.0</td>
          <td>0.669</td>
          <td>0.352</td>
          <td>0.733</td>
          <td>0.703</td>
      </tr>
      <tr>
          <td>MolMiner</td>
          <td>0.774</td>
          <td>0.497</td>
          <td>0.185</td>
          <td>0.507</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>0.703</td>
          <td>0.481</td>
          <td>0.156</td>
          <td>0.455</td>
      </tr>
      <tr>
          <td>MolNexTR</td>
          <td>0.695</td>
          <td>0.419</td>
          <td>0.045</td>
          <td>0.401</td>
      </tr>
      <tr>
          <td>DECIMER</td>
          <td>0.545</td>
          <td>0.326</td>
          <td>0.000</td>
          <td>0.298</td>
      </tr>
  </tbody>
</table>
<p>Chiral molecule recognition remains a significant challenge and is identified as a key area for future work.</p>
<h2 id="document-parsing-benchmarks">Document Parsing Benchmarks</h2>
<p>On the Uni-Parser Benchmark (150 PDFs, 2,887 pages from patents and scientific articles), Uni-Parser (HQ mode) achieves an overall score of 89.74 (excluding molecules), outperforming both pipeline tools (MinerU, PP-StructureV3) and specialized VLMs (MinerU2-VLM, DeepSeek-OCR, PaddleOCR-VL). Competing systems score zero on molecule localization and OCSR because they lack molecular recognition capabilities.</p>
<p>On the general-document OmniDocBench-1.5, a variant (Uni-Parser-G) using a swapped layout module achieves 89.75 overall, competitive with top-performing specialized VLMs.</p>
<h2 id="comparison-with-ocsr-enabled-pdf-parsers">Comparison with OCSR-Enabled PDF Parsers</h2>
<p>On a controlled test set of 141 simple molecules, Uni-Parser outperforms other PDF parsing systems with OCSR support:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Recall</th>
          <th>OCSR Success</th>
          <th>OCSR Acc</th>
          <th>Id Match</th>
          <th>Time</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Uni-Parser</strong></td>
          <td><strong>100%</strong></td>
          <td><strong>100%</strong></td>
          <td><strong>96.5%</strong></td>
          <td><strong>100%</strong></td>
          <td><strong>1.8s</strong></td>
      </tr>
      <tr>
          <td>MathPix</td>
          <td>100%</td>
          <td>75.9%</td>
          <td>59.6%</td>
          <td>-</td>
          <td>66.1s</td>
      </tr>
      <tr>
          <td>MinerU.Chem</td>
          <td>66.7%</td>
          <td>63.1%</td>
          <td>22.7%</td>
          <td>-</td>
          <td>~7 min</td>
      </tr>
  </tbody>
</table>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/UniParser">HuggingFace Models</a></td>
          <td>Model/Dataset</td>
          <td>Unknown</td>
          <td>MolDet models and MolParser-7M dataset available</td>
      </tr>
      <tr>
          <td><a href="https://uni-parser.github.io">Project Page</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Project website with documentation</td>
      </tr>
  </tbody>
</table>
<p>The Uni-Parser system is deployed on a cluster of 240 NVIDIA L40 GPUs (48 GB each) with 22 CPU cores and 90 GB of host memory per GPU. The reference throughput benchmark (20 pages/second) uses 8 NVIDIA RTX 4090D GPUs. The HuggingFace organization hosts MolDet detection models and several datasets (MolParser-7M, RxnBench, OmniScience), but the full Uni-Parser system code and end-to-end inference pipeline do not appear to be publicly released. MolParser 1.5 model weights are not publicly available as of this writing.</p>
<h2 id="limitations-and-future-directions">Limitations and Future Directions</h2>
<ul>
<li>Chiral molecule recognition remains a challenge for end-to-end OCSR models</li>
<li>Chemical reaction understanding in real-world literature has substantial room for improvement</li>
<li>Layout models are primarily tailored to scientific and patent documents, with plans to expand to newspapers, slides, books, and financial statements</li>
<li>Chart parsing falls short of industrial-level requirements across the diversity of chart types in scientific literature</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fang, X., Tao, H., Yang, S., Huang, C., Zhong, S., Lu, H., Lyu, H., Li, X., Zhang, L., &amp; Ke, G. (2025). Uni-Parser Technical Report. <em>arXiv preprint arXiv:2512.15098</em>. <a href="https://arxiv.org/abs/2512.15098">https://arxiv.org/abs/2512.15098</a></p>
<p><strong>Publication</strong>: arXiv 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://uni-parser.github.io">Project Page</a></li>
<li><a href="https://huggingface.co/UniParser">HuggingFace Models</a></li>
</ul>
]]></content:encoded></item><item><title>GraSP: Graph Recognition via Subgraph Prediction (2026)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/grasp-2026/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/grasp-2026/</guid><description>GraSP is a general image-to-graph framework using sequential subgraph prediction, applied to OCSR with 67.5% accuracy on QM9.</description><content:encoded><![CDATA[<h2 id="a-general-framework-for-visual-graph-recognition">A General Framework for Visual Graph Recognition</h2>
<p>GraSP (Graph Recognition via Subgraph Prediction) addresses a fundamental limitation in image-to-graph methods: existing solutions are task-specific and do not transfer between domains. Whether the task is OCSR, scene graph recognition, music notation parsing, or road network extraction, each domain has developed independent solutions despite solving the same conceptual problem of extracting a graph from an image.</p>
<p>The key insight is that graph recognition can be reformulated as sequential subgraph prediction using a binary classifier, sidestepping two core difficulties of using graphs as neural network outputs:</p>
<ol>
<li><strong>Graph isomorphism</strong>: An uncolored graph with $n$ nodes has $n!$ equivalent representations, making direct output comparison intractable</li>
<li><strong>Compositional outputs</strong>: Nodes, edges, and features are interdependent, so standard i.i.d. loss functions are insufficient</li>
</ol>
<h2 id="sequential-subgraph-prediction-as-an-mdp">Sequential Subgraph Prediction as an MDP</h2>
<p>GraSP formulates graph recognition as a Markov Decision Process. Starting from an empty graph, the method iteratively expands the current graph by adding one edge at a time (connecting either a new node or two existing nodes). At each step, a binary classifier predicts whether each candidate successor graph is a subgraph of the target graph shown in the image.</p>
<p>The critical observation is that the optimal value function $V^{\pi^*}$ satisfies:</p>
<p>$$V^{\pi^*}(\mathcal{G}_t | \mathcal{I}) = 1 \iff \mathcal{G}_t \subseteq \mathcal{G}_{\mathcal{I}}$$</p>
<p>This means the value function reduces to a subgraph membership test, which can be learned as a binary classifier rather than requiring reinforcement learning. Greedy decoding then suffices: at each step, select any successor that the classifier predicts is a valid subgraph, and terminate when the classifier indicates the current graph is complete.</p>
<p>This formulation decouples <strong>decision</strong> (what to add) from <strong>generation</strong> (in what order), making the same model applicable across different graph types without modification.</p>
<h2 id="architecture-gnn--film-conditioned-cnn">Architecture: GNN + FiLM-Conditioned CNN</h2>
<p>The architecture has three components:</p>
<ol>
<li>
<p><strong>GNN encoder</strong>: A Message Passing Neural Network processes the candidate subgraph, producing a graph embedding. Messages are constructed as concatenations of source node features, target node features, and connecting edge features.</p>
</li>
<li>
<p><strong>FiLM-conditioned CNN</strong>: A ResNet-v2 processes the image, with FiLM layers placed after every normalization layer within each block. The graph embedding conditions the image processing, producing a joint graph-image representation.</p>
</li>
<li>
<p><strong>MLP classification head</strong>: Takes the conditioned image embedding plus a binary terminal flag (indicating whether this is a termination check) and predicts subgraph membership.</p>
</li>
</ol>
<p>The model uses only 7.25M parameters. Group Normalization is used in the CNN (8 groups per layer), Layer Normalization in the GNN and MLP.</p>
<h2 id="training-via-streaming-data-generation">Training via Streaming Data Generation</h2>
<p>Training uses a streaming architecture rather than a fixed dataset:</p>
<ul>
<li>For each iteration, a target graph $\mathcal{G}_T$ is sampled and rendered as an image</li>
<li><strong>Positive samples</strong> are generated by deleting edges that do not disconnect the graph (yielding valid subgraphs)</li>
<li><strong>Negative samples</strong> are generated by expanding successor states and checking via approximate subgraph matching</li>
<li>Two FIFO buffers (one for positives, one for negatives), each holding up to 25,000 images, maintain diverse and balanced mini-batches of 1024 samples</li>
<li>Training uses the RAdam optimizer with a cosine learning rate schedule (warmup over 50M samples, cycle of 250M samples) on 4 A100 GPUs with a 24h budget</li>
</ul>
<h2 id="synthetic-benchmarks-on-colored-trees">Synthetic Benchmarks on Colored Trees</h2>
<p>GraSP is evaluated on increasingly complex synthetic tasks involving colored tree graphs:</p>
<ul>
<li><strong>Small trees (6-9 nodes)</strong>: Tasks with varying numbers of node colors (1, 3, 5) and edge colors (1, 3, 5). The model works well across all configurations, with simpler tasks (fewer colors) converging faster.</li>
<li><strong>Larger trees (10-15 nodes)</strong>: The same trends hold but convergence is slower due to increased structural complexity.</li>
<li><strong>Out-of-distribution generalization</strong>: Models trained on 6-9 node trees show zero-shot generalization to 10-node trees, indicating learned patterns are size-independent.</li>
</ul>
<h2 id="ocsr-evaluation-on-qm9">OCSR Evaluation on QM9</h2>
<p>For the real-world OCSR evaluation, GraSP is applied to QM9 molecular images (grayscale, no stereo-bonds) with a 10,000-molecule held-out test set:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>OSRA</td>
          <td>45.61%</td>
      </tr>
      <tr>
          <td>GraSP</td>
          <td>67.51%</td>
      </tr>
      <tr>
          <td>MolGrapher</td>
          <td>88.36%</td>
      </tr>
      <tr>
          <td>DECIMER</td>
          <td>92.08%</td>
      </tr>
  </tbody>
</table>
<p>GraSP does not match state-of-the-art OCSR tools, but the authors emphasize that the same model architecture and training procedure transfers directly from synthetic tree tasks to molecular graphs with no task-specific modifications. The only domain knowledge incorporated is a simple chemistry rule: not extending nodes that already have degree four.</p>
<p>The method highlights the practical advantage of decoupling decision from generation. Functional groups can be represented at different granularities (as single nodes to reduce trajectory depth, or expanded to reduce trajectory breadth) without changing the model.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/c72bcbf4/grasp">GraSP Code</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Official implementation with pre-trained models</td>
      </tr>
  </tbody>
</table>
<p>The repository includes pre-trained models and example trajectories for interactive exploration. Training requires 4 A100 GPUs with a 24h time budget. The QM9 dataset used for OCSR evaluation is publicly available. No license file is included in the repository.</p>
<h2 id="limitations-and-future-directions">Limitations and Future Directions</h2>
<ul>
<li><strong>Finite type assumption</strong>: The current framework assumes a finite set of node and edge types, limiting applicability to open-vocabulary tasks like scene graph recognition</li>
<li><strong>Scaling to large graphs</strong>: For very large graphs, the branching factor of successor states becomes expensive. Learned filters to prune irrelevant successor states could help</li>
<li><strong>OCSR performance gap</strong>: While GraSP demonstrates transferability, it falls short of specialized OCSR tools that use domain-specific encodings (SMILES) or pixel-level supervision</li>
<li><strong>Modality extension</strong>: The framework could extend beyond images to other input modalities, such as vector embeddings of graphs</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Eberhard, A., Neumann, G., &amp; Friederich, P. (2026). Graph Recognition via Subgraph Prediction. <em>arXiv preprint arXiv:2601.15133</em>. <a href="https://arxiv.org/abs/2601.15133">https://arxiv.org/abs/2601.15133</a></p>
<p><strong>Publication</strong>: arXiv 2026</p>
]]></content:encoded></item><item><title>GraphReco: Probabilistic Structure Recognition (2026)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/graphreco-2026/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/graphreco-2026/</guid><description>GraphReco is a rule-based OCSR system using Markov networks for probabilistic atom/bond ambiguity resolution during graph assembly.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, H., Yu, Y., &amp; Liu, J.-C. (2026). GraphReco: Probabilistic Structure Recognition for Chemical Molecules. <em>ChemistryOpen</em>, e202500537. <a href="https://doi.org/10.1002/open.202500537">https://doi.org/10.1002/open.202500537</a></p>
<p><strong>Publication</strong>: ChemistryOpen 2026 (Open Access)</p>
<h2 id="a-rule-based-ocsr-system-with-probabilistic-graph-assembly">A Rule-Based OCSR System with Probabilistic Graph Assembly</h2>
<p>GraphReco tackles a challenge that is rarely addressed explicitly in rule-based OCSR: the ambiguity that arises during graph assembly when lower-level component extraction results are imprecise. Small deviations in bond endpoint locations, false positive detections, and spatial proximity between elements all create uncertainty about which atoms and bonds should be connected, merged, or discarded.</p>
<p>The system introduces two main contributions:</p>
<ol>
<li><strong>Fragment Merging (FM) line detection</strong>: An adaptive three-stage algorithm for precise bond line identification across images of variable resolution</li>
<li><strong>Probabilistic ambiguity resolution</strong>: A Markov network that infers the most likely existence and merging state of atom and bond candidates</li>
</ol>
<h2 id="three-stage-pipeline">Three-Stage Pipeline</h2>
<p>GraphReco follows a three-stage workflow:</p>
<ol>
<li>
<p><strong>Component Extraction</strong>: Detects circles (aromatic bonds), bond lines (via the FM algorithm), and chemical symbols (via Tesseract OCR). Includes detection of solid wedge, dashed wedge, dashed line, and wavy bond styles. A semi-open-loop correction step resolves cases where symbols are misclassified as bonds and vice versa.</p>
</li>
<li>
<p><strong>Atom and Bond Ambiguity Resolution</strong>: Creates atom and bond candidates from detected components, builds a Markov network to infer their most probable states, and resolves candidates through existence and merging decisions.</p>
</li>
<li>
<p><strong>Graph Reconstruction</strong>: Assembles resolved atoms and bonds into a molecule graph, selects the largest connected component, and exports as MDL Molfile.</p>
</li>
</ol>
<h2 id="fragment-merging-line-detection">Fragment Merging Line Detection</h2>
<p>Classical Line Hough Transform (LHT) struggles with chemical structure images because bond lines suffer from pixelization, and algorithm parameters that work for one image resolution fail at others. The FM algorithm addresses this with three stages:</p>
<ol>
<li>
<p><strong>Fragment extraction</strong>: Apply LHT with high-resolution parameters (resolution $r = 2$, resolution $\theta = 2°$) to detect fine line fragments. Walk along detected theoretical lines to find actual black pixels and group them by connectivity.</p>
</li>
<li>
<p><strong>Fragment grouping</strong>: Pair fragments that share similar angles, are close in the perpendicular direction, and are either overlapping or connected by a path of black pixels.</p>
</li>
<li>
<p><strong>Fragment merging</strong>: Merge grouped fragments into single line segments using the two border pixels farthest from the centroid.</p>
</li>
</ol>
<p>The FM algorithm effectively handles the tradeoff that plagues standard LHT: coarse parameters miss short lines and produce overlaps, while fine parameters return many fragments shorter than actual bonds.</p>
<h2 id="probabilistic-ambiguity-resolution-via-markov-network">Probabilistic Ambiguity Resolution via Markov Network</h2>
<p>After component extraction, GraphReco creates atom and bond candidates rather than directly assembling the graph. Each bond endpoint generates an atom candidate with a circular bounding area of radius:</p>
<p>$$r_b = \min(l_{\text{bond}}, l_{\text{med}}) / 4$$</p>
<p>where $l_{\text{bond}}$ is the bond length and $l_{\text{med}}$ is the median bond length.</p>
<p>A Markov network is constructed with four types of nodes:</p>
<ul>
<li><strong>Atom nodes</strong>: Boolean existence variables for each atom candidate</li>
<li><strong>Bond nodes</strong>: Boolean existence variables for each bond candidate</li>
<li><strong>Atom merge nodes</strong>: Boolean variables for pairs of overlapping atom candidates</li>
<li><strong>Bond merge nodes</strong>: Boolean variables for pairs of nearby bond candidates</li>
</ul>
<p>Potential functions encode rules about when candidates should exist or merge, with merging likelihood between two bond-ending atom candidates defined as a piecewise function of center distance $d$:</p>
<p>$$P(a_1, a_2) = \begin{cases} 0.9, &amp; \text{if } d \leq Q \\ 0.7 - 0.4(d - Q)/(R - Q), &amp; \text{if } Q &lt; d \leq R \\ 0.1, &amp; \text{if } d &gt; R \end{cases}$$</p>
<p>where $Q = \max(r_1, r_2)$ and $R = \min(1.5Q, r_1 + r_2)$. MAP inference determines the final state of all candidates.</p>
<h2 id="evaluation-results">Evaluation Results</h2>
<p>GraphReco is evaluated on USPTO benchmarks with InChI string comparison (stereochemistry removed):</p>
<table>
  <thead>
      <tr>
          <th>System</th>
          <th>USPTO-10K</th>
          <th>USPTO-10K-Abb</th>
          <th>USPTO</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GraphReco</strong></td>
          <td><strong>94.2</strong></td>
          <td><strong>86.7</strong></td>
          <td>89.9</td>
      </tr>
      <tr>
          <td>MolVec 0.9.7</td>
          <td>92.4</td>
          <td>70.3</td>
          <td>89.1</td>
      </tr>
      <tr>
          <td>Imago 2.0</td>
          <td>89.9</td>
          <td>63.0</td>
          <td>89.4</td>
      </tr>
      <tr>
          <td>OSRA 2.1</td>
          <td>89.7</td>
          <td>63.9</td>
          <td>89.3</td>
      </tr>
      <tr>
          <td>MolGrapher</td>
          <td>93.3</td>
          <td>82.8</td>
          <td><strong>91.5</strong></td>
      </tr>
      <tr>
          <td>Img2Mol</td>
          <td>35.4</td>
          <td>13.8</td>
          <td>25.2</td>
      </tr>
  </tbody>
</table>
<p>GraphReco outperforms all rule-based systems and most ML systems, with a particularly large margin on USPTO-10K-Abb (abbreviation-heavy molecules). MolGrapher achieves slightly higher accuracy on the USPTO dataset.</p>
<h3 id="robustness-on-perturbed-images">Robustness on Perturbed Images</h3>
<p>On USPTO-perturbed (rotation and shearing applied), rule-based methods degrade substantially:</p>
<table>
  <thead>
      <tr>
          <th>System</th>
          <th>USPTO-perturbed</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolGrapher</td>
          <td><strong>86.7</strong></td>
      </tr>
      <tr>
          <td>Img2Mol</td>
          <td>42.3</td>
      </tr>
      <tr>
          <td><strong>GraphReco</strong></td>
          <td>40.6</td>
      </tr>
      <tr>
          <td>MolVec 0.9.7</td>
          <td>30.7</td>
      </tr>
      <tr>
          <td>OSRA 2.1</td>
          <td>6.4</td>
      </tr>
      <tr>
          <td>Imago 2.0</td>
          <td>5.1</td>
      </tr>
  </tbody>
</table>
<p>GraphReco performs better than other rule-based systems on perturbed inputs (40.6% vs. under 31%) thanks to its probabilistic assembly, but still falls far behind MolGrapher (86.7%), demonstrating the robustness advantage of learned approaches.</p>
<h2 id="ablation-study">Ablation Study</h2>
<p>Each component contributes substantially to overall performance on USPTO-10K:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>USPTO-10K</th>
          <th>USPTO-10K-Abb</th>
          <th>USPTO</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Full system</td>
          <td>94.2</td>
          <td>86.7</td>
          <td>89.9</td>
      </tr>
      <tr>
          <td>Without FM line detection</td>
          <td>2.9</td>
          <td>5.5</td>
          <td>4.8</td>
      </tr>
      <tr>
          <td>Without atom candidates</td>
          <td>9.8</td>
          <td>0.4</td>
          <td>5.0</td>
      </tr>
      <tr>
          <td>Without bond candidates</td>
          <td>79.1</td>
          <td>75.8</td>
          <td>75.0</td>
      </tr>
      <tr>
          <td>Without Markov network</td>
          <td>88.2</td>
          <td>81.4</td>
          <td>84.2</td>
      </tr>
  </tbody>
</table>
<p>The FM algorithm and atom candidate mechanism are both critical (accuracy drops below 10% without either). Bond candidates provide a moderate improvement (~15 percentage points), and the Markov network adds ~6 points over hard-threshold alternatives.</p>
<h2 id="limitations">Limitations</h2>
<ul>
<li>Deterministic expert rules limit robustness on perturbed or noisy images, as evidenced by the large accuracy gap with MolGrapher on USPTO-perturbed</li>
<li>The system relies on Tesseract OCR for symbol recognition, which may struggle with unusual fonts or degraded image quality</li>
<li>Only handles single 2D molecule structures per image</li>
<li>Stereochemistry is removed during evaluation, so performance on stereo-bond recognition is not assessed</li>
</ul>
<h2 id="reproducibility">Reproducibility</h2>
<p>GraphReco is implemented in Python and relies on Tesseract OCR, OpenCV, and RDKit. The authors provide an online demo for testing but have not released the source code or a public repository.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Online Demo</td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Google Cloud Run deployment (no longer available)</td>
      </tr>
  </tbody>
</table>
<p><strong>Missing components for full reproduction:</strong></p>
<ul>
<li>Source code is not publicly available</li>
<li>No pre-built package or installable library</li>
<li>Hyperparameters for Markov network potential functions are given in the paper (Equations 8-11), but full implementation details are not released</li>
</ul>
<p><strong>Hardware/compute requirements:</strong> Not specified in the paper. The system uses classical computer vision (Hough transforms, thinning) and probabilistic inference (Markov networks), so GPU hardware is likely not required.</p>
]]></content:encoded></item><item><title>AdaptMol: Domain Adaptation for Molecular OCSR (2026)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/adaptmol-2026/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/adaptmol-2026/</guid><description>AdaptMol is an image-to-graph OCSR model using MMD-based domain adaptation and self-training for hand-drawn molecule recognition.</description><content:encoded><![CDATA[<h2 id="bridging-the-synthetic-to-real-gap-in-graph-based-ocsr">Bridging the Synthetic-to-Real Gap in Graph-Based OCSR</h2>
<p>Most OCSR methods are trained on synthetic molecular images and evaluated on high-quality literature figures, both exhibiting relatively uniform styles. Hand-drawn molecules represent a particularly challenging domain with irregular bond lengths, variable stroke widths, and inconsistent atom symbols. Prior graph reconstruction methods like MolScribe and MolGrapher drop below 15% accuracy on hand-drawn images, despite achieving over 65% on literature datasets.</p>
<p>AdaptMol addresses this with a three-stage pipeline that enables effective transfer from synthetic to real-world data without requiring graph annotations in the target domain:</p>
<ol>
<li><strong>Base model training</strong> on synthetic data with comprehensive augmentation and dual position representation</li>
<li><strong>MMD alignment</strong> of bond-level features between source and target domains</li>
<li><strong>Self-training</strong> with SMILES-validated pseudo-labels on unlabeled target images</li>
</ol>
<h2 id="end-to-end-graph-reconstruction-architecture">End-to-End Graph Reconstruction Architecture</h2>
<p>AdaptMol builds on MolScribe&rsquo;s architecture, using a Swin Transformer base encoder ($384 \times 384$ input) with a 6-layer Transformer decoder (8 heads, hidden dim 256). The model jointly predicts atoms and bonds:</p>
<p><strong>Atom prediction</strong> follows the Pix2Seq approach, autoregressively generating a sequence of atom tokens:</p>
<p>$$S_N = [l_1, x_1, y_1, l_2, x_2, y_2, \dots, l_n, x_n, y_n]$$</p>
<p>where $l_i$ is the atom label and $(x_i, y_i)$ are discretized coordinate bin indices.</p>
<p><strong>Dual position representation</strong> adds a 2D spatial heatmap on top of token-based coordinate prediction. The heatmap aggregates joint spatial distributions of all atoms:</p>
<p>$$\mathbf{H} = \text{Upsample}\left(\sum_{i=1}^{n} P_y^{(i)} \otimes P_x^{(i)}\right)$$</p>
<p>where $P_x^{(i)}$ and $P_y^{(i)}$ are coordinate probability distributions from the softmax logits. During training, this heatmap is supervised with Gaussian kernels at ground-truth atom positions. This reduces false positive atom predictions substantially (from 356 to 33 false positives at IoU 0.05).</p>
<p><strong>Bond prediction</strong> extracts atom-level features from decoder hidden states and enriches them with encoder visual features via multi-head attention with a learnable residual weight $\alpha$:</p>
<p>$$\mathbf{F}_{\text{enriched}} = \text{LayerNorm}(\mathbf{F}_{\text{atom}} + \alpha \cdot \text{MHA}(\mathbf{F}_{\text{atom}}, \mathbf{E}_{\text{vis}}))$$</p>
<p>A feed-forward network then predicts bond types between all atom pairs.</p>
<h2 id="bond-level-domain-adaptation-via-mmd">Bond-Level Domain Adaptation via MMD</h2>
<p>The key insight is that bond features are domain-invariant: they encode structural relationships (single, double, triple, aromatic) independent of visual style. Atom-level alignment is problematic due to class imbalance (carbon dominates), multi-token spanning (functional groups), and position-dependent features.</p>
<p>AdaptMol aligns bond-level feature distributions via class-conditional Maximum Mean Discrepancy:</p>
<p>$$L_{\text{MMD}} = \frac{1}{|\mathcal{C}&rsquo;|} \sum_{c \in \mathcal{C}&rsquo;} MMD(F_c^{\text{src}}, F_c^{\text{tgt}})$$</p>
<p>where $\mathcal{C}&rsquo;$ contains classes with sufficient samples in both domains. Confidence-based filtering retains only high-confidence predictions (confidence &gt; 0.95, entropy &lt; 0.1) for alignment, tightening to 0.98 and 0.05 after the first epoch. Progressive loss weighting follows a schedule of 0.1 (epoch 0), 0.075 (epoch 1), and 0.05 thereafter.</p>
<p>An important side effect: MMD alignment improves inter-class bond discrimination, reducing confusion between visually similar bond types (e.g., jagged double bonds vs. aromatic bonds).</p>
<h2 id="self-training-with-smiles-validation">Self-Training with SMILES Validation</h2>
<p>After MMD alignment, the model generates predictions on unlabeled target images. Predicted molecular graphs are converted to SMILES and validated against ground-truth SMILES annotations. Only exact matches are retained as pseudo-labels, providing complete graph supervision (atom coordinates, element types, bond types) that was previously unavailable in the target domain.</p>
<p>This approach is far more data-efficient than alternatives: AdaptMol uses only 4,080 real hand-drawn images vs. DECIMER-Handdraw&rsquo;s 38 million synthetic hand-drawn images.</p>
<h2 id="comprehensive-data-augmentation">Comprehensive Data Augmentation</h2>
<p>Two categories of augmentation are applied during synthetic data generation:</p>
<ul>
<li><strong>Structure-rendering augmentation</strong>: Functional group abbreviation substitution, bond type conversions (single to wavy/aromatic, Kekule to aromatic rings), R-group insertion, and rendering parameter randomization (font family/size, bond width/spacing)</li>
<li><strong>Image-level augmentation</strong>: Geometric operations, quality degradation, layout variations, and chemical document artifacts (caption injection, arrows, marginal annotations)</li>
</ul>
<p>Structure-rendering augmentation provides the larger benefit, contributing ~20% accuracy improvement on JPO and ~30% on ACS benchmarks.</p>
<h2 id="results">Results</h2>
<h3 id="hand-drawn-molecule-recognition">Hand-Drawn Molecule Recognition</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>DECIMER test (Acc)</th>
          <th>ChemPix (Acc)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>AdaptMol</strong></td>
          <td><strong>82.6</strong></td>
          <td><strong>60.5</strong></td>
      </tr>
      <tr>
          <td>DECIMER v2.2</td>
          <td>71.9</td>
          <td>51.4</td>
      </tr>
      <tr>
          <td>AtomLenz</td>
          <td>30.0</td>
          <td>48.4</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>10.1</td>
          <td>26.1</td>
      </tr>
      <tr>
          <td>MolGrapher</td>
          <td>10.7</td>
          <td>14.5</td>
      </tr>
  </tbody>
</table>
<h3 id="literature-and-synthetic-benchmarks">Literature and Synthetic Benchmarks</h3>
<p>AdaptMol achieves state-of-the-art on 4 of 6 literature benchmarks:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>AdaptMol</th>
          <th>MolScribe</th>
          <th>MolGrapher</th>
          <th>DECIMER v2.2</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CLEF</td>
          <td><strong>92.7</strong></td>
          <td>87.5</td>
          <td>57.2</td>
          <td>77.7</td>
      </tr>
      <tr>
          <td>JPO</td>
          <td><strong>88.2</strong></td>
          <td>78.8</td>
          <td>73.0</td>
          <td>75.7</td>
      </tr>
      <tr>
          <td>UOB</td>
          <td><strong>89.3</strong></td>
          <td>88.2</td>
          <td>85.1</td>
          <td>87.2</td>
      </tr>
      <tr>
          <td>ACS</td>
          <td><strong>75.5</strong></td>
          <td>72.8</td>
          <td>41.0</td>
          <td>37.7</td>
      </tr>
      <tr>
          <td>USPTO</td>
          <td>90.9</td>
          <td><strong>92.6</strong></td>
          <td>74.9</td>
          <td>59.6</td>
      </tr>
      <tr>
          <td>Staker</td>
          <td>84.0</td>
          <td><strong>84.4</strong></td>
          <td>0.0</td>
          <td>66.3</td>
      </tr>
  </tbody>
</table>
<p>MolScribe edges out on USPTO and Staker. The authors attribute this to MolScribe directly training on all 680K USPTO samples, which may cause it to specialize to that distribution.</p>
<h3 id="pipeline-ablation">Pipeline Ablation</h3>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Hand-drawn</th>
          <th>ChemDraw</th>
          <th>JPO</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base model</td>
          <td>10.4</td>
          <td>92.3</td>
          <td>82.7</td>
      </tr>
      <tr>
          <td>+ Font augmentation</td>
          <td>30.2</td>
          <td>92.5</td>
          <td>82.8</td>
      </tr>
      <tr>
          <td>+ Font aug + MMD</td>
          <td>42.1</td>
          <td>94.0</td>
          <td>83.0</td>
      </tr>
      <tr>
          <td>+ Font aug + MMD + Self-training</td>
          <td><strong>82.6</strong></td>
          <td><strong>95.9</strong></td>
          <td><strong>88.2</strong></td>
      </tr>
  </tbody>
</table>
<p>Each component contributes meaningfully: font augmentation (+19.8), MMD alignment (+11.9), and self-training (+40.5) on hand-drawn accuracy.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/fffh1/AdaptMol">AdaptMol Code</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/fffh1/AdaptMol/tree/main">Model + Data</a></td>
          <td>Model/Dataset</td>
          <td>MIT</td>
          <td>Pretrained checkpoint and datasets</td>
      </tr>
  </tbody>
</table>
<p>Training uses 2 NVIDIA A100 GPUs (40GB each). Base model trains for 30 epochs on 1M synthetic samples. Domain adaptation involves 3 steps: USPTO self-training (3 iterations of 3 epochs), MMD alignment on hand-drawn data (5 epochs), and hand-drawn self-training (5 iterations).</p>
<h2 id="limitations">Limitations</h2>
<ul>
<li>Sequence length constraints prevent accurate prediction of very large molecules (&gt;120 atoms), where resizing causes significant information loss</li>
<li>Cannot recognize Markush structures with repeating unit notation (parentheses/brackets), as synthetic training data lacks such cases</li>
<li>Stereochemistry information is lost when stereo bonds connect to abbreviated functional groups due to RDKit post-processing limitations</li>
<li>The retrained baseline (30 epochs from scratch on synthetic + pseudo-labels) achieves higher hand-drawn accuracy (87.2%) but at the cost of cross-domain robustness on literature benchmarks</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hu, F., He, E., &amp; Verspoor, K. (2026). AdaptMol: Domain Adaptation for Molecular Image Recognition with Limited Supervision. <em>Research Square preprint</em>. <a href="https://doi.org/10.21203/rs.3.rs-8365561/v1">https://doi.org/10.21203/rs.3.rs-8365561/v1</a></p>
<p><strong>Publication</strong>: Research Square preprint, February 2026</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/fffh1/AdaptMol">GitHub</a></li>
<li><a href="https://huggingface.co/fffh1/AdaptMol/tree/main">HuggingFace (model + data)</a></li>
</ul>
]]></content:encoded></item><item><title>OCSU: Optical Chemical Structure Understanding (2025)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/ocsu/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/ocsu/</guid><description>OCSU task for translating molecular images into multi-level descriptions. Introduces Vis-CheBI20 dataset and DoubleCheck/Mol-VL for molecular understanding.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fan, S., Xie, Y., Cai, B., Xie, A., Liu, G., Qiao, M., Xing, J., &amp; Nie, Z. (2025). OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery. <em>arXiv preprint arXiv:2501.15415</em>. <a href="https://doi.org/10.48550/arXiv.2501.15415">https://doi.org/10.48550/arXiv.2501.15415</a></p>
<p><strong>Publication</strong>: arXiv 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/PharMolix/OCSU">Code and Dataset (GitHub)</a></li>
</ul>
<h2 id="multi-level-chemical-understanding-method-and-resource">Multi-Level Chemical Understanding (Method and Resource)</h2>
<p>This is primarily a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong> with a significant <strong>Resource ($\Psi_{\text{Resource}}$)</strong> contribution.</p>
<ul>
<li><strong>Methodological</strong>: It proposes two novel architectures, <strong>DoubleCheck</strong> (an enhanced recognition model) and <strong>Mol-VL</strong> (an end-to-end vision-language model), to solve the newly formulated OCSU task.</li>
<li><strong>Resource</strong>: It constructs and releases <strong>Vis-CheBI20</strong>, the first large-scale dataset specifically designed for optical chemical structure understanding, containing 29.7K images and 117.7K image-text pairs.</li>
</ul>
<h2 id="the-motivation-for-ocsu-beyond-basic-graph-recognition">The Motivation for OCSU Beyond Basic Graph Recognition</h2>
<p>Existing methods for processing molecular images focus narrowly on <strong>Optical Chemical Structure Recognition (OCSR)</strong>, which translates an image solely into a machine-readable graph or SMILES string. However, SMILES strings are not chemist-friendly and lack high-level semantic context.</p>
<ul>
<li><strong>Gap</strong>: There is a lack of systems that can translate chemical diagrams into human-readable descriptions (e.g., functional groups, IUPAC names) alongside the graph structure.</li>
<li><strong>Goal</strong>: To enable <strong>Optical Chemical Structure Understanding (OCSU)</strong>, bridging the gap between visual representations and both machine/chemist-readable descriptions to support drug discovery and property prediction.</li>
</ul>
<h2 id="key-innovations-doublecheck-mol-vl-and-the-vis-chebi20-dataset">Key Innovations: DoubleCheck, Mol-VL, and the Vis-CheBI20 Dataset</h2>
<p>The paper introduces the <strong>OCSU task</strong>, enabling multi-level understanding (motif, molecule, and abstract levels). To solve this, it introduces two distinct paradigms:</p>
<ol>
<li><strong>DoubleCheck (OCSR-based)</strong>: An enhancement to standard OCSR models (like MolScribe) that performs a &ldquo;second look&rdquo; at locally ambiguous atoms. It uses attentive feature enhancement to fuse global molecular features with local features from ambiguous regions.</li>
<li><strong>Mol-VL (OCSR-free)</strong>: An end-to-end Vision-Language Model (VLM) based on Qwen2-VL. It uses multi-task learning to directly generate text descriptions from molecular images without an intermediate SMILES step.</li>
<li><strong>Vis-CheBI20 Dataset</strong>: A new benchmark specifically constructed for OCSU, deriving captions and functional group data from ChEBI-20 and PubChem.</li>
</ol>
<h2 id="methodology-and-experimental-evaluation">Methodology and Experimental Evaluation</h2>
<p>The authors evaluated both paradigms on <strong>Vis-CheBI20</strong> and existing benchmarks (USPTO, ACS) across four subtasks:</p>
<ol>
<li><strong>Functional Group Caption</strong>: Retrieval/F1 score evaluation.</li>
<li><strong>Molecule Description</strong>: Natural language generation metrics (BLEU, ROUGE, METEOR).</li>
<li><strong>IUPAC Naming</strong>: Text generation metrics (BLEU, ROUGE).</li>
<li><strong>SMILES Naming (OCSR)</strong>: Exact matching accuracy ($Acc_s$).</li>
</ol>
<p><strong>Baselines</strong>:</p>
<ul>
<li><strong>Task-Specific</strong>: MolScribe, MolVec, OSRA.</li>
<li><strong>LLM/VLM</strong>: Qwen2-VL, BioT5+, Mol-Instructions.</li>
<li><strong>Ablation</strong>: DoubleCheck vs. MolScribe backbone to test the &ldquo;feature enhancement&rdquo; mechanism.</li>
</ul>
<h2 id="results-and-conclusions-paradigm-trade-offs">Results and Conclusions: Paradigm Trade-Offs</h2>
<ul>
<li><strong>DoubleCheck Superiority</strong>: DoubleCheck outperformed MolScribe on OCSR tasks across all benchmarks. On USPTO, it achieved <strong>92.85%</strong> $Acc_s$ (vs. 92.57%), and on the ACS dataset it showed a <strong>+3.12%</strong> gain on chiral molecules. On Vis-CheBI20, DoubleCheck improved over MolScribe by an average of 2.27% across all metrics.</li>
<li><strong>Paradigm Trade-offs</strong>:
<ul>
<li><strong>Mol-VL (OCSR-free)</strong> excelled at semantic tasks like <strong>Functional Group Captioning</strong>, achieving <strong>97.32%</strong> F1 (vs. 93.63% for DoubleCheck &amp; RDKit and 89.60% for MolScribe &amp; RDKit). It benefits from end-to-end learning of structural context.</li>
<li><strong>DoubleCheck (OCSR-based)</strong> performed better on <strong>IUPAC naming recall</strong> and exact SMILES recovery, as explicit graph reconstruction is more precise for rigid nomenclature than VLM generation.</li>
</ul>
</li>
<li><strong>Conclusion</strong>: Enhancing submodules improves OCSR-based paradigms, while end-to-end VLMs offer stronger semantic understanding but struggle with exact syntax generation (SMILES/IUPAC).</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Vis-CheBI20 Dataset</strong></p>
<ul>
<li><strong>Source</strong>: Derived from ChEBI-20 and PubChem.</li>
<li><strong>Size</strong>: 29,700 molecular diagrams, 117,700 image-text pairs.</li>
<li><strong>Generation</strong>: Images generated from SMILES using RDKit to simulate real-world journal/patent styles.</li>
<li><strong>Splits</strong> (vary by task, see table below):</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Train Size</th>
          <th style="text-align: left">Test Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Functional Group</td>
          <td style="text-align: left">26,144</td>
          <td style="text-align: left">3,269</td>
      </tr>
      <tr>
          <td style="text-align: left">Description</td>
          <td style="text-align: left">26,407</td>
          <td style="text-align: left">3,300</td>
      </tr>
      <tr>
          <td style="text-align: left">IUPAC Naming</td>
          <td style="text-align: left">26,200</td>
          <td style="text-align: left">2,680</td>
      </tr>
      <tr>
          <td style="text-align: left">SMILES Naming</td>
          <td style="text-align: left">26,407</td>
          <td style="text-align: left">3,300</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>DoubleCheck (Attentive Feature Enhancement)</strong></p>
<ol>
<li><strong>Ambiguity Detection</strong>: Uses atom prediction confidence to identify &ldquo;ambiguous atoms&rdquo;.</li>
<li><strong>Masking</strong>: Applies a 2D Gaussian mask to the image centered on the ambiguous atom.</li>
<li><strong>Local Encoding</strong>: A Swin-B encoder ($\Phi_l$) encodes the masked image region.</li>
<li><strong>Fusion</strong>: Aligns local features ($\mathcal{F}_l$) with global features ($\mathcal{F}_g$) using a 2-layer MLP and fuses them via weighted summation.</li>
</ol>
<p>$$
\begin{aligned}
\mathcal{F}_e = \mathcal{F}_g + \text{MLP}(\mathcal{F}_g \oplus \hat{\mathcal{F}}_l) \cdot \hat{\mathcal{F}}_l
\end{aligned}
$$</p>
<ol start="5">
<li><strong>Two-Stage Training</strong>:
<ul>
<li>Stage 1: Train atom/bond predictors (30 epochs).</li>
<li>Stage 2: Train alignment/fusion modules with random Gaussian mask noise (10 epochs).</li>
</ul>
</li>
</ol>
<p><strong>Mol-VL (Multi-Task VLM)</strong></p>
<ul>
<li><strong>Prompting</strong>: System prompt: &ldquo;You are working as an excellent assistant in chemistry&hellip;&rdquo;</li>
<li><strong>Tokens</strong>: Uses <code>&lt;image&gt;</code> and <code>&lt;/image&gt;</code> special tokens.</li>
<li><strong>Auxiliary Task</strong>: Functional group recognition (identifying highlighted groups) added to training to improve context learning.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>DoubleCheck</strong>:
<ul>
<li><strong>Backbone</strong>: MolScribe architecture.</li>
<li><strong>Encoders</strong>: Swin-B for both global and local atom encoding.</li>
</ul>
</li>
<li><strong>Mol-VL</strong>:
<ul>
<li><strong>Base Model</strong>: Qwen2-VL (2B and 7B versions).</li>
<li><strong>Vision Encoder</strong>: ViT with naive dynamic resolution and M-RoPE.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Key Metrics</strong>:</p>
<ul>
<li><strong>SMILES</strong>: Exact Match Accuracy ($Acc_s$), Chiral Accuracy ($Acc_c$).</li>
<li><strong>Functional Groups</strong>: F1 Score (Information Retrieval task).</li>
<li><strong>Text Generation</strong>: BLEU-2/4, METEOR, ROUGE-L.</li>
</ul>
<p><strong>Selected Results</strong>:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Model</th>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>DoubleCheck</strong></td>
          <td style="text-align: left">OCSR (USPTO)</td>
          <td style="text-align: left">$Acc_s$</td>
          <td style="text-align: left"><strong>92.85%</strong></td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>MolScribe</strong></td>
          <td style="text-align: left">OCSR (USPTO)</td>
          <td style="text-align: left">$Acc_s$</td>
          <td style="text-align: left">92.57%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Mol-VL-7B</strong></td>
          <td style="text-align: left">Func. Group Caption</td>
          <td style="text-align: left">F1</td>
          <td style="text-align: left"><strong>97.32%</strong></td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>DoubleCheck &amp; RDKit</strong></td>
          <td style="text-align: left">Func. Group Caption</td>
          <td style="text-align: left">F1</td>
          <td style="text-align: left">93.63%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>DoubleCheck</strong>: Trained on <strong>4 NVIDIA A100 GPUs</strong> for <strong>4 days</strong>.
<ul>
<li>Max LR: 4e-4.</li>
</ul>
</li>
<li><strong>Mol-VL</strong>: Trained on <strong>4 NVIDIA A100 GPUs</strong> for <strong>10 days</strong>.
<ul>
<li>Max LR: 1e-5, 50 epochs.</li>
</ul>
</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/PharMolix/OCSU">PharMolix/OCSU (GitHub)</a></td>
          <td style="text-align: left">Code, Model, Dataset</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Official implementation, Mol-VL-7B weights, and Vis-CheBI20 dataset</td>
      </tr>
  </tbody>
</table>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>The long-tail distribution of functional groups in training data limits performance on uncommon chemical structures.</li>
<li>Mol-VL struggles with exact syntax generation (SMILES and IUPAC) compared to explicit graph-reconstruction approaches.</li>
<li>Vis-CheBI20 images are synthetically generated via RDKit, which may not fully capture the diversity of real-world journal and patent images.</li>
<li>The authors note that OCSU technologies should be restricted to research purposes, as downstream molecule discovery applications could potentially generate harmful molecules.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{fanOCSUOpticalChemical2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{OCSU}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Fan, Siqi and Xie, Yuguang and Cai, Bowen and Xie, Ailin and Liu, Gaochao and Qiao, Mu and Xing, Jie and Nie, Zaiqing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2501.15415}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2501.15415}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2501.15415}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GTR-CoT: Graph Traversal Chain-of-Thought for Molecules</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/gtr-mol-vlm/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/gtr-mol-vlm/</guid><description>GTR-VL uses graph traversal chain-of-thought and two-stage training to improve optical chemical structure recognition on printed and hand-drawn molecules.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, J., He, Y., Yang, H., Wu, J., Ge, L., Wei, X., Wang, Y., Li, L., Ao, H., Liu, C., Wang, B., Wu, L., &amp; He, C. (2025). GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition (arXiv:2506.07553). arXiv. <a href="https://doi.org/10.48550/arXiv.2506.07553">https://doi.org/10.48550/arXiv.2506.07553</a></p>
<p><strong>Publication</strong>: arXiv preprint (2025)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://doi.org/10.48550/arXiv.2506.07553">Paper on arXiv</a></li>
</ul>
<h2 id="contribution-vision-language-modeling-for-ocsr">Contribution: Vision-Language Modeling for OCSR</h2>
<p>This is a <strong>method paper</strong> that introduces GTR-VL, a Vision-Language Model for Optical Chemical Structure Recognition (OCSR). The work addresses the persistent challenge of converting molecular structure images into machine-readable formats, with a particular focus on handling chemical abbreviations that cause errors in existing systems.</p>
<h2 id="motivation-the-abbreviation-bottleneck">Motivation: The Abbreviation Bottleneck</h2>
<p>The motivation tackles a long-standing bottleneck in chemical informatics: most existing OCSR systems produce incorrect structures when they encounter abbreviated functional groups. When a chemist draws &ldquo;Ph&rdquo; for phenyl or &ldquo;Et&rdquo; for ethyl, current models fail because they have been trained on data where images contain abbreviations but the ground-truth labels contain fully expanded molecular graphs.</p>
<p>This creates a fundamental mismatch. The model sees &ldquo;Ph&rdquo; in the image but is told the &ldquo;correct&rdquo; answer is a full benzene ring. The supervision signal is inconsistent with what is actually visible.</p>
<p>Beyond this data problem, existing graph-parsing methods use a two-stage approach: predict all atoms first, then predict all bonds. This is inefficient and ignores the structural constraints that could help during prediction. The authors argue that mimicking how humans analyze molecular structures - following bonds from atom to atom in a connected traversal - would be more effective.</p>
<h2 id="novelty-graph-traversal-as-visual-chain-of-thought">Novelty: Graph Traversal as Visual Chain-of-Thought</h2>
<p>The novelty lies in combining two key insights about how to properly train and architect OCSR systems. The main contributions are:</p>
<ol>
<li>
<p><strong>Graph Traversal as Visual Chain of Thought</strong>: GTR-VL generates molecular graphs by traversing them sequentially, predicting an atom, then its connected bond, then the next atom, and so on. This mimics how a human chemist would trace through a structure and allows the model to use previously predicted atoms and bonds as context for subsequent predictions.</p>
<p>Formally, the model output sequence for image $I_m$ is generated as:</p>
<p>$$ R_m = \text{concat}(CoT_m, S_m) $$</p>
<p>where $CoT_m$ represents the deterministic graph traversal steps (atoms and bonds) and $S_m$ is the final SMILES representation. This intermediate reasoning step makes the model more interpretable and helps it learn the structural logic of molecules.</p>
</li>
<li>
<p><strong>&ldquo;Faithfully Recognize What You&rsquo;ve Seen&rdquo; Principle</strong>: This addresses the abbreviation problem head-on. The authors correct the ground-truth annotations to match what&rsquo;s actually visible in the image.</p>
<p>They treat abbreviations like &ldquo;Ph&rdquo; as single &ldquo;superatoms&rdquo; and build a pipeline to automatically detect and correct training data. Using OCR to extract visible text from molecular images, they replace the corresponding expanded substructures in the ground-truth with the appropriate abbreviation tokens. This ensures the supervision signal is consistent with the visual input.</p>
</li>
<li>
<p><strong>Large-Scale Dataset (GTR-1.3M)</strong>: To support this approach, the authors created a large-scale dataset combining 1M synthetic molecules from PubChem with 351K corrected real-world patent images from USPTO. The key innovation is the correction pipeline that identifies abbreviations in patent images and fixes the inconsistent ground-truth labels.</p>
</li>
<li>
<p><strong>GRPO for Hand-Drawn OCSR</strong>: Hand-drawn molecular data lacks fine-grained atom/bond coordinate annotations, making SFT-based graph parsing inapplicable. The authors use Group Relative Policy Optimization (GRPO) with a composite reward function that combines format, SMILES, and graph-level rewards. The graph reward computes the maximum common subgraph (MCS) between predicted and ground-truth molecular graphs:</p>
<p>$$ R_{\text{graph}} = \frac{|N_m^a|}{|N_g^a| + |N_p^a|} + \frac{|N_m^b|}{|N_g^b| + |N_p^b|} $$</p>
<p>where $N_m^a$, $N_g^a$, $N_p^a$ are atom counts in the MCS, ground truth, and prediction, and $N_m^b$, $N_g^b$, $N_p^b$ are the corresponding bond counts.</p>
</li>
<li>
<p><strong>Two-Stage Training</strong>: Stage 1 performs SFT on GTR-1.3M for printed molecule recognition. Stage 2 applies GRPO on a mixture of printed data (GTR-USPTO-4K) and hand-drawn data (DECIMER-HD-Train, 4,070 samples) to extend capabilities to hand-drawn structures.</p>
</li>
<li>
<p><strong>MolRec-Bench Evaluation</strong>: Traditional SMILES-based evaluation fails for molecules with abbreviations because canonicalization breaks down. The authors created a new benchmark that evaluates graph structure directly, providing three metrics: direct SMILES generation, graph-derived SMILES, and exact graph matching.</p>
</li>
</ol>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The evaluation focused on demonstrating that GTR-VL&rsquo;s design principles solve real problems that plague existing OCSR systems:</p>
<ol>
<li>
<p><strong>Comprehensive Baseline Comparison</strong>: GTR-VL was tested against three categories of models:</p>
<ul>
<li><strong>Specialist OCSR systems</strong>: MolScribe and MolNexTR</li>
<li><strong>Chemistry-focused VLMs</strong>: ChemVLM, ChemDFM-X, OCSU</li>
<li><strong>General-purpose VLMs</strong>: GPT-4o, GPT-4o-mini, Qwen-VL-Max</li>
</ul>
</li>
<li>
<p><strong>MolRec-Bench Evaluation</strong>: The new benchmark includes two subsets of patent images:</p>
<ul>
<li><strong>MolRec-USPTO</strong>: 5,423 standard patent images similar to existing benchmarks</li>
<li><strong>MolRec-Abb</strong>: 9,311 molecular images with abbreviated superatoms, derived from MolGrapher&rsquo;s USPTO 10K abb subset</li>
</ul>
<p>This design directly tests whether models can handle the abbreviation problem that breaks existing systems.</p>
</li>
<li>
<p><strong>Ablation Studies</strong>: Systematic experiments isolated the contribution of key design choices:</p>
<ul>
<li><strong>Chain-of-Thought vs. Direct</strong>: Comparing graph traversal CoT against direct SMILES prediction</li>
<li><strong>Traversal Strategy</strong>: Graph traversal vs. the traditional &ldquo;atoms-then-bonds&rdquo; approach</li>
<li><strong>Dataset Quality</strong>: Training on corrected vs. uncorrected data</li>
</ul>
</li>
<li>
<p><strong>Retraining Experiments</strong>: Existing specialist models (MolScribe, MolNexTR) were retrained from scratch on the corrected GTR-1.3M dataset to isolate the effect of data quality from architectural improvements.</p>
</li>
<li>
<p><strong>Hand-Drawn OCSR Evaluation</strong>: GTR-VL was also evaluated on the DECIMER Hand-drawn test set and ChemPix dataset, comparing against DECIMER and AtomLenz+EditKT baselines.</p>
</li>
<li>
<p><strong>Qualitative Analysis</strong>: Visual inspection of predictions on challenging cases with heavy abbreviation usage, complex structures, and edge cases to understand failure modes.</p>
</li>
</ol>
<h2 id="results--conclusions-resolving-the-abbreviation-bottleneck">Results &amp; Conclusions: Resolving the Abbreviation Bottleneck</h2>
<ul>
<li>
<p><strong>Performance Gains on Abbreviations</strong>: On MolRec-Abb, GTR-VL-Stage1 achieves 85.49% Graph accuracy compared to around 20% for MolScribe and MolNexTR with their original checkpoints. On MolRec-USPTO, GTR-VL-Stage1 reaches 93.45% Graph accuracy. Existing specialist models see their accuracy drop below 20% on MolRec-Abb when abbreviations are present.</p>
</li>
<li>
<p><strong>Data Correction is Critical</strong>: When MolScribe and MolNexTR were retrained on GTR-1.3M, their MolRec-Abb Graph accuracy jumped from around 20% to 70.60% and 71.85% respectively. GTR-VL-Stage1 still outperformed these retrained baselines at 85.49%, confirming that both data correction and the graph traversal approach contribute.</p>
</li>
<li>
<p><strong>Chain-of-Thought Helps</strong>: Ablation on GTR-USPTO-351K shows that CoT yields 68.85% Gen-SMILES vs. 66.54% without CoT, a 2.31 percentage point improvement.</p>
</li>
<li>
<p><strong>Graph Traversal Beats Traditional Parsing</strong>: Graph traversal achieves 83.26% Graph accuracy vs. 80.15% for the atoms-then-bonds approach, and 81.88% vs. 79.02% on Gra-SMILES.</p>
</li>
<li>
<p><strong>General VLMs Still Struggle</strong>: General-purpose VLMs like GPT-4o scored near 0% on MolRec-Bench across all metrics, highlighting the importance of domain-specific training for OCSR.</p>
</li>
<li>
<p><strong>Hand-Drawn Recognition via GRPO</strong>: GTR-VL-Stage1 (SFT only) achieves only 9.53% Graph accuracy on DECIMER-HD-Test, but after GRPO training in Stage 2, performance jumps to 75.44%. On ChemPix, Graph accuracy rises from 22.02% to 86.13%. The graph reward is essential: GRPO without graph supervision achieves only 11.00% SMILES on DECIMER-HD-Test, while adding graph reward reaches 75.64%.</p>
</li>
<li>
<p><strong>Evaluation Methodology Matters</strong>: The new graph-based evaluation metrics revealed problems with traditional SMILES-based evaluation that previous work had missed. Many &ldquo;failures&rdquo; in existing benchmarks were actually correct graph predictions that got marked wrong due to canonicalization issues with abbreviations.</p>
</li>
</ul>
<p>The work establishes that addressing the abbreviation problem requires both correcting the training data and rethinking the model architecture. The combination of faithful data annotation and sequential graph generation improves OCSR performance on molecules with abbreviations by a large margin over previous methods.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="models">Models</h3>
<p><strong>Base Model</strong>: GTR-VL fine-tunes <strong>Qwen2.5-VL</strong>.</p>
<p><strong>Input/Output Mechanism</strong>:</p>
<ul>
<li><strong>Input</strong>: The model takes an image $I_m$ and a text prompt</li>
<li><strong>Output</strong>: The model generates $R_m = \text{concat}(CoT_m, S_m)$, where it first produces the Chain-of-Thought (the graph traversal steps) followed immediately by the final SMILES string</li>
<li><strong>Traversal Strategy</strong>: Uses <strong>depth-first traversal</strong> to alternately predict atoms and bonds</li>
</ul>
<p><strong>Prompt Structure</strong>: The model is prompted to &ldquo;list the types of atomic elements&hellip; the coordinates&hellip; and the chemical bonds&hellip; then&hellip; output a canonical SMILES&rdquo;. The CoT output is formatted as a JSON list of atoms (with coordinates) and bonds (with indices referring to previous atoms), interleaved.</p>
<h3 id="data">Data</h3>
<p><strong>Training Dataset (GTR-1.3M)</strong>:</p>
<ul>
<li><strong>Synthetic Component</strong>: 1 million molecular SMILES from PubChem, converted to images using Indigo</li>
<li><strong>Real Component</strong>: 351,000 samples from USPTO patents (filtered from an original 680,000)
<ul>
<li>Processed using an OCR pipeline to detect abbreviations (e.g., &ldquo;Ph&rdquo;, &ldquo;Et&rdquo;)</li>
<li>Ground truth expanded structures replaced with superatoms to match visible abbreviations in images</li>
<li>This &ldquo;Faithfully Recognize What You&rsquo;ve Seen&rdquo; correction ensures training supervision matches visual input</li>
</ul>
</li>
</ul>
<p><strong>Evaluation Dataset (MolRec-Bench)</strong>:</p>
<ul>
<li><strong>MolRec-USPTO</strong>: 5,423 molecular images from USPTO patents</li>
<li><strong>MolRec-Abb</strong>: 9,311 molecular images with abbreviated superatoms, derived from MolGrapher&rsquo;s USPTO 10K abb subset</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Graph Traversal Algorithm</strong>:</p>
<ul>
<li>Depth-first traversal strategy</li>
<li>Alternating atom-bond prediction sequence</li>
<li>Each step uses previously predicted atoms and bonds as context</li>
</ul>
<p><strong>Two-Stage Training</strong>:</p>
<ul>
<li><strong>Stage 1 (SFT)</strong>: Train on GTR-1.3M to learn visual CoT mechanism for printed molecules (produces GTR-VL-Stage1)</li>
<li><strong>Stage 2 (GRPO)</strong>: Apply GRPO on GTR-USPTO-4K + DECIMER-HD-Train (4,070 samples) for hand-drawn recognition (produces GTR-VL-Stage2, i.e., GTR-VL)</li>
</ul>
<p><strong>Training Procedure</strong>:</p>
<ul>
<li><strong>Optimizer</strong>: AdamW</li>
<li><strong>Learning Rate (SFT)</strong>: Peak learning rate of $1.6 \times 10^{-4}$ with cosine decay</li>
<li><strong>Learning Rate (GRPO)</strong>: Peak learning rate of $1 \times 10^{-5}$ with cosine decay</li>
<li><strong>Warm-up</strong>: Linear warm-up for the first 10% of iterations</li>
<li><strong>Batch Size (SFT)</strong>: 2 per GPU with gradient accumulation over 16 steps, yielding <strong>effective batch size of 1024</strong></li>
<li><strong>Batch Size (GRPO)</strong>: 4 per GPU with gradient accumulation of 1, yielding <strong>effective batch size of 128</strong></li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong> (three complementary measures to handle abbreviation issues):</p>
<ul>
<li><strong>Gen-SMILES</strong>: Exact match ratio of SMILES strings directly generated by the VLM (image-captioning style)</li>
<li><strong>Gra-SMILES</strong>: Exact match ratio of SMILES strings derived from the predicted graph structure (graph-parsing style)</li>
<li><strong>Graph</strong>: Exact match ratio between ground truth and predicted graphs (node/edge comparison, bypassing SMILES canonicalization issues)</li>
</ul>
<p><strong>Baselines Compared</strong>:</p>
<ul>
<li>Specialist OCSR systems: MolScribe, MolNexTR</li>
<li>Chemistry-focused VLMs: ChemVLM, ChemDFM-X, OCSU</li>
<li>General-purpose VLMs: GPT-4o, GPT-4o-mini, Qwen-VL-Max</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p><strong>Compute</strong>: Training performed on <strong>32 NVIDIA A100 GPUs</strong></p>
<h3 id="reproducibility-status">Reproducibility Status</h3>
<p><strong>Status</strong>: Closed. As of the paper&rsquo;s publication, no source code, pre-trained model weights, or dataset downloads (GTR-1.3M, MolRec-Bench) have been publicly released. The paper does not mention plans for open-source release. The training data pipeline relies on PubChem SMILES (public), USPTO patent images (publicly available through prior work), the Indigo rendering tool (open-source), and an unspecified OCR system for abbreviation detection. Without the released code and data corrections, reproducing the full pipeline would require substantial re-implementation effort.</p>
]]></content:encoded></item><item><title>OCSAug: Diffusion-Based Augmentation for Hand-Drawn OCSR</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/ocsaug/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/ocsaug/</guid><description>A diffusion-based data augmentation pipeline (OCSAug) using DDPM and RePaint to improve optical chemical structure recognition on hand-drawn images.</description><content:encoded><![CDATA[<h2 id="document-taxonomy-ocsaug-as-a-novel-method">Document Taxonomy: OCSAug as a Novel Method</h2>
<p>This is a <strong>Method</strong> paper according to the <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">taxonomy</a>. It proposes a novel data augmentation pipeline (<strong>OCSAug</strong>) that integrates Denoising Diffusion Probabilistic Models (DDPM) and the RePaint algorithm to address the data scarcity problem in hand-drawn optical chemical structure recognition (OCSR). The contribution is validated through systematic benchmarking against existing augmentation techniques (RDKit, Randepict) and ablation studies on mask design.</p>
<h2 id="expanding-hand-drawn-training-data-for-ocsr">Expanding Hand-Drawn Training Data for OCSR</h2>
<p>A vast amount of molecular structure data exists in analog formats, such as hand-drawn diagrams in research notes or older literature. While OCSR models perform well on digitally rendered images, they struggle with hand-drawn images due to noise, varying handwriting styles, and distortions. Current datasets for hand-drawn images (e.g., DECIMER) are too small to train effective models, and existing augmentation tools (RDKit, Randepict) fail to generate sufficiently realistic hand-drawn variations.</p>
<h2 id="ocsaug-pipeline-masked-repaint-via-generative-ai">OCSAug Pipeline: Masked RePaint via Generative AI</h2>
<p>The core novelty is <strong>OCSAug</strong>, a three-phase pipeline that uses generative AI to synthesize training data:</p>
<ol>
<li><strong>DDPM + RePaint</strong>: It utilizes a DDPM to learn the distribution of hand-drawn images and the RePaint algorithm for inpainting.</li>
<li><strong>Structural Masking</strong>: It introduces <strong>vertical and horizontal stripe pattern masks</strong>. These masks selectively obscure parts of atoms or bonds, forcing the diffusion model to reconstruct them with irregular &ldquo;hand-drawn&rdquo; styles while preserving the underlying chemical topology.</li>
<li><strong>Label Transfer</strong>: Because the chemical structure is preserved during inpainting, the SMILES label from the original image is directly transferred to the augmented image, bypassing the need for re-annotation.</li>
</ol>
<h2 id="benchmarking-diffusion-augmentations-on-decimer">Benchmarking Diffusion Augmentations on DECIMER</h2>
<p>The authors evaluated OCSAug using the <strong>DECIMER dataset</strong>, specifically a &ldquo;drug-likeness&rdquo; subset filtered by Lipinski&rsquo;s and Veber&rsquo;s rules.</p>
<ul>
<li><strong>Baselines</strong>: The method was compared against <strong>RDKit</strong> (digital generation) and <strong>Randepict</strong> (rule-based augmentation).</li>
<li><strong>Models</strong>: Four recent OCSR models were fine-tuned: <strong>MolScribe</strong>, <strong>DECIMER 1.0 (I2S)</strong>, <strong>MolNexTR</strong>, and <strong>MPOCSR</strong>.</li>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Tanimoto Similarity</strong>: To measure prediction accuracy against ground truth.</li>
<li><strong>Fréchet Inception Distance (FID)</strong>: To measure the distributional similarity between generated and real hand-drawn images.</li>
<li><strong>RMSE</strong>: To quantify pixel-level structural preservation across different mask thicknesses.</li>
</ul>
</li>
</ul>
<h2 id="improved-generalization-capabilities-and-fid-scores">Improved Generalization Capabilities and FID Scores</h2>
<ul>
<li><strong>Performance Boost</strong>: OCSAug improved recognition accuracy (Tanimoto similarity) by <strong>1.918 to 3.820 times</strong> compared to non-fine-tuned baselines (Improvement Ratio), outperforming traditional augmentation techniques such as RDKit and Randepict (1.570-3.523x).</li>
<li><strong>Data Quality</strong>: OCSAug achieved the lowest FID score (0.471) compared to Randepict (4.054) and RDKit (10.581), indicating its generated images are much closer to the real hand-drawn distribution.</li>
<li><strong>Generalization</strong>: The method showed improved generalization on a newly collected real-world dataset of 463 images from 6 volunteers.</li>
<li><strong>Resolution Mixing</strong>: Training MolScribe and MolNexTR with a mix of $128 \times 128$, $256 \times 256$, and $512 \times 512$ resolution images improved Tanimoto similarity (e.g., MolScribe from 0.585 to 0.640), though this strategy did not help I2S or MPOCSR.</li>
<li><strong>Real-World Evaluation</strong>: On a newly collected dataset of 463 hand-drawn images from 6 volunteers (88 drug compounds), the MPOCSR model fine-tuned with OCSAug achieved 0.367 exact-match accuracy (Tanimoto = 1.0), compared to 0.365 for non-augmented fine-tuning and 0.037 for no fine-tuning. The area under the accuracy curve showed a more notable improvement in reducing misrecognition.</li>
<li><strong>Limitations</strong>: The generation process is slow (3 weeks for 10k images on a single GPU). The fixed stripe masks may struggle with highly complex, non-drug-like geometries: when evaluated on the full DECIMER dataset (without drug-likeness filtering), OCSAug did not yield uniform improvements across all models.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/jjjabcd/OCSAug">OCSAug</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation using guided-diffusion and RePaint</td>
      </tr>
      <tr>
          <td><a href="https://zenodo.org/records/6456306">DECIMER Hand-Drawn Dataset</a></td>
          <td>Dataset</td>
          <td>CC-BY 4.0</td>
          <td>5,088 hand-drawn molecular structure images from 24 individuals</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: DECIMER dataset (hand-drawn images).</li>
<li><strong>Filtering</strong>: A &ldquo;drug-likeness&rdquo; filter was applied (Lipinski&rsquo;s rule of 5 + Veber&rsquo;s rules) along with an atom filter (C, H, O, S, F, Cl, Br, N, P only).</li>
<li><strong>Final Size</strong>: 3,194 samples, split into:
<ul>
<li><strong>Training</strong>: 2,604 samples.</li>
<li><strong>Validation</strong>: 290 samples.</li>
<li><strong>Test</strong>: 300 samples.</li>
</ul>
</li>
<li><strong>Resolution</strong>: All images resized to $256 \times 256$ pixels.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Framework</strong>: DDPM implemented using <code>guided-diffusion</code>.</li>
<li><strong>RePaint Settings</strong>:
<ul>
<li>Total time steps: 250.</li>
<li>Jump length: 10.</li>
<li>Resampling counts: 10.</li>
</ul>
</li>
<li><strong>Masking Strategy</strong>:
<ul>
<li><strong>Vertical Stripes</strong>: Obscure atom symbols to vary handwriting style.</li>
<li><strong>Horizontal Stripes</strong>: Obscure bonds to vary length/thickness/alignment.</li>
<li><strong>Optimal Thickness</strong>: A stripe thickness of <strong>4 pixels</strong> was found to be optimal for balancing diversity and structural preservation.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The OCSR models were pretrained on PubChem (digital images) and then fine-tuned on the OCSAug dataset.</p>
<ul>
<li><strong>MolScribe</strong>: Swin Transformer encoder, Transformer decoder. Fine-tuned (all layers) for 30 epochs, batch size 16-128, LR 2e-5.</li>
<li><strong>I2S (DECIMER 1.0)</strong>: Inception V3 encoder (frozen), FC/Decoder fine-tuned. 25 epochs, batch size 64, LR 1e-5.</li>
<li><strong>MolNexTR</strong>: Dual-stream encoder (Swin + CNN). Fine-tuned (all layers) for 30 epochs, batch size 16-64, LR 2e-5.</li>
<li><strong>MPOCSR</strong>: MPViT backbone. Fine-tuned (all layers) for 25 epochs, batch size 16-32, LR 4e-5.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>
<p><strong>Metric</strong>: Improvement Ratio (IR) of Tanimoto Similarity (TS), calculated iteratively or defined as:</p>
<p>$$
\text{IR} = \frac{\text{TS}_{\text{finetuned}}}{\text{TS}_{\text{non-finetuned}}}
$$</p>
</li>
<li>
<p><strong>Validation</strong>: Cross-validation on the split DECIMER dataset.</p>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: NVIDIA GeForce RTX 4090.</li>
<li><strong>Training Time</strong>: DDPM training took ~6 days.</li>
<li><strong>Generation Time</strong>: Generating 2,600 augmented images took ~70 hours.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kim, J. H., &amp; Choi, J. (2025). OCSAug: diffusion-based optical chemical structure data augmentation for improved hand-drawn chemical structure image recognition. <em>The Journal of Supercomputing</em>, 81, 926.</p>
<p><strong>Publication</strong>: The Journal of Supercomputing 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/jjjabcd/OCSAug">Official Repository</a></li>
<li><a href="https://zenodo.org/records/6456306">DECIMER Dataset</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{kimOCSAugDiffusionbasedOptical2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{OCSAug: Diffusion-Based Optical Chemical Structure Data Augmentation for Improved Hand-Drawn Chemical Structure Image Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{OCSAug}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Kim, Jin Hyuk and Choi, Jonghwan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2025</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = may,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{The Journal of Supercomputing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{81}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{926}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1007/s11227-025-07406-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Image-to-Sequence OCSR: A Comparative Analysis</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/image-to-sequence-comparison/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/image-to-sequence-comparison/</guid><description>Comparative analysis of image-to-sequence OCSR methods across architecture, output format, training data, and compute requirements.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>This note provides a comparative analysis of image-to-sequence methods for Optical Chemical Structure Recognition (OCSR). These methods treat molecular structure recognition as an image captioning task, using encoder-decoder architectures to generate sequential molecular representations (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>) directly from pixels.</p>
<p>For the full taxonomy of OCSR approaches including image-to-graph and rule-based methods, see the <a href="/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/">OCSR Methods taxonomy</a>.</p>
<h2 id="architectural-evolution-2019-2025">Architectural Evolution (2019-2025)</h2>
<p>The field has undergone rapid architectural evolution, with clear generational shifts in both encoder and decoder design.</p>
<h3 id="timeline">Timeline</h3>
<table>
  <thead>
      <tr>
          <th>Era</th>
          <th>Encoder</th>
          <th>Decoder</th>
          <th>Representative Methods</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>2019-2020</strong></td>
          <td>CNN (Inception V3, ResNet)</td>
          <td>LSTM/GRU with Attention</td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/staker-deep-learning-2019/">Staker et al.</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer/">DECIMER</a></td>
      </tr>
      <tr>
          <td><strong>2021</strong></td>
          <td>EfficientNet, ViT</td>
          <td>Transformer</td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-1.0/">DECIMER 1.0</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/img2mol/">Img2Mol</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/vit-inchi-transformer/">ViT-InChI</a></td>
      </tr>
      <tr>
          <td><strong>2022</strong></td>
          <td>Swin Transformer, ResNet</td>
          <td>Transformer</td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/swinocsr/">SwinOCSR</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/image2smiles/">Image2SMILES</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/micer/">MICER</a></td>
      </tr>
      <tr>
          <td><strong>2023-2024</strong></td>
          <td>EfficientNetV2, SwinV2</td>
          <td>Transformer</td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-ai/">DECIMER.ai</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/image2inchi/">Image2InChI</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/mmssc-net/">MMSSC-Net</a></td>
      </tr>
      <tr>
          <td><strong>2025</strong></td>
          <td>EfficientViT, VLMs (Qwen2-VL)</td>
          <td>LLM decoders, RL fine-tuning</td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/molsight/">MolSight</a>, <a href="/notes/chemistry/optical-structure-recognition/vision-language/gtr-mol-vlm/">GTR-CoT</a>, <a href="/notes/chemistry/optical-structure-recognition/vision-language/ocsu/">OCSU</a></td>
      </tr>
  </tbody>
</table>
<h3 id="encoder-architectures">Encoder Architectures</h3>
<table>
  <thead>
      <tr>
          <th>Architecture</th>
          <th>Methods Using It</th>
          <th>Key Characteristics</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Inception V3</strong></td>
          <td>DECIMER (2020)</td>
          <td>Early CNN approach, 299x299 input</td>
      </tr>
      <tr>
          <td><strong>ResNet-50/101</strong></td>
          <td>IMG2SMI, Image2SMILES, MICER, DGAT</td>
          <td>Strong baseline, well-understood</td>
      </tr>
      <tr>
          <td><strong>EfficientNet-B3</strong></td>
          <td>DECIMER 1.0</td>
          <td>Efficient scaling, compound coefficients</td>
      </tr>
      <tr>
          <td><strong>EfficientNet-V2-M</strong></td>
          <td>DECIMER.ai, DECIMER-Hand-Drawn</td>
          <td>Improved training efficiency</td>
      </tr>
      <tr>
          <td><strong>EfficientViT-L1</strong></td>
          <td>MolSight</td>
          <td>Optimized for deployment</td>
      </tr>
      <tr>
          <td><strong>Swin Transformer</strong></td>
          <td>SwinOCSR, MolParser</td>
          <td>Hierarchical vision transformer</td>
      </tr>
      <tr>
          <td><strong>SwinV2</strong></td>
          <td>MMSSC-Net, Image2InChI</td>
          <td>Improved training stability</td>
      </tr>
      <tr>
          <td><strong>Vision Transformer (ViT)</strong></td>
          <td>ViT-InChI</td>
          <td>Pure attention encoder</td>
      </tr>
      <tr>
          <td><strong>DenseNet</strong></td>
          <td>RFL, Hu et al. RCGD</td>
          <td>Dense connections, feature reuse</td>
      </tr>
      <tr>
          <td><strong>Deep TNT</strong></td>
          <td>ICMDT</td>
          <td>Transformer-in-Transformer</td>
      </tr>
      <tr>
          <td><strong>Qwen2-VL</strong></td>
          <td>OCSU, GTR-CoT</td>
          <td>Vision-language model encoder</td>
      </tr>
  </tbody>
</table>
<h3 id="decoder-architectures">Decoder Architectures</h3>
<table>
  <thead>
      <tr>
          <th>Architecture</th>
          <th>Methods Using It</th>
          <th>Output Format</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GRU with Attention</strong></td>
          <td>DECIMER, RFL, Hu et al. RCGD</td>
          <td>SMILES, RFL, SSML</td>
      </tr>
      <tr>
          <td><strong>LSTM with Attention</strong></td>
          <td>Staker et al., ChemPix, MICER</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td><strong>Transformer</strong></td>
          <td>Most 2021+ methods</td>
          <td>SMILES, SELFIES, InChI</td>
      </tr>
      <tr>
          <td><strong>GPT-2</strong></td>
          <td>MMSSC-Net</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td><strong>BART</strong></td>
          <td>MolParser</td>
          <td>E-SMILES</td>
      </tr>
      <tr>
          <td><strong>Pre-trained CDDD</strong></td>
          <td>Img2Mol</td>
          <td>Continuous embedding → SMILES</td>
      </tr>
  </tbody>
</table>
<h2 id="output-representation-comparison">Output Representation Comparison</h2>
<p>The choice of molecular string representation significantly impacts model performance. Representations fall into three categories: core molecular formats for single structures, extended formats for molecular families and variable structures (primarily Markush structures in patents), and specialized representations optimizing for specific recognition challenges.</p>
<p>The <a href="/notes/chemistry/optical-structure-recognition/benchmarks/rajan-string-representations-2022/">Rajan et al. 2022 ablation study</a> provides a comparison of core formats.</p>
<h3 id="core-molecular-formats">Core Molecular Formats</h3>
<p>These represent specific, concrete molecular structures.</p>
<table>
  <thead>
      <tr>
          <th>Format</th>
          <th>Validity Guarantee</th>
          <th>Sequence Length</th>
          <th>Key Characteristic</th>
          <th>Used By</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>SMILES</strong></td>
          <td>No</td>
          <td>Shortest (baseline)</td>
          <td>Standard, highest accuracy</td>
          <td>DECIMER.ai, MolSight, DGAT, most 2023+</td>
      </tr>
      <tr>
          <td><strong>DeepSMILES</strong></td>
          <td>Partial</td>
          <td>~1.1x SMILES</td>
          <td>Reduces non-local dependencies</td>
          <td>SwinOCSR</td>
      </tr>
      <tr>
          <td><strong>SELFIES</strong></td>
          <td>Yes (100%)</td>
          <td>~1.5x SMILES</td>
          <td>Guaranteed valid molecules</td>
          <td>DECIMER 1.0, IMG2SMI</td>
      </tr>
      <tr>
          <td><strong>InChI</strong></td>
          <td>N/A (canonical)</td>
          <td>Variable (long)</td>
          <td>Unique identifiers, layered syntax</td>
          <td>ViT-InChI, ICMDT, Image2InChI</td>
      </tr>
      <tr>
          <td><strong>FG-SMILES</strong></td>
          <td>No</td>
          <td>Similar to SMILES</td>
          <td>Functional group-aware tokenization</td>
          <td>Image2SMILES</td>
      </tr>
  </tbody>
</table>
<h4 id="smiles-and-variants">SMILES and Variants</h4>
<p><strong>SMILES</strong> remains the dominant format due to its compactness and highest accuracy on clean data. Standard SMILES uses single characters for ring closures and branches that may appear far apart in the sequence, creating learning challenges for sequence models.</p>
<p><strong>DeepSMILES</strong> addresses these non-local syntax dependencies by modifying how branches and ring closures are encoded, making sequences more learnable for neural models. Despite this modification, DeepSMILES sequences are ~1.1x longer than standard SMILES (not shorter). The format offers partial validity improvements through regex-based tokenization with a compact 76-token vocabulary, providing a middle ground between SMILES accuracy and guaranteed validity.</p>
<p><strong>SELFIES</strong> guarantees 100% valid molecules by design through a context-free grammar, eliminating invalid outputs entirely. This comes at the cost of ~1.5x longer sequences and a typical 2-5% accuracy drop compared to SMILES on exact-match metrics. The validity guarantee makes SELFIES particularly attractive for generative modeling applications.</p>
<p><strong>InChI</strong> uses a layered canonical syntax fundamentally different from SMILES-based formats. While valuable for unique molecular identification, its complex multi-layer structure (formula, connectivity, stereochemistry, isotopes, etc.) and longer sequences make it less suitable for image-to-sequence learning, resulting in lower recognition accuracy.</p>
<h4 id="key-findings-from-rajan-et-al-2022">Key Findings from Rajan et al. 2022</h4>
<ol>
<li><strong>SMILES achieves highest exact-match accuracy</strong> on clean synthetic data</li>
<li><strong>SELFIES guarantees 100% valid molecules</strong> but at cost of ~2-5% accuracy drop</li>
<li><strong>InChI is problematic</strong> due to complex layered syntax and longer sequences</li>
<li><strong>DeepSMILES offers middle ground</strong> with partial validity improvements through modified syntax</li>
</ol>
<h3 id="extended-formats-for-variable-structures">Extended Formats for Variable Structures</h3>
<p><strong>Markush structures</strong> represent families of molecules, using variable groups (R1, R2, etc.) with textual definitions. They are ubiquitous in patent documents for intellectual property protection. Standard SMILES cannot represent these variable structures.</p>
<table>
  <thead>
      <tr>
          <th>Format</th>
          <th>Base Format</th>
          <th>Key Feature</th>
          <th>Used By</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>E-SMILES</strong></td>
          <td>SMILES + XML annotations</td>
          <td>Backward-compatible with separator token</td>
          <td>MolParser</td>
      </tr>
      <tr>
          <td><strong>CXSMILES</strong></td>
          <td>SMILES + extension block</td>
          <td>Substituent tables, compression</td>
          <td>MarkushGrapher</td>
      </tr>
  </tbody>
</table>
<p><strong>E-SMILES</strong> (Extended SMILES) maintains backward compatibility by using a <code>&lt;sep&gt;</code> token to separate core SMILES from XML-like annotations. Annotations encode Markush substituents (<code>&lt;a&gt;index:group&lt;/a&gt;</code>), polymer structures (<code>&lt;p&gt;polymer_info&lt;/p&gt;</code>), and abstract ring patterns (<code>&lt;r&gt;abstract_ring&lt;/r&gt;</code>). The core structure remains parseable by standard RDKit.</p>
<p><strong>CXSMILES</strong> optimizes representation by moving variable groups directly into the main SMILES string as special atoms with explicit atom indexing (e.g., <code>C:1</code>) to link to an extension block containing substituent tables. This handles both frequency variation and position variation in Markush structures.</p>
<h3 id="specialized-representations">Specialized Representations</h3>
<p>These formats optimize for specific recognition challenges beyond standard single-molecule tasks.</p>
<h4 id="rfl-ring-free-language">RFL: Ring-Free Language</h4>
<p><strong>RFL</strong> fundamentally restructures molecular serialization through hierarchical ring decomposition, addressing a core challenge: standard 1D formats (SMILES, SSML) flatten complex 2D molecular graphs, losing explicit spatial relationships.</p>
<p><strong>Mechanism</strong>: RFL decomposes molecules into three explicit components:</p>
<ul>
<li><strong>Molecular Skeleton (𝒮)</strong>: Main graph with rings &ldquo;collapsed&rdquo;</li>
<li><strong>Ring Structures (ℛ)</strong>: Individual ring components stored separately</li>
<li><strong>Branch Information (ℱ)</strong>: Connectivity between skeleton and rings</li>
</ul>
<p><strong>Technical approach</strong>:</p>
<ol>
<li>Detect all non-nested rings using DFS</li>
<li>Calculate adjacency ($\gamma$) between rings based on shared edges</li>
<li>Merge isolated rings ($\gamma=0$) into <strong>SuperAtoms</strong> (single node placeholders)</li>
<li>Merge adjacent rings ($\gamma&gt;0$) into <strong>SuperBonds</strong> (edge placeholders)</li>
<li>Progressive decoding: predict skeleton first, then conditionally decode rings using stored hidden states</li>
</ol>
<p><strong>Performance</strong>: RFL achieves SOTA results on both handwritten (95.38% EM) and printed (95.58% EM) structures, with particular strength on high-complexity molecules where standard baselines fail completely (0% → ~30% on hardest tier).</p>
<p><strong>Note</strong>: RFL does not preserve original drawing orientation; it&rsquo;s focused on computational efficiency through hierarchical decomposition.</p>
<h4 id="ssml-structure-specific-markup-language">SSML: Structure-Specific Markup Language</h4>
<p><strong>SSML</strong> is the primary orientation-preserving format in OCSR. Based on Chemfig (LaTeX chemical drawing package), it provides step-by-step drawing instructions.</p>
<p><strong>Key characteristics</strong>:</p>
<ul>
<li>Describes <em>how to draw</em> the molecule alongside its graph structure</li>
<li>Uses &ldquo;reconnection marks&rdquo; for cyclic structures</li>
<li>Preserves branch angles and spatial relationships</li>
<li>Significantly outperformed SMILES for handwritten recognition: 92.09% vs 81.89% EM (Hu et al. RCGD 2023)</li>
</ul>
<p><strong>Use case</strong>: Particularly valuable for hand-drawn structure recognition where visual alignment between image and reconstruction sequence aids model learning.</p>
<h2 id="training-data-comparison">Training Data Comparison</h2>
<p>Training data scale has grown dramatically, with a shift toward combining synthetic and real-world images.</p>
<h3 id="data-scale-evolution">Data Scale Evolution</h3>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Typical Scale</th>
          <th>Maximum Reported</th>
          <th>Primary Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2019-2020</td>
          <td>1-15M</td>
          <td>57M (Staker)</td>
          <td>Synthetic (RDKit, CDK)</td>
      </tr>
      <tr>
          <td>2021-2022</td>
          <td>5-35M</td>
          <td>35M (DECIMER 1.0)</td>
          <td>Synthetic with augmentation</td>
      </tr>
      <tr>
          <td>2023-2024</td>
          <td>100-150M</td>
          <td>450M+ (DECIMER.ai)</td>
          <td>Synthetic + real patents</td>
      </tr>
      <tr>
          <td>2025</td>
          <td>1-10M + real</td>
          <td>7.7M (MolParser)</td>
          <td>Curated real + synthetic</td>
      </tr>
  </tbody>
</table>
<h3 id="synthetic-vs-real-data">Synthetic vs Real Data</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Training Data</th>
          <th>Real-World Performance Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>DECIMER.ai</strong></td>
          <td>450M+ synthetic (RanDepict)</td>
          <td>Strong generalization via domain randomization</td>
      </tr>
      <tr>
          <td><strong>MolParser</strong></td>
          <td>7.7M with active learning</td>
          <td>Explicitly targets &ldquo;in the wild&rdquo; images</td>
      </tr>
      <tr>
          <td><strong>GTR-CoT</strong></td>
          <td>Real patent/paper images</td>
          <td>Chain-of-thought improves reasoning</td>
      </tr>
      <tr>
          <td><strong>MolSight</strong></td>
          <td>Multi-stage curriculum</td>
          <td>RL fine-tuning for stereochemistry</td>
      </tr>
  </tbody>
</table>
<h3 id="data-augmentation-strategies">Data Augmentation Strategies</h3>
<p>Common augmentation techniques across methods:</p>
<table>
  <thead>
      <tr>
          <th>Technique</th>
          <th>Purpose</th>
          <th>Used By</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Rotation</strong></td>
          <td>Orientation invariance</td>
          <td>Nearly all methods</td>
      </tr>
      <tr>
          <td><strong>Gaussian blur</strong></td>
          <td>Image quality variation</td>
          <td>DECIMER, MolParser</td>
      </tr>
      <tr>
          <td><strong>Salt-and-pepper noise</strong></td>
          <td>Scan artifact simulation</td>
          <td>DECIMER, Image2SMILES</td>
      </tr>
      <tr>
          <td><strong>Affine transforms</strong></td>
          <td>Perspective variation</td>
          <td>ChemPix, MolParser</td>
      </tr>
      <tr>
          <td><strong>Font/style variation</strong></td>
          <td>Rendering diversity</td>
          <td>RanDepict (DECIMER.ai)</td>
      </tr>
      <tr>
          <td><strong>Hand-drawn simulation</strong></td>
          <td>Sketch-like inputs</td>
          <td>ChemPix, ChemReco, DECIMER-Hand-Drawn</td>
      </tr>
      <tr>
          <td><strong>Background variation</strong></td>
          <td>Document context</td>
          <td>MolParser, DECIMER.ai</td>
      </tr>
  </tbody>
</table>
<h2 id="hardware-and-compute-requirements">Hardware and Compute Requirements</h2>
<p>Hardware requirements span several orders of magnitude, from consumer GPUs to TPU pods.</p>
<h3 id="training-hardware-comparison">Training Hardware Comparison</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Hardware</th>
          <th>Training Time</th>
          <th>Dataset Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Staker et al. (2019)</strong></td>
          <td>8x GPUs</td>
          <td>26 days</td>
          <td>57M</td>
      </tr>
      <tr>
          <td><strong>IMG2SMI (2021)</strong></td>
          <td>1x RTX 2080 Ti</td>
          <td>5 epochs</td>
          <td>~10M</td>
      </tr>
      <tr>
          <td><strong>Image2SMILES (2022)</strong></td>
          <td>4x V100</td>
          <td>2 weeks</td>
          <td>30M</td>
      </tr>
      <tr>
          <td><strong>MICER (2022)</strong></td>
          <td>4x V100</td>
          <td>42 hours</td>
          <td>10M</td>
      </tr>
      <tr>
          <td><strong>DECIMER 1.0 (2021)</strong></td>
          <td>TPU v3-8</td>
          <td>Not reported</td>
          <td>35M</td>
      </tr>
      <tr>
          <td><strong>DECIMER.ai (2023)</strong></td>
          <td>TPU v3-256</td>
          <td>Not reported</td>
          <td>450M+</td>
      </tr>
      <tr>
          <td><strong>SwinOCSR (2022)</strong></td>
          <td>4x RTX 3090</td>
          <td>5 days</td>
          <td>5M</td>
      </tr>
      <tr>
          <td><strong>MolParser (2025)</strong></td>
          <td>8x A100</td>
          <td>Curriculum learning</td>
          <td>7.7M</td>
      </tr>
      <tr>
          <td><strong>MolSight (2025)</strong></td>
          <td>Not specified</td>
          <td>RL fine-tuning (GRPO)</td>
          <td>Multi-stage</td>
      </tr>
  </tbody>
</table>
<h3 id="inference-considerations">Inference Considerations</h3>
<p>Few papers report inference speed consistently. Available data:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Inference Speed</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>DECIMER 1.0</strong></td>
          <td>4x faster than DECIMER</td>
          <td>TensorFlow Lite optimization</td>
      </tr>
      <tr>
          <td><strong>OSRA</strong> (baseline)</td>
          <td>~1 image/sec</td>
          <td>CPU-based rule system</td>
      </tr>
      <tr>
          <td><strong>MolScribe</strong></td>
          <td>Real-time capable</td>
          <td>Optimized Swin encoder</td>
      </tr>
  </tbody>
</table>
<h3 id="accessibility-tiers">Accessibility Tiers</h3>
<table>
  <thead>
      <tr>
          <th>Tier</th>
          <th>Hardware</th>
          <th>Representative Methods</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Consumer</strong></td>
          <td>1x RTX 2080/3090</td>
          <td>IMG2SMI, ChemPix</td>
      </tr>
      <tr>
          <td><strong>Workstation</strong></td>
          <td>4x V100/A100</td>
          <td>Image2SMILES, MICER, SwinOCSR</td>
      </tr>
      <tr>
          <td><strong>Cloud/HPC</strong></td>
          <td>TPU pods, 8+ A100</td>
          <td>DECIMER.ai, MolParser</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmark-performance">Benchmark Performance</h2>
<h3 id="common-evaluation-datasets">Common Evaluation Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Type</th>
          <th>Size</th>
          <th>Challenge</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>USPTO</strong></td>
          <td>Patent images</td>
          <td>~5K test</td>
          <td>Real-world complexity</td>
      </tr>
      <tr>
          <td><strong>UOB</strong></td>
          <td>Scanned images</td>
          <td>~5K test</td>
          <td>Scan artifacts</td>
      </tr>
      <tr>
          <td><strong>Staker</strong></td>
          <td>Synthetic</td>
          <td>Variable</td>
          <td>Baseline synthetic</td>
      </tr>
      <tr>
          <td><strong>CLEF</strong></td>
          <td>Patent images</td>
          <td>~1K test</td>
          <td>Markush structures</td>
      </tr>
      <tr>
          <td><strong>JPO</strong></td>
          <td>Japanese patents</td>
          <td>~1K test</td>
          <td>Different rendering styles</td>
      </tr>
  </tbody>
</table>
<h3 id="accuracy-comparison-exact-match-">Accuracy Comparison (Exact Match %)</h3>
<p>Methods are roughly grouped by evaluation era; direct comparison is complicated by different test sets.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>USPTO</th>
          <th>UOB</th>
          <th>Staker</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>OSRA</strong> (baseline)</td>
          <td>~70%</td>
          <td>~65%</td>
          <td>~80%</td>
          <td>Rule-based reference</td>
      </tr>
      <tr>
          <td><strong>DECIMER 1.0</strong></td>
          <td>~85%</td>
          <td>~80%</td>
          <td>~90%</td>
          <td>First transformer-based</td>
      </tr>
      <tr>
          <td><strong>SwinOCSR</strong></td>
          <td>~88%</td>
          <td>~82%</td>
          <td>~92%</td>
          <td>Swin encoder advantage</td>
      </tr>
      <tr>
          <td><strong>DECIMER.ai</strong></td>
          <td>~90%</td>
          <td>~85%</td>
          <td>~95%</td>
          <td>Scale + augmentation</td>
      </tr>
      <tr>
          <td><strong>MolParser</strong></td>
          <td>~92%</td>
          <td>~88%</td>
          <td>~96%</td>
          <td>Real-world focus</td>
      </tr>
      <tr>
          <td><strong>MolSight</strong></td>
          <td>~93%+</td>
          <td>~89%+</td>
          <td>~97%+</td>
          <td>RL fine-tuning boost</td>
      </tr>
  </tbody>
</table>
<p><em>Note: Numbers are approximate and may vary by specific test split. See individual paper notes for precise figures.</em></p>
<h3 id="stereochemistry-recognition">Stereochemistry Recognition</h3>
<p>Stereochemistry remains a persistent challenge across all methods:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Approach</th>
          <th>Stereo Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Most methods</strong></td>
          <td>Standard SMILES</td>
          <td>Lower than non-stereo</td>
      </tr>
      <tr>
          <td><strong>MolSight</strong></td>
          <td>RL (GRPO) specifically for stereo</td>
          <td>Improved</td>
      </tr>
      <tr>
          <td><strong>MolNexTR</strong></td>
          <td>Graph-based explicit stereo</td>
          <td>Better handling</td>
      </tr>
      <tr>
          <td><strong>Image2InChI</strong></td>
          <td>InChI stereo layers</td>
          <td>Mixed results</td>
      </tr>
  </tbody>
</table>
<h2 id="hand-drawn-recognition">Hand-Drawn Recognition</h2>
<p>A distinct sub-lineage focuses on hand-drawn/sketched chemical structures.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Target Domain</th>
          <th>Key Innovation</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ChemPix (2021)</strong></td>
          <td>Hand-drawn hydrocarbons</td>
          <td>First deep learning for sketches</td>
      </tr>
      <tr>
          <td><strong>Hu et al. RCGD (2023)</strong></td>
          <td>Hand-drawn structures</td>
          <td>Random conditional guided decoder</td>
      </tr>
      <tr>
          <td><strong>ChemReco (2024)</strong></td>
          <td>Hand-drawn C-H-O structures</td>
          <td>EfficientNet + curriculum learning</td>
      </tr>
      <tr>
          <td><strong>DECIMER-Hand-Drawn (2024)</strong></td>
          <td>General hand-drawn</td>
          <td>Enhanced DECIMER architecture</td>
      </tr>
  </tbody>
</table>
<h3 id="hand-drawn-vs-printed-trade-offs">Hand-Drawn vs Printed Trade-offs</h3>
<ul>
<li>Hand-drawn methods sacrifice some accuracy on clean printed images</li>
<li>Require specialized training data (synthetic hand-drawn simulation)</li>
<li>Generally smaller training sets due to data collection difficulty</li>
<li>Better suited for educational and lab notebook applications</li>
</ul>
<h2 id="key-innovations-by-method">Key Innovations by Method</h2>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Primary Innovation</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Staker et al.</strong></td>
          <td>First end-to-end deep learning OCSR</td>
      </tr>
      <tr>
          <td><strong>DECIMER 1.0</strong></td>
          <td>Transformer decoder + SELFIES</td>
      </tr>
      <tr>
          <td><strong>Img2Mol</strong></td>
          <td>Continuous embedding space (CDDD)</td>
      </tr>
      <tr>
          <td><strong>Image2SMILES</strong></td>
          <td>Functional group-aware SMILES (FG-SMILES)</td>
      </tr>
      <tr>
          <td><strong>SwinOCSR</strong></td>
          <td>Hierarchical vision transformer encoder</td>
      </tr>
      <tr>
          <td><strong>DECIMER.ai</strong></td>
          <td>Massive scale + RanDepict augmentation</td>
      </tr>
      <tr>
          <td><strong>MolParser</strong></td>
          <td>Extended SMILES + active learning</td>
      </tr>
      <tr>
          <td><strong>MolSight</strong></td>
          <td>RL fine-tuning (GRPO) for accuracy</td>
      </tr>
      <tr>
          <td><strong>GTR-CoT</strong></td>
          <td>Chain-of-thought graph traversal</td>
      </tr>
      <tr>
          <td><strong>OCSU</strong></td>
          <td>Multi-task vision-language understanding</td>
      </tr>
      <tr>
          <td><strong>RFL</strong></td>
          <td>Hierarchical ring decomposition with SuperAtoms/SuperBonds</td>
      </tr>
  </tbody>
</table>
<h2 id="open-challenges">Open Challenges</h2>
<ol>
<li><strong>Stereochemistry</strong>: Consistent challenge across all methods; RL approaches (MolSight) show promise</li>
<li><strong>Abbreviations/R-groups</strong>: E-SMILES and Markush-specific methods emerging</li>
<li><strong>Real-world robustness</strong>: Gap between synthetic training and patent/paper images</li>
<li><strong>Inference speed</strong>: Rarely reported; important for production deployment</li>
<li><strong>Memory efficiency</strong>: Almost never documented; limits accessibility</li>
<li><strong>Multi-molecule images</strong>: Most methods assume single isolated structure</li>
</ol>
<h2 id="references">References</h2>
<p>Individual paper notes linked throughout. For the complete method listing, see the <a href="/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/">OCSR Methods taxonomy</a>.</p>
]]></content:encoded></item><item><title>MolSight: OCSR with RL and Multi-Granularity Learning</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/molsight/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/molsight/</guid><description>A three-stage OCSR framework using SMILES pretraining, auxiliary bond/coordinate tasks, and reinforcement learning to master stereochemistry recognition.</description><content:encoded><![CDATA[<h2 id="contribution-a-framework-for-optical-chemical-structure-recognition">Contribution: A Framework for Optical Chemical Structure Recognition</h2>
<p>This is primarily a <strong>Method</strong> paper. It proposes a novel three-stage training framework (Pretraining → Fine-tuning → RL Post-training) to improve Optical Chemical Structure Recognition (OCSR). Specifically, it introduces the use of Group Relative Policy Optimization (GRPO) to solve non-differentiable chemical validity issues.</p>
<p>It also has a <strong>Resource</strong> component, as the authors construct and release <em>Stereo-200k</em>, a dataset specifically designed to train models on challenging stereoisomeric molecules.</p>
<h2 id="motivation-resolving-stereochemical-cues">Motivation: Resolving Stereochemical Cues</h2>
<p>Existing OCSR systems struggle to accurately recognize stereochemical information (e.g., chirality, geometric isomerism) because the visual cues distinguishing stereoisomers (such as wedge and dash bonds) are subtle. Current methods often fail to capture the geometric relationships required to distinguish molecules with identical connectivity but different spatial arrangements. Accurate recognition is critical for downstream tasks like drug discovery where stereochemistry determines pharmacological effects.</p>
<h2 id="core-innovations-grpo-and-multi-granularity-learning">Core Innovations: GRPO and Multi-Granularity Learning</h2>
<p>MolSight introduces three key technical innovations:</p>
<ol>
<li><strong>Reinforcement Learning for OCSR</strong>: It is the first OCSR system to incorporate RL (specifically GRPO) to directly optimize for chemical semantic correctness.</li>
<li><strong>Multi-Granularity Learning</strong>: It employs auxiliary heads for chemical bond classification and atom localization. Unlike previous approaches that optimize these jointly, MolSight decouples the coordinate head to prevent interference with SMILES generation.</li>
<li><strong>SMILES-M Notation</strong>: A lightweight extension to SMILES to handle Markush structures (common in patents) without significant sequence length increase.</li>
</ol>
<h2 id="experimental-methodology">Experimental Methodology</h2>
<p>The authors evaluated MolSight using a rigorous mix of real and synthetic benchmarks:</p>
<ul>
<li><strong>Baselines</strong>: Compared against rule-based (OSRA, MolVec, Imago) and deep learning methods (MolScribe, MolGrapher, DECIMER).</li>
<li><strong>Benchmarks</strong>: Evaluated on real-world datasets (USPTO, Maybridge UoB, CLEF-2012, JPO) and synthetic datasets (Staker, ChemDraw, Indigo, Stereo-2K).</li>
<li><strong>Ablation Studies</strong>: Tested the impact of the bond head, coordinate head, and RL stages separately.</li>
<li><strong>Transfer Learning</strong>: Assessed the quality of learned representations by using the frozen encoder for molecular property prediction on MoleculeNet.</li>
</ul>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li><strong>SOTA Performance</strong>: MolSight achieved 85.1% stereochemical accuracy on the USPTO dataset, significantly outperforming the previous SOTA (MolScribe) which achieved 69.0%.</li>
<li><strong>RL Effectiveness</strong>: Reinforcement learning post-training specifically improved performance on stereoisomers, raising Tanimoto similarity and exact match rates on the Stereo-2k test set.</li>
<li><strong>Robustness</strong>: On perturbed USPTO images (random rotations and shearing), MolSight achieved 92.3% exact match accuracy (vs. the original 92.0%), while rule-based methods like OSRA dropped from 83.5% to 6.7%. On the low-resolution Staker dataset, MolSight reached 82.1% exact match.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training pipeline uses three distinct data sources:</p>
<ol>
<li><strong>Pre-training</strong>: <em>MolParser-7M</em>. Contains diverse images but requires the <strong>SMILES-M</strong> extension to handle Markush structures.</li>
<li><strong>Fine-tuning</strong>: <em>PubChem-1M</em> and <em>USPTO-680K</em>. Used for multi-granularity learning with bond and coordinate labels.</li>
<li><strong>RL Post-training</strong>: <em>Stereo-200k</em>. A self-collected dataset from the first 2M compounds in PubChem, filtered for chirality (&rsquo;@&rsquo;) and cis-trans isomerism (&rsquo;/&rsquo;, &lsquo;\&rsquo;). It uses 5 different RDKit drawing styles to ensure robustness.</li>
</ol>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Reinforcement Learning</strong>: Uses <strong>GRPO (Group Relative Policy Optimization)</strong>.
<ul>
<li><strong>Reward Function</strong>: A linear combination of Tanimoto similarity and a graded stereochemistry reward.
$$ R = w_t \cdot r_{\text{tanimoto}} + w_s \cdot r_{\text{stereo}} $$
where $w_t=0.4$ and $w_s=0.6$. The stereochemistry reward $r_{\text{stereo}}$ is 1.0 for an InChIKey exact match, 0.3 if the atom count matches, and 0.1 otherwise.</li>
<li><strong>Sampling</strong>: Samples 4 completions per image with temperature 1.0 during RL training.</li>
</ul>
</li>
<li><strong>Auxiliary Tasks</strong>:
<ul>
<li><strong>Bond Classification</strong>: Concatenates hidden states of two atom queries to predict bond type via MLP.</li>
<li><strong>Atom Localization</strong>: Treated as a classification task (SimCC) but optimized using <strong>Maximum Likelihood Estimation (MLE)</strong> to account for uncertainty.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-Decoder Transformer. Input images are preprocessed to $512 \times 512$ resolution.
<ul>
<li><strong>Encoder</strong>: <strong>EfficientViT-L1</strong> (~53M params), chosen for linear attention efficiency.</li>
<li><strong>Decoder</strong>: 6-layer Transformer with <strong>RoPE</strong>, <strong>SwiGLU</strong>, and <strong>RMSNorm</strong>. Randomly initialized (no LLM weights) due to vocabulary mismatch.</li>
<li><strong>Coordinate Head</strong>: Separated from the main decoder. It adds 2 extra Transformer layers to process atom queries before prediction to improve accuracy.</li>
</ul>
</li>
<li><strong>Parameter Tuning</strong>:
<ul>
<li>Stage 3 (RL) uses <strong>LoRA</strong> (Rank=8, Alpha=16) to optimize the decoder.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Exact Match</strong>: Exact recognition accuracy for the full molecular structure.</li>
<li><strong>Tanimoto Coefficient</strong>: Fingerprint similarity for chemical semantics.</li>
<li><strong>OKS (Object Keypoint Similarity)</strong>: Used specifically for evaluating atom localization accuracy.</li>
</ul>
</li>
<li><strong>Perturbation</strong>: Robustness tested with random rotations [-5°, 5°] and xy-shearing [-0.1, 0.1].</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Training and inference performed on a single node.</li>
<li><strong>Processors</strong>: Intel Xeon Silver 4210R CPU.</li>
<li><strong>Accelerators</strong>: 4x <strong>NVIDIA GeForce RTX 3090/4090</strong> GPUs.</li>
<li><strong>Hyperparameters</strong>:
<ul>
<li>Stage 1: Batch size 512, LR $4 \times 10^{-4}$.</li>
<li>Stage 2: Batch size 256, Bond head LR $4 \times 10^{-4}$, Coord head LR $4 \times 10^{-5}$.</li>
<li>Stage 3 (RL): Batch size 64, Base LR $1 \times 10^{-4}$.</li>
</ul>
</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/hustvl/MolSight">MolSight (GitHub)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official PyTorch implementation with training and inference code</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, W., Wang, X., Feng, B., &amp; Liu, W. (2025). MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning. In <em>Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2026)</em>. <a href="https://doi.org/10.48550/arXiv.2511.17300">https://doi.org/10.48550/arXiv.2511.17300</a></p>
<p><strong>Publication</strong>: AAAI 2026</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/hustvl/MolSight">Official Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zhang2025molsight,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wenrui Zhang and Xinggang Wang and Bin Feng and Wenyu Liu}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the AAAI Conference on Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2511.17300}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CV}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2511.17300}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolScribe: Robust Image-to-Graph Molecular Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molscribe/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molscribe/</guid><description>Image-to-graph generation model for OCSR that predicts atoms, bonds, and coordinates jointly to better handle stereochemistry and abbreviations.</description><content:encoded><![CDATA[<h2 id="contribution-generative-image-to-graph-modelling">Contribution: Generative Image-to-Graph Modelling</h2>
<p>This is a <strong>Methodological Paper</strong> ($\Psi_{\text{Method}}$) with a secondary contribution to Resources ($\Psi_{\text{Resource}}$).</p>
<p>It proposes a novel architecture (image-to-graph generation) to solve the Optical Chemical Structure Recognition (OCSR) task, validating it through extensive ablation studies and comparisons against strong baselines like MolVec and DECIMER. It also contributes a new benchmark dataset of annotated images from ACS journals.</p>
<h2 id="motivation-limitations-in-existing-ocsr-pipelines">Motivation: Limitations in Existing OCSR Pipelines</h2>
<p>Translating molecular images into machine-readable graphs (OCSR) is challenging due to the high variance in drawing styles, stereochemistry conventions, and abbreviated structures found in literature.</p>
<p>Existing solutions face structural bottlenecks:</p>
<ul>
<li><strong>Rule-based systems</strong> (e.g., OSRA) rely on rigid heuristics that fail on diverse styles.</li>
<li><strong>Image-to-SMILES neural models</strong> treat the problem as captioning. They struggle with geometric reasoning (which is strictly required for chirality) and struggle to incorporate chemical constraints or verify correctness because they omit explicit atom locations.</li>
</ul>
<h2 id="core-innovation-joint-graph-and-coordinate-prediction">Core Innovation: Joint Graph and Coordinate Prediction</h2>
<p>MolScribe introduces an <strong>Image-to-Graph</strong> generation paradigm that combines the flexibility of neural networks with the precision of symbolic constraints. It frames the task probabilistically as:</p>
<p>$$
P(G | I) = P(A | I) P(B | A, I)
$$</p>
<p>Where the model predicts a sequence of atoms $A$ given an image $I$, followed by the bonds $B$ given both the atoms and the image.</p>
<ol>
<li><strong>Explicit Graph Prediction</strong>: It predicts a sequence of atoms (with 2D coordinates) and then predicts bonds between them.</li>
<li><strong>Symbolic Constraints</strong>: It uses the predicted graph structure and coordinates to strictly determine chirality and cis/trans isomerism.</li>
<li><strong>Abbreviation Expansion</strong>: It employs a greedy algorithm to parse and expand &ldquo;superatoms&rdquo; (e.g., &ldquo;CO2Et&rdquo;) into their full atomic structure.</li>
<li><strong>Dynamic Augmentation</strong>: It introduces a data augmentation strategy that randomly substitutes functional groups with abbreviations and adds R-groups during training to improve generalization.</li>
</ol>
<h2 id="methodology-autoregressive-atoms-and-pairwise-bonds">Methodology: Autoregressive Atoms and Pairwise Bonds</h2>
<p>The authors evaluate MolScribe on synthetic and real-world datasets, focusing on <strong>Exact Match Accuracy</strong> of the canonical SMILES string. The model generates atom sequences autoregressively:</p>
<p>$$
P(A | I) = \prod_{i=1}^n P(a_i | A_{&lt;i}, I)
$$</p>
<p>To handle continuous spatial locations, atom coordinates map to discrete bins (e.g., $\hat{x}_i = \lfloor \frac{x_i}{W} \times n_{\text{bins}} \rfloor$), and decode alongside element labels. Bonds act on a pairwise classifier over the hidden states of every atom pair:</p>
<p>$$
P(B | A, I) = \prod_{i=1}^n \prod_{j=1}^n P(b_{i,j} | A, I)
$$</p>
<ul>
<li><strong>Baselines</strong>: Compared against rule-based (MolVec, OSRA) and neural (Img2Mol, DECIMER, SwinOCSR) systems.</li>
<li><strong>Benchmarks</strong>:
<ul>
<li><strong>Synthetic</strong>: Indigo (in-domain) and ChemDraw (out-of-domain).</li>
<li><strong>Realistic</strong>: Five public benchmarks (CLEF, JPO, UOB, USPTO, Staker).</li>
<li><strong>New Dataset</strong>: 331 images from ACS Publications (journal articles).</li>
</ul>
</li>
<li><strong>Ablations</strong>: Tested performance without data augmentation, with continuous vs. discrete coordinates, and without non-atom tokens.</li>
<li><strong>Human Eval</strong>: Measured the time reduction for chemists using MolScribe to digitize molecules vs. drawing from scratch.</li>
</ul>
<h2 id="results-robust-exact-match-accuracy">Results: Robust Exact Match Accuracy</h2>
<ul>
<li><strong>Strong Performance</strong>: MolScribe achieved <strong>76-93% accuracy</strong> across public benchmarks, outperforming baselines on most datasets. On the ACS dataset of journal article images, MolScribe achieved 71.9% compared to the next best 55.3% (OSRA). On the large Staker patent dataset, MolScribe achieved 86.9%, surpassing MSE-DUDL (77.0%) while using far less training data (1.68M vs. 68M examples).</li>
<li><strong>Chirality Verification</strong>: Explicit geometric reasoning allowed MolScribe to predict chiral molecules significantly better than image-to-SMILES baselines. When chirality is ignored, the performance gap narrows (e.g., on Indigo, baseline accuracy rises from 94.1% to 96.3%), isolating MolScribe&rsquo;s primary advantage to geometric reasoning for stereochemistry.</li>
<li><strong>Hand-Drawn Generalization</strong>: The model achieved <strong>11.2% exact match accuracy</strong> on the DECIMER-HDM dataset, despite lacking hand-drawn images in the training set, with many errors bounded to a few atomic mismatches.</li>
<li><strong>Robustness</strong>: The model maintained high performance on perturbed images (rotation/shear), whereas rule-based systems degraded severely.</li>
<li><strong>Usability</strong>: The atom-level alignment allows for confidence visualization, and human evaluation showed it reduced digitization time from <strong>137s to 20s</strong> per molecule.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The model was trained on a mix of synthetic and patent data with extensive dynamic augmentation:</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td><strong>PubChem (Synthetic)</strong></td>
          <td>1M</td>
          <td>Molecules randomly sampled from PubChem and rendered via Indigo toolkit; includes atom coords.</td>
      </tr>
      <tr>
          <td>Training</td>
          <td><strong>USPTO (Patents)</strong></td>
          <td>680K</td>
          <td>Patent data lacks exact atom coordinates; relative coordinates normalized from MOLfiles to image dimensions (often introduces coordinate shifts).</td>
      </tr>
  </tbody>
</table>
<p><strong>Molecule Augmentation</strong>:</p>
<ul>
<li><strong>Functional Groups</strong>: Randomly substituted using 53 common substitution rules (e.g., replacing substructures with &ldquo;Et&rdquo; or &ldquo;Ph&rdquo;).</li>
<li><strong>R-Groups</strong>: Randomly added using vocabulary: <code>[R, R1...R12, Ra, Rb, Rc, Rd, X, Y, Z, A, Ar]</code>.</li>
<li><strong>Styles</strong>: Random variation of aromaticity (circle vs. bonds) and explicit hydrogens.</li>
</ul>
<p><strong>Image Augmentation</strong>:</p>
<ul>
<li><strong>Rendering</strong>: Randomized font (Arial, Times, Courier, Helvetica), line width, and label modes during synthetic generation.</li>
<li><strong>Perturbations</strong>: Applied rotation ($\pm 90^{\circ}$), cropping ($1%$), padding ($40%$), downscaling, blurring, and Salt-and-Pepper/Gaussian noise.</li>
</ul>
<p><strong>Preprocessing</strong>: Input images are resized to $384 \times 384$.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Atom Prediction (Pix2Seq-style)</strong>:
<ul>
<li>The model generates a sequence of tokens: $S^A = [l_1, \hat{x}_1, \hat{y}_1, \dots, l_n, \hat{x}_n, \hat{y}_n]$.</li>
<li><strong>Discretization</strong>: Coordinates are binned into integer tokens ($n_{bins} = 64$).</li>
<li><strong>Tokenizer</strong>: Atom-wise tokenizer splits SMILES into atoms; non-atom tokens (parentheses, digits) are kept to help structure learning.</li>
</ul>
</li>
<li><strong>Bond Prediction</strong>:
<ul>
<li>Format: Pairwise classification for every pair of predicted atoms.</li>
<li>Symmetry: For symmetric bonds (single/double), the probability is averaged as:
$$
\hat{P}(b_{i,j} = t) = \frac{1}{2} \big( P(b_{i,j} = t) + P(b_{j,i} = t) \big)
$$
For wedges, directional logic strictly applies instead.</li>
</ul>
</li>
<li><strong>Abbreviation Expansion (Algorithm 1)</strong>:
<ul>
<li>A greedy algorithm connects atoms within an expanded abbreviation (e.g., &ldquo;COOH&rdquo;) until valences are full, avoiding the need for a fixed dictionary.</li>
<li><strong>Carbon Chains</strong>: Splits condensed chains like $C_aX_b$ into explicit sequences ($CX_q&hellip;CX_{q+r}$).</li>
<li><strong>Nested Formulas</strong>: Recursively parses nested structures like $N(CH_3)_2$ by treating them as superatoms attached to the current backbone.</li>
<li><strong>Valence Handling</strong>: Iterates through common valences first to resolve ambiguities.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture is an encoder-decoder with a classification head:</p>
<ul>
<li><strong>Encoder</strong>: <strong>Swin Transformer (Swin-B)</strong>, pre-trained on ImageNet-22K (88M params).</li>
<li><strong>Decoder</strong>: 6-layer Transformer, 8 heads, hidden dimension 256.</li>
<li><strong>Bond Predictor</strong>: 2-layer MLP (Feedforward) with ReLU, taking concatenated atom hidden states as input.</li>
<li><strong>Training</strong>: Teacher forcing, Cross-Entropy Loss, Batch size 128, 30 epochs.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metric</strong>: Exact Match of Canonical SMILES.</p>
<ul>
<li>Stereochemistry: Must match tetrahedral chirality; cis-trans ignored.</li>
<li>R-groups: Replaced with wildcards <code>*</code> or <code>[d*]</code> for evaluation.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Training performed on Linux server with <strong>96 CPUs</strong> and <strong>500GB RAM</strong>.</li>
<li><strong>GPUs</strong>: <strong>4x NVIDIA A100 GPUs</strong>.</li>
<li><strong>Training Time</strong>: Unspecified; comparative models on large datasets took &ldquo;more than one day&rdquo;.</li>
<li><strong>Inference</strong>: Requires autoregressive decoding for atoms, followed by a single forward pass for bonds.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/thomas0809/MolScribe">MolScribe (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch implementation with training, inference, and evaluation scripts</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/spaces/yujieq/MolScribe">MolScribe (Hugging Face)</a></td>
          <td>Demo</td>
          <td>MIT</td>
          <td>Interactive web demo for molecular image recognition</td>
      </tr>
  </tbody>
</table>
<h3 id="limitations">Limitations</h3>
<ul>
<li>Scoped to single-molecule images only; does not handle multi-molecule diagrams or reaction schemes.</li>
<li>Hand-drawn molecule recognition remains weak (the model was not trained on hand-drawn data).</li>
<li>Complex Markush structures (positional variation, frequency variation) are not supported, as these cannot be represented in SMILES or MOLfiles.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Qian, Y., Guo, J., Tu, Z., Li, Z., Coley, C. W., &amp; Barzilay, R. (2023). MolScribe: Robust Molecular Structure Recognition with Image-To-Graph Generation. <em>Journal of Chemical Information and Modeling</em>, 63(7), 1925-1934. <a href="https://doi.org/10.1021/acs.jcim.2c01480">https://doi.org/10.1021/acs.jcim.2c01480</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://huggingface.co/spaces/yujieq/MolScribe">Hugging Face Space</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{qianMolScribeRobustMolecular2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{MolScribe}}: {{Robust Molecular Structure Recognition}} with {{Image-To-Graph Generation}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{MolScribe}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Qian, Yujie and Guo, Jiang and Tu, Zhengkai and Li, Zhening and Coley, Connor W. and Barzilay, Regina}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2023</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = apr,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{63}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{7}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1925--1934}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/acs.jcim.2c01480}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://pubs.acs.org/doi/10.1021/acs.jcim.2c01480}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolMole: Unified Vision Pipeline for Molecule Mining</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molmole/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molmole/</guid><description>A vision-based deep learning framework that unifies molecule detection, reaction parsing, and OCSR for page-level chemical data extraction.</description><content:encoded><![CDATA[<h2 id="molmoles-dual-contribution-unified-ocsr-method-and-page-level-benchmarks">MolMole&rsquo;s Dual Contribution: Unified OCSR Method and Page-Level Benchmarks</h2>
<p>This is primarily a <strong>Method</strong> paper, with a strong <strong>Resource</strong> contribution.</p>
<p>It functions as a <strong>Method</strong> paper because it introduces &ldquo;MolMole,&rdquo; a unified deep learning framework that integrates molecule detection, reaction diagram parsing, and optical chemical structure recognition (OCSR) into a single pipeline. It validates this method through extensive comparisons against state-of-the-art baselines like DECIMER and OpenChemIE.</p>
<p>It also serves as a <strong>Resource</strong> paper because the authors construct and release a novel page-level benchmark dataset of 550 annotated pages (patents and articles) to address the lack of standardized evaluation metrics for full-page chemical extraction.</p>
<h2 id="addressing-the-limitations-of-fragmented-processing">Addressing the Limitations of Fragmented Processing</h2>
<p>The rapid accumulation of chemical literature has trapped valuable molecular and reaction data in unstructured formats like images and PDFs. Extracting this manually is time-consuming, while existing AI frameworks have significant limitations:</p>
<ul>
<li><strong>DECIMER</strong>: Lacks the ability to process reaction diagrams entirely.</li>
<li><strong>OpenChemIE</strong>: Relies on external layout parser models to crop elements before processing. This dependence often leads to detection failures in documents with complex layouts.</li>
<li><strong>Generative Hallucination</strong>: Existing generative OCSR models (like MolScribe) are prone to &ldquo;hallucinating&rdquo; structures or failing on complex notations like polymers.</li>
</ul>
<h2 id="a-unified-vision-pipeline-for-layout-aware-detection">A Unified Vision Pipeline for Layout-Aware Detection</h2>
<p>MolMole introduces several architectural and workflow innovations:</p>
<ul>
<li><strong>Direct Page-Level Processing</strong>: Unlike OpenChemIE, MolMole processes full document pages directly without requiring an external layout parser, which improves robustness on complex layouts like two-column patents.</li>
<li><strong>Unified Vision Pipeline</strong>: It integrates three specialized vision models into one workflow:
<ul>
<li><strong>ViDetect</strong>: A DINO-based object detector for identifying molecular regions.</li>
<li><strong>ViReact</strong>: An RxnScribe-based model adapted for full-page reaction parsing.</li>
<li><strong>ViMore</strong>: A detection-based OCSR model that explicitly predicts atoms and bonds.</li>
</ul>
</li>
<li><strong>Hallucination Mitigation</strong>: By using a detection-based approach (ViMore), the model avoids hallucinating chemical structures and provides confidence scores.</li>
<li><strong>Advanced Notation Support</strong>: The system explicitly handles &ldquo;wavy bonds&rdquo; (variable attachments in patents) and polymer bracket notations, which confuse standard SMILES-based models.</li>
</ul>
<h2 id="page-level-benchmark-evaluation-and-unified-metrics">Page-Level Benchmark Evaluation and Unified Metrics</h2>
<p>The authors evaluated the framework on both a newly curated benchmark and existing public datasets:</p>
<ul>
<li><strong>New Benchmark Creation</strong>: They curated 550 pages (300 patents, 250 articles) fully annotated with bounding boxes, reaction roles (reactant, product, condition), and MOLfiles.</li>
<li><strong>Baselines</strong>: MolMole was compared against <strong>DECIMER 2.0</strong>, <strong>OpenChemIE</strong>, and <strong>ReactionDataExtractor 2.0</strong>.</li>
<li><strong>OCSR Benchmarking</strong>: ViMore was evaluated against DECIMER, MolScribe, and MolGrapher on four public datasets: <strong>USPTO</strong>, <strong>UOB</strong>, <strong>CLEF</strong>, and <strong>JPO</strong>.</li>
<li><strong>Metric Proposal</strong>: They introduced a combined &ldquo;End-to-End&rdquo; metric that modifies standard object detection Precision/Recall to strictly require correct SMILES conversion for a &ldquo;True Positive&rdquo;.</li>
</ul>
<p>$$ \text{True Positive (End-to-End)} = ( \text{IoU} \geq 0.5 ) \land ( \text{SMILES}_{\text{gt}} == \text{SMILES}_{\text{pred}} ) $$</p>
<h2 id="key-results">Key Results</h2>
<ul>
<li><strong>Page-Level Performance</strong>: On the new benchmark, MolMole achieved F1 scores of <strong>89.1%</strong> (Patents) and <strong>86.8%</strong> (Articles) for the combined detection-to-conversion task, compared to 73.8% and 67.3% for DECIMER and 68.8% and 70.6% for OpenChemIE (Table 4).</li>
<li><strong>Reaction Parsing</strong>: ViReact achieved soft-match F1 scores of <strong>98.0%</strong> on patents and <strong>97.0%</strong> on articles, compared to 82.2% and 82.9% for the next best model, RxnScribe (w/o LP). Hard-match F1 scores were 92.5% (patents) and 84.6% (articles).</li>
<li><strong>Public Benchmarks</strong>: ViMore outperformed competitors on 3 out of 4 public OCSR datasets (CLEF, JPO, USPTO).</li>
<li><strong>Layout Handling</strong>: The authors demonstrated that MolMole successfully handles multi-column reaction diagrams where cropping-based models fail and faithfully preserves layout geometry in generated MOLfiles.</li>
</ul>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://lgai-ddu.github.io/molmole/">MolMole Project Page</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Demo and project information</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<ul>
<li><strong>Training Data</strong>: The models (ViDetect and ViMore) were trained on <strong>private/proprietary datasets</strong>, which is a limitation for full reproducibility from scratch.</li>
<li><strong>Benchmark Data</strong>: The authors introduce a test set of <strong>550 pages</strong> (3,897 molecules, 1,022 reactions) derived from patents and scientific articles. This dataset is stated to be made &ldquo;publicly available&rdquo;.</li>
<li><strong>Public Evaluation Data</strong>: Standard OCSR datasets used include USPTO (5,719 images), UOB (5,740 images), CLEF (992 images), and JPO (450 images).</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pipeline Workflow</strong>: PDF → PNG Images → Parallel execution of <strong>ViDetect</strong> and <strong>ViReact</strong> → Cropping of molecular regions → <strong>ViMore</strong> conversion → Output (JSON/Excel).</li>
<li><strong>Post-Processing</strong>:
<ul>
<li><em>ViDetect</em>: Removes overlapping proposals based on confidence scores and size constraints.</li>
<li><em>ViReact</em>: Refines predictions by correcting duplicates and removing empty entities.</li>
<li><em>ViMore</em>: Assembles detected atom/bond information into structured representations (MOLfile).</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Architecture Basis</th>
          <th>Task</th>
          <th>Key Feature</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ViDetect</strong></td>
          <td>DINO (DETR-based)</td>
          <td>Molecule Detection</td>
          <td>End-to-end training; avoids slow autoregressive methods.</td>
      </tr>
      <tr>
          <td><strong>ViReact</strong></td>
          <td>RxnScribe</td>
          <td>Reaction Parsing</td>
          <td>Operates on full pages; autoregressive decoder for structured sequence generation.</td>
      </tr>
      <tr>
          <td><strong>ViMore</strong></td>
          <td>Custom Vision Model</td>
          <td>OCSR</td>
          <td>Detection-based (predicts atom/bond regions).</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Molecule Detection</strong>: Evaluated using COCO metrics (AP, AR, F1) at IoU thresholds 0.50-0.95.</li>
<li><strong>Molecule Conversion</strong>: Evaluated using SMILES exact match accuracy and Tanimoto similarity.</li>
<li><strong>Combined Metric</strong>: A custom metric where a True Positive requires both IoU $\geq$ 0.5 and a correct SMILES string match where $\text{SMILES}_{\text{gt}} == \text{SMILES}_{\text{pred}}$.</li>
<li><strong>Reaction Parsing</strong>: Evaluated using <strong>Hard Match</strong> (all components correct) and <strong>Soft Match</strong> (molecular entities only, ignoring text labels).</li>
</ul>
<h3 id="missing-components">Missing Components</h3>
<ul>
<li><strong>Source code</strong>: Not publicly released. The paper states the toolkit &ldquo;will be accessible soon through an interactive demo on the LG AI Research website.&rdquo; For commercial use, the authors direct inquiries to contact <a href="mailto:ddu@lgresearch.ai">ddu@lgresearch.ai</a>.</li>
<li><strong>Training data</strong>: ViDetect and ViMore are trained on proprietary datasets. Training code and data are not available.</li>
<li><strong>Hardware requirements</strong>: Not specified in the paper.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chun, S., Kim, J., Jo, A., Jo, Y., Oh, S., et al. (2025). MolMole: Molecule Mining from Scientific Literature. <em>arXiv preprint arXiv:2505.03777</em>. <a href="https://doi.org/10.48550/arXiv.2505.03777">https://doi.org/10.48550/arXiv.2505.03777</a></p>
<p><strong>Publication</strong>: arXiv 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://lgai-ddu.github.io/molmole/">Project Page</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{chun2025molmole,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolMole: Molecule Mining from Scientific Literature}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chun, Sehyun and Kim, Jiye and Jo, Ahra and Jo, Yeonsik and Oh, Seungyul and Lee, Seungjun and Ryoo, Kwangrok and Lee, Jongmin and Kim, Seung Hwan and Kang, Byung Jun and Lee, Soonyoung and Park, Jun Ha and Moon, Chanwoo and Ham, Jiwon and Lee, Haein and Han, Heejae and Byun, Jaeseung and Do, Soojong and Ha, Minju and Kim, Dongyun and Bae, Kyunghoon and Lim, Woohyung and Lee, Edward Hwayoung and Park, Yongmin and Yu, Jeongsang and Jo, Gerrard Jeongwon and Hong, Yeonjung and Yoo, Kyungjae and Han, Sehui and Lee, Jaewan and Park, Changyoung and Jeon, Kijeong and Yi, Sihyuk}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2505.03777}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arXiv.2505.03777}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2505.03777}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolGrapher: Graph-based Chemical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molgrapher/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molgrapher/</guid><description>A graph-based deep learning approach for optical chemical structure recognition that outperforms image captioning methods.</description><content:encoded><![CDATA[<h2 id="1-contribution--type">1. Contribution / Type</h2>
<p>This is primarily a <strong>Methodological</strong> paper that proposes a novel neural architecture (MolGrapher), shifting the paradigm of Optical Chemical Structure Recognition (OCSR) from image captioning back to graph reconstruction. It also has a significant <strong>Resource</strong> component, releasing a synthetic data generation pipeline and a new large-scale benchmark (USPTO-30K) to address the scarcity of annotated real-world data.</p>
<h2 id="2-motivation">2. Motivation</h2>
<p>The automatic analysis of chemical literature is critical for accelerating drug and material discovery, but much of this information is locked in 2D images of molecular structures.</p>
<ul>
<li><strong>Problem:</strong> Existing rule-based methods are rigid, while recent deep learning methods based on &ldquo;image captioning&rdquo; (predicting <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings) struggle with complex molecules and fail to exploit the natural graph structure of molecules.</li>
<li><strong>Gap:</strong> There is a lack of diverse, annotated real-world training data, and captioning models suffer from &ldquo;hallucinations&rdquo; where they predict valid SMILES that do not match the image.</li>
</ul>
<h2 id="3-novelty--core-innovation">3. Novelty / Core Innovation</h2>
<p>MolGrapher introduces a <strong>graph-based deep learning pipeline</strong> that explicitly models the molecule&rsquo;s geometry and topology.</p>
<ul>
<li><strong>Supergraph Concept:</strong> It first detects all atom keypoints and builds a &ldquo;supergraph&rdquo; of all plausible bonds.</li>
<li><strong>Hybrid Approach:</strong> It combines a ResNet-based keypoint detector with a Graph Neural Network (GNN) that classifies both atom nodes and bond nodes within the supergraph context. Both atoms and bonds are represented as nodes, with edges only connecting atom nodes to bond nodes.</li>
<li><strong>Synthetic Pipeline:</strong> A data generation pipeline that renders molecules with varying styles (fonts, bond widths) and augmentations (pepper patches, random lines, captions) to simulate real document noise.</li>
</ul>
<p>At the core of the Keypoint Detector&rsquo;s performance is the <strong>Weight-Adaptive Heatmap Regression (WAHR)</strong> loss. Since pixels without an atom drastically outnumber pixels containing an atom, WAHR loss is designed to counter the class imbalance. For ground-truth heatmap $y$ and prediction $p$:</p>
<p>$$ L_{WAHR}(p, y) = \sum_i \alpha_y (p_i - y_i)^2 $$</p>
<p>where $\alpha_y$ dynamically down-weights easily classified background pixels.</p>
<h2 id="4-methodology--experiments">4. Methodology &amp; Experiments</h2>
<p>The authors evaluated MolGrapher against both rule-based (OSRA, MolVec) and deep learning baselines (DECIMER, Img2Mol, Image2Graph).</p>
<ul>
<li><strong>Benchmarks:</strong> Evaluated on standard datasets: USPTO, Maybridge UoB, CLEF-2012, and JPO.</li>
<li><strong>New Benchmark:</strong> Introduced and tested on <strong>USPTO-30K</strong>, split into clean, abbreviated, and large molecule subsets.</li>
<li><strong>Ablations:</strong> Analyzed the impact of synthetic augmentations, keypoint loss functions, supergraph connectivity radius, and GNN layers.</li>
<li><strong>Robustness:</strong> Tested on perturbed images (rotations, shearing) to mimic scanned patent quality.</li>
</ul>
<p>The GNN iteratively updates node embeddings through layers ${g^k}_{k \in [1, N]}$, where $e^{k+1} = g^k(e^k)$. Final predictions are obtained via two MLPs (one for atoms, one for bonds): $p_i = MLP_t(e_i^N)$, where $p_i \in \mathbb{R}^{C_t}$ contains the logits for atom or bond classes.</p>
<h2 id="5-results--conclusions">5. Results &amp; Conclusions</h2>
<p>MolGrapher achieved the highest accuracy among synthetic-only deep learning methods on most benchmarks tested.</p>
<ul>
<li><strong>Accuracy:</strong> It achieved <strong>91.5%</strong> accuracy on USPTO, outperforming all other synthetic-only deep learning methods including ChemGrapher (80.9%), Graph Generation (67.0%), and DECIMER 2.0 (61.0%).</li>
<li><strong>Large Molecules:</strong> It demonstrated superior scaling, correctly recognizing large molecules (USPTO-10K-L) where image captioning methods like Img2Mol failed completely (0.0% accuracy).</li>
<li><strong>Generalization:</strong> The method proved robust to image perturbations and style variations without requiring fine-tuning on real data. The paper acknowledges that MolGrapher cannot recognize Markush structures (depictions of sets of molecules with positional and frequency variation indicators).</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The model relies on synthetic data for training due to the scarcity of annotated real-world images.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td>Synthetic Data</td>
          <td>300,000 images</td>
          <td>Generated from PubChem SMILES using RDKit. Augmentations include pepper patches, random lines, and variable bond styles.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>USPTO-30K</td>
          <td>30,000 images</td>
          <td>Created by authors from USPTO patents (2001-2020). Subsets: 10K clean, 10K abbreviated, 10K large (&gt;70 atoms).</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>Standard Benchmarks</td>
          <td>Various</td>
          <td>USPTO (5,719), Maybridge UoB (5,740), CLEF-2012 (992), JPO (450).</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The pipeline consists of three distinct algorithmic stages:</p>
<ol>
<li>
<p><strong>Keypoint Detection</strong>:</p>
<ul>
<li>Predicts a heatmap of atom locations using a CNN.</li>
<li>Thresholds heatmaps at the bottom 10th percentile and uses a $5\times5$ window for local maxima.</li>
<li>Uses <strong>Weight-Adaptive Heatmap Regression (WAHR)</strong> loss to handle class imbalance (background vs. atoms).</li>
</ul>
</li>
<li>
<p><strong>Supergraph Construction</strong>:</p>
<ul>
<li>Connects every detected keypoint to neighbors within a radius of $3 \times$ the estimated bond length.</li>
<li>Prunes edges with no filled pixels or if obstructed by a third keypoint.</li>
<li>Keeps a maximum of 6 bond candidates per atom.</li>
</ul>
</li>
<li>
<p><strong>Superatom Recognition</strong>:</p>
<ul>
<li>Detects &ldquo;superatom&rdquo; nodes (abbreviations like <code>COOH</code>).</li>
<li>Uses <strong>PP-OCR</strong> to transcribe the text at these node locations.</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<p>The architecture utilizes standard backbones tailored for specific sub-tasks:</p>
<ul>
<li><strong>Keypoint Detector</strong>: <strong>ResNet-18</strong> backbone with $8\times$ dilation to preserve spatial resolution.</li>
<li><strong>Node Classifier</strong>: <strong>ResNet-50</strong> backbone with $2\times$ dilation for extracting visual features at node locations.</li>
<li><strong>Graph Neural Network</strong>: A custom GNN that updates node embeddings based on visual features and neighborhood context. The initial node embedding combines the visual feature vector $v_i$ and a learnable type encoding $w_{t_i}$.</li>
<li><strong>Readout</strong>: MLPs classify nodes into atom types (e.g., C, O, N) and bond types (No Bond, Single, Double, Triple).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Accuracy is defined strictly: the predicted molecule must have an identical <strong><a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a></strong> string to the ground truth. Stereochemistry and Markush structures are excluded from evaluation.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Dataset</th>
          <th>MolGrapher Score</th>
          <th>Best DL Baseline (Synthetic)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td>USPTO</td>
          <td><strong>91.5%</strong></td>
          <td>80.9% (ChemGrapher)</td>
          <td>Full USPTO benchmark</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>USPTO-10K-L</td>
          <td><strong>31.4%</strong></td>
          <td>0.0% (Img2Mol)</td>
          <td>Large molecules (&gt;70 atoms)</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>JPO</td>
          <td><strong>67.5%</strong></td>
          <td>64.0% (DECIMER 2.0)</td>
          <td>Challenging, low-quality images</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPUs</strong>: Trained on 3 NVIDIA A100 GPUs.</li>
<li><strong>Training Time</strong>: 20 epochs.</li>
<li><strong>Optimization</strong>: ADAM optimizer, learning rate 0.0001, decayed by 0.8 after 5000 iterations.</li>
<li><strong>Loss Weighting</strong>: Atom classifier loss weighted by 1; bond classifier loss weighted by 3.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/DS4SD/MolGrapher">DS4SD/MolGrapher</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch implementation with training and inference scripts</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Title</strong>: MolGrapher: Graph-based Visual Recognition of Chemical Structures</p>
<p><strong>Authors</strong>: Lucas Morin, Martin Danelljan, Maria Isabel Agea, Ahmed Nassar, Valéry Weber, Ingmar Meijer, Peter Staar, Fisher Yu</p>
<p><strong>Citation</strong>: Morin, L., Danelljan, M., Agea, M. I., Nassar, A., Weber, V., Meijer, I., Staar, P., &amp; Yu, F. (2023). MolGrapher: Graph-based Visual Recognition of Chemical Structures. <em>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</em>, 19552-19561.</p>
<p><strong>Publication</strong>: ICCV 2023</p>
<p><strong>Links</strong>:</p>
<ul>
<li><a href="https://openaccess.thecvf.com/content/ICCV2023/html/Morin_MolGrapher_Graph-based_Visual_Recognition_of_Chemical_Structures_ICCV_2023_paper.html">Paper</a></li>
<li><a href="https://github.com/DS4SD/MolGrapher">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{morinMolGrapherGraphbasedVisual2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{MolGrapher}}: {{Graph-based Visual Recognition}} of {{Chemical Structures}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{MolGrapher}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the {{IEEE}}/{{CVF International Conference}} on {{Computer Vision}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Morin, Lucas and Danelljan, Martin and Agea, Maria Isabel and Nassar, Ahmed and Weber, Valéry and Meijer, Ingmar and Staar, Peter and Yu, Fisher}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{19552--19561}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICCV51070.2023.01791}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-10-18}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MMSSC-Net: Multi-Stage Sequence Cognitive Networks</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/mmssc-net/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/mmssc-net/</guid><description>A deep learning model for Optical Chemical Structure Recognition (OCSR) using SwinV2 and GPT-2 to convert molecular images to SMILES.</description><content:encoded><![CDATA[<h2 id="contribution-a-multi-stage-architectural-pipeline">Contribution: A Multi-Stage Architectural Pipeline</h2>
<p><strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong>.
The paper proposes a deep learning architecture (<strong>MMSSC-Net</strong>) for Optical Chemical Structure Recognition (OCSR). It focuses on architectural innovation, specifically combining a SwinV2 visual encoder with a GPT-2 decoder, and validates this method through extensive benchmarking against existing rule-based and deep-learning baselines. It includes ablation studies to justify the choice of the visual encoder.</p>
<h2 id="motivation-addressing-noise-and-rigid-image-recognition">Motivation: Addressing Noise and Rigid Image Recognition</h2>
<ul>
<li><strong>Data Usage Gap</strong>: Drug discovery relies heavily on scientific literature, but molecular structures are often locked in vector graphics or images that computers cannot easily process.</li>
<li><strong>Limitations of Prior Work</strong>: Existing Rule-based methods are rigid and sensitive to noise. Previous Deep Learning approaches (Encoder-Decoder &ldquo;Image Captioning&rdquo; styles) often lack precision, interpretability, and struggle with varying image resolutions or large molecules.</li>
<li><strong>Need for &ldquo;Cognition&rdquo;</strong>: The authors argue that treating the image as a single isolated whole is insufficient; a model needs to &ldquo;perceive&rdquo; fine-grained details (atoms and bonds) to handle noise and varying pixel qualities effectively.</li>
</ul>
<h2 id="novelty-a-fine-grained-perception-pipeline">Novelty: A Fine-Grained Perception Pipeline</h2>
<ul>
<li><strong>Multi-Stage Cognitive Architecture</strong>: MMSSC-Net splits the task into stages:
<ol>
<li><strong>Fine-grained Perception</strong>: Detecting atom and bond sequences (including spatial coordinates) using SwinV2.</li>
<li><strong>Graph Construction</strong>: Assembling these into a molecular graph.</li>
<li><strong>Sequence Evolution</strong>: converting the graph into a machine-readable format (SMILES).</li>
</ol>
</li>
<li><strong>Hybrid Transformer Model</strong>: It combines a hierarchical vision transformer (<strong>SwinV2</strong>) for encoding with a generative pre-trained transformer (<strong>GPT-2</strong>) and MLPs for decoding atomic and bond targets.</li>
<li><strong>Robustness Mechanisms</strong>: The inclusion of random noise sequences during training to improve generalization to new molecular targets.</li>
</ul>
<h2 id="methodology-and-benchmarks">Methodology and Benchmarks</h2>
<ul>
<li><strong>Baselines</strong>: compared against 8 other tools:
<ul>
<li><em>Rule-based</em>: MolVec, OSRA.</li>
<li><em>Image-Smiles (DL)</em>: ABC-Net, Img2Mol, MolMiner.</li>
<li><em>Image-Graph-Smiles (DL)</em>: Image-To-Graph, MolScribe, ChemGrapher.</li>
</ul>
</li>
<li><strong>Datasets</strong>: Evaluated on 5 diverse datasets: STAKER (synthetic), USPTO, CLEF, JPO, and UOB (real-world).</li>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Accuracy</strong>: Exact string match of the predicted SMILES.</li>
<li><strong>Tanimoto Similarity</strong>: Chemical similarity using Morgan fingerprints.</li>
</ul>
</li>
<li><strong>Ablation Study</strong>: Tested different visual encoders (Swin Transformer, ViT-B, ResNet-50) to validate the choice of SwinV2.</li>
<li><strong>Resolution Sensitivity</strong>: Tested model performance across image resolutions from 256px to 2048px.</li>
</ul>
<h2 id="results-and-core-outcomes">Results and Core Outcomes</h2>
<ul>
<li><strong>Strong Performance</strong>: MMSSC-Net achieved 75-98% accuracy across datasets, outperforming baselines on most benchmarks. The first three intra-domain and real datasets achieved above 94% accuracy.</li>
<li><strong>Resolution Robustness</strong>: The model maintained relatively stable accuracy across varying image resolutions, whereas baselines like Img2Mol showed greater sensitivity to resolution changes (Fig. 4 in the paper).</li>
<li><strong>Efficiency</strong>: The SwinV2 encoder was noted to be more efficient than ViT-B in this context.</li>
<li><strong>Limitations</strong>: The model struggles with stereochemistry, specifically confusing dashed wedge bonds with solid wedge bonds and misclassifying single bonds as solid wedge bonds. It also has difficulty with &ldquo;irrelevant text&rdquo; noise (e.g., unexpected symbols in JPO and DECIMER datasets).</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The model was trained on a combination of PubChem and USPTO data, augmented to handle visual variability.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td><strong>PubChem</strong></td>
          <td>1,000,000</td>
          <td>Converted from <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> to SMILES; random sampling.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td><strong>USPTO</strong></td>
          <td>600,000</td>
          <td>Patent images; converted from MOL to SMILES.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>STAKER</strong></td>
          <td>40,000</td>
          <td>Synthetic; Avg res $256 \times 256$.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>USPTO</strong></td>
          <td>4,862</td>
          <td>Real; Avg res $721 \times 432$.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>CLEF</strong></td>
          <td>881</td>
          <td>Real; Avg res $1245 \times 412$.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>JPO</strong></td>
          <td>380</td>
          <td>Real; Avg res $614 \times 367$.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>UOB</strong></td>
          <td>5,720</td>
          <td>Real; Avg res $759 \times 416$.</td>
      </tr>
  </tbody>
</table>
<p><strong>Augmentation</strong>:</p>
<ul>
<li><strong>Image</strong>: Random perturbations using RDKit/Indigo (rotation, filling, cropping, bond thickness/length, font size, Gaussian noise).</li>
<li><strong>Molecular</strong>: Introduction of functional group abbreviations and R-substituents (dummy atoms) using SMARTS templates.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Target Sequence Formulation</strong>: The model predicts a sequence containing bounding box coordinates and type labels: ${y_{\text{min}}, x_{\text{min}}, y_{\text{max}}, x_{\text{max}}, C_{n}}$.</li>
<li><strong>Loss Function</strong>: Cross-entropy loss with maximum likelihood estimation.
$$ \max \sum_{i=1}^{N} \sum_{j=1}^{L} \omega_{j} \log P(t_{j}^{i} \mid x_{1}^{i}, x_{2}^{i}, \dots, x_{M}^{i}, t_{1}^{i}, \dots, t_{j-1}^{i}) $$</li>
<li><strong>Noise Injection</strong>: A random sequence $T_r$ is appended to the target sequence during training to improve generalization to new goals.</li>
<li><strong>Graph Construction</strong>: Atoms ($v$) and bonds ($e$) are recognized separately; bonds are defined by connecting spatial atomic coordinates.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Encoder</strong>: <strong>Swin Transformer V2</strong>.
<ul>
<li>Pre-trained on ImageNet-1K.</li>
<li>Window size: $16 \times 16$.</li>
<li>Parameters: 88M.</li>
<li>Input resolution: $256 \times 256$.</li>
<li>Features: Scaled cosine attention; log-space continuous position bias.</li>
</ul>
</li>
<li><strong>Decoder</strong>: <strong>GPT-2</strong> + <strong>MLP</strong>.
<ul>
<li><strong>GPT-2</strong>: Used for recognizing atom types.
<ul>
<li>Layers: 24.</li>
<li>Attention Heads: 12.</li>
<li>Hidden Dimension: 768.</li>
<li>Dropout: 0.1.</li>
</ul>
</li>
<li><strong>MLP</strong>: Used for classifying bond types (single, double, triple, aromatic, solid wedge, dashed wedge).</li>
</ul>
</li>
<li><strong>Vocabulary</strong>:
<ul>
<li>Standard: 95 common numbers/characters ([0], [C], [=], etc.).</li>
<li>Extended: 2000 SMARTS-based characters for isomers/groups (e.g., &ldquo;[C2F5]&rdquo;, &ldquo;[halo]&rdquo;).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ol>
<li><strong>Accuracy</strong>: Exact match of the generated SMILES string.</li>
<li><strong>Tanimoto Similarity</strong>: Similarity of Morgan fingerprints between predicted and ground truth molecules.</li>
</ol>
<p><strong>Key Results (Accuracy)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>MMSSC-Net</th>
          <th>MolVec (Rule)</th>
          <th>ABC-Net (DL)</th>
          <th>MolScribe (DL)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Indigo</strong></td>
          <td>98.14</td>
          <td>95.63</td>
          <td>96.4</td>
          <td>97.5</td>
      </tr>
      <tr>
          <td><strong>RDKit</strong></td>
          <td>94.91</td>
          <td>86.7</td>
          <td>98.3</td>
          <td>93.8</td>
      </tr>
      <tr>
          <td><strong>USPTO</strong></td>
          <td>94.24</td>
          <td>88.47</td>
          <td>*</td>
          <td>92.6</td>
      </tr>
      <tr>
          <td><strong>CLEF</strong></td>
          <td>91.26</td>
          <td>81.61</td>
          <td>*</td>
          <td>86.9</td>
      </tr>
      <tr>
          <td><strong>UOB</strong></td>
          <td>92.71</td>
          <td>81.32</td>
          <td>96.1</td>
          <td>87.9</td>
      </tr>
      <tr>
          <td><strong>Staker</strong></td>
          <td>89.44</td>
          <td>4.49</td>
          <td>*</td>
          <td>86.9</td>
      </tr>
      <tr>
          <td><strong>JPO</strong></td>
          <td>75.48</td>
          <td>66.8</td>
          <td>*</td>
          <td>76.2</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Configuration</strong>:
<ul>
<li>Batch Size: 128.</li>
<li>Learning Rate: $4 \times 10^{-5}$.</li>
<li>Epochs: 40.</li>
</ul>
</li>
<li><strong>Inference Speed</strong>: The SwinV2 encoder demonstrated higher efficiency (faster inference time) compared to ViT-B and ResNet-50 baselines during ablation.</li>
</ul>
<h3 id="reproducibility">Reproducibility</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Wzew5Lp/MMSSCNet">MMSSCNet (GitHub)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Official implementation; includes training and prediction scripts</td>
      </tr>
  </tbody>
</table>
<p>The paper is published in RSC Advances (open access). Source code is available on GitHub, though the repository has minimal documentation and no explicit license. The training data comes from PubChem (public) and USPTO (public patent data). Pre-trained model weights do not appear to be released. No specific GPU hardware or training time is reported in the paper.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, D., Zhao, D., Wang, Z., Li, J., &amp; Li, J. (2024). MMSSC-Net: multi-stage sequence cognitive networks for drug molecule recognition. <em>RSC Advances</em>, 14(26), 18182-18191. <a href="https://doi.org/10.1039/D4RA02442G">https://doi.org/10.1039/D4RA02442G</a></p>
<p><strong>Publication</strong>: RSC Advances 2024</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhangMMSSCNetMultistageSequence2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{MMSSC-Net: Multi-Stage Sequence Cognitive Networks for Drug Molecule Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{MMSSC-Net}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zhang, Dehai and Zhao, Di and Wang, Zhengwu and Li, Junhui and Li, Jin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2024</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{RSC Advances}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{26}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{18182--18191}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1039/D4RA02442G}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://pubs.rsc.org/en/content/articlelanding/2024/ra/d4ra02442g}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MarkushGrapher: Multi-modal Markush Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/markushgrapher/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/markushgrapher/</guid><description>Multi-modal transformer combining vision, text, and layout encoding to extract complex Markush structures from patent documents with OCSR.</description><content:encoded><![CDATA[<h2 id="overcoming-unimodal-limitations-for-markush-structures">Overcoming Unimodal Limitations for Markush Structures</h2>
<p>The automated analysis of chemical literature, particularly patents, is critical for drug discovery and material science. A major bottleneck is the extraction of <strong>Markush structures</strong>, which are complex chemical templates that represent families of molecules using a core backbone image and textual variable definitions. Existing methods are limited because they either rely solely on images (OCSR) and miss the textual context, or focus solely on text and miss the structural backbone. This creates a practical need for a unified, multi-modal approach that jointly interprets visual and textual data to accurately extract these structures for prior-art search and database construction. This paper proposes a <strong>Method</strong> and introduces a new <strong>Resource</strong> (M2S dataset) to bridge this gap.</p>
<h2 id="markushgrapher-the-multi-modal-architecture">MarkushGrapher: The Multi-Modal Architecture</h2>
<p>The core innovation is <strong>MarkushGrapher</strong>, a multi-modal architecture that jointly encodes image, text, and layout information. Key contributions include:</p>
<ul>
<li><strong>Dual-Encoder Architecture</strong>: Combines a Vision-Text-Layout (VTL) encoder (based on UDOP) with a specialized, pre-trained Optical Chemical Structure Recognition (OCSR) encoder (MolScribe). Let $E_{\text{VTL}}$ represent the combined sequence embedding and $E_{\text{OCSR}}$ represent the domain-specific visual embeddings.</li>
<li><strong>Joint Recognition</strong>: The model autoregressively generates a sequential graph representation (Optimized CXSMILES) and a substituent table simultaneously. It uses cross-modal dependencies, allowing text to clarify ambiguous visual details like bond types.</li>
<li><strong>Synthetic Data Pipeline</strong>: A comprehensive pipeline generates realistic synthetic Markush structures (images and text) from PubChem data, overcoming the lack of labeled training data.</li>
<li><strong>Optimized Representation</strong>: A compacted version of CXSMILES moves variable groups into the SMILES string and adds explicit atom indexing to handle complex &ldquo;frequency&rdquo; and &ldquo;position&rdquo; variation indicators.</li>
</ul>
<h2 id="experimental-validation-on-the-new-m2s-benchmark">Experimental Validation on the New M2S Benchmark</h2>
<p>The authors validated their approach using the following setup:</p>
<ul>
<li><strong>Baselines</strong>: Compared against image-only chemistry models (DECIMER, MolScribe) and general-purpose multi-modal models (Uni-SMART, GPT-4o, Pixtral, Llama-3.2).</li>
<li><strong>Datasets</strong>: Evaluated on three benchmarks:
<ol>
<li><strong>MarkushGrapher-Synthetic</strong>: 1,000 generated samples.</li>
<li><strong>M2S</strong>: A new benchmark of 103 manually annotated real-world patent images.</li>
<li><strong>USPTO-Markush</strong>: 74 Markush backbone images from USPTO patents.</li>
</ol>
</li>
<li><strong>Ablation Studies</strong>: Analyzed the impact of the OCSR encoder, late fusion strategies, and the optimized CXSMILES format. Late fusion improved USPTO-Markush EM from 23% (VTL only) to 32% (Table 3). Removing R-group compression dropped M2S EM from 38% to 30%, and removing atom indexing dropped USPTO-Markush EM from 32% to 24% (Table 4).</li>
</ul>
<h2 id="key-results">Key Results</h2>
<ul>
<li><strong>Performance</strong>: MarkushGrapher outperformed all baselines. On the M2S benchmark, it achieved 38% Exact Match on CXSMILES (compared to 21% for MolScribe) and 29% Exact Match on tables. On USPTO-Markush, it reached 32% CXSMILES EM versus 7% for MolScribe.</li>
<li><strong>Markush Feature Recognition</strong>: The model can recognize complex Markush features like frequency variation (&lsquo;Sg&rsquo;) and position variation (&rsquo;m&rsquo;) indicators. DECIMER and MolScribe scored 0% on both &rsquo;m&rsquo; and &lsquo;Sg&rsquo; sections (Table 2), while MarkushGrapher achieved 76% on &rsquo;m&rsquo; and 31% on &lsquo;Sg&rsquo; sections on M2S.</li>
<li><strong>Cross-Modal Reasoning</strong>: Qualitative analysis showed the model can correctly infer visual details (such as bond order) that appear ambiguous in the image but become apparent with the text description.</li>
<li><strong>Robustness</strong>: The model generalizes well to real-world data despite being trained purely on synthetic data. On augmented versions of M2S and USPTO-Markush simulating low-quality scanned documents, it maintained 31% and 32% CXSMILES EM respectively (Table 6).</li>
</ul>
<h2 id="limitations">Limitations</h2>
<p>The authors note several limitations:</p>
<ul>
<li>MarkushGrapher does not currently handle abbreviations in chemical structures (e.g., &lsquo;OG&rsquo; for oxygen connected to a variable group).</li>
<li>The model relies on ground-truth OCR cells as input, requiring an external OCR model for practical deployment.</li>
<li>Substituent definitions that combine text with interleaved chemical structure drawings are not supported.</li>
<li>The model is trained to predict &rsquo;m&rsquo; sections connecting to all atoms in a cycle, which can technically violate valence constraints, though the output contains enough information to reconstruct only valid connections.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Training Data</strong></p>
<ul>
<li><strong>Source</strong>: Synthetic dataset generated from PubChem SMILES.</li>
<li><strong>Size</strong>: 210,000 synthetic images.</li>
<li><strong>Pipeline</strong>:
<ol>
<li><strong>Selection</strong>: Sampled SMILES from PubChem based on substructure diversity.</li>
<li><strong>Augmentation</strong>: SMILES augmented to artificial CXSMILES using RDKit (inserting variable groups, frequency indicators).</li>
<li><strong>Rendering</strong>: Images rendered using Chemistry Development Kit (CDK) with randomized drawing parameters (font, bond width, spacing).</li>
<li><strong>Text Generation</strong>: Textual definitions generated using manual templates extracted from patents; 10% were paraphrased using Mistral-7B-Instruct-v0.3 to increase diversity.</li>
<li><strong>OCR</strong>: Bounding boxes extracted via a custom SVG parser aligned with MOL files.</li>
</ol>
</li>
</ul>
<p><strong>Evaluation Data</strong></p>
<ul>
<li><strong>M2S Dataset</strong>: 103 images from USPTO, EPO, and WIPO patents (1999-2023), manually annotated with CXSMILES and substituent tables.</li>
<li><strong>USPTO-Markush</strong>: 74 images from USPTO patents (2010-2016).</li>
<li><strong>MarkushGrapher-Synthetic</strong>: 1,000 samples generated via the pipeline.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimized CXSMILES</strong>:
<ul>
<li><strong>Compression</strong>: Variable groups moved from the extension block to the main SMILES string as special atoms to reduce sequence length.</li>
<li><strong>Indexing</strong>: Atom indices appended to each atom (e.g., <code>C:1</code>) to explicitly link the graph to the extension block (crucial for <code>m</code> and <code>Sg</code> sections).</li>
<li><strong>Vocabulary</strong>: Specific tokens used for atoms and bonds.</li>
</ul>
</li>
<li><strong>Augmentation</strong>: Standard image augmentations (shift, scale, blur, pepper noise, random lines) and OCR text augmentations (character substitution/insertion/deletion).</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-Decoder Transformer.
<ul>
<li><strong>VTL Encoder</strong>: T5-large encoder (initialized from UDOP) that processes image patches, text tokens, and layout (bounding boxes).</li>
<li><strong>OCSR Encoder</strong>: Vision encoder from MolScribe (Swin Transformer), frozen during training.</li>
<li><strong>Text Decoder</strong>: T5-large decoder.</li>
</ul>
</li>
<li><strong>Fusion Strategy</strong>: <strong>Late Fusion</strong>. The core multi-modal alignment combines the textual layout features with specialized chemical vision explicitly. The fused representation relies on the VTL output $e_1$ concatenated with the MLP-projected OCSR output $e_2$ before decoding:
$$ e = e_1(v, t, l) \oplus \text{MLP}(e_2(v)) $$</li>
<li><strong>Parameters</strong>: 831M total (744M trainable).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>CXSMILES Exact Match (EM)</strong>: Requires perfect match of SMILES string, variable groups, <code>m</code> sections, and <code>Sg</code> sections (ignoring stereochemistry).</li>
<li><strong>Tanimoto Score</strong>: Similarity of RDKit DayLight fingerprints (Markush features removed).</li>
<li><strong>Table Exact Match</strong>: All variable groups and substituents must match.</li>
<li><strong>Table F1-Score</strong>: Aggregated recall and precision of substituents per variable group.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Trained on a single NVIDIA H100 GPU.</li>
<li><strong>Training Config</strong>: 10 epochs, batch size of 10, ADAM optimizer, learning rate 5e-4, 100 warmup steps, weight decay 1e-3.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/DS4SD/MarkushGrapher">MarkushGrapher</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Morin, L., Weber, V., Nassar, A., Meijer, G. I., Van Gool, L., Li, Y., &amp; Staar, P. (2025). MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures. <em>2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em>, 14505-14515. <a href="https://doi.org/10.1109/CVPR52734.2025.01352">https://doi.org/10.1109/CVPR52734.2025.01352</a></p>
<p><strong>Publication</strong>: CVPR 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/DS4SD/MarkushGrapher">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{morinMarkushGrapherJointVisual2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{MarkushGrapher}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Morin, Lucas and Weber, Valéry and Nassar, Ahmed and Meijer, Gerhard Ingmar and Van Gool, Luc and Li, Yawei and Staar, Peter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jun,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{14505--14515}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/CVPR52734.2025.01352}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Image2InChI: SwinTransformer for Molecular Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/image2inchi/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/image2inchi/</guid><description>Deep learning model using improved SwinTransformer encoder and attention-based feature fusion to convert molecular images to InChI strings.</description><content:encoded><![CDATA[<h2 id="image2inchi-as-a-methodological-innovation">Image2InChI as a Methodological Innovation</h2>
<p>This is a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong>. It proposes a specific new deep learning architecture (&ldquo;Image2InChI&rdquo;) to solve the task of Optical Chemical Structure Recognition (OCSR). The rhetorical focus is on engineering a system that outperforms baselines on specific metrics (InChI accuracy, MCS accuracy) and providing a valuable reference for future algorithmic work.</p>
<h2 id="bottlenecks-in-chemical-literature-digitization">Bottlenecks in Chemical Literature Digitization</h2>
<p>The accurate digitization of chemical literature is a bottleneck in AI-driven drug discovery. Chemical structures in patents and papers exist as optical images (pixels), but machine learning models require machine-readable string representations (like <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> or <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>). Efficiently and automatically bridging this gap is a prerequisite for large-scale data mining in chemistry.</p>
<h2 id="hierarchical-swintransformer-and-attention-integration">Hierarchical SwinTransformer and Attention Integration</h2>
<p>The core novelty is the <strong>Image2InChI</strong> architecture, which integrates:</p>
<ol>
<li><strong>Improved SwinTransformer Encoder</strong>: Uses a hierarchical vision transformer to capture image features.</li>
<li><strong>Feature Fusion with Attention</strong>: A novel network designed to integrate image patch features with InChI prediction steps.</li>
<li><strong>End-to-End InChI Prediction</strong>: The architecture frames the problem as a direct image-to-sequence translation targeting InChI strings directly, diverging from techniques predicting independent graph components. The model is optimized using a standard Cross-Entropy Loss over the token vocabulary:
$$ \mathcal{L}_{\text{CE}} = - \sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, \mathbf{X}) $$
where $\mathbf{X}$ represents the input image features, $y_t$ is the predicted token, and $T$ is the sequence length.</li>
</ol>
<h2 id="benchmarking-on-the-bms-dataset">Benchmarking on the BMS Dataset</h2>
<ul>
<li><strong>Benchmark Validation</strong>: The model was trained and tested on the <strong>BMS1000 (Bristol-Myers Squibb)</strong> dataset from a Kaggle competition.</li>
<li><strong>Ablation/Comparative Analysis</strong>: The authors compared their method against other models in the supplement.</li>
<li><strong>Preprocessing Validation</strong>: They justified their choice of denoising algorithms (8-neighborhood vs. Gaussian/Mean) to ensure preservation of bond lines while removing &ldquo;spiky point noise&rdquo;.</li>
</ul>
<h2 id="high-inchi-recognition-metrics">High InChI Recognition Metrics</h2>
<ul>
<li><strong>High Accuracy</strong>: The model achieved <strong>99.8% InChI accuracy</strong>, 94.8% Maximum Common Substructure (MCS) accuracy, and 96.2% Longest Common Subsequence (LCS) accuracy on the benchmarked dataset. It remains to be seen how well these models generalize to heavily degraded real-world patent images.</li>
<li><strong>Effective Denoising</strong>: The authors concluded that <strong>eight-neighborhood filtering</strong> is superior to mean or Gaussian filtering for this specific domain because it removes isolated noise points without blurring the fine edges of chemical bonds.</li>
<li><strong>Open Source</strong>: The authors stated their intention to release the code, though no public repository has been identified.</li>
</ul>
<hr>
<h2 id="artifacts">Artifacts</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.kaggle.com/c/bms-molecular-translation">BMS Dataset (Kaggle)</a></td>
          <td>Dataset</td>
          <td>Competition</td>
          <td>Bristol-Myers Squibb Molecular Translation competition dataset</td>
      </tr>
  </tbody>
</table>
<p>No public code repository has been identified for Image2InChI despite the authors&rsquo; stated intent to release it.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The primary dataset used is the <strong>BMS (Bristol-Myers Squibb) Dataset</strong>.</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Source</strong></td>
          <td>Kaggle Competition (BMS-Molecular-Translation)</td>
      </tr>
      <tr>
          <td><strong>Total Size</strong></td>
          <td>2.4 million images</td>
      </tr>
      <tr>
          <td><strong>Training Set</strong></td>
          <td>1.8 million images</td>
      </tr>
      <tr>
          <td><strong>Test Set</strong></td>
          <td>0.6 million images</td>
      </tr>
      <tr>
          <td><strong>Content</strong></td>
          <td>Each image corresponds to a unique International Chemical Identifier (<a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>)</td>
      </tr>
  </tbody>
</table>
<p><strong>Other Datasets</strong>: The authors also utilized JPO (Japanese Patent Office), CLEF (CLEF-IP 2012), UOB (MolrecUOB), and USPTO datasets for broader benchmarking.</p>
<p><strong>Preprocessing Pipeline</strong>:</p>
<ol>
<li><strong>Denoising</strong>: <strong>Eight-neighborhood filtering</strong> (threshold &lt; 4 non-white pixels) is used to remove salt-and-pepper noise while preserving bond lines. Mean and Gaussian filtering were rejected due to blurring.</li>
<li><strong>Sequence Padding</strong>:
<ul>
<li>Analysis showed max InChI length &lt; 270.</li>
<li>Fixed sequence length set to <strong>300</strong>.</li>
<li>Tokens: <code>&lt;sos&gt;</code> (190), <code>&lt;eos&gt;</code> (191), <code>&lt;pad&gt;</code> (192) used for padding/framing.</li>
</ul>
</li>
<li><strong>Numerization</strong>: Characters are mapped to integers based on a fixed vocabulary (e.g., &lsquo;C&rsquo; -&gt; 178, &lsquo;H&rsquo; -&gt; 182).</li>
</ol>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Eight-Neighborhood Filtering (Denoising)</strong>:</p>
<p>Pseudocode logic:</p>
<ul>
<li>Iterate through every pixel.</li>
<li>Count non-white neighbors in the 3x3 grid (8 neighbors).</li>
<li>If count &lt; threshold (default 4), treat as noise and remove.</li>
</ul>
<p><strong>InChI Tokenization</strong>:</p>
<ul>
<li>InChI strings are split into character arrays.</li>
<li>Example: Vitamin C <code>InChI=1S/C6H8O6...</code> becomes <code>[&lt;sos&gt;, C, 6, H, 8, O, 6, ..., &lt;eos&gt;, &lt;pad&gt;...]</code>.</li>
<li>Mapped to integer tensor for model input.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: Image2InChI</p>
<ul>
<li><strong>Encoder</strong>: Improved SwinTransformer (Hierarchical Vision Transformer).</li>
<li><strong>Decoder</strong>: Transformer Decoder with patch embedding.</li>
<li><strong>Fusion</strong>: A novel &ldquo;feature fusion network with attention&rdquo; integrates the visual tokens with the sequence generation process.</li>
<li><strong>Framework</strong>: PyTorch 1.8.1.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>InChI Acc</strong>: Exact match accuracy of the predicted InChI string (Reported: 99.8%).</li>
<li><strong>MCS Acc</strong>: Maximum Common Substructure accuracy (structural similarity) (Reported: 94.8%).</li>
<li><strong>LCS Acc</strong>: Longest Common Subsequence accuracy (string similarity) (Reported: 96.2%).</li>
<li><strong>Morgan FP</strong>: Morgan Fingerprint similarity (Reported: 94.1%).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Specification</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GPU</strong></td>
          <td>NVIDIA Tesla P100 (16GB VRAM)</td>
      </tr>
      <tr>
          <td><strong>Platform</strong></td>
          <td>MatPool cloud platform</td>
      </tr>
      <tr>
          <td><strong>CPU</strong></td>
          <td>Intel Xeon Gold 6271</td>
      </tr>
      <tr>
          <td><strong>RAM</strong></td>
          <td>32GB System Memory</td>
      </tr>
      <tr>
          <td><strong>Driver</strong></td>
          <td>NVIDIA-SMI 440.100</td>
      </tr>
      <tr>
          <td><strong>OS</strong></td>
          <td>Ubuntu 18.04</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, D., Xu, X., Pan, J., Gao, W., &amp; Zhang, S. (2024). Image2InChI: Automated Molecular Optical Image Recognition. <em>Journal of Chemical Information and Modeling</em>, 64(9), 3640-3649. <a href="https://doi.org/10.1021/acs.jcim.3c02082">https://doi.org/10.1021/acs.jcim.3c02082</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling (JCIM) 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.kaggle.com/c/bms-molecular-translation">BMS Dataset (Kaggle)</a></li>
</ul>
<p><strong>Note</strong>: These notes are based on the Abstract and Supporting Information files only.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{li2024image2inchi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Image2InChI: Automated Molecular Optical Image Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Li, Da-zhou and Xu, Xin and Pan, Jia-heng and Gao, Wei and Zhang, Shi-rui}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{64}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3640--3649}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.3c02082}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Enhanced DECIMER for Hand-Drawn Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/decimer-hand-drawn/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/decimer-hand-drawn/</guid><description>An improved encoder-decoder model (EfficientNetV2 + Transformer) converts hand-drawn chemical structures into SMILES strings using synthetic training data.</description><content:encoded><![CDATA[<h2 id="method-contribution-architectural-optimization">Method Contribution: Architectural Optimization</h2>
<p>This is a <strong>Method</strong> paper. It proposes an enhanced neural network architecture (EfficientNetV2 + Transformer) specifically designed to solve the problem of recognizing hand-drawn chemical structures. The primary contribution is architectural optimization and a data-driven training strategy, validated through ablation studies (comparing encoders) and benchmarked against existing rule-based and deep learning tools.</p>
<h2 id="motivation-digitizing-dark-chemical-data">Motivation: Digitizing &ldquo;Dark&rdquo; Chemical Data</h2>
<p>Chemical information in legacy laboratory notebooks and modern tablet-based inputs often exists as hand-drawn sketches.</p>
<ul>
<li><strong>Gap:</strong> Existing Optical Chemical Structure Recognition (OCSR) tools (particularly rule-based ones) lack robustness and fail when images have variability in style, line thickness, or noise.</li>
<li><strong>Need:</strong> There is a critical need for automated tools to digitize this &ldquo;dark data&rdquo; effectively to preserve it and make it machine-readable and searchable.</li>
</ul>
<h2 id="core-innovation-decoder-only-design-and-synthetic-scaling">Core Innovation: Decoder-Only Design and Synthetic Scaling</h2>
<p>The core novelty is the <strong>architectural enhancement</strong> and <strong>synthetic training strategy</strong>:</p>
<ol>
<li><strong>Decoder-Only Transformer:</strong> Using only the decoder part of the Transformer (instead of a full encoder-decoder Transformer) improved average accuracy across OCSR benchmarks from 61.28% to 69.27% (Table 3 in the paper).</li>
<li><strong>EfficientNetV2 Integration:</strong> Replacing standard CNNs or EfficientNetV1 with <strong>EfficientNetV2-M</strong> provided better feature extraction and 2x faster training speeds.</li>
<li><strong>Scale of Synthetic Data:</strong> The authors demonstrate that scaling synthetic training data (up to 152 million images generated by RanDepict) directly correlates with improved generalization to real-world hand-drawn images, without ever training on real hand-drawn data.</li>
</ol>
<h2 id="experimental-setup-ablation-and-real-world-baselines">Experimental Setup: Ablation and Real-World Baselines</h2>
<ul>
<li><strong>Model Selection (Ablation):</strong> Tested three architectures (EfficientNetV2-M + Full Transformer, EfficientNetV1-B7 + Decoder-only, EfficientNetV2-M + Decoder-only) on standard benchmarks (JPO, CLEF, USPTO, UOB).</li>
<li><strong>Data Scaling:</strong> Trained the best model on four progressively larger datasets (from 4M to 152M images) to measure performance gains.</li>
<li><strong>Real-World Benchmarking:</strong> Validated the final model on the <strong>DECIMER Hand-drawn dataset</strong> (5088 real images drawn by volunteers) and compared against 9 other tools (OSRA, MolVec, Img2Mol, MolScribe, etc.).</li>
</ul>
<h2 id="results-and-conclusions-strong-accuracy-on-hand-drawn-scans">Results and Conclusions: Strong Accuracy on Hand-Drawn Scans</h2>
<ul>
<li><strong>Strong Performance:</strong> The final DECIMER model achieved <strong>99.72% valid predictions</strong> and <strong>73.25% exact accuracy</strong> on the hand-drawn benchmark. The next best non-DECIMER tool was MolGrapher at 10.81% accuracy, followed by MolScribe at 7.65%.</li>
<li><strong>Robustness:</strong> Deep learning methods outperform rule-based methods (which scored 3% or less accuracy) on hand-drawn data.</li>
<li><strong>Data Saturation:</strong> Quadrupling the dataset from 38M to 152M images yielded only marginal gains (about 3 percentage points in accuracy), suggesting current synthetic data strategies may be hitting a plateau.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER-Image_Transformer">DECIMER Image Transformer (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official TensorFlow implementation</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10781330">Model Weights (Zenodo)</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Pre-trained hand-drawn model weights</td>
      </tr>
      <tr>
          <td><a href="https://pypi.org/project/decimer/">DECIMER PyPi Package</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Installable Python package</td>
      </tr>
      <tr>
          <td><a href="https://github.com/OBrink/RanDepict">RanDepict (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Synthetic hand-drawn image generation toolkit</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The model was trained entirely on <strong>synthetic data</strong> generated using the <a href="https://github.com/OBrink/RanDepict">RanDepict</a> toolkit. No real hand-drawn images were used for training.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Source</th>
          <th>Molecules</th>
          <th>Total Images</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>ChEMBL</td>
          <td>2,187,669</td>
          <td>4,375,338</td>
          <td>1 augmented + 1 clean per molecule</td>
      </tr>
      <tr>
          <td>2</td>
          <td>ChEMBL</td>
          <td>2,187,669</td>
          <td>13,126,014</td>
          <td>2 augmented + 4 clean per molecule</td>
      </tr>
      <tr>
          <td>3</td>
          <td>PubChem</td>
          <td>9,510,000</td>
          <td>38,040,000</td>
          <td>1 augmented + 3 clean per molecule</td>
      </tr>
      <tr>
          <td>4</td>
          <td>PubChem</td>
          <td>38,040,000</td>
          <td><strong>152,160,000</strong></td>
          <td>1 augmented + 3 clean per molecule</td>
      </tr>
  </tbody>
</table>
<p>A separate <strong>model selection</strong> experiment used a 1,024,000-molecule subset of ChEMBL to compare the three architectures (Table 1 in the paper). The <strong>DECIMER Hand-Drawn</strong> evaluation dataset consists of 5,088 real hand-drawn images from 23 volunteers.</p>
<p><strong>Preprocessing:</strong></p>
<ul>
<li><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings length &lt; 300 characters.</li>
<li>Images resized to $512 \times 512$.</li>
<li>Images generated with and without &ldquo;hand-drawn style&rdquo; augmentations.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization:</strong> SMILES split by heavy atoms, brackets, bond symbols, and special characters. Start <code>&lt;start&gt;</code> and end <code>&lt;end&gt;</code> tokens added; padded with <code>&lt;pad&gt;</code>.</li>
<li><strong>Optimization:</strong> Adam optimizer with a custom learning rate schedule (as specified in the original Transformer paper). A dropout rate of 0.1 was used.</li>
<li><strong>Loss Function:</strong> Trained using focal loss to address class imbalance for rare tokens. The focal loss formulation reduces the relative loss for well-classified examples:
$$
\text{FL}(p_{\text{t}}) = -\alpha_{\text{t}} (1 - p_{\text{t}})^\gamma \log(p_{\text{t}})
$$</li>
<li><strong>Augmentations:</strong> RanDepict applied synthetic distortions to mimic handwriting (wobbly lines, variable thickness, etc.).</li>
</ul>
<h3 id="models">Models</h3>
<p>The final architecture (Model 3) is an Encoder-Decoder structure:</p>
<ul>
<li><strong>Encoder:</strong> <strong>EfficientNetV2-M</strong> (pretrained ImageNet backbone).
<ul>
<li>Input: $512 \times 512 \times 3$ image.</li>
<li>Output Features: $16 \times 16 \times 512$ (reshaped to sequence length 256, dimension 512).</li>
<li><em>Note:</em> The final fully connected layer of the CNN is removed.</li>
</ul>
</li>
<li><strong>Decoder:</strong> <strong>Transformer (Decoder-only)</strong>.
<ul>
<li>Layers: 6</li>
<li>Attention Heads: 8</li>
<li>Embedding Dimension: 512</li>
</ul>
</li>
<li><strong>Output:</strong> Predicted SMILES string token by token.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics used for evaluation:</p>
<ol>
<li><strong>Valid Predictions (%):</strong> Percentage of outputs that are syntactically valid SMILES.</li>
<li><strong>Exact Match Accuracy (%):</strong> Canonical SMILES string identity.</li>
<li><strong>Tanimoto Similarity:</strong> Fingerprint similarity (PubChem fingerprints) between ground truth and prediction.</li>
</ol>
<p><strong>Data Scaling Results (Hand-Drawn Dataset, Table 4 in the paper):</strong></p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Training Images</th>
          <th>Valid Predictions</th>
          <th>Exact Accuracy</th>
          <th>Tanimoto</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1 (ChEMBL)</td>
          <td>4,375,338</td>
          <td>96.21%</td>
          <td>5.09%</td>
          <td>0.490</td>
      </tr>
      <tr>
          <td>2 (ChEMBL)</td>
          <td>13,126,014</td>
          <td>97.41%</td>
          <td>26.08%</td>
          <td>0.690</td>
      </tr>
      <tr>
          <td>3 (PubChem)</td>
          <td>38,040,000</td>
          <td>99.67%</td>
          <td>70.34%</td>
          <td>0.939</td>
      </tr>
      <tr>
          <td>4 (PubChem)</td>
          <td>152,160,000</td>
          <td>99.72%</td>
          <td>73.25%</td>
          <td>0.942</td>
      </tr>
  </tbody>
</table>
<p><strong>Comparison with Other Tools (Hand-Drawn Dataset, Table 5 in the paper):</strong></p>
<table>
  <thead>
      <tr>
          <th>OCSR Tool</th>
          <th>Method</th>
          <th>Valid Predictions</th>
          <th>Exact Accuracy</th>
          <th>Tanimoto</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>DECIMER (Ours)</strong></td>
          <td>Deep Learning</td>
          <td><strong>99.72%</strong></td>
          <td><strong>73.25%</strong></td>
          <td><strong>0.94</strong></td>
      </tr>
      <tr>
          <td>DECIMER.ai</td>
          <td>Deep Learning</td>
          <td>96.07%</td>
          <td>26.98%</td>
          <td>0.69</td>
      </tr>
      <tr>
          <td>MolGrapher</td>
          <td>Deep Learning</td>
          <td>99.94%</td>
          <td>10.81%</td>
          <td>0.51</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>Deep Learning</td>
          <td>95.66%</td>
          <td>7.65%</td>
          <td>0.59</td>
      </tr>
      <tr>
          <td>Img2Mol</td>
          <td>Deep Learning</td>
          <td>98.96%</td>
          <td>5.25%</td>
          <td>0.52</td>
      </tr>
      <tr>
          <td>SwinOCSR</td>
          <td>Deep Learning</td>
          <td>97.37%</td>
          <td>5.11%</td>
          <td>0.64</td>
      </tr>
      <tr>
          <td>ChemGrapher</td>
          <td>Deep Learning</td>
          <td>69.56%</td>
          <td>N/A</td>
          <td>0.09</td>
      </tr>
      <tr>
          <td>Imago</td>
          <td>Rule-based</td>
          <td>43.14%</td>
          <td>2.99%</td>
          <td>0.22</td>
      </tr>
      <tr>
          <td>MolVec</td>
          <td>Rule-based</td>
          <td>71.86%</td>
          <td>1.30%</td>
          <td>0.23</td>
      </tr>
      <tr>
          <td>OSRA</td>
          <td>Rule-based</td>
          <td>54.66%</td>
          <td>0.57%</td>
          <td>0.17</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute:</strong> Google Cloud TPU v4-128 pod slice.</li>
<li><strong>Training Time:</strong>
<ul>
<li>EfficientNetV2-M model trained ~2x faster than EfficientNetV1-B7.</li>
<li>Average training time per epoch: 34 minutes (for Model 3 on 1M dataset subset).</li>
</ul>
</li>
<li><strong>Epochs:</strong> Models trained for 25 epochs.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Brinkhaus, H.O., Zielesny, A. et al. (2024). Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture. <em>Journal of Cheminformatics</em>, 16(78). <a href="https://doi.org/10.1186/s13321-024-00872-7">https://doi.org/10.1186/s13321-024-00872-7</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://pypi.org/project/decimer/">PyPi Package</a></li>
<li><a href="https://doi.org/10.5281/zenodo.10781330">Model Weights (Zenodo)</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanAdvancementsHanddrawnChemical2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Advancements in Hand-Drawn Chemical Structure Recognition through an Enhanced {{DECIMER}} Architecture}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Brinkhaus, Henning Otto and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2024</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jul,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{78}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-024-00872-7}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Dual-Path Global Awareness Transformer (DGAT) for OCSR</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/dgat/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/dgat/</guid><description>A Transformer-based OCSR model introducing dual-path modules (CGFE and SDGLA) to improve global context awareness and complex motif recognition.</description><content:encoded><![CDATA[<h2 id="contribution-type-deep-learning-method-for-ocsr">Contribution Type: Deep Learning Method for OCSR</h2>
<p>This is a <strong>Method</strong> paper ($\Psi_{\text{Method}}$).</p>
<p>The classification is based on the proposal of a novel deep learning architecture (DGAT) designed to address specific limitations in existing Optical Chemical Structure Recognition (OCSR) systems. The contribution is validated through benchmarking against external baselines (DeepOCSR, DECIMER, SwinOCSR) and ablation studies that isolate the impact of the new modules.</p>
<h2 id="motivation-addressing-global-context-loss">Motivation: Addressing Global Context Loss</h2>
<p>Existing multimodal fusion methods for OCSR suffer from limited awareness of global context.</p>
<ul>
<li><strong>Problem</strong>: Models often generate erroneous sequences when processing complex motifs, such as rings or long chains, due to a disconnect between local feature extraction and global structural understanding.</li>
<li><strong>Gap</strong>: Current architectures struggle to capture the &ldquo;fine-grained differences between global and local features,&rdquo; leading to topological errors.</li>
<li><strong>Practical Need</strong>: Accurate translation of chemical images to machine-readable sequences (SMILES/SELFIES) is critical for materials science and AI-guided chemical research.</li>
</ul>
<h2 id="core-innovation-dual-path-global-awareness-transformer">Core Innovation: Dual-Path Global Awareness Transformer</h2>
<p>The authors propose the <strong>Dual-Path Global Awareness Transformer (DGAT)</strong>, which redesigns the decoder with two novel mechanisms to better handle global context:</p>
<ol>
<li>
<p><strong>Cascaded Global Feature Enhancement (CGFE)</strong>: This module bridges cross-modal gaps by emphasizing global context. It concatenates global visual features with sequence features and processes them through a Cross-Modal Assimilation MLP and an Adaptive Alignment MLP to align multimodal representations. The feature enhancement conceptually computes:</p>
<p>$$ f_{\text{enhanced}} = \text{MLP}_{\text{align}}(\text{MLP}_{\text{assimilate}}([f_{\text{global}}, f_{\text{seq}}])) $$</p>
</li>
<li>
<p><strong>Sparse Differential Global-Local Attention (SDGLA)</strong>: A module that dynamically captures fine-grained differences between global and local features. It uses sequence features (embedded with global info) as queries, while utilizing local and global visual features as keys/values in parallel attention heads to generate initial multimodal features.</p>
</li>
</ol>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p>The model was evaluated on a newly constructed dataset and compared against five major baselines.</p>
<ul>
<li><strong>Baselines</strong>: DeepOCSR, DECIMER 1.0, DECIMER V2, SwinOCSR, and MPOCSR.</li>
<li><strong>Ablation Studies</strong>:
<ul>
<li><strong>Layer Depth</strong>: Tested Transformer depths from 1 to 5 layers; 3 layers proved optimal for balancing gradient flow and parameter sufficiency.</li>
<li><strong>Beam Size</strong>: Tested inference beam sizes 1-5; size 3 achieved the best balance between search depth and redundancy.</li>
<li><strong>Module Contribution</strong>: Validated that removing CGFE results in a drop in structural similarity (Tanimoto), proving the need for pre-fusion alignment.</li>
</ul>
</li>
<li><strong>Robustness Analysis</strong>: Performance broken down by molecule complexity (atom count, ring count, bond count).</li>
<li><strong>Chirality Validation</strong>: Qualitative analysis of attention maps on chiral molecules to verify the model learns stereochemical cues implicitly.</li>
</ul>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li><strong>Performance Over Baselines</strong>: DGAT outperformed the MPOCSR baseline across all metrics:
<ul>
<li><strong>BLEU-4</strong>: 84.0% (+5.3% improvement)</li>
<li><strong>ROUGE</strong>: 90.8% (+1.9% improvement)</li>
<li><strong>Tanimoto Similarity</strong>: 98.8% (+1.2% improvement)</li>
<li><strong>Exact Match Accuracy</strong>: 54.6% (+10.9% over SwinOCSR)</li>
</ul>
</li>
<li><strong>Chiral Recognition</strong>: The model implicitly recognizes chiral centers (e.g., generating <code>[C@@H1]</code> tokens correctly) based on 2D wedge cues without direct stereochemical supervision.</li>
<li><strong>Limitations</strong>: Performance drops for extreme cases, such as molecules with 4+ rings or 4+ double/triple bonds, due to dataset imbalance. The model still hallucinates branches in highly complex topologies.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data is primarily drawn from PubChem and augmented to improve robustness.</p>
<ul>
<li><strong>Augmentation Strategy</strong>: Each sequence generates three images with random rendering parameters.
<ul>
<li><strong>Rotation</strong>: 0, 90, 180, 270, or random [0, 360)</li>
<li><strong>Bond Width</strong>: 1, 2, or 3 pixels</li>
<li><strong>Bond Offset</strong>: Sampled from 0.08-0.18 (inherited from Image2SMILES)</li>
<li><strong>CoordGen</strong>: Enabled with 20% probability</li>
</ul>
</li>
<li><strong>Evaluation Set</strong>: A newly constructed benchmark dataset was used for final reporting.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Training Configuration</strong>:
<ul>
<li><strong>Encoder LR</strong>: $5 \times 10^{-5}$ (Pretrained ResNet-101)</li>
<li><strong>Decoder LR</strong>: $1 \times 10^{-4}$ (Randomly initialized Transformer)</li>
<li><strong>Optimizer</strong>: Implied SGD/Adam (context mentions Momentum 0.9, Weight Decay 0.0001)</li>
<li><strong>Batch Size</strong>: 256</li>
</ul>
</li>
<li><strong>Inference</strong>:
<ul>
<li><strong>Beam Search</strong>: A beam size of <strong>3</strong> is used. Larger beam sizes (4-5) degraded BLEU/ROUGE scores due to increased redundancy.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Visual Encoder</strong>:
<ul>
<li><strong>Backbone</strong>: ResNet-101 initialized with ImageNet weights</li>
<li><strong>Structure</strong>: Convolutional layers preserved up to the final module. Classification head removed.</li>
<li><strong>Pooling</strong>: A $7 \times 7$ average pooling layer is used to extract global visual features.</li>
</ul>
</li>
<li><strong>Sequence Decoder</strong>:
<ul>
<li><strong>Architecture</strong>: Transformer-based with CGFE and SDGLA modules.</li>
<li><strong>Depth</strong>: 3 Transformer layers</li>
<li><strong>Dropout</strong>: Not utilized</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance is reported using sequence-level and structure-level metrics.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">DGAT Score</th>
          <th style="text-align: left">Baseline (MPOCSR)</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>BLEU-4</strong></td>
          <td style="text-align: left"><strong>84.0%</strong></td>
          <td style="text-align: left">78.7%</td>
          <td style="text-align: left">Measures n-gram precision</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>ROUGE</strong></td>
          <td style="text-align: left"><strong>90.8%</strong></td>
          <td style="text-align: left">88.9%</td>
          <td style="text-align: left">Sequence recall metric</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Tanimoto</strong></td>
          <td style="text-align: left"><strong>98.8%</strong></td>
          <td style="text-align: left">97.6%</td>
          <td style="text-align: left">Structural similarity fingerprint</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Accuracy</strong></td>
          <td style="text-align: left"><strong>54.6%</strong></td>
          <td style="text-align: left">35.7%</td>
          <td style="text-align: left">Exact structure match rate</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/Drwr97/DGAT">DGAT</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Official implementation with training and evaluation scripts</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, R., Ji, Y., Li, Y., &amp; Lee, S.-T. (2025). Dual-Path Global Awareness Transformer for Optical Chemical Structure Recognition. <em>The Journal of Physical Chemistry Letters</em>, 16(50), 12787-12795. <a href="https://doi.org/10.1021/acs.jpclett.5c03057">https://doi.org/10.1021/acs.jpclett.5c03057</a></p>
<p><strong>Publication</strong>: The Journal of Physical Chemistry Letters 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Drwr97/DGAT">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wang2025dgat,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Dual-Path Global Awareness Transformer for Optical Chemical Structure Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wang, Rui and Ji, Yujin and Li, Youyong and Lee, Shuit-Tong}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{The Journal of Physical Chemistry Letters}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{50}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{12787--12795}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jpclett.5c03057}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DECIMER.ai: Optical Chemical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-ai/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-ai/</guid><description>Open-source OCSR platform combining Mask R-CNN segmentation and Transformer recognition, trained on 450M+ synthetic images from RanDepict.</description><content:encoded><![CDATA[<h2 id="project-scope-and-contribution-type">Project Scope and Contribution Type</h2>
<p>This is primarily a <strong>Resource</strong> paper (Infrastructure Basis) with a significant <strong>Method</strong> component.</p>
<p>The primary contribution is DECIMER.ai, a fully open-source platform (web app and Python packages) for the entire chemical structure mining pipeline, filling a gap where most tools were proprietary or fragmented. It also contributes the RanDepict toolkit for massive synthetic data generation.</p>
<p>The secondary methodological contribution proposes and validates a specific deep learning architecture (EfficientNet-V2 encoder + Transformer decoder) that treats chemical structure recognition as an image-to-text translation task (SMILES generation).</p>
<h2 id="the-scarcity-of-machine-readable-chemical-data">The Scarcity of Machine-Readable Chemical Data</h2>
<p><strong>Data Scarcity</strong>: While the number of chemical publications is increasing, most chemical information is locked in non-machine-readable formats (images in PDFs) and is not available in public databases.</p>
<p><strong>Limitations of Existing Tools</strong>: Prior OCSR (Optical Chemical Structure Recognition) tools were largely rule-based (fragile to noise) or proprietary.</p>
<p><strong>Lack of Integration</strong>: There was no existing open-source system that combined segmentation (finding the molecule on a page), classification (confirming it is a molecule), and recognition (translating it to SMILES) into a single workflow.</p>
<h2 id="decimer-architecture-and-novel-image-to-smiles-approach">DECIMER Architecture and Novel Image-to-SMILES Approach</h2>
<p><strong>Comprehensive Workflow</strong>: It is the first open-source platform to integrate segmentation (Mask R-CNN), classification (EfficientNet), and recognition (Transformer) into a unified pipeline.</p>
<p><strong>Data-Driven Approach</strong>: Unlike tools like MolScribe which use intermediate graph representations and rules, DECIMER uses a purely data-driven &ldquo;image-to-SMILES&rdquo; translation approach without hard-coded chemical rules. The core recognition model operates as a sequence-to-sequence generator, mathematically formalizing the task as maximizing the conditional probability of a SMILES sequence given an image.</p>
<p><strong>Massive Synthetic Training</strong>: The use of RanDepict to generate over 450 million synthetic images, covering diverse depiction styles and augmentations (including Markush structures), to train the model from scratch.</p>
<h2 id="benchmarking-and-evaluation-methodology">Benchmarking and Evaluation Methodology</h2>
<p><strong>Benchmarking</strong>: The system was tested against openly available tools (OSRA, MolVec, Imago, Img2Mol, SwinOCSR, MolScribe) on standard datasets: USPTO, UOB, CLEF, JPO, and a custom &ldquo;Hand-drawn&rdquo; dataset.</p>
<p><strong>Robustness Testing</strong>: Performance was evaluated on both clean images and images with added distortions (rotation, shearing) to test the fragility of rule-based systems vs. DECIMER.</p>
<p><strong>Markush Structure Analysis</strong>: Specific evaluation of the model&rsquo;s ability to interpret Markush structures (generic structures with R-groups).</p>
<p><strong>Comparison of Approaches</strong>: A direct comparison with MolScribe by training DECIMER on MolScribe&rsquo;s smaller training set to isolate the impact of architecture vs. data volume.</p>
<h2 id="performance-outcomes-and-key-findings">Performance Outcomes and Key Findings</h2>
<p><strong>Comparative Performance</strong>: DECIMER Image Transformer consistently produced average Tanimoto similarities above 0.95 on in-domain test data and achieved competitive or leading results across external benchmarks, with extremely low rates of catastrophic failure. Tanimoto similarity is calculated based on molecular fingerprints $A$ and $B$ as:
$$ T(A, B) = \frac{A \cdot B}{|A|^2 + |B|^2 - A \cdot B} $$</p>
<p><strong>Data Volume Necessity</strong>: When trained on small datasets, MolScribe (graph/rule-based) outperformed DECIMER. DECIMER&rsquo;s performance advantage relies heavily on its massive training scale (&gt;400M images).</p>
<p><strong>Robustness</strong>: The model showed no performance degradation on distorted images, unlike rule-based legacy tools.</p>
<p><strong>Generalization</strong>: Despite having no hand-drawn images in the training set, the base model recognized 27% of hand-drawn structures perfectly (average Tanimoto 0.69), outperforming all alternative open tools. After fine-tuning with synthetic hand-drawn-like images from RanDepict, perfect predictions increased to 60% (average Tanimoto 0.89).</p>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/OBrink/DECIMER.ai">DECIMER.ai Web App</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Laravel-based web application for the full pipeline</td>
      </tr>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER-Image_Transformer">DECIMER Image Transformer</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Core OCSR Python package</td>
      </tr>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER-Image-Segmentation">DECIMER Image Segmentation</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Mask R-CNN segmentation for chemical structures in documents</td>
      </tr>
      <tr>
          <td><a href="https://github.com/Iagea/DECIMER-Image-Classifier">DECIMER Image Classifier</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>EfficientNet-based chemical structure image classifier</td>
      </tr>
      <tr>
          <td><a href="https://github.com/OBrink/RanDepict">RanDepict</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Synthetic training data generation toolkit</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The models were trained on synthetic data generated from PubChem molecules.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Generation/Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td><code>pubchem_1</code></td>
          <td>~108M mols</td>
          <td>PubChem molecules (mass &lt; 1500 Da), processed with RanDepict (v1.0.5). Included image augmentations.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td><code>pubchem_2</code></td>
          <td>~126M mols</td>
          <td>Included Markush structures generated by pseudo-randomly replacing atoms with R-groups. Image size 299x299.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td><code>pubchem_3</code></td>
          <td>&gt;453M images</td>
          <td>Re-depicted <code>pubchem_2</code> molecules at <strong>512x512</strong> resolution. Used RanDepict v1.0.8.</td>
      </tr>
      <tr>
          <td><strong>Test</strong></td>
          <td>In-domain</td>
          <td>250,000</td>
          <td>Held-out set generated similarly to training data.</td>
      </tr>
      <tr>
          <td><strong>Benchmark</strong></td>
          <td>External</td>
          <td>Various</td>
          <td>USPTO (5719), UOB (5740), CLEF (992), JPO (450), Indigo (50k), Hand-drawn (5088).</td>
      </tr>
  </tbody>
</table>
<p><strong>Data Generation</strong>:</p>
<ul>
<li><strong>Tool</strong>: RanDepict (uses CDK, RDKit, Indigo, PIKAChU)</li>
<li><strong>Augmentations</strong>: Rotation, shearing, noise, pixelation, curved arrows, text labels</li>
<li><strong>Format</strong>: Data saved as TFRecord files for TPU training</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>SMILES Tokenization</strong>: Regex-based splitting (atoms, brackets, bonds). Added <code>&lt;start&gt;</code>, <code>&lt;end&gt;</code>, and padded with <code>&lt;pad&gt;</code>. <code>&lt;unk&gt;</code> used for unknown tokens.</li>
<li><strong>Markush Token Handling</strong>: To avoid ambiguity, digits following &lsquo;R&rsquo; (e.g., R1) were replaced with unique non-digit characters during training to distinguish them from ring-closure numbers.</li>
<li><strong>Image Augmentation Pipeline</strong>: Custom RanDepict features (v1.1.4) were used to simulate &ldquo;hand-drawn-like&rdquo; styles based on ChemPIX&rsquo;s implementation.</li>
</ul>
<h3 id="models">Models</h3>
<p>The platform consists of three distinct models:</p>
<ol>
<li>
<p><strong>DECIMER Segmentation</strong>:</p>
<ul>
<li><strong>Architecture</strong>: Mask R-CNN (TensorFlow 2.10.0 implementation)</li>
<li><strong>Purpose</strong>: Detects and cuts chemical structures from full PDF pages</li>
</ul>
</li>
<li>
<p><strong>DECIMER Image Classifier</strong>:</p>
<ul>
<li><strong>Architecture</strong>: EfficientNet-V1-B0</li>
<li><strong>Input</strong>: 224x224 pixels</li>
<li><strong>Training</strong>: Fine-tuned on ~10.9M images (balanced chemical/non-chemical)</li>
<li><strong>Performance</strong>: AUC 0.99 on in-domain test set</li>
</ul>
</li>
<li>
<p><strong>DECIMER Image Transformer (OCSR Engine)</strong>:</p>
<ul>
<li><strong>Encoder</strong>: EfficientNet-V2-M (CNN). Input size <strong>512x512</strong>. 52M parameters</li>
<li><strong>Decoder</strong>: Transformer. 4 encoder blocks, 4 decoder blocks, 8 attention heads. d_model=512, d_ff=2048. 59M parameters</li>
<li><strong>Total Params</strong>: ~111 Million</li>
</ul>
</li>
</ol>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: Tanimoto Similarity (calculated on PubChem fingerprints of the predicted vs. ground truth SMILES)</li>
<li><strong>Secondary Metrics</strong>: Exact Match (Identity), BLEU score (for string similarity, esp. Markush)</li>
<li><strong>Failure Analysis</strong>: &ldquo;Catastrophic failure&rdquo; defined as Tanimoto similarity of 0 or invalid SMILES</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Training was performed on Google Cloud TPUs due to the massive dataset size.</p>
<ul>
<li><strong><code>pubchem_1</code>/<code>pubchem_2</code></strong>: Trained on TPU v3-32 pod slice</li>
<li><strong><code>pubchem_3</code> (Final Model)</strong>: Trained on <strong>TPU v3-256</strong> pod slice</li>
<li><strong>Training Time</strong>:
<ul>
<li>Data generation (512x512): ~2 weeks on cluster (20 threads, 36 cores)</li>
<li>Model Training (EffNet-V2-M): <strong>1 day and 7 hours per epoch</strong> on TPU v3-256</li>
</ul>
</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Brinkhaus, H. O., Agea, M. I., Zielesny, A., &amp; Steinbeck, C. (2023). DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. <em>Nature Communications</em>, 14(1), 5045. <a href="https://doi.org/10.1038/s41467-023-40782-0">https://doi.org/10.1038/s41467-023-40782-0</a></p>
<p><strong>Publication</strong>: Nature Communications 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://decimer.ai">Web Application</a></li>
<li><a href="https://github.com/Kohulan/DECIMER-Image_Transformer">DECIMER Image Transformer GitHub</a></li>
<li><a href="https://github.com/OBrink/RanDepict">RanDepict GitHub</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanDECIMERaiOpenPlatform2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Brinkhaus, Henning Otto and Agea, M. Isabel and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{5045}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1038/s41467-023-40782-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemReco: Hand-Drawn Chemical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/chemreco/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/chemreco/</guid><description>A deep learning method using EfficientNet and Transformer to convert hand-drawn chemical structures into SMILES codes, achieving 96.9% accuracy.</description><content:encoded><![CDATA[<h2 id="research-contribution--classification">Research Contribution &amp; Classification</h2>
<p>This is a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong> with a significant <strong>Resource ($\Psi_{\text{Resource}}$)</strong> component.</p>
<ul>
<li><strong>Method</strong>: The primary contribution is &ldquo;ChemReco,&rdquo; a specific deep learning pipeline (EfficientNet + Transformer) designed to solve the Optical Chemical Structure Recognition (OCSR) task for hand-drawn images. The authors conduct extensive ablation studies on architecture and data mixing ratios to validate performance.</li>
<li><strong>Resource</strong>: The authors explicitly state that &ldquo;the primary focus of this paper is constructing datasets&rdquo; due to the scarcity of hand-drawn molecular data. They introduce a comprehensive synthetic data generation pipeline involving RDKit modifications and image degradation to create training data.</li>
</ul>
<h2 id="motivation-digitizing-hand-drawn-chemical-sketches">Motivation: Digitizing Hand-Drawn Chemical Sketches</h2>
<p>Hand-drawing is the most intuitive method for chemists and students to record molecular structures. However, digitizing these drawings into machine-readable formats (like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>) usually requires time-consuming manual entry or specialized software.</p>
<ul>
<li><strong>Gap</strong>: Existing OCSR tools and rule-based methods often fail on hand-drawn sketches due to diverse writing styles, poor image quality, and the absence of labeled data.</li>
<li><strong>Application</strong>: Automated recognition enables efficient chemical research and allows for automatic grading in educational settings.</li>
</ul>
<h2 id="core-innovation-synthetic-pipeline-and-hybrid-architecture">Core Innovation: Synthetic Pipeline and Hybrid Architecture</h2>
<p>The paper introduces <strong>ChemReco</strong>, an end-to-end system for recognizing C-H-O structures. Key novelties include:</p>
<ol>
<li><strong>Synthetic Data Pipeline</strong>: A multi-stage generation method that modifies RDKit source code to randomize bond/angle parameters, followed by OpenCV-based augmentation, degradation, and background addition to simulate realistic hand-drawn artifacts.</li>
<li><strong>Architectural Choice</strong>: The specific application of <strong>EfficientNet</strong> (encoder) combined with a <strong>Transformer</strong> (decoder) for this domain, which the authors demonstrate outperforms the more common ResNet+LSTM baselines.</li>
<li><strong>Hybrid Training Strategy</strong>: Finding that a mix of 90% synthetic and 10% real data yields optimal performance, superior to using either dataset alone.</li>
</ol>
<h2 id="methodology--ablation-studies">Methodology &amp; Ablation Studies</h2>
<p>The authors performed a series of ablation studies and comparisons:</p>
<ul>
<li><strong>Synthesis Ablation</strong>: Evaluated the impact of each step in the generation pipeline (RDKit only $\rightarrow$ Augmentation $\rightarrow$ Degradation $\rightarrow$ Background) on validation loss and accuracy.</li>
<li><strong>Dataset Size Ablation</strong>: Tested model performance when trained on synthetic datasets ranging from 100,000 to 1,000,000 images.</li>
<li><strong>Real/Synthetic Ratio</strong>: Investigated the optimal mixing ratio of synthetic to real hand-drawn images (100:0, 90:10, 50:50, 10:90, 0:100), finding that the 90:10 ratio achieved 93.81% exact match, compared to 63.33% for synthetic-only and 65.83% for real-only.</li>
<li><strong>Architecture Comparison</strong>: Benchmarked four encoder-decoder combinations: ResNet vs. EfficientNet encoders paired with LSTM vs. Transformer decoders.</li>
<li><strong>Baseline Comparison</strong>: Compared results against a related study utilizing a CNN+LSTM framework.</li>
</ul>
<h2 id="results--interpretations">Results &amp; Interpretations</h2>
<ul>
<li><strong>Best Performance</strong>: The EfficientNet + Transformer model trained on a 90:10 synthetic-to-real ratio achieved a <strong>96.90% Exact Match</strong> rate on the test set.</li>
<li><strong>Background Robustness</strong>: When training on synthetic data alone (no real images), the best accuracy on background-free test images was approximately 46% (using RDKit-aug-deg), while background test images reached approximately 53% (using RDKit-aug-bkg-deg). Adding random backgrounds during training helped prevent the model from overfitting to clean white backgrounds.</li>
<li><strong>Data Volume</strong>: Increasing the synthetic dataset size from 100k to 1M consistently improved accuracy (average exact match: 49.40% at 100k, 54.29% at 200k, 61.31% at 500k, 63.33% at 1M, all without real images in training).</li>
<li><strong>Encoder-Decoder Comparison</strong> (at 90:10 mix with 1M images):</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Encoder</th>
          <th style="text-align: left">Decoder</th>
          <th style="text-align: left">Avg. Exact Match (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">ResNet</td>
          <td style="text-align: left">LSTM</td>
          <td style="text-align: left">93.81</td>
      </tr>
      <tr>
          <td style="text-align: left">ResNet</td>
          <td style="text-align: left">Transformer</td>
          <td style="text-align: left">94.76</td>
      </tr>
      <tr>
          <td style="text-align: left">EfficientNet</td>
          <td style="text-align: left">LSTM</td>
          <td style="text-align: left">96.31</td>
      </tr>
      <tr>
          <td style="text-align: left">EfficientNet</td>
          <td style="text-align: left">Transformer</td>
          <td style="text-align: left"><strong>96.90</strong></td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Superiority over Baselines</strong>: The model outperformed the cited CNN+LSTM baseline from ChemPix (93% vs 76% on the ChemPix test set).</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Restricted atom types</strong>: The system only handles molecules composed of carbon, hydrogen, and oxygen (C-H-O), excluding nitrogen, sulfur, halogens, and other heteroatoms commonly found in organic chemistry.</li>
<li><strong>Structural complexity</strong>: Only structures with at most one ring are supported. Complex multi-ring systems and fused ring structures are not covered.</li>
<li><strong>Dataset availability</strong>: The real hand-drawn dataset (2,598 images) is not publicly released and is only available upon request from the corresponding author.</li>
<li><strong>Future directions</strong>: The authors suggest expanding to more heteroatoms, complex ring structures, and applications in automated grading of chemistry exams.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/a-die/hdr-DeepLearning">hdr-DeepLearning</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Official implementation in PyTorch</td>
      </tr>
      <tr>
          <td style="text-align: left">Paper</td>
          <td style="text-align: left">Publication</td>
          <td style="text-align: left">CC-BY-4.0</td>
          <td style="text-align: left">Open access via Nature</td>
      </tr>
  </tbody>
</table>
<p>The real hand-drawn dataset (2,598 images) is available upon request from the corresponding author, not publicly downloadable. The synthetic data generation pipeline is described in detail but relies on modified RDKit source code, which is included in the repository.</p>
<h3 id="data">Data</h3>
<p>The study utilizes a combination of collected SMILES data, real hand-drawn images, and generated synthetic images.</p>
<ul>
<li><strong>Source Data</strong>: SMILES codes collected from PubChem, ZINC, <a href="/notes/chemistry/datasets/gdb-11/">GDB-11</a>, and <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a>. Filtered for C, H, O atoms and max 1 ring.</li>
<li><strong>Real Dataset</strong>: 670 selected SMILES codes drawn by multiple volunteers, totaling <strong>2,598 images</strong>.</li>
<li><strong>Synthetic Dataset</strong>: Generated up to <strong>1,000,000 images</strong> using the pipeline below.</li>
<li><strong>Training Mix</strong>: The optimal training set used 1 million images with a <strong>90:10 ratio</strong> of synthetic to real images.</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Dataset Type</th>
          <th style="text-align: left">Source</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Real</strong></td>
          <td style="text-align: left">Volunteer Drawings</td>
          <td style="text-align: left">2,598 images</td>
          <td style="text-align: left">Used for mixed training and testing</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Synthetic</strong></td>
          <td style="text-align: left">Generated</td>
          <td style="text-align: left">100k - 1M</td>
          <td style="text-align: left">Generated via modified RDKit + OpenCV augmentation/degradation; optionally enhanced with Stable Diffusion</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The <strong>Synthetic Image Generation Pipeline</strong> is critical for reproduction:</p>
<ol>
<li><strong>RDKit Modification</strong>: Modify source code to introduce random keys, character width, length, and bond angles.</li>
<li><strong>Augmentation (OpenCV)</strong>: Apply sequence: Resize ($p=0.5$), Blur ($p=0.4$), Erode/Dilate ($p=0.2$), Distort ($p=0.8$), Flip ($p=0.5$), Affine ($p=0.7$).</li>
<li><strong>Degradation</strong>: Apply sequence: Salt+pepper noise ($p=0.1$), Contrast ($p=0.7$), Sharpness ($p=0.5$), Invert ($p=0.3$).</li>
<li><strong>Background Addition</strong>: Random backgrounds are augmented (Crop, Distort, Flip) and added to the molecular image to prevent background overfitting.</li>
<li><strong>Diffusion Enhancement</strong>: Stable Diffusion (v1-4) is used for image-to-image enhancement to better simulate hand-drawn styles (prompt: &ldquo;A pencil sketch of [Formula]&hellip; without charge distribution&rdquo;).</li>
</ol>
<h3 id="models">Models</h3>
<p>The system uses an encoder-decoder architecture:</p>
<ul>
<li><strong>Encoder</strong>: <strong>EfficientNet</strong> (pre-trained on ImageNet). The last layer is removed, and features are extracted into a Numpy array.</li>
<li><strong>Decoder</strong>: <strong>Transformer</strong>. Utilizes self-attention to generate the SMILES sequence. Chosen over LSTM for better handling of long-range dependencies.</li>
<li><strong>Output</strong>: Canonical SMILES string.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: <strong>Exact Match (EM)</strong>. A strict binary evaluation checking whether the complete generated SMILES perfectly replicates the target string.</li>
<li><strong>Other Metrics</strong>: <strong>Levenshtein Distance</strong> measures edit-level character proximity, while the <strong>Tanimoto coefficient</strong> evaluates structural similarity based on chemical fingerprints. Both were monitored during validation ablation runs.</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Value</th>
          <th style="text-align: left">Baseline (CNN+LSTM)</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Exact Match</strong></td>
          <td style="text-align: left"><strong>96.90%</strong></td>
          <td style="text-align: left">76%</td>
          <td style="text-align: left">Tested on the provided test set</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>CPU</strong>: Intel(R) Xeon(R) Gold 6130 (40 GB RAM).</li>
<li><strong>GPU</strong>: NVIDIA Tesla V100 (32 GB video memory).</li>
<li><strong>Framework</strong>: PyTorch 1.9.1.</li>
<li><strong>Training Configuration</strong>:
<ul>
<li>Optimizer: Adam (learning rate 1e-4).</li>
<li>Batch size: 32.</li>
<li>Epochs: 100.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ouyang, H., Liu, W., Tao, J., et al. (2024). ChemReco: automated recognition of hand-drawn carbon-hydrogen-oxygen structures using deep learning. <em>Scientific Reports</em>, 14, 17126. <a href="https://doi.org/10.1038/s41598-024-67496-7">https://doi.org/10.1038/s41598-024-67496-7</a></p>
<p><strong>Publication</strong>: Scientific Reports 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/a-die/hdr-DeepLearning">Official Code Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ouyangChemRecoAutomatedRecognition2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{ChemReco: Automated Recognition of Hand-Drawn Carbon--Hydrogen--Oxygen Structures Using Deep Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Ouyang, Hengjie and Liu, Wei and Tao, Jiajun and Luo, Yanghong and Zhang, Wanjia and Zhou, Jiayu and Geng, Shuqi and Zhang, Chengpeng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{17126}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1038/s41598-024-67496-7}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Benchmarking Eight OCSR Tools on Patent Images (2024)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/krasnov-ocsr-benchmark-2024/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/krasnov-ocsr-benchmark-2024/</guid><description>Benchmark of 8 open-access OCSR methods on 2702 manually curated patent images, with ChemIC classifier for hybrid approach.</description><content:encoded><![CDATA[<h2 id="contribution-benchmarking-general-and-specialized-ocsr-tools">Contribution: Benchmarking General and Specialized OCSR Tools</h2>
<p>This paper is primarily a <strong>Resource</strong> contribution ($0.7 \Psi_{\text{Resource}}$) with a secondary <strong>Method</strong> component ($0.3 \Psi_{\text{Method}}$).</p>
<p>It establishes a new, independent benchmark dataset of 2,702 manually selected patent images to evaluate existing Optical Chemical Structure Recognition (OCSR) tools. The authors rigorously compare 8 different methods using this dataset to determine the state-of-the-art. The Resource contribution is evidenced by the creation of this curated benchmark, explicit evaluation metrics (exact connectivity table matching), and public release of datasets, processing scripts, and evaluation tools on Zenodo.</p>
<p>The secondary Method contribution comes through the development of &ldquo;ChemIC,&rdquo; a ResNet-50 image classifier designed to categorize images (Single vs. Multiple vs. Reaction) to enable a modular processing pipeline. However, this method serves to support the insights gained from the benchmarking resource.</p>
<h2 id="motivation-the-need-for-realistic-modality-diverse-patent-benchmarks">Motivation: The Need for Realistic, Modality-Diverse Patent Benchmarks</h2>
<p><strong>Lack of Standardization</strong>: A universally accepted standard set of images for OCSR quality measurement is currently missing; existing tools are often evaluated on synthetic data or limited datasets.</p>
<p><strong>Industrial Relevance</strong>: Patents contain diverse and &ldquo;noisy&rdquo; image modalities (Markush structures, salts, reactions, hand-drawn styles) that are critical for Freedom to Operate (FTO) and novelty checks in the pharmaceutical industry. These real-world complexities are often missing from existing benchmarks.</p>
<p><strong>Modality Gaps</strong>: Different tools excel at different tasks (e.g., single molecules vs. reactions). Monolithic approaches frequently break down on complex patent documents, and there was minimal systematic understanding of which tools perform best for which image types.</p>
<p><strong>Integration Needs</strong>: The authors aimed to identify tools to replace or augment their existing rule-based system (OSRA) within the SciWalker application, requiring a rigorous comparative study.</p>
<h2 id="core-innovation-a-curated-multi-modality-dataset-and-hybrid-classification-pipeline">Core Innovation: A Curated Multi-Modality Dataset and Hybrid Classification Pipeline</h2>
<p><strong>Independent Benchmark</strong>: Creation of a manually curated test set of 2,702 images from real-world patents (WO, EP, US), specifically selected to include &ldquo;problematic&rdquo; edge cases like inorganic complexes, peptides, and Markush structures, providing a more realistic evaluation environment than synthetic datasets.</p>
<p><strong>Comprehensive Comparison</strong>: Side-by-side evaluation of 8 open-access tools: DECIMER, ReactionDataExtractor, MolScribe, RxnScribe, SwinOCSR, OCMR, MolVec, and OSRA, using identical test conditions and evaluation criteria.</p>
<p><strong>ChemIC Classifier</strong>: Implementation of a specialized image classifier (ResNet-50) to distinguish between single molecules, multiple molecules, reactions, and non-chemical images, facilitating a &ldquo;hybrid&rdquo; pipeline that routes images to the most appropriate tool.</p>
<p><strong>Strict Evaluation Logic</strong>: Utilization of an exact match criterion for connectivity tables (ignoring partial similarity scores like Tanimoto) to reflect rigorous industrial requirements for novelty checking in patent applications.</p>
<h2 id="methodology-exact-match-evaluation-across-eight-open-source-systems">Methodology: Exact-Match Evaluation Across Eight Open-Source Systems</h2>
<p><strong>Tool Selection</strong>: Installed and tested 8 tools: DECIMER v2.4.0, ReactionDataExtractor v2.0.0, MolScribe v1.1.1, RxnScribe v1.0, MolVec v0.9.8, OCMR, SwinOCSR, and OSRA v2.1.5.</p>
<p><strong>Dataset Construction</strong>:</p>
<ul>
<li><strong>Test Set</strong>: 2,702 patent images split into three &ldquo;buckets&rdquo;: A (Single structure - 1,454 images), B (Multiple structures - 661 images), C (Reactions - 481 images).</li>
<li><strong>Training Set (for ChemIC)</strong>: 16,000 images from various sources (Patents, Im2Latex, etc.) split into 12,804 training, 1,604 validation, and 1,604 test images.</li>
</ul>
<p><strong>Evaluation Protocol</strong>:</p>
<ul>
<li>Calculated Precision, Recall, and F1 scores based on an <em>exact connectivity table structure matching</em> (rejecting Tanimoto similarity as industrially insufficient). The metrics follow standard formulations where true positives ($\text{TP}$) represent perfectly assembled structures:
$$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \qquad \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \qquad \text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$</li>
<li>Manual inspection by four chemists to verify predictions.</li>
<li>Developed custom tools (<code>ImageComparator</code> and <code>ExcelConstructor</code>) to facilitate visual comparison and result aggregation.</li>
</ul>
<p><strong>Segmentation Test</strong>: Applied DECIMER segmentation to multi-structure images to see if splitting them before processing improved results, combining segmentation with MolScribe for final predictions.</p>
<h2 id="key-findings-modality-specialization-outperforms-monolithic-approaches">Key Findings: Modality Specialization Outperforms Monolithic Approaches</h2>
<p><strong>Single Molecules</strong>: <strong>MolScribe</strong> achieved the highest performance (Precision: 87%, F1: 93%), followed closely by <strong>DECIMER</strong> (Precision: 84%, F1: 91%). These transformer-based approaches outperformed rule-based methods on single-structure images (e.g., MolScribe F1: 93% vs. OSRA F1: 78%).</p>
<p><strong>Reactions</strong>: Evaluated on 103 randomly selected reaction images containing 284 total reactions, <strong>RxnScribe</strong> outperformed others (Recall: 97%, F1: 86%), demonstrating the value of specialized architectures for reaction diagrams. General-purpose tools struggled with reaction recognition.</p>
<p><strong>Multiple Structures</strong>: Evaluated on 20 multi-structure images containing 146 single structures, all AI-based tools struggled. <strong>OSRA</strong> (rule-based) performed best here but still had low precision (58%). Combining DECIMER segmentation (with the <code>expand</code> option) with MolScribe on these same 20 images improved precision to 82% and F1 to 90%, showing that image segmentation as a preprocessing step can boost multi-structure performance.</p>
<p><strong>Failures</strong>: Current tools fail on polymers, large oligomers, and complex Markush structures. Most tools (except MolVec) correctly recognize cis-trans and tetrahedral stereochemistry, but other forms (e.g., octahedral, axial, helical) are not recognized. None of the evaluated tools can reliably recognize dative/coordinate bonds in metal complexes, indicating gaps in training data coverage.</p>
<p><strong>Classifier Utility</strong>: The ChemIC model achieved 99.62% accuracy on the test set, validating the feasibility of a modular pipeline where images are routed to the specific tool best suited for that modality. The authors estimate that a hybrid system (MolScribe + OSRA + RxnScribe) routed by ChemIC would achieve an average F1 of 80%, compared to 68% for OSRA alone across all modalities.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Benchmark (Test)</strong></td>
          <td>Manual Patent Selection</td>
          <td>2,702 Images</td>
          <td>Sources: WO, EP, US patents<br><strong>Bucket A</strong>: Single structures (1,454)<br><strong>Bucket B</strong>: Multi-structures (661)<br><strong>Bucket C</strong>: Reactions (481)</td>
      </tr>
      <tr>
          <td><strong>ChemIC Training</strong></td>
          <td>Aggregated Sources</td>
          <td>16,000 Images</td>
          <td>Sources: Patents (OntoChem), MolScribe dataset, DECIMER dataset, RxnScribe dataset, Im2Latex-100k<br><strong>Split</strong>: 12,804 Train / 1,604 Val / 1,604 Test</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Scoring Logic</strong>:</p>
<ul>
<li><strong>Single Molecules</strong>: Score = 1 if exact match of connectivity table (all atoms, valencies, bonds, superatom abbreviations, and charge correct), 0 otherwise. Stereochemistry correctness was not considered a scoring criterion. Tanimoto similarity explicitly rejected as too lenient.</li>
<li><strong>Reactions</strong>: Considered correct if at least one reactant and one product are correct and capture main features. Stoichiometry and conditions ignored.</li>
</ul>
<p><strong>Image Segmentation</strong>: Used DECIMER segmentation (with <code>expand</code> option) to split multi-structure images into single images before passing to MolScribe.</p>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Version</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>DECIMER</strong></td>
          <td>v2.4.0</td>
          <td>EfficientNet-V2-M encoder + Transformer decoder</td>
      </tr>
      <tr>
          <td><strong>MolScribe</strong></td>
          <td>v1.1.1</td>
          <td>Swin Transformer encoder + Transformer decoder</td>
      </tr>
      <tr>
          <td><strong>RxnScribe</strong></td>
          <td>v1.0</td>
          <td>Specialized for reaction diagrams</td>
      </tr>
      <tr>
          <td><strong>ReactionDataExtractor</strong></td>
          <td>v2.0.0</td>
          <td>Deep learning-based extraction</td>
      </tr>
      <tr>
          <td><strong>MolVec</strong></td>
          <td>v0.9.8</td>
          <td>Rule-based vectorization</td>
      </tr>
      <tr>
          <td><strong>OSRA</strong></td>
          <td>v2.1.5</td>
          <td>Rule-based recognition</td>
      </tr>
      <tr>
          <td><strong>SwinOCSR</strong></td>
          <td>-</td>
          <td>Swin Transformer encoder-decoder</td>
      </tr>
      <tr>
          <td><strong>OCMR</strong></td>
          <td>-</td>
          <td>CNN-based framework</td>
      </tr>
      <tr>
          <td><strong>ChemIC (New)</strong></td>
          <td>-</td>
          <td>ResNet-50 CNN in PyTorch for 4-class classification</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Key Results on Single Structures (Bucket A - 400 random sample):</strong></p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Precision</th>
          <th>Recall</th>
          <th>F1 Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>MolScribe</strong></td>
          <td>87%</td>
          <td>100%</td>
          <td>93%</td>
      </tr>
      <tr>
          <td><strong>DECIMER</strong></td>
          <td>84%</td>
          <td>100%</td>
          <td>91%</td>
      </tr>
      <tr>
          <td><strong>OCMR</strong></td>
          <td>77%</td>
          <td>100%</td>
          <td>87%</td>
      </tr>
      <tr>
          <td><strong>MolVec</strong></td>
          <td>74%</td>
          <td>100%</td>
          <td>85%</td>
      </tr>
      <tr>
          <td><strong>OSRA</strong></td>
          <td>64%</td>
          <td>100%</td>
          <td>78%</td>
      </tr>
      <tr>
          <td><strong>SwinOCSR</strong></td>
          <td>65%</td>
          <td>95%</td>
          <td>77%</td>
      </tr>
  </tbody>
</table>
<p><strong>Key Results on Reactions (Bucket C):</strong></p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Precision</th>
          <th>Recall</th>
          <th>F1 Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>RxnScribe</strong></td>
          <td>77%</td>
          <td>97%</td>
          <td>86%</td>
      </tr>
      <tr>
          <td><strong>OSRA</strong></td>
          <td>64%</td>
          <td>65%</td>
          <td>64%</td>
      </tr>
      <tr>
          <td><strong>ReactionDataExtractor</strong></td>
          <td>49%</td>
          <td>62%</td>
          <td>55%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p><strong>ChemIC Training</strong>: Trained on a machine with 40 Intel(R) Xeon(R) Gold 6226 CPUs. Training time approximately 6 hours for 100 epochs (early stopping at epoch 26).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10546827">Zenodo Repository (Code &amp; Data)</a></td>
          <td>Code, Dataset</td>
          <td>Unknown</td>
          <td>Benchmark images, processing scripts, evaluation tools, ChemIC classifier code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/ontochem/ImageComparator">ImageComparator</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Java tool for visual comparison of OCSR predictions</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Krasnov, A., Barnabas, S. J., Boehme, T., Boyer, S. K., &amp; Weber, L. (2024). Comparing software tools for optical chemical structure recognition. <em>Digital Discovery</em>, 3(4), 681-693. <a href="https://doi.org/10.1039/D3DD00228D">https://doi.org/10.1039/D3DD00228D</a></p>
<p><strong>Publication</strong>: Digital Discovery 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://doi.org/10.5281/zenodo.10546827">Zenodo Repository (Code &amp; Data)</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{krasnovComparingSoftwareTools2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Comparing Software Tools for Optical Chemical Structure Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Krasnov, Aleksei and Barnabas, Shadrack J. and Boehme, Timo and Boyer, Stephen K. and Weber, Lutz}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{681--693}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1039/D3DD00228D}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>AtomLenz: Atom-Level OCSR with Limited Supervision</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/atomlenz/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/atomlenz/</guid><description>Weakly supervised OCSR framework combining object detection and graph construction to recognize chemical structures from hand-drawn images using SMILES.</description><content:encoded><![CDATA[<h2 id="dual-contribution-method-and-data-resource">Dual Contribution: Method and Data Resource</h2>
<p>The paper proposes an architecture (AtomLenz) and training framework (ProbKT* + Edit-Correction) to solve the problem of Optical Chemical Structure Recognition (OCSR) in data-sparse domains. It also releases a curated, relabeled dataset of hand-drawn molecules with atom-level bounding box annotations.</p>
<h2 id="overcoming-annotation-bottlenecks-in-ocsr">Overcoming Annotation Bottlenecks in OCSR</h2>
<p>Optical Chemical Structure Recognition (OCSR) is critical for digitizing chemical literature and lab notes. However, existing methods face three main limitations:</p>
<ol>
<li><strong>Generalization Limits:</strong> They struggle with sparse or stylistically unique domains, such as hand-drawn images, where massive datasets for pretraining are unavailable.</li>
<li><strong>Annotation Cost:</strong> &ldquo;Atom-level&rdquo; methods (which detect individual atoms and bonds) require expensive bounding box annotations, which are rarely available for real-world sketch data.</li>
<li><strong>Lack of Interpretability/Localization:</strong> Pure &ldquo;Image-to-SMILES&rdquo; models (like DECIMER) work well but fail to localize the atoms or bonds in the original image, limiting human-in-the-loop review and mechanistic interpretability.</li>
</ol>
<h2 id="atomlenz-probkt-and-graph-edit-correction">AtomLenz, ProbKT*, and Graph Edit-Correction</h2>
<p>The core contribution is <strong>AtomLenz</strong>, an OCSR framework that achieves atom-level entity detection using <strong>only SMILES supervision</strong> on target domains. The authors construct an explicit object detection pipeline using Faster R-CNN trained via a composite multi-task loss. The objective aims to optimize a multi-class log loss $L_{cls}$ for predicted class $\hat{c}$ and a regression loss $L_{reg}$ for predicted bounding box coordinates $\hat{b}$:</p>
<p>$$ \mathcal{L} = L_{cls}(c, \hat{c}) + L_{reg}(b, \hat{b}) $$</p>
<p>To bridge the gap between image inputs and the weakly supervised SMILES labels, the system leverages:</p>
<ul>
<li><em><em>ProbKT</em> (Probabilistic Knowledge Transfer):</em>* Uses probabilistic logic and Hungarian matching to align predicted objects with the &ldquo;ground truth&rdquo; derived from the SMILES strings, enabling backpropagation without explicit bounding boxes.</li>
<li><strong>Graph Edit-Correction:</strong> Generates pseudo-labels by solving an optimization problem that finds the smallest edit on the predicted graph such that the corrected graph and the ground-truth SMILES graph become isomorphic, which forces fine-tuning on less frequent atom types. The combination of ProbKT* and Edit-Correction is abbreviated as <strong>EditKT</strong>*.</li>
<li><strong>ChemExpert:</strong> A chemically sound ensemble strategy that cascades predictions from multiple models (e.g., passing through DECIMER, then AtomLenz), halting at the first output that clears basic RDKit chemical validity checks.</li>
</ul>
<h2 id="data-efficiency-and-domain-adaptation-experiments">Data Efficiency and Domain Adaptation Experiments</h2>
<p>The authors evaluated the model specifically on domain adaptation and sample efficiency, treating hand-drawn molecules as the primary low-data target distribution:</p>
<ul>
<li><strong>Pretraining:</strong> Initially trained on ~214k synthetic images from ChEMBL explicitly labeled with bounding boxes (generated via RDKit).</li>
<li><strong>Target Domain Adaptation:</strong> Fine-tuned on the Brinkhaus hand-drawn dataset (4,070 images) using purely SMILES supervision.</li>
<li><strong>Evaluation Sets:</strong>
<ul>
<li><strong>Hand-drawn test set</strong>: 1,018 images.</li>
<li><strong>ChemPix</strong>: 614 out-of-domain hand-drawn images.</li>
<li><strong>Atom Localization set</strong>: 1,000 synthetic images to evaluate precise bounding box capabilities.</li>
</ul>
</li>
<li><strong>Baselines:</strong> Compared against leading OCSR methods, including DECIMER (v2.2.0), Img2Mol, MolScribe, ChemGrapher, and OSRA.</li>
</ul>
<h2 id="state-of-the-art-ensembles-vs-standalone-limitations">State-of-the-Art Ensembles vs. Standalone Limitations</h2>
<ul>
<li><strong>SOTA Ensemble Performance:</strong> The <strong>ChemExpert</strong> module (combining AtomLenz and DECIMER) achieved state-of-the-art accuracy on both hand-drawn (63.5%) and ChemPix (51.8%) test sets.</li>
<li><strong>Data Efficiency under Bottleneck Regimes:</strong> AtomLenz effectively bypassed the massive data constraints of competing models. When all methods were retrained from scratch on the same 4,070-sample hand-drawn training set (enriched with atom-level annotations from EditKT*), AtomLenz achieved 33.8% exact accuracy, outperforming baselines like Img2Mol (0.0%), MolScribe (1.3%), and DECIMER (0.1%), illustrating its sample efficiency.</li>
<li><strong>Localization Success:</strong> The base framework achieved strong localization (mAP 0.801), a capability not provided by end-to-end transformers like DECIMER.</li>
<li><strong>Methodological Tradeoffs:</strong> While AtomLenz is highly sample efficient, its standalone performance when fine-tuned on the target domain (33.8% accuracy) underperforms fine-tuned models trained on larger datasets like DECIMER (62.2% accuracy). AtomLenz achieves state-of-the-art results primarily when deployed as part of the ChemExpert ensemble alongside DECIMER, since errors from the two approaches tend to occur on different samples, allowing them to complement each other.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/molden/atomlenz">Official Repository (AtomLenz)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">MIT</td>
          <td style="text-align: left">Complete pipeline for AtomLenz, ProbKT*, and Graph Edit-Correction.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/molden/atomlenz/tree/main/models">Pre-trained Models</a></td>
          <td style="text-align: left">Model</td>
          <td style="text-align: left">MIT</td>
          <td style="text-align: left">Downloadable weights for Faster R-CNN detection backbones.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://dx.doi.org/10.6084/m9.figshare.24599412">Hand-drawn Dataset (Brinkhaus)</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Images and SMILES used for target domain fine-tuning and evaluation.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://dx.doi.org/10.6084/m9.figshare.24599172">Relabeled Hand-drawn Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">1,417 images with bounding box annotations generated via EditKT*.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://huggingface.co/spaces/moldenhof/atomlenz">AtomLenz Web Demo</a></td>
          <td style="text-align: left">Other</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Interactive Hugging Face space for testing model inference.</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The study utilizes a mix of large synthetic datasets and smaller curated hand-drawn datasets.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Pretraining</strong></td>
          <td>Synthetic ChEMBL</td>
          <td>~214,000</td>
          <td>Generated via RDKit/Indigo. Annotated with atoms, bonds, charges, stereocenters.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>Hand-drawn (Brinkhaus)</td>
          <td>4,070</td>
          <td>Used for weakly supervised adaptation (SMILES only).</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>Hand-drawn Test</td>
          <td>1,018</td>
          <td></td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>ChemPix</td>
          <td>614</td>
          <td>Out-of-distribution hand-drawn images.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>Atom Localization</td>
          <td>1,000</td>
          <td>Synthetic images with ground truth bounding boxes.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Molecular Graph Constructor (Algorithm 1):</strong> A rule-based system to assemble the graph from detected objects:
<ol>
<li><strong>Filtering:</strong> Removes overlapping atom boxes (IoU threshold).</li>
<li><strong>Node Creation:</strong> Merges overlapping charge and stereocenter objects with their corresponding atom objects.</li>
<li><strong>Edge Creation:</strong> Iterates over bond objects; if a bond overlaps with exactly two atoms, an edge is added. If &gt;2, it selects the most probable pair.</li>
<li><strong>Validation:</strong> Checks valency constraints; removes bonds iteratively if constraints are violated.</li>
</ol>
</li>
<li><strong>Weakly Supervised Training:</strong>
<ul>
<li><strong>ProbKT*:</strong> Uses Hungarian matching to align predicted objects with the &ldquo;ground truth&rdquo; implied by the SMILES string, allowing backpropagation without explicit boxes.</li>
<li><strong>Graph Edit-Correction:</strong> Finds the smallest edit on the predicted graph such that the corrected and true SMILES graphs become isomorphic, then uses the correction to generate pseudo-labels for retraining.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Object Detection Backbone:</strong> <strong>Faster R-CNN</strong>.
<ul>
<li>Four distinct models are trained for different entity types: Atoms ($O^a$), Bonds ($O^b$), Charges ($O^c$), and Stereocenters ($O^s$).</li>
<li><strong>Loss Function:</strong> Multi-task loss combining Multi-class Log Loss ($L_{cls}$) and Regression Loss ($L_{reg}$).</li>
</ul>
</li>
<li><strong>ChemExpert:</strong> An ensemble wrapper that prioritizes models based on user preference (e.g., DECIMER first, then AtomLenz). It accepts the first prediction that passes RDKit chemical validity checks.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Primary metrics focused on structural correctness and localization accuracy.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (Hand-drawn)</th>
          <th>Baseline (DECIMER FT)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Accuracy (T=1)</strong></td>
          <td>33.8% (AtomLenz+EditKT*)</td>
          <td>62.2%</td>
          <td>Exact ECFP6 fingerprint match.</td>
      </tr>
      <tr>
          <td><strong>Tanimoto Sim.</strong></td>
          <td>0.484</td>
          <td>0.727</td>
          <td>Average similarity.</td>
      </tr>
      <tr>
          <td><strong>mAP</strong></td>
          <td>0.801</td>
          <td>N/A</td>
          <td>Localization accuracy (IoU 0.05-0.35).</td>
      </tr>
      <tr>
          <td><strong>Ensemble Acc.</strong></td>
          <td><strong>63.5%</strong></td>
          <td>62.2%</td>
          <td>ChemExpert (DECIMER + AtomLenz).</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute:</strong> Experiments utilized the Flemish Supercomputer Center (VSC) resources.</li>
<li><strong>Note:</strong> Specific GPU models (e.g., A100/V100) are not explicitly detailed in the text, but Faster R-CNN training is standard on consumer or enterprise GPUs.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Oldenhof, M., De Brouwer, E., Arany, Á., &amp; Moreau, Y. (2024). Atom-Level Optical Chemical Structure Recognition with Limited Supervision. In <em>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em>, 2024.</p>
<p><strong>Publication venue/year</strong>: CVPR 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/molden/atomlenz">Official Repository</a></li>
<li><a href="https://dx.doi.org/10.6084/m9.figshare.24599412">Hand-drawn Dataset on Figshare</a></li>
</ul>
<p><strong>BibTeX</strong>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{oldenhofAtomLevelOpticalChemical2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Atom-Level Optical Chemical Structure Recognition with Limited Supervision}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Oldenhof, Martijn and De Brouwer, Edward and Arany, {\&#39;A}d{\&#39;a}m and Moreau, Yves}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2404.01743}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs.CV}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SwinOCSR: End-to-End Chemical OCR with Swin Transformers</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/swinocsr/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/swinocsr/</guid><description>Deep learning model using Swin Transformer and Focal Loss for OCSR, achieving 98.58% accuracy on synthetic benchmarks.</description><content:encoded><![CDATA[<h2 id="contribution-methodological-architecture-and-datasets">Contribution: Methodological Architecture and Datasets</h2>
<p>This is a <strong>Methodological Paper</strong> with a significant <strong>Resource</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes a novel architecture (Swin Transformer backbone) and a specific loss function optimization (Focal Loss) for the task of Optical Chemical Structure Recognition (OCSR).</li>
<li><strong>Resource</strong>: It constructs a large-scale synthetic dataset of 5 million molecules, specifically designing it to cover complex cases like substituents and aromatic rings.</li>
</ul>
<h2 id="motivation-addressing-visual-context-and-data-imbalance">Motivation: Addressing Visual Context and Data Imbalance</h2>
<ul>
<li><strong>Problem</strong>: OCSR (converting images of chemical structures to <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>) is difficult due to complex chemical patterns and long sequences. Existing deep learning methods (often CNN-based) struggle to achieve satisfactory recognition rates.</li>
<li><strong>Technical Gap</strong>: Standard CNN backbones (like ResNet or EfficientNet) focus on local feature extraction and miss global dependencies required for interpreting complex molecular diagrams.</li>
<li><strong>Data Imbalance</strong>: Chemical strings suffer from severe class imbalance (e.g., &lsquo;C&rsquo; and &lsquo;H&rsquo; are frequent; &lsquo;Br&rsquo; or &lsquo;Cl&rsquo; are rare), which causes standard Cross Entropy loss to underperform.</li>
</ul>
<h2 id="core-innovation-swin-transformers-and-focal-loss">Core Innovation: Swin Transformers and Focal Loss</h2>
<ul>
<li><strong>Swin Transformer Backbone</strong>: SwinOCSR replaces the standard CNN backbone with a <strong>Swin Transformer</strong>, using shifted window attention to capture both local and global image features more effectively.</li>
<li><strong>Multi-label Focal Loss (MFL)</strong>: The paper introduces a modified Focal Loss to OCSR, the first explicit attempt to address token imbalance in OCSR (per the authors). This penalizes the model for errors on rare tokens, addressing the &ldquo;long-tail&rdquo; distribution of chemical elements. The standard Focal Loss formulation heavily weights hard-to-classify examples:
$$
\begin{aligned}
FL(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t) \\
\end{aligned}
$$</li>
<li><strong>Structured Synthetic Dataset</strong>: Creation of a dataset explicitly balanced across four structural categories: Kekule rings, Aromatic rings, and their combinations with substituents.</li>
</ul>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<ul>
<li><strong>Backbone Comparison</strong>: The authors benchmarked SwinOCSR against the backbones of leading competitors: ResNet-50 (used in Image2SMILES) and EfficientNet-B3 (used in DECIMER 1.0).</li>
<li><strong>Loss Function Ablation</strong>: They compared the performance of standard Cross Entropy (CE) loss against their proposed Multi-label Focal Loss (MFL).</li>
<li><strong>Category Stress Test</strong>: Performance was evaluated separately on molecules with/without substituents and with/without aromaticity to test robustness.</li>
<li><strong>Real-world Evaluation</strong>: The model was tested on 100 images manually extracted from the literature (with manually labeled SMILES), and separately on 100 CDK-generated images from those same SMILES, to measure the domain gap between synthetic and real-world data.</li>
</ul>
<h2 id="results-and-limitations">Results and Limitations</h2>
<ul>
<li><strong>Synthetic test set performance</strong>: With Multi-label Focal Loss (MFL), SwinOCSR achieved <strong>98.58% accuracy</strong> on the synthetic test set, compared to 97.36% with standard CE loss. Both ResNet-50 (89.17%) and EfficientNet-B3 (86.70%) backbones scored lower when using CE loss (Table 3).</li>
<li><strong>Handling of long sequences</strong>: The model maintained high accuracy (94.76%) even on very long DeepSMILES strings (76-100 characters), indicating effective global feature extraction.</li>
<li><strong>Per-category results</strong>: Performance was consistent across molecule categories: Category 1 (Kekule, 98.20%), Category 2 (Aromatic, 98.46%), Category 3 (Kekule + Substituents, 98.76%), Category 4 (Aromatic + Substituents, 98.89%). The model performed slightly better on molecules with substituents and aromatic rings.</li>
<li><strong>Domain shift</strong>: While performance on synthetic data was strong, accuracy dropped to <strong>25%</strong> on 100 real-world literature images. On 100 CDK-generated images from the same SMILES strings, accuracy was 94%, confirming that the gap stems from stylistic differences between CDK-rendered and real-world images. The authors attribute this to noise, low resolution, and variations such as condensed structural formulas and abbreviations.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: The first 8.5 million structures from <strong>PubChem</strong> were downloaded, yielding ~6.9 million unique SMILES.</li>
<li><strong>Generation Pipeline</strong>:
<ul>
<li><strong>Tools</strong>: <strong>CDK</strong> (Chemistry Development Kit) for image rendering; <strong>RDKit</strong> for SMILES canonicalization.</li>
<li><strong>Augmentation</strong>: To ensure diversity, the dataset was split into 4 categories (1.25M each): (1) Kekule, (2) Aromatic, (3) Kekule + Substituents, (4) Aromatic + Substituents. Substituents were randomly added from a list of 224 common patent substituents.</li>
<li><strong>Preprocessing</strong>: Images rendered as binary, resized to <strong>224x224</strong>, and copied to 3 channels (RGB simulation).</li>
</ul>
</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>Synthetic (PubChem-derived)</td>
          <td>4,500,000</td>
          <td>18:1:1 split (Train/Val/Test)</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>Synthetic (PubChem-derived)</td>
          <td>250,000</td>
          <td></td>
      </tr>
      <tr>
          <td>Test</td>
          <td>Synthetic (PubChem-derived)</td>
          <td>250,000</td>
          <td></td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Loss Function</strong>: <strong>Multi-label Focal Loss (MFL)</strong>. The single-label classification task was cast as multi-label to apply Focal Loss, using a sigmoid activation on logits.</li>
<li><strong>Optimization</strong>:
<ul>
<li><strong>Optimizer</strong>: <strong>Adam</strong> with initial learning rate <code>5e-4</code>.</li>
<li><strong>Schedulers</strong>: Cosine decay for the Swin Transformer backbone; Step decay for the Transformer encoder/decoder.</li>
<li><strong>Regularization</strong>: Dropout rate of <code>0.1</code>.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Backbone (Encoder 1)</strong>: <strong>Swin Transformer</strong>.
<ul>
<li>Patch size: $4 \times 4$.</li>
<li>Linear embedding dimension: 192.</li>
<li>Structure: 4 stages with Swin Transformer Blocks (Window MSA + Shifted Window MSA).</li>
<li>Output: Flattened patch sequence $S_b$.</li>
</ul>
</li>
<li><strong>Transformer Encoder (Encoder 2)</strong>: 6 standard Transformer encoder layers. Uses Positional Embedding + Multi-Head Attention + MLP.</li>
<li><strong>Transformer Decoder</strong>: 6 standard Transformer decoder layers. Uses Masked Multi-Head Attention (to prevent look-ahead) + Multi-Head Attention (connecting to encoder output $S_e$).</li>
<li><strong>Tokenization</strong>: <strong>DeepSMILES</strong> format used (syntactically more robust than SMILES). Vocabulary size: <strong>76 tokens</strong> (76 unique characters found in dataset). Embedding dimension: 256.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>: Accuracy (Exact Match), Tanimoto Similarity (PubChem fingerprints), BLEU, ROUGE.</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SwinOCSR (CE)</th>
          <th>SwinOCSR (MFL)</th>
          <th>ResNet-50 (CE)</th>
          <th>EfficientNet-B3 (CE)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td>97.36%</td>
          <td><strong>98.58%</strong></td>
          <td>89.17%</td>
          <td>86.70%</td>
      </tr>
      <tr>
          <td>Tanimoto</td>
          <td>99.65%</td>
          <td><strong>99.77%</strong></td>
          <td>98.79%</td>
          <td>98.46%</td>
      </tr>
      <tr>
          <td>BLEU</td>
          <td>99.46%</td>
          <td><strong>99.59%</strong></td>
          <td>98.62%</td>
          <td>98.37%</td>
      </tr>
      <tr>
          <td>ROUGE</td>
          <td>99.64%</td>
          <td><strong>99.78%</strong></td>
          <td>98.87%</td>
          <td>98.66%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Trained on <strong>NVIDIA Tesla V100-PCIE</strong>.</li>
<li><strong>Training Time</strong>: 30 epochs.</li>
<li><strong>Batch Size</strong>: 256 images ($224 \times 224$ pixels).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/suanfaxiaohuo/SwinOCSR">SwinOCSR</a></td>
          <td>Code + Data</td>
          <td>Unknown</td>
          <td>Official implementation with dataset and trained models</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xu, Z., Li, J., Yang, Z. et al. (2022). SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer. <em>Journal of Cheminformatics</em>, 14(41). <a href="https://doi.org/10.1186/s13321-022-00624-5">https://doi.org/10.1186/s13321-022-00624-5</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/suanfaxiaohuo/SwinOCSR">GitHub Repository</a></li>
</ul>
]]></content:encoded></item><item><title>String Representations for Chemical Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/rajan-string-representations-2022/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/rajan-string-representations-2022/</guid><description>Ablation study comparing SMILES, DeepSMILES, SELFIES, and InChI for OCSR. SMILES achieves highest accuracy; SELFIES guarantees validity.</description><content:encoded><![CDATA[<h2 id="empirical-focus-and-resource-contributions">Empirical Focus and Resource Contributions</h2>
<p>This is an <strong>Empirical Paper</strong> ($\Psi_{\text{Empirical}}$) with a secondary contribution as a <strong>Resource Paper</strong> ($\Psi_{\text{Resource}}$).</p>
<p>It functions as a systematic ablation study, keeping the model architecture (EfficientNet-B3 + Transformer) constant while varying the input/output representation (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, DeepSMILES, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>) to determine which format yields the best performance for Optical Chemical Structure Recognition (OCSR). It also contributes large-scale benchmarking datasets derived from ChEMBL and PubChem.</p>
<h2 id="the-syntax-challenge-in-chemical-image-recognition">The Syntax Challenge in Chemical Image Recognition</h2>
<p>Optical Chemical Structure Recognition (OCSR) is essential for extracting chemical information buried in scientific literature and patents. While deep learning offers a promising alternative to rule-based approaches, neural networks struggle with the syntax of standard chemical representations like SMILES. Specifically, the tokenization of SMILES strings (where ring closures and branches are marked by single characters potentially far apart in the sequence) creates learning difficulties for sequence-to-sequence models. Newer representations like DeepSMILES and SELFIES were developed to address these syntax issues, but their comparative performance in image-to-text tasks had not been rigorously benchmarked.</p>
<h2 id="isolating-string-representation-variables">Isolating String Representation Variables</h2>
<p>The core novelty is the <strong>comparative isolation of the string representation variable</strong> in an OCSR context. Previous approaches often selected a representation (usually SMILES) without validating if it was optimal for the learning task. This study specifically tests the hypothesis that syntax-robust representations (like SELFIES) improve deep learning performance compared to standard SMILES. It provides empirical evidence on the trade-off between <em>validity</em> (guaranteed by SELFIES) and <em>accuracy</em> (highest with SMILES).</p>
<h2 id="large-scale-image-to-text-translation-experiments">Large-Scale Image-to-Text Translation Experiments</h2>
<p>The authors performed a large-scale image-to-text translation experiment:</p>
<ul>
<li><strong>Task</strong>: Converting 2D chemical structure images into text strings.</li>
<li><strong>Data</strong>:
<ul>
<li><strong>ChEMBL</strong>: ~1.6M molecules, split into two datasets (with and without stereochemistry).</li>
<li><strong>PubChem</strong>: ~3M molecules, split similarly, to test performance scaling with data size.</li>
</ul>
</li>
<li><strong>Representations</strong>: The same chemical structures were converted into four formats: SMILES, DeepSMILES, SELFIES, and InChI.</li>
<li><strong>Metric</strong>: The models were evaluated on:
<ul>
<li><strong>Validity</strong>: Can the predicted string be decoded back to a molecule?</li>
<li><strong>Exact Match</strong>: Is the predicted string identical to the ground truth?</li>
<li><strong>Tanimoto Similarity</strong>: How chemically similar is the prediction to the ground truth (using PubChem fingerprints)? The similarity $\mathcal{T}$ between two molecular fingerprints $A$ and $B$ is calculated as:
$$ \mathcal{T}(A, B) = \frac{A \cdot B}{|A|^2 + |B|^2 - A \cdot B} $$</li>
</ul>
</li>
</ul>
<h2 id="comparative-performance-and-validity-trade-offs">Comparative Performance and Validity Trade-offs</h2>
<ul>
<li><strong>SMILES is the most accurate</strong>: Contrary to the hypothesis that syntax-robust formats would learn better, SMILES consistently achieved the highest exact match accuracy (up to 88.62% on PubChem data) and average Tanimoto similarity (0.98). This is likely due to SMILES having shorter string lengths and fewer unique tokens compared to SELFIES.</li>
<li><strong>SELFIES guarantees validity</strong>: While slightly less accurate in direct translation, SELFIES achieved 100% structural validity (every prediction could be decoded), whereas SMILES predictions occasionally contained syntax errors.</li>
<li><strong>InChI is unsuitable</strong>: InChI performed significantly worse (approx. 64% exact match) due to extreme maximum string lengths (up to 273 characters).</li>
<li><strong>Stereochemistry adds difficulty</strong>: Including stereochemistry reduced accuracy across all representations due to increased token count and visual complexity.</li>
<li><strong>Recommendation</strong>: Use SMILES for maximum accuracy; use SELFIES if generating valid structures is the priority (e.g., generative tasks).</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study used curated subsets from ChEMBL and PubChem. Images were generated synthetically.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ChEMBL (Dataset 1/2)</td>
          <td>~1.5M</td>
          <td>Filtered for MW &lt; 1500, specific elements (C,H,O,N,P,S,F,Cl,Br,I,Se,B).</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>PubChem (Dataset 3/4)</td>
          <td>~3.0M</td>
          <td>Same filtering rules, used to test scaling.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Test Split</td>
          <td>~120k - 250k</td>
          <td>Created using RDKit MaxMin algorithm to ensure chemical diversity.</td>
      </tr>
  </tbody>
</table>
<p><strong>Image Generation</strong>:</p>
<ul>
<li><strong>Tool</strong>: CDK Structure Diagram Generator (SDG).</li>
<li><strong>Specs</strong>: $300 \times 300$ pixels, rotated by random angles ($0-360^{\circ}$), saved as 8-bit PNG.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Tokenization Rules</strong> (Critical for replication):</p>
<ul>
<li><strong>SELFIES</strong>: Split at every <code>][</code> (e.g., <code>[C][N]</code> $\rightarrow$ <code>[C]</code>, <code>[N]</code>).</li>
<li><strong>SMILES / DeepSMILES</strong>: Regex-based splitting:
<ul>
<li>Every heavy atom (e.g., <code>C</code>, <code>N</code>).</li>
<li>Every bracket <code>(</code> and <code>)</code>.</li>
<li>Every bond symbol <code>=</code> and <code>#</code>.</li>
<li>Every single-digit number.</li>
<li>Everything inside square brackets <code>[]</code> is kept as a single token.</li>
</ul>
</li>
<li><strong>InChI</strong>: The prefix <code>InChI=1S/</code> was treated as a single token and removed during training, then re-added for evaluation.</li>
</ul>
<h3 id="models">Models</h3>
<p>The model follows the <strong>DECIMER</strong> architecture.</p>
<ul>
<li><strong>Encoder</strong>: EfficientNet-B3 (pre-trained with &ldquo;Noisy Student&rdquo; weights).
<ul>
<li>Output: Image feature vectors of shape $10 \times 10 \times 1536$.</li>
</ul>
</li>
<li><strong>Decoder</strong>: Transformer (similar to the &ldquo;Base&rdquo; model from <em>Attention Is All You Need</em>).
<ul>
<li>Layers: 4 encoder-decoder layers.</li>
<li>Attention Heads: 8.</li>
<li>Dimension ($d_{\text{model}}$): 512.</li>
<li>Feed-forward ($d_{\text{ff}}$): 2048.</li>
<li>Dropout: 10%.</li>
</ul>
</li>
<li><strong>Loss</strong>: Sparse categorical cross-entropy.</li>
<li><strong>Optimizer</strong>: Adam with custom learning rate scheduler.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics were calculated after converting all predictions back to standard SMILES.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Baseline (SMILES)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Identical Match</strong></td>
          <td>88.62% (PubChem)</td>
          <td>Strict character-for-character equality.</td>
      </tr>
      <tr>
          <td><strong>Valid Structure</strong></td>
          <td>99.78%</td>
          <td>SMILES had rare syntax errors; SELFIES achieved 100%.</td>
      </tr>
      <tr>
          <td><strong>Tanimoto (Avg)</strong></td>
          <td>0.98</td>
          <td>Calculated using PubChem fingerprints via CDK.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: Google Cloud TPUs (v3-8).</li>
<li><strong>Format</strong>: Data converted to TFRecords (128 image/text pairs per record) for TPU efficiency.</li>
<li><strong>Batch Size</strong>: 1024.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER_Short_Communication">DECIMER Short Communication</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Training and evaluation scripts (Python, Java)</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5155037">Datasets on Zenodo</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>SMILES data and processing scripts</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Steinbeck, C., &amp; Zielesny, A. (2022). Performance of chemical structure string representations for chemical image recognition using transformers. <em>Digital Discovery</em>, 1(2), 84-90. <a href="https://doi.org/10.1039/D1DD00013F">https://doi.org/10.1039/D1DD00013F</a></p>
<p><strong>Publication</strong>: Digital Discovery 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://chemrxiv.org/doi/pdf/10.26434/chemrxiv-2021-7c9wf">ChemRxiv Preprint (PDF)</a></li>
<li><a href="https://github.com/Kohulan/DECIMER_Short_Communication">Official Code Repository</a></li>
<li><a href="https://doi.org/10.5281/zenodo.5155037">Data on Zenodo</a></li>
<li>Related work: <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer/">DECIMER</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-1.0/">DECIMER 1.0</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/img2smi/">IMG2SMI</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanPerformanceChemicalStructure2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Performance of Chemical Structure String Representations for Chemical Image Recognition Using Transformers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Steinbeck, Christoph and Zielesny, Achim}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{84--90}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1039/D1DD00013F}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Review of OCSR Techniques and Models (Musazade 2022)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/musazade-ocsr-review-2022/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/musazade-ocsr-review-2022/</guid><description>Systematization of OCSR evolution from rule-based systems to deep learning, highlighting the paradigm shift to image captioning approaches.</description><content:encoded><![CDATA[<h2 id="systematization-of-ocsr-evolution">Systematization of OCSR Evolution</h2>
<p>This is a <strong>Systematization</strong> paper ($\Psi_{\text{Systematization}}$). It organizes existing literature into two distinct evolutionary phases: <strong>Rule-based systems</strong> (1990s-2010s) and <strong>Machine Learning-based systems</strong> (2015-present). It synthesizes performance metrics across these paradigms to highlight the shift from simple classification to &ldquo;image captioning&rdquo; (sequence generation).</p>
<p><strong>Justification</strong>: The paper focuses on &ldquo;organizing and synthesizing existing literature&rdquo; and answers the core question: &ldquo;What do we know?&rdquo; The dominant contribution is systematization based on several key indicators:</p>
<ol>
<li>
<p><strong>Survey Structure</strong>: The paper explicitly structures content by categorizing the field into two distinct historical and methodological groups: &ldquo;Rule-based systems&rdquo; and &ldquo;ML-based systems&rdquo;. It traces the &ldquo;evolution of approaches from rule-based structure analyses to complex statistical models&rdquo;, moving chronologically from early tools like OROCS and OSRA (1990s-2000s) to modern Deep Learning approaches like DECIMER and Vision Transformers.</p>
</li>
<li>
<p><strong>Synthesis of Knowledge</strong>: The paper aggregates performance metrics from various distinct studies into unified comparison tables (Table 1 for rule-based and Table 2 for ML-based). It synthesizes technical details of different models, explaining how specific architectures (CNNs, LSTMs, Attention mechanisms) are applied to the specific problem of Optical Chemical Structure Recognition (OCSR).</p>
</li>
<li>
<p><strong>Identification of Gaps</strong>: The authors dedicate specific sections to &ldquo;Gaps of rule-based systems&rdquo; and &ldquo;Gaps of ML-based systems&rdquo;. It concludes with recommendations for future development, such as the need for &ldquo;standardized datasets&rdquo; and specific improvements in image augmentation and evaluation metrics.</p>
</li>
</ol>
<h2 id="motivation-for-digitization-in-cheminformatics">Motivation for Digitization in Cheminformatics</h2>
<p>The primary motivation is the need to digitize vast amounts of chemical knowledge locked in non-digital formats (e.g., scanned PDFs, older textbooks). This is challenging because:</p>
<ol>
<li><strong>Representational Variety</strong>: A single chemical formula can be drawn in many visually distinct ways (e.g., different orientations, bond styles, fonts).</li>
<li><strong>Legacy Data</strong>: Older documents contain noise, low resolution, and disconnected strokes that confuse standard computer vision models.</li>
<li><strong>Lack of Standardization</strong>: There is no centralized database or standardized benchmark for evaluating OCSR performance, making comparison difficult.</li>
</ol>
<h2 id="key-insights-and-the-paradigm-shift">Key Insights and the Paradigm Shift</h2>
<p>The paper provides a structured comparison of the &ldquo;evolution&rdquo; of OCSR, specifically identifying the pivot point where the field moved from object detection to <strong>NLP-inspired sequence generation</strong>.</p>
<p>Key insights include:</p>
<ul>
<li><strong>The Paradigm Shift</strong>: Identifying that OCSR has effectively become an &ldquo;image captioning&rdquo; problem where the &ldquo;caption&rdquo; is a <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> or <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> string.</li>
<li><strong>Metric Critique</strong>: It critically analyzes the flaws in current evaluation metrics, noting that Levenshtein Distance (LD) is better than simple accuracy but still fails to capture semantic chemical severity (e.g., mistaking &ldquo;F&rdquo; for &ldquo;S&rdquo; is worse than a wrong digit).</li>
<li><strong>Hybrid Potential</strong>: Despite the dominance of ML, the authors argue that rule-based heuristics are still valuable for post-processing validation (e.g., checking element order, sequence structure, and formula correspondence).</li>
</ul>
<h2 id="comparative-analysis-of-rule-based-vs-ml-systems">Comparative Analysis of Rule-Based vs. ML Systems</h2>
<p>As a review paper, it aggregates experimental results from primary sources. It compares:</p>
<ul>
<li><strong>Rule-based systems</strong>: OSRA, chemoCR, Imago, Markov Logic OCSR, and various heuristic approaches.</li>
<li><strong>ML-based systems</strong>: DECIMER (multiple versions), MSE-DUDL, ICMDT (Image Captioning Model based on Deep Transformer-in-Transformer), and other BMS Kaggle competition solutions.</li>
</ul>
<p>It contrasts these systems using:</p>
<ul>
<li><strong>Datasets</strong>: BMS (synthetic, 4M images), PubChem (synthetic), U.S. Patents (real-world scanned).</li>
<li><strong>Metrics</strong>: Tanimoto similarity (structural overlap) and Levenshtein distance (string edit distance).</li>
</ul>
<h2 id="outcomes-critical-gaps-and-recommendations">Outcomes, Critical Gaps, and Recommendations</h2>
<ol>
<li><strong>Transformers are SOTA</strong>: Attention-based encoder-decoder models outperform CNN-RNN hybrids. DECIMER 1.0 achieved 96.47% Tanimoto $= 1.0$ on its test set using an EfficientNet-B3 encoder and Transformer decoder.</li>
<li><strong>Data Hungry</strong>: Modern approaches require massive datasets (millions of images) and significant compute. DECIMER 1.0 trained on 39M images for 14 days on TPU, while the original DECIMER took 27 days on a single GPU. Rule-based systems required neither large data nor heavy compute but hit a performance ceiling.</li>
<li><strong>Critical Gaps</strong>:
<ul>
<li><strong>Super-atoms</strong>: Current models struggle with abbreviated super-atoms (e.g., &ldquo;Ph&rdquo;, &ldquo;COOH&rdquo;).</li>
<li><strong>Stereochemistry</strong>: 3D information (wedges/dashes) is often lost or misinterpreted.</li>
<li><strong>Resolution</strong>: Models are brittle to resolution changes; some require high-res, others fail if images aren&rsquo;t downscaled.</li>
</ul>
</li>
<li><strong>Recommendation</strong>: Future systems should integrate &ldquo;smart&rdquo; pre-processing (denoising without cropping) and use domain-specific distance metrics. The authors also note that post-processing formula validation (checking element order, sequence structure, and formula correspondence) increases accuracy by around 5-6% on average. They suggest exploring Capsule Networks as an alternative to CNNs, since capsules add position invariance through routing-by-agreement rather than max-pooling.</li>
</ol>
<h2 id="reproducibility">Reproducibility</h2>
<p>As a review paper, this work does not introduce original code, models, or datasets. The paper itself is open access via the Journal of Cheminformatics. This section summarizes the technical details of the systems reviewed.</p>
<h3 id="data">Data</h3>
<p>The review identifies the following key datasets used for training OCSR models:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>BMS (Bristol-Myers Squibb)</strong></td>
          <td style="text-align: left">Synthetic</td>
          <td style="text-align: left">~4M images</td>
          <td style="text-align: left">2.4M train / 1.6M test. Used for Kaggle competition. Test images contain noise (salt &amp; pepper, blur) and rotations absent from training images.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>PubChem</strong></td>
          <td style="text-align: left">Synthetic</td>
          <td style="text-align: left">~39M</td>
          <td style="text-align: left">Generated via CDK (Chemistry Development Kit). Used by DECIMER 1.0 (90/10 train/test split).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>U.S. Patents (USPTO)</strong></td>
          <td style="text-align: left">Scanned</td>
          <td style="text-align: left">Variable</td>
          <td style="text-align: left">Real-world noise, often low resolution. One of several training sources for MSE-DUDL (alongside PubChem and Indigo, totaling 50M+ samples).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>ChemInfty</strong></td>
          <td style="text-align: left">Scanned</td>
          <td style="text-align: left">869 images</td>
          <td style="text-align: left">Older benchmark used to evaluate rule-based systems (e.g., Markov Logic OCSR).</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The review highlights the progression of algorithms:</p>
<ul>
<li><strong>Rule-Based</strong>: Hough transforms for bond detection, vectorization/skeletonization, and OCR for atom labels.</li>
<li><strong>Sequence Modeling</strong>:
<ul>
<li><strong>Image Captioning</strong>: Encoder (CNN/ViT) → Decoder (RNN/Transformer).</li>
<li><strong>Tokenization</strong>: Parsing InChI/SMILES into discrete tokens (e.g., splitting <code>C13</code> into <code>C</code>, <code>13</code>).</li>
<li><strong>Beam Search</strong>: Used in inference (typical $k=15-20$) to find the most likely chemical string.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>Key architectures reviewed:</p>
<ul>
<li><strong>DECIMER 1.0</strong>: Uses <strong>EfficientNet-B3</strong> (Encoder) and <strong>Transformer</strong> (Decoder). Predicts <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> strings (more robust than <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>).</li>
<li><strong>Swin Transformer</strong>: Often used in Kaggle ensembles as the visual encoder due to better handling of variable image sizes.</li>
<li><strong>Grid LSTM</strong>: Used in older deep learning approaches (MSE-DUDL) to capture spatial dependencies.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics standard in the field:</p>
<ul>
<li><strong>Levenshtein Distance (LD)</strong>: Edit distance between predicted and ground truth strings. Lower is better. Formally, for two sequences $a$ and $b$ (e.g. SMILES strings) of lengths $|a|$ and $|b|$, the recursive distance $LD(a, b)$ is bounded from $0$ to $\max(|a|, |b|)$.</li>
<li><strong>Tanimoto Similarity</strong>: Measures overlap of molecular fingerprints ($0.0 - 1.0$). Higher is better. DECIMER 1.0 achieved a Tanimoto of 0.99 on PubChem data (Table 2). Calculated as:
$$
\begin{aligned}
T(A, B) = \frac{N_c}{N_a + N_b - N_c}
\end{aligned}
$$
where $N_a$ and $N_b$ are the number of bits set to 1 in fingerprints $A$ and $B$, and $N_c$ is the number of common bits set to 1.</li>
<li><strong>1-1 Match Rate</strong>: Exact string matching (accuracy). For DECIMER 1.0, 96.47% of results achieved Tanimoto $= 1.0$.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Cost</strong>: High for SOTA. DECIMER 1.0 required ~14 days on TPU. The original DECIMER took ~27 days on a single NVIDIA GPU.</li>
<li><strong>Inference</strong>: Transformer models are heavy; rule-based systems run on standard CPUs but with lower accuracy.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Musazade, F., Jamalova, N., &amp; Hasanov, J. (2022). Review of techniques and models used in optical chemical structure recognition in images and scanned documents. <em>Journal of Cheminformatics</em>, 14(1), 61. <a href="https://doi.org/10.1186/s13321-022-00642-3">https://doi.org/10.1186/s13321-022-00642-3</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2022</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{musazadeReviewTechniquesModels2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Review of Techniques and Models Used in Optical Chemical Structure Recognition in Images and Scanned Documents}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Musazade, Fidan and Jamalova, Narmin and Hasanov, Jamaladdin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = sep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{61}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-022-00642-3}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>One Strike, You're Out: Detecting Markush Structures</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/jurriaans-markush-detection-2023/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/jurriaans-markush-detection-2023/</guid><description>Patch-based CNN method for detecting Markush structures in chemical images, addressing low signal-to-noise ratios in OCSR.</description><content:encoded><![CDATA[<h2 id="methodology-and-classification">Methodology and Classification</h2>
<p>This is a <strong>Method</strong> paper (Classification: $\Psi_{\text{Method}}$).</p>
<p>It proposes a patch-based classification pipeline to solve a technical failure mode in Optical Chemical Structure Recognition (OCSR). Distinct rhetorical indicators include a baseline comparison (CNN vs. traditional ORB), ablation studies (architecture, pretraining), and a focus on evaluating the filtering efficacy against a known failure mode.</p>
<h2 id="the-markush-structure-challenge">The Markush Structure Challenge</h2>
<p><strong>The Problem</strong>: Optical Chemical Structure Recognition (OCSR) tools convert 2D images of molecules into machine-readable formats. These tools struggle with &ldquo;Markush structures,&rdquo; generic structural templates used frequently in patents that contain variables rather than specific atoms (e.g., $R$, $X$, $Y$).</p>
<p><strong>The Gap</strong>: Markush structures are difficult to detect because they often appear as small indicators (a single &ldquo;R&rdquo; or variable) within a large image, resulting in a very low Signal-to-Noise Ratio (SNR). Existing OCSR research pipelines typically bypass this by manually excluding these structures from their datasets.</p>
<p><strong>The Goal</strong>: To build an automated filter that can identify images containing Markush structures so they can be removed from OCSR pipelines, improving overall database quality without requiring manual data curation.</p>
<h2 id="patch-based-classification-pipeline">Patch-Based Classification Pipeline</h2>
<p>The core technical contribution is an end-to-end deep learning pipeline tailored for low-SNR chemical images where standard global resizing or cropping fails due to large variations in image resolution and pixel scales.</p>
<ul>
<li><strong>Patch Generation</strong>: The system slices input images into overlapping patches generated from two offset grids, ensuring that variables falling on boundaries are fully captured in at least one crop.</li>
<li><strong>Targeted Annotation</strong>: The labels rely on pixel-level bounding boxes around Markush indicators, minimizing the noise that would otherwise overwhelm a full-image classification attempt.</li>
<li><strong>Inference Strategy</strong>: During inference, the query image is broken into patches, individually classified, and aggregated entirely using a maximum pooling rule where $X = \max_{i=1}^{n} \{ x_i \}$.</li>
<li><strong>Evaluation</strong>: Provides the first systematic comparison between fixed-feature extraction (ORB + XGBoost) and end-to-end deep learning for this specific domain.</li>
</ul>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p>The authors compared two distinct paradigms on a manually annotated dataset:</p>
<ol>
<li>
<p><strong>Fixed-Feature Baseline</strong>: Used <strong>ORB</strong> (Oriented FAST and Rotated BRIEF) to detect keypoints and match them against a template bank of known Markush symbols. Features (match counts, Hamming distances) were fed into an <strong>XGBoost</strong> model.</p>
</li>
<li>
<p><strong>Deep Learning Method</strong>: Fine-tuned <strong>ResNet18</strong> and <strong>Inception V3</strong> models on the generated image patches.</p>
<ul>
<li><strong>Ablations</strong>: Contrasted pretraining sources, evaluating general domain (ImageNet) against chemistry-specific domain (USPTO images).</li>
<li><strong>Fine-tuning</strong>: Compared full-network fine-tuning against freezing all but the fully connected layers.</li>
</ul>
</li>
</ol>
<p>To handle significant class imbalance, the primary evaluation metric was the Macro F1 score, defined as:</p>
<p>$$ \text{Macro F1} = \frac{1}{N} \sum_{i=1}^{N} \frac{2 \cdot \text{precision}_i \cdot \text{recall}_i}{\text{precision}_i + \text{recall}_i} $$</p>
<h2 id="performance-outcomes">Performance Outcomes</h2>
<ul>
<li>
<p><strong>CNN vs. ORB</strong>: Deep learning architectures outperformed the fixed-feature baseline. The best model (<strong>Inception V3</strong> pretrained on ImageNet) achieved an image-level Macro F1 of <strong>0.928</strong>, compared to <strong>0.701</strong> (image-level) for the ORB baseline, and a patch-level Macro F1 of <strong>0.917</strong>.</p>
</li>
<li>
<p><strong>The Pretraining Surprise</strong>: Counterintuitively, ImageNet pretraining consistently outperformed the domain-specific USPTO pretraining. The authors hypothesize that the filters learned from ImageNet pretraining generalize well outside the ImageNet domain, though why the USPTO-pretrained filters underperform remains unclear.</p>
</li>
<li>
<p><strong>Full Model Tuning</strong>: Unfreezing the entire network yielded higher performance than tuning only the classifier head, indicating that standard low-level visual filters require substantial adaptation to reliably distinguish chemical line drawings.</p>
</li>
<li>
<p><strong>Limitations and Edge Cases</strong>: The best CNN achieved an ROC AUC of <strong>0.97</strong> on the primary patch test set, while the ORB baseline scored <strong>0.81</strong> on the auxiliary dataset (the paper notes these ROC curves are not directly comparable due to different evaluation sets). The aggregation metric ($X = \max \{ x_i \}$) is naive and has not been optimized. Furthermore, the patching approach creates inherent label noise when a Markush indicator is cleanly bisected by a patch edge, potentially forcing the network to learn incomplete visual features.</p>
</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study used a primary dataset labeled by domain experts and a larger auxiliary dataset for evaluation.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training/Val</strong></td>
          <td><strong>Primary Dataset</strong></td>
          <td>272 Images</td>
          <td>Manually annotated with bounding boxes for Markush indicators. Split 60/20/20.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>Auxiliary Dataset</strong></td>
          <td>~5.4k Images</td>
          <td>5117 complete structures, 317 Markush. Used for image-level testing only (no bbox).</td>
      </tr>
  </tbody>
</table>
<p><strong>Patch Generation</strong>:</p>
<ul>
<li>Images are cropped into patches of size <strong>224x224</strong> (ResNet) or <strong>299x299</strong> (Inception).</li>
<li>Patches are generated from 2 grids offset by half the patch width/height to ensure annotations aren&rsquo;t lost on edges.</li>
<li><strong>Labeling Rule</strong>: A patch is labeled &ldquo;Markush&rdquo; if &gt;50% of an annotation&rsquo;s pixels fall inside it.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>ORB (Baseline)</strong>:</p>
<ul>
<li>Matches query images against a bank of template patches containing Markush indicators.</li>
<li><strong>Features</strong>: Number of keypoints, number of matches, Hamming distance of best 5 matches.</li>
<li><strong>Classifier</strong>: XGBoost trained on these features.</li>
<li><strong>Hyperparameters</strong>: Search over number of features (500-2000) and template patches (50-250).</li>
</ul>
<p><strong>Training Configuration</strong>:</p>
<ul>
<li><strong>Framework</strong>: PyTorch with Optuna for optimization.</li>
<li><strong>Optimization</strong>: 25 trials per configuration.</li>
<li><strong>Augmentations</strong>: Random perspective shift, posterization, sharpness/blur.</li>
</ul>
<h3 id="models">Models</h3>
<p>Two main architectures were compared.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Input Size</th>
          <th>Parameters</th>
          <th>Pretraining Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ResNet18</strong></td>
          <td>224x224</td>
          <td>11.5M</td>
          <td>ImageNet</td>
      </tr>
      <tr>
          <td><strong>Inception V3</strong></td>
          <td>299x299</td>
          <td>23.8M</td>
          <td>ImageNet &amp; USPTO</td>
      </tr>
  </tbody>
</table>
<p><strong>Best Configuration</strong>: Inception V3, ImageNet weights, Full Model fine-tuning (all layers unfrozen).</p>
<h3 id="evaluation">Evaluation</h3>
<p>Primary metric was <strong>Macro F1</strong> due to class imbalance.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best CNN (Inception V3)</th>
          <th>Baseline (ORB)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Patch Test F1</strong></td>
          <td>$0.917 \pm 0.014$</td>
          <td>N/A</td>
          <td>ORB does not support patch-level</td>
      </tr>
      <tr>
          <td><strong>Image Test F1</strong></td>
          <td>$0.928 \pm 0.035$</td>
          <td>$0.701 \pm 0.052$</td>
          <td>CNN aggregates patch predictions</td>
      </tr>
      <tr>
          <td><strong>Aux Test F1</strong></td>
          <td>0.914</td>
          <td>0.533</td>
          <td>Evaluation on large secondary dataset</td>
      </tr>
      <tr>
          <td><strong>ROC AUC</strong></td>
          <td>0.97</td>
          <td>0.81</td>
          <td></td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Tesla V100-SXM2-16GB</li>
<li><strong>CPU</strong>: Intel Xeon E5-2686 @ 2.30GHz</li>
<li><strong>RAM</strong>: 64 GB</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Thomasjurriaans/markush-recognition-msc-thesis">GitHub Repository</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>MSc thesis code: CNN training, ORB baseline, evaluation scripts</td>
      </tr>
  </tbody>
</table>
<p>The primary dataset was manually annotated by Elsevier domain experts and is not publicly available. The auxiliary dataset (from Elsevier) is also not public. Pre-trained model weights are not released in the repository.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Jurriaans, T., Szarkowska, K., Nalisnick, E., Schwörer, M., Thorne, C., &amp; Akhondi, S. (2023). One Strike, You&rsquo;re Out: Detecting Markush Structures in Low Signal-to-Noise Ratio Images. <em>arXiv preprint arXiv:2311.14633</em>. <a href="https://doi.org/10.48550/arXiv.2311.14633">https://doi.org/10.48550/arXiv.2311.14633</a></p>
<p><strong>Publication</strong>: arXiv 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Thomasjurriaans/markush-recognition-msc-thesis">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{jurriaansOneStrikeYoure2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{One {{Strike}}, {{You}}&#39;re {{Out}}: {{Detecting Markush Structures}} in {{Low Signal-to-Noise Ratio Images}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{One {{Strike}}, {{You}}&#39;re {{Out}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Jurriaans, Thomas and Szarkowska, Kinga and Nalisnick, Eric and Schwoerer, Markus and Thorne, Camilo and Akhondi, Saber}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2023</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = nov,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2311.14633}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2311.14633}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2311.14633}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolMiner: Deep Learning OCSR with YOLOv5 Detection</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molminer/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/molminer/</guid><description>Deep learning OCSR tool using YOLOv5 and MobileNetV2 to extract machine-readable molecular structures from scientific documents and PDFs.</description><content:encoded><![CDATA[<h2 id="classification-and-contribution">Classification and Contribution</h2>
<p>This is primarily a <strong>Resource</strong> paper ($\Psi_{\text{Resource}}$) with a strong <strong>Method</strong> component ($\Psi_{\text{Method}}$).</p>
<ul>
<li><strong>Resource</strong>: It presents a complete software application (published as an &ldquo;Application Note&rdquo;) for Optical Chemical Structure Recognition (OCSR), including a graphical user interface (GUI) and a new curated &ldquo;Real-World&rdquo; dataset of 3,040 molecular images.</li>
<li><strong>Method</strong>: It proposes a novel &ldquo;rule-free&rdquo; pipeline that replaces traditional vectorization algorithms with deep learning object detection (YOLOv5) and segmentation models.</li>
</ul>
<h2 id="motivation-bottlenecks-in-rule-based-systems">Motivation: Bottlenecks in Rule-Based Systems</h2>
<ul>
<li><strong>Legacy Backlog</strong>: Decades of scientific literature contain chemical structures only as 2D images (PDFs), which are not machine-readable.</li>
<li><strong>Limitations of Legacy Architecture</strong>: Existing tools (like OSRA, CLIDE, MolVec) rely on rule-based vectorization (interpreting vectors and nodes) which struggle with noise, low resolution, and complex drawing styles found in scanned documents.</li>
<li><strong>Deep Learning Gap</strong>: While deep learning (DL) has advanced computer vision, few practical, end-to-end DL tools existed for OCSR that could handle the full pipeline from PDF extraction to graph generation with high accuracy.</li>
</ul>
<h2 id="core-innovation-object-detection-paradigm-for-ocsr">Core Innovation: Object Detection Paradigm for OCSR</h2>
<ul>
<li><strong>Object Detection Paradigm</strong>: MolMiner shifts away from the strategy of line-tracing (vectorization), opting to treat atoms and bonds directly as objects to be detected using <strong>YOLOv5</strong>. This allows it to &ldquo;look once&rdquo; at the image.</li>
<li><strong>End-to-End Pipeline</strong>: Integration of three specialized modules:
<ol>
<li><strong>MobileNetV2</strong> for segmenting molecular figures from PDF pages.</li>
<li><strong>YOLOv5</strong> for detecting chemical elements (atoms/bonds) as bounding boxes.</li>
<li><strong>EasyOCR</strong> for recognizing text labels and resolving abbreviations (supergroups) to full explicit structures.</li>
</ol>
</li>
<li><strong>Synthetic Training Strategy</strong>: The authors bypassed manual labeling by building a data generation module that uses RDKit to create chemically valid images with perfect ground-truth annotations automatically.</li>
</ul>
<h2 id="methodology-end-to-end-object-detection-pipeline">Methodology: End-to-End Object Detection Pipeline</h2>
<ul>
<li><strong>Benchmarks</strong>: Evaluated on four standard OCSR datasets: <strong>USPTO</strong> (5,719 images), <strong>UOB</strong> (5,740 images), <strong>CLEF2012</strong> (992 images), and <strong>JPO</strong> (450 images).</li>
<li><strong>New External Dataset</strong>: Collected and annotated a &ldquo;Real-World&rdquo; dataset of <strong>3,040 images</strong> from 239 scientific papers to test generalization beyond synthetic benchmarks.</li>
<li><strong>Baselines</strong>: Compared against open-source tools: <strong>MolVec</strong> (v0.9.8), <strong>OSRA</strong> (v2.1.0), and <strong>Imago</strong> (v2.0).</li>
<li><strong>Qualitative Tests</strong>: Tested on difficult cases like hand-drawn molecules and large-sized scans (e.g., Palytoxin).</li>
</ul>
<h2 id="results-speed-and-generalization-metrics">Results: Speed and Generalization Metrics</h2>
<ul>
<li><strong>Benchmark Performance</strong>: MolMiner outperformed open-source baselines on standard validation splits.
<ul>
<li><em>USPTO</em>: 93% MCS accuracy (vs. 89% for MolVec, per Table 2). The commercial CLiDE Pro tool reports 93.8% on USPTO, slightly higher than MolMiner&rsquo;s 93.3%.</li>
<li><em>Real-World Set</em>: 87.8% MCS accuracy (vs. 50.1% for MolVec, 8.9% for OSRA, and 10.3% for Imago).</li>
</ul>
</li>
<li><strong>Inference Velocity</strong>: The architecture allows for faster processing compared to CPU rule-based systems. On JPO (450 images), MolMiner finishes in under 1 minute versus 8-23 minutes for rule-based tools (Table 3).</li>
<li><strong>Robustness</strong>: Demonstrated ability to handle hand-drawn sketches and noisy scans, though limitations remain with crossing bonds, colorful backgrounds, crowded layout segmentation, and Markush structures.</li>
<li><strong>Software Release</strong>: Released as a free desktop application for Mac and Windows with a Ketcher-based editing plugin.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The system relies heavily on synthetic data for training, while evaluation uses both standard and novel real-world datasets.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Training</strong></td>
          <td style="text-align: left"><strong>Synthetic RDKit</strong></td>
          <td style="text-align: left">Large-scale</td>
          <td style="text-align: left">Generated using RDKit v2021.09.1 and ReportLab v3.5.0. Includes augmentations (rotation, thinning, noise).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Evaluation</strong></td>
          <td style="text-align: left"><strong>USPTO</strong></td>
          <td style="text-align: left">5,719</td>
          <td style="text-align: left">Standard benchmark. Avg MW: 380.0.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Evaluation</strong></td>
          <td style="text-align: left"><strong>UOB</strong></td>
          <td style="text-align: left">5,740</td>
          <td style="text-align: left">Standard benchmark. Avg MW: 213.5.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Evaluation</strong></td>
          <td style="text-align: left"><strong>CLEF2012</strong></td>
          <td style="text-align: left">992</td>
          <td style="text-align: left">Standard benchmark. Avg MW: 401.2.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Evaluation</strong></td>
          <td style="text-align: left"><strong>JPO</strong></td>
          <td style="text-align: left">450</td>
          <td style="text-align: left">Standard benchmark. Avg MW: 360.3.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Evaluation</strong></td>
          <td style="text-align: left"><strong>Real-World</strong></td>
          <td style="text-align: left">3,040</td>
          <td style="text-align: left"><strong>New Contribution</strong>. Collected from 239 scientific papers. <a href="https://zenodo.org/records/6973361">Download Link</a>.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Data Generation</strong>:
<ul>
<li>Uses <strong>RDKit</strong> <code>MolDraw2DSVG</code> and <code>CondenseMolAbbreviations</code> to generate images and ground truth.</li>
<li><strong>Augmentation</strong>: Rotation, line thinning/thickness variation, noise injection.</li>
</ul>
</li>
<li><strong>Graph Construction</strong>:
<ul>
<li>A distance-based algorithm connects recognized &ldquo;Atom&rdquo; and &ldquo;Bond&rdquo; objects into a molecular graph.</li>
<li><strong>Supergroup Parser</strong>: Matches detected text against a dictionary collected from RDKit, ChemAxon, and OSRA to resolve abbreviations (e.g., &ldquo;Ph&rdquo;, &ldquo;Me&rdquo;).</li>
</ul>
</li>
<li><strong>Image Preprocessing</strong>:
<ul>
<li><strong>Resizing</strong>: Images with max dim &gt; 2560 are resized to 2560. Small images (&lt; 640) resized to 640.</li>
<li><strong>Padding</strong>: Images padded to nearest upper bound (640, 1280, 1920, 2560) with white background (255, 255, 255).</li>
<li><strong>Dilation</strong>: For thick-line images, <code>cv2.dilate</code> (3x3 or 2x2 kernel) is applied to estimate median line width.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The system is a cascade of three distinct deep learning models:</p>
<ol>
<li><strong>MolMiner-ImgDet</strong> (Page Segmentation):
<ul>
<li><strong>Architecture</strong>: <strong>MobileNetV2</strong>.</li>
<li><strong>Task</strong>: Semantic segmentation to identify and crop chemical figures from full PDF pages.</li>
<li><strong>Classes</strong>: Background vs. Compound.</li>
<li><strong>Performance</strong>: Recall 95.5%.</li>
</ul>
</li>
<li><strong>MolMiner-ImgRec</strong> (Structure Recognition):
<ul>
<li><strong>Architecture</strong>: <strong>YOLOv5</strong> (One-stage object detector). Selected over MaskRCNN/EfficientDet for speed/accuracy trade-off.</li>
<li><strong>Task</strong>: Detects atoms and bonds as bounding boxes.</li>
<li><strong>Labels</strong>:
<ul>
<li><em>Atoms</em>: Si, N, Br, S, I, Cl, H, P, O, C, B, F, Text.</li>
<li><em>Bonds</em>: Single, Double, Triple, Wedge, Dash, Wavy.</li>
</ul>
</li>
<li><strong>Performance</strong>: <a href="mailto:mAP@0.5">mAP@0.5</a> = 97.5%.</li>
</ul>
</li>
<li><strong>MolMiner-TextOCR</strong> (Character Recognition):
<ul>
<li><strong>Architecture</strong>: <strong>EasyOCR</strong> (fine-tuned).</li>
<li><strong>Task</strong>: Recognize specific characters in &ldquo;Text&rdquo; regions identified by YOLO (e.g., supergroups, complex labels).</li>
<li><strong>Performance</strong>: ~96.4% accuracy.</li>
</ul>
</li>
</ol>
<h2 id="performance-evaluation--accuracy-metrics">Performance Evaluation &amp; Accuracy Metrics</h2>
<p>The paper argues that computing the Maximum Common Substructure (MCS) accuracy is superior to string comparisons of canonical identifiers like InChI or SMILES. The InChI string is heavily sensitive to slight canonicalization or tautomerization discrepancies (like differing aromaticity models). Therefore, for comparing structural isomorphism:</p>
<p>$$ \text{MCS_Accuracy} = \frac{|\text{Edges}_{\text{MCS}}| + |\text{Nodes}_{\text{MCS}}|}{|\text{Edges}_{\text{Ground_Truth}}| + |\text{Nodes}_{\text{Ground_Truth}}|} $$</p>
<p>Using this metric to evaluate bond- and atom-level recall directly measures OCR extraction fidelity.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">MolMiner (Real-World)</th>
          <th style="text-align: left">MolVec</th>
          <th style="text-align: left">OSRA</th>
          <th style="text-align: left">Imago</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>MCS Accuracy</strong></td>
          <td style="text-align: left"><strong>87.8%</strong></td>
          <td style="text-align: left">50.1%</td>
          <td style="text-align: left">8.9%</td>
          <td style="text-align: left">10.3%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>InChI Accuracy</strong></td>
          <td style="text-align: left"><strong>88.9%</strong></td>
          <td style="text-align: left">62.6%</td>
          <td style="text-align: left">64.5%</td>
          <td style="text-align: left">10.8%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Inference Hardware</strong>: Tested on Intel Xeon Gold 6230R CPU @ 2.10 GHz.</li>
<li><strong>Acceleration</strong>: Supports batch inference on GPU, which provides the reported speedups over rule-based CPU tools.</li>
<li><strong>Runtime</strong>: Under 1 minute on JPO (450 images), 7 minutes on USPTO (5,719 images), compared to 29-148 minutes for baseline tools on USPTO (Table 3).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/iipharma/pharmamind-molminer">pharmamind-molminer</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">GitHub repo with user guides and release downloads</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://zenodo.org/records/6973361">Real-World Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">3,040 molecular images from 239 papers</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xu, Y., Xiao, J., Chou, C.-H., Zhang, J., Zhu, J., Hu, Q., Li, H., Han, N., Liu, B., Zhang, S., Han, J., Zhang, Z., Zhang, S., Zhang, W., Lai, L., &amp; Pei, J. (2022). MolMiner: You only look once for chemical structure recognition. <em>Journal of Chemical Information and Modeling</em>, 62(22), 5321&ndash;5328. <a href="https://doi.org/10.1021/acs.jcim.2c00733">https://doi.org/10.1021/acs.jcim.2c00733</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling (JCIM) 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/iipharma/pharmamind-molminer">Github Repository</a></li>
<li><a href="https://zenodo.org/records/6973361">Zenodo Dataset</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{xuMolMinerYouOnly2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{MolMiner: You only look once for chemical structure recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{MolMiner}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Xu, Youjun and Xiao, Jinchuan and Chou, Chia-Han and Zhang, Jianhang and Zhu, Jintao and Hu, Qiwan and Li, Hemin and Han, Ningsheng and Liu, Bingyu and Zhang, Shuaipeng and Han, Jinyu and Zhang, Zhen and Zhang, Shuhao and Zhang, Weilin and Lai, Luhua and Pei, Jianfeng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = nov,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{62}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{5321--5328}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1549-9596}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/acs.jcim.2c00733}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MICER: Molecular Image Captioning with Transfer Learning</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/micer/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/micer/</guid><description>Encoder-decoder model using pre-trained ResNet and attention-based LSTM to translate molecular images into SMILES strings, reaching 97.54% sequence accuracy.</description><content:encoded><![CDATA[<h2 id="micers-contribution-to-optical-structure-recognition">MICER&rsquo;s Contribution to Optical Structure Recognition</h2>
<p>This is a <strong>Method</strong> paper according to the AI for Physical Sciences taxonomy. It proposes MICER, an encoder-decoder architecture that integrates transfer learning (fine-tuning pre-trained models) and attention mechanisms for Optical Chemical Structure Recognition (OCSR). The study includes rigorous benchmarking comparing MICER against three rule-based tools (OSRA, MolVec, Imago) and existing deep learning methods (DECIMER). The authors conduct extensive factor comparison experiments to isolate the effects of stereochemistry, molecular complexity, data volume, and encoder backbone choices.</p>
<h2 id="the-challenge-of-generalizing-in-ocsr">The Challenge of Generalizing in OCSR</h2>
<p>Chemical structures in scientific literature are valuable for drug discovery, but they are locked in image formats that are difficult to mine automatically. Traditional OCSR tools (like OSRA) rely on hand-crafted rules and expert knowledge. They are brittle, struggle with stylistic variations, and have low generalization ability. While deep learning has been applied (e.g., DECIMER), previous attempts often used frozen pre-trained feature extractors (without fine-tuning) or failed to fully exploit transfer learning, leading to suboptimal performance. The goal of this work is to build an end-to-end &ldquo;image captioning&rdquo; system that translates molecular images directly into <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings without intermediate segmentation steps.</p>
<h2 id="integrating-fine-tuning-and-attention-for-chemistry">Integrating Fine-Tuning and Attention for Chemistry</h2>
<p>The core novelty lies in the specific architectural integration of transfer learning with fine-tuning for the chemical domain. Unlike DECIMER, which used a frozen network, MICER fine-tunes a pre-trained ResNet on molecular images. This allows the encoder to adapt from general object recognition to specific chemical feature extraction.</p>
<p>The model incorporates an attention mechanism into the LSTM decoder, allowing the model to focus on specific image regions (atoms and bonds) when generating each character of the SMILES string. The paper explicitly analyzes &ldquo;intrinsic features&rdquo; of molecular data (stereochemistry, complexity) to guide the design of the training dataset, combining multiple chemical toolkits (Indigo, RDKit) to generate diverse styles.</p>
<h2 id="experimental-setup-and-ablation-studies">Experimental Setup and Ablation Studies</h2>
<p>The authors performed two types of experiments: Factor Comparison (ablations) and Benchmarking.</p>
<p><strong>Factor Comparisons</strong>: They evaluated how performance is affected by:</p>
<ul>
<li><strong>Stereochemistry (SI)</strong>: Comparing models trained on data with and without stereochemical information.</li>
<li><strong>Molecular Complexity (MC)</strong>: Analyzing performance across 5 molecular weight intervals.</li>
<li><strong>Data Volume (DV)</strong>: Training on datasets ranging from 0.64 million to 10 million images.</li>
<li><strong>Pre-trained Models (PTMs)</strong>: Comparing 8 different backbones (e.g., ResNet, VGG, Inception, MobileNet) versus a base CNN.</li>
</ul>
<p><strong>Benchmarking</strong>:</p>
<ul>
<li><strong>Baselines</strong>: OSRA, MolVec, Imago (rule-based); Base CNN, DECIMER (deep learning).</li>
<li><strong>Datasets</strong>: Four test sets (100k images each, except UOB): Uni-style, Multi-style, Noisy, and Real-world (UOB dataset).</li>
<li><strong>Metrics</strong>: Sequence Accuracy (Exact Match), Levenshtein Distance (ALD), and Tanimoto Similarity (Fingerprint match).</li>
</ul>
<h2 id="results-and-core-insights">Results and Core Insights</h2>
<p>MICER achieved 97.54% Sequence Accuracy on uni-style data and 82.33% on the real-world UOB dataset, outperforming rule-based and deep learning baselines across all four test sets.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Method</th>
          <th>SA (%)</th>
          <th>AMFTS (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Uni-style</td>
          <td>OSRA</td>
          <td>23.14</td>
          <td>56.83</td>
      </tr>
      <tr>
          <td>Uni-style</td>
          <td>DECIMER</td>
          <td>35.32</td>
          <td>86.92</td>
      </tr>
      <tr>
          <td>Uni-style</td>
          <td><strong>MICER</strong></td>
          <td><strong>97.54</strong></td>
          <td><strong>99.74</strong></td>
      </tr>
      <tr>
          <td>Multi-style</td>
          <td>OSRA</td>
          <td>15.68</td>
          <td>44.50</td>
      </tr>
      <tr>
          <td>Multi-style</td>
          <td><strong>MICER</strong></td>
          <td><strong>95.09</strong></td>
          <td><strong>99.28</strong></td>
      </tr>
      <tr>
          <td>Noisy</td>
          <td><strong>MICER</strong></td>
          <td><strong>94.95</strong></td>
          <td><strong>99.25</strong></td>
      </tr>
      <tr>
          <td>UOB (real-world)</td>
          <td>OSRA</td>
          <td>80.24</td>
          <td>91.17</td>
      </tr>
      <tr>
          <td>UOB (real-world)</td>
          <td>DECIMER</td>
          <td>21.75</td>
          <td>65.15</td>
      </tr>
      <tr>
          <td>UOB (real-world)</td>
          <td><strong>MICER</strong></td>
          <td><strong>82.33</strong></td>
          <td><strong>94.47</strong></td>
      </tr>
  </tbody>
</table>
<p>ResNet101 was identified as the most effective encoder (87.58% SA in preliminary tests on 0.8M images), outperforming deeper (DenseNet121 at 81.41%) and lighter (MobileNetV2 at 39.83%) networks. Performance saturates around 6 million training samples, reaching 98.84% SA. Stereochemical information drops accuracy by approximately 6.1% (from 87.61% to 81.50%), indicating wedge and dash bonds are harder to recognize. Visualizing attention maps showed the model correctly attends to specific atoms (e.g., focusing on &lsquo;S&rsquo; or &lsquo;Cl&rsquo; pixels) when generating the corresponding character.</p>
<h2 id="limitations">Limitations</h2>
<p>The authors acknowledge several limitations. MICER struggles with superatoms, R-groups, text labels, and uncommon atoms (e.g., Sn) that were not seen during training. On noisy data, noise spots near Cl atoms can cause misclassification as O atoms. Complex molecular images with noise lead to misrecognition of noise points as single bonds and wedge-shaped bonds as double bonds. All methods, including MICER, have substantial room for improvement on real-world datasets that contain these challenging elements.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data was curated from the <strong>ZINC20</strong> database.</p>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>Filtering</strong>: Removed organometallics, mixtures, and invalid molecules.</li>
<li><strong>Standardization</strong>: SMILES were canonicalized and de-duplicated.</li>
<li><strong>Generation</strong>: Images generated using <strong>Indigo</strong> and <strong>RDKit</strong> toolkits to vary styles.</li>
</ul>
<p><strong>Dataset Size</strong>:</p>
<ul>
<li><strong>Total</strong>: 10 million images selected for the final model.</li>
<li><strong>Composition</strong>: 6 million &ldquo;default style&rdquo; (Indigo) + 4 million &ldquo;multi-style&rdquo; (Indigo + RDKit).</li>
<li><strong>Splits</strong>: 8:1:1 ratio for Training/Validation/Test.</li>
</ul>
<p><strong>Vocabulary</strong>: A token dictionary of 39 SMILES characters plus 3 special tokens: <code>[pad]</code>, <code>[sos]</code>, <code>[eos]</code>, <code>[0]</code>-<code>[9]</code>, <code>[C]</code>, <code>[l]</code>, <code>[c]</code>, <code>[O]</code>, <code>[N]</code>, <code>[n]</code>, <code>[F]</code>, <code>[H]</code>, <code>[o]</code>, <code>[S]</code>, <code>[s]</code>, <code>[B]</code>, <code>[r]</code>, <code>[I]</code>, <code>[i]</code>, <code>[P]</code>, <code>[p]</code>, <code>(</code>, <code>)</code>, <code>[</code>, <code>]</code>, <code>@</code>, <code>=</code>, <code>#</code>, <code>/</code>, <code>-</code>, <code>+</code>, <code>\</code>, <code>%</code>. Two-letter atoms like &lsquo;Br&rsquo; are tokenized as distinct characters <code>[B]</code>, <code>[r]</code>, and &lsquo;Cl&rsquo; as <code>[C]</code>, <code>[l]</code>.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>: Character-level tokenization (not atom-level); the model learns to assemble &lsquo;C&rsquo; and &rsquo;l&rsquo; into &lsquo;Cl&rsquo;.</li>
<li><strong>Attention Mechanism</strong>: Uses a soft attention mechanism where the decoder calculates an attention score between the encoder&rsquo;s feature map ($8 \times 8 \times 512$) and the current hidden vector. Formula:
$$
\begin{aligned}
\text{att_score} &amp;= \text{softmax}(L_a(\tanh(L_f(F) + L_b(b_t))))
\end{aligned}
$$</li>
<li><strong>Training Configuration</strong>:
<ul>
<li><strong>Loss Function</strong>: Cross-entropy loss</li>
<li><strong>Optimizer</strong>: Adam optimizer</li>
<li><strong>Learning Rate</strong>: 2e-5</li>
<li><strong>Batch Size</strong>: 256</li>
<li><strong>Epochs</strong>: 15</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Encoder</strong>:</p>
<ul>
<li><strong>Backbone</strong>: Pre-trained <strong>ResNet101</strong> (trained on ImageNet).</li>
<li><strong>Modifications</strong>: The final layer is removed to output a Feature Map of size $8 \times 8 \times 512$.</li>
<li><strong>Flattening</strong>: Reshaped to a $64 \times 512$ feature matrix for the decoder.</li>
</ul>
<p><strong>Decoder</strong>:</p>
<ul>
<li><strong>Type</strong>: Long Short-Term Memory (LSTM) with Attention.</li>
<li><strong>Dropout</strong>: 0.3 applied to minimize overfitting.</li>
</ul>
<p>The encoder uses a pilot network (for universal feature extraction), a max-pooling layer, and multiple feature extraction layers containing convolutional blocks (CBs), feeding into the attention LSTM.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>SA (Sequence Accuracy)</strong>: Strict exact match of SMILES strings.</li>
<li><strong>ALD (Average Levenshtein Distance)</strong>: Edit distance for character-level error analysis.</li>
<li><strong>AMFTS / <a href="mailto:MFTS@1.0">MFTS@1.0</a></strong>: Tanimoto similarity of ECFP4 fingerprints to measure structural similarity.</li>
</ul>
<p><strong>Test Sets</strong>:</p>
<ul>
<li><strong>Uni-style</strong>: 100,000 images (Indigo default).</li>
<li><strong>Multi-style</strong>: 100,000 images (&gt;10 styles).</li>
<li><strong>Noisy</strong>: 100,000 images with noise added.</li>
<li><strong>UOB</strong>: 5,575 real-world images from literature.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 4 x NVIDIA Tesla V100 GPUs</li>
<li><strong>Training Time</strong>: Approximately 42 hours for the final model</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Jiacai-Yi/MICER">MICER</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<p>The training data (generated from ZINC20) and pre-trained model weights are not publicly released. The repository contains code but has minimal documentation (2 commits, no description).</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yi, J., Wu, C., Zhang, X., Xiao, X., Qiu, Y., Zhao, W., Hou, T., &amp; Cao, D. (2022). MICER: a pre-trained encoder-decoder architecture for molecular image captioning. <em>Bioinformatics</em>, 38(19), 4562-4572. <a href="https://doi.org/10.1093/bioinformatics/btac545">https://doi.org/10.1093/bioinformatics/btac545</a></p>
<p><strong>Publication</strong>: Bioinformatics 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Jiacai-Yi/MICER">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{yiMICERPretrainedEncoder2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{MICER}}: A Pre-Trained Encoder--Decoder Architecture for Molecular Image Captioning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{MICER}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Yi, Jiacai and Wu, Chengkun and Zhang, Xiaochen and Xiao, Xinyi and Qiu, Yanlong and Zhao, Wentao and Hou, Tingjun and Cao, Dongsheng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = sep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{38}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{19}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{4562--4572}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1367-4811}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1093/bioinformatics/btac545}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Image2SMILES: Transformer OCSR with Synthetic Data Pipeline</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/image2smiles/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/image2smiles/</guid><description>Transformer-based OCSR using a novel synthetic data generation pipeline for robust molecular image interpretation across diverse drawing styles.</description><content:encoded><![CDATA[<h2 id="contribution-image2smiles-as-a-method-and-resource">Contribution: Image2SMILES as a Method and Resource</h2>
<p>This is primarily a <strong>Method</strong> paper with a significant <strong>Resource</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes a specific neural architecture (ResNet backbone and Transformer Decoder) to solve the Optical Chemical Structure Recognition (OCSR) task, answering &ldquo;How well does this work?&rdquo; with extensive benchmarks against rule-based systems like OSRA.</li>
<li><strong>Resource</strong>: A core contribution is the &ldquo;Generate and Train!&rdquo; paradigm, where the authors release a comprehensive synthetic data generator to overcome the lack of labeled training data in the field.</li>
</ul>
<h2 id="motivation-bottlenecks-in-recognizing-trapped-chemical-structures">Motivation: Bottlenecks in Recognizing Trapped Chemical Structures</h2>
<p>Retrieving chemical structure data from legacy scientific literature is a major bottleneck in cheminformatics.</p>
<ul>
<li><strong>Problem</strong>: Chemical structures are often &ldquo;trapped&rdquo; in image formats (PDFs, scans). Manual extraction is slow, and existing rule-based tools (e.g., OSRA) are brittle when facing diverse drawing styles, &ldquo;Markush&rdquo; structures (templates), or visual contamination.</li>
<li><strong>Gap</strong>: Deep learning approaches require massive datasets, but no large-scale annotated dataset of chemical figures exists.</li>
<li><strong>Goal</strong>: To create a robust, data-driven recognition engine that can handle the messiness of real-world chemical publications (e.g., text overlays, arrows, partial overlaps).</li>
</ul>
<h2 id="core-innovation-the-generate-and-train-pipeline-and-fg-smiles">Core Innovation: The &ldquo;Generate and Train!&rdquo; Pipeline and FG-SMILES</h2>
<ul>
<li><strong>&ldquo;Generate and Train!&rdquo; Paradigm</strong>: The authors assert that architecture is secondary to data simulation. They developed an advanced augmentation pipeline that simulates geometry (rotation, bonds) alongside specific chemical drawing artifacts like &ldquo;Markush&rdquo; variables ($R_1$, $R_2$), functional group abbreviations (e.g., -OMe, -Ph), and visual &ldquo;contamination&rdquo; (stray text, arrows).</li>
<li><strong>FG-SMILES</strong>: A modified SMILES syntax designed to handle functional groups and Markush templates as single tokens (pseudo-atoms), allowing the model to predict generalized scaffolds.</li>
<li><strong>Encoder-Free Architecture</strong>: The authors found that a standard Transformer Encoder was unnecessary. They feed the flattened feature map from a ResNet backbone directly into the Transformer Decoder, which improved performance.</li>
</ul>
<h2 id="methodology-and-benchmarking-against-osra">Methodology and Benchmarking Against OSRA</h2>
<ul>
<li><strong>Training</strong>: The model was trained on 10 million synthetically generated images derived from PubChem structures, selected via a complexity-biased sampling algorithm.</li>
<li><strong>Validation (Synthetic)</strong>: Evaluated on a hold-out set of 1M synthetic images.</li>
<li><strong>Validation (Real World)</strong>:
<ul>
<li><strong>Dataset A</strong>: 332 manually cropped structures from 10 specific articles, excluding reaction schemes.</li>
<li><strong>Dataset B</strong>: 296 structures systematically extracted from <em>Journal of Organic Chemistry</em> (one paper per issue from 2020) to reduce selection bias.</li>
</ul>
</li>
<li><strong>Comparison</strong>: Benchmarked against OSRA (v2.11), a widely used rule-based OCSR tool.</li>
</ul>
<h2 id="results-high-precision-extraction-and-key-limitations">Results: High-Precision Extraction and Key Limitations</h2>
<ul>
<li><strong>Performance</strong>:
<ul>
<li><strong>Synthetic</strong>: 90.7% exact match accuracy.</li>
<li><strong>Real Data (Dataset A)</strong>: Image2SMILES achieved <strong>79.2%</strong> accuracy compared to OSRA&rsquo;s <strong>62.1%</strong>.</li>
<li><strong>Real Data (Dataset B)</strong>: Image2SMILES achieved <strong>62.5%</strong> accuracy compared to OSRA&rsquo;s <strong>24.0%</strong>.</li>
</ul>
</li>
<li><strong>Confidence Correlation</strong>: There is a strong correlation between the model&rsquo;s confidence score and prediction validity. Thresholding at 0.995 yields 99.85% accuracy while ignoring 22.5% of data, enabling high-precision automated pipelines.</li>
<li><strong>Key Failures</strong>: The model struggles with functional groups absent from its training dictionary (e.g., $\text{NMe}_2$, Ms), confusion of R-group indices ($R&rsquo;$ vs $R_1$), and explicit hydrogens rendered as groups.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: A subset of 10 million molecules sampled from PubChem.</li>
<li><strong>Selection Logic</strong>: Bias towards complex/rare structures using a &ldquo;Full Coefficient&rdquo; (FC) probability metric based on molecule size and ring/atom rarity.
<ul>
<li>Formula: $BC=0.1+1.2\left(\frac{n_{\max}-n}{n_{\max}}\right)^{3}$ where $n_{\max}=60$.</li>
</ul>
</li>
<li><strong>Generation</strong>: Uses RDKit for rendering with augmentations: rotation, font size, line thickness, whitespace, and CoordGen (20% probability).</li>
<li><strong>Contamination</strong>: &ldquo;Visual noise&rdquo; is stochastically added, including parts of other structures, labels, and arrows cropped from real documents.</li>
<li><strong>Target Format</strong>: <strong>FG-SMILES</strong> (Functional Group SMILES). Replaces common functional groups with pseudo-atoms (e.g., [Me], [Ph], [NO2]) and supports variable R-group positions using a <code>v</code> token.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Contamination Augmentation</strong>: A dedicated algorithm simulates visual noise (arrows, text) touching or overlapping the main molecule to force robustness.</li>
<li><strong>Functional Group Resolution</strong>: An algorithm identifies overlapping functional group templates (SMARTS) and resolves them to prevent nested group conflicts (e.g., resolving Methyl vs Methoxy).</li>
<li><strong>Markush Support</strong>: Stochastic replacement of substituents with R-group labels ($R_1$, $R&rsquo;$, etc.) based on a defined probability table (e.g., $P(R)=0.2$, $P(R_1)=0.15$).</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: &ldquo;Image-to-Sequence&rdquo; hybrid model.
<ul>
<li><strong>Backbone</strong>: ResNet-50, but with the last two residual blocks removed. Output shape: $512 \times 48 \times 48$.</li>
<li><strong>Neck</strong>: No Transformer Encoder. CNN features are flattened and passed directly to the Decoder.</li>
<li><strong>Decoder</strong>: Standard Transformer Decoder with parameters from the original Transformer architecture.</li>
</ul>
</li>
<li><strong>Input</strong>: Images resized to $384 \times 384 \times 3$.</li>
<li><strong>Output</strong>: Sequence of FG-SMILES tokens.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metric</strong>: Binary &ldquo;Exact Match&rdquo; (valid/invalid).
<ul>
<li>Strict criteria: Stereo and R-group indices must match exactly (e.g., $R&rsquo;$ vs $R_1$ is a failure).</li>
</ul>
</li>
<li><strong>Datasets</strong>:
<ul>
<li><strong>Internal</strong>: 5% random split of generated data (500k samples).</li>
<li><strong>External (Dataset A &amp; B)</strong>: Manually cropped real-world images from specified journals.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: 4 $\times$ Nvidia V100 GPUs + 36 CPU cores.</li>
<li><strong>Duration</strong>: ~2 weeks for training (5 epochs, ~63 hours/epoch). Data generation took 3 days on 80 CPUs.</li>
<li><strong>Optimizer</strong>: RAdam with learning rate $3 \cdot 10^{-4}$.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/syntelly/img2smiles_generator">Data Generator (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Synthetic training data generator</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5069806">1M Generated Samples (Zenodo)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Randomly generated image-SMILES pairs</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5356500">Real-World Test Images (Zenodo)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Cropped structures from real papers with target FG-SMILES</td>
      </tr>
      <tr>
          <td><a href="https://app.syntelly.com/pdf2smiles">Syntelly Demo</a></td>
          <td>Other</td>
          <td>Proprietary</td>
          <td>Web demo for PDF-to-SMILES extraction</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Khokhlov, I., Krasnov, L., Fedorov, M. V., &amp; Sosnin, S. (2022). Image2SMILES: Transformer-Based Molecular Optical Recognition Engine. <em>Chemistry-Methods</em>, 2(1), e202100069. <a href="https://doi.org/10.1002/cmtd.202100069">https://doi.org/10.1002/cmtd.202100069</a></p>
<p><strong>Publication</strong>: Chemistry-Methods 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/syntelly/img2smiles_generator">Official Code (Data Generator)</a></li>
<li><a href="https://app.syntelly.com/pdf2smiles">Syntelly Demo</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{khokhlovImage2SMILESTransformerBasedMolecular2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Image2SMILES: Transformer-Based Molecular Optical Recognition Engine}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Image2SMILES}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Khokhlov, Ivan and Krasnov, Lev and Fedorov, Maxim V. and Sosnin, Sergey}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Chemistry-Methods}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{e202100069}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{2628-9725}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1002/cmtd.202100069}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cmtd.202100069}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Image-to-Graph Transformers for Chemical Structures</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/image-to-graph-transformers/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/image-to-graph-transformers/</guid><description>A deep learning model that converts molecular images directly into graph structures, enabling recognition of abbreviated non-atomic symbols.</description><content:encoded><![CDATA[<h2 id="contribution-and-taxonomic-classification">Contribution and Taxonomic Classification</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel deep learning architecture designed to extract molecular structures from images by directly predicting the graph topology. The paper validates this approach through ablation studies (comparing ResNet-only baselines to the Transformer-augmented model) and extensive benchmarking against existing tools.</p>
<h2 id="the-challenge-with-smiles-and-non-atomic-symbols">The Challenge with SMILES and Non-Atomic Symbols</h2>
<ul>
<li><strong>Handling Abbreviations:</strong> Chemical structures in scientific literature often use non-atomic symbols (superatoms like &ldquo;R&rdquo; or &ldquo;Ph&rdquo;) to reduce complexity. Standard tools that generate SMILES strings fail here because SMILES syntax does not support arbitrary non-atomic symbols.</li>
<li><strong>Robustness to Style:</strong> Existing rule-based tools are brittle to the diverse drawing styles found in literature.</li>
<li><strong>Data Utilization:</strong> Pixel-wise graph recognition tools (like ChemGrapher) require expensive pixel-level labeling. An end-to-end approach can utilize massive amounts of image-molecule pairs (like USPTO data) without needing exact coordinate labels.</li>
</ul>
<h2 id="the-image-to-graph-i2g-architecture">The Image-to-Graph (I2G) Architecture</h2>
<p>The core novelty is the <strong>Image-to-Graph (I2G)</strong> architecture that bypasses string representations entirely:</p>
<ul>
<li><strong>Hybrid Encoder:</strong> Combines a ResNet backbone (for locality) with a Transformer encoder (for global context), allowing the model to capture relationships between atoms that are far apart in the image.</li>
<li><strong>Graph Decoder (GRAT):</strong> A modified Transformer decoder that generates the graph auto-regressively. It uses feature-wise transformations to modulate attention weights based on edge information (bond types).</li>
<li><strong>Coordinate-Aware Training:</strong> The model is forced to predict the exact 2D coordinates of atoms in the source image. Combined with auxiliary losses, this boosts SMI accuracy from 0.009 to 0.567 on the UoB ablation (Table 1 in the paper).</li>
</ul>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<ul>
<li><strong>Baselines:</strong> The model was compared against OSRA (rule-based), MolVec (rule-based), and ChemGrapher (deep learning pixel-wise).</li>
<li><strong>Benchmarks:</strong> Evaluated on four standard datasets: UoB, USPTO, CLEF, and JPO. Images were converted to PDF and back to simulate degradation.</li>
<li><strong>Large Molecule Test:</strong> A custom dataset (<strong>OLED</strong>) was created from 12 journal papers (434 images) to test performance on larger, more complex structures (average 52.8 atoms).</li>
<li><strong>Ablations:</strong> The authors tested the impact of the Transformer encoder, auxiliary losses, and coordinate prediction.</li>
</ul>
<h2 id="empirical-results-and-robustness">Empirical Results and Robustness</h2>
<ul>
<li><strong>Benchmark Performance:</strong> The proposed model outperformed existing models with a 17.1% relative improvement on benchmark datasets.</li>
<li><strong>Robustness:</strong> On large molecules (OLED dataset), it achieved a 12.8% relative improvement over MolVec (and 20.0% over OSRA).</li>
<li><strong>Data Scaling:</strong> Adding real-world USPTO data to the synthetic training set improved performance by 20.5%, demonstrating the model&rsquo;s ability to learn from noisy, unlabeled coordinates.</li>
<li><strong>Handling Superatoms:</strong> The model successfully recognized pseudo-atoms (e.g., $R_1$, $R_2$, $R_3$) as distinct nodes. OSRA, which outputs SMILES, collapsed them into generic &ldquo;Any&rdquo; atoms since SMILES does not support non-atomic symbols. MolVec could not recognize them properly at all.</li>
</ul>
<h2 id="limitations-and-error-analysis">Limitations and Error Analysis</h2>
<p>The paper identifies two main failure modes on the USPTO, CLEF, and JPO benchmarks:</p>
<ol>
<li><strong>Unrecognized superatoms:</strong> The model struggles with complex multi-character superatoms not seen during training (e.g., NHNHCOCH$_3$ or H$_3$CO$_2$S). The authors propose character-level atom decoding as a future solution.</li>
<li><strong>Caption interference:</strong> The model sometimes misidentifies image captions as atoms, particularly on the JPO dataset. Data augmentation with arbitrary caption text or a dedicated image segmentation step could mitigate this.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors used a combination of synthetic and real-world data for training.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td><strong>PubChem</strong></td>
          <td>4.6M</td>
          <td>Synthetic images generated using RDKit. Random superatoms (e.g., $CF_3$, $NO_2$) were substituted to simulate abbreviations.</td>
      </tr>
      <tr>
          <td>Training</td>
          <td><strong>USPTO</strong></td>
          <td>2.5M</td>
          <td>Real image-molecule pairs from patents. Used for robustness; lacks coordinate labels.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><strong>Benchmarks</strong></td>
          <td>~5.7k</td>
          <td>UoB, USPTO, CLEF, JPO. Average ~15.8 atoms per molecule.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><strong>OLED</strong></td>
          <td>434</td>
          <td>Manually segmented from 12 journal papers. Large molecules (avg 52.8 atoms).</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing:</strong></p>
<ul>
<li>Input resolution is fixed at $800 \times 800$ pixels.</li>
<li>Images are virtually split into a $25 \times 25$ grid (625 patches total), where each patch is $32 \times 32$ pixels.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Encoder Logic:</strong></p>
<ul>
<li><strong>Grid Serialization:</strong> The $25 \times 25$ grid is flattened into a 1D sequence. 2D position information is concatenated to ResNet features before the Transformer.</li>
<li><strong>Auxiliary Losses:</strong> To aid convergence, classifiers on the encoder predict three things <em>per patch</em>: (1) number of atoms, (2) characters in atom labels, and (3) edge-sharing neighbors. These losses decrease to zero during training.</li>
</ul>
<p><strong>Decoder Logic:</strong></p>
<ul>
<li><strong>Auto-regressive Generation:</strong> At step $t$, the decoder generates a new node and connects it to existing nodes.</li>
<li><strong>Attention Modulation:</strong> Attention weights are transformed using bond information:
$$
\begin{aligned}
\text{Att}(Q, K, V) = \text{softmax} \left( \frac{\Gamma \odot (QK^T) + B}{\sqrt{d_k}} \right) V
\end{aligned}
$$
where $(\gamma_{ij}, \beta_{ij}) = f(e_{ij})$, with $e_{ij}$ being the edge type (in one-hot representation) between nodes $i$ and $j$, and $f$ is a multi-layer perceptron. $\Gamma$ and $B$ are matrices whose elements at position $(i, j)$ are $\gamma_{ij}$ and $\beta_{ij}$, respectively.</li>
<li><strong>Coordinate Prediction:</strong> The decoder outputs coordinates for each atom, which acts as a mechanism to track attention history.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Image Encoder:</strong> ResNet-34 backbone followed by a Transformer encoder.</li>
<li><strong>Graph Decoder:</strong> A &ldquo;Graph-Aware Transformer&rdquo; (GRAT) that outputs nodes (atom labels, coordinates) and edges (bond types).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics focus on structural identity, as standard string matching (SMILES) is insufficient for graphs with superatoms.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>SMI</strong></td>
          <td>Canonical SMILES Match</td>
          <td>Correct if predicted SMILES is identical to ground truth.</td>
      </tr>
      <tr>
          <td><strong>TS 1</strong></td>
          <td>Tanimoto Similarity = 1.0</td>
          <td>Ratio of predictions with perfect fingerprint overlap.</td>
      </tr>
      <tr>
          <td><strong>Sim.</strong></td>
          <td>Average Tanimoto Similarity</td>
          <td>Measures average structural overlap across all predictions.</td>
      </tr>
  </tbody>
</table>
<h2 id="reproducibility">Reproducibility</h2>
<p>The paper does not release source code, pre-trained models, or the custom OLED evaluation dataset. The training data sources (PubChem, USPTO) are publicly available, but the specific image generation pipeline (modified RDKit with coordinate extraction and superatom substitution) is not released. Key architectural details (ResNet-34 backbone, Transformer encoder/decoder configuration) and training techniques are described, but exact hyperparameters for full reproduction are limited.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</a></td>
          <td>Dataset</td>
          <td>Public Domain</td>
          <td>Source of 4.6M molecules for synthetic image generation</td>
      </tr>
      <tr>
          <td><a href="https://www.uspto.gov/">USPTO</a></td>
          <td>Dataset</td>
          <td>Public Domain</td>
          <td>2.5M real image-molecule pairs from patents</td>
      </tr>
      <tr>
          <td><a href="https://www.rdkit.org/">RDKit</a></td>
          <td>Code</td>
          <td>BSD-3-Clause</td>
          <td>Used (with modifications) for synthetic image generation</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yoo, S., Kwon, O., &amp; Lee, H. (2022). Image-to-Graph Transformers for Chemical Structure Recognition. <em>ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</em>, 3393-3397. <a href="https://doi.org/10.1109/ICASSP43922.2022.9746088">https://doi.org/10.1109/ICASSP43922.2022.9746088</a></p>
<p><strong>Publication</strong>: ICASSP 2022</p>
]]></content:encoded></item><item><title>ICMDT: Automated Chemical Structure Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/icmdt/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/icmdt/</guid><description>A Transformer-based model (ICMDT) for converting chemical structure images into InChI text strings using a novel Deep TNT block.</description><content:encoded><![CDATA[<h2 id="contribution-image-to-text-translation-for-chemical-structures">Contribution: Image-to-Text Translation for Chemical Structures</h2>
<p>This is a <strong>Method</strong> paper.</p>
<p>It proposes a novel neural network architecture, the <strong>Image Captioning Model based on Deep TNT (ICMDT)</strong>, to solve the specific problem of &ldquo;molecular translation&rdquo; (image-to-text). The classification is supported by the following rhetorical indicators:</p>
<ul>
<li><strong>Novel Mechanism:</strong> It introduces the &ldquo;Deep TNT block&rdquo; to improve upon the existing TNT architecture by fusing features at three levels (pixel, small patch, large patch).</li>
<li><strong>Baseline Comparison:</strong> The authors explicitly compare their model against four other architectures (CNN+RNN and CNN+Transformer variants).</li>
<li><strong>Ablation Study:</strong> Section 4.3 is dedicated to ablating specific components (position encoding, patch fusion) to prove their contribution to the performance gain.</li>
</ul>
<h2 id="motivation-digitizing-historical-chemical-literature">Motivation: Digitizing Historical Chemical Literature</h2>
<p>The primary motivation is to speed up chemical research by digitizing historical chemical literature.</p>
<ul>
<li><strong>Problem:</strong> Historical sources often contain corrupted or noisy images, making automated recognition difficult.</li>
<li><strong>Gap:</strong> Existing models like the standard TNT (Transformer in Transformer) function primarily as encoders for classification and fail to effectively integrate local pixel-level information required for precise structure generation.</li>
<li><strong>Goal:</strong> To build a dependable generative model that can accurately translate these noisy images into <strong><a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a></strong> (International Chemical Identifier) text strings.</li>
</ul>
<h2 id="novelty-multi-level-feature-fusion-with-deep-tnt">Novelty: Multi-Level Feature Fusion with Deep TNT</h2>
<p>The core contribution is the <strong>Deep TNT block</strong> and the resulting <strong>ICMDT</strong> architecture.</p>
<ul>
<li><strong>Deep TNT Block:</strong> The Deep TNT block expands upon standard local and global modeling by stacking three transformer blocks to process information at three granularities:
<ol>
<li><strong>Internal Transformer:</strong> Processes pixel embeddings.</li>
<li><strong>Middle Transformer:</strong> Processes small patch embeddings.</li>
<li><strong>Exterior Transformer:</strong> Processes large patch embeddings.</li>
</ol>
</li>
<li><strong>Multi-level Fusion:</strong> The model fuses pixel-level features into small patches, and small patches into large patches, allowing for finer integration of local details.</li>
<li><strong>Position Encoding:</strong> A specific strategy of applying shared position encodings to small patches and pixels, while using a learnable 1D encoding for large patches.</li>
</ul>
<h2 id="methodology-benchmarking-on-the-bms-dataset">Methodology: Benchmarking on the BMS Dataset</h2>
<p>The authors evaluated the model on the <strong>Bristol-Myers Squibb Molecular Translation</strong> dataset.</p>
<ul>
<li><strong>Baselines:</strong> They constructed four comparative models:
<ul>
<li>EfficientNetb0 + RNN (Bi-LSTM)</li>
<li>ResNet50d + RNN (Bi-LSTM)</li>
<li>EfficientNetb0 + Transformer</li>
<li>ResNet101d + Transformer</li>
</ul>
</li>
<li><strong>Ablation:</strong> They tested the impact of removing the large patch position encoding (ICMDT*), reverting the encoder to a standard TNT-S (TNTD), and setting the patch size to 32 directly on TNT-S without the exterior transformer block (TNTD-B).</li>
<li><strong>Pre-processing Study:</strong> They experimented with denoising ratios and cropping strategies.</li>
</ul>
<h2 id="results--conclusions-improved-inchi-translation-accuracy">Results &amp; Conclusions: Improved InChI Translation Accuracy</h2>
<ul>
<li><strong>Performance:</strong> ICMDT achieved the lowest <strong>Levenshtein distance (0.69)</strong> among all five models tested (Table 3). The best-performing baseline was ResNet101d+Transformer.</li>
<li><strong>Convergence:</strong> The model converged significantly faster than the baselines, outperforming others as early as epoch 6.7.</li>
<li><strong>Ablation Results:</strong> The full Deep TNT block reduced error by nearly half compared to the standard TNT encoder (0.69 vs 1.29 Levenshtein distance). Removing large patch position encoding (ICMDT*) degraded performance to 1.04, and directly using patch size 32 on TNT-S (TNTD-B) scored 1.37.</li>
<li><strong>Limitations:</strong> The model struggles with <strong>stereochemical layers</strong> (e.g., identifying clockwise neighbors or +/- signs) compared to non-stereochemical layers.</li>
<li><strong>Inference &amp; Fusion:</strong> The multi-model inference and fusion pipeline (beam search, TTA, step-wise logit ensemble, and voting) improved results by 0.24 to 2.5 Levenshtein distance reduction over single models.</li>
<li><strong>Future Work:</strong> Integrating full object detection to predict atom/bond coordinates to better resolve 3D stereochemical information.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<p><strong>Status: Partially Reproducible.</strong> The dataset is publicly available through Kaggle, and the paper provides detailed hyperparameters and architecture specifications. However, no source code or pretrained model weights have been released.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.kaggle.com/c/bms-molecular-translation">BMS Molecular Translation (Kaggle)</a></td>
          <td>Dataset</td>
          <td>Competition Terms</td>
          <td>Training/test images with InChI labels</td>
      </tr>
  </tbody>
</table>
<p><strong>Missing components:</strong> No official code repository or pretrained weights. Reimplementation requires reconstructing the Deep TNT block, training pipeline, and inference/fusion strategy from the paper description alone.</p>
<p><strong>Hardware/compute requirements:</strong> Not explicitly stated in the paper.</p>
<h3 id="data">Data</h3>
<p>The experiments used the <strong>Bristol-Myers Squibb Molecular Translation</strong> dataset from Kaggle.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>BMS Training Set</td>
          <td>2,424,186 images</td>
          <td>Supervised; contains noise and blur</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>BMS Test Set</td>
          <td>1,616,107 images</td>
          <td>Higher noise variation than training set</td>
      </tr>
  </tbody>
</table>
<p><strong>Pre-processing Strategy</strong>:</p>
<ul>
<li><strong>Effective:</strong> Padding resizing (reshaping to square using the longer edge, padding insufficient parts with pixels from the middle of the image).</li>
<li><strong>Ineffective:</strong> Smart cropping (removing white borders degraded performance).</li>
<li><strong>Augmentation:</strong> GaussNoise, Blur, RandomRotate90, and PepperNoise ($SNR=0.996$).</li>
<li><strong>Denoising:</strong> Best results found by mixing denoised and original data (Ratio 2:13) during training.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimizer:</strong> Lookahead ($\alpha=0.5, k=5$) and RAdam ($\beta_1=0.9, \beta_2=0.99$).</li>
<li><strong>Loss Function:</strong> Anti-Focal loss ($\gamma=0.5$) combined with Label Smoothing. Standard Focal Loss adds a modulating factor $(1-p_t)^\gamma$ to cross-entropy to focus on hard negatives. Anti-Focal Loss (Raunak et al., 2020) modifies this factor to reduce the disparity between training and inference distributions in Seq2Seq models.</li>
<li><strong>Training Schedule:</strong>
<ul>
<li>Initial resolution: $224 \times 224$</li>
<li>Fine-tuning: Resolution $384 \times 384$ for labels $&gt;150$ length.</li>
<li>Batch size: Dynamic, increasing from 16 to 1024 (with proportional learning rate scaling).</li>
<li>Noisy Labels: Randomly replacing chemical elements in labels with a certain probability to improve robustness during inference.</li>
</ul>
</li>
<li><strong>Inference Strategy:</strong>
<ul>
<li>Beam Search ($k=16$ initially, $k=64$ if failing InChI validation).</li>
<li>Test Time Augmentation (TTA): Rotations of $90^\circ$.</li>
<li>Ensemble: Step-wise logit ensemble and voting based on Levenshtein distance scores.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>ICMDT Architecture:</strong></p>
<ul>
<li><strong>Encoder (Deep TNT)</strong> (Depth: 12 layers):
<ul>
<li><strong>Internal Block:</strong> Dim 160, Heads 4, Hidden size 640, MLP act GELU, Pixel patch size 4.</li>
<li><strong>Middle Block:</strong> Dim 10, Heads 6, Hidden size 128, MLP act GELU, Small patch size 16.</li>
<li><strong>Exterior Block:</strong> Dim 2560, Heads 10, Hidden size 5120, MLP act GELU, Large patch size 32.</li>
</ul>
</li>
<li><strong>Decoder (Vanilla Transformer)</strong>:
<ul>
<li>Decoder dim: 2560, FFN dim: 1024.</li>
<li>Depth: 3 layers, Heads: 8.</li>
<li>Vocab size: 193 (InChI tokens), text_dim: 384.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metric:</strong> Levenshtein Distance (measures single-character edit operations between generated and ground truth InChI strings).</p>
<p><strong>Ablation Results (Table 3 from paper):</strong></p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Params (M)</th>
          <th>Levenshtein Distance</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ICMDT</strong></td>
          <td>138.16</td>
          <td><strong>0.69</strong></td>
      </tr>
      <tr>
          <td>ICMDT*</td>
          <td>138.16</td>
          <td>1.04</td>
      </tr>
      <tr>
          <td>TNTD</td>
          <td>114.36</td>
          <td>1.29</td>
      </tr>
      <tr>
          <td>TNTD-B</td>
          <td>114.36</td>
          <td>1.37</td>
      </tr>
  </tbody>
</table>
<p><strong>Baseline Comparison (from convergence curves, Figure 9):</strong></p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Params (M)</th>
          <th>Convergence (Epochs)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ICMDT</strong></td>
          <td>138.16</td>
          <td>~9.76</td>
      </tr>
      <tr>
          <td>ResNet101d + Transformer</td>
          <td>302.02</td>
          <td>14+</td>
      </tr>
      <tr>
          <td>EfficientNetb0 + Transformer</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>ResNet50d + RNN</td>
          <td>90.6</td>
          <td>14+</td>
      </tr>
      <tr>
          <td>EfficientNetb0 + RNN</td>
          <td>46.3</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, Y., Chen, G., &amp; Li, X. (2022). Automated Recognition of Chemical Molecule Images Based on an Improved TNT Model. <em>Applied Sciences</em>, 12(2), 680. <a href="https://doi.org/10.3390/app12020680">https://doi.org/10.3390/app12020680</a></p>
<p><strong>Publication</strong>: MDPI Applied Sciences 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.kaggle.com/c/bms-molecular-translation">Kaggle Competition: BMS Molecular Translation</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{liAutomatedRecognitionChemical2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Automated {{Recognition}} of {{Chemical Molecule Images Based}} on an {{Improved TNT Model}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Li, Yanchi and Chen, Guanyu and Li, Xiang}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Applied Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{680}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Multidisciplinary Digital Publishing Institute}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{2076-3417}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.3390/app12020680}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Handwritten Chemical Structure Recognition with RCGD</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/hu-handwritten-rcgd-2023/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/hu-handwritten-rcgd-2023/</guid><description>An end-to-end framework (RCGD) and unambiguous markup language (SSML) for recognizing complex handwritten chemical structures with guided graph traversal.</description><content:encoded><![CDATA[<h2 id="contribution-and-methodological-framework">Contribution and Methodological Framework</h2>
<p>This is primarily a <strong>Method</strong> paper with a significant <strong>Resource</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes a novel architectural framework (<strong>RCGD</strong>) and a new representation syntax (<strong>SSML</strong>) to solve the specific problem of handwritten chemical structure recognition.</li>
<li><strong>Resource</strong>: It introduces a new benchmark dataset, <strong>EDU-CHEMC</strong>, containing 50,000 handwritten images to address the lack of public data in this domain.</li>
</ul>
<h2 id="the-ambiguity-of-handwritten-chemical-structures">The Ambiguity of Handwritten Chemical Structures</h2>
<p>Recognizing handwritten chemical structures is significantly harder than printed ones due to:</p>
<ol>
<li><strong>Inherent Ambiguity</strong>: Handwritten atoms and bonds vary greatly in appearance.</li>
<li><strong>Projection Complexity</strong>: Converting 2D projected layouts (like Natta or Fischer projections) into linear strings is difficult.</li>
<li><strong>Limitations of Existing Formats</strong>: Standard formats like SMILES require domain knowledge (valence rules) and have a high semantic gap with the visual image. They often fail to represent &ldquo;invalid&rdquo; structures commonly found in educational/student work.</li>
</ol>
<h2 id="bridging-the-semantic-gap-with-ssml-and-rcgd">Bridging the Semantic Gap with SSML and RCGD</h2>
<p>The paper introduces two core contributions to bridge the semantic gap between image and markup:</p>
<ol>
<li>
<p><strong>Structure-Specific Markup Language (SSML)</strong>: An extension of Chemfig that provides an unambiguous, visual-based graph representation. Unlike SMILES, it describes <em>how to draw</em> the molecule step-by-step, making it easier for models to learn visual alignments. It supports &ldquo;reconnection marks&rdquo; to handle cyclic structures explicitly.</p>
</li>
<li>
<p><strong>Random Conditional Guided Decoder (RCGD)</strong>: A decoder that treats recognition as a graph traversal problem. It introduces three novel mechanisms:</p>
<ul>
<li><strong>Conditional Attention Guidance</strong>: Uses branch angle directions to guide the attention mechanism, preventing the model from getting lost in complex structures.</li>
<li><strong>Memory Classification</strong>: A module that explicitly stores and classifies &ldquo;unexplored&rdquo; branch points to handle ring closures (reconnections).</li>
<li><strong>Path Selection</strong>: A training strategy that randomly samples traversal paths to prevent overfitting to a specific serialization order.</li>
</ul>
</li>
</ol>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p><strong>Datasets</strong>:</p>
<ul>
<li><strong>Mini-CASIA-CSDB</strong> (Printed): A subset of 97,309 printed molecular structure images, upscaled to $500 \times 500$ resolution.</li>
<li><strong>EDU-CHEMC</strong> (Handwritten): A new dataset of 52,987 images collected from educational settings (cameras, scanners, screens), including erroneous/non-existent structures.</li>
</ul>
<p><strong>Baselines</strong>:</p>
<ul>
<li>Compared against standard <strong>String Decoders (SD)</strong> (based on DenseWAP), tested with both SMILES and SSML on Mini-CASIA-CSDB and exclusively with SSML on EDU-CHEMC.</li>
<li>Compared against <strong>BTTR</strong> and <strong>ABM</strong> (recent mathematical expression recognition models) adapted for the chemical structure task, both using SSML on EDU-CHEMC.</li>
<li>On Mini-CASIA-CSDB, also compared against <strong>WYGIWYS</strong> (a SMILES-based string decoder at 300x300 resolution).</li>
</ul>
<p><strong>Ablation Studies</strong>:</p>
<ul>
<li>Evaluated the impact of removing Path Selection (PS) and Memory Classification (MC) mechanisms on EDU-CHEMC.</li>
<li>Tested robustness to image rotation ($180^{\circ}$) on Mini-CASIA-CSDB.</li>
</ul>
<h2 id="recognition-performance-and-robustness">Recognition Performance and Robustness</h2>
<ul>
<li><strong>Superiority of SSML</strong>: Models trained with SSML significantly outperformed those trained with SMILES (92.09% vs 81.89% EM on printed data) due to reduced semantic gap.</li>
<li><strong>Best Performance</strong>: RCGD achieved the highest Exact Match (EM) scores on both datasets:
<ul>
<li><strong>Mini-CASIA-CSDB</strong>: 95.01% EM.</li>
<li><strong>EDU-CHEMC</strong>: 62.86% EM.</li>
</ul>
</li>
<li><strong>EDU-CHEMC Baselines</strong>: On the handwritten dataset, SD (DenseWAP) achieved 61.35% EM, outperforming both BTTR (58.21% EM) and ABM (58.78% EM). The authors note that BTTR and ABM&rsquo;s reverse training mode, which helps in regular formula recognition, does not transfer well to graph-structured molecular data.</li>
<li><strong>Ablation Results</strong> (Table 5, EDU-CHEMC): Removing Path Selection alone dropped EM from 62.86% to 62.15%. Removing both Path Selection and Memory Classification dropped EM further to 60.31%, showing that memory classification has a larger impact.</li>
<li><strong>Robustness</strong>: RCGD showed minimal performance drop (0.85%) on rotated images compared to SMILES-based methods (10.36% drop). The SD with SSML dropped by 2.19%, confirming that SSML itself improves rotation invariance.</li>
<li><strong>Educational Utility</strong>: The method can recognize and reconstruct chemically invalid structures (e.g., a Carbon atom with 5 bonds), making it applicable for correcting and revising handwritten answers in chemistry education.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>1. EDU-CHEMC (Handwritten)</strong></p>
<ul>
<li><strong>Total Size</strong>: 52,987 images.</li>
<li><strong>Splits</strong>: Training (48,998), Validation (999), Test (2,992).</li>
<li><strong>Characteristics</strong>: Real-world educational data, mixture of isolated molecules and reaction equations, includes invalid chemical structures.</li>
</ul>
<p><strong>2. Mini-CASIA-CSDB (Printed)</strong></p>
<ul>
<li><strong>Total Size</strong>: 97,309 images.</li>
<li><strong>Splits</strong>: Training (80,781), Validation (8,242), Test (8,286).</li>
<li><strong>Preprocessing</strong>: Original $300 \times 300$ images were upscaled to $500 \times 500$ RGB to resolve blurring issues.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. SSML Generation</strong></p>
<p>To convert a molecular graph to SSML:</p>
<ol>
<li><strong>Traverse</strong>: Start from the left-most atom.</li>
<li><strong>Bonds/Atoms</strong>: Output atom text and bond format <code>&lt;bond&gt;[:&lt;angle&gt;]</code>.</li>
<li><strong>Branches</strong>: At branch points, use phantom symbols <code>(</code> and <code>)</code> to enclose branches, ordered by ascending bond angle.</li>
<li><strong>Reconnections</strong>: Use <code>?[tag]</code> and <code>?[tag, bond]</code> to mark start/end of ring closures.</li>
</ol>
<p><strong>2. RCGD Specifics</strong></p>
<ul>
<li><strong>RCGD-SSML</strong>: Modified version of SSML for the decoder. Removes <code>(</code> <code>)</code> delimiters; adds <code>\eob</code> (end of branch). Maintains a dynamic <strong>Branch Angle Set ($M$)</strong>.</li>
<li><strong>Path Selection</strong>: During training, when multiple branches exist in $M$, the model randomly selects one to traverse next. During inference, it uses beam search to score candidate paths.</li>
<li><strong>Loss Function</strong>:
$$
\begin{aligned}
L_{\text{total}} = L_{\text{ce}} + L_{\text{bc}}
\end{aligned}
$$
<ul>
<li>$L_{\text{ce}}$: Cross-entropy loss for character sequence generation.</li>
<li>$L_{\text{bc}}$: Multi-label classification loss for the memory module (predicting reconnection bond types for stored branch states).</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Encoder</strong>: DenseNet</p>
<ul>
<li><strong>Structure</strong>: 3 dense blocks.</li>
<li><strong>Growth Rate</strong>: 24.</li>
<li><strong>Depth</strong>: 32 per block.</li>
<li><strong>Output</strong>: High-dimensional feature map $x \in \mathbb{R}^{d_x \times h \times w}$.</li>
</ul>
<p><strong>Decoder</strong>: GRU with Attention</p>
<ul>
<li><strong>Hidden State Dimension</strong>: 256.</li>
<li><strong>Embedding Dimension</strong>: 256.</li>
<li><strong>Attention Projection</strong>: 128.</li>
<li><strong>Memory Classification Projection</strong>: 256.</li>
</ul>
<p><strong>Training Config</strong>:</p>
<ul>
<li><strong>Optimizer</strong>: Adam.</li>
<li><strong>Learning Rate</strong>: 2e-4 with multi-step decay (gamma 0.5).</li>
<li><strong>Dropout</strong>: 15%.</li>
<li><strong>Strategy</strong>: Teacher-forcing used for validation selection.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Exact Match (EM)</strong>: Percentage of samples where the predicted graph structure perfectly matches the label. For SMILES, string comparison; for SSML, converted to graph for isomorphism check.</li>
<li><strong>Structure EM</strong>: Auxiliary metric for samples with mixed content (text + molecules), counting samples where <em>all</em> molecular structures are correct.</li>
</ul>
<p><strong>Artifacts</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/iFLYTEK-CV/EDU-CHEMC">EDU-CHEMC</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Dataset annotations and download links (actual data hosted on Google Drive)</td>
      </tr>
  </tbody>
</table>
<p><strong>Missing Components</strong>:</p>
<ul>
<li>No training or inference code is publicly released; only the dataset is available.</li>
<li>Pre-trained model weights are not provided.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hu, J., Wu, H., Chen, M., Liu, C., Wu, J., Yin, S., Yin, B., Yin, B., Liu, C., Du, J., &amp; Dai, L. (2023). Handwritten Chemical Structure Image to Structure-Specific Markup Using Random Conditional Guided Decoder. <em>Proceedings of the 31st ACM International Conference on Multimedia</em> (pp. 8114-8124). <a href="https://doi.org/10.1145/3581783.3612573">https://doi.org/10.1145/3581783.3612573</a></p>
<p><strong>Publication</strong>: ACM Multimedia 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/iFLYTEK-CV/EDU-CHEMC">GitHub Repository / EDU-CHEMC Dataset</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{huHandwrittenChemicalStructure2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Handwritten Chemical Structure Image to Structure-Specific Markup Using Random Conditional Guided Decoder}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 31st ACM International Conference on Multimedia}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Hu, Jinshui and Wu, Hao and Chen, Mingjun and Liu, Chenyu and Wu, Jiajia and Yin, Shi and Yin, Baocai and Yin, Bing and Liu, Cong and Du, Jun and Dai, Lirong}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{8114--8124}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{ACM}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Ottawa ON Canada}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1145/3581783.3612573}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{979-8-4007-0108-5}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>End-to-End Transformer for Molecular Image Captioning</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/vit-inchi-transformer/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/vit-inchi-transformer/</guid><description>Vision Transformer encoder with Transformer decoder for molecular image-to-InChI translation, outperforming CNN baselines on noisy molecular datasets.</description><content:encoded><![CDATA[<h2 id="methodological-contribution">Methodological Contribution</h2>
<p>This is a <strong>Methodological Paper</strong>. It proposes a novel architectural approach to molecular image translation by replacing the standard CNN encoder with a Vision Transformer (ViT). The authors validate this method through comparative benchmarking against standard CNN+RNN baselines (e.g., ResNet+LSTM) and provide optimizations for inference speed.</p>
<h2 id="motivation-and-problem-statement">Motivation and Problem Statement</h2>
<p>The core problem addressed is existing molecular translation methods (extracting chemical structure from images into computer-readable InChI format) rely heavily on rule-based systems or CNN+RNN architectures. These current approaches often underperform when handling noisy images (common in scanned old journals) or images with few distinguishable features. There is a significant need in drug discovery to digitize and analyze legacy experimental data locked in image format within scientific publications.</p>
<h2 id="core-innovations-end-to-end-vit-encoder">Core Innovations: End-to-End ViT Encoder</h2>
<p>The primary contribution is the use of a completely convolution-free Vision Transformer (ViT) as the encoder, allowing the model to utilize long-range dependencies among image patches from the very beginning via self-attention:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
The architecture is a pure Transformer (Encoder-Decoder), treating the molecular image similarly to a sequence of tokens (patches). Furthermore, the authors implement a specific caching strategy for the decoder to avoid recomputing embeddings for previously decoded tokens, reducing the time complexity of the decoding step.</p>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p>The model was compared against standard CNN + RNN and ResNet (18, 34, 50) + LSTM with attention. Ablation studies were conducted varying the number of transformer layers (3, 6, 12, 24) and image resolution (224x224 vs 384x384). The model trained on a large combined dataset, including Bristol Myers Squibb data, SMILES, GDB-13, and synthetically augmented images containing noise and artifacts. Performance was evaluated using the Levenshtein distance metric, which computes the minimum number of single-character edits to transform the predicted string into the ground truth.</p>
<h2 id="performance-outcomes-and-capabilities">Performance Outcomes and Capabilities</h2>
<p>The proposed 24-layer ViT model (input size 384) achieved the lowest Levenshtein distance of <strong>6.95</strong>, outperforming the ResNet50+LSTM baseline (7.49) and the standard CNN+RNN (103.7). Increasing the number of layers had a strong positive impact, with the 24-layer model becoming competitive with current approaches. The authors note the model was evaluated on datasets with low distinguishable features and noise, where the ViT encoder&rsquo;s self-attention over all patches from the first layer helped capture relevant structure. The proposed caching optimization reduced the total decoding time complexity from $O(MN^2 + N^3)$ to $O(MN + N^2)$ for $N$ timesteps, by reducing the per-timestep cost to $O(M + N)$.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The model was trained on a combined dataset randomly split into 70% training, 10% test, and 20% validation.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Bristol Myers Squibb</strong></td>
          <td>~2.4 million synthetic images with InChI labels.</td>
          <td>Provided by BMS global biopharmaceutical company.</td>
      </tr>
      <tr>
          <td><strong>SMILES</strong></td>
          <td>Kaggle contest data converted to InChI.</td>
          <td>Images generated using RDKit.</td>
      </tr>
      <tr>
          <td><strong><a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a></strong></td>
          <td>Subset of 977 million small organic molecules (up to 13 atoms).</td>
          <td>Converted from SMILES using RDKit.</td>
      </tr>
      <tr>
          <td><strong>Augmented Images</strong></td>
          <td>Synthetic images with salt/pepper noise, dropped atoms, and bond modifications.</td>
          <td>Used to improve robustness against noise.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Training Objective</strong>: Cross-entropy loss minimization.</li>
<li><strong>Inference Decoding</strong>: Autoregressive decoding predicting the next character of the InChI string.</li>
<li><strong>Positional Encoding</strong>: Standard sine and cosine functions of different frequencies.</li>
<li><strong>Optimization</strong>:
<ul>
<li><strong>Caching</strong>: Caches the output of each layer during decoding to avoid recomputing embeddings for already decoded tokens.</li>
<li><strong>JIT</strong>: PyTorch JIT compiler used for graph optimization (1.2-1.5x speed increase on GPU).</li>
<li><strong>Self-Critical Training</strong>: Finetuning performed using self-critical sequence training (SCST).</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Encoder (Vision Transformer)</strong>:
<ul>
<li>Input: Flattened 2D patches of the image. Patch size: $16 \times 16$.</li>
<li>Projection: Trainable linear projection to latent vector size $D$.</li>
<li>Structure: Alternating layers of Multi-Head Self-Attention (MHSA) and MLP blocks.</li>
</ul>
</li>
<li><strong>Decoder (Vanilla Transformer)</strong>:
<ul>
<li>Input: Tokenized InChI string + sinusoidal positional embedding.</li>
<li>Vocabulary: 275 tokens (including <code>&lt;SOS&gt;</code>, <code>&lt;PAD&gt;</code>, <code>&lt;EOS&gt;</code>).</li>
</ul>
</li>
<li><strong>Hyperparameters (Best Model)</strong>:
<ul>
<li>Image Size: $384 \times 384$.</li>
<li>Layers: 24.</li>
<li>Feature Dimension: 512.</li>
<li>Attention Heads: 12.</li>
<li>Optimizer: Adam.</li>
<li>Learning Rate: $3 \times 10^{-5}$ (decayed by 0.5 in last 2 epochs).</li>
<li>Batch Size: Varied [64-512].</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: Levenshtein Distance (lower is better).</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Image Size</th>
          <th>Layers</th>
          <th>Epochs</th>
          <th>Levenshtein Dist.</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Standard CNN+RNN</td>
          <td>224</td>
          <td>3</td>
          <td>10</td>
          <td>103.7</td>
      </tr>
      <tr>
          <td>ResNet18 + LSTM</td>
          <td>224</td>
          <td>4</td>
          <td>10</td>
          <td>75.03</td>
      </tr>
      <tr>
          <td>ResNet34 + LSTM</td>
          <td>224</td>
          <td>4</td>
          <td>10</td>
          <td>45.72</td>
      </tr>
      <tr>
          <td>ResNet50 + LSTM</td>
          <td>224</td>
          <td>5</td>
          <td>10</td>
          <td>7.49</td>
      </tr>
      <tr>
          <td>ViT Transformers</td>
          <td>224</td>
          <td>3</td>
          <td>5</td>
          <td>79.82</td>
      </tr>
      <tr>
          <td>ViT Transformers</td>
          <td>224</td>
          <td>6</td>
          <td>5</td>
          <td>54.58</td>
      </tr>
      <tr>
          <td>ViT Transformers</td>
          <td>224</td>
          <td>12</td>
          <td>5</td>
          <td>31.30</td>
      </tr>
      <tr>
          <td>ViT Transformers (Best)</td>
          <td>384</td>
          <td>24</td>
          <td>10</td>
          <td><strong>6.95</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>System</strong>: 70GB GPU system.</li>
<li><strong>Framework</strong>: PyTorch and PyTorch Lightning.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sundaramoorthy, C., Kelvin, L. Z., Sarin, M., &amp; Gupta, S. (2021). End-to-End Attention-based Image Captioning. <em>arXiv preprint arXiv:2104.14721</em>. <a href="https://doi.org/10.48550/arXiv.2104.14721">https://doi.org/10.48550/arXiv.2104.14721</a></p>
<p><strong>Publication</strong>: arXiv 2021 (preprint)</p>
<p><strong>Note</strong>: This is an arXiv preprint and has not undergone formal peer review.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{sundaramoorthyEndtoEndAttentionbasedImage2021,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{End-to-{{End Attention-based Image Captioning}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Sundaramoorthy, Carola and Kelvin, Lin Ziwen and Sarin, Mahak and Gupta, Shubham}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2021</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = apr,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2104.14721}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2104.14721}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2104.14721}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DECIMER 1.0: Transformers for Chemical Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-1.0/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-1.0/</guid><description>Transformer-based approach for Optical Chemical Structure Recognition converting chemical images to SELFIES strings with 96% accuracy.</description><content:encoded><![CDATA[<h2 id="evaluating-the-contribution-a-methodological-shift">Evaluating the Contribution: A Methodological Shift</h2>
<p><strong>Method (Dominant)</strong> with strong <strong>Resource</strong> elements.</p>
<p>This is primarily a <strong>Method</strong> paper because it proposes a specific architectural evolution. It replaces CNN-RNN/Encoder-Decoder models with a <strong>Transformer-based network</strong> to solve the problem of image-to-structure translation. It validates this methodological shift through rigorous ablation studies comparing feature extractors (InceptionV3 vs. EfficientNet) and decoder architectures.</p>
<p>It also serves as a <strong>Resource</strong> contribution by releasing the open-source software, trained models, and describing the curation of a massive synthetic training dataset (&gt;35 million molecules).</p>
<h2 id="motivation-inaccessible-chemical-knowledge">Motivation: Inaccessible Chemical Knowledge</h2>
<ul>
<li><strong>Data Inaccessibility</strong>: A vast amount of chemical knowledge (pre-1990s) is locked in printed or scanned literature and is not machine-readable.</li>
<li><strong>Manual Bottlenecks</strong>: Manual curation and extraction of this data is tedious, slow, and error-prone.</li>
<li><strong>Limitations of Prior Tools</strong>: Existing Optical Chemical Structure Recognition (OCSR) tools are often rule-based or struggle with the noise and variability of full-page scanned articles. Previous deep learning attempts were not publicly accessible or robust enough.</li>
</ul>
<h2 id="key-innovation-transformer-based-molecular-translation">Key Innovation: Transformer-Based Molecular Translation</h2>
<ul>
<li><strong>Transformer Architecture</strong>: Shifts from the standard CNN-RNN (Encoder-Decoder) approach to a <strong>Transformer-based decoder</strong>, significantly improving accuracy.</li>
<li><strong>EfficientNet Backbone</strong>: Replaces the standard InceptionV3 feature extractor with <strong>EfficientNet-B3</strong>, which improved feature extraction quality for chemical images.</li>
<li><strong>SELFIES Representation</strong>: Utilizes <a href="/notes/chemistry/molecular-representations/notations/selfies/"><strong>SELFIES</strong></a> (SELF-referencing Embedded Strings) as the target output. This guarantees 100% robust molecular strings and eliminates the &ldquo;invalid SMILES&rdquo; problem common in generative models.</li>
<li><strong>Massive Scaling</strong>: Trains on synthetic datasets derived from PubChem (up to <strong>39 million molecules</strong> total, with the largest training subset at ~35 million), demonstrating that scaling data size directly correlates with improved model performance.</li>
</ul>
<h2 id="methodology-and-experimental-validation">Methodology and Experimental Validation</h2>
<ul>
<li><strong>Feature Extractor Ablation</strong>: Compared InceptionV3 vs. EfficientNet-B3 (and B7) on a 1-million molecule subset to determine the optimal image encoder.</li>
<li><strong>Architecture Comparison</strong>: Benchmarked the Encoder-Decoder (CNN+RNN) against the Transformer model using Tanimoto similarity metrics. The structural similarity between predicted and ground truth molecules was measured via Tanimoto similarity over molecular fingerprints:
$$ T(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}|^2 + |\mathbf{B}|^2 - \mathbf{A} \cdot \mathbf{B}} $$</li>
<li><strong>Data Scaling</strong>: Evaluated performance across increasing training set sizes (1M, 10M, 15M, 35M) to observe scaling laws.</li>
<li><strong>Stereochemistry &amp; Ions</strong>: Tested the model&rsquo;s ability to handle complex stereochemical information and charged groups (ions), creating separate datasets for these tasks.</li>
<li><strong>Augmentation Robustness</strong>: Evaluated the model on augmented images (blur, noise, varying contrast) to simulate real-world scanned document conditions.</li>
</ul>
<h2 id="results-and-scaling-observations">Results and Scaling Observations</h2>
<ul>
<li><strong>Architecture Comparison</strong>: The Transformer model with EfficientNet-B3 features outperformed the Encoder-Decoder baseline by a wide margin. On the 1M dataset, the Transformer achieved <strong>74.57%</strong> exact matches (Tanimoto 1.0) compared to only <strong>7.03%</strong> for the Encoder-Decoder (Table 4 in the paper).</li>
<li><strong>High Accuracy at Scale</strong>: With the full 35-million molecule training set (Dataset 1), the model achieved a <strong>Tanimoto 1.0 score of 96.47%</strong> and an average Tanimoto similarity of 0.99.</li>
<li><strong>Isomorphism</strong>: 99.75% of predictions with a Tanimoto score of 1.0 were confirmed to be structurally isomorphic to the ground truth (checked via <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>).</li>
<li><strong>Stereochemistry Costs</strong>: Including stereochemistry and ions increased the token count and difficulty, resulting in slightly lower accuracy (~89.87% exact match on Dataset 2).</li>
<li><strong>Hardware Efficiency</strong>: Training on TPUs (v3-8) was ~4x faster than Nvidia V100 GPUs. For the 1M molecule model, convergence took ~8h 41min on TPU v3-8 vs ~29h 48min on V100 GPU. The largest model (35M) took less than 14 days on TPU.</li>
<li><strong>Augmentation Robustness (Dataset 3)</strong>: When trained on augmented images and tested on non-augmented images, the model achieved 86.43% Tanimoto 1.0. Using a pre-trained model from Dataset 2 and refitting on augmented images improved this to 88.04% on non-augmented test images and 80.87% on augmented test images, retaining above 97% isomorphism rates.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors generated synthetic data from PubChem.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td>Dataset 1 (Clean)</td>
          <td>39M total (35M train)</td>
          <td>No stereo/ions. Filtered for MW &lt; 1500, bond count 3-40, SMILES len &lt; 40.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Dataset 2 (Complex)</td>
          <td>37M total (33M train)</td>
          <td>Includes stereochemistry and charged groups (ions).</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Dataset 3 (Augmented)</td>
          <td>37M total (33M train)</td>
          <td>Dataset 2 with image augmentations applied.</td>
      </tr>
      <tr>
          <td><strong>Preprocessing</strong></td>
          <td>N/A</td>
          <td>N/A</td>
          <td>Molecules converted to <strong>SELFIES</strong>. Images generated via CDK Structure Diagram Generator (SDG) as $299 \times 299$ 8-bit PNGs.</td>
      </tr>
      <tr>
          <td><strong>Format</strong></td>
          <td>TFRecords</td>
          <td>75 MB chunks</td>
          <td>128 Data points (image vector + tokenized string) per record.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Text Representation</strong>: <strong>SELFIES</strong> used to avoid invalid intermediate strings. Tokenized via Keras tokenizer.
<ul>
<li><em>Dataset 1 Tokens</em>: 27 unique tokens. Max length 47.</li>
<li><em>Dataset 2/3 Tokens</em>: 61 unique tokens (due to stereo/ion tokens).</li>
</ul>
</li>
<li><strong>Augmentation</strong>: Implemented using <code>imgaug</code> python package. Random application of:
<ul>
<li>Gaussian/Average Blur, Additive Gaussian Noise, Salt &amp; Pepper, Coarse Dropout, Gamma Contrast, Sharpen, Brightness.</li>
</ul>
</li>
<li><strong>Optimization</strong>: Adam optimizer with a custom learning rate scheduler (following the &ldquo;Attention is all you need&rdquo; paper).</li>
</ul>
<h3 id="models">Models</h3>
<p>The final architecture is an <strong>Image-to-SELFIES Transformer</strong>.</p>
<ul>
<li><strong>Encoder (Feature Extractor)</strong>:
<ul>
<li><strong>EfficientNet-B3</strong> (pre-trained on Noisy-student).</li>
<li>Input: $299 \times 299 \times 3$ images (normalized -1 to 1).</li>
<li>Output Feature Vector: $10 \times 10 \times 1536$.</li>
</ul>
</li>
<li><strong>Decoder (Transformer)</strong>:
<ul>
<li>4 Encoder-Decoder layers.</li>
<li>8 Parallel Attention Heads.</li>
<li>Dimension size: 512.</li>
<li>Feed-forward size: 2048.</li>
<li>Dropout: 0.1.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation was performed on a held-out test set (10% of total data) selected via RDKit MaxMin algorithm for diversity.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Tanimoto 1.0</strong></td>
          <td><strong>96.47%</strong></td>
          <td>74.57% (1M subset)</td>
          <td>Percentage of predictions with perfect fingerprint match (Dataset 1, 35M training).</td>
      </tr>
      <tr>
          <td><strong>Avg Tanimoto</strong></td>
          <td><strong>0.9923</strong></td>
          <td>0.9371 (1M subset)</td>
          <td>Average similarity score (Dataset 1, 35M training).</td>
      </tr>
      <tr>
          <td><strong>Isomorphism</strong></td>
          <td><strong>99.75%</strong></td>
          <td>-</td>
          <td>Percentage of Tanimoto 1.0 predictions that are structurally identical (checked via InChI).</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Hardware</strong>: TPU v3-8 (Google Cloud). TPU v3-32 was tested but v3-8 was chosen for cost-effectiveness.</li>
<li><strong>Comparison Hardware</strong>: Nvidia Tesla V100 (32GB GPU).</li>
<li><strong>Performance</strong>:
<ul>
<li>TPU v3-8 was ~4x faster than V100 GPU.</li>
<li>1 Million molecule model convergence: 8h 41min on TPU vs ~29h 48min on GPU.</li>
<li>Largest model (35M) took less than 14 days on TPU.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<p>The paper is open-access, and both code and data are publicly available.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER-Image_Transformer">DECIMER-TPU (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation using TensorFlow and TPU training</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.4730515">Code Archive (Zenodo)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Archival snapshot of the codebase</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.4766251">Training Data (Zenodo)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>SMILES data used for training (images generated via CDK SDG)</td>
      </tr>
      <tr>
          <td><a href="https://decimer.ai/">DECIMER Project Page</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Project landing page</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Hardware Requirements</strong>: Training requires TPU v3-8 (Google Cloud) or Nvidia V100 GPU. The largest model (35M molecules) took less than 14 days on TPU v3-8.</li>
<li><strong>Missing Components</strong>: Augmentation parameters are documented in the paper (Table 14). Pre-trained model weights are available through the GitHub repository.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Zielesny, A. &amp; Steinbeck, C. (2021). DECIMER 1.0: deep learning for chemical image recognition using transformers. <em>Journal of Cheminformatics</em>, 13(1), 61. <a href="https://doi.org/10.1186/s13321-021-00538-8">https://doi.org/10.1186/s13321-021-00538-8</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2021</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Kohulan/DECIMER-Image_Transformer">GitHub Repository</a></li>
<li><a href="https://decimer.ai/">DECIMER Project Page</a></li>
<li><a href="https://doi.org/10.5281/zenodo.4730515">Code Archive (Zenodo)</a></li>
<li><a href="https://doi.org/10.5281/zenodo.4766251">Training Data (Zenodo)</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanDECIMER10Deep2021,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{DECIMER 1.0: Deep Learning for Chemical Image Recognition Using Transformers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{DECIMER 1.0}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = <span style="color:#e6db74">{aug}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{61}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-021-00538-8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1186/s13321-021-00538-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemPix: Hand-Drawn Hydrocarbon Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/chempix/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/chempix/</guid><description>Deep learning framework using CNN-LSTM image captioning to convert hand-drawn hydrocarbon structures into SMILES strings with 76% accuracy.</description><content:encoded><![CDATA[<h2 id="paper-classification-and-core-contribution">Paper Classification and Core Contribution</h2>
<p>This is primarily a <strong>Method</strong> paper, with a secondary contribution as a <strong>Resource</strong> paper.</p>
<p>The paper&rsquo;s core contribution is the <strong>ChemPix architecture and training strategy</strong> using neural image captioning (CNN-LSTM) to convert hand-drawn chemical structures to SMILES. The extensive ablation studies on synthetic data generation (augmentation, degradation, backgrounds) and ensemble learning strategies confirm the methodological focus. The secondary resource contribution includes releasing a curated dataset of hand-drawn hydrocarbons and code for generating synthetic training data.</p>
<h2 id="the-structural-input-bottleneck-in-computational-chemistry">The Structural Input Bottleneck in Computational Chemistry</h2>
<p>Inputting molecular structures into computational chemistry software for quantum calculations is often a bottleneck, requiring domain expertise and cumbersome manual entry in drawing software. While optical chemical structure recognition (OCSR) tools exist, they typically struggle with the noise and variability of hand-drawn sketches. There is a practical need for a tool that allows chemists to simply photograph a hand-drawn sketch and immediately convert it into a machine-readable format (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>), making computational workflows more accessible.</p>
<h2 id="cnn-lstm-image-captioning-and-synthetic-generalization">CNN-LSTM Image Captioning and Synthetic Generalization</h2>
<ol>
<li><strong>Image Captioning Paradigm</strong>: The authors treat the problem as <strong>neural image captioning</strong>, using an encoder-decoder (CNN-LSTM) framework to &ldquo;translate&rdquo; an image directly to a SMILES string. This avoids the complexity of explicit atom/bond detection and graph assembly.</li>
<li><strong>Synthetic Data Engineering</strong>: The paper introduces a rigorous synthetic data generation pipeline that transforms clean RDKit-generated images into &ldquo;pseudo-hand-drawn&rdquo; images via randomized backgrounds, degradation, and heavy augmentation. This allows the model to achieve &gt;50% accuracy on real hand-drawn data without ever seeing it during training.</li>
<li><strong>Ensemble Uncertainty Estimation</strong>: The method utilizes a &ldquo;committee&rdquo; (ensemble) of networks to improve accuracy and estimate confidence based on vote agreement, providing users with reliability indicators for predictions.</li>
</ol>
<h2 id="extensive-ablation-and-real-world-evaluation">Extensive Ablation and Real-World Evaluation</h2>
<ol>
<li><strong>Ablation Studies on Data Pipeline</strong>: The authors trained models on datasets generated at different stages of the pipeline (Clean RDKit $\rightarrow$ Augmented $\rightarrow$ Backgrounds $\rightarrow$ Degraded) to quantify the value of each transformation in bridging the synthetic-to-real domain gap.</li>
<li><strong>Sample Size Scaling</strong>: They analyzed performance scaling by training on synthetic dataset sizes ranging from 10,000 to 500,000 images to understand data requirements.</li>
<li><strong>Real-world Validation</strong>: The model was evaluated on a held-out test set of hand-drawn images collected via a custom web app, providing genuine out-of-distribution testing.</li>
<li><strong>Fine-tuning Experiments</strong>: Comparisons of synthetic-only training versus fine-tuning with a small fraction of real hand-drawn data to assess the value of limited real-world supervision.</li>
</ol>
<h2 id="state-of-the-art-hand-drawn-ocsr-performance">State-of-the-Art Hand-Drawn OCSR Performance</h2>
<ol>
<li>
<p><strong>Pipeline Efficacy</strong>: Augmentation and image degradation were the most critical factors for generalization, achieving over 50% accuracy on hand-drawn data when training with 500,000 synthetic images. Adding backgrounds had a negligible effect on accuracy compared to degradation.</p>
</li>
<li>
<p><strong>State-of-the-Art Performance</strong>: The final ensemble model (5 out of 17 trained NNs, selected for achieving &gt;50% individual accuracy) achieved <strong>76% accuracy</strong> (top-1) and <strong>85.5% accuracy</strong> (top-3) on the hand-drawn test set, a significant improvement over the best single model&rsquo;s 67.5%.</p>
</li>
<li>
<p><strong>Synthetic Generalization</strong>: A model trained on 500,000 synthetic images achieved &gt;50% accuracy on real hand-drawn data without any fine-tuning, validating the synthetic data generation strategy as a viable alternative to expensive manual labeling.</p>
</li>
<li>
<p><strong>Ensemble Benefits</strong>: The voting committee approach improved accuracy and provided interpretable uncertainty estimates through vote distributions. When all five committee members agree ($V=5$), the confidence value reaches 98%.</p>
</li>
</ol>
<h2 id="limitations">Limitations</h2>
<p>The authors acknowledge several limitations of the current system:</p>
<ul>
<li><strong>Hydrocarbons only</strong>: The model is restricted to hydrocarbon structures and does not handle heteroatoms or functional groups.</li>
<li><strong>No conjoined rings</strong>: Molecules with multiple conjoined rings are excluded due to limitations of RDKit&rsquo;s image generation, which depicts bridges differently from standard chemistry drawing conventions.</li>
<li><strong>Resonance hybrid notation</strong>: The network struggles with benzene rings drawn in the resonance hybrid style (with a circle) compared to the Kekule structure, since the RDKit training images use exclusively Kekule representations.</li>
<li><strong>Challenging backgrounds</strong>: Lined and squared paper increase recognition difficulty, and structures bleeding through from the opposite side of the page can confuse the network.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study relies on two primary data sources: a massive synthetic dataset generated procedurally and a smaller collected dataset of real drawings.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td>Synthetic (RDKit)</td>
          <td>500,000 images</td>
          <td>Generated via RDKit with &ldquo;heavy&rdquo; augmentation: rotation ($0-360°$), blur, salt+pepper noise, and background texture addition.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>Hand-Drawn (Real)</td>
          <td>613 images</td>
          <td>Crowdsourced via a web app from over 100 unique users; split into 200-image test set and 413 training/validation images.</td>
      </tr>
      <tr>
          <td><strong>Backgrounds</strong></td>
          <td>Texture Images</td>
          <td>1,052 images</td>
          <td>A pool of unlabeled texture photos (paper, desks, shadows) used to generate synthetic backgrounds.</td>
      </tr>
  </tbody>
</table>
<p><strong>Data Generation Parameters</strong>:</p>
<ul>
<li><strong>Augmentations</strong>: Rotation, Resize ($200-300px$), Blur, Dilate, Erode, Aspect Ratio, Affine transform ($\pm 20px$), Contrast, Quantize, Sharpness</li>
<li><strong>Backgrounds</strong>: Randomly translated $\pm 100$ pixels and reflected</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Ensemble Voting</strong><br>
A committee of networks casts votes for the predicted SMILES string. The final prediction is the one with the highest vote count. Validity of SMILES is checked using RDKit.</p>
<p><strong>Beam Search</strong><br>
Used in the decoding layer with a beam width of $k=5$ to explore multiple potential SMILES strings. It approximates the sequence $\mathbf{\hat{y}}$ that maximizes the joint probability:</p>
<p>$$ \mathbf{\hat{y}} = \arg\max_{\mathbf{y}} \sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, \mathbf{x}) $$</p>
<p><strong>Optimization</strong>:</p>
<ul>
<li>
<p><strong>Optimizer</strong>: Adam</p>
</li>
<li>
<p><strong>Learning Rate</strong>: $1 \times 10^{-4}$</p>
</li>
<li>
<p><strong>Batch Size</strong>: 20</p>
</li>
<li>
<p><strong>Loss Function</strong>: Cross-entropy loss across the sequence of $T$ tokens, computed as:</p>
<p>$$ \mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, \mathbf{x}) $$</p>
<p>where $\mathbf{x}$ is the image representation and $y_t$ is the predicted SMILES character. This is calculated as perplexity for validation.</p>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture is a standard image captioning model (Show, Attend and Tell style) adapted for chemical structures.</p>
<p><strong>Encoder (CNN)</strong>:</p>
<ul>
<li><strong>Input</strong>: 256x256 pixel PNG images</li>
<li><strong>Structure</strong>: 4 blocks of Conv2D + MaxPool
<ul>
<li>Block 1: 64 filters, (3,3) kernel</li>
<li>Block 2: 128 filters, (3,3) kernel</li>
<li>Block 3: 256 filters, (3,3) kernel</li>
<li>Block 4: 512 filters, (3,3) kernel</li>
</ul>
</li>
<li><strong>Activation</strong>: ReLU throughout</li>
</ul>
<p><strong>Decoder (LSTM)</strong>:</p>
<ul>
<li><strong>Hidden Units</strong>: 512</li>
<li><strong>Embedding Dimension</strong>: 80</li>
<li><strong>Attention</strong>: Mechanism with intermediary vector dimension of 512</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: Exact SMILES match accuracy (character-by-character identity between predicted and ground truth SMILES)</li>
<li><strong>Perplexity</strong>: Used for saving model checkpoints (minimizing uncertainty)</li>
<li><strong>Top-k Accuracy</strong>: Reported for $k=1$ (76%) and $k=3$ (85.5%)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/mtzgroup/ChemPixCH">ChemPixCH</a></td>
          <td>Code + Dataset</td>
          <td>Apache-2.0</td>
          <td>Official implementation with synthetic data generation pipeline and collected hand-drawn dataset</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Weir, H., Thompson, K., Woodward, A., Choi, B., Braun, A., &amp; Martínez, T. J. (2021). ChemPix: Automated Recognition of Hand-Drawn Hydrocarbon Structures Using Deep Learning. <em>Chemical Science</em>, 12(31), 10622-10633. <a href="https://doi.org/10.1039/D1SC02957F">https://doi.org/10.1039/D1SC02957F</a></p>
<p><strong>Publication</strong>: Chemical Science 2021</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/mtzgroup/ChemPixCH">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{weir2021chempix,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ChemPix: Automated Recognition of Hand-Drawn Hydrocarbon Structures Using Deep Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Weir, Hayley and Thompson, Keiran and Woodward, Amelia and Choi, Benjamin and Braun, Augustin and Mart{\&#39;i}nez, Todd J.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{31}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{10622--10633}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D1SC02957F}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ABC-Net: Keypoint-Based Molecular Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/abc-net/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/abc-net/</guid><description>Deep learning OCSR model using keypoint estimation to detect atom and bond centers for graph-based molecular structure recognition.</description><content:encoded><![CDATA[<h2 id="contribution-and-paper-type">Contribution and Paper Type</h2>
<p><strong>Method</strong>. The paper proposes a novel architectural framework (ABC-Net) for Optical Chemical Structure Recognition (OCSR). It reformulates the problem from image captioning (sequence generation) to keypoint estimation (pixel-wise detection), backed by ablation studies on noise and comparative benchmarks against state-of-the-art tools.</p>
<h2 id="motivation-for-keypoint-based-ocsr">Motivation for Keypoint-Based OCSR</h2>
<ul>
<li><strong>Inefficiency of Rule-Based Methods</strong>: Traditional tools (OSRA, MolVec) rely on hand-coded rules that are brittle, require domain expertise, and fail to handle the wide variance in molecular drawing styles.</li>
<li><strong>Data Inefficiency of Captioning Models</strong>: Recent Deep Learning approaches (like DECIMER, Img2mol) treat OCSR as image captioning (Image-to-SMILES). This is data-inefficient because canonical SMILES require learning traversal orders, necessitating millions of training examples.</li>
<li><strong>Goal</strong>: To create a scalable, data-efficient model that predicts graph structures directly by detecting atomic/bond primitives.</li>
</ul>
<h2 id="abc-nets-divide-and-conquer-architecture">ABC-Net&rsquo;s Divide-and-Conquer Architecture</h2>
<ul>
<li><strong>Divide-and-Conquer Strategy</strong>: ABC-Net breaks the problem down into detecting <strong>atom centers</strong> and <strong>bond centers</strong> as independent keypoints.</li>
<li><strong>Keypoint Estimation</strong>: A Fully Convolutional Network (FCN) generates heatmaps for object centers. This is inspired by computer vision techniques like CornerNet and CenterNet.</li>
<li><strong>Angle-Based Bond Detection</strong>: To handle overlapping bonds, the model classifies bond angles into 60 distinct bins ($0-360°$) at detected bond centers, allowing separation of intersecting bonds.</li>
<li><strong>Implicit Hydrogen Prediction</strong>: The model explicitly predicts the number of implicit hydrogens for heterocyclic atoms to resolve ambiguity in dearomatization.</li>
</ul>
<h2 id="experimental-setup-and-synthetic-data">Experimental Setup and Synthetic Data</h2>
<ul>
<li><strong>Dataset Construction</strong>: Synthetic dataset of 100,000 molecules from ChEMBL, rendered using two different engines (RDKit and Indigo) to ensure style diversity.</li>
<li><strong>Baselines</strong>: Compared against two rule-based methods (MolVec, OSRA) and one deep learning method (Img2mol).</li>
<li><strong>Robustness Testing</strong>: Evaluated on the external UOB dataset (real-world images) and synthetic images with varying levels of salt-and-pepper noise (up to $p=0.6$).</li>
<li><strong>Data Efficiency</strong>: Analyzed performance scaling with training set size (10k to 160k images).</li>
</ul>
<h2 id="results-generalization-and-noise-robustness">Results, Generalization, and Noise Robustness</h2>
<ul>
<li><strong>Superior Accuracy</strong>: ABC-Net achieved <strong>94-98% accuracy</strong> across all test sets (Table 1), outperforming MolVec (12-45% on synthetic data, ~83% on UOB), OSRA (26-62% on synthetic, ~82% on UOB), and Img2mol (78-93% on non-stereo subsets).</li>
<li><strong>Generalization</strong>: On the external UOB benchmark, ABC-Net achieved <strong>&gt;95% accuracy</strong>, whereas the deep learning baseline (Img2mol) dropped to 78.2%, indicating better generalization.</li>
<li><strong>Data Efficiency</strong>: The model reached ~95% performance with only 80,000 training images, requiring roughly an order of magnitude less data than captioning-based models like Img2mol (which use millions of training examples).</li>
<li><strong>Noise Robustness</strong>: Performance remained stable (&lt;2% drop) with noise levels up to $p=0.1$. Even at extreme noise ($p=0.6$), Tanimoto similarity remained high, suggesting the model recovers most substructures even when exact matches fail.</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Drawing style coverage</strong>: The synthetic training data covers only styles available through RDKit and Indigo renderers. Many real-world styles (e.g., hand-drawn structures, atomic group abbreviations) are not represented.</li>
<li><strong>No stereo baseline from Img2mol</strong>: The Img2mol comparison only covers non-stereo subsets because stereo results were not available from the original Img2mol paper.</li>
<li><strong>Scalability to large molecules</strong>: Molecules with more than 50 non-hydrogen atoms are excluded from the dataset, and performance on such large structures is untested.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/zhang-xuan1314/ABC-Net">ABC-Net Repository</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Official implementation. Missing requirements.txt and pre-trained weights.</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility Status: Partially Reproducible</strong>. The code is provided, but key components like the pre-trained weights, exact training environment dependencies, and the generated synthetic datasets are missing from the open-source release, making exact reproduction difficult.</p>
<h3 id="data">Data</h3>
<p>The authors constructed a synthetic dataset because labeled pixel-wise OCSR data is unavailable.</p>
<ul>
<li><strong>Source</strong>: ChEMBL database</li>
<li><strong>Filtering</strong>: Excluded molecules with &gt;50 non-H atoms or rare atom types/charges (&lt;1000 occurrences).</li>
<li><strong>Sampling</strong>: 100,000 unique SMILES selected such that every atom type/charge appears in at least 1,000 compounds.</li>
<li><strong>Generation</strong>: Images generated via <strong>RDKit</strong> and <strong>Indigo</strong> libraries.
<ul>
<li><em>Augmentation</em>: Varied bond thickness, label mode, orientation, and aromaticity markers.</li>
<li><em>Resolution</em>: $512 \times 512$ pixels.</li>
<li><em>Noise</em>: Salt-and-pepper noise added during training ($P$ = prob of background flip, $Q = 50P$).</li>
</ul>
</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ChEMBL (RDKit/Indigo)</td>
          <td>80k</td>
          <td>8:1:1 split (Train/Val/Test)</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>UOB Dataset</td>
          <td>~5.7k images</td>
          <td>External benchmark from Univ. of Birmingham</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Keypoint Detection (Heatmaps)</strong></p>
<ul>
<li>
<p><strong>Down-sampling</strong>: Input $512 \times 512$ → Output $128 \times 128$ (stride 4).</p>
</li>
<li>
<p><strong>Label Softening</strong>: To handle discretization error, ground truth peaks are set to 1, first-order neighbors to 0.95, others to 0.</p>
</li>
<li>
<p><strong>Loss Function</strong>: Penalty-reduced pixel-wise binary focal loss (variants of CornerNet loss). The loss formulation is given as:</p>
<p>$$ L_{det} = - \frac{1}{N} \sum_{x,y} \begin{cases} (1 - \hat{A}_{x,y})^{\alpha} \log(\hat{A}_{x,y}) &amp; \text{if } A_{x,y} = 1 \\ (1 - A_{x,y}) (\hat{A}_{x,y})^{\alpha} \log(1 - \hat{A}_{x,y}) &amp; \text{otherwise} \end{cases} $$</p>
<ul>
<li>$\alpha=2$ (focal parameter). The $(1 - A_{x,y})$ term reduces the penalty for first-order neighbors of ground truth locations.</li>
<li>Property classification losses use a separate focal parameter $\beta=2$ with weight balancing: classes with &lt;10% frequency are weighted 10x.</li>
</ul>
</li>
</ul>
<p><strong>2. Bond Direction Classification</strong></p>
<ul>
<li><strong>Angle Binning</strong>: $360°$ divided into 60 intervals.</li>
<li><strong>Inference</strong>: A bond is detected if the angle probability is a local maximum and exceeds a threshold.</li>
<li><strong>Non-Maximum Suppression (NMS)</strong>: Required for opposite angles (e.g., $30°$ and $210°$) representing the same non-stereo bond.</li>
</ul>
<p><strong>3. Multi-Task Weighting</strong></p>
<ul>
<li>Uses Kendall&rsquo;s uncertainty weighting to balance 8 different loss terms (atom det, bond det, atom type, charge, H-count, bond angle, bond type, bond length).</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: ABC-Net (Custom U-Net / FCN)</p>
<ul>
<li><strong>Input</strong>: $512 \times 512 \times 1$ (Grayscale).</li>
<li><strong>Contracting Path</strong>: 5 steps. Each step has conv-blocks + $2 \times 2$ MaxPool.</li>
<li><strong>Expansive Path</strong>: 3 steps. Transpose-Conv upsampling + Concatenation (Skip Connections).</li>
<li><strong>Heads</strong>: Separate $1 \times 1$ convs for each task map (Atom Heatmap, Bond Heatmap, Property Maps).</li>
<li><strong>Output Dimensions</strong>:
<ul>
<li>Heatmaps: $(1, 128, 128)$</li>
<li>Bond Angles: $(60, 128, 128)$</li>
</ul>
</li>
<li><strong>Pre-trained Weights</strong>: Not included in the public <a href="https://github.com/zhang-xuan1314/ABC-Net">GitHub repository</a>. The paper&rsquo;s availability statement mentions code and training datasets but not weights.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Detection</strong>: Precision &amp; Recall (Object detection level).</li>
<li><strong>Regression</strong>: Mean Absolute Error (MAE) for bond lengths.</li>
<li><strong>Structure Recovery</strong>:
<ul>
<li><em>Accuracy</em>: Exact SMILES match rate.</li>
<li><em>Tanimoto</em>: ECFP similarity (fingerprint overlap).</li>
</ul>
</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>ABC-Net</th>
          <th>Img2mol (Baseline)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Accuracy (UOB)</strong></td>
          <td><strong>96.1%</strong></td>
          <td>78.2%</td>
          <td>Non-stereo subset</td>
      </tr>
      <tr>
          <td><strong>Accuracy (Indigo)</strong></td>
          <td><strong>96.4%</strong></td>
          <td>89.5%</td>
          <td>Non-stereo subset</td>
      </tr>
      <tr>
          <td><strong>Tanimoto (UOB)</strong></td>
          <td><strong>0.989</strong></td>
          <td>0.953</td>
          <td>Higher substructure recovery</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Configuration</strong>: 15 epochs, Batch size 64.</li>
<li><strong>Optimization</strong>: Adam Optimizer. LR $2.5 \times 10^{-4}$ (first 5 epochs) → $2.5 \times 10^{-5}$ (last 10).</li>
<li><strong>Repetition</strong>: Every experiment was repeated 3 times with random dataset splitting; mean values are reported.</li>
<li><strong>Compute</strong>: High-Performance Computing Center of Central South University. Specific GPU model not listed.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, X.-C., Yi, J.-C., Yang, G.-P., Wu, C.-K., Hou, T.-J., &amp; Cao, D.-S. (2022). ABC-Net: A divide-and-conquer based deep learning architecture for SMILES recognition from molecular images. <em>Briefings in Bioinformatics</em>, 23(2), bbac033. <a href="https://doi.org/10.1093/bib/bbac033">https://doi.org/10.1093/bib/bbac033</a></p>
<p><strong>Publication</strong>: Briefings in Bioinformatics 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/zhang-xuan1314/ABC-Net">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhangABCNetDivideandconquerBased2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{ABC-Net: A Divide-and-Conquer Based Deep Learning Architecture for {SMILES} Recognition from Molecular Images}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zhang, Xiao-Chen and Yi, Jia-Cai and Yang, Guo-Ping and Wu, Cheng-Kun and Hou, Ting-Jun and Cao, Dong-Sheng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{23}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{bbac033}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1093/bib/bbac033}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Unified Framework for Handwritten Chemical Expressions</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/chang-unified-framework-2009/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/chang-unified-framework-2009/</guid><description>A 2009 unified framework for inorganic/organic chemical handwriting recognition using graph search and statistical symbol grouping.</description><content:encoded><![CDATA[<h2 id="addressing-the-complexity-of-handwritten-organic-chemistry">Addressing the Complexity of Handwritten Organic Chemistry</h2>
<p>This is a <strong>Methodological Paper</strong> ($\Psi_{\text{Method}}$) from Microsoft Research Asia that addresses the challenge of recognizing complex 2D organic chemistry structures. By 2009, math expression recognition had seen significant commercial progress, but chemical expression recognition remained less developed.</p>
<p>The specific gap addressed is the geometric complexity of organic formulas. While inorganic formulas typically follow a linear, equation-like structure, organic formulas present complex 2D diagrammatic structures with various bond types and rings. Existing work often relied on strong assumptions (like single-stroke symbols) or failed to handle arbitrary compounds. There was a clear need for a unified solution capable of handling both inorganic and organic domains consistently.</p>
<h2 id="the-chemical-expression-structure-graph-cesg">The Chemical Expression Structure Graph (CESG)</h2>
<p>The core innovation is a unified statistical framework that processes inorganic and organic expressions within the same pipeline. Key technical novelties include:</p>
<ol>
<li><strong>Unified Bond Modeling</strong>: Bonds are treated as special symbols. The framework detects &ldquo;extended bond symbols&rdquo; (multi-stroke bonds) and splits them into single, double, or triple bonds using corner detection for consistent processing.</li>
<li><strong>Chemical Expression Structure Graph (CESG)</strong>: A defined graph representation for generic chemical expressions where nodes represent symbols and edges represent bonds or spatial relations.</li>
<li><strong>Non-Symbol Modeling</strong>: During the symbol grouping phase, the system explicitly models &ldquo;invalid groups&rdquo; to reduce over-grouping errors.</li>
<li><strong>Global Graph Search</strong>: Structure analysis is formulated as finding the optimal CESG by searching over a Weighted Direction Graph ($G_{WD}$).</li>
</ol>
<h2 id="graph-search-and-statistical-validation">Graph Search and Statistical Validation</h2>
<p>The authors validated the framework on a proprietary database of 35,932 handwritten chemical expressions collected from 300 writers.</p>
<ul>
<li><strong>Setup</strong>: The data was split into roughly 26,000 training and 6,400 testing samples.</li>
<li><strong>Metric</strong>: Recognition accuracy was measured strictly by expression (all symbols and the complete structure must be correct).</li>
<li><strong>Ablations</strong>: The team evaluated the performance contribution of symbol grouping, structure analysis, and full semantic verification.</li>
</ul>
<h2 id="recognition-accuracy-and-outcomes">Recognition Accuracy and Outcomes</h2>
<p>The full framework achieved a Top-1 accuracy of 75.4% and a Top-5 accuracy of 83.1%.</p>
<ul>
<li><strong>Component Contribution</strong>: Structure analysis is the primary bottleneck. Adding it drops the theoretical &ldquo;perfect grouping&rdquo; performance from 85.9% to 74.1% (Top-1) due to structural errors.</li>
<li><strong>Semantic Verification</strong>: Checking valence and grammar improved relative accuracy by 1.7%.</li>
</ul>
<p>The unified framework effectively handles the variance in 2D space for chemical expressions, demonstrating that delayed decision-making (keeping top-N candidates) improves robustness.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">No public artifacts (code, data, models) were released by the authors.</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The study used a private Microsoft Research Asia dataset, making direct reproduction difficult.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Total</td>
          <td>Proprietary MSRA DB</td>
          <td>35,932 expressions</td>
          <td>Written by 300 people</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>Subset</td>
          <td>25,934 expressions</td>
          <td></td>
      </tr>
      <tr>
          <td>Testing</td>
          <td>Subset</td>
          <td>6,398 expressions</td>
          <td></td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Content</strong>: 2,000 unique expressions from high school/college textbooks.</li>
<li><strong>Composition</strong>: ~25% of samples are organic expressions.</li>
<li><strong>Vocabulary</strong>: 163 symbol classes (elements, digits, <code>+</code>, <code>↑</code>, <code>%</code>, bonds, etc.).</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Symbol Grouping (Dynamic Programming)</strong></p>
<ul>
<li>Objective: Find the optimal symbol sequence $G_{max}$ maximizing the posterior probability given the ink strokes:
$$ G_{max} = \arg\max_{G} P(G | \text{Ink}) $$</li>
<li><strong>Non-symbol modeling</strong>: Iteratively trained models on &ldquo;incorrect grouping results&rdquo; to learn to reject invalid strokes.</li>
<li><strong>Inter-group modeling</strong>: Uses Gaussian Mixture Models (GMM) to model spatial relations ($R_j$) between groups.</li>
</ul>
<p><strong>2. Bond Processing</strong></p>
<ul>
<li><strong>Extended Bond Symbol</strong>: Recognizes connected strokes (e.g., a messy double bond written in one stroke) as a single &ldquo;extended&rdquo; symbol.</li>
<li><strong>Splitting</strong>: Uses <strong>Curvature Scale Space (CSS)</strong> corner detection to split extended symbols into primitive lines.</li>
<li><strong>Classification</strong>: A Neural Network verifies if the split lines form valid single, double, or triple bonds.</li>
</ul>
<p><strong>3. Structure Analysis (Graph Search)</strong></p>
<ul>
<li><strong>Graph Construction</strong>: Builds a Weighted Direction Graph ($G_{WD}$) where nodes are symbol candidates and edges are potential relationships ($E_{c}, E_{nc}, E_{peer}, E_{sub}$).</li>
<li><strong>Edge Weights</strong>: Calculated as the product of observation, spatial, and contextual probabilities:
$$ W(S, O, R) = P(O|S) \times P(\text{Spatial}|R) \times P(\text{Context}|S, R) $$
<ul>
<li>Spatial probability uses rectangular control regions and distance functions.</li>
<li>Contextual probability uses statistical co-occurrence (e.g., &lsquo;C&rsquo; often appears with &lsquo;H&rsquo;).</li>
</ul>
</li>
<li><strong>Search</strong>: Breadth-first search with pruning to find the top-N optimal CESGs.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Symbol Recognition</strong>: Implementation details not specified, but likely HMM or NN based on the era. Bond verification explicitly uses a <strong>Neural Network</strong>.</li>
<li><strong>Spatial Models</strong>: <strong>Gaussian Mixture Models (GMM)</strong> are used to model the 9 spatial relations (e.g., Left-super, Above, Subscript).</li>
<li><strong>Semantic Model</strong>: A <strong>Context-Free Grammar (CFG)</strong> parser is used for final verification (e.g., ensuring digits aren&rsquo;t reactants).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation is performed using &ldquo;Expression-level accuracy&rdquo;.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (Top-1)</th>
          <th>Value (Top-5)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Full Framework</td>
          <td>75.4%</td>
          <td>83.1%</td>
          <td></td>
      </tr>
      <tr>
          <td>Without Semantics</td>
          <td>74.1%</td>
          <td>83.0%</td>
          <td></td>
      </tr>
      <tr>
          <td>Grouping Only</td>
          <td>85.9%</td>
          <td>95.6%</td>
          <td>Theoretical max if structure analysis was perfect</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chang, M., Han, S., &amp; Zhang, D. (2009). A Unified Framework for Recognizing Handwritten Chemical Expressions. <em>2009 10th International Conference on Document Analysis and Recognition</em>, 1345&ndash;1349. <a href="https://doi.org/10.1109/ICDAR.2009.64">https://doi.org/10.1109/ICDAR.2009.64</a></p>
<p><strong>Publication</strong>: ICDAR 2009</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{changUnifiedFrameworkRecognizing2009,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{A {{Unified Framework}} for {{Recognizing Handwritten Chemical Expressions}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2009 10th {{International Conference}} on {{Document Analysis}} and {{Recognition}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Chang, Ming and Han, Shi and Zhang, Dongmei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2009</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1345--1349}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Barcelona, Spain}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICDAR.2009.64}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SVM-HMM Online Classifier for Chemical Symbols</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/zhang-svm-hmm-2010/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/zhang-svm-hmm-2010/</guid><description>A dual-stage classifier combining SVM and HMM to recognize online handwritten chemical symbols, introducing a reordering algorithm for organic rings.</description><content:encoded><![CDATA[<h2 id="contribution-double-stage-classification-method">Contribution: Double-Stage Classification Method</h2>
<p><strong>Method</strong>.
This paper is a methodological contribution that proposes a novel &ldquo;double-stage classifier&rdquo; architecture. It fits the taxonomy by introducing a specific algorithmic pipeline (SVM rough classification followed by HMM fine classification) and a novel pre-processing algorithm (Point Sequence Reordering) to solve technical limitations in recognizing organic ring structures. The contribution is validated through ablation studies (comparing SVM kernels and HMM state/Gaussian counts) and performance benchmarks.</p>
<h2 id="motivation-recognizing-complex-organic-ring-structures">Motivation: Recognizing Complex Organic Ring Structures</h2>
<p>The primary motivation is the complexity of recognizing handwritten chemical symbols, specifically the distinction between <strong>Organic Ring Structures (ORS)</strong> and <strong>Non-Ring Structures (NRS)</strong>. Existing single-stage classifiers are unreliable for ORS because these symbols have arbitrary writing styles, variable stroke numbers, and inconsistent stroke orders due to their 2D hexagonal structure. A robust system is needed to handle this uncertainty and achieve high accuracy.</p>
<h2 id="core-innovation-point-sequence-reordering-psr">Core Innovation: Point Sequence Reordering (PSR)</h2>
<p>The authors introduce two main novelties:</p>
<ol>
<li><strong>Double-Stage Architecture</strong>: A hybrid system where an SVM (using RBF kernel) first roughly classifies inputs as either ORS or NRS, followed by specialized HMMs for fine-grained recognition.</li>
<li><strong>Point Sequence Reordering (PSR) Algorithm</strong>: A stroke-order independent algorithm designed specifically for ORS. It reorders the point sequence of a symbol based on a counter-clockwise scan from the centroid, effectively eliminating the uncertainty caused by variations in stroke number and writing order.</li>
</ol>
<h2 id="methodology--experimental-design">Methodology &amp; Experimental Design</h2>
<p>The authors collected a custom dataset and performed sequential optimizations:</p>
<ul>
<li><strong>SVM Optimization</strong>: Compared Polynomial, RBF, and Sigmoid kernels to find the best rough classifier.</li>
<li><strong>HMM Optimization</strong>: Tested multiple combinations of states (4, 6, 8) and Gaussians (3, 4, 6, 8, 9, 12) to maximize fine classification accuracy.</li>
<li><strong>PSR Validation</strong>: Conducted an ablation study comparing HMM accuracy on ORS symbols &ldquo;Before PSR&rdquo; vs &ldquo;After PSR&rdquo; to quantify the algorithm&rsquo;s impact.</li>
</ul>
<h2 id="results--final-conclusions">Results &amp; Final Conclusions</h2>
<ul>
<li><strong>Architecture Performance</strong>: The RBF-based SVM achieved 99.88% accuracy in differentiating ORS from NRS.</li>
<li><strong>HMM Configuration</strong>: The optimal HMM topology was found to be 8-states and 12-Gaussians for both symbol types.</li>
<li><strong>PSR Impact</strong>: The PSR algorithm improved ORS recognition. Top-1 accuracy shifted from <strong>49.84% (Before PSR)</strong> to <strong>98.36% (After PSR)</strong>.</li>
<li><strong>Overall Accuracy</strong>: The final integrated system achieved a Top-1 accuracy of <strong>93.10%</strong> and Top-3 accuracy of <strong>98.08%</strong> on the test set.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study defined 101 chemical symbols split into two categories.</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Count</th>
          <th>Content</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>NRS</strong> (Non-Ring)</td>
          <td>63</td>
          <td>Digits 0-9, 44 letters, 9 operators</td>
          <td>Operators include +, -, =, $\rightarrow$, etc.</td>
      </tr>
      <tr>
          <td><strong>ORS</strong> (Organic Ring)</td>
          <td>38</td>
          <td>2D hexagonal structures</td>
          <td>Benzene rings, cyclohexane, etc.</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Collection</strong>: 12,322 total samples (122 per symbol) collected from 20 writers (teachers and students).</li>
<li><strong>Split</strong>: 9,090 training samples and 3,232 test samples.</li>
<li><strong>Constraints</strong>: Three specifications were used: normal, standard, and freestyle.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. SVM Feature Extraction (Rough Classification)</strong>
The input strokes are scaled, and a 58-dimensional feature vector is calculated:</p>
<ul>
<li><strong>Mesh ($4 \times 4$)</strong>: Ratio of points in 16 grids (16 features).</li>
<li><strong>Outline</strong>: Normalized scan distance from 4 edges with 5 scan lines each (20 features).</li>
<li><strong>Projection</strong>: Point density in 5 bins per edge (20 features).</li>
<li><strong>Aspect Ratio</strong>: Height/Width ratios (2 features).</li>
</ul>
<p><strong>2. Point Sequence Reordering (PSR)</strong>
Used strictly for ORS preprocessing:</p>
<ol>
<li>Calculate the centroid $(x_c, y_c)$ of the symbol.</li>
<li>Initialize a scan line at angle $\theta = 0$.</li>
<li>Traverse points; if a point $p_i = (x_i, y_i)$ satisfies the distance threshold to the scan line, add it to the reordered list. Distance $d_i$ is calculated as:
$$ d_i = |(y_i - y_c)\cos(\theta) - (x_i - x_c)\sin(\theta)| $$</li>
<li>Increment $\theta$ by $\Delta\theta$ and repeat until a full circle ($2\pi$) is completed.</li>
</ol>
<h3 id="models">Models</h3>
<ul>
<li><strong>SVM (Stage 1)</strong>: RBF Kernel was selected as optimal with parameters $C=512$ and $\gamma=0.5$.</li>
<li><strong>HMM (Stage 2)</strong>: Left-right continuous HMM trained via Baum-Welch algorithm. The topology is one model per symbol using <strong>8 states and 12 Gaussians</strong>.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics reported are Top-1, Top-2, and Top-3 accuracy on the held-out test set.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>NRS Accuracy</th>
          <th>ORS Accuracy</th>
          <th>Overall Test Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Top-1</strong></td>
          <td>91.91%</td>
          <td>97.53%</td>
          <td>93.10%</td>
      </tr>
      <tr>
          <td><strong>Top-3</strong></td>
          <td>99.12%</td>
          <td>99.34%</td>
          <td>98.08%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Device</strong>: HP Pavilion tx1000 Tablet PC.</li>
<li><strong>Processor</strong>: 2.00GHz CPU.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, Y., Shi, G., &amp; Wang, K. (2010). A SVM-HMM Based Online Classifier for Handwritten Chemical Symbols. <em>2010 International Conference on Pattern Recognition</em>, 1888&ndash;1891. <a href="https://doi.org/10.1109/ICPR.2010.465">https://doi.org/10.1109/ICPR.2010.465</a></p>
<p><strong>Publication</strong>: ICPR 2010</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zhang2010svm,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{A SVM-HMM Based Online Classifier for Handwritten Chemical Symbols}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2010 International Conference on Pattern Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zhang, Yang and Shi, Guangshun and Wang, Kai}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2010}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1888--1891}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICPR.2010.465}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Recognition of On-line Handwritten Chemical Expressions</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/yang-online-handwritten-2008/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/yang-online-handwritten-2008/</guid><description>A two-level recognition algorithm for on-line handwritten chemical expressions using structural and syntactic features.</description><content:encoded><![CDATA[<h2 id="contribution-on-line-chemical-expression-recognition-framework">Contribution: On-line Chemical Expression Recognition Framework</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel architectural pipeline (&ldquo;Algorithm Model&rdquo;) for recognizing on-line handwritten chemical expressions. The paper focuses on detailing the specific mechanisms of this pipeline (pre-processing, segmentation, two-level recognition, and HCI) and validates its effectiveness through quantitative comparison against a conventional baseline. The rhetorical structure aligns with the &ldquo;Methodological Basis&rdquo; of the taxonomy, prioritizing the &ldquo;how well does this work?&rdquo; question over theoretical derivation or dataset curation.</p>
<h2 id="motivation-the-hci-gap-in-chemical-drawing">Motivation: The HCI Gap in Chemical Drawing</h2>
<p>The authors identify a gap in existing human-computer interaction (HCI) for chemistry. While mathematical formula recognition had seen progress, chemical expression recognition was under-researched. Existing tools relied on keyboard/mouse input, which was time-consuming and inefficient for the complex, variable nature of chemical structures. Previous attempts were either too slow (vectorization-based) or failed to leverage specific chemical knowledge effectively. There was a practical need for a system that could handle the specific syntactic rules of chemistry in an on-line (real-time) handwriting setting.</p>
<h2 id="novelty-two-level-recognition-architecture">Novelty: Two-Level Recognition Architecture</h2>
<p>The core contribution is a <strong>two-level recognition algorithm</strong> that integrates chemical domain knowledge.</p>
<ul>
<li><strong>Level 1 (Substance Level):</strong> Treats connected strokes as a potential &ldquo;substance unit&rdquo; (e.g., &ldquo;H2O&rdquo;) and matches them against a dictionary using a modified edit distance algorithm.</li>
<li><strong>Level 2 (Character Level):</strong> If the substance match fails, it falls back to segmenting the unit into isolated characters and reconstructing them using syntactic rules.</li>
<li><strong>Hybrid Segmentation:</strong> Combines structural analysis (using bounding box geometry for super/subscript detection) with &ldquo;partial recognition&rdquo; (identifying special symbols like <code>+</code>, <code>=</code>, <code>-&gt;</code> early to split the expression).</li>
</ul>
<h2 id="methodology-custom-dataset-and-baseline-comparisons">Methodology: Custom Dataset and Baseline Comparisons</h2>
<p>The authors conducted a validation experiment in a laboratory environment with 20 participants (chemistry students and teachers).</p>
<ul>
<li><strong>Dataset:</strong> 1,197 total samples (983 from a standard set of 341 expressions, 214 arbitrary expressions written by users).</li>
<li><strong>Baselines:</strong> They compared their &ldquo;Two-Level&rdquo; algorithm against a &ldquo;Conventional&rdquo; algorithm that skips the substance-level check and directly recognizes characters (&ldquo;Recognize Character Directly&rdquo;).</li>
<li><strong>Conditions:</strong> They also tested the impact of their Human-Computer Interaction (HCI) module which allows user corrections.</li>
</ul>
<h2 id="results-high-accuracy-and-hci-corrections">Results: High Accuracy and HCI Corrections</h2>
<ul>
<li><strong>Accuracy:</strong> The proposed two-level algorithm achieved significantly higher accuracy (<strong>96.4%</strong> for expression recognition) compared to the conventional baseline (<strong>91.5%</strong>).</li>
<li><strong>Robustness:</strong> The method performed well even on &ldquo;arbitrary&rdquo; expressions not in the standard set (92.5% accuracy vs 88.2% baseline).</li>
<li><strong>HCI Impact:</strong> Allowing users to modify results via the HCI module pushed final accuracy to high levels (<strong>98.8%</strong>).</li>
<li><strong>Conclusion:</strong> The authors concluded the algorithm is reliable for real applications and flexible enough to be extended to other domains like physics or engineering.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The paper does not use a public benchmark but collected its own data for validation.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Validation</strong></td>
          <td style="text-align: left">Custom Lab Dataset</td>
          <td style="text-align: left">1,197 samples</td>
          <td style="text-align: left">Collected from 20 chemistry students/teachers using Tablet PCs. Includes 341 standard expressions + arbitrary user inputs.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The pipeline consists of four distinct phases with specific algorithmic choices:</p>
<p><strong>1. Pre-processing</strong></p>
<ul>
<li><strong>Smoothing:</strong> Uses a 5-tap Gaussian low-pass filter (Eq. 1) with specific coefficients to smooth stroke data.</li>
<li><strong>Redundancy:</strong> Merges redundant points and removes &ldquo;prickles&rdquo; (isolated noise).</li>
<li><strong>Re-ordering:</strong> Strokes are spatially re-sorted left-to-right, top-to-down to correct for arbitrary writing order.</li>
</ul>
<p><strong>2. Segmentation</strong></p>
<ul>
<li><strong>Structural Analysis:</strong> Distinguishes relationships (Superscript vs. Subscript vs. Horizontal) using a geometric feature vector $(T, B)$ based on bounding box heights ($h$), vertical centers ($C$), and barycenters ($B_{bary}$):
$$
\begin{aligned}
d &amp;= 0.7 \cdot y_{12} - y_{22} + 0.3 \cdot y_{11} \\
T &amp;= 1000 \cdot \frac{d}{h_1} \\
B &amp;= 1000 \cdot \frac{B_{bary1} - B_{bary2}}{h_1}
\end{aligned}
$$</li>
<li><strong>Partial Recognition:</strong> Detects special symbols (<code>+</code>, <code>=</code>, <code>-&gt;</code>) early to break expressions into &ldquo;super-substance units&rdquo; (e.g., separating reactants from products).</li>
</ul>
<p><strong>3. Recognition (Two-Level)</strong></p>
<ul>
<li><strong>Level 1 (Dictionary Match):</strong>
<ul>
<li>Uses a modified <strong>Edit Distance</strong> (Eq. 6) incorporating a specific distance matrix based on chemical syntax.</li>
<li>Similarity $\lambda_{ij}$ is weighted by stroke credibility $\mu_i$ and normalized by string length.</li>
</ul>
</li>
<li><strong>Level 2 (Character Segmentation):</strong>
<ul>
<li>Falls back to this if Level 1 fails.</li>
<li>Segments characters by analyzing pixel density in horizontal/vertical/diagonal directions to find concave/convex points.</li>
<li>Recombines characters using syntactic rules (e.g., valency checks) to verify validity.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation focused on recognition accuracy at both the character and expression level.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Value (Proposed)</th>
          <th style="text-align: left">Value (Baseline)</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Expression Accuracy (EA)</strong></td>
          <td style="text-align: left"><strong>96.4%</strong></td>
          <td style="text-align: left">91.5%</td>
          <td style="text-align: left">&ldquo;Standard&rdquo; dataset subset.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Expression Accuracy (EA)</strong></td>
          <td style="text-align: left"><strong>92.5%</strong></td>
          <td style="text-align: left">88.2%</td>
          <td style="text-align: left">&ldquo;Other&rdquo; (arbitrary) dataset subset.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>HCI-Assisted Accuracy</strong></td>
          <td style="text-align: left"><strong>98.8%</strong></td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">Accuracy after user correction.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Input Devices:</strong> Tablet PCs were used for data collection and testing.</li>
<li><strong>Compute:</strong> Specific training hardware is not listed, but the algorithm is designed for real-time interaction on standard 2008-era computing devices.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yang, J., Shi, G., Wang, Q., &amp; Zhang, Y. (2008). Recognition of On-line Handwritten Chemical Expressions. <em>2008 IEEE International Joint Conference on Neural Networks</em>, 2360&ndash;2365. <a href="https://doi.org/10.1109/IJCNN.2008.4634125">https://doi.org/10.1109/IJCNN.2008.4634125</a></p>
<p><strong>Publication</strong>: IJCNN 2008</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{jufengyangRecognitionOnlineHandwritten2008,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Recognition of On-Line Handwritten Chemical Expressions}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2008 {{IEEE International Joint Conference}} on {{Neural Networks}} ({{IEEE World Congress}} on {{Computational Intelligence}})}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{{Jufeng Yang} and {Guangshun Shi} and {Qingren Wang} and {Yong Zhang}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2008</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jun,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{2360--2365}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Hong Kong, China}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/IJCNN.2008.4634125}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{978-1-4244-1820-6}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Online Handwritten Chemical Formula Structure Analysis</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/wang-online-handwritten-2009/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/wang-online-handwritten-2009/</guid><description>A hierarchical grammar-based approach for recognizing and analyzing online handwritten chemical formulas in mobile education contexts.</description><content:encoded><![CDATA[<h2 id="hierarchical-grammatical-framework-contribution">Hierarchical Grammatical Framework Contribution</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel architectural framework for processing chemical formulas by decomposing them into three hierarchical levels (Formula, Molecule, Text). The contribution is defined by a specific set of formal grammatical rules and parsing algorithms used to construct a &ldquo;grammar spanning tree&rdquo; and &ldquo;molecule spanning graph&rdquo; from online handwritten strokes.</p>
<h2 id="motivation-for-online-formula-recognition">Motivation for Online Formula Recognition</h2>
<p>The primary motivation is the application of mobile computing in chemistry education, where precise comprehension of casual, <em>online</em> handwritten formulas is a significant challenge.</p>
<ul>
<li><strong>2D Complexity</strong>: Unlike 1D text, chemical formulas utilize complex 2D spatial relationships that convey specific chemical meaning (e.g., bonds, rings).</li>
<li><strong>Format Limitations</strong>: Existing storage formats like CML (Chemical Markup Language) or MDL MOLFILE do not natively record the layout or abbreviated information necessary for recognizing handwritten input.</li>
<li><strong>Online Gap</strong>: Previous research focused heavily on <em>offline</em> (image-based) recognition, lacking solutions for <em>online</em> (stroke-based) handwritten chemical formulas (OHCF).</li>
</ul>
<h2 id="core-novelty-in-three-level-grammatical-analysis">Core Novelty in Three-Level Grammatical Analysis</h2>
<p>The core novelty is the <strong>Three-Level Grammatical Analysis</strong> approach:</p>
<ol>
<li><strong>Formula Level (1D)</strong>: Treats the reaction equation as a linear sequence of components (Reactants, Products, Separators), parsed via a context-free grammar to build a spanning tree.</li>
<li><strong>Molecule Level (2D)</strong>: Treats molecules as graphs where &ldquo;text groups&rdquo; are vertices and &ldquo;bonds&rdquo; are edges. It introduces specific handling for &ldquo;hidden Carbon dots&rdquo; (intersections of bonds without text).</li>
<li><strong>Text Level (1D)</strong>: Analyzes the internal structure of text groups (atoms, subscripts).</li>
</ol>
<p>Unique to this approach is the <strong>formal definition of the chemical grammar</strong> as a 5-tuple $G=(T,N,P,M,S)$ and the generation of an <strong>Adjacency Matrix</strong> directly from the handwritten sketch to represent chemical connectivity.</p>
<h2 id="experimental-validation-on-handwritten-strokes">Experimental Validation on Handwritten Strokes</h2>
<p>The authors validated their model using a custom dataset of online handwritten formulas.</p>
<ul>
<li><strong>Data Source</strong>: 25 formulas were randomly selected from a larger pool of 1,250 samples.</li>
<li><strong>Scope</strong>: The test set included 484 total symbols, comprising generators, separators, text symbols, rings, and various bond types.</li>
<li><strong>Granular Validation</strong>: The system was tested at multiple distinct stages:
<ul>
<li>Key Symbol Extraction (Formula Level)</li>
<li>Text Localization (Molecule Level)</li>
<li>Bond End Grouping (Molecule Level)</li>
<li>Text Recognition (Text Level)</li>
</ul>
</li>
</ul>
<h2 id="downstream-impact-and-parsing-accuracy">Downstream Impact and Parsing Accuracy</h2>
<p>The system achieved high accuracy across all sub-tasks, demonstrating that the hierarchical grammar approach is effective for both inorganic and organic formulas.</p>
<ul>
<li><strong>Formula Level</strong>: 98.3% accuracy for Key Symbols; 100% for State-assisted symbols.</li>
<li><strong>Molecule Level</strong>: 98.8% accuracy for Bond End Grouping; 100% for Free End-Text connection detection.</li>
<li><strong>Text Recognition</strong>: 98.7% accuracy (Top-3) using HMMs.</li>
<li><strong>Impact</strong>: The method successfully preserves the writer&rsquo;s &ldquo;online information&rdquo; (habits/intentions) while converting the handwritten input into standard formats suitable for visual editing or data retrieval.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>To replicate this work, one would need to implement the specific grammatical production rules and the geometric thresholds defined for bond analysis.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td>Symbol HMMs</td>
          <td>5,670 samples</td>
          <td>Used to train the text recognition module</td>
      </tr>
      <tr>
          <td><strong>Testing</strong></td>
          <td>Text Recognition</td>
          <td>2,016 samples</td>
          <td>Test set for character HMMs</td>
      </tr>
      <tr>
          <td><strong>Testing</strong></td>
          <td>Formula Analysis</td>
          <td>25 formulas</td>
          <td>Random subset of 1,250 collected samples; contains 484 symbols</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Formula Level Parsing</strong></p>
<ul>
<li><strong>HBL Analysis</strong>: Identify the &ldquo;Horizontal Baseline&rdquo; (HBL) containing the most symbols to locate key operators (e.g., $+$, $\rightarrow$).</li>
<li><strong>Grammar</strong>: Use the productions defined in Figure 4. Example rules include:
<ul>
<li>$Reaction ::= ReactantList \ Generator \ ProductList$</li>
<li>$Reactant ::= BalancingNum \ Molecule \ IonicCharacter$</li>
</ul>
</li>
</ul>
<p><strong>2. Molecule Level Analysis (Bond Grouping)</strong></p>
<ul>
<li><strong>Endpoint Classification</strong>: Points are classified as <em>free ends</em>, <em>junctions</em> (3+ bonds), or <em>connections</em> (2 bonds).</li>
<li><strong>Grouping Equation</strong>: An endpoint $(x_k, y_k)$ belongs to Group A based on distance thresholding:
$$
\begin{aligned}
Include(x_0, y_0) = \begin{cases} 1, &amp; d_0 &lt; t \cdot \max d_x + \partial \\ 0, &amp; \text{else} \end{cases}
\end{aligned}
$$
Where $d_k$ is the Euclidean distance to the group center $(x_a, y_a)$.</li>
</ul>
<p><strong>3. Connection Detection</strong></p>
<ul>
<li><strong>Text-Bond Connection</strong>: A text group is connected to a bond if the free end falls within a bounding box expanded by thresholds $t_W$ and $t_H$:
$$
\begin{aligned}
Con(x,y) = \begin{cases} 1, &amp; \min x - t_W &lt; x &lt; \max x + t_W \text{ AND } \min y - t_H &lt; y &lt; \max y + t_H \\ 0, &amp; \text{else} \end{cases}
\end{aligned}
$$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Text Recognition</strong>: Hidden Markov Models (HMM) are used for recognizing individual text symbols.</li>
<li><strong>Grammar</strong>: Context-Free Grammar (CFG) designed with ambiguity elimination to ensure a single valid parse tree for any valid formula.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance is measured by recognition accuracy at specific processing stages:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td>F1 (Key Symbol Extraction)</td>
          <td>98.3%</td>
          <td>Formula Level</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>F2 (State-assisted Symbol)</td>
          <td>100%</td>
          <td>Formula Level</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>M2 (Bond End Grouping)</td>
          <td>98.8%</td>
          <td>Molecule Level</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>M3 (Free End-Text Conn)</td>
          <td>100%</td>
          <td>Molecule Level</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>T1 (Text Recognition)</td>
          <td>98.7%</td>
          <td>Top-3 Accuracy</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, X., Shi, G., &amp; Yang, J. (2009). The Understanding and Structure Analyzing for Online Handwritten Chemical Formulas. <em>2009 10th International Conference on Document Analysis and Recognition</em>, 1056&ndash;1060. <a href="https://doi.org/10.1109/ICDAR.2009.70">https://doi.org/10.1109/ICDAR.2009.70</a></p>
<p><strong>Publication</strong>: ICDAR 2009</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{wangUnderstandingStructureAnalyzing2009,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{The {{Understanding}} and {{Structure Analyzing}} for {{Online Handwritten Chemical Formulas}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2009 10th {{International Conference}} on {{Document Analysis}} and {{Recognition}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Wang, Xin and Shi, Guangshun and Yang, Jufeng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2009}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1056--1060}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Barcelona, Spain}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICDAR.2009.70}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{978-1-4244-4500-4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>On-line Handwritten Chemical Expression Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/yang-icpr-2008/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/yang-icpr-2008/</guid><description>Two-level algorithm for recognizing on-line handwritten chemical expressions using structural analysis, ANNs, and string edit distance.</description><content:encoded><![CDATA[<h2 id="a-methodological-approach-to-chemical-recognition">A Methodological Approach to Chemical Recognition</h2>
<p>This is a <strong>Method</strong> paper. It proposes a specific &ldquo;novel two-level algorithm&rdquo; and a &ldquo;System model&rdquo; for recognizing chemical expressions. The paper focuses on the architectural design of the recognition pipeline (segmentation, substance recognition, symbol recognition) and validates it against a &ldquo;conventional algorithm&rdquo; baseline, fitting the standard profile of a methodological contribution.</p>
<h2 id="bridging-the-gap-in-pen-based-chemical-input">Bridging the Gap in Pen-Based Chemical Input</h2>
<p>While pen-based computing has advanced for text and mathematical formulas, inputting chemical expressions remains &ldquo;time-consuming&rdquo;. Existing research often lacks &ldquo;adequate chemical knowledge&rdquo; or relies on algorithms that are too slow (global optimization) or structurally weak (local optimization). The authors aim to bridge this gap by integrating chemical domain knowledge into the recognition process to improve speed and accuracy.</p>
<h2 id="two-level-recognition-strategy-for-formulas">Two-Level Recognition Strategy for Formulas</h2>
<p>The core novelty is a <strong>two-level recognition strategy</strong>:</p>
<ol>
<li><strong>Level 1 (Substance Recognition)</strong>: Uses global structural information to identify entire &ldquo;substance units&rdquo; (e.g., $H_2SO_4$) by matching against a dictionary.</li>
<li><strong>Level 2 (Symbol Recognition)</strong>: If Level 1 fails, the system falls back to segmenting the substance into isolated characters and recognizing them individually.</li>
</ol>
<p>Additionally, the method integrates <strong>syntactic features</strong> (chemical knowledge) such as element conservation to validate and correct results and uses specific geometric features to distinguish superscript/subscript relationships.</p>
<h2 id="dataset-collection-and-baseline-comparisons">Dataset Collection and Baseline Comparisons</h2>
<ul>
<li><strong>Dataset Collection</strong>: The authors collected 1197 handwritten expression samples from 20 chemistry professionals and students. This included 983 &ldquo;standard&rdquo; expressions (from 341 templates) and 214 &ldquo;arbitrary&rdquo; expressions written freely.</li>
<li><strong>Comparison</strong>: They compared their &ldquo;Two-level recognition&rdquo; approach against a &ldquo;conventional algorithm&rdquo; that shields the first level (directly segmenting into characters).</li>
<li><strong>Metrics</strong>: They measured Material Accuracy (MA), Correct Expressions Number (AEN), and Expression Accuracy (EA).</li>
</ul>
<h2 id="high-accuracy-in-formula-recognition">High Accuracy in Formula Recognition</h2>
<ul>
<li><strong>High Accuracy</strong>: The proposed algorithm achieved <strong>96.4% Material Accuracy (MA)</strong> and <strong>95.7% Expression Accuracy (EA)</strong> on the total test set.</li>
<li><strong>Robustness</strong>: The method performed well on both standard (96.3% EA) and arbitrary (92.5% EA) expressions.</li>
<li><strong>Validation</strong>: The authors conclude the algorithm is &ldquo;reliable,&rdquo; &ldquo;flexible,&rdquo; and suitable for real-time applications compared to prior work.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors constructed two distinct datasets for training and evaluation:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Symbol Training</strong></td>
          <td style="text-align: left">ISF Files</td>
          <td style="text-align: left">12,240 files</td>
          <td style="text-align: left">Used to train the ANN classifier. Covers 102 symbol classes (numerals, letters, operators, organic loops).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Expression Testing</strong></td>
          <td style="text-align: left">Handwritten Expressions</td>
          <td style="text-align: left">1,197 samples</td>
          <td style="text-align: left">983 standard + 214 arbitrary expressions collected from 20 chemistry teachers/students.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Structural Segmentation (Superscript/Subscript)</strong></p>
<p>To distinguish relationships (superscript, subscript, in-line), the authors define geometric parameters based on the bounding boxes of adjacent symbols ($x_{i1}, y_{i1}, x_{i2}, y_{i2}$):</p>
<p>$$d = 0.7 \times y_{12} - y_{22} + 0.3 \times y_{11}$$
$$T = 1000 \times d/h$$
$$B = 1000 \times (B_1 - B_2)/h_1$$</p>
<p>Where $B_1, B_2$ are barycenters and $h$ is height. $(T, B)$ serves as the feature vector for classification.</p>
<p><strong>2. Segmentation Reliability</strong></p>
<p>For segmenting strokes into units, the reliability of a segmentation path is calculated as:</p>
<p>$$Cof(K_{i},n)=\sum_{j=0}^{N}P(k_{j},k_{j+1})+P(S_{K_{i}})+\delta(N)$$</p>
<p>Where $P(k_j, k_{j+1})$ is the reliability of strokes being recognized as symbol $S_{k_j}$.</p>
<p><strong>3. Substance Matching (Level 1)</strong></p>
<p>A modified string edit distance is used to match handwritten input against a dictionary:</p>
<p>$$\lambda_{\overline{u}}=\mu_{i} \times f(Dis(i,j,r)/\sqrt{Max(Len_{i},Len_{j})})$$</p>
<p>Where $\mu_i$ is the recognizer credibility and $Dis(i,j,r)$ is the edit distance.</p>
<h3 id="models">Models</h3>
<ul>
<li><strong>Classifier</strong>: An ANN-based classifier is used for isolated symbol recognition.</li>
<li><strong>Input Features</strong>: A set of ~30 features is extracted from strokes, including writing time, interval time, elastic mesh, and stroke outline.</li>
<li><strong>Performance</strong>: The classifier achieved 92.1% accuracy on a test set of 2,702 isolated symbols.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The system was evaluated on the 1,197 expression samples.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Value (Total)</th>
          <th style="text-align: left">Value (Standard)</th>
          <th style="text-align: left">Value (Other)</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Material Accuracy (MA)</strong></td>
          <td style="text-align: left">96.4%</td>
          <td style="text-align: left">97.7%</td>
          <td style="text-align: left">94%</td>
          <td style="text-align: left">Accuracy of substance recognition.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Expression Accuracy (EA)</strong></td>
          <td style="text-align: left">95.7%</td>
          <td style="text-align: left">96.3%</td>
          <td style="text-align: left">92.5%</td>
          <td style="text-align: left">Accuracy of full expression recognition.</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yang, J., Shi, G., Wang, K., Geng, Q., &amp; Wang, Q. (2008). A Study of On-Line Handwritten Chemical Expressions Recognition. <em>2008 19th International Conference on Pattern Recognition</em>, 1&ndash;4. <a href="https://doi.org/10.1109/ICPR.2008.4761824">https://doi.org/10.1109/ICPR.2008.4761824</a></p>
<p><strong>Publication</strong>: ICPR 2008</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{yangStudyOnlineHandwritten2008,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{A Study of On-Line Handwritten Chemical Expressions Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2008 19th {{International Conference}} on {{Pattern Recognition}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Yang, Jufeng and Shi, Guangshun and Wang, Kai and Geng, Qian and Wang, Qingren}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2008</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = dec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1--4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Tampa, FL, USA}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICPR.2008.4761824}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Img2Mol: Accurate SMILES Recognition from Depictions</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/img2mol/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/img2mol/</guid><description>Two-stage CNN approach for converting molecular images to SMILES using CDDD embeddings and extensive data augmentation.</description><content:encoded><![CDATA[<h2 id="method-classification">Method Classification</h2>
<p>This is a <strong>method paper</strong> that introduces Img2Mol, a deep learning system for Optical Chemical Structure Recognition (OCSR). The work focuses on building a fast, accurate, and robust system for converting molecular structure depictions into machine-readable SMILES strings.</p>
<h2 id="systematization-and-motivation">Systematization and Motivation</h2>
<p>Vast amounts of chemical knowledge exist only as images in scientific literature and patents, making this data inaccessible for computational analysis, database searches, or machine learning pipelines. Manually extracting this information is slow and error-prone, creating a bottleneck for drug discovery and chemical research.</p>
<p>While rule-based OCSR systems like OSRA, MolVec, and Imago exist, they are brittle. Small variations in drawing style or image quality can cause them to fail. The authors argue that a deep learning approach, trained on diverse synthetic data, can generalize better across different depiction styles and handle the messiness of real-world images more reliably.</p>
<h2 id="two-stage-architecture-and-core-novelty">Two-Stage Architecture and Core Novelty</h2>
<p>The novelty lies in a two-stage architecture that separates perception from decoding, combined with aggressive data augmentation to ensure robustness. The key contributions are:</p>
<p><strong>1. Two-Stage Architecture with CDDD Embeddings</strong></p>
<p>Img2Mol uses an intermediate representation to predict SMILES from pixels. A <strong>custom CNN encoder</strong> maps the input image to a 512-dimensional <strong>Continuous and Data-Driven Molecular Descriptor (CDDD)</strong> embedding - a pre-trained, learned molecular representation that smoothly captures chemical similarity. A <strong>pre-trained decoder</strong> then converts this CDDD vector into the final canonical SMILES string.</p>
<p>This two-stage design has several advantages:</p>
<ul>
<li>The CDDD space is continuous and chemically meaningful, so nearby embeddings correspond to structurally similar molecules. This makes the regression task easier than learning discrete token sequences directly.</li>
<li>The decoder is pre-trained and fixed, so the CNN only needs to learn the image → CDDD mapping. This decouples the visual recognition problem from the sequence generation problem.</li>
<li>CDDD embeddings naturally enforce chemical validity constraints, reducing the risk of generating nonsensical structures.</li>
</ul>
<p><strong>2. Extensive Data Augmentation for Robustness</strong></p>
<p>The model was trained on 11.1 million unique molecules from ChEMBL and PubChem, but the critical insight is how the training images were generated. To expose the CNN to maximum variation in depiction styles, the authors:</p>
<ul>
<li>Used <strong>three different cheminformatics libraries</strong> (RDKit, OEChem, Indigo) to render images, each with its own drawing conventions</li>
<li>Applied <strong>wide-ranging augmentations</strong>: varying bond thickness, font size, rotation, resolution (originally 192-256 px; expanded to 190-2500 px in the final model), and other stylistic parameters</li>
<li><strong>Over-sampled larger molecules</strong> to improve performance on complex structures, which are underrepresented in chemical databases</li>
</ul>
<p>This ensures the network rarely sees the same depiction of a molecule twice, forcing it to learn invariant features.</p>
<p><strong>3. Fast Inference</strong></p>
<p>Because the architecture is a simple CNN followed by a fixed decoder, inference is very fast - especially compared to rule-based systems that rely on iterative graph construction algorithms. This makes Img2Mol practical for large-scale document mining.</p>
<h2 id="experimental-validation-and-benchmarks">Experimental Validation and Benchmarks</h2>
<p>The evaluation focused on demonstrating that Img2Mol is more accurate, robust, and generalizable than existing rule-based systems:</p>
<ol>
<li>
<p><strong>Benchmark Comparisons</strong>: Img2Mol was tested on several standard OCSR benchmarks, including USPTO (patent images), University of Birmingham (UoB), CLEF, and JPO (Japanese Patent Office) datasets, against three open-source baselines: <strong>OSRA, MolVec, and Imago</strong>. No deep learning baselines were available at the time for comparison.</p>
</li>
<li>
<p><strong>Resolution and Molecular Size Analysis</strong>: The initial model, <code>Img2Mol(no aug.)</code>, was evaluated across different image resolutions and molecule sizes (measured by number of atoms) to understand failure modes. This revealed that:</p>
<ul>
<li>Performance degraded for molecules with &gt;35 atoms</li>
<li>Very high-resolution images lost detail when downscaled to the fixed input size</li>
<li>Low-resolution images (where rule-based methods failed completely) were handled well</li>
</ul>
</li>
<li>
<p><strong>Data Augmentation Ablation</strong>: A final model, <strong>Img2Mol</strong>, was trained with the full augmentation pipeline (wider resolution range, over-sampling of large molecules). Performance was compared to the initial version to quantify the effect of augmentation.</p>
</li>
<li>
<p><strong>Depiction Library Robustness</strong>: The model was tested on images generated by each of the three rendering libraries separately to confirm that training on diverse styles improved generalization.</p>
</li>
<li>
<p><strong>Input Perturbation for Benchmark Fairness</strong>: For the smaller benchmark datasets (USPTO, UoB, CLEF, JPO), the authors applied slight random rotation (within +/-5 degrees) and shearing to each image five times to detect potential overfitting of rule-based methods to well-known benchmarks.</p>
</li>
<li>
<p><strong>Generalization Tests</strong>: Img2Mol was evaluated on real-world patent images from the <strong>STAKER</strong> dataset, which were not synthetically generated. This tested whether the model could transfer from synthetic training data to real documents.</p>
</li>
<li>
<p><strong>Hand-Drawn Molecule Recognition</strong>: As an exploratory test, the authors evaluated performance on hand-drawn molecular structures, a task the model was never trained for, to see if the learned features could generalize to completely different visual styles.</p>
</li>
<li>
<p><strong>Speed Benchmarking</strong>: Inference time was measured and compared to rule-based baselines to demonstrate the practical efficiency of the approach.</p>
</li>
</ol>
<h2 id="results-conclusions-and-limitations">Results, Conclusions, and Limitations</h2>
<p>Key benchmark results from Table 1 of the paper (accuracy / Tanimoto similarity, in %):</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Img2Mol</th>
          <th>MolVec 0.9.8</th>
          <th>Imago 2.0</th>
          <th>OSRA 2.1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Img2Mol test set</td>
          <td>88.25 / 95.27</td>
          <td>2.59 / 13.03</td>
          <td>0.02 / 4.74</td>
          <td>2.59 / 13.03</td>
      </tr>
      <tr>
          <td>STAKER</td>
          <td>64.33 / 83.76</td>
          <td>5.32 / 31.78</td>
          <td>0.07 / 5.06</td>
          <td>5.23 / 26.98</td>
      </tr>
      <tr>
          <td>USPTO</td>
          <td>42.29 / 73.07</td>
          <td>30.68 / 65.50</td>
          <td>5.07 / 7.28</td>
          <td>6.37 / 44.21</td>
      </tr>
      <tr>
          <td>UoB</td>
          <td>78.18 / 88.51</td>
          <td>75.01 / 86.88</td>
          <td>5.12 / 7.19</td>
          <td>70.89 / 85.27</td>
      </tr>
      <tr>
          <td>CLEF</td>
          <td>48.84 / 78.04</td>
          <td>44.48 / 76.61</td>
          <td>26.72 / 41.29</td>
          <td>17.04 / 58.84</td>
      </tr>
      <tr>
          <td>JPO</td>
          <td>45.14 / 69.43</td>
          <td>49.48 / 66.46</td>
          <td>23.18 / 37.47</td>
          <td>33.04 / 49.62</td>
      </tr>
  </tbody>
</table>
<p>Per-library accuracy on a 5,000-compound subset (depicted five times each):</p>
<table>
  <thead>
      <tr>
          <th>Library</th>
          <th>Img2Mol</th>
          <th>MolVec</th>
          <th>Imago</th>
          <th>OSRA</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RDKit</td>
          <td>93.4%</td>
          <td>3.7%</td>
          <td>0.3%</td>
          <td>4.4%</td>
      </tr>
      <tr>
          <td>OEChem</td>
          <td>89.5%</td>
          <td>33.4%</td>
          <td>12.3%</td>
          <td>26.3%</td>
      </tr>
      <tr>
          <td>Indigo</td>
          <td>79.0%</td>
          <td>22.2%</td>
          <td>4.2%</td>
          <td>22.6%</td>
      </tr>
  </tbody>
</table>
<ul>
<li>
<p><strong>Substantial Performance Gains</strong>: Img2Mol outperformed all three rule-based baselines on nearly every benchmark. MolVec scored higher on JPO (49.48% vs. 45.14% accuracy). Accuracy was measured both as exact SMILES match and as <strong>Tanimoto similarity</strong> (using ECFP6 1024-bit fingerprints). Even when Img2Mol did not predict the exact molecule, it often predicted a chemically similar one.</p>
</li>
<li>
<p><strong>Robustness Across Conditions</strong>: The full Img2Mol model (with aggressive augmentation) showed consistent performance across all image resolutions and molecule sizes. In contrast, rule-based systems were &ldquo;brittle&rdquo; - performance dropped sharply with minor perturbations to image quality or style.</p>
</li>
<li>
<p><strong>Depiction Library Invariance</strong>: Img2Mol&rsquo;s performance was stable across all three rendering libraries (RDKit, OEChem, Indigo), validating the multi-library training strategy. Rule-based methods struggled particularly with RDKit-generated images.</p>
</li>
<li>
<p><strong>Strong Generalization to Real-World Data</strong>: Despite being trained exclusively on synthetic images, Img2Mol performed well on real patent images from the STAKER dataset. This suggests the augmentation strategy successfully captured the diversity of real-world depictions.</p>
</li>
<li>
<p><strong>Overfitting in Baselines</strong>: Rule-based methods performed surprisingly well on older benchmarks (USPTO, UoB, CLEF) but failed on newer datasets (Img2Mol&rsquo;s test set, STAKER). This suggests they may be implicitly tuned to specific drawing conventions in legacy datasets.</p>
</li>
<li>
<p><strong>Limited Hand-Drawn Recognition</strong>: Img2Mol could recognize simple hand-drawn structures but struggled with complex or large molecules. This is unsurprising given the lack of hand-drawn data in training, but it highlights a potential avenue for future work.</p>
</li>
<li>
<p><strong>Speed Advantage</strong>: Img2Mol processed 5,000 images in approximately 4 minutes at the smallest input size, with compute time mostly independent of input resolution due to the fixed 224x224 rescaling. Rule-based methods showed sharply increasing compute times at higher resolutions.</p>
</li>
</ul>
<p>The work establishes that deep learning can outperform traditional rule-based OCSR systems when combined with a principled two-stage architecture and comprehensive data augmentation. The CDDD embedding acts as a bridge between visual perception and chemical structure, providing a chemically meaningful intermediate representation that improves both accuracy and robustness. The focus on synthetic data diversity proves to be an effective strategy for generalizing to real-world documents.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: Custom 8-layer Convolutional Neural Network (CNN) encoder</p>
<ul>
<li><strong>Input</strong>: $224 \times 224$ pixel grayscale images</li>
<li><strong>Backbone Structure</strong>: 8 convolutional layers organized into 3 stacks, followed by 3 fully connected layers
<ul>
<li><strong>Stack 1</strong>: 3 Conv layers ($7 \times 7$ filters, stride 3, padding 4) + Max Pooling</li>
<li><strong>Stack 2</strong>: 2 Conv layers + Max Pooling</li>
<li><strong>Stack 3</strong>: 3 Conv layers + Max Pooling</li>
<li><strong>Head</strong>: 3 fully connected layers</li>
</ul>
</li>
<li><strong>Output</strong>: 512-dimensional CDDD embedding vector</li>
</ul>
<p><strong>Decoder</strong>: Pre-trained CDDD decoder (from Winter et al.) - fixed during training, not updated</p>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Loss Function</strong>: Mean Squared Error (MSE) regression minimizing the distance between the predicted and true embeddings:</p>
<p>$$
l(d) = l(\text{cddd}_{\text{true}} - \text{cddd}_{\text{predicted}})
$$</p>
<p><strong>Optimizer</strong>: AdamW with initial learning rate $10^{-4}$</p>
<p><strong>Training Schedule</strong>:</p>
<ul>
<li>Batch size: 256</li>
<li>Training duration: 300 epochs</li>
<li>Plateau scheduler: Multiplies learning rate by 0.7 if validation loss plateaus for 10 epochs</li>
<li>Early stopping: Triggered if no improvement in validation loss for 50 epochs</li>
</ul>
<p><strong>Noise Tolerance</strong>: The decoder requires the CNN to predict embeddings with noise level $\sigma \le 0.15$ to achieve &gt;90% accuracy</p>
<h3 id="data">Data</h3>
<p><strong>Training Data</strong>: 11.1 million unique molecules from ChEMBL and PubChem</p>
<p><strong>Splits</strong>: Approximately 50,000 examples each for validation and test sets</p>
<p><strong>Synthetic Image Generation</strong>:</p>
<ul>
<li>Three cheminformatics libraries: RDKit, OEChem, and Indigo</li>
<li>Augmentations: Resolution (190-2500 pixels), rotation, bond thickness, font size</li>
<li>Salt stripping: Keep only the largest fragment</li>
<li>Over-sampling: Larger molecules (&gt;35 atoms) over-sampled to improve performance</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li>Exact SMILES match accuracy</li>
<li>Tanimoto similarity (chemical fingerprint-based structural similarity)</li>
</ul>
<p><strong>Benchmarks</strong>:</p>
<ul>
<li>Img2Mol test set (25,000 synthetic images at 224x224 px)</li>
<li>STAKER (30,000 real-world USPTO patent images at 256x256 px)</li>
<li>USPTO (4,852 patent images, avg. 649x417 px)</li>
<li>UoB (5,716 images from University of Birmingham, avg. 762x412 px)</li>
<li>CLEF (711 images, avg. 1243x392 px)</li>
<li>JPO (365 Japanese Patent Office images, avg. 607x373 px)</li>
<li>Hand-drawn molecular structures (exploratory, no defined benchmark)</li>
</ul>
<p><strong>Baselines</strong>: OSRA, MolVec, Imago (rule-based systems)</p>
<h3 id="hardware">Hardware</h3>
<p>⚠️ <strong>Unspecified in paper or supplementary materials.</strong> Inference speed reported as ~4 minutes for 5000 images; training hardware (GPU model, count) is undocumented.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/bayer-science-for-a-better-life/Img2Mol">Img2Mol GitHub</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://github.com/bayer-science-for-a-better-life/Img2Mol">Img2Mol model weights</a></td>
          <td>Model</td>
          <td>CC BY-NC 4.0</td>
          <td>Non-commercial use only</td>
      </tr>
  </tbody>
</table>
<h3 id="known-limitations">Known Limitations</h3>
<p><strong>Molecular Size</strong>: Performance degrades for molecules with &gt;35 atoms. This is partly a property of the CDDD latent space itself: for larger molecules, the &ldquo;volume of decodable latent space&rdquo; shrinks, making the decoder more sensitive to small noise perturbations in the predicted embedding.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Clevert, D.-A., Le, T., Winter, R., &amp; Montanari, F. (2021). Img2Mol &ndash; accurate SMILES recognition from molecular graphical depictions. <em>Chemical Science</em>, 12(42), 14174&ndash;14181. <a href="https://doi.org/10.1039/d1sc01839f">https://doi.org/10.1039/d1sc01839f</a></p>
<p><strong>Publication</strong>: Chemical Science (2021)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/bayer-science-for-a-better-life/Img2Mol">GitHub Repository</a></li>
<li><a href="https://doi.org/10.1039/d1sc01839f">Paper on Royal Society of Chemistry</a></li>
</ul>
]]></content:encoded></item><item><title>HMM-based Online Recognition of Chemical Symbols</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/zhang-hmm-handwriting-2009/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/zhang-hmm-handwriting-2009/</guid><description>Online recognition of handwritten chemical symbols using Hidden Markov Models with 11-dimensional local features, achieving 89.5% top-1 accuracy.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Method</strong> paper that proposes a specific algorithmic pipeline for the online recognition of handwritten chemical symbols. The core contribution is the engineering of an 11-dimensional feature vector combined with a Hidden Markov Model (HMM) architecture. The paper validates this method through quantitative experiments on a custom dataset, focusing on recognition accuracy as the primary metric.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>Recognizing chemical symbols is uniquely challenging due to the complex structure of chemical expressions and the nature of pen-based input, which often results in broken or conglutinate strokes. Additionally, variations in writing style and random noise make the task difficult. While online recognition for Western characters and CJK scripts is well-developed, works specifically targeting online chemical symbol recognition are scarce, with most prior research focusing on offline recognition or global optimization.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The primary novelty is the application of continuous HMMs specifically to the domain of <strong>online</strong> chemical symbol recognition, utilizing a specialized set of <strong>11-dimensional local features</strong>. While HMMs have been used for other scripts, this paper tailors the feature extraction (including curliness, linearity, and writing direction) to capture the specific geometric properties of chemical symbols.</p>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors constructed a specific dataset for this task involving 20 participants (college teachers and students).</p>
<ul>
<li><strong>Dataset</strong>: 64 distinct symbols (digits, English letters, Greek letters, operators)</li>
<li><strong>Volume</strong>: 7,808 total samples (122 per symbol), split into 5,670 training samples and 2,016 testing samples</li>
<li><strong>Model Sweeps</strong>: They evaluated the HMM performance by varying the number of states (4, 6, 8) and the number of Gaussians per state (3, 4, 6, 9, 12)</li>
</ul>
<h2 id="what-were-the-outcomes-and-conclusions-drawn">What were the outcomes and conclusions drawn?</h2>
<ul>
<li><strong>Performance</strong>: The best configuration (6 states, 9 Gaussians) achieved a <strong>top-1 accuracy of 89.5%</strong> and a <strong>top-3 accuracy of 98.7%</strong></li>
<li><strong>Scaling</strong>: Results showed that generally, increasing the number of states and Gaussians improved accuracy, though at the cost of computational efficiency</li>
<li><strong>Error Analysis</strong>: The primary sources of error were shape similarities between specific characters (e.g., &lsquo;0&rsquo; vs &lsquo;O&rsquo; vs &lsquo;o&rsquo;, and &lsquo;C&rsquo; vs &lsquo;c&rsquo; vs &lsquo;(&rsquo;)</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Status:</strong> Closed / Very Low Reproducibility. This 2009 study relies on a private, custom-collected dataset and does not provide source code, model weights, or an open-access preprint.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><em>None publicly available</em></td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">No open source code, open datasets, or open-access preprints were released with this publication.</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The study utilized a custom dataset collected in a laboratory environment.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Training</strong></td>
          <td style="text-align: left">Custom Chemical Symbol Set</td>
          <td style="text-align: left">5,670 samples</td>
          <td style="text-align: left">90 samples per symbol</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Testing</strong></td>
          <td style="text-align: left">Custom Chemical Symbol Set</td>
          <td style="text-align: left">2,016 samples</td>
          <td style="text-align: left">32 samples per symbol</td>
      </tr>
  </tbody>
</table>
<p><strong>Dataset Composition</strong>: The set includes <strong>64 symbols</strong>: Digits (0-9), Uppercase (A-Z, missing Q), Lowercase (a-z, selected), Greek letters ($\alpha$, $\beta$, $\gamma$, $\pi$), and operators ($+$, $=$, $\rightarrow$, $\uparrow$, $\downarrow$, $($ , $)$).</p>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Preprocessing</strong></p>
<p>The raw tablet data undergoes a 6-step pipeline:</p>
<ol>
<li><strong>Duplicate Point Elimination</strong>: Removing sequential points with identical coordinates</li>
<li><strong>Broken Stroke Connection</strong>: Using Bezier curves to interpolate missing points/connect broken strokes</li>
<li><strong>Hook Elimination</strong>: Removing artifacts at the start/end of strokes characterized by short length and sharp angle changes</li>
<li><strong>Smoothing</strong>: Reducing noise from erratic pen movement</li>
<li><strong>Re-sampling</strong>: Spacing points equidistantly to remove temporal variation</li>
<li><strong>Size Normalization</strong>: Removing variation in writing scale</li>
</ol>
<p><strong>2. Feature Extraction (11 Dimensions)</strong></p>
<p>Features are extracted from a 5-point window centered on $t$ ($t-2$ to $t+2$). The 11 dimensions are:</p>
<ol>
<li><strong>Normalized Vertical Position</strong>: $y(t)$ mapped to $[0,1]$</li>
<li><strong>Normalized First Derivative ($x&rsquo;$)</strong>: Calculated via weighted sum of neighbors</li>
<li><strong>Normalized First Derivative ($y&rsquo;$)</strong>: Calculated via weighted sum of neighbors</li>
<li><strong>Normalized Second Derivative ($x&rsquo;&rsquo;$)</strong>: Computed using $x&rsquo;$ values</li>
<li><strong>Normalized Second Derivative ($y&rsquo;&rsquo;$)</strong>: Computed using $y&rsquo;$ values</li>
<li><strong>Curvature</strong>: $\frac{x&rsquo;y&rsquo;&rsquo; - x&rsquo;&lsquo;y&rsquo;}{(x&rsquo;^2 + y&rsquo;^2)^{3/2}}$</li>
<li><strong>Writing Direction (Cos)</strong>: $\cos \alpha(t)$ based on vector from $t-1$ to $t+1$</li>
<li><strong>Writing Direction (Sin)</strong>: $\sin \alpha(t)$</li>
<li><strong>Aspect Ratio</strong>: Ratio of height to width in the 5-point window</li>
<li><strong>Curliness</strong>: Deviation from the straight line connecting the first and last point of the window</li>
<li><strong>Linearity</strong>: Average squared distance of points in the window to the straight line connecting start/end points</li>
</ol>
<p><strong>3. Feature Normalization</strong></p>
<p>The final feature matrix $V$ is normalized to zero mean and unit standard deviation using the covariance matrix: $o_t = \Sigma^{-1/2}(v_t - \mu)$.</p>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Continuous Hidden Markov Models (HMM)</li>
<li><strong>Topology</strong>: Left-to-right (Bakis model)</li>
<li><strong>Initialization</strong>: Initial distribution $\pi = {1, 0, &hellip;, 0}$; uniform transition matrix $A$; segmental k-means for observation matrix $B$</li>
<li><strong>Training</strong>: Baum-Welch re-estimation</li>
<li><strong>Decision</strong>: Maximum likelihood classification ($\hat{\lambda} = \arg \max P(O|\lambda)$)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Best Value</th>
          <th style="text-align: left">Configuration</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Top-1 Accuracy</strong></td>
          <td style="text-align: left"><strong>89.5%</strong></td>
          <td style="text-align: left">6 States, 9 Gaussians</td>
          <td style="text-align: left">Highest reported accuracy</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-3 Accuracy</strong></td>
          <td style="text-align: left"><strong>98.7%</strong></td>
          <td style="text-align: left">6 States, 9 Gaussians</td>
          <td style="text-align: left">Top-3 candidate accuracy</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, Y., Shi, G., &amp; Yang, J. (2009). HMM-Based Online Recognition of Handwritten Chemical Symbols. <em>2009 10th International Conference on Document Analysis and Recognition</em>, 1255&ndash;1259. <a href="https://doi.org/10.1109/ICDAR.2009.99">https://doi.org/10.1109/ICDAR.2009.99</a></p>
<p><strong>Publication</strong>: ICDAR 2009</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zhang2009hmm,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{HMM-Based Online Recognition of Handwritten Chemical Symbols}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2009 10th International Conference on Document Analysis and Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zhang, Yang and Shi, Guangshun and Yang, Jufeng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2009}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{75}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1255--1259}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICDAR.2009.99}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Handwritten Chemical Symbol Recognition Using SVMs</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/tang-online-symbol-2013/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/tang-online-symbol-2013/</guid><description>A hybrid SVM and elastic matching approach for recognizing handwritten chemical symbols drawn on touch devices, achieving 89.7% top-1 accuracy.</description><content:encoded><![CDATA[<h2 id="paper-contribution-and-taxonomy">Paper Contribution and Taxonomy</h2>
<p>This is a <strong>Method</strong> paper according to the <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI for Physical Sciences taxonomy</a>.</p>
<ul>
<li><strong>Dominant Basis</strong>: The authors propose a novel hybrid architecture (SVM-EM) that combines two existing techniques to solve a specific recognition problem.</li>
<li><strong>Rhetorical Indicators</strong>: The paper explicitly defines algorithms (Algorithm 1 &amp; 2), presents a system architecture, and validates the method via ablation studies comparing the hybrid approach against its individual components.</li>
</ul>
<h2 id="motivation-for-pen-based-input">Motivation for Pen-Based Input</h2>
<p>Entering chemical expressions on digital devices is difficult due to their complex 2D spatial structure.</p>
<ul>
<li><strong>The Problem</strong>: While handwriting recognition for text and math is mature, chemical structures involve unique symbols and spatial arrangements that existing tools struggle to process efficiently.</li>
<li><strong>Existing Solutions</strong>: Standard tools (like ChemDraw) rely on point-and-click interactions, which are described as complicated and non-intuitive compared to direct handwriting.</li>
<li><strong>Goal</strong>: To enable fluid handwriting input on pen/touch-based devices (like iPads) by accurately recognizing individual chemical symbols in real-time.</li>
</ul>
<h2 id="novelty-hybrid-svm-and-elastic-matching">Novelty: Hybrid SVM and Elastic Matching</h2>
<p>The core contribution is the <strong>Hybrid SVM-EM</strong> approach, which splits recognition into a coarse classification stage and a fine-grained verification stage.</p>
<ul>
<li><strong>Two-Stage Pipeline</strong>:
<ol>
<li><strong>SVM Recognition</strong>: Uses statistical features (stroke count, turning angles) to generate a short-list of candidate symbols.</li>
<li><strong>Elastic Matching (EM)</strong>: Uses a geometric point-to-point distance metric to re-rank these candidates against a library of stored symbol prototypes.</li>
</ol>
</li>
<li><strong>Online Stroke Partitioning</strong>: A heuristic-based method to group strokes into symbols in real-time based on time adjacency (grouping the last $n$ strokes) and spatial intersection checks, without waiting for the user to finish the entire drawing.</li>
</ul>
<h2 id="experimental-design-and-data-collection">Experimental Design and Data Collection</h2>
<p>The authors conducted a user study to collect data and evaluate the system:</p>
<ul>
<li><strong>Participants</strong>: 10 users were recruited to write chemical symbols on an iPad.</li>
<li><strong>Task</strong>: Each user wrote 78 distinct chemical symbols (digits, alphabets, bonds) 3 times each.</li>
<li><strong>Baselines</strong>: The hybrid method was compared against two baselines:
<ol>
<li>SVM only</li>
<li>Elastic Matching only.</li>
</ol>
</li>
<li><strong>Metrics</strong>: Evaluation focused on <strong>Precision@k</strong> (where $k=1, 3, 5$), measuring how often the correct symbol appeared in the top-$k$ suggestions.</li>
</ul>
<h2 id="recognition-performance-and-outcomes">Recognition Performance and Outcomes</h2>
<p>The hybrid approach demonstrated improved performance compared to using either technique in isolation.</p>
<ul>
<li><strong>Key Results</strong>:
<ul>
<li><strong>Hybrid SVM-EM</strong>: 89.7% Precision@1 (Top-1 accuracy).</li>
<li><strong>SVM Only</strong>: 85.1% Precision@1.</li>
<li><strong>EM Only</strong>: 76.7% Precision@1.</li>
</ul>
</li>
<li><strong>Category Performance</strong>: The system performed best on Operators (91.9%) and Digits (91.3%), with slightly lower performance on Alphabetic characters (88.6%).</li>
<li><strong>Impact</strong>: The system was successfully implemented as a real-time iOS application, allowing users to draw complex structures like $C\#CC(O)$ which are then converted to SMILES strings.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study generated a custom dataset for training and evaluation.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset Stats</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>2,340 samples</td>
          <td>Collected from 10 users. Consists of <strong>78 unique symbols</strong>: 10 digits (0-9), 52 letters (A-Z, a-z), and 16 bonds/operators (e.g., $=$, $+$, hash bonds).</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Unspecified size</td>
          <td>A &ldquo;Chemical Elastic Symbol Library&rdquo; was created containing samples of all supported symbols to serve as prototypes for the Elastic Matching step.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The pipeline consists of four distinct algorithmic steps:</p>
<p><strong>1. Stroke Partitioning</strong></p>
<ul>
<li><strong>Logic</strong>: Groups the most recently written stroke with up to the last 4 previous strokes.</li>
<li><strong>Filtering</strong>: Invalid groups are removed using &ldquo;Spatial Distance Checking&rdquo; (strokes too far apart) and &ldquo;Stroke Intersection Checking&rdquo; (strokes that don&rsquo;t intersect where expected).</li>
</ul>
<p><strong>2. Preprocessing</strong></p>
<ul>
<li><strong>Size Normalization</strong>: Scales symbol to a standard size based on its bounding box.</li>
<li><strong>Smoothing</strong>: Uses average smoothing (replacing mid-points with the average of neighbors) to remove jitter.</li>
<li><strong>Sampling</strong>: Resamples valid strokes to a fixed number of <strong>50 points</strong>.</li>
</ul>
<p><strong>3. SVM Feature Extraction</strong></p>
<ul>
<li><strong>Horizontal Angle</strong>: Calculated between two consecutive points ($P_1, P_2$). Values are binned into 12 groups ($30^{\circ}$ each).</li>
<li><strong>Turning Angle</strong>: The difference between two consecutive horizontal angles. Values are binned into 18 groups ($10^{\circ}$ each).</li>
<li><strong>Features</strong>: Input vector consists of stroke count, normalized coordinates, and the percentage of angles falling into the histograms described above.</li>
</ul>
<p><strong>4. Elastic Matching (Verification)</strong></p>
<ul>
<li><strong>Distance Function</strong>: Euclidean distance summation between the points of the candidate symbol ($s$) and the partitioned input ($s_p$).
$$
\begin{aligned}
D(s, s_p) = \sum_{j=1}^{n} \sqrt{(x_{s,j} - x_{p,j})^2 + (y_{s,j} - y_{p,j})^2}
\end{aligned}
$$
<em>Note: The paper formula sums the distances; $n$ is the number of points (50).</em></li>
<li><strong>Ranking</strong>: Candidates are re-ranked in ascending order of this elastic distance.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Classifier</strong>: Linear Support Vector Machine (SVM) implemented using <strong>LibSVM</strong>.</li>
<li><strong>Symbol Library</strong>: A &ldquo;Chemical Elastic Symbol Library&rdquo; stores the raw stroke point sequences for all 78 supported symbols to enable the elastic matching comparison.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance was measured using precision at different ranks (Top-N accuracy).</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Precision@1</strong></td>
          <td><strong>89.7%</strong></td>
          <td>85.1% (SVM)</td>
          <td>Hybrid model reduces error rate significantly over baselines.</td>
      </tr>
      <tr>
          <td><strong>Precision@3</strong></td>
          <td><strong>94.1%</strong></td>
          <td>N/A</td>
          <td>High recall in top 3 allows users to quickly correct errors via UI selection.</td>
      </tr>
      <tr>
          <td><strong>Precision@5</strong></td>
          <td><strong>94.6%</strong></td>
          <td>N/A</td>
          <td></td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Device</strong>: Apple iPad (iOS platform).</li>
<li><strong>Input</strong>: Touch/Pen-based input recording digital ink (x, y coordinates and pen-up/down events).</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Tang, P., Hui, S. C., &amp; Fu, C. W. (2013). Online Chemical Symbol Recognition for Handwritten Chemical Expression Recognition. <em>2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS)</em>, 535&ndash;540. <a href="https://doi.org/10.1109/ICIS.2013.6607894">https://doi.org/10.1109/ICIS.2013.6607894</a></p>
<p><strong>Publication</strong>: IEEE ICIS 2013</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{tangOnlineChemicalSymbol2013,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Online Chemical Symbol Recognition for Handwritten Chemical Expression Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Tang, Peng and Hui, Siu Cheung and Fu, Chi-Wing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2013</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{535--540}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICIS.2013.6607894}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Handwritten Chemical Ring Recognition with Neural Networks</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/hewahi-ring-recognition-2008/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/hewahi-ring-recognition-2008/</guid><description>A two-phase Classifier-Recognizer neural network pipeline for recognizing 23 types of handwritten heterocyclic chemical rings, achieving ~94% accuracy.</description><content:encoded><![CDATA[<h2 id="contribution-recognition-architecture-for-heterocyclic-rings">Contribution: Recognition Architecture for Heterocyclic Rings</h2>
<p>This is a <strong>Method</strong> paper ($\Psi_{\text{Method}}$).</p>
<p>It proposes a specific algorithmic architecture (the &ldquo;Classifier-Recognizer Approach&rdquo;) to solve a pattern recognition problem. The rhetorical structure centers on defining three variations of a method, performing ablation-like comparisons between them (Whole Image vs. Lower Part), and demonstrating superior performance metrics (~94% accuracy) for the proposed technique.</p>
<h2 id="motivation-enabling-sketch-based-chemical-search">Motivation: Enabling Sketch-Based Chemical Search</h2>
<p>The authors identify a gap in existing OCR and handwriting recognition research, which typically focuses on alphanumeric characters or whole words.</p>
<ul>
<li><strong>Missing Capability</strong>: Recognition of specific <em>heterocyclic chemical rings</em> (23 types) had not been performed previously.</li>
<li><strong>Practical Utility</strong>: Existing chemical search engines require text-based queries (names); this work enables &ldquo;backward&rdquo; search where a user can draw a ring to find its information.</li>
<li><strong>Educational/Professional Aid</strong>: Useful for chemistry departments and mobile applications where chemists can sketch formulas on screens.</li>
</ul>
<h2 id="innovation-the-classifier-recognizer-pipeline">Innovation: The Classifier-Recognizer Pipeline</h2>
<p>The core novelty is the <strong>two-phase &ldquo;Classifier-Recognizer&rdquo; architecture</strong> designed to handle the visual similarity of heterocyclic rings:</p>
<ol>
<li><strong>Phase 1 (Classifier)</strong>: A neural network classifies the ring into one of four broad categories (S, N, O, Others) based solely on the <em>upper part</em> of the image (40x15 pixels).</li>
<li><strong>Phase 2 (Recognizer)</strong>: A class-specific neural network identifies the exact ring.</li>
<li><strong>Optimization</strong>: The most successful variation (&ldquo;Lower Part Image Recognizer with Half Size Grid&rdquo;) uses only the <em>lower part</em> of the image and <em>odd rows</em> (half-grid) to reduce input dimensionality and computation time while improving accuracy. This effectively subsamples the input grid matrix $M \in \mathbb{R}^{H \times W}$ to a reduced matrix $M_{\text{sub}}$:
$$ M_{\text{sub}} = { m_{i,j} \in M \mid i \text{ is odd} } $$</li>
</ol>
<h2 id="failed-preliminary-approaches">Failed Preliminary Approaches</h2>
<p>Before arriving at the Classifier-Recognizer architecture, the authors tried three simpler methods that all failed:</p>
<ol>
<li><strong>Ordinary NN</strong>: A single neural network with 1600 inputs (40x40 grid), 1600 hidden units, and 23 outputs. This standard approach achieved only 7% accuracy.</li>
<li><strong>Row/Column pixel counts</strong>: Using the number of black pixels per row and per column as features ($N_c + N_r$ inputs), which dramatically reduced dimensionality. This performed even worse, below 1% accuracy.</li>
<li><strong>Midline crossing count</strong>: Drawing a horizontal midline and counting the number of line crossings. This failed because the crossing count varies between writers for the same ring.</li>
</ol>
<p>These failures motivated the two-phase Classifier-Recognizer design.</p>
<h2 id="experimental-setup-and-network-variations">Experimental Setup and Network Variations</h2>
<p>The authors conducted a comparative study of three methodological variations:</p>
<ol>
<li><strong>Whole Image Recognizer</strong>: Uses the full image.</li>
<li><strong>Whole Image (Half Size Grid)</strong>: Uses only odd rows ($20 \times 40$ pixels).</li>
<li><strong>Lower Part (Half Size Grid)</strong>: Uses the lower part of the image with odd rows (the proposed method).</li>
</ol>
<p><strong>Setup</strong>:</p>
<ul>
<li><strong>Dataset</strong>: 23 types of heterocyclic rings.</li>
<li><strong>Training</strong>: 1500 samples (distributed across S, N, O, and Others classes).</li>
<li><strong>Testing</strong>: 1150 samples.</li>
<li><strong>Metric</strong>: Recognition accuracy (Performance %) and Error %.</li>
</ul>
<h2 id="results-high-accuracy-via-dimension-reduction">Results: High Accuracy via Dimension Reduction</h2>
<ul>
<li><strong>Superior Method</strong>: The &ldquo;Lower Part Image Recognizer with Half Size Grid&rdquo; achieved the best performance (~94% overall).</li>
<li><strong>High Classifier Accuracy</strong>: The first phase (classification into S/N/O/Other) achieves 100% accuracy for class S, 98.67% for O, 97.75% for N, and 97.67% for Others (Table 3).</li>
<li><strong>Class &lsquo;Others&rsquo; Difficulty</strong>: The &lsquo;Others&rsquo; class showed lower performance (~90-93%) compared to S/N/O due to the higher complexity and similarity of rings in that category.</li>
<li><strong>Efficiency</strong>: The half-grid approach reduced training time from ~53 hours (Whole Image) to ~35 hours (Lower Part Half Size Grid) while improving accuracy from 87% to 94%.</li>
</ul>
<p><strong>Training/Testing comparison across the three Classifier-Recognizer variations (Table 2)</strong>:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Method</th>
          <th style="text-align: left">Hidden Nodes</th>
          <th style="text-align: left">Iterations</th>
          <th style="text-align: left">Training Time (hrs)</th>
          <th style="text-align: left">Error</th>
          <th style="text-align: left">Performance</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Whole Image</td>
          <td style="text-align: left">50</td>
          <td style="text-align: left">1000</td>
          <td style="text-align: left">~53</td>
          <td style="text-align: left">13.0%</td>
          <td style="text-align: left">87.0%</td>
      </tr>
      <tr>
          <td style="text-align: left">Whole Image (Half Grid)</td>
          <td style="text-align: left">50</td>
          <td style="text-align: left">1000</td>
          <td style="text-align: left">~41</td>
          <td style="text-align: left">9.0%</td>
          <td style="text-align: left">91.0%</td>
      </tr>
      <tr>
          <td style="text-align: left">Lower Part (Half Grid)</td>
          <td style="text-align: left">50</td>
          <td style="text-align: left">1000</td>
          <td style="text-align: left">~35</td>
          <td style="text-align: left">6.0%</td>
          <td style="text-align: left">94.0%</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The dataset consists of handwritten samples of 23 specific heterocyclic rings.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Training</strong></td>
          <td style="text-align: left">Heterocyclic Rings</td>
          <td style="text-align: left">1500 samples</td>
          <td style="text-align: left">Split: 300 (S), 400 (N), 400 (O), 400 (Others)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Testing</strong></td>
          <td style="text-align: left">Heterocyclic Rings</td>
          <td style="text-align: left">1150 samples</td>
          <td style="text-align: left">Split: 150 (S), 300 (O), 400 (N), 300 (Others)</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing Steps</strong>:</p>
<ol>
<li><strong>Monochrome Conversion</strong>: Convert image to monochrome bitmap.</li>
<li><strong>Grid Scaling</strong>: Convert drawing area (regardless of original size) to a fixed <strong>40x40</strong> grid.</li>
<li><strong>Bounding</strong>: Scale the ring shape itself to fit the 40x40 grid.</li>
</ol>
<h3 id="algorithms">Algorithms</h3>
<p><strong>The &ldquo;Lower Part with Half Size&rdquo; Pipeline</strong>:</p>
<ol>
<li><strong>Cut Point</strong>: A horizontal midline is defined; the algorithm separates the &ldquo;Upper Part&rdquo; and &ldquo;Lower Part&rdquo;.</li>
<li><strong>Phase 1 Input</strong>: The <strong>Upper Part</strong> (rows 0-15 approx, scaled) is fed to the Classifier NN to determine the class (S, N, O, or Others).</li>
<li><strong>Phase 2 Input</strong>:
<ul>
<li>For classes <strong>S, N, O</strong>: The <strong>Lower Part</strong> of the image is used.</li>
<li>For class <strong>Others</strong>: The <strong>Whole Ring</strong> is used.</li>
</ul>
</li>
<li><strong>Dimensionality Reduction</strong>: For the recognizer networks, only <strong>odd rows</strong> are used (effectively a 20x40 input grid) to reduce inputs from 1600 to 800.</li>
</ol>
<h3 id="models">Models</h3>
<p>The system uses multiple distinct Feed-Forward Neural Networks (Backpropagation is implied by &ldquo;training&rdquo; and &ldquo;epochs&rdquo; context, though not explicitly named as the algorithm):</p>
<ul>
<li><strong>Structure</strong>: 1 Classifier NN + 4 Recognizer NNs (one for each class).</li>
<li><strong>Hidden Layers</strong>: The preliminary &ldquo;ordinary method&rdquo; experiment used 1600 hidden units. The Classifier-Recognizer methods all used 50 hidden nodes per Table 2. The paper also notes that the ordinary approach tried various hidden layer sizes.</li>
<li><strong>Input Nodes</strong>:
<ul>
<li>Standard: 1600 (40x40).</li>
<li>Optimized: ~800 (20x40 via half-grid).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Classifier Phase Testing Results (Table 3)</strong>:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Class</th>
          <th style="text-align: left">Samples</th>
          <th style="text-align: left">Correct</th>
          <th style="text-align: left">Accuracy</th>
          <th style="text-align: left">Error</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>S</strong></td>
          <td style="text-align: left">150</td>
          <td style="text-align: left">150</td>
          <td style="text-align: left"><strong>100.00%</strong></td>
          <td style="text-align: left">0.00%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>O</strong></td>
          <td style="text-align: left">300</td>
          <td style="text-align: left">296</td>
          <td style="text-align: left"><strong>98.67%</strong></td>
          <td style="text-align: left">1.33%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>N</strong></td>
          <td style="text-align: left">400</td>
          <td style="text-align: left">391</td>
          <td style="text-align: left"><strong>97.75%</strong></td>
          <td style="text-align: left">2.25%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Others</strong></td>
          <td style="text-align: left">300</td>
          <td style="text-align: left">293</td>
          <td style="text-align: left"><strong>97.67%</strong></td>
          <td style="text-align: left">2.33%</td>
      </tr>
  </tbody>
</table>
<p><strong>Recognizer Phase Testing Results (Lower Part Image Recognizer with Half Size Grid, Table 4)</strong>:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Class</th>
          <th style="text-align: left">Samples</th>
          <th style="text-align: left">Correct</th>
          <th style="text-align: left">Accuracy</th>
          <th style="text-align: left">Error</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>S</strong></td>
          <td style="text-align: left">150</td>
          <td style="text-align: left">147</td>
          <td style="text-align: left"><strong>98.00%</strong></td>
          <td style="text-align: left">2.00%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>O</strong></td>
          <td style="text-align: left">300</td>
          <td style="text-align: left">289</td>
          <td style="text-align: left"><strong>96.33%</strong></td>
          <td style="text-align: left">3.67%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>N</strong></td>
          <td style="text-align: left">400</td>
          <td style="text-align: left">386</td>
          <td style="text-align: left"><strong>96.50%</strong></td>
          <td style="text-align: left">3.50%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Others</strong></td>
          <td style="text-align: left">300</td>
          <td style="text-align: left">279</td>
          <td style="text-align: left"><strong>93.00%</strong></td>
          <td style="text-align: left">7.00%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Overall</strong></td>
          <td style="text-align: left"><strong>1150</strong></td>
          <td style="text-align: left"><strong>-</strong></td>
          <td style="text-align: left"><strong>~94.0%</strong></td>
          <td style="text-align: left"><strong>-</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="reproducibility-assessment">Reproducibility Assessment</h3>
<p>No source code, trained models, or datasets were released with this paper. The handwritten ring samples were collected by the authors, and the software described (a desktop application) is not publicly available. The neural network architecture details (50 hidden nodes, 1000 iterations) and preprocessing pipeline are described in sufficient detail for reimplementation, but reproducing results would require collecting a new handwritten dataset of heterocyclic rings.</p>
<p><strong>Status</strong>: Closed (no public code, data, or models).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hewahi, N., Nounou, M. N., Nassar, M. S., Abu-Hamad, M. I., &amp; Abu-Hamad, H. I. (2008). Chemical Ring Handwritten Recognition Based on Neural Networks. <em>Ubiquitous Computing and Communication Journal</em>, 3(3).</p>
<p><strong>Publication</strong>: Ubiquitous Computing and Communication Journal 2008</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{hewahiCHEMICALRINGHANDWRITTEN2008,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{CHEMICAL RING HANDWRITTEN RECOGNITION BASED ON NEURAL NETWORKS}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Hewahi, Nabil and Nounou, Mohamed N and Nassar, Mohamed S and Abu-Hamad, Mohamed I and Abu-Hamad, Husam I}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2008}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Ubiquitous Computing and Communication Journal}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{3}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Deep Learning for Molecular Structure Extraction (2019)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/staker-deep-learning-2019/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/staker-deep-learning-2019/</guid><description>An end-to-end deep learning approach using U-Net segmentation and a CNN encoder with GridLSTM decoder to predict chemical structures from document images.</description><content:encoded><![CDATA[<h2 id="contribution-type-method-and-resource">Contribution Type: Method and Resource</h2>
<p>This is primarily a <strong>methodological</strong> paper with a secondary <strong>resource</strong> contribution.</p>
<p><strong>Method</strong>: It proposes a novel end-to-end deep learning architecture (Segmentation U-Net + Recognition Encoder-Decoder) to replace traditional rule-based optical chemical structure recognition (OCSR) systems.</p>
<p><strong>Resource</strong>: It details a pipeline for generating large-scale synthetic datasets (images overlaying patent/journal backgrounds) necessary to train the deep learning models.</p>
<h2 id="motivation-overcoming-brittle-rule-based-systems">Motivation: Overcoming Brittle Rule-Based Systems</h2>
<p>Existing tools for extracting chemical structures from literature (e.g., OSRA, CLIDE) rely on complex, handcrafted rules and heuristics (edge detection, vectorization). These systems suffer from:</p>
<ol>
<li><strong>Brittleness</strong>: They fail when image quality is low (low resolution, noise) or when artistic styles vary (wavy bonds, crossing lines).</li>
<li><strong>Maintenance difficulty</strong>: Improvements require manual codification of new rules for every edge case, which is difficult to scale.</li>
<li><strong>Data volume</strong>: The explosion of published life science papers (2000+ per day in Medline) creates a need for automated, robust curation tools that humans cannot match.</li>
</ol>
<h2 id="core-innovation-end-to-end-pixel-to-smiles-recognition">Core Innovation: End-to-End Pixel-to-SMILES Recognition</h2>
<p>The authors present an <strong>end-to-end deep learning approach</strong> for this task that operates directly on raw pixels without explicit subcomponent recognition (e.g., detecting atoms and bonds separately). Key innovations include:</p>
<ol>
<li><strong>Pixel-to-SMILES</strong>: Treating structure recognition as an image captioning problem using an encoder-decoder architecture with attention, generating SMILES directly.</li>
<li><strong>Low-Resolution Robustness</strong>: The model is trained on aggressively downsampled images (~60 dpi for segmentation, 256x256 for prediction), making it robust to poor quality and noisy inputs from legacy PDF extractions.</li>
<li><strong>Implicit Superatom Handling</strong>: The model learns to recognize and generate sequences for superatoms (e.g., &ldquo;OTBS&rdquo;) contextually.</li>
</ol>
<h2 id="experimental-setup-and-large-scale-synthetic-data">Experimental Setup and Large-Scale Synthetic Data</h2>
<p>The authors validated their approach using a mix of large-scale synthetic training sets and real-world test sets:</p>
<ol>
<li><strong>Synthetic Generation</strong>: They created a segmentation dataset by overlaying USPTO molecules onto &ldquo;whited-out&rdquo; journal pages.</li>
<li><strong>Ablation/Training</strong>: Metrics were tracked on Indigo (synthetic) and USPTO (real patent images) datasets.</li>
<li><strong>External Validation</strong>:
<ul>
<li><strong>Valko Dataset</strong>: A standard benchmark of 454 heterogeneous images from literature.</li>
<li><strong>Proprietary Dataset</strong>: A collection of images from 47 articles and 5 patents to simulate real-world drug discovery curation.</li>
</ul>
</li>
<li><strong>Stress Testing</strong>: They analyzed performance distributions across molecular weight, heavy atom count, and rare elements (e.g., Uranium, Vanadium).</li>
</ol>
<h2 id="results-and-limitations-in-complex-structures">Results and Limitations in Complex Structures</h2>
<ul>
<li><strong>High Accuracy on Standard Sets</strong>: The model achieved <strong>82% accuracy</strong> on the Indigo validation set and <strong>77%</strong> on the USPTO validation set. No apparent overfitting was observed on the Indigo data (57M training examples), though some overfitting occurred on the smaller USPTO set (1.7M training examples).</li>
<li><strong>Real-World Viability</strong>: It achieved <strong>83% accuracy</strong> on the proprietary internal test set, with validation and proprietary accuracies ranging from 77-83%, indicating the training sets reasonably approximate real drug discovery data.</li>
<li><strong>Segmentation Quality</strong>: Low segmentation error rates were observed: only 3.3% of the Valko dataset and 6.6% of the proprietary images failed to segment properly.</li>
<li><strong>Limitations on Complexity</strong>: Performance dropped to <strong>41% on the Valko test set</strong>. Superatoms were the single largest contributor to prediction errors, with 21% of Valko samples containing one or more incorrectly predicted superatoms. Only 6.6% of total training images contained any superatom, limiting the model&rsquo;s exposure.</li>
<li><strong>Stereochemistry Challenges</strong>: 60% of compounds with incorrectly predicted stereochemistry had explicit stereochemistry in both the ground truth and the prediction, but with wrong configurations assigned (e.g., predicting R instead of S). The model often correctly identified which atoms have stereocenters but assigned the wrong direction, suggesting the architecture may not incorporate sufficient spatial context for configuration assignment.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors utilized three primary sources for generating training data. All inputs were strictly downsampled to improve robustness.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td><strong>Indigo Set</strong></td>
          <td>57M</td>
          <td>PubChem molecules rendered via Indigo (256x256).</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td><strong>USPTO Set</strong></td>
          <td>1.7M</td>
          <td>Image/SMILES pairs from public patent data.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td><strong>OS X Indigo</strong></td>
          <td>10M</td>
          <td>Additional Indigo renders from Mac OS for style diversity.</td>
      </tr>
      <tr>
          <td><strong>Segmentation</strong></td>
          <td><strong>Synthetic Pages</strong></td>
          <td>N/A</td>
          <td>Generated by overlaying USPTO images on text-cleared PDF pages.</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>Segmentation Inputs</strong>: Grayscale, downsampled to ~60 dpi.</li>
<li><strong>Prediction Inputs</strong>: Resized to 256x256 such that bond lengths are approximately 3-12 pixels.</li>
<li><strong>Augmentation</strong>: Random affine transforms, brightness scaling, and binarization applied during training.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Segmentation Pipeline</strong>:</p>
<ul>
<li><strong>Multi-scale Inference</strong>: Masks generated at resolutions from 30 to 60 dpi (3 dpi increments) and averaged for the final mask.</li>
<li><strong>Post-processing</strong>: Hough transform used to remove long straight lines (table borders). Mask blobs filtered by pixel count thresholds.</li>
</ul>
<p><strong>Prediction Pipeline</strong>:</p>
<ul>
<li><strong>Sequence Generation</strong>: SMILES generated character-by-character via greedy decoding. During inference, predictions are made at several low resolutions and the sequence with the highest confidence (product of per-character softmax outputs) is returned.</li>
<li><strong>Attention-based Verification</strong>: Attention weights used to re-project predicted atoms back into 2D space to visually verify alignment with the input image.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>1. Segmentation Model (U-Net Variant)</strong>:</p>
<ul>
<li><strong>Architecture</strong>: U-Net style with skip connections.</li>
<li><strong>Input</strong>: 128x128x1 grayscale image.</li>
<li><strong>Layers</strong>: Alternating 3x3 Conv and 2x2 Max Pool.</li>
<li><strong>Activation</strong>: Parametric ReLU (pReLU).</li>
<li><strong>Parameters</strong>: ~380,000.</li>
</ul>
<p><strong>2. Structure Prediction Model (Encoder-Decoder)</strong>:</p>
<ul>
<li><strong>Encoder</strong>: CNN with 5x5 convolutions, 2x2 Max Pooling, pReLU. No pooling in first layers to preserve fine features.</li>
<li><strong>Decoder</strong>: 3 layers of <strong>GridLSTM</strong> cells.</li>
<li><strong>Attention</strong>: Soft/Global attention mechanism conditioned on the encoder state.</li>
<li><strong>Input</strong>: 256x256x1 image.</li>
<li><strong>Output</strong>: Sequence of characters (vocab size 65).</li>
<li><strong>Parameters</strong>: ~46.3 million.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation required an exact string match of the Canonical SMILES (including stereochemistry) to the ground truth.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Dataset</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td><strong>82%</strong></td>
          <td>Indigo Val</td>
          <td>Synthetic validation set</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td><strong>77%</strong></td>
          <td>USPTO Val</td>
          <td>Real patent images</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td><strong>83%</strong></td>
          <td>Proprietary</td>
          <td>Internal pharma dataset (real world)</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td><strong>41%</strong></td>
          <td>Valko Test</td>
          <td>External benchmark; difficult due to superatoms</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Segmentation Training</strong>: 1 GPU, ~4 days (650k steps).</li>
<li><strong>Prediction Training</strong>: 8 NVIDIA Pascal GPUs, ~26 days (1M steps).</li>
<li><strong>Framework</strong>: TensorFlow.</li>
<li><strong>Optimizer</strong>: Adam.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<p>No public code, pre-trained models, or generated datasets were released with this paper. The training pipeline relies on publicly available molecular databases (PubChem, USPTO) and open-source rendering tools (Indigo), but the specific training sets, model weights, and inference code remain unavailable.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Staker, J., Marshall, K., Abel, R., &amp; McQuaw, C. (2019). Molecular Structure Extraction From Documents Using Deep Learning. <em>Journal of Chemical Information and Modeling</em>, 59(3), 1017-1029. <a href="https://doi.org/10.1021/acs.jcim.8b00669">https://doi.org/10.1021/acs.jcim.8b00669</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling (JCIM) 2019</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.schrodinger.com/publications/">Schrödinger Publication Page</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{stakerMolecularStructureExtraction2019,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Molecular Structure Extraction From Documents Using Deep Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Staker, Joshua and Marshall, Kyle and Abel, Robert and McQuaw, Carolyn}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = <span style="color:#e6db74">{feb}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{59}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1017--1029}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/acs.jcim.8b00669}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1021/acs.jcim.8b00669}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DECIMER: Deep Learning for Chemical Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer/</guid><description>Deep learning method for optical chemical structure recognition using image captioning networks trained on millions of synthetic molecular images.</description><content:encoded><![CDATA[<h2 id="contribution-method-for-optical-chemical-entity-recognition">Contribution: Method for Optical Chemical Entity Recognition</h2>
<p>This is primarily a <strong>Method ($\Psi_{\text{Method}}$)</strong> paper with a strong <strong>Resource ($\Psi_{\text{Resource}}$)</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes a novel architecture (DECIMER) that repurposes &ldquo;show-and-tell&rdquo; image captioning networks for Optical Chemical Entity Recognition (OCER), providing an alternative to traditional rule-based segmentation pipelines.</li>
<li><strong>Resource</strong>: It establishes a framework for generating large-scale synthetic training data using open-source cheminformatics tools (CDK) and databases (PubChem), circumventing the scarcity of manually annotated chemical images.</li>
</ul>
<h2 id="motivation-brittleness-of-heuristic-pipelines">Motivation: Brittleness of Heuristic Pipelines</h2>
<p>The extraction of chemical structures from scientific literature (OCER) is critical for populating open-access databases. Traditional OCER systems (like OSRA or CLiDE) rely on complex multi-step pipelines involving vectorization, character recognition, and graph compilation. These systems are brittle and incorporating new structural features requires laborious engineering. Inspired by the success of deep neural network approaches like AlphaGo Zero, the authors sought to formulate an end-to-end deep learning approach that learns directly from data with minimal prior assumptions.</p>
<h2 id="novelty-image-captioning-for-molecular-graphs">Novelty: Image Captioning for Molecular Graphs</h2>
<ul>
<li><strong>Image-to-Text Formulation</strong>: The paper frames chemical structure recognition as an image captioning problem, translating a bitmap image directly into a SMILES string using an encoder-decoder network. This bypasses explicit segmentation of atoms and bonds entirely.</li>
<li><strong>Synthetic Data Strategy</strong>: The authors generate synthetic images from PubChem using the CDK Structure Diagram Generator, scaling the dataset size to 15 million.</li>
<li><strong>Robust String Representations</strong>: The study performs key ablation experiments on string representations, comparing standard SMILES against DeepSMILES to evaluate how syntactic validity affects the network&rsquo;s learning capability.</li>
</ul>
<h2 id="experimental-setup-and-validation-strategies">Experimental Setup and Validation Strategies</h2>
<ul>
<li><strong>Data Scaling</strong>: Models were trained on dataset sizes ranging from 54,000 to 15 million synthetic images to observe empirical scaling laws regarding accuracy and compute time.</li>
<li><strong>Representation Comparison</strong>: The authors compared the validity of predicted strings and recognition accuracy when training on SMILES versus DeepSMILES. The cross-entropy loss formulation for sequence generation can be represented as:
$$ \mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, \mathbf{x}) $$
where $\mathbf{x}$ is the image representation and $y_t$ are the tokens of the SMILES/DeepSMILES string.</li>
<li><strong>Metric Evaluation</strong>: Performance was measured using Validity (syntactic correctness) and Tanimoto Similarity $T$, computed on molecular fingerprints to capture partial correctness even if the exact string prediction failed:
$$ T(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$</li>
</ul>
<h2 id="results-and-critical-conclusions">Results and Critical Conclusions</h2>
<ul>
<li><strong>Data Representation</strong>: DeepSMILES proved superior to standard SMILES for training stability and output validity. Preliminary tests suggested SELFIES performs even better (0.78 Tanimoto vs 0.53 for DeepSMILES at 6M images).</li>
<li><strong>Scaling Behavior</strong>: Accuracy improves linearly with dataset size. The authors extrapolate that near-perfect detection would require training on 50 to 100 million structures.</li>
<li><strong>Current Limitations</strong>: At the reported training scale (up to 15M), the model does not yet rival traditional heuristic approaches, but the learning curve suggests it is a viable trajectory given sufficient compute and data.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data is synthetic, generated using the Chemistry Development Kit (CDK) Structure Diagram Generator (SDG) based on molecules from PubChem.</p>
<p><strong>Curation Rules</strong> (applied to PubChem data):</p>
<ul>
<li>Molecular weight &lt; 1500 Daltons.</li>
<li>Elements restricted to: C, H, O, N, P, S, F, Cl, Br, I, Se, B.</li>
<li>No counter ions or charged groups.</li>
<li>No isotopes (e.g., D, T).</li>
<li>Bond count between 5 and 40.</li>
<li>SMILES length &lt; 40 characters.</li>
<li>Implicit hydrogens only (except in functional groups).</li>
</ul>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>Images</strong>: Generated as 299x299 bitmaps to match Inception V3 input requirements.</li>
<li><strong>Augmentation</strong>: One random rotation applied per molecule; no noise or blurring added in this iteration.</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>Synthetic (PubChem)</td>
          <td>54k - 15M</td>
          <td>Scaled across 12 experiments</td>
      </tr>
      <tr>
          <td>Testing</td>
          <td>Independent Set</td>
          <td>6k - 1.6M</td>
          <td>10% of training size</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: <code>&quot;Show, Attend and Tell&quot;</code> (Attention-based Image Captioning).</li>
<li><strong>Optimization</strong>: Adam optimizer with learning rate 0.0005.</li>
<li><strong>Loss Function</strong>: Sparse Categorical Crossentropy.</li>
<li><strong>Training Loop</strong>: Trained for 25 epochs per model. Batch size of 640 images.</li>
</ul>
<h3 id="models">Models</h3>
<p>The network is implemented in TensorFlow 2.0.</p>
<ul>
<li><strong>Encoder</strong>: Inception V3 (Convolutional NN), used unaltered. Extracts feature vectors saved as NumPy arrays.</li>
<li><strong>Decoder</strong>: Gated Recurrent Unit (GRU) based Recurrent Neural Network (RNN) with soft attention mechanism.</li>
<li><strong>Embeddings</strong>: Image embedding dimension size of 600.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The primary metric is Tanimoto similarity (Jaccard index) on PubChem fingerprints, which is robust for measuring structural similarity even when exact identity is not reached.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Definition</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Tanimoto 1.0</strong></td>
          <td>Percentage of predictions that are chemically identical to ground truth (isomorphic).</td>
      </tr>
      <tr>
          <td><strong>Average Tanimoto</strong></td>
          <td>Mean similarity score across the test set (captures partial correctness).</td>
      </tr>
      <tr>
          <td><strong>Validity</strong></td>
          <td>Percentage of predicted strings that are valid DeepSMILES/SMILES.</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER">DECIMER (Java utilities)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>CDK-based data generation and conversion tools</td>
      </tr>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER-Image-to-SMILES">DECIMER-Image-to-SMILES</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>TensorFlow training and inference scripts (archived)</td>
      </tr>
      <tr>
          <td><a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</a></td>
          <td>Dataset</td>
          <td>Public Domain</td>
          <td>Source of molecular structures for synthetic training data</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Training was performed on a single node.</p>
<ul>
<li><strong>GPU</strong>: 1x NVIDIA Tesla V100.</li>
<li><strong>CPU</strong>: 2x Intel Xeon Gold 6230.</li>
<li><strong>RAM</strong>: 384 GB.</li>
<li><strong>Compute Time</strong>:
<ul>
<li>Linear scaling with data size.</li>
<li>15 million structures took ~27 days (91,881s per epoch).</li>
<li>Projected time for 100M structures: ~4 months on single GPU.</li>
</ul>
</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Zielesny, A. &amp; Steinbeck, C. (2020). DECIMER: towards deep learning for chemical image recognition. <em>Journal of Cheminformatics</em>, 12(1), 65. <a href="https://doi.org/10.1186/s13321-020-00469-w">https://doi.org/10.1186/s13321-020-00469-w</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2020</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Kohulan/DECIMER">Official GitHub Repository</a></li>
<li><a href="https://github.com/Kohulan/DECIMER-Image-to-SMILES">DECIMER Image-to-SMILES Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanDECIMERDeepLearning2020,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{DECIMER}}: Towards Deep Learning for Chemical Image Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{DECIMER}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2020</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{65}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-020-00469-w}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemGrapher: Deep Learning for Chemical Graph OCSR</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/chemgrapher-2020/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/chemgrapher-2020/</guid><description>Deep learning OCSR method using semantic segmentation and classification CNNs to reconstruct chemical graphs with improved stereochemistry.</description><content:encoded><![CDATA[<h2 id="classifying-the-methodology">Classifying the Methodology</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel deep learning architecture and a specific graph-reconstruction algorithm to solve the problem of Optical Chemical Structure Recognition (OCSR). It validates this method by comparing it against the existing standard tool (OSRA), demonstrating superior performance on specific technical challenges like stereochemistry.</p>
<h2 id="the-ocr-stereochemistry-challenge">The OCR Stereochemistry Challenge</h2>
<p>Chemical knowledge is frequently locked in static images within scientific publications. Extracting this structure into machine-readable formats (graphs, SMILES) is essential for drug discovery and database querying. Existing tools, such as OSRA, rely on optical character recognition (OCR) and expert systems or hand-coded rules. These tools struggle with bond multiplicity and stereochemical information, often missing atoms or misinterpreting 3D cues (wedges and dashes). A machine learning approach allows for improvement via data scaling.</p>
<h2 id="decoupled-semantic-segmentation-and-classification-pipeline">Decoupled Semantic Segmentation and Classification Pipeline</h2>
<p>The core novelty is the <strong>segmentation-classification pipeline</strong> which decouples object detection from type assignment:</p>
<ol>
<li><strong>Semantic Segmentation</strong>: The model first predicts pixel-wise maps for atoms, bonds, and charges using a Dense Prediction Convolutional Network built on dilated convolutions.</li>
<li><strong>Graph Building Algorithm</strong>: A specific algorithm iterates over the segmentation maps to generate candidate locations for atoms and bonds.</li>
<li><strong>Refinement via Classification</strong>: Dedicated classification networks take cutouts of the original image combined with the segmentation mask to verify and classify each candidate (e.g., distinguishing a single bond from a double bond, or a wedge from a dash).</li>
</ol>
<p>Additionally, the authors developed a novel method for <strong>synthetic data generation</strong> by modifying the source code of RDKit to output pixel-wise labels during the image drawing process. This solves the lack of labeled training data.</p>
<h2 id="evaluating-synthetics-and-benchmarks">Evaluating Synthetics and Benchmarks</h2>
<ul>
<li><strong>Synthetic Benchmarking</strong>: The authors generated test sets in 3 different stylistic variations. For each style, they tested on both stereo (complex 3D information) and non-stereo compounds.</li>
<li><strong>Baseline Comparison</strong>: They compared the error rates of ChemGrapher against <strong>OSRA</strong> (Optical Structure Recognition Application).</li>
<li><strong>Component-level Evaluation</strong>: They analyzed the F1 scores of the segmentation networks versus the classification networks independently to understand where errors propagated.</li>
<li><strong>Real-world Case Study</strong>: They manually curated 61 images cut from journal articles to test performance on real, non-synthetic data.</li>
</ul>
<h2 id="advancements-over-osra">Advancements Over OSRA</h2>
<ul>
<li><strong>Superior Accuracy</strong>: ChemGrapher consistently achieved lower error rates than OSRA across all synthetic styles, particularly for stereochemical information (wedge and dash bonds).</li>
<li><strong>Component Performance</strong>: The classification networks showed higher F1 scores than the segmentation networks across all prediction types (Figure 4 in the paper). This suggests the two-stage approach allows the classifier to correct segmentation noise.</li>
<li><strong>Real-world Viability</strong>: In the manual case study, ChemGrapher correctly predicted 46 of 61 images, compared to 42 of 61 for OSRA.</li>
<li><strong>Limitations</strong>: The model struggles with thick bond lines in real-world images. Performance is stronger on carbon-only compounds, where no letters appear in the image.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors created a custom synthetic dataset using ChEMBL and RDKit, as no pixel-wise labeled dataset existed.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Source</strong></td>
          <td>ChEMBL</td>
          <td>1.9M</td>
          <td>Split into training pool (1.5M), val/train pool (300K), and test pools (35K each).</td>
      </tr>
      <tr>
          <td><strong>Segmentation Train</strong></td>
          <td>Synthetic</td>
          <td>~114K</td>
          <td>Sampled from ChEMBL pool such that every atom type appears in &gt;1000 compounds.</td>
      </tr>
      <tr>
          <td><strong>Labels</strong></td>
          <td>Pixel-wise</td>
          <td>N/A</td>
          <td>Generated by modifying <strong>RDKit</strong> source code to output label masks (atom type, bond type, charge) during drawing.</td>
      </tr>
      <tr>
          <td><strong>Candidates (Val)</strong></td>
          <td>Cutouts</td>
          <td>~27K (Atom)<br>~55K (Bond)</td>
          <td>Validation candidates generated from ~450 compounds for evaluating the classification networks.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Algorithm 1: Graph Building</strong></p>
<ol>
<li><strong>Segment</strong>: Apply segmentation network $s(x)$ to get maps $S^a$ (atoms), $S^b$ (bonds), $S^c$ (charges).</li>
<li><strong>Atom Candidates</strong>: Identify candidate blobs in $S^a$.</li>
<li><strong>Classify Atoms</strong>: For each candidate, crop the input image and segmentation map. Feed to $c_A$ and $c_C$ to predict Atom Type and Charge. Add to Vertex set $V$ if valid.</li>
<li><strong>Bond Candidates</strong>: Generate all pairs of nodes in $V$ within $2 \times$ bond length distance.</li>
<li><strong>Classify Bonds</strong>: For each pair, create a candidate mask (two rectangles meeting in the middle to encode directionality). Feed to $c_B$ to predict Bond Type (single, double, wedge, etc.). Add to Edge set $E$.</li>
</ol>
<h3 id="models">Models</h3>
<p>The pipeline uses four distinct Convolutional Neural Networks (CNNs).</p>
<p><strong>1. Semantic Segmentation Network ($s$)</strong></p>
<ul>
<li><strong>Architecture</strong>: 8 convolutional layers (3x3) plus a final 1x1 linear layer (Dense Prediction Convolutional Network).</li>
<li><strong>Kernels</strong>: $3 \times 3$ for all convolutional layers; $1 \times 1$ for the final linear layer.</li>
<li><strong>Dilation</strong>: Uses dilated convolutions to expand receptive field without losing resolution. Six of the eight convolutional layers use dilation (factors: 2, 4, 8, 8, 4, 2); the first and last convolutional layers have no dilation.</li>
<li><strong>Input</strong>: Binary B/W image.</li>
<li><strong>Output</strong>: Multi-channel probability maps for Atom Types ($S^a$), Bond Types ($S^b$), and Charges ($S^c$).</li>
</ul>
<p><strong>2. Classification Networks ($c_A, c_B, c_C$)</strong></p>
<ul>
<li><strong>Purpose</strong>: Refines predictions on small image patches.</li>
<li><strong>Architecture</strong>: 5 convolutional layers, followed by a MaxPool layer and a final linear (1x1) layer.
<ul>
<li>Layer 1: <strong>Depthwise separable convolution</strong> (no dilation).</li>
<li>Layers 2-4: Dilated convolutions (factors 2, 4, 8).</li>
<li>Layer 5: Standard convolution (no dilation).</li>
<li>MaxPool: $124 \times 124$.</li>
<li>Final: 1x1 linear layer.</li>
</ul>
</li>
<li><strong>Inputs</strong>:
<ul>
<li>Crop of the binary image ($x^{cut}$).</li>
<li>Crop of the segmentation map ($S^{cut}$).</li>
<li>&ldquo;Highlight&rdquo; mask ($h_L$) indicating the specific candidate location (e.g., a dot for atoms, two rectangles for bonds).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metric</strong>: <strong>F1 Score</strong> for individual network performance (segmentation pixels and classification accuracy).</li>
<li><strong>Metric</strong>: <strong>Error Rate</strong> (percentage of incorrect graphs) for overall system. A graph is &ldquo;incorrect&rdquo; if there is at least one mistake in atoms or bonds.</li>
<li><strong>Baselines</strong>: Compared against <strong>OSRA</strong>.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Training and inference performed on a single <strong>NVIDIA Titan Xp</strong> (donated by NVIDIA).</li>
</ul>
<h3 id="reproducibility-status">Reproducibility Status</h3>
<p><strong>Closed.</strong> The authors did not release source code, pre-trained models, or the synthetic dataset. The data generation pipeline requires modifications to RDKit&rsquo;s internal drawing code, which are not publicly available. The ChEMBL source compounds are public, but the pixel-wise labeling procedure cannot be reproduced without the modified RDKit code.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Oldenhof, M., Arany, Á., Moreau, Y., &amp; Simm, J. (2020). ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep Learning. <em>Journal of Chemical Information and Modeling</em>, 60(10), 4506-4517. <a href="https://doi.org/10.1021/acs.jcim.0c00459">https://doi.org/10.1021/acs.jcim.0c00459</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2020 (arXiv preprint Feb 2020)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2002.09914">arXiv Page</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{oldenhof2020chemgrapher,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Oldenhof, Martijn and Arany, Ádám and Moreau, Yves and Simm, Jaak}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{60}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{4506--4517}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACS Publications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.0c00459}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>A Review of Optical Chemical Structure Recognition Tools</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/rajan-ocsr-review-2020/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/rajan-ocsr-review-2020/</guid><description>Comprehensive review and benchmarking of 30 years of Optical Chemical Structure Recognition (OCSR) methods and tools.</description><content:encoded><![CDATA[<h2 id="systematization-and-benchmarking-of-ocsr">Systematization and Benchmarking of OCSR</h2>
<p>This is primarily a <strong>Systematization</strong> paper ($0.7 \Psi_{\text{Systematization}}$) with a significant <strong>Resource</strong> component ($0.3 \Psi_{\text{Resource}}$).</p>
<p>It serves as a <strong>Systematization</strong> because it organizes nearly three decades of research in Optical Chemical Structure Recognition (OCSR), categorizing methods into rule-based systems (e.g., Kekulé, CLiDE, OSRA) and emerging machine-learning approaches (e.g., MSE-DUDL, Chemgrapher). It synthesizes information on 16 distinct tools, many of which are commercial or no longer available.</p>
<p>It acts as a <strong>Resource</strong> by defining a benchmark for the field. The authors evaluate the three available open-source tools (Imago, MolVec, OSRA) against four distinct datasets to establish baseline performance metrics for accuracy and speed.</p>
<h2 id="motivation-digitizing-legacy-chemical-literature">Motivation: Digitizing Legacy Chemical Literature</h2>
<p>A vast amount of chemical knowledge remains &ldquo;hidden&rdquo; in the primary scientific literature (printed or PDF), conveyed as 2D images. Because these depictions are not machine-readable, there is a &ldquo;backlog of decades of chemical literature&rdquo; that cannot be easily indexed or searched in open-access databases.</p>
<p>While Chemical Named Entity Recognition (NER) exists for text, translating graphical depictions into formats like SMILES or SDfiles requires specialized OCSR tools. The motivation is to enable the automated curation of this legacy data to feed public databases.</p>
<h2 id="core-innovations-historical-taxonomy-and-open-standards">Core Innovations: Historical Taxonomy and Open Standards</h2>
<p>The primary novelty is the comprehensive aggregation of the history of the field, which had not been thoroughly reviewed recently. It details the algorithmic evolution from the first work in 1990 to deep learning methods in 2019.</p>
<p>Specific contributions include:</p>
<ul>
<li><strong>Historical Taxonomy</strong>: Classification of tools into rule-based vs. machine-learning, and open-source vs. commercial/unavailable.</li>
<li><strong>Open Source Benchmark</strong>: A comparative performance analysis of the only three open-source tools available at the time (Imago, MolVec, OSRA) on standardized datasets.</li>
<li><strong>Algorithmic Breakdown</strong>: Detailed summaries of the workflows for closed-source or lost tools (e.g., Kekulé, OROCS, ChemReader) based on their original publications.</li>
</ul>
<h2 id="benchmarking-methodology-and-open-source-evaluation">Benchmarking Methodology and Open-Source Evaluation</h2>
<p>The authors performed a benchmark study to evaluate the accuracy and speed of three open-source OCSR tools: <strong>MolVec (0.9.7)</strong>, <strong>Imago (2.0)</strong>, and <strong>OSRA (2.1.0)</strong>.</p>
<p>They tested these tools on four datasets of varying quality and origin:</p>
<ol>
<li><strong>USPTO</strong>: 5,719 images from US patents (high quality).</li>
<li><strong>UOB</strong>: 5,740 images from the University of Birmingham, published alongside MolRec.</li>
<li><strong>CLEF 2012</strong>: 961 images from the CLEF-IP evaluation (well-segmented, clean).</li>
<li><strong>JPO</strong>: 450 images from Japanese patents (low quality, noise, Japanese characters).</li>
</ol>
<p>Evaluation metrics were:</p>
<ul>
<li><strong>Accuracy</strong>: Percentage of perfectly recognized structures, mathematically defined as exact string matching between generated and reference standard InChI sequences $\text{Accuracy} = \frac{\text{Correct InChI Matches}}{\text{Total Images}}$ (verified by converting output to <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> strings and matching against reference InChIs).</li>
<li><strong>Speed</strong>: Total processing time for the dataset.</li>
</ul>
<h2 id="results-and-general-conclusions">Results and General Conclusions</h2>
<p><strong>Benchmark Results (Table 2)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Metric</th>
          <th>MolVec 0.9.7</th>
          <th>Imago 2.0</th>
          <th>OSRA 2.1.0</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>USPTO (5,719 images)</td>
          <td>Time (min)</td>
          <td>28.65</td>
          <td>72.83</td>
          <td>145.04</td>
      </tr>
      <tr>
          <td></td>
          <td>Accuracy</td>
          <td>88.41%</td>
          <td>87.20%</td>
          <td>87.69%</td>
      </tr>
      <tr>
          <td>UOB (5,740 images)</td>
          <td>Time (min)</td>
          <td>28.42</td>
          <td>152.52</td>
          <td>125.78</td>
      </tr>
      <tr>
          <td></td>
          <td>Accuracy</td>
          <td>88.39%</td>
          <td>63.54%</td>
          <td>86.50%</td>
      </tr>
      <tr>
          <td>CLEF 2012 (961 images)</td>
          <td>Time (min)</td>
          <td>4.41</td>
          <td>16.03</td>
          <td>21.33</td>
      </tr>
      <tr>
          <td></td>
          <td>Accuracy</td>
          <td>80.96%</td>
          <td>65.45%</td>
          <td>94.90%</td>
      </tr>
      <tr>
          <td>JPO (450 images)</td>
          <td>Time (min)</td>
          <td>7.50</td>
          <td>22.55</td>
          <td>16.68</td>
      </tr>
      <tr>
          <td></td>
          <td>Accuracy</td>
          <td>66.67%</td>
          <td>40.00%</td>
          <td>57.78%</td>
      </tr>
  </tbody>
</table>
<p><strong>Key Observations</strong>:</p>
<ul>
<li><strong>MolVec</strong> was the fastest tool, processing datasets significantly quicker than competitors (e.g., 28.65 min for USPTO vs. 145.04 min for OSRA).</li>
<li><strong>OSRA</strong> performed exceptionally well on clean, well-segmented data (94.90% on CLEF 2012) but was slower.</li>
<li><strong>Imago</strong> generally lagged in accuracy compared to the other two, particularly on the UOB dataset (63.54% vs. 88.39% for MolVec and 86.50% for OSRA).</li>
<li><strong>JPO Difficulty</strong>: All tools struggled with the noisy Japanese Patent Office dataset (accuracies ranged from 40.00% to 66.67%), highlighting issues with noise and non-standard labels.</li>
</ul>
<p><strong>General Conclusions</strong>:</p>
<ul>
<li>No &ldquo;gold standard&rdquo; tool existed (as of 2020) that solved all problems (page segmentation, R-groups, NLP integration).</li>
<li>Rule-based approaches dominate the history of the field, but deep learning methods (MSE-DUDL, Chemgrapher) were emerging, though they were closed-source at the time of writing.</li>
<li>There was a critical need for tools that could handle full-page recognition (combining segmentation and recognition).</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>The authors provided sufficient detail to replicate the benchmarking study.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Kohulan/OCSR_Review">OCSR_Review (GitHub)</a></td>
          <td>Code / Data</td>
          <td>MIT</td>
          <td>Benchmark images (PNG, 72 dpi) and evaluation scripts</td>
      </tr>
      <tr>
          <td><a href="https://sourceforge.net/p/osra/wiki/Download/">OSRA</a></td>
          <td>Code</td>
          <td>Open Source</td>
          <td>Version 2.1.0 tested; precompiled binaries are commercial</td>
      </tr>
      <tr>
          <td><a href="https://lifescience.opensource.epam.com/download/imago.html">Imago</a></td>
          <td>Code</td>
          <td>Open Source</td>
          <td>Version 2.0 tested; no longer actively developed</td>
      </tr>
      <tr>
          <td><a href="https://github.com/ncats/molvec">MolVec</a></td>
          <td>Code</td>
          <td>LGPL-2.1</td>
          <td>Version 0.9.7 tested; Java-based standalone tool</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The study used four public datasets. Images were converted to PNG (72 dpi) to ensure compatibility across all tools.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size</th>
          <th>Source</th>
          <th>Characteristics</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>USPTO</strong></td>
          <td>5,719</td>
          <td>OSRA Validation Set</td>
          <td>US Patent images, generally clean.</td>
      </tr>
      <tr>
          <td><strong>UOB</strong></td>
          <td>5,740</td>
          <td>Univ. of Birmingham</td>
          <td>Published alongside MolRec.</td>
      </tr>
      <tr>
          <td><strong>CLEF 2012</strong></td>
          <td>961</td>
          <td>CLEF-IP 2012</td>
          <td>Well-segmented, high quality.</td>
      </tr>
      <tr>
          <td><strong>JPO</strong></td>
          <td>450</td>
          <td>Japanese Patent Office</td>
          <td>Low quality, noisy, contains Japanese text.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper does not propose a new algorithm but benchmarks existing ones. The execution commands for reproducibility were:</p>
<ul>
<li><strong>Imago</strong>: Executed via command line without installation.
<code>./imago_console -dir /image/directory/path</code></li>
<li><strong>MolVec</strong>: Executed as a JAR file.
<code>java -cp [dependencies] gov.nih.ncats.molvec.Main -dir [input_dir] -outDir [output_dir]</code></li>
<li><strong>OSRA</strong>: Installed via Conda (PyOSRA) due to compilation complexity. Required dictionaries for superatoms and spelling.
<code>osra -f sdf -a [superatom_dict] -l [spelling_dict] -w [output_file] [input_file]</code></li>
</ul>
<h3 id="models">Models</h3>
<p>The specific versions of the open-source software tested were:</p>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Version</th>
          <th>Technology</th>
          <th>License</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>MolVec</strong></td>
          <td>0.9.7</td>
          <td>Java-based, rule-based</td>
          <td>LGPL-2.1</td>
      </tr>
      <tr>
          <td><strong>Imago</strong></td>
          <td>2.0</td>
          <td>C++, rule-based</td>
          <td>Open Source</td>
      </tr>
      <tr>
          <td><strong>OSRA</strong></td>
          <td>2.1.0</td>
          <td>C++, rule-based</td>
          <td>Open Source</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metric</strong>: Perfect structural match. The output SDfile/SMILES was converted to a Standard InChI string and compared to the ground truth InChI. Any deviation counted as a failure.</li>
<li><strong>Environment</strong>: Linux workstation (Ubuntu 20.04 LTS).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The benchmark was performed on a high-end workstation to measure processing time.</p>
<ul>
<li><strong>CPUs</strong>: 2x Intel Xeon Silver 4114 (40 threads total).</li>
<li><strong>RAM</strong>: 64 GB.</li>
<li><strong>Parallelization</strong>: MolVec had pre-implemented parallelization features that contributed to its speed.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Brinkhaus, H. O., Zielesny, A., &amp; Steinbeck, C. (2020). A review of optical chemical structure recognition tools. <em>Journal of Cheminformatics</em>, 12(1), 60. <a href="https://doi.org/10.1186/s13321-020-00465-0">https://doi.org/10.1186/s13321-020-00465-0</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2020</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanReviewOpticalChemical2020,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{A Review of Optical Chemical Structure Recognition Tools}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Brinkhaus, Henning Otto and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2020</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{60}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-020-00465-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Research on Chemical Expression Images Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/hong-chemical-expression-2015/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/hong-chemical-expression-2015/</guid><description>A 2015 methodology for Optical Chemical Structure Recognition (OCSR) focusing on improved handling of adhesive symbols and wedge bonds.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hong, C., Du, X., &amp; Zhang, L. (2015). Research on Chemical Expression Images Recognition. <em>Proceedings of the 2015 Joint International Mechanical, Electronic and Information Technology Conference</em>, 267-271. <a href="https://doi.org/10.2991/jimet-15.2015.50">https://doi.org/10.2991/jimet-15.2015.50</a></p>
<p><strong>Publication</strong>: JIMET 2015 (Atlantis Press)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://jsme-editor.github.io/">JSME Editor (used for visualization)</a></li>
</ul>
<h2 id="contribution-new-ocsr-workflow-for-adhesion-and-wedge-bonds">Contribution: New OCSR Workflow for Adhesion and Wedge Bonds</h2>
<p><strong>Method</strong>. The paper proposes a novel algorithmic pipeline (OCSR) for recognizing 2D organic chemical structures from images. It validates this method by comparing it against an existing tool (OSRA) using a quantitative metric (Tanimoto Coefficient) on a test set of 200 images.</p>
<h2 id="motivation-challenges-with-connecting-symbols-and-stereochemistry">Motivation: Challenges with Connecting Symbols and Stereochemistry</h2>
<p>A vast amount of chemical structural information exists in scientific literature (PDFs/images) that is not machine-readable. Manually converting these images to formats like <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> or CML is labor-intensive. Existing tools face challenges with:</p>
<ol>
<li><strong>Adhesion</strong>: Poor separation when chemical symbols touch or overlap with bonds.</li>
<li><strong>Stereochemistry</strong>: Incomplete identification of &ldquo;real&rdquo; (solid) and &ldquo;virtual&rdquo; (dashed/hashed) wedge bonds.</li>
</ol>
<h2 id="core-innovation-vector-based-separation-and-stereochemical-logic">Core Innovation: Vector-Based Separation and Stereochemical Logic</h2>
<p>The authors propose a specific <strong>OCSR (Optical Chemical Structure Recognition)</strong> workflow with two key technical improvements:</p>
<ol>
<li><strong>Vector-based Separation</strong>: The method vectorizes the image (using Potrace) to extract straight lines and curves, allowing better separation of &ldquo;adhesive&rdquo; chemical symbols (like H, N, O attached to bonds).</li>
<li><strong>Stereochemical Logic</strong>: Specific rules for identifying wedge bonds:
<ul>
<li><strong>Virtual (Dashed) Wedges</strong>: Identified by grouping connected domains and checking linear correlation of their center points.</li>
<li><strong>Real (Solid) Wedges</strong>: Identified after thinning by analyzing linear correlation and width variance of line segments.</li>
</ul>
</li>
</ol>
<h2 id="methodology--experimental-setup">Methodology &amp; Experimental Setup</h2>
<ul>
<li>
<p><strong>Dataset</strong>: 200 chemical structure images collected from the network.</p>
</li>
<li>
<p><strong>Baselines</strong>: Compared against <strong>OSRA</strong> (Optical Structure Recognition Application), a free online tool.</p>
</li>
<li>
<p><strong>Metric</strong>: <strong>Tanimoto Coefficient</strong>, measuring the similarity of the set of recognized bonds and symbols against the ground truth. The similarity $T(A, B)$ is defined as:</p>
<p>$$ T(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$</p>
</li>
</ul>
<h2 id="results--conclusions">Results &amp; Conclusions</h2>
<ul>
<li><strong>Performance</strong>: The proposed OCSR method achieved higher recognition rates than OSRA.
<ul>
<li><strong>Exact Match (100%)</strong>: OCSR achieved 90.0% vs. OSRA&rsquo;s 82.2%.</li>
<li><strong>High Similarity (&gt;85%)</strong>: OCSR recognized 157 structures vs. OSRA&rsquo;s 114.</li>
</ul>
</li>
<li><strong>Limitations</strong>: The paper notes that &ldquo;real wedge&rdquo; and &ldquo;virtual wedge&rdquo; identification was a primary focus, but general recognition effectiveness still &ldquo;has room for improvement&rdquo;.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study used a custom collection of images, not a standard benchmark.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Web-crawled chemical images</td>
          <td>200 structures</td>
          <td>Images containing 2D organic structures; specific source URLs not provided.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The recognition pipeline follows these specific steps:</p>
<ol>
<li><strong>Preprocessing</strong>:
<ul>
<li><strong>Grayscale</strong>: via <code>cvCvtColor</code> (OpenCV).</li>
<li><strong>Binarization</strong>: via Otsu&rsquo;s method.</li>
</ul>
</li>
<li><strong>Isolated Symbol Removal</strong>:
<ul>
<li>Identifies connected domains with aspect ratios in <code>[0.8, 3.0]</code>.</li>
<li>Recognizes them using OCR (GOCR, OCRAD, Tesseract) and removes them from the image.</li>
</ul>
</li>
<li><strong>Virtual Wedge Recognition</strong>:
<ul>
<li>Groups small connected domains (points/clumps).</li>
<li>Calculates linear correlation of center points; if collinear, treats as a dashed bond.</li>
</ul>
</li>
<li><strong>Vectorization &amp; Thinning</strong>:
<ul>
<li><strong>Thinning</strong>: Rosenfeld algorithm (optimized) to reduce lines to single pixel width.</li>
<li><strong>Vectorization</strong>: Uses <strong>Potrace</strong> to convert pixels to vector segments.</li>
<li><strong>Merging</strong>: Combines split vector segments based on angle thresholds to form long straight lines.</li>
</ul>
</li>
<li><strong>Adhesive Symbol Separation</strong>:
<ul>
<li>Identifies curves (short segments after vectorization) attached to long lines.</li>
<li>Separates these domains and re-runs OCR.</li>
</ul>
</li>
<li><strong>&ldquo;Super Atom&rdquo; Merging</strong>:
<ul>
<li>Merges adjacent vertical/horizontal symbols (e.g., &ldquo;HO&rdquo;, &ldquo;CH3&rdquo;) based on distance thresholds between bounding boxes.</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<p>The system relies on off-the-shelf OCR tools for character recognition; no custom ML models were trained.</p>
<ul>
<li><strong>OCR Engines</strong>: GOCR, OCRAD, TESSERACT.</li>
<li><strong>Visualization</strong>: JSME (JavaScript Molecule Editor) used to render output strings.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (OCSR)</th>
          <th>Baseline (OSRA)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Exact Match (100%)</td>
          <td><strong>90.0%</strong></td>
          <td>82.2%</td>
          <td>Percentage of 200 images perfectly recognized.</td>
      </tr>
      <tr>
          <td>&gt;95% Similarity</td>
          <td><strong>95 images</strong></td>
          <td>71 images</td>
          <td>Count of images with Tanimoto &gt; 0.95.</td>
      </tr>
      <tr>
          <td>&gt;85% Similarity</td>
          <td><strong>157 images</strong></td>
          <td>114 images</td>
          <td>Count of images with Tanimoto &gt; 0.85.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Requirements</strong>: Unspecified; runs on standard CPU architecture (implied by use of standard libraries like OpenCV and Potrace).</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{hongResearchChemicalExpression2015,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Research on {{Chemical Expression Images Recognition}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 2015 {{Joint International Mechanical}}, {{Electronic}} and {{Information Technology Conference}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Hong, Chen and Du, Xiaoping and Zhang, Lu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2015}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Atlantis Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Chongqing, China}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.2991/jimet-15.2015.50}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{978-94-6252-129-2}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Probabilistic OCSR with Markov Logic Networks</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/mlocsr/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/mlocsr/</guid><description>A probabilistic approach using Markov Logic Networks to recognize chemical structures from images, improving robustness over rule-based systems.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Frasconi, P., Gabbrielli, F., Lippi, M., &amp; Marinai, S. (2014). Markov Logic Networks for Optical Chemical Structure Recognition. <em>Journal of Chemical Information and Modeling</em>, 54(8), 2380-2390. <a href="https://doi.org/10.1021/ci5002197">https://doi.org/10.1021/ci5002197</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2014</p>
<h2 id="contribution-probabilistic-method-for-ocsr">Contribution: Probabilistic Method for OCSR</h2>
<p>This is a <strong>Method</strong> paper ($\Psi_{\text{Method}}$).</p>
<p>It proposes a novel algorithmic architecture (<strong>MLOCSR</strong>) that integrates low-level pattern recognition with a high-level probabilistic reasoning engine based on Markov Logic Networks (MLNs). While it contributes to resources by creating a clustered dataset for evaluation, the primary focus is on demonstrating that probabilistic inference offers a superior methodology to the deterministic, rule-based heuristics employed by previous state-of-the-art systems like OSRA and CLiDE.</p>
<h2 id="motivation-overcoming-brittle-rule-based-systems">Motivation: Overcoming Brittle Rule-Based Systems</h2>
<p>Optical Chemical Structure Recognition (OCSR) is critical for converting the vast archive of chemical literature (bitmap images in patents and papers) into machine-readable formats.</p>
<ul>
<li><strong>Limitation of Prior Work</strong>: Existing systems (OSRA, CLiDE, ChemReader) rely on &ldquo;empirical hard-coded geometrical rules&rdquo; to assemble atoms and bonds. These heuristics are brittle, requiring manual tuning of parameters for different image resolutions and failing when images are degraded or noisy.</li>
<li><strong>Gap</strong>: Chemical knowledge is typically used only in post-processing (e.g., to fix valency errors).</li>
<li><strong>Goal</strong>: To create a resolution-independent system that uses probabilistic reasoning to handle noise and ambiguity in graphical primitives.</li>
</ul>
<h2 id="core-innovation-markov-logic-networks-for-diagram-interpretation">Core Innovation: Markov Logic Networks for Diagram Interpretation</h2>
<p>The core novelty is the application of <strong>Markov Logic Networks (MLNs)</strong> to the problem of diagram interpretation.</p>
<ul>
<li><strong>Probabilistic Reasoning</strong>: The system treats extracted visual elements (lines, text boxes) as &ldquo;evidence&rdquo; and uses weighted first-order logic formulas to infer the most likely molecular graph (Maximum A Posteriori inference). The probability of a state $x$ is defined by the MLN log-linear model:
$$ P(X=x) = \frac{1}{Z} \exp\left(\sum_{i} w_i n_i(x)\right) $$
where $w_i$ is the weight of the $i$-th formula and $n_i(x)$ is the number of true groundings in $x$.</li>
<li><strong>Unified Knowledge Representation</strong>: Geometric constraints (e.g., collinearity) and chemical rules (e.g., valency) are encoded in the same logic framework.</li>
<li>Methodology and Experimental Setupe low-level extraction module dynamically estimates character size ($T$) and stroke width ($S$) to normalize parameters, removing the dependence on image DPI metadata.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors evaluated the system on recognition accuracy against the leading open-source baseline, <strong>OSRA (v1.4.0)</strong>.</p>
<ul>
<li><strong>Datasets</strong>:
<ul>
<li><strong>USPTO Clustered</strong>: A non-redundant subset of 937 images derived from a larger set of 5,719 US Patent Office images.</li>
<li><strong>ChemInfty</strong>: 869 images from Japanese patents.</li>
<li><strong>Degraded Images</strong>: The USPTO set was synthetically degraded at three resampling levels (Low, Medium, High degradation) to test robustness.</li>
</ul>
</li>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Geometric</strong>: Precision, Recall, and $F_1$ scores for individual atoms and bonds.</li>
<li><strong>Chemical</strong>: Tanimoto similarity (using path fingerprints) and InChI string matching (basic and full stereochemistry).</li>
</ul>
</li>
</ul>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li><strong>Superior Robustness</strong>: MLOCSR significantly outperformed OSRA on degraded images. On high-degradation images, MLOCSR achieved an atom $F_1$ of 80.3% compared to OSRA&rsquo;s 76.0%.</li>
<li><strong>Geometric Accuracy</strong>: In clean datasets (USPTO cluster), MLOCSR achieved higher $F_1$ scores for atoms (99.1% vs 97.5%) and bonds (98.8% vs 97.8%).</li>
<li><strong>Chemical Fidelity</strong>: The system achieved comparable Tanimoto similarity scores (0.948 vs 0.940 for OSRA).</li>
<li><strong>Limitation</strong>: OSRA slightly outperformed MLOCSR on &ldquo;Full <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>&rdquo; matching (81.4% vs 79.4%), indicating the probabilistic model still needs improvement in handling complex stereochemistry.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study utilized public datasets, with specific preprocessing to ensure non-redundancy.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td><strong>USPTO Clustered</strong></td>
          <td>937 images</td>
          <td>Selected via spectral clustering from 5,719 raw images to remove near-duplicates.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><strong>ChemInfty</strong></td>
          <td>869 images</td>
          <td>Ground-truthed dataset from Japanese patent applications (2008).</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The pipeline consists of two distinct phases: Low-Level Vectorization and High-Level Inference.</p>
<p><strong>1. Low-Level Extraction (Image Processing)</strong></p>
<ul>
<li><strong>Binarization</strong>: Global thresholding followed by morphological closing.</li>
<li><strong>Text/Stroke Estimation</strong>:
<ul>
<li>Finds text height ($T$) by looking for &ldquo;N&rdquo; or &ldquo;H&rdquo; characters via OCR, or averaging compatible connected components.</li>
<li>Estimates stroke width ($S$) by inspecting pixel density on potential segments identified by Hough transform.</li>
</ul>
</li>
<li><strong>Vectorization</strong>:
<ul>
<li><strong>Canny Edge Detection</strong> + <strong>Hough Transform</strong> to find lines.</li>
<li><strong>Douglas-Peucker algorithm</strong> for polygonal approximation of contours.</li>
<li><strong>Circle Detection</strong>: Finds aromatic rings by checking for circular arrangements of carbon candidates.</li>
</ul>
</li>
</ul>
<p><strong>2. High-Level Inference (Markov Logic)</strong></p>
<ul>
<li><strong>Evidence Generation</strong>: Visual primitives (lines, text boxes, circles) are converted into logical ground atoms (e.g., <code>LineBetweenCpoints(c1, c2)</code>).</li>
<li><strong>Inference Engine</strong>: Uses <strong>MaxWalkSAT</strong> for Maximum A Posteriori (MAP) inference to determine the most probable state of query predicates (e.g., <code>DoubleBond(c1, c2)</code>).</li>
<li><strong>Parameters</strong>: MaxWalkSAT run with 3 tries and 1,000,000 steps per try.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Markov Logic Network (MLN)</strong>:
<ul>
<li>Contains <strong>128 first-order logic formulas</strong>.</li>
<li><strong>Geometric Rules</strong>: Example: <code>VeryCloseCpoints(c1, c2) =&gt; SameCarbon(c1, c2)</code> (weighted rule to merge close nodes).</li>
<li><strong>Chemical Rules</strong>: Example: <code>IsHydroxyl(t) ^ Connected(c,t) =&gt; SingleBond(c,t)</code> (imposes valency constraints).</li>
</ul>
</li>
<li><strong>OCR Engine</strong>: Tesseract is used for character recognition on text connected components.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The authors introduced a bipartite graph matching method to evaluate geometric accuracy when superatoms (e.g., &ldquo;COOH&rdquo;) are not expanded.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Atom/Bond $F_1$</strong></td>
          <td>Calculated via minimum-weight bipartite matching between predicted graph and ground truth, weighted by Euclidean distance.</td>
      </tr>
      <tr>
          <td><strong>InChI</strong></td>
          <td>Standard unique identifier string. &ldquo;Basic&rdquo; ignores stereochemistry; &ldquo;Full&rdquo; includes it.</td>
      </tr>
      <tr>
          <td><strong>Tanimoto</strong></td>
          <td>Jaccard index of path fingerprints between predicted and ground truth molecules.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Software</strong>: Logic inference performed using the <strong>Alchemy</strong> software package (University of Washington).</li>
<li><strong>Web Server</strong>: The system was made available at <code>http://mlocsr.dinfo.unifi.it</code> (Note: URL likely inactive).</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{frasconiMarkovLogicNetworks2014,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Markov {{Logic Networks}} for {{Optical Chemical Structure Recognition}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Frasconi, Paolo and Gabbrielli, Francesco and Lippi, Marco and Marinai, Simone}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2014</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = aug,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{54}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{2380--2390}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1549-9596, 1549-960X}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci5002197}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-10-13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Overview of the TREC 2011 Chemical IR Track Benchmark</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/trec-chem-2011/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/trec-chem-2011/</guid><description>Overview of the 2011 TREC Chemical IR track, establishing benchmarks for patent prior art, technology surveys, and chemical image recognition.</description><content:encoded><![CDATA[<h2 id="contribution-establishing-chemical-ir-benchmarks">Contribution: Establishing Chemical IR Benchmarks</h2>
<p>This is a <strong>Resource ($\Psi_{\text{Resource}}$)</strong> paper with a secondary contribution in <strong>Systematization ($\Psi_{\text{Systematization}}$)</strong>.</p>
<p>It serves as an infrastructural foundation for the field by establishing the &ldquo;yardstick&rdquo; for chemical information retrieval. It defines three distinct tasks, curates the necessary datasets (text and image), and creates the evaluation metrics required to measure progress. Secondarily, it systematizes the field by analyzing 36 different runs from 9 research groups, categorizing the performance of various approaches against these new benchmarks.</p>
<h2 id="motivation-bridging-text-and-image-search-in-chemistry">Motivation: Bridging Text and Image Search in Chemistry</h2>
<p>The primary motivation is to bridge the gap between distinct research communities (text mining and image understanding), which are both essential for chemical information retrieval but rarely interact. Professional searchers in chemistry rely heavily on non-textual information (structures), yet prior evaluation efforts lacked specific tasks to handle image data. The track aims to provide professional searchers with a clear understanding of the limits of current tools while stimulating research interest in both patent retrieval and chemical image recognition.</p>
<h2 id="novelty-the-image-to-structure-i2s-task">Novelty: The Image-to-Structure (I2S) Task</h2>
<p>The core novelty is the introduction of the <strong>Image-to-Structure (I2S)</strong> task. While previous years provided image data, this was the first specific task requiring participants to translate a raster image of a molecule into a chemical structure file. Additionally, the Technology Survey (TS) task shifted its focus specifically to <strong>biomedical and pharmaceutical topics</strong> to investigate how general IR systems handle the high terminological diversity (synonyms, abbreviations) typical of biomedical patents.</p>
<h2 id="methodology-trec-2011-task-formulations">Methodology: TREC 2011 Task Formulations</h2>
<p>The organizers conducted a large-scale benchmarking campaign across three specific tasks:</p>
<ol>
<li><strong>Prior Art (PA) Task</strong>: A patent retrieval task using 1,000 topics distributed among the EPO, USPTO, and WIPO.</li>
<li><strong>Technology Survey (TS) Task</strong>: An ad-hoc retrieval task focused on 6 specific biomedical/pharmaceutical information needs (e.g., &ldquo;Tests for HCG hormone&rdquo;).</li>
<li><strong>Image-to-Structure (I2S) Task</strong>: A recognition task using 1,000 training images and 1,000 evaluation images from USPTO patents, where systems had to generate the correct chemical structure (MOL file).</li>
</ol>
<p>A total of 9 groups submitted 36 runs across these tasks. Relevance judgments were performed using stratified sampling and a dual-evaluator system (junior and senior experts) for the TS task.</p>
<h2 id="outcomes-task-achievements-and-limitations">Outcomes: Task Achievements and Limitations</h2>
<ul>
<li><strong>Image-to-Structure Success</strong>: The new I2S task was the most successful task that year, with 5 participating groups submitting 11 runs. All participants recognized over 60% of the structures.</li>
<li><strong>Prior Art Saturation</strong>: Only 2 groups participated in the PA task. The organizers concluded that this task had reached its &ldquo;final point,&rdquo; having learned the extent to which relevant documents can be retrieved in one pass for chemical patent applications.</li>
<li><strong>Biomedical Complexity</strong>: Four teams submitted 14 runs for the TS task, which highlighted the complexity of biomedical queries. The use of specialized domain experts (senior evaluators) and students (junior evaluators) provided high-quality relevance data, though the small number of topics (6) limits broad generalization.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>The following details describe the benchmark environment established by the organizers, allowing for the replication of the evaluation.</p>
<h3 id="data">Data</h3>
<p>The track utilized a large collection of approximately 500GB of compressed text and image data.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Dataset / Source</th>
          <th style="text-align: left">Size / Split</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Prior Art (PA)</strong></td>
          <td style="text-align: left">EPO, USPTO, WIPO patents</td>
          <td style="text-align: left">1,000 Topics</td>
          <td style="text-align: left">Distributed: 334 EPO, 333 USPTO, 333 WIPO.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Tech Survey (TS)</strong></td>
          <td style="text-align: left">Biomedical patents/articles</td>
          <td style="text-align: left">6 Topics</td>
          <td style="text-align: left">Topics formulated by domain experts; focused on complexity (synonyms, abbreviations).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Image (I2S)</strong></td>
          <td style="text-align: left">USPTO patent images</td>
          <td style="text-align: left">1,000 Train / 1,000 Eval</td>
          <td style="text-align: left">Criteria: No polymers, &ldquo;organic&rdquo; elements only, MW &lt; 1000, single fragment.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper defines specific <strong>evaluation algorithms</strong> used to ground-truth the submissions:</p>
<ul>
<li><strong>Stratified Sampling (TS)</strong>: Pools were generated using the method from Yilmaz et al. (2008). The pool included the top 10 documents from all runs, 30% of the top 30, and 10% of the rest down to rank 1000.</li>
<li><strong>InChI Matching (I2S)</strong>: Evaluation relied on generating <strong>Standard <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> Keys</strong> from both the ground truth MOL files and the participant submissions. Success was defined by exact string matching of these keys. This provided a relatively controversy-free measure of chemical identity.</li>
</ul>
<h3 id="models">Models</h3>
<p>While the paper does not propose a single model, it evaluates several distinct approaches submitted by participants. Notable systems mentioned include:</p>
<ul>
<li><strong>OSRA</strong> (SAIC-Frederik / NIH)</li>
<li><strong>ChemReader</strong> (University of Michigan)</li>
<li><strong>ChemOCR</strong> (Fraunhofer SCAI)</li>
<li><strong>UoB</strong> (University of Birmingham)</li>
<li><strong>GGA</strong> (GGA Software)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance was measured using standard IR metrics for text and exact matching for images.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>MAP / xinfAP</strong></td>
          <td style="text-align: left">Prior Art / Tech Survey</td>
          <td style="text-align: left">Mean Average Precision ($\text{MAP}$) and Extended Inferred AP ($\text{xinfAP}$) were used to measure retrieval quality.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>infNDCG</strong></td>
          <td style="text-align: left">Tech Survey</td>
          <td style="text-align: left">Used to account for graded relevance (highly relevant vs relevant, formalized as $\text{infNDCG}$).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Recall</strong></td>
          <td style="text-align: left">Image-to-Structure</td>
          <td style="text-align: left">Percentage of images where the generated InChI key matched exactly ($R = \frac{\text{Correct}}{\text{Total}}$).</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://trec.nist.gov/data/chemical11.html">TREC 2011 Chemistry Track Data</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Topics, relevance judgments, and image sets for all three tasks</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://trec.nist.gov/pubs/trec20/t20.proceedings.html">TREC 2011 Proceedings</a></td>
          <td style="text-align: left">Other</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Full proceedings including participant system descriptions</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Specific hardware requirements for the participating systems are not detailed in this overview, but the dataset size (500GB) implies significant storage and I/O throughput requirements.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Lupu, M., Gurulingappa, H., Filippov, I., Zhao, J., Fluck, J., Zimmermann, M., Huang, J., &amp; Tait, J. (2011). Overview of the TREC 2011 Chemical IR Track. In <em>Proceedings of the Twentieth Text REtrieval Conference (TREC 2011)</em>.</p>
<p><strong>Publication</strong>: Text REtrieval Conference (TREC) 2011</p>
<p><strong>Resources</strong>:</p>
<ul>
<li><a href="https://trec.nist.gov/pubs/trec20/t20.proceedings.html">TREC 2011 Proceedings</a></li>
<li><a href="https://trec.nist.gov/data/chemical11.html">TREC 2011 Chemistry Track Data</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{lupuOverviewTREC20112011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Overview of the {{TREC}} 2011 {{Chemical IR Track}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Lupu, Mihai and Gurulingappa, Harsha and Filippov, Igor and Zhao, Jiashu and Fluck, Juliane and Zimmermann, Marc and Huang, Jimmy and Tait, John}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the Twentieth Text REtrieval Conference (TREC 2011)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{NIST}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{The third year of the Chemical IR evaluation track benefitted from the support of many more people interested in the domain, as shown by the number of co-authors of this overview paper. We continued the two tasks we had before, and introduced a new task focused on chemical image recognition. The objective is to gradually move towards systems really useful to the practitioners, and in chemistry, this involves both text and images. The track had a total of 9 groups participating, submitting a total of 36 runs.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>OSRA at CLEF-IP 2012: Native TIFF Processing for Patents</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/osra-clef-2012/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/osra-clef-2012/</guid><description>Evaluation of OSRA on CLEF-IP 2012 patent data showing native TIFF processing outperforms external splitting tools and pairwise-distance segmentation.</description><content:encoded><![CDATA[<h2 id="contribution-evaluating-native-processing-in-osra">Contribution: Evaluating Native Processing in OSRA</h2>
<p>This is a <strong>Method</strong> paper ($\Psi_{\text{Method}}$).</p>
<p>It focuses on evaluating the algorithmic performance of the Optical Structure Recognition Application (OSRA) and justifies specific implementation details (such as pairwise distance clustering) through comparative analysis. The paper systematically compares preprocessing workflows (native vs. <code>tiffsplit</code>) to demonstrate how implementation choices impact precision, recall, and F1 scores.</p>
<h2 id="motivation-advancing-chemical-structure-recognition">Motivation: Advancing Chemical Structure Recognition</h2>
<p>The primary motivation is to solve the <strong>Chemical Structure Recognition</strong> task within the context of the CLEF-IP 2012 challenge. The goal is to accurately convert images of chemical structures found in patent documents into established computerized molecular formats (connection tables).</p>
<p>A secondary technical motivation is to address issues in page segmentation where standard bounding box approaches fail to separate overlapping or nested molecular structures.</p>
<h2 id="core-innovation-pairwise-distance-segmentation">Core Innovation: Pairwise Distance Segmentation</h2>
<p>The core novelty lies in the algorithmic approach to object detection and page segmentation:</p>
<ol>
<li>
<p><strong>Rejection of Bounding Boxes</strong>: Unlike standard OCR approaches, OSRA does not use a bounding box paradigm internally. Instead, it relies on the <strong>minimum pairwise distance</strong> between points of different connected components. This allows the system to correctly handle cases where a larger molecule &ldquo;surrounds&rdquo; a smaller one, which bounding boxes would incorrectly merge.</p>
</li>
<li>
<p><strong>Native TIFF Processing</strong>: The authors identify that external tools (specifically <code>tiffsplit</code>) introduce artifacts during multi-page TIFF conversion. They implement native splitting facilities within OSRA, which substantially improves precision (from 0.433 to 0.708 at tolerance 0).</p>
</li>
</ol>
<h2 id="experimental-setup-segmentation-and-recognition-tracks">Experimental Setup: Segmentation and Recognition Tracks</h2>
<p>The authors performed two specific tracks for the CLEF-IP 2012 challenge:</p>
<ol>
<li>
<p><strong>Page Segmentation</strong>:</p>
<ul>
<li><strong>Dataset</strong>: 5421 ground truth structures.</li>
<li><strong>Comparison</strong>: Run 1 used <code>tiffsplit</code> (external tool) to separate pages; Run 2 used OSRA&rsquo;s native internal page splitting.</li>
<li><strong>Metrics</strong>: Precision, Recall, and F1 scores calculated at varying pixel tolerances (0, 10, 20, 40, 55 pixels).</li>
</ul>
</li>
<li>
<p><strong>Structure Recognition</strong>:</p>
<ul>
<li><strong>Dataset</strong>: A test set split into an &ldquo;Automatic&rdquo; evaluation set (865 structures checkable via <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> keys) and a &ldquo;Manual&rdquo; evaluation set (95 structures requiring human review due to Markush labels).</li>
<li><strong>Metric</strong>: Recognition rate (Recalled %).</li>
</ul>
</li>
</ol>
<h2 id="results-and-conclusions-native-processing-gains">Results and Conclusions: Native Processing Gains</h2>
<ul>
<li><strong>Native vs. External Splitting</strong>: The native OSRA page splitting outperformed the external <code>tiffsplit</code> tool by a wide margin. At tolerance 0, native processing achieved <strong>0.708 Precision</strong> compared to <strong>0.433</strong> for <code>tiffsplit</code>. The authors attribute this gap to artifacts introduced during <code>tiffsplit</code>&rsquo;s internal TIFF format conversion. The native run also returned far fewer records (5,254 vs. 8,800 for <code>tiffsplit</code>), indicating fewer false detections.</li>
<li><strong>Recognition Rate</strong>: Across 960 total structures, the system achieved an <strong>83% recognition rate</strong> (88% on the automatic set, 40% on the manual Markush set).</li>
<li><strong>Context</strong>: The results were consistent with OSRA&rsquo;s second-place finish (out of 6 participants) at TREC-CHEM 2011.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The experiments used the CLEF-IP 2012 benchmark datasets.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Set</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Segmentation</strong></td>
          <td>Ground Truth</td>
          <td>5,421 structures</td>
          <td>Used to evaluate bounding box/coordinate accuracy.</td>
      </tr>
      <tr>
          <td><strong>Recognition</strong></td>
          <td>Automatic</td>
          <td>865 structures</td>
          <td>Evaluated via InChI key matching.</td>
      </tr>
      <tr>
          <td><strong>Recognition</strong></td>
          <td>Manual</td>
          <td>95 structures</td>
          <td>Evaluated manually due to Markush-style labels.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Component Clustering (Pairwise Distance)</strong></p>
<p>The segmentation algorithm avoids bounding boxes.</p>
<ul>
<li><strong>Logic</strong>: Calculate the minimum pairwise distance between points of distinct graphical components.</li>
<li><strong>Criterion</strong>: If distance $d &lt; \text{threshold}$, components are grouped.</li>
<li><strong>Advantage</strong>: Enables separation of complex geometries where a bounding box $B_1$ might fully encompass $B_2$ (e.g., a large ring surrounding a salt ion), whereas the actual pixels are disjoint.</li>
</ul>
<p><strong>2. Image Pre-processing</strong></p>
<ul>
<li><strong>Workflow A (Run 1)</strong>: Multi-page TIFF → <code>tiffsplit</code> binary → Single TIFFs → OSRA.</li>
<li><strong>Workflow B (Run 2)</strong>: Multi-page TIFF → OSRA Internal Split → Recognition.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Page Segmentation Results (tiffsplit, Run 1)</strong></p>
<p>Using <code>tiffsplit</code> for page splitting returned 8,800 records against 5,421 ground truth structures.</p>
<table>
  <thead>
      <tr>
          <th>Tolerance (px)</th>
          <th>Precision</th>
          <th>Recall</th>
          <th>F1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0</td>
          <td>0.433</td>
          <td>0.703</td>
          <td>0.536</td>
      </tr>
      <tr>
          <td>10</td>
          <td>0.490</td>
          <td>0.795</td>
          <td>0.606</td>
      </tr>
      <tr>
          <td>20</td>
          <td>0.507</td>
          <td>0.823</td>
          <td>0.627</td>
      </tr>
      <tr>
          <td>40</td>
          <td>0.536</td>
          <td>0.870</td>
          <td>0.663</td>
      </tr>
      <tr>
          <td>55</td>
          <td>0.549</td>
          <td>0.891</td>
          <td>0.679</td>
      </tr>
  </tbody>
</table>
<p><strong>Page Segmentation Results (Native Split, Run 2)</strong></p>
<p>Using OSRA&rsquo;s native TIFF reading returned 5,254 records, with much higher precision.</p>
<table>
  <thead>
      <tr>
          <th>Tolerance (px)</th>
          <th>Precision</th>
          <th>Recall</th>
          <th>F1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0</td>
          <td>0.708</td>
          <td>0.686</td>
          <td>0.697</td>
      </tr>
      <tr>
          <td>10</td>
          <td>0.793</td>
          <td>0.769</td>
          <td>0.781</td>
      </tr>
      <tr>
          <td>20</td>
          <td>0.821</td>
          <td>0.795</td>
          <td>0.808</td>
      </tr>
      <tr>
          <td>40</td>
          <td>0.867</td>
          <td>0.840</td>
          <td>0.853</td>
      </tr>
      <tr>
          <td>55</td>
          <td>0.887</td>
          <td>0.860</td>
          <td>0.873</td>
      </tr>
  </tbody>
</table>
<p><strong>Structure Recognition Results</strong></p>
<table>
  <thead>
      <tr>
          <th>Set</th>
          <th>Count</th>
          <th>Recalled</th>
          <th>Percentage</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Automatic</td>
          <td>865</td>
          <td>761</td>
          <td>88%</td>
      </tr>
      <tr>
          <td>Manual</td>
          <td>95</td>
          <td>38</td>
          <td>40%</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>960</strong></td>
          <td><strong>799</strong></td>
          <td><strong>83%</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="http://cactus.nci.nih.gov/osra">OSRA</a></td>
          <td>Code</td>
          <td>Open Source</td>
          <td>Official project page at NCI/NIH</td>
      </tr>
  </tbody>
</table>
<p>OSRA is described as an open source utility. The CLEF-IP 2012 benchmark datasets were provided as part of the shared task. No hardware or compute requirements are specified in the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Filippov, I. V., Katsubo, D., &amp; Nicklaus, M. C. (2012). Optical Structure Recognition Application entry to CLEF-IP 2012. <em>CLEF 2012 Evaluation Labs and Workshop, Online Working Notes</em>.</p>
<p><strong>Publication</strong>: CLEF 2012</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="http://cactus.nci.nih.gov/osra">Project Home Page</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{filippovOpticalStructureRecognition2012,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Optical {{Structure Recognition Application}} Entry to {{CLEF-IP}} 2012}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Filippov, Igor V and Katsubo, Dmitry and Nicklaus, Marc C}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2012}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{CLEF 2012 Evaluation Labs and Workshop, Online Working Notes}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-FilippovEt2012.pdf}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{We present our entry to CLEF 2012 Chemical Structure Recognition task. Our submission includes runs for both bounding box extraction and molecule structure recognition tasks using Optical Structure Recognition Application. OSRA is an open source utility to convert images of chemical structures to connection tables into established computerized molecular formats. It has been under constant development since 2007.}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolRec at CLEF 2012: Rule-Based Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec-clef-2012/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec-clef-2012/</guid><description>Overview and failure analysis of the MolRec rule-based chemical structure recognition system evaluated on the CLEF 2012 chemical structure recognition task.</description><content:encoded><![CDATA[<h2 id="contribution-to-chemical-structure-recognition">Contribution to Chemical Structure Recognition</h2>
<p>This is a <strong>Method</strong> paper. It describes the architecture of an engineered artifact (the &ldquo;MolRec&rdquo; system) and evaluates its efficacy on a specific task (Chemical Structure Recognition) using a standardized benchmark. It focuses on the mechanisms of vectorization and rule-based rewriting.</p>
<h2 id="motivation-and-clef-2012-context">Motivation and CLEF 2012 Context</h2>
<p>The work was motivated by the <strong>CLEF 2012 chemical structure recognition task</strong>. The goal was to automatically interpret chemical diagram images clipped from patent documents. This is challenging because real-world patent images contain complex structures, such as bridge bonds and elements not supported by standard conversion tools like OpenBabel.</p>
<h2 id="novelty-in-rule-based-vectorization">Novelty in Rule-Based Vectorization</h2>
<p>The primary contribution is an <strong>improved rule-based rewrite engine</strong> compared to the authors&rsquo; previous TREC 2011 submission, featuring a fully overhauled implementation that improves both recognition performance and computational efficiency. The system uses a two-stage approach:</p>
<ol>
<li><strong>Vectorization</strong>: Extracts geometric primitives (lines, circles, arrows) and characters.</li>
<li><strong>Rule Engine</strong>: Applies 18 specific geometric rewriting rules to transform primitives into a chemical graph, which can then be exported to MOL or SMILES format.</li>
</ol>
<p>Notably, the system explicitly handles &ldquo;bridge bonds&rdquo; (3D perspective structures) by applying specific recognition rules before general bond detection.</p>
<h2 id="experimental-setup-on-the-clef-2012-corpus">Experimental Setup on the CLEF 2012 Corpus</h2>
<p>The system was evaluated on the <strong>CLEF 2012 corpus</strong> of 961 test images, split into two distinct sets to test different capabilities:</p>
<ul>
<li><strong>Automatic Set</strong>: 865 images evaluated automatically using OpenBabel to compare generated MOL files against ground truth.</li>
<li><strong>Manual Set</strong>: 95 &ldquo;challenging&rdquo; images containing elements beyond OpenBabel&rsquo;s scope (e.g., Markush structures), evaluated via manual visual inspection.</li>
</ul>
<p>The authors performed <strong>four runs</strong> with slightly different internal parameters to test system stability.</p>
<h2 id="performance-outcomes-and-failure-analysis">Performance Outcomes and Failure Analysis</h2>
<p><strong>Performance:</strong></p>
<ul>
<li><strong>Automatic Set</strong>: High performance, achieving accuracy between <strong>94.91% and 96.18%</strong>.</li>
<li><strong>Manual Set</strong>: Lower performance, with accuracy between <strong>46.32% and 58.95%</strong>, reflecting the difficulty of complex patent diagrams containing Markush structures and other elements beyond OpenBabel&rsquo;s scope.</li>
</ul>
<p><strong>Failure Analysis:</strong></p>
<p>The authors conducted a detailed error analysis on 52 distinct mis-recognized diagrams from the manual set and 46 from the automatic set. Key failure modes include:</p>
<ul>
<li><strong>Character Grouping</strong>: The largest error source in the manual set (26 images). A bug caused the digit &ldquo;1&rdquo; to be repeated within atom groups, and closely-spaced atom groups were incorrectly merged.</li>
<li><strong>Touching Characters</strong>: 8 images in the manual set and 1 in the automatic set. The system lacks segmentation for characters that touch, causing OCR failure.</li>
<li><strong>Four-way Junctions</strong>: 6 manual and 7 automatic images. Vectorization failed to correctly identify junctions where four lines meet.</li>
<li><strong>Missed Wedge Bonds</strong>: 6 images each for missed solid wedge and dashed wedge bonds in the automatic set.</li>
<li><strong>OCR Errors</strong>: 5 manual and 11 automatic images, including misrecognition of &ldquo;G&rdquo; as &ldquo;O&rdquo; and &ldquo;I&rdquo; interpreted as a vertical single bond.</li>
<li><strong>Charge Signs</strong>: MolRec correctly recognized positive charge signs but missed three negative charge signs, including one placed at the top left of an atom name.</li>
<li><strong>Dataset Errors</strong>: The authors identified 11 images where the ground truth MOL files were incorrect, but MolRec&rsquo;s recognition was actually correct.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The dataset was provided by CLEF 2012 organizers and consists of images clipped from patent documents.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation (Auto)</td>
          <td>CLEF 2012 Set 1</td>
          <td>865 images</td>
          <td>Evaluated via OpenBabel</td>
      </tr>
      <tr>
          <td>Evaluation (Manual)</td>
          <td>CLEF 2012 Set 2</td>
          <td>95 images</td>
          <td>Complex/Markush structures</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The <strong>MolRec</strong> pipeline consists of two primary modules:</p>
<p><strong>1. Vectorization Module</strong></p>
<ul>
<li><strong>Binarization</strong>: Uses <strong>Otsu&rsquo;s method</strong>.</li>
<li><strong>OCR</strong>: Extracts connected components and classifies them using <strong>nearest neighbor classification</strong> with a Euclidean metric. Detected characters are removed from the image.</li>
<li><strong>Bond Separation</strong>:
<ul>
<li>Thins remaining components to single-pixel width.</li>
<li>Builds polyline representations.</li>
<li>Splits polylines at junctions (3+ lines meeting).</li>
<li><strong>Simplification</strong>: Applies the <strong>Douglas-Peucker algorithm</strong> with a threshold of 1-2 average line widths to remove scanning artifacts while preserving corners. The threshold is based on measured average line width, allowing adaptation to different line styles.</li>
<li>Also detects circles, arrow heads, and solid triangles (annotated with direction).</li>
</ul>
</li>
</ul>
<p><strong>2. Rule Engine</strong></p>
<ul>
<li><strong>Input</strong>: Geometric primitives (segments, circles, triangles, arrows, character groups).</li>
<li><strong>Structure</strong>: 18 rewrite rules.</li>
<li><strong>Priority</strong>: Two rules for <strong>Bridge Bonds</strong> (Open/Closed) are applied <em>first</em>.</li>
<li><strong>Standard Rules</strong>: 16 rules applied in arbitrary order for standard bonds (Single, Double, Triple, Wedge, Dative, etc.).</li>
<li><strong>Implicit Nodes</strong>: Some rules handle cases where carbon atoms are implicit at bond junctions. These rules detect double or triple bonds while producing new geometric objects by splitting bonds at implicit nodes for further processing.</li>
<li><strong>Example Rule (Wavy Bond)</strong>:
<ul>
<li><em>Condition 1</em>: Set of line segments $L$ where $n \ge 3$.</li>
<li><em>Condition 2</em>: Segment lengths match &ldquo;dash length&rdquo; parameter.</li>
<li><em>Condition 3</em>: All elements are connected.</li>
<li><em>Condition 4</em>: Center points are approximately collinear.</li>
<li><em>Condition 5</em>: Endpoints form a single sequence (end elements have 1 neighbor, internal have 2).</li>
<li><em>Condition 6</em>: Two unconnected endpoints must be the pair of endpoints that are furthest apart.</li>
<li><em>Consequence</em>: Replace $L$ with a Wavy Bond between the furthest two endpoints. The bond has unknown direction.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>MolRec is a <strong>rule-based system</strong> and does not use trained deep learning models or weights.</p>
<ul>
<li><strong>Superatoms</strong>: Uses a dictionary look-up to resolve character groups representing superatoms into subgraphs.</li>
<li><strong>Disambiguation</strong>: Context-based logic is applied <em>after</em> graph construction to resolve ambiguities (e.g., distinguishing vertical bond <code>|</code> from letter <code>I</code> or digit <code>1</code>).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Set</th>
          <th>Run 1</th>
          <th>Run 2</th>
          <th>Run 3</th>
          <th>Run 4</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Auto (865 images)</td>
          <td>96.18% (832/865)</td>
          <td>94.91% (821/865)</td>
          <td>94.91% (821/865)</td>
          <td>96.18% (832/865)</td>
      </tr>
      <tr>
          <td>Manual (95 images)</td>
          <td>46.32% (44/95)</td>
          <td>58.95% (56/95)</td>
          <td>46.32% (44/95)</td>
          <td>56.84% (54/95)</td>
      </tr>
  </tbody>
</table>
<p><strong>Key Parameters</strong>:</p>
<ul>
<li><strong>Dash Length</strong>: Range of acceptable values for dashed lines.</li>
<li><strong>Simplification Threshold</strong>: 1-2x average line width for Douglas-Peucker.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf">CLEF 2012 Workshop Paper</a></td>
          <td>Other</td>
          <td>Open Access</td>
          <td>CEUR Workshop Proceedings</td>
      </tr>
  </tbody>
</table>
<h3 id="reproducibility-classification-closed">Reproducibility Classification: Closed</h3>
<p>No source code for the MolRec system has been publicly released. The CLEF 2012 evaluation dataset was distributed to task participants and is not openly available. The rule-based algorithm is described in sufficient detail to re-implement, but exact parameter values and the character classification training set are not fully specified. No hardware or compute requirements are reported.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sadawi, N. M., Sexton, A. P., &amp; Sorge, V. (2012). MolRec at CLEF 2012 &ndash; Overview and Analysis of Results. <em>CLEF 2012 Evaluation Labs and Workshop, Online Working Notes</em>. <a href="https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf">https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf</a></p>
<p><strong>Publication</strong>: CLEF 2012 Evaluation Labs and Workshop, Online Working Notes</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{sadawi2012molrec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolRec at CLEF 2012--Overview and Analysis of Results}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Sadawi, Noureddin M and Sexton, Alan P and Sorge, Volker}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{CLEF 2012 Evaluation Labs and Workshop, Online Working Notes}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2012}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>CLEF-IP 2012: Patent and Chemical Structure Benchmark</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/clef-ip-2012/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/clef-ip-2012/</guid><description>Overview of the CLEF-IP 2012 benchmarking lab focusing on patent passage retrieval, flowchart recognition, and chemical structure extraction.</description><content:encoded><![CDATA[<h2 id="patent-retrieval-and-the-clef-ip-2012-benchmark">Patent Retrieval and the CLEF-IP 2012 Benchmark</h2>
<p>This is a <strong>Resource</strong> paper (benchmark infrastructure). It establishes a standardized test bed for the Intellectual Property (IP) Information Retrieval community by defining tasks, curating datasets (topics and relevance judgments), and establishing evaluation protocols. The paper does not propose a new method itself but aggregates and analyzes the performance of participant systems on these shared tasks.</p>
<h2 id="motivation-for-standardized-ip-information-retrieval">Motivation for Standardized IP Information Retrieval</h2>
<p>The volume of patent applications is increasing rapidly, necessitating automated methods to help patent experts find prior art and classify documents.</p>
<ul>
<li><strong>Economic Impact:</strong> Thorough searches are critical due to the high economic value of granted patents.</li>
<li><strong>Complexity:</strong> Patent work-flows are specific; examiners need to find prior art for specific <em>claims</em> alongside whole documents, and often rely on non-textual data like flowcharts and chemical diagrams.</li>
<li><strong>Gap:</strong> Existing general IR tools are insufficient for the specific granularity (passages, images, structures) required in the IP domain.</li>
</ul>
<h2 id="novel-multi-modal-tasks-claims-flowcharts-and-chemicals">Novel Multi-modal Tasks: Claims, Flowcharts, and Chemicals</h2>
<p>The 2012 edition of the lab introduced three specific tasks targeting different modalities of patent data:</p>
<ol>
<li><strong>Passage Retrieval starting from Claims:</strong> Moving beyond document-level retrieval to identifying specific relevant passages based on claim text.</li>
<li><strong>Flowchart Recognition:</strong> A new image analysis task requiring the extraction of structural information (nodes, edges, text) from patent images.</li>
<li><strong>Chemical Structure Recognition:</strong> A dual task of segmenting molecular diagrams from full pages and recognizing them into structural files (MOL), specifically addressing the challenge of Markush structures in patents.</li>
</ol>
<h2 id="benchmarking-setup-and-evaluation">Benchmarking Setup and Evaluation</h2>
<p>The &ldquo;experiments&rdquo; were the benchmarking tasks themselves, performed by participants (e.g., University of Birmingham, SAIC, TU Vienna).</p>
<ul>
<li><strong>Passage Retrieval:</strong> Participants retrieved documents and passages for 105 test topics (sets of claims) from a corpus of 1.5 million patents. Performance was measured using PRES, Recall, and MAP at the document level, and AP/Precision at the passage level.</li>
<li><strong>Flowchart Recognition:</strong> Participants extracted graph structures from 100 test images. Evaluation compared the submitted graphs to ground truth using a distance metric based on the Maximum Common Subgraph (MCS).</li>
<li><strong>Chemical Structure:</strong>
<ul>
<li><em>Segmentation:</em> Identifying bounding boxes of chemical structures in 30 multipage TIFF patents.</li>
<li><em>Recognition:</em> Converting 865 &ldquo;automatic&rdquo; (standard MOL) and 95 &ldquo;manual&rdquo; (Markush/complex) diagrams into structure files.</li>
</ul>
</li>
</ul>
<h2 id="key-findings-and-baseline-results">Key Findings and Baseline Results</h2>
<ul>
<li><strong>Passage Retrieval:</strong> Approaches varied from two-step retrieval (document then passage) to full NLP techniques. Translation tools were universally used due to the multilingual corpus (English, German, French).</li>
<li><strong>Chemical Recognition:</strong> The best performing system (UoB, run uob-4) achieved 92% recall on total structures (886/960), with 96% on the automatic set and 57% on the manual set. SAIC achieved 83% total recall. The manual evaluation highlighted a critical need for standards extending MOL files to support Markush structures, which are common in patents but poorly supported by current tools.</li>
<li><strong>Flowchart Recognition:</strong> The evaluation was not completed at the time of writing the workshop notes. The evaluation required a combination of structural matching and edit-distance for text labels because OCR outputs rarely &ldquo;hard-matched&rdquo; the gold standard.</li>
</ul>
<h3 id="chemical-structure-recognition-results">Chemical Structure Recognition Results</h3>
<p><strong>Segmentation</strong> (SAIC, best run using OSRA native rendering):</p>
<table>
  <thead>
      <tr>
          <th>Tolerance (px)</th>
          <th>Precision</th>
          <th>Recall</th>
          <th>$F_1$</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0</td>
          <td>0.708</td>
          <td>0.686</td>
          <td>0.697</td>
      </tr>
      <tr>
          <td>10</td>
          <td>0.793</td>
          <td>0.769</td>
          <td>0.781</td>
      </tr>
      <tr>
          <td>20</td>
          <td>0.821</td>
          <td>0.795</td>
          <td>0.808</td>
      </tr>
      <tr>
          <td>40</td>
          <td>0.867</td>
          <td>0.840</td>
          <td>0.853</td>
      </tr>
      <tr>
          <td>55</td>
          <td>0.887</td>
          <td>0.860</td>
          <td>0.873</td>
      </tr>
  </tbody>
</table>
<p><strong>Recognition</strong> (automatic and manual sets):</p>
<table>
  <thead>
      <tr>
          <th>System</th>
          <th>Auto (#/865)</th>
          <th>Auto %</th>
          <th>Manual (#/95)</th>
          <th>Manual %</th>
          <th>Total (#/960)</th>
          <th>Total %</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SAIC</td>
          <td>761</td>
          <td>88%</td>
          <td>38</td>
          <td>40%</td>
          <td>799</td>
          <td>83%</td>
      </tr>
      <tr>
          <td>UoB-1</td>
          <td>832</td>
          <td>96%</td>
          <td>44</td>
          <td>46%</td>
          <td>876</td>
          <td>91%</td>
      </tr>
      <tr>
          <td>UoB-2</td>
          <td>821</td>
          <td>95%</td>
          <td>56</td>
          <td>59%</td>
          <td>877</td>
          <td>91%</td>
      </tr>
      <tr>
          <td>UoB-3</td>
          <td>821</td>
          <td>95%</td>
          <td>44</td>
          <td>46%</td>
          <td>865</td>
          <td>90%</td>
      </tr>
      <tr>
          <td>UoB-4</td>
          <td>832</td>
          <td>96%</td>
          <td>54</td>
          <td>57%</td>
          <td>886</td>
          <td>92%</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The collection focuses on European Patent Office (EPO) and WIPO documents published up to 2002.</p>
<p><strong>1. Passage Retrieval Data</strong></p>
<ul>
<li><strong>Corpus:</strong> &gt;1.5 million XML patent documents (EP and WO sources).</li>
<li><strong>Training Set:</strong> 51 topics (sets of claims) with relevance judgments (18 DE, 21 EN, 12 FR).</li>
<li><strong>Test Set:</strong> 105 topics (35 per language).</li>
<li><strong>Topic Source:</strong> Extracted manually from search reports listing &ldquo;X&rdquo; or &ldquo;Y&rdquo; citations (highly relevant prior art).</li>
</ul>
<p><strong>2. Flowchart Data</strong></p>
<ul>
<li><strong>Format:</strong> Black and white TIFF images.</li>
<li><strong>Training Set:</strong> 50 images with textual graph representations.</li>
<li><strong>Test Set:</strong> 100 images.</li>
<li><strong>Ground Truth:</strong> A defined textual format describing nodes (<code>NO</code>), directed edges (<code>DE</code>), undirected edges (<code>UE</code>), and meta-data (<code>MT</code>).</li>
</ul>
<p><strong>3. Chemical Structure Data</strong></p>
<ul>
<li><strong>Segmentation:</strong> 30 patent files rendered as 300dpi monochrome multipage TIFFs.</li>
<li><strong>Recognition (Automatic Set):</strong> 865 diagram images fully representable in standard MOL format.</li>
<li><strong>Recognition (Manual Set):</strong> 95 diagram images containing Markush structures or variability not supported by standard MOL.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Ground Truth Generation:</strong></p>
<ul>
<li><strong>Qrels Generator:</strong> An in-house tool was used to manually map search report citations to specific XML passages (XPaths) for the passage retrieval task.</li>
<li><strong>McGregor Algorithm:</strong> Used for the flowchart evaluation to compute the Maximum Common Subgraph (MCS) between participant submissions and ground truth.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Passage Retrieval Metrics:</strong></p>
<ul>
<li><strong>Document Level:</strong> PRES (Patent Retrieval Evaluation Score), Recall, MAP. Cut-off at 100 documents.</li>
<li><strong>Passage Level:</strong> $AP(D)$ (Average Precision at document level) and $Precision(D)$ (Precision at document level), averaged across all relevant documents for a topic.</li>
</ul>
<p><strong>Flowchart Recognition Metric:</strong></p>
<ul>
<li><strong>Graph Distance ($d$):</strong> Defined quantitatively based on the Maximum Common Subgraph (MCS) between a target flowchart ($F_t$) and a submitted flowchart ($F_s$):
$$
\begin{aligned}
d(F_t, F_s) &amp;= 1 - \frac{|mcs(F_t, F_s)|}{|F_t| + |F_s| - |mcs(F_t, F_s)|}
\end{aligned}
$$
where $|F|$ represents the size of the graph (nodes + edges).</li>
<li><strong>Levels:</strong> Evaluated at three levels: Basic (structure only), Intermediate (structure + node types), and Complete (structure + types + text labels).</li>
</ul>
<p><strong>Chemical Structure Metrics:</strong></p>
<ul>
<li><strong>Segmentation:</strong> Precision, Recall, and $F_1$ based on bounding box matches. A match is valid if borders are within a tolerance (0 to 55 pixels).</li>
<li><strong>Recognition:</strong>
<ul>
<li><em>Automatic:</em> Comparison of InChI strings generated by Open Babel.</li>
<li><em>Manual:</em> Visual comparison of images rendered by MarvinView.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<p>The CLEF-IP 2012 benchmark data was distributed to registered participants through the CLEF evaluation framework. The patent corpus is derived from the MAREC dataset (EPO and WIPO documents published until 2002). Evaluation tools for segmentation (bounding box comparison) and recognition (InChI comparison via Open Babel) were developed in-house by the organizers. The McGregor algorithm implementation for flowchart evaluation was also custom-built.</p>
<p>No public code repositories or pre-trained models are associated with this paper, as it is a benchmarking infrastructure paper. The evaluation protocols and data formats are fully described in the paper.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="http://www.ifs.tuwien.ac.at/~clef-ip">CLEF-IP 2012 data</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Distributed to registered CLEF participants; no persistent public archive</td>
      </tr>
      <tr>
          <td><a href="https://www.ir-facility.org/prototypes/marec">MAREC corpus</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Source patent corpus (EPO/WIPO documents up to 2002)</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Status</strong>: Partially Reproducible</li>
<li><strong>Missing components</strong>: The benchmark datasets were distributed to participants and are not hosted on a persistent public repository. The in-house evaluation tools (qrels generator, segmentation comparator, flowchart distance calculator) are not publicly released.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Piroi, F., Lupu, M., Hanbury, A., Magdy, W., Sexton, A. P., &amp; Filippov, I. (2012). CLEF-IP 2012: Retrieval Experiments in the Intellectual Property Domain. <em>CLEF 2012 Working Notes</em>, CEUR Workshop Proceedings, Vol. 1178.</p>
<p><strong>Publication</strong>: CLEF 2012 Working Notes (CEUR-WS Vol. 1178)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{piroi2012clefip,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{CLEF-IP 2012: Retrieval Experiments in the Intellectual Property Domain}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Piroi, Florina and Lupu, Mihai and Hanbury, Allan and Magdy, Walid and Sexton, Alan P. and Filippov, Igor}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{CLEF 2012 Working Notes}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">series</span>=<span style="color:#e6db74">{CEUR Workshop Proceedings}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1178}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2012}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{CEUR-WS.org}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-PiroiEt2012.pdf}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemReader Image-to-Structure OCR at TREC 2011 Chemical IR</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/chemreader-trec-2011/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/chemreader-trec-2011/</guid><description>ChemReader OCR software evaluation on TREC 2011 Chemical IR campaign achieving 93% accuracy on image-to-structure task.</description><content:encoded><![CDATA[<h2 id="methodological-application-applying-chemreader-to-chemical-ocr">Methodological Application: Applying ChemReader to Chemical OCR</h2>
<p>This is a <strong>Methodological Paper</strong> ($\Psi_{\text{Method}}$).</p>
<p>The dominant vector is $\Psi_{\text{Method}}$ because the paper&rsquo;s core contribution is the empirical evaluation and refinement of ChemReader on the Image-to-Structure (I2S) task. The rhetorical indicators align with this classification through the reporting of quantitative performance metrics, detailed <strong>error analysis</strong>, and a focus on <strong>how well the system works</strong> and how its underlying algorithms need refinement.</p>
<h2 id="motivation-bridging-the-gap-in-image-to-structure-tasks">Motivation: Bridging the Gap in Image-to-Structure Tasks</h2>
<p>The motivation is two-fold:</p>
<ol>
<li>
<p><strong>Scientific Need</strong>: Traditional text-based chemical mining methods cannot utilize image data in scientific literature. Chemical OCR software is required to extract 2D chemical structure diagrams from raster images and convert them into a machine-readable chemical file format, paving the way for advanced chemical literature mining.</p>
</li>
<li>
<p><strong>Benchmark Participation</strong>: The immediate motivation was participation in the <strong>TREC Chemical IR campaign&rsquo;s Image-to-Structure (I2S) task</strong>, which was designed to evaluate existing chemical OCR software and establish a platform for developing chemical information retrieval techniques utilizing image data.</p>
</li>
</ol>
<h2 id="novelty-benchmark-evaluation-and-error-analysis-of-chemreader">Novelty: Benchmark Evaluation and Error Analysis of ChemReader</h2>
<p>ChemReader was previously introduced in earlier publications and is a chemical OCR system tailored to a chemical database annotation scheme. The novelty of this paper lies in <strong>evaluating ChemReader within the formal I2S benchmark setting</strong> and conducting a detailed <strong>error analysis</strong> of its performance. After fixing a stereo bond omission and a corner detection bug discovered during the evaluation, ChemReader achieved 93% accuracy (930/1000) on the benchmark test set.</p>
<h2 id="experimental-setup-the-trec-2011-i2s-challenge">Experimental Setup: The TREC 2011 I2S Challenge</h2>
<p>The experiment was the application of the ChemReader software to the <strong>Image-to-Structure (I2S) task</strong> of the TREC Chemical IR campaign.</p>
<ul>
<li><strong>Setup</strong>: The software was used to process image data provided for the I2S task.</li>
<li><strong>Evaluation</strong>: The system was initially evaluated, revealing two issues: the omission of <strong>bond stereo types</strong> in the output structures and a bug in the <strong>corner detection</strong> code that failed on lines touching the image boundary. Each issue lowered accuracy by approximately 10%.</li>
<li><strong>Analysis</strong>: After fixing these issues, ChemReader was re-evaluated on the full 1000-image test set (<strong>Test III</strong>). A detailed error analysis was then conducted on 20 randomly selected samples from Test III results.</li>
</ul>
<h2 id="training-progress">Training Progress</h2>
<p>The paper reports three rounds of major training, with approximately 15% accuracy gain per round:</p>
<ul>
<li><strong>Initial (untrained)</strong>: 57% accuracy on 100 selected training images</li>
<li>Key changes included deactivating unnecessary heuristic algorithms (resizing, de-noising, line merging), limiting the character set, updating the chemical dictionary to a lightweight version, and fixing precision loss from type conversions.</li>
<li>Each round improved accuracy by approximately 15% (Figure 1 in the paper shows the progression).</li>
</ul>
<h2 id="outcomes-high-accuracy-hindered-by-complex-connectivity-rules">Outcomes: High Accuracy Hindered by Complex Connectivity Rules</h2>
<ul>
<li>
<p><strong>Submitted Results</strong>: Test I achieved 691/1000 correct outputs (avg. Tanimoto similarity 0.9769), and Test II achieved 689/1000 (avg. Tanimoto similarity 0.9823). Both scored lower than training accuracy due to the stereo bond omission and corner detection bug.</p>
</li>
<li>
<p><strong>Key Finding</strong>: After fixing these two issues, ChemReader achieved <strong>93% accuracy</strong> (930/1000) on the I2S task (Test III), comparable to the highest accuracy among participants.</p>
</li>
<li>
<p><strong>Limitation/Future Direction</strong>: A detailed <strong>error analysis</strong> on 20 randomly selected samples from Test III (Table 2) showed that the software requires the incorporation of <strong>more chemical intelligence in its algorithms</strong> to address remaining systematic errors. The most frequent errors were:</p>
<ul>
<li>Wrongly merged nodes: 6 samples (30%), caused by nodes too close to be distinguished by a distance threshold</li>
<li>Missed bonds: 4 samples (20%), caused by filtering out short line segments</li>
<li>Nonstandard representations: noise symbols confusing the system, nonstandard wedge/hatched bond styles, and 3D crossing bonds that ChemReader cannot interpret</li>
</ul>
</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Training</td>
          <td style="text-align: left">TREC 2011 Chemical IR I2S Training Set</td>
          <td style="text-align: left">1000 images (100 used for quick eval)</td>
          <td style="text-align: left">TIF format, one chemical structure per image</td>
      </tr>
      <tr>
          <td style="text-align: left">Evaluation</td>
          <td style="text-align: left">TREC 2011 Chemical IR I2S Test Set</td>
          <td style="text-align: left">1000 images (20 sampled for error analysis)</td>
          <td style="text-align: left">Same format constraints; 930/1000 correct in Test III</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>ChemReader is a <strong>chemical Optical Character Recognition (OCR) system</strong> with a 17-step pipeline:</p>
<ol>
<li><strong>Pixel clustering</strong>: Region-growing to identify the chemical structure region</li>
<li><strong>Preprocessing</strong>: Resizing, de-noising, and bond length estimation (deactivated for I2S task)</li>
<li><strong>Text identification</strong>: Connected components with similar heights/areas labeled as characters</li>
<li><strong>Benzene ring detection</strong>: Identifying circles representing aromatic bonds</li>
<li><strong>Hatched bond detection</strong>: Finding short collinear line segments of uniform length</li>
<li><strong>Skeletonization</strong>: Thinning bond pixels for downstream processing</li>
<li><strong>Ring structure detection</strong>: Pentagonal/hexagonal rings via Generalized Hough Transformation (GHT)</li>
<li><strong>Line detection</strong>: Modified Hough Transformation with corner detection for bond extraction</li>
<li><strong>Line filtering</strong>: Removing spurious short segments</li>
<li><strong>Secondary text identification</strong>: Re-examining unidentified fragments for text</li>
<li><strong>Character recognition</strong>: Dual-engine approach (GOCR template matching + Euclidean distance-based engine)</li>
<li><strong>Chemical spell checker</strong>: Matching against a dictionary of 770 chemical abbreviations</li>
<li><strong>Secondary line detection</strong>: Re-running line detection on remaining pixels</li>
<li><strong>Line merging/breaking</strong>: Combining fragmented bonds or splitting at junction nodes</li>
<li><strong>Graph construction</strong>: Creating nodes from bond endpoints and chemical symbol centers, merging nearby nodes</li>
<li><strong>Connected component selection</strong>: Selecting the largest graph component</li>
<li><strong>Output</strong>: Connection table in machine-readable format</li>
</ol>
<h3 id="models">Models</h3>
<p>ChemReader is a rule-based system relying on traditional computer vision (Hough Transformation, region growing, skeletonization) and template-based character recognition. It does not use machine learning model architectures such as CNNs or neural networks.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Test</th>
          <th style="text-align: left">Correct Outputs</th>
          <th style="text-align: left">Avg. Tanimoto Similarity</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Test I (submitted)</td>
          <td style="text-align: left">691/1000</td>
          <td style="text-align: left">0.9769</td>
          <td style="text-align: left">Original submission</td>
      </tr>
      <tr>
          <td style="text-align: left">Test II (submitted)</td>
          <td style="text-align: left">689/1000</td>
          <td style="text-align: left">0.9823</td>
          <td style="text-align: left">Alternative parameter setting</td>
      </tr>
      <tr>
          <td style="text-align: left">Test III (post-fix)</td>
          <td style="text-align: left">930/1000 (93%)</td>
          <td style="text-align: left">0.9913</td>
          <td style="text-align: left">After fixing stereo bond omission and corner detection bug</td>
      </tr>
  </tbody>
</table>
<p><strong>Error Breakdown</strong> (from 20-sample analysis of Test III):</p>
<ul>
<li>Wrongly merged nodes: 6 (30%)</li>
<li>Missed bonds: 4 (20%)</li>
<li>Nonstandard representations (noise symbols, nonstandard wedge/hatched bonds, 3D crossing bonds): remaining errors</li>
</ul>
<h3 id="reproducibility-assessment">Reproducibility Assessment</h3>
<p>ChemReader&rsquo;s source code is not publicly available. The TREC 2011 Chemical IR I2S image sets were distributed to task participants but are not openly hosted. No pre-trained models apply (rule-based system). The paper provides a detailed algorithmic description (17-step pipeline) and parameter values, but full reproduction requires access to both the ChemReader codebase and the TREC image sets.</p>
<p><strong>Status</strong>: Closed</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Park, J., Li, Y., Rosania, G. R., &amp; Saitou, K. (2011). Image-to-Structure Task by ChemReader. <em>TREC 2011 Chemical IR Track Report</em>.</p>
<p><strong>Publication</strong>: TREC 2011 Chemical IR Track</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://trec.nist.gov/pubs/trec20/papers/CHEM.OVERVIEW.pdf">TREC 2011 Chemical IR Track Overview</a></li>
<li><a href="/notes/chemistry/optical-structure-recognition/rule-based/chemreader-2009/">ChemReader 2009 original paper</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@techreport</span>{parkImagetoStructureTaskChemReader2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Image-to-Structure Task by {ChemReader}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Park, Jungkap and Li, Ye and Rosania, Gus R. and Saitou, Kazuhiro}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">institution</span> = <span style="color:#e6db74">{University of Michigan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">type</span> = <span style="color:#e6db74">{TREC 2011 Chemical IR Track Report}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemical Structure Reconstruction with chemoCR (2011)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/chemocr-trec-2011/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/chemocr-trec-2011/</guid><description>A hybrid system combining pattern recognition and rule-based expert systems to reconstruct chemical structures from bitmap images.</description><content:encoded><![CDATA[<h2 id="contribution-the-chemocr-architecture">Contribution: The chemoCR Architecture</h2>
<p><strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong></p>
<p>This paper focuses entirely on the architecture and workflow of the <strong>chemoCR</strong> system. It proposes specific algorithmic innovations (texture-based vectorization, graph constraint exploration) and defines a comprehensive pipeline for converting raster images into semantic chemical graphs. The primary contribution is the system design and its operational efficacy.</p>
<h2 id="motivation-digitizing-image-locked-chemical-structures">Motivation: Digitizing Image-Locked Chemical Structures</h2>
<p>Chemical structures are the preferred language of chemistry, yet they are frequently locked in non-machine-readable formats (bitmap images like GIF, BMP) within patents and journals.</p>
<ul>
<li><strong>The Problem:</strong> Once published as images, chemical structure information is &ldquo;dead&rdquo; to analysis software.</li>
<li><strong>The Gap:</strong> Manual reconstruction is slow and error-prone. Existing tools struggled with the diversity of drawing styles (e.g., varying line thickness, font types, and non-standard bond representations).</li>
<li><strong>The Goal:</strong> To automate the conversion of these depictions into connection tables (SDF/MOL files) to make the data accessible for computational chemistry applications.</li>
</ul>
<h2 id="core-innovation-rule-based-semantic-object-identification">Core Innovation: Rule-Based Semantic Object Identification</h2>
<p>The system is based on a &ldquo;Semantic Entity&rdquo; approach that identifies chemically significant objects (chiral bonds, superatoms, reaction arrows) from structural formula depictions. Key technical innovations include:</p>
<ol>
<li><strong>Texture-based Vectorization:</strong> A new algorithm that computes local directions to vectorize lines, robust against varying drawing styles.</li>
<li><strong>Expert System Integration:</strong> A graph constraint exploration algorithm that applies an XML-based rule set to classify geometric primitives into chemical classes such as <code>BOND</code>, <code>DOUBLEBOND</code>, <code>TRIPLEBOND</code>, <code>BONDSET</code>, <code>DOTTED CHIRAL</code>, <code>STRINGASSOCIATION</code>, <code>DOT</code>, <code>RADICAL</code>, <code>REACTION</code>, <code>REACTION ARROW</code>, <code>REACTION PLUS</code>, <code>CHARGE</code>, and <code>UNKNOWN</code>.</li>
<li><strong>Validation Scoring:</strong> A built-in validation module that tests valences, bond lengths and angles, typical atom types, and fragments to assign a confidence score (0 to 1) to the reconstruction.</li>
</ol>
<h2 id="experiments-the-trec-2011-image-to-structure-task">Experiments: The TREC 2011 Image-to-Structure Task</h2>
<p>The system was evaluated as part of the <strong>TREC 2011 Image-to-Structure (I2S) Task</strong>.</p>
<ul>
<li><strong>Dataset:</strong> 1,000 unique chemical structure images provided by USPTO.</li>
<li><strong>Configuration:</strong> The authors used chemoCR v0.93 in batch mode with a single pre-configured parameter set (&ldquo;Houben-Weyl&rdquo;), originally developed for the Houben-Weyl book series of organic chemistry reactions published by Thieme.</li>
<li><strong>Process:</strong> The workflow included image binarization, connected component analysis, OCR for atom labels, and final molecule assembly.</li>
<li><strong>Metric:</strong> Perfect match recall against ground-truth MOL files.</li>
</ul>
<h2 id="results-and-conclusions-expert-systems-vs-dirty-data">Results and Conclusions: Expert Systems vs. &ldquo;Dirty&rdquo; Data</h2>
<ul>
<li><strong>Performance:</strong> The system achieved a <strong>perfect match for 656 out of 1,000 structures (65.6%)</strong>.</li>
<li><strong>Error Analysis:</strong> Failures were primarily attributed to &ldquo;unclear semantics&rdquo; in drawing styles, such as:
<ul>
<li>Overlapping objects (e.g., atom labels clashing with bonds).</li>
<li>Ambiguous primitives (dots interpreted as both radicals and chiral centers).</li>
<li>Markush structures (variable groups), which were excluded from the I2S task definition. A prototype for Markush detection existed but was not used.</li>
</ul>
</li>
<li><strong>Limitations:</strong> The vectorizer cannot recognize curves and circles, only straight lines. Aromatic ring detection (via a heuristic that looks for a large &ldquo;O&rdquo; character in the center of a ring system) was switched off for the I2S task. The system maintained 12 different parameter sets for various drawing styles, and selecting the correct set was critical.</li>
<li><strong>Impact:</strong> Demonstrated that rule-based expert systems combined with standard pattern recognition could handle high-quality datasets effectively, though non-standard drawing styles remain a challenge.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The paper relies on the TREC 2011 I2S dataset, comprising images extracted from USPTO patents.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>TREC 2011 I2S</td>
          <td>1,000 images</td>
          <td>Binarized bitmaps from USPTO patents.</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>Internal Training Set</td>
          <td>Unknown</td>
          <td>Used to optimize parameter sets (e.g., &ldquo;Houben-Weyl&rdquo; set).</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper describes three main workflow phases (preprocessing, semantic entity recognition, and molecule reconstruction plus validation), organized into four pipeline sections:</p>
<ol>
<li>
<p><strong>Preprocessing:</strong></p>
<ul>
<li><em>Vaporizer Unit:</em> Erases parts of the image that are presumably not structure diagrams (e.g., text or other human-readable information), isolating the chemical depictions.</li>
<li><em>Connected Components:</em> Groups all foreground pixels that are 8-connected into components.</li>
<li><em>Text Tagging and OCR:</em> Identifies components that map to text areas and converts bitmap letters into characters.</li>
</ul>
</li>
<li>
<p><strong>Vectorization:</strong></p>
<ul>
<li><em>Algorithm:</em> <strong>Compute Local Directions</strong>. It analyzes segment clusters to detect ascending, descending, horizontal, and vertical trends in pixel data, converting them into vectors.</li>
<li><em>Feature:</em> Explicitly handles &ldquo;thick chirals&rdquo; (wedges) by computing orientation.</li>
</ul>
</li>
<li>
<p><strong>Reconstruction (Expert System):</strong></p>
<ul>
<li><em>Core Logic:</em> <strong>Graph Constraint Exploration</strong>. It visits connected components and evaluates them against an XML Rule Set.</li>
<li><em>Classification:</em> Objects are tagged with chemical keywords (e.g., <code>BONDSET</code> for ring systems and chains, <code>STRINGASSOCIATION</code> for atom labels, <code>DOTTED CHIRAL</code> for chiral bonds).</li>
<li><em>Rules:</em> Configurable via <code>chemoCRSettings.xml</code>. The successful rule with the highest priority value defines the annotation for each component.</li>
</ul>
</li>
<li>
<p><strong>Assembly &amp; Validation:</strong></p>
<ul>
<li>Combines classified vectors and OCR text into a semantic graph.</li>
<li><em>Superatoms:</em> Matches text groups against a loaded superatom database (e.g., &ldquo;COOH&rdquo;, &ldquo;Boc&rdquo;).</li>
<li><em>Validation:</em> Calculates a score (0-1) based on chemical feasibility (valences, bond lengths and angles, typical atom types, and fragments).</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<p>The system is primarily rule-based but utilizes machine learning components for specific sub-tasks:</p>
<ul>
<li><strong>OCR:</strong> A trainable OCR module using supervised machine learning to recognize atom labels ($H, C, N, O$). The specific classifier is not detailed in the paper.</li>
<li><strong>Rule Base:</strong> An XML file containing the expert system logic. This is the &ldquo;model&rdquo; for structural interpretation.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation was performed strictly within the context of the TREC competition.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Recall (Perfect Match)</td>
          <td>656 / 1000</td>
          <td>N/A</td>
          <td>Strict structural identity required.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Software Stack:</strong> Platform-independent JAVA libraries.</li>
<li><strong>Compute:</strong> Batch mode processing supported; specific hardware requirements (CPU/RAM) were not disclosed.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>chemoCR (Fraunhofer SCAI)</td>
          <td>Software</td>
          <td>Unknown</td>
          <td>Project page defunct; tool was proprietary</td>
      </tr>
      <tr>
          <td><a href="https://trec.nist.gov/pubs/trec20/papers/chemoCR.chem.update.pdf">TREC 2011 Proceedings Paper</a></td>
          <td>Paper</td>
          <td>Public</td>
          <td>Official NIST proceedings</td>
      </tr>
  </tbody>
</table>
<p>No source code was publicly released. The chemoCR system was a proprietary tool from Fraunhofer SCAI. The TREC 2011 I2S dataset was distributed to competition participants and is not independently hosted.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zimmermann, M. (2011). Chemical Structure Reconstruction with chemoCR. <em>TREC 2011 Proceedings</em>.</p>
<p><strong>Publication</strong>: Text REtrieval Conference (TREC) 2011</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zimmermannChemicalStructureReconstruction2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Chemical Structure Reconstruction with {{chemoCR}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Text {{REtrieval Conference}} ({{TREC}}) 2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zimmermann, Marc}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Structural Analysis of Handwritten Chemical Formulas</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/ramel-handwritten-1999/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/ramel-handwritten-1999/</guid><description>A 1999 methodology for recognizing handwritten chemical structures using a structural graph representation and recursive specialists.</description><content:encoded><![CDATA[<h2 id="contribution-structural-approach-to-document-analysis">Contribution: Structural Approach to Document Analysis</h2>
<p><strong>Method</strong>.
This paper proposes a system architecture for document analysis. It introduces a specific pipeline (Global Perception followed by Incremental Extraction) and validates this strategy with recognition rates on specific tasks. The core contribution is the shift from bitmap-based processing to a <strong>structural graph representation</strong> of graphical primitives.</p>
<h2 id="motivation-overcoming-bitmap-limitations-in-freehand-drawings">Motivation: Overcoming Bitmap Limitations in Freehand Drawings</h2>
<ul>
<li><strong>Complexity of Freehand</strong>: Freehand drawings contain fluctuating lines and noise that make standard vectorization techniques difficult to apply directly.</li>
<li><strong>Limitation of Bitmap Analysis</strong>: Most existing systems at the time attempted to interpret the document by working directly on the static bitmap image throughout the process.</li>
<li><strong>Need for Context</strong>: Interpretation requires a dynamic resource that can evolve as knowledge is extracted (e.g., recognizing a polygon changes the context for its neighbors).</li>
</ul>
<h2 id="novelty-dynamic-structural-graphs-and-recursive-specialists">Novelty: Dynamic Structural Graphs and Recursive Specialists</h2>
<p>The authors propose a <strong>Structural Representation</strong> as the unique resource for interpretation.</p>
<ul>
<li><strong>Quadrilateral Primitives</strong>: The system builds Quadrilaterals (pairs of vectors) to represent thin shapes, which are robust to handwriting fluctuations.</li>
<li><strong>Structural Graph</strong>: These primitives are organized into a graph where arcs represent geometric relationships (T-junctions, L-junctions, parallels).</li>
<li><strong>Specialist Agents</strong>: Interpretation is driven by independent modules (specialists) that browse this graph recursively to identify high-level chemical entities like rings (polygons) or chains.</li>
</ul>
<h2 id="experimental-setup-and-outcomes">Experimental Setup and Outcomes</h2>
<ul>
<li><strong>Validation Set</strong>: The system was tested on 20 handwritten off-line documents containing chemical formulas at 300 dpi resolution.</li>
<li><strong>Text Database</strong>: A separate base of 328 models was used for the text recognition component.</li>
<li><strong>High Graphical Accuracy</strong>: The system achieved a $\approx 97%$ recognition rate for graphical parts (chemical elements like rings and bonds).</li>
<li><strong>Text Recognition</strong>: The text recognition module achieved a $\approx 93%$ success rate.</li>
<li><strong>Robustness</strong>: The structural graph approach successfully handled multiple liaisons, polygons, chains and allowed for the progressive construction of a solution consistent with the context.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Handwritten Documents</td>
          <td>20 docs</td>
          <td>Off-line documents at 300 dpi</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>Character Models</td>
          <td>328 models</td>
          <td>Used for the Pattern Matching text recognition base</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The interpretation process is divided into two distinct phases:</p>
<p><strong>1. Global Perception (Graph Construction)</strong></p>
<ul>
<li><strong>Vectorization</strong>: Contour tracking produces a chain of vectors, which are simplified via iterative polygonal approximation until fusion stabilizes (2-5 iterations).</li>
<li><strong>Quadrilateral Formation</strong>: Vectors are paired to form quadrilaterals based on Euclidean distance and &ldquo;empirical&rdquo; alignment criteria.</li>
<li><strong>Graph Generation</strong>: Quadrilaterals become nodes. Arcs are created based on &ldquo;zones of influence&rdquo; and classified into 5 types: T-junction, Intersection (X), Parallel (//), L-junction, and Successive (S).</li>
<li><strong>Redraw Heuristic</strong>: A pre-processing step transforms T, X, and S junctions into L or // relations, as chemical drawings primarily consist of L-junctions and parallels.</li>
</ul>
<p><strong>2. Specialists (Interpretation)</strong></p>
<ul>
<li><strong>Liaison Specialist</strong>: Scans the graph for // arcs or quadrilaterals with free extremities to identify bonds.</li>
<li><strong>Polygon/Chain Specialist</strong>: Uses recursive <code>look-left</code> and <code>look-right</code> procedures. If a search returns to the start node after $n$ steps, a polygon is detected.</li>
<li><strong>Text Localization</strong>: Clusters &ldquo;short&rdquo; quadrilaterals by physical proximity into &ldquo;focus zones&rdquo;. Zones are classified as text/non-text based on connected components.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Text Recognition Hybrid</strong>:</p>
<ol>
<li><strong>Normalization &amp; Pattern Matching</strong>: A classic method using the database of 328 models.</li>
<li><strong>Structural Rule Base</strong>: Uses &ldquo;significant&rdquo; quadrilaterals (length $\ge 1/3$ of zone dimension) to verify characters. A rule base defines the expected count of horizontal, vertical, right-diagonal, and left-diagonal lines for each character.</li>
</ol>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Graphical Element Recognition</td>
          <td>~97%</td>
          <td>N/A</td>
          <td>Evaluated on 20 documents (Fig. 7 examples)</td>
      </tr>
      <tr>
          <td>Text Recognition</td>
          <td>~93%</td>
          <td>N/A</td>
          <td>Evaluated on 20 documents</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ramel, J.-Y., Boissier, G., &amp; Emptoz, H. (1999). Automatic Reading of Handwritten Chemical Formulas from a Structural Representation of the Image. <em>Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR &lsquo;99)</em>, 83-86. <a href="https://doi.org/10.1109/ICDAR.1999.791730">https://doi.org/10.1109/ICDAR.1999.791730</a></p>
<p><strong>Publication</strong>: ICDAR 1999</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{ramelAutomaticReadingHandwritten1999,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Automatic Reading of Handwritten Chemical Formulas from a Structural Representation of the Image}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the {{Fifth International Conference}} on {{Document Analysis}} and {{Recognition}}. {{ICDAR}} &#39;99 ({{Cat}}. {{No}}.{{PR00318}})}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Ramel, J.-Y. and Boissier, G. and Emptoz, H.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1999</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{83--86}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Bangalore, India}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICDAR.1999.791730}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{978-0-7695-0318-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>OSRA at TREC-CHEM 2011: Optical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/osra-trec-2011/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/osra-trec-2011/</guid><description>A methodological overview of OSRA, an open-source pipeline for converting chemical structure images into machine-readable formats.</description><content:encoded><![CDATA[<h2 id="contribution-method-and-resource">Contribution: Method and Resource</h2>
<p>This is primarily a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong>, with a secondary <strong>Resource ($\Psi_{\text{Resource}}$)</strong> component.</p>
<p>It is Methodological because it details the specific algorithmic workflow (segmentation, binarization, vectorization, and rule-based recognition) used to translate pixel data into chemical semantics. It specifically addresses the &ldquo;Image2Structure&rdquo; task. It also serves as a Resource contribution by introducing OSRA as a free, open-source utility available to the community.</p>
<h2 id="motivation-limitations-of-standard-ocr-in-chemistry">Motivation: Limitations of Standard OCR in Chemistry</h2>
<p>A vast body of chemical information exists in journal publications and patents as two-dimensional structure diagrams. While human-readable, these images are inaccessible to machine data mining techniques like virtual screening. Standard Optical Character Recognition (OCR) is insufficient, and widely used techniques such as wavelet transforms or neural networks (as used in face recognition) are not applicable here because chemical diagrams contain far more structural complexity than alphabet characters, and misinterpretation of a single element can yield a valid but incorrect molecule.</p>
<h2 id="core-innovation-chemistry-aware-heuristic-pipeline">Core Innovation: Chemistry-Aware Heuristic Pipeline</h2>
<p>The authors present a specialized pipeline distinct from standard OCR that combines image processing with domain-specific chemical logic. Key technical contributions include:</p>
<ul>
<li><strong>Entropy-based Page Segmentation</strong>: A statistical method using row entropy to distinguish between pages with mixed text/graphics and pages with single structures.</li>
<li><strong>Custom Binarization</strong>: A specific grayscale conversion ($Gr=\min(R,G,B)$).</li>
<li><strong>Heuristic Confidence Scoring</strong>: A linear &ldquo;confidence function&rdquo; derived from atom and ring counts to select the best structure resolution.</li>
<li><strong>Specialized Bond Recognition</strong>: Algorithms to detect bridge bonds, wedge/dashed bonds (3D info), and aromatic rings via inner circles.</li>
</ul>
<h2 id="methodology-evaluation-on-trec-chem-image2structure">Methodology: Evaluation on TREC-CHEM Image2Structure</h2>
<p>The system was validated through submission to the <strong>Image2Structure task of TREC-CHEM</strong>.</p>
<ul>
<li><strong>Version</strong>: OSRA version 1.3.8 was used without modifications.</li>
<li><strong>Setup</strong>: Two runs were submitted: one with default settings (automatic scale selection) and one fixed at 300 dpi.</li>
<li><strong>Data</strong>: The evaluation used a &ldquo;Training set&rdquo; and a &ldquo;Challenge Set&rdquo; provided by the task organizers.</li>
<li><strong>Metric</strong>: Recall rates were measured for both sets.</li>
</ul>
<h2 id="results-and-real-world-impact">Results and Real-World Impact</h2>
<ul>
<li><strong>Performance</strong>: The default settings achieved an <strong>84.3%</strong> recall on the training set and <strong>84.8%</strong> on the challenge set. The 300 dpi run performed slightly better (86.1% training, 85.6% challenge).</li>
<li><strong>Utility</strong>: The tool is widely used by academic and commercial researchers to extract data from patents (USPTO, JPO).</li>
<li><strong>Validation</strong>: Recognition rates have shown steady improvement over a 3-year development period.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://osra.sourceforge.net">OSRA (SourceForge)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Open-source OCSR tool</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: The primary evaluation data came from the TREC-CHEM Image2Structure task.</li>
<li><strong>Reference Datasets</strong>: The paper references the &ldquo;Chem-Infty Dataset&rdquo; as a source of ground-truthed chemical structure images.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The OSRA pipeline is heuristic-heavy. Key implementation details for replication include:</p>
<p><strong>1. Page Segmentation</strong></p>
<ul>
<li><strong>Entropy Calculation</strong>: Used to detect text vs. graphics. Entropy $E = -p \log p$ is calculated for rows in a feature matrix of component distances.</li>
<li><strong>Thresholds</strong>: Max entropy &gt; 6 indicates mixed text/graphics; $\le$ 3 indicates a single structure. A threshold of <strong>4</strong> is used to distinguish the two.</li>
<li><strong>Separator Removal</strong>: Linear separators (aspect ratio above 100 or below 0.01, size above 300 pixels) are deleted early. Table frames are identified as connected components with aspect ratio between 0.1 and 10, with at least 300 pixels lying on the surrounding rectangle.</li>
<li><strong>Text Removal</strong>: Text blocks are identified if a group of segments (distance determined by local minima in distance matrix) contains &gt; 8 segments, has a fill ratio &gt; 0.2, or aspect ratio &gt; 10.</li>
</ul>
<p><strong>2. Image Preprocessing</strong></p>
<ul>
<li><strong>Grayscale</strong>: $Gr = \min(R, G, B)$.</li>
<li><strong>Resolutions</strong>: Processed at 72, 150, 300 dpi, and a dynamic resolution between 500-1200 dpi.</li>
<li><strong>Noise Factor</strong>: Ratio of 2-pixel line segments to 3-pixel line segments. If this factor is between <strong>0.5 and 1.0</strong>, anisotropic smoothing (GREYCstoration) is applied.</li>
<li><strong>Thinning</strong>: Uses the method by J. M. Cychosz to reduce lines to 1 pixel width.</li>
</ul>
<p><strong>3. Vectorization &amp; Atom Detection</strong></p>
<ul>
<li><strong>Library</strong>: Potrace is used for vectorization.</li>
<li><strong>Atom Identification</strong>: Atoms are detected at Bezier curve control points if:
<ul>
<li>Potrace classifies it as a corner.</li>
<li>Direction change normal component is $\ge$ 2 pixels.</li>
<li>The distance from the last atom to the next control point is less than the distance from the last atom to the current control point.</li>
</ul>
</li>
<li><strong>OCR</strong>: GOCR and OCRAD are used for label recognition on connected sets smaller than max character dimensions. Tesseract and Cuneiform were also tested but did not improve recognition results.</li>
</ul>
<p><strong>4. Chemical Logic</strong></p>
<ul>
<li><strong>Average Bond Length</strong>: Defined as the value at the <strong>75th percentile</strong> of the sorted bond length list (to avoid bias from small artifacts).</li>
<li><strong>Aromaticity</strong>: Flagged if a circle is found inside a ring, atoms are within half the average bond length of the circle, and bond angles to the center are less than 90 degrees.</li>
<li><strong>Bridge Bonds</strong>: Detected if an atom connected to 4 pairwise collinear single bonds (none terminal) can be removed without changing fragment count, rotatable bonds, or reducing the number of 5- and 6-membered rings by 2.</li>
</ul>
<p><strong>5. Connection Table Compilation</strong></p>
<ul>
<li><strong>Library</strong>: OpenBabel is used for conversion into SMILES or SDF formats.</li>
<li><strong>Process</strong>: A molecular object is constructed from connectivity information along with stereo- and aromaticity flags. Superatom fragments are added at this stage using a user-modifiable dictionary.</li>
</ul>
<h3 id="models">Models</h3>
<p>This is a non-learning based system (Rule-based/Heuristic). However, it uses a tuned linear function for confidence estimation.</p>
<p><strong>Confidence Function</strong>: Used to select the best resolution result.</p>
<p>$$
\begin{aligned}
\text{confidence} &amp;= 0.316030 - 0.016315 N_C + 0.034336 N_N + 0.066810 N_O \\
&amp;+ 0.035674 N_F + 0.065504 N_S + 0.04 N_{Cl} + 0.066811 N_{Br} \\
&amp;+ 0.01 N_R - 0.02 N_{Xx} - 0.212739 N_{rings} + 0.071300 N_{aromatic} \\
&amp;+ 0.329922 N_{rings5} + 0.342865 N_{rings6} - 0.037796 N_{fragments}
\end{aligned}
$$</p>
<p>Where $N_C$ is carbon count, $N_{rings}$ is ring count, etc.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Run</th>
          <th>Training Set</th>
          <th>Challenge Set</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Recall</td>
          <td>Default Settings</td>
          <td>84.3%</td>
          <td>84.8%</td>
      </tr>
      <tr>
          <td>Recall</td>
          <td>Fixed 300 dpi</td>
          <td>86.1%</td>
          <td>85.6%</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Filippov, I. V., Katsubo, D., &amp; Nicklaus, M. C. (2011). Optical Structure Recognition Application entry in Image2Structure task. <em>TREC-CHEM</em>.</p>
<p><strong>Publication</strong>: TREC-CHEM 2011</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://osra.sourceforge.net">SourceForge Project</a></li>
<li><a href="https://launchpad.net/cuneiform-linux">Cuneiform Linux Port</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@techreport</span>{filippovOpticalStructureRecognition2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Optical {{Structure Recognition Application}} Entry in {{Image2Structure}} Task}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Filippov, Igor V. and Katsubo, Dmitry and Nicklaus, Marc C.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">institution</span> = <span style="color:#e6db74">{National Cancer Institute}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">note</span> = <span style="color:#e6db74">{TREC-CHEM Entry}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Kekulé-1 System for Chemical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/kekule-1996/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/kekule-1996/</guid><description>Foundational OCSR method combining neural OCR with chemical rule-based post-processing for automated structure interpretation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: McDaniel, J. R., &amp; Balmuth, J. R. (1996). Automatic Interpretation of Chemical Structure Diagrams. <em>Graphics Recognition. Methods and Applications</em>, 148-158. <a href="https://doi.org/10.1007/3-540-61226-2_13">https://doi.org/10.1007/3-540-61226-2_13</a></p>
<p><strong>Publication</strong>: Lecture Notes in Computer Science (LNCS), Vol. 1072, Springer, 1996.</p>
<h2 id="system-architecture-and-contribution">System Architecture and Contribution</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel software architecture (&ldquo;Kekulé-1&rdquo;) designed to solve the specific technical problem of converting rasterized chemical diagrams into machine-readable connection tables. The paper is characterized by:</p>
<ul>
<li><strong>Algorithmic Specification</strong>: It details specific algorithms for vectorization, polygon approximation, and character recognition.</li>
<li><strong>Performance Metrics</strong>: It validates the method using quantitative accuracy (98.9%) and speed comparisons against manual entry.</li>
<li><strong>System Architecture</strong>: It describes the integration of typically disparate components (OCR, vectorization, chemical rules) into a cohesive pipeline.</li>
</ul>
<h2 id="motivation-the-chemical-data-entry-bottleneck">Motivation: The Chemical Data Entry Bottleneck</h2>
<p>Chemical structure diagrams are the primary medium for communication between chemists, but computers cannot natively &ldquo;read&rdquo; these raster images.</p>
<ul>
<li><strong>Efficiency Gap</strong>: Manual redrawing of structures into chemical databases takes 6 to 10 minutes per structure.</li>
<li><strong>Technical Challenge</strong>: Existing commercial OCR systems failed on chemical diagrams because they could not handle the mix of graphics (bonds) and text (atom labels), nor could they recognize small fonts (3-7 points) or chemical symbols accurately.</li>
<li><strong>Goal</strong>: To create an &ldquo;Optical Chemical Structure Recognition&rdquo; (OCSR) system that reduces processing time to seconds while handling complex notation like stereochemistry and group formulas.</li>
</ul>
<h2 id="core-innovations-in-chemical-ocr">Core Innovations in Chemical OCR</h2>
<p>Kekulé-1 represents the &ldquo;first successful attempt&rdquo; to integrate image processing, OCR, and structure editing into a single workflow. Key innovations include:</p>
<ul>
<li><strong>Context-Aware OCR</strong>: Unlike standard OCR, Kekulé-1 uses &ldquo;chemical spell checking&rdquo; by applying valence rules and chemical context to correct raw character recognition errors (e.g., distinguishing &lsquo;5&rsquo; from &lsquo;S&rsquo; based on bonding).</li>
<li><strong>Adaptive Polygon Approximation</strong>: A modified vectorization algorithm that partitions objects at the farthest node to prevent artifact nodes in U-shaped structures.</li>
<li><strong>Hybrid Parsing</strong>: It treats the diagram as a graph where nodes can be explicit atoms or geometric intersections, using rule-based logic to parse &ldquo;group formulas&rdquo; (like $COOH$) recursively.</li>
</ul>
<h2 id="experimental-validation-and-benchmarks">Experimental Validation and Benchmarks</h2>
<p>The authors evaluated the system on a private test set to validate robustness and speed.</p>
<ul>
<li><strong>Dataset</strong>: 524 chemical structures chosen from a &ldquo;wide variety of sources&rdquo; specifically to test the system&rsquo;s limits.</li>
<li><strong>Metrics</strong>: Success rate (percentage of structures processed with minimal editing) and processing time per structure.</li>
<li><strong>Comparators</strong>: Performance was compared against the &ldquo;manual redrawing&rdquo; baseline.</li>
</ul>
<h2 id="results-performance-and-conclusions">Results, Performance, and Conclusions</h2>
<ul>
<li><strong>High Accuracy</strong>: 98.9% of the test structures were successfully processed (with an average of 0.74 user prompts per structure).</li>
<li><strong>Speedup</strong>: Processing took 7 to 30 seconds per structure, a significant improvement over the 6 to 10 minute manual baseline.</li>
<li><strong>Robustness</strong>: The system successfully handled pathological cases like broken characters, skew (rotation), and touching characters.</li>
<li><strong>Impact</strong>: The authors conclude that the techniques are generalizable to other domains like electrical circuits and utility maps.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Training/Test Data</strong>: The evaluation used 524 chemical structures. These were not released publicly but were selected to represent &ldquo;limit&rdquo; cases.</li>
<li><strong>Input format</strong>: Scanned images at 300-400 dpi. The authors note that higher resolutions do not add information due to ink wicking and paper limitations.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The paper details several specific algorithmic implementations:</p>
<p><strong>Vectorization (Polygon Approximation)</strong>:</p>
<ul>
<li>Standard thinning and raster-to-vector translation are used.</li>
<li><strong>Innovation</strong>: The algorithm searches for the node <em>farthest</em> from the current start node to partition the object. This prevents artifact nodes in curved lines.</li>
<li><strong>Threshold Formula</strong>: The allowed deviation ($dist$) from a straight line is adaptive based on segment length ($length$):</li>
</ul>
<p>$$dist = \max(1, \frac{length}{10.0} + 0.4)$$</p>
<p>(Units in pixels)</p>
<p><strong>Rotation Correction</strong>:</p>
<ul>
<li>The system computes the angle of all &ldquo;long&rdquo; line segments modulo 15 degrees.</li>
<li>It bins these angles; the bin with the highest count (representing &lt; 4 degrees rotation) is treated as the scan skew and corrected.</li>
</ul>
<p><strong>Optical Character Recognition (OCR)</strong>:</p>
<ul>
<li>Uses a neural network with linked/shared weights (similar to Convolutional Neural Networks, though not named as such) acting as a feature detector.</li>
<li><strong>Training</strong>: Trained on specific chemical fonts.</li>
<li><strong>Inference</strong>: Outputs are ranked; if multiple characters (e.g., &lsquo;5&rsquo; and &lsquo;S&rsquo;) exceed a threshold, both are kept, and chemical context resolves the ambiguity later.</li>
</ul>
<p><strong>Chemical Parsing</strong>:</p>
<ul>
<li>Group formulas (e.g., $COOH$) are parsed left-to-right by subtracting valences.</li>
<li>Example: For $COOH$, the external bond reduces Carbon&rsquo;s valence to 3. The first Oxygen takes 2, leaving 1. The final Oxygen takes 1 (attaching to Carbon), and the Hydrogen takes 1 (attaching to Oxygen).</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>OCR Model</strong>: A neural network with a &ldquo;shared weights&rdquo; paradigm, effectively creating a learned convolution map. It achieves ~99.9% raw accuracy on isolated test sets of chemical fonts.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: The evaluation was performed on an <strong>80486 processor at 33 MHz</strong>.</li>
<li><strong>Time</strong>: Average processing time was 9 seconds per structure.</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{mcdanielAutomaticInterpretationChemical1996,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Automatic Interpretation of Chemical Structure Diagrams}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Graphics Recognition. Methods and Applications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{McDaniel, Joe R. and Balmuth, Jason R.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">editor</span> = <span style="color:#e6db74">{O&#39;Gorman, Lawrence and Kasturi, Rangachar}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">series</span> = <span style="color:#e6db74">{Lecture Notes in Computer Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{1072}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{148--158}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{1996}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1007/3-540-61226-2_14}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Imago: Open-Source Chemical Structure Recognition (2011)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/imago-trec-2011/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/imago-trec-2011/</guid><description>Open-source C++ toolkit for extracting 2D chemical structures from scientific literature using heuristic image processing methods.</description><content:encoded><![CDATA[<h2 id="paper-contribution-and-resource-utility">Paper Contribution and Resource Utility</h2>
<p>This is primarily a <strong>Resource ($\Psi_{\text{Resource}}$)</strong> paper, with a secondary <strong>Method ($\Psi_{\text{Method}}$)</strong> component.</p>
<p><strong>Resource:</strong> The paper&rsquo;s main contribution is the release of the &ldquo;Imago&rdquo; open-source toolkit. It emphasizes infrastructure utility: cross-platform C++ implementation, a core written from scratch without third-party code, and the inclusion of both GUI and command-line tools.</p>
<p><strong>Method:</strong> It provides a detailed description of the recognition pipeline (filtering, segmentation, vectorization) to document the resource.</p>
<h2 id="motivation-the-deep-web-of-chemical-structures">Motivation: The Deep Web of Chemical Structures</h2>
<p>Chemical databases (like PubChem or PubMed) allow text and substructure searches, but a vast amount of chemical data remains &ldquo;locked&rdquo; in the images of scientific articles and patents. This is described as a &ldquo;Deep Web indexing problem&rdquo;. To bridge this gap, the authors identify a need for efficient, accurate, and portable algorithms to convert 2D raster images of molecules into graph representations suitable for indexing and search.</p>
<h2 id="core-innovation-a-dependency-free-c-architecture">Core Innovation: A Dependency-Free C++ Architecture</h2>
<p>The novelty lies in the <strong>open-source, dependency-free implementation</strong>.</p>
<p><strong>Portability:</strong> The core of the toolkit is written from scratch in modern C++ without third-party libraries, specifically targeting portability to mobile devices and various platforms.</p>
<p><strong>Integration:</strong> It combines optical character recognition (OCR) with specific chemical heuristics (like identifying stereochemistry and abbreviations) into a single usable workflow.</p>
<h2 id="methodology-and-experimental-validation-at-trec-chem">Methodology and Experimental Validation at TREC-CHEM</h2>
<p>The paper describes the algorithm used in Imago and reflects on its participation in the <strong>Image2Structure task at TREC-CHEM 2011</strong>. No quantitative results are reported; the &ldquo;Discussion&rdquo; section instead reflects on qualitative performance issues observed during the task, such as handling low resolution, noise, and connected atom labels.</p>
<h2 id="outcomes-limitations-and-future-directions">Outcomes, Limitations, and Future Directions</h2>
<p><strong>Release:</strong> The authors successfully released Imago under the GPLv3 license, including an API for developers. The toolkit outputs recognized structures in MDL Molfile format.</p>
<p><strong>Limitations Identified:</strong> The straightforward pipeline fails when images have low resolution (atom labels merge with bonds), high noise, or tight character spacing (symbols rendered without space pixels between them). Additionally, when few symbols are present, the average bond length estimate can have large error, causing atom symbols to be misidentified as bond chains.</p>
<p><strong>Future Directions:</strong> The authors propose moving from a linear pipeline to an &ldquo;optimization procedure&rdquo; that maximizes a confidence score, using probabilistic mapping of image primitives to chemical entities. They also argue that recognition programs should output a confidence score to enable automatic batch processing (only images with low confidence need manual review). They suggest a multi-pass workflow where each iteration adjusts parameters to improve the confidence level, and they note the additional challenge of separating molecule images from text in real articles and patents.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The paper does not specify a training dataset for the core logic (which appears heuristic-based), but references testing context:</p>
<ul>
<li><strong>Domain:</strong> Images from scientific articles and patents.</li>
<li><strong>Validation:</strong> TREC-CHEM 2011 Image2Structure task data.</li>
<li><strong>Databases:</strong> Mentions PubMed and PubChem as context for the type of data being indexed.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The recognition pipeline follows a strict linear sequence:</p>
<ol>
<li>
<p><strong>Preprocessing:</strong></p>
<ul>
<li><strong>Binarization:</strong> Threshold-based.</li>
<li><strong>Supersegmentation:</strong> Locates the chemical structure using a $15 \times 15$ window neighbor search.</li>
<li><strong>Filtering:</strong> Removes single-down stereo bonds (dashed triangles) early to prevent incorrect recognition of the small line segments during classification.</li>
</ul>
</li>
<li>
<p><strong>Separation (Symbols vs. Graphics):</strong></p>
<ul>
<li><strong>Heuristic:</strong> Estimates &ldquo;capital letter height&rdquo;.</li>
<li><strong>Criteria:</strong> Groups segments by height and aspect ratio range $[\text{MinSymRatio}, \text{MaxSymRatio}]$.</li>
</ul>
</li>
<li>
<p><strong>Skeleton Construction (Vectorization):</strong></p>
<ul>
<li><strong>Thinning:</strong> Uses neighborhood maps to reduce lines to 1-pixel thickness.</li>
<li><strong>De-crossing:</strong> Each black pixel with more than 2 black pixels in its 8-neighborhood becomes white, isolating polylines.</li>
<li><strong>Smoothing:</strong> Uses the <strong>Douglas-Peucker algorithm</strong>.</li>
<li><strong>Graph Adjustment:</strong> Merges close vertices and detects bond orders based on parallel edges.</li>
</ul>
</li>
<li>
<p><strong>Symbol Recognition:</strong></p>
<ul>
<li><strong>Grouping:</strong> Uses a <strong>Relative Neighborhood Graph</strong> to group characters into superatoms/labels.</li>
<li><strong>OCR:</strong> Classification based on <strong>Fourier descriptors</strong> of outer/inner contours.</li>
</ul>
</li>
<li>
<p><strong>Chemical Expansion:</strong></p>
<ul>
<li><strong>Abbreviation:</strong> Expands common groups (e.g., Ph, COOH) stored as SMILES notation, using the <strong>Indigo toolkit</strong> for 2D coordinate generation of the expanded structures.</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<ul>
<li><strong>Type:</strong> Heuristic-based computer vision pipeline; no learned deep learning weights mentioned.</li>
<li><strong>Stereo Recognition:</strong>
<ul>
<li><strong>Single Down:</strong> Identified as $k \ge 3$ parallel equidistant lines.</li>
<li><strong>Single Up:</strong> Identified by checking if a bond was a solid triangle before thinning.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics:</strong> None quantitatively reported in the text; discussion focuses on qualitative failure modes (low resolution, noise).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/epam/Imago">Imago GitHub Repository</a></td>
          <td>Code</td>
          <td>Apache-2.0 (current); GPLv3 (as published)</td>
          <td>Official C++ implementation</td>
      </tr>
      <tr>
          <td><a href="https://lifescience.opensource.epam.com/imago/">Imago Project Page</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Documentation and downloads</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Requirements:</strong> Designed to be lightweight and portable (mobile-device capable), written in C++. No specific GPU/TPU requirements.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Smolov, V., Zentsev, F., &amp; Rybalkin, M. (2011). Imago: Open-Source Toolkit for 2D Chemical Structure Image Recognition. <em>TREC-CHEM 2011</em>.</p>
<p><strong>Publication</strong>: TREC-CHEM 2011</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://trec.nist.gov/pubs/trec20/t20.proceedings.html">TREC-CHEM 2011 Proceedings</a></li>
<li><a href="https://lifescience.opensource.epam.com/imago/">Project Website</a></li>
<li><a href="https://github.com/epam/Imago">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@techreport</span>{smolovImagoOpenSourceToolkit2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Imago: {{Open-Source Toolkit}} for {{2D Chemical Structure Image Recognition}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Smolov, Viktor and Zentsev, Fedor and Rybalkin, Mikhail}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">institution</span> = <span style="color:#e6db74">{{GGA Software Services LLC}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">note</span> = <span style="color:#e6db74">{TREC-CHEM 2011}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>CLiDE Pro: Optical Chemical Structure Recognition Tool</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/clide-pro-2009/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/clide-pro-2009/</guid><description>A methodological paper presenting CLiDE Pro, an OCSR system for reconstructing chemical graphs from images with ~90% accuracy.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Valko, A. T., &amp; Johnson, A. P. (2009). CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition. <em>Journal of Chemical Information and Modeling</em>, 49(4), 780-787. <a href="https://doi.org/10.1021/ci800449t">https://doi.org/10.1021/ci800449t</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2009</p>
<h2 id="contribution-robust-algorithmic-pipeline-for-ocsr">Contribution: Robust Algorithmic Pipeline for OCSR</h2>
<p>This is primarily a <strong>Method ($\Psi_{\text{Method}}$)</strong> paper, as it proposes a specific algorithmic architecture (CLiDE Pro) for converting raster images of chemical structures into connection tables. It details the procedural steps for segmentation, vectorization, and graph reconstruction.</p>
<p>It also has a secondary <strong>Resource ($\Psi_{\text{Resource}}$)</strong> contribution, as the authors compile and release a validation set of 454 real-world images to serve as a community benchmark for OCSR systems.</p>
<h2 id="motivation-bridging-the-gap-between-legacy-document-images-and-machine-readable-chemistry">Motivation: Bridging the Gap Between Legacy Document Images and Machine-Readable Chemistry</h2>
<p>While modern chemical drawing software captures structural information explicitly, the vast majority of legacy and current chemical literature (journals, patents, reports) exists as images or PDF documents. These images are human-readable but lack the semantic &ldquo;connection table&rdquo; data required for chemical databases and software. Manual redrawing is time-consuming and error-prone. Therefore, there is a critical need for efficient Optical Chemical Structure Recognition (OCSR) systems to automate this extraction.</p>
<h2 id="novelty-integrated-document-segmentation-and-ambiguity-resolution-heuristics">Novelty: Integrated Document Segmentation and Ambiguity Resolution Heuristics</h2>
<p>CLiDE Pro introduces several algorithmic improvements over its predecessor (CLiDE) and contemporary tools:</p>
<ul>
<li><strong>Integrated Document Segmentation</strong>: Unlike page-oriented systems, it processes whole documents to link information across pages.</li>
<li><strong>Robust &ldquo;Difficult Feature&rdquo; Handling</strong>: It implements specific heuristic rules to resolve ambiguities in crossing bonds (bridged structures), which are often misinterpreted as carbon atoms in other systems.</li>
<li><strong>Generic Structure Interpretation</strong>: It includes a module to parse &ldquo;generic&rdquo; (Markush) structures by matching R-group labels in the diagram with text-based definitions found in the document.</li>
<li><strong>Ambiguity Resolution</strong>: It uses context-aware rules to distinguish between geometrically similar features, such as vertical lines representing bonds vs. the letter &rsquo;l&rsquo; in &lsquo;Cl&rsquo;.</li>
</ul>
<h2 id="methodology-and-benchmarking-on-real-world-data">Methodology and Benchmarking on Real-World Data</h2>
<p>The authors conducted a systematic validation on a dataset of <strong>454 images</strong> containing <strong>519 structure diagrams</strong>.</p>
<ul>
<li><strong>Source Material</strong>: Images were extracted from published materials (journals, patents), ensuring &ldquo;real artifacts&rdquo; like noise and scanning distortions were present.</li>
<li><strong>Automation</strong>: The test was fully automated without human intervention.</li>
<li><strong>Metrics</strong>: The primary metric was the &ldquo;success rate,&rdquo; defined as the correct reconstruction of the molecule&rsquo;s connection table. They also performed fine-grained error analysis on specific features (e.g., atom labels, dashed bonds, wavy bonds).</li>
</ul>
<h2 id="results-high-topological-accuracy-and-persistent-ocr-challenges">Results: High Topological Accuracy and Persistent OCR Challenges</h2>
<ul>
<li><strong>High Accuracy</strong>: The system achieved a <strong>89.79%</strong> retrieval rate (466/519 molecules correctly reconstructed).</li>
<li><strong>Robustness on Primitives</strong>: Solid straight bonds were recognized with 99.92% accuracy.</li>
<li><strong>Key Failure Modes</strong>: The majority of errors (58 cases) occurred in atom label construction, specifically when labels touched nearby bonds or other artifacts, causing OCR failures.</li>
<li><strong>Impact</strong>: The study demonstrated that handling &ldquo;difficult&rdquo; drawing features like crossing bonds and bridged structures significantly reduces topological errors. The authors released the test set to encourage standardized benchmarking in the OCSR field.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors utilized a custom dataset designed to reflect real-world noise.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>CLiDE Pro Validation Set</td>
          <td>454 images (519 structures)</td>
          <td>Extracted from scanned journals and PDFs. Includes noise/artifacts. Available in Supporting Information.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The CLiDE Pro pipeline consists of five distinct phases. To replicate this system, one would need to implement:</p>
<ol>
<li>
<p><strong>Image Binarization</strong>:</p>
<ul>
<li>Input images are binarized using a threshold-based technique to separate foreground (molecule) from background.</li>
<li><strong>Connected Component Analysis (CCA)</strong>: A non-recursive scan identifies connected components (CCs) and generates interpixel contours (using N, S, E, W directions).</li>
</ul>
</li>
<li>
<p><strong>Document Segmentation</strong>:</p>
<ul>
<li><strong>Layout Analysis</strong>: Uses a bottom-up approach building a tree structure. It treats CCs as graph vertices and distances as edges.</li>
<li><strong>Clustering</strong>: A minimal-cost spanning tree (Kruskal&rsquo;s algorithm) groups CCs into words, lines, and blocks.</li>
<li><strong>Classification</strong>: CCs are classified (Character, Dash, Line, Graphics, Noise) based on size thresholds derived from statistical image analysis.</li>
</ul>
</li>
<li>
<p><strong>Vectorization</strong>:</p>
<ul>
<li><strong>Contour Approximation</strong>: Uses a method similar to <strong>Sklansky and Gonzalez</strong> to approximate contours into polygons.</li>
<li><strong>Vector Formation</strong>: Long polygon sides become straight lines; short consecutive sides become curves. Opposing borders of a line are matched to define the bond vector.</li>
<li><strong>Wavy Bonds</strong>: Detected by finding groups of short vectors lying on a straight line.</li>
<li><strong>Dashed Bonds</strong>: Detected using the <strong>Hough transform</strong> to find collinear or parallel dashes.</li>
</ul>
</li>
<li>
<p><strong>Atom Label Construction</strong>:</p>
<ul>
<li><strong>OCR</strong>: An OCR engine (filtering + topological analysis) interprets characters.</li>
<li><strong>Grouping</strong>: Characters are grouped into words based on horizontal and vertical proximity (for vertical labels).</li>
<li><strong>Superatom Lookup</strong>: Labels are matched against a database of elements, functional groups, and R-groups. Unknown linear formulas (e.g., $\text{CH}_2\text{CH}_2\text{OH}$) are parsed.</li>
</ul>
</li>
<li>
<p><strong>Graph Reconstruction</strong>:</p>
<ul>
<li><strong>Connection Logic</strong>: Bond endpoints are joined to atoms if they are within a distance threshold and &ldquo;point toward&rdquo; the label.</li>
<li><strong>Implicit Carbons</strong>: Unconnected bond ends are joined if close; parallel bonds merge into double/triple bonds.</li>
<li><strong>Crossing Bonds</strong>: Rules check proximity, length, and ring membership to determine if crossing lines are valid atoms or 3D visual artifacts.</li>
</ul>
</li>
<li>
<p><strong>Generic Structure Interpretation</strong>:</p>
<ul>
<li><strong>Text Mining</strong>: A lexical/syntactic analyzer extracts R-group definitions (e.g., &ldquo;R = Me or H&rdquo;) from text blocks.</li>
<li><strong>Matching</strong>: The system attempts to match R-group labels in the diagram with the parsed text definitions.</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<ul>
<li><strong>OCR Engine</strong>: The system relies on a customized OCR engine capable of handling rotation and chemical symbols, though the specific architecture (neural vs. feature-based) is not detailed beyond &ldquo;topological and geometrical feature analysis&rdquo;.</li>
<li><strong>Superatom Database</strong>: A lookup table containing elements, common functional groups, and R-group labels.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The evaluation focused on the topological correctness of the output.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Total Success Rate</strong></td>
          <td>89.79%</td>
          <td>466/519 structures perfectly reconstructed.</td>
      </tr>
      <tr>
          <td><strong>Atom Label Accuracy</strong></td>
          <td>98.54%</td>
          <td>3923/3981 labels correct. Main error source: labels touching bonds.</td>
      </tr>
      <tr>
          <td><strong>Solid Bond Accuracy</strong></td>
          <td>&gt;99.9%</td>
          <td>16061/16074 solid bonds correct.</td>
      </tr>
      <tr>
          <td><strong>Dashed Bond Accuracy</strong></td>
          <td>98.37%</td>
          <td>303/308 dashed bonds correct.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Requirements</strong>: Unspecified; described as efficient.</li>
<li><strong>Performance</strong>: The system processed the complex Palytoxin structure &ldquo;within a few seconds&rdquo;. This implies low computational overhead suitable for standard desktop hardware of the 2009 era.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{valkoCLiDEProLatest2009,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Valko, Aniko T. and Johnson, A. Peter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{49}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{780--787}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2009}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci800449t}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemInk: Real-Time Recognition for Chemical Drawings</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/chemink-2011/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/online-recognition/chemink-2011/</guid><description>A sketch recognition framework for chemical diagrams using a joint CRF model to combine multi-level visual features for real-time interpretation.</description><content:encoded><![CDATA[<h2 id="contribution-real-time-sketch-recognition-method">Contribution: Real-Time Sketch Recognition Method</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel architectural framework for sketch recognition that integrates visual features at three distinct levels (inkpoints, segments, symbols) into a single probabilistic model. The rhetorical structure centers on the proposal of this new architecture, the introduction of a specific &ldquo;trainable corner detector&rdquo; algorithm, and the validation of these methods against existing benchmarks and alternative toolsets (ChemDraw).</p>
<h2 id="motivation-bridging-the-gap-between-sketching-and-cad">Motivation: Bridging the Gap Between Sketching and CAD</h2>
<p>The primary motivation is to bridge the gap between the natural, efficient process of drawing chemical diagrams by hand and the cumbersome &ldquo;point-click-and-drag&rdquo; interactions required by CAD tools like ChemDraw. While chemists prefer sketching for communication, existing digital tools do not offer the same speed or ease of use. The goal is to build an intelligent system that understands freehand sketches in real-time, converting them into structured data suitable for analysis or search.</p>
<h2 id="core-innovation-hierarchical-joint-crf-model">Core Innovation: Hierarchical Joint CRF Model</h2>
<p>The core novelty lies in the <strong>hierarchical joint model</strong>. Unlike previous approaches that might treat stroke segmentation and symbol recognition as separate, isolated steps, ChemInk uses a <strong>Conditional Random Field (CRF)</strong> to jointly model dependencies across three levels:</p>
<ol>
<li><strong>Inkpoints</strong>: Local visual appearance.</li>
<li><strong>Segments</strong>: Stroke fragments separated by corners.</li>
<li><strong>Candidates</strong>: Potential symbol groupings.</li>
</ol>
<p>Additionally, the paper introduces a <strong>trainable corner detector</strong> that learns domain-specific corner definitions from data.</p>
<h2 id="experimental-design-and-baselines">Experimental Design and Baselines</h2>
<p>The authors conducted two primary evaluations:</p>
<ol>
<li><strong>Off-line Accuracy Evaluation</strong>:
<ul>
<li><strong>Dataset</strong>: 12 real-world organic compounds drawn by 10 participants.</li>
<li><strong>Metric</strong>: Recognition accuracy (Recall and Precision).</li>
<li><strong>Baseline</strong>: Comparison against their own previous work (O&amp;D 2009) and ablations (with/without context).</li>
</ul>
</li>
<li><strong>On-line User Study</strong>:
<ul>
<li><strong>Task</strong>: 9 participants (chemistry students) drew 5 diagrams using both ChemInk (Tablet PC) and ChemDraw (Mouse/Keyboard).</li>
<li><strong>Metric</strong>: Time to completion and subjective user ratings (speed/ease of use).</li>
</ul>
</li>
</ol>
<h2 id="results-accuracy-and-user-study-outcomes">Results: Accuracy and User Study Outcomes</h2>
<ul>
<li><strong>Accuracy</strong>: The system achieved <strong>97.4% symbol recognition accuracy</strong>, slightly outperforming the best prior result (97.1%). The trainable corner detector achieved <strong>99.91% recall</strong>.</li>
<li><strong>Speed</strong>: Users were <strong>twice as fast</strong> using ChemInk (avg. 36s) compared to ChemDraw (avg. 79s).</li>
<li><strong>Usability</strong>: Participants rated ChemInk significantly higher for speed (6.3 vs 4.5) and ease of use (6.3 vs 4.7) on a 7-point scale.</li>
<li><strong>Conclusion</strong>: Sketch recognition is a viable, superior alternative to standard CAD tools for authoring chemical diagrams.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Training/Test Data</strong>: 12 real-world organic compounds (e.g., Aspirin, Penicillin) drawn by 10 participants (organic chemistry familiar).</li>
<li><strong>Evaluation Split</strong>: User-independent cross-validation (training on 9 users, testing on 1).</li>
<li><strong>Input</strong>: Raw digital ink (strokes) collected on a Tablet PC.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Corner Detection (Trainable)</strong></p>
<ul>
<li><strong>Method</strong>: Iterative vertex elimination.</li>
<li><strong>Cost Function</strong>: $cost(p_{i}) = \sqrt{mse(s_{i}; p_{i-1}, p_{i+1})} \cdot dist(p_{i}; p_{i-1}, p_{i+1})$</li>
<li><strong>Procedure</strong>: Repeatedly remove the vertex with the lowest cost until the classifier (trained on features like cost, diagonal length, ink density) predicts the remaining vertices are corners.</li>
</ul>
<p><strong>2. Feature Extraction</strong></p>
<ul>
<li><strong>Inkpoints</strong>: Sampled at regular intervals. Features = $10 \times 10$ pixel orientation filters (0, 45, 90, 135 degrees) at two scales ($L/2$, $L$), smoothed and downsampled to $5 \times 5$. Total 400 features.</li>
<li><strong>Segments</strong>: Similar image features centered at segment midpoint, plus geometric features (length, ink density).</li>
<li><strong>Candidates</strong>: 5 feature images ($20 \times 20$) including an &ldquo;endpoint&rdquo; image, stretched to normalize aspect ratio.</li>
<li><strong>Dimensionality Reduction</strong>: PCA used to compress feature images to 256 components.</li>
</ul>
<p><strong>3. Structure Generation</strong></p>
<ul>
<li><strong>Clustering</strong>: Agglomerative clustering with a complete-link metric to connect symbols.</li>
<li><strong>Threshold</strong>: Stop clustering at distance $0.4L$.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Conditional Random Field (CRF)</strong></p>
<ul>
<li><strong>Structure</strong>: 3-level hierarchy (Inkpoints $V_p$, Segments $V_s$, Candidates $V_c$).</li>
<li><strong>Nodes</strong>:
<ul>
<li>$V_p, V_s$ labels: &ldquo;bond&rdquo;, &ldquo;hash&rdquo;, &ldquo;wedge&rdquo;, &ldquo;text&rdquo;.</li>
<li>$V_c$ labels: specific candidate interpretations.</li>
</ul>
</li>
<li><strong>Edges/Potentials</strong>:
<ul>
<li><strong>Entity-Feature</strong>: $\phi(y, x)$ (Linear classifier).</li>
<li><strong>Consistency</strong>: $\psi(y_i, y_j)$ (Hard constraint: child must match parent label).</li>
<li><strong>Spatial Context</strong>: $\psi_{ss}(y_i, y_j)$ (Pairwise geometric relationships between segments: angle, distance).</li>
<li><strong>Overlap</strong>: Prevents conflicting candidates from sharing segments.</li>
</ul>
</li>
<li><strong>Inference</strong>: Loopy Belief Propagation (up to 100 iterations).</li>
<li><strong>Training</strong>: Maximum Likelihood via gradient ascent (L-BFGS).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: Accuracy (Recall/Precision) of symbol detection.</li>
<li><strong>Comparison</strong>: Compared against Ouyang &amp; Davis 2009 (previous SOTA).</li>
<li><strong>Speed Metric</strong>: Wall-clock time for diagram creation (ChemInk vs. ChemDraw).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Processor</strong>: 3.7 GHz processor (single thread) for base benchmarking (approx. 1 sec/sketch).</li>
<li><strong>Deployment</strong>: Validated on 1.8 GHz Tablet PCs using multi-core parallelization for real-time feedback.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ouyang, T. Y., &amp; Davis, R. (2011). ChemInk: A Natural Real-Time Recognition System for Chemical Drawings. <em>Proceedings of the 16th International Conference on Intelligent User Interfaces</em>, 267&ndash;276. <a href="https://doi.org/10.1145/1943403.1943444">https://doi.org/10.1145/1943403.1943444</a></p>
<p><strong>Publication</strong>: IUI &lsquo;11</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{ouyangChemInkNaturalRealtime2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{ChemInk: A Natural Real-Time Recognition System for Chemical Drawings}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{ChemInk}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 16th International Conference on Intelligent User Interfaces}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Ouyang, Tom Y. and Davis, Randall}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = feb,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{267--276}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{ACM}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Palo Alto, CA, USA}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1145/1943403.1943444}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{978-1-4503-0419-1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{http://hdl.handle.net/1721.1/78898}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemical Structure Recognition (Rule-Based)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/molrec-2012/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/molrec-2012/</guid><description>A strictly rule-based expert system (MolRec) for converting raster chemical diagrams into graph representations.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sadawi, N. M., Sexton, A. P., &amp; Sorge, V. (2012). Chemical structure recognition: A rule based approach. <em>Proceedings of SPIE</em>, 8297, 82970E. <a href="https://doi.org/10.1117/12.912185">https://doi.org/10.1117/12.912185</a></p>
<p><strong>Publication</strong>: IS&amp;T/SPIE Electronic Imaging 2012</p>
<h2 id="methodological-contribution">Methodological Contribution</h2>
<p><strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong></p>
<p>This paper proposes a novel mechanism (MolRec) for Optical Chemical Structure Recognition (OCSR). It focuses on defining a &ldquo;strictly rule based system&rdquo; to transform vectorised molecule images into graph representations, contrasting this declarative approach with procedural or heuristic-heavy methods. The contribution is validated through direct comparison with the leading open-source tool (OSRA).</p>
<h2 id="motivation-overcoming-procedural-heuristics">Motivation: Overcoming Procedural Heuristics</h2>
<p>Chemical literature contains vast amounts of information locked in 2D diagrams. This visual data is generally inaccessible to search tools or electronic processing. While commercial and academic tools existed (e.g., OSRA, Kekulé), they typically relied on procedural heuristics that required experimental tuning and were difficult to extend. The authors sought to create a system based on precise, declarative rewrite rules to handle the ambiguity inherent in chemical drawing conventions.</p>
<h2 id="core-innovation-geometric-rewrite-rules">Core Innovation: Geometric Rewrite Rules</h2>
<p>The core novelty is the <strong>geometric rewrite rule system</strong> (MolRec).</p>
<ul>
<li><strong>Geometric Primitives</strong>: The system operates on five high-level primitives: Line Segment, Arrow, Circle, Triangle, and Character Group.</li>
<li><strong>Fuzzy Parameters</strong>: It introduces formal definitions for &ldquo;fuzzy&rdquo; relationships (e.g., <code>dash-neighbouring</code>, <code>approximate collinearity</code>) to handle drawing irregularities and scanning artifacts.</li>
<li><strong>Ambiguity Resolution</strong>: Specific rules (R4-R6) are designed to disambiguate visual homoglyphs, such as distinguishing a &ldquo;triple bond&rdquo; from a &ldquo;dashed bold bond&rdquo; based on context (connected atoms).</li>
<li><strong>Explicit &ldquo;Cutting&rdquo;</strong>: A mechanism to identify implicit carbon nodes within continuous line segments (e.g., splitting a long line intersected by parallel lines into a double bond).</li>
</ul>
<h2 id="experimental-setup-vs-baselines">Experimental Setup vs. Baselines</h2>
<p>The authors compared their system (MolRec) against <strong>OSRA</strong> (the leading open-source system) on two datasets:</p>
<ol>
<li><strong>OSRA Benchmark</strong>: 5,735 computer-generated diagrams with ground truth MOL files.</li>
<li><strong>Maybridge Dataset</strong>: 5,730 scanned images (300dpi) from a drug catalogue, converted to ground truth MOL files via InChI lookups.</li>
</ol>
<p>Evaluation was semantic: The output MOL files were compared using OpenBabel to check for structural equivalence, ignoring syntactic file differences.</p>
<h2 id="results-and-key-findings">Results and Key Findings</h2>
<p><strong>MolRec outperformed OSRA</strong> on both datasets:</p>
<ul>
<li><strong>OSRA Benchmark</strong>: MolRec achieved <strong>88.46%</strong> accuracy vs. OSRA&rsquo;s 77.23%.</li>
<li><strong>Maybridge Dataset</strong>: MolRec achieved <strong>83.84%</strong> accuracy vs. OSRA&rsquo;s 72.57%.</li>
</ul>
<p><strong>Key Findings</strong>:</p>
<ul>
<li><strong>Robustness</strong>: The line thinning + Douglas-Peucker vectorization approach was found to be more robust than Hough transform approaches used by other tools.</li>
<li><strong>Failure Modes</strong>: Major remaining errors were caused by &ldquo;touching components&rdquo; (ligatures, characters touching bonds) and complex &ldquo;superatoms&rdquo; (abbreviations like &ldquo;-Ph&rdquo; or &ldquo;-COOH&rdquo;) with ambiguous connection points.</li>
<li><strong>Triangle Detection</strong>: The &ldquo;expanding disc&rdquo; method for identifying wedge bonds was highly effective.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>Two distinct datasets were used for validation:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>OSRA Benchmark</strong></td>
          <td style="text-align: left">Synthetic</td>
          <td style="text-align: left">5,735</td>
          <td style="text-align: left">Computer-generated diagrams provided by the OSRA project.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Maybridge</strong></td>
          <td style="text-align: left">Scanned</td>
          <td style="text-align: left">5,730</td>
          <td style="text-align: left">Scanned at 300dpi from the Maybridge drug catalogue. Ground truth generated via CAS Registry Number $\to$ <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> $\to$ OpenBabel.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The recognition pipeline consists of three stages: <strong>Vectorization</strong>, <strong>Geometric Processing</strong>, and <strong>Rule Application</strong>.</p>
<p><strong>1. Vectorization &amp; Primitives</strong></p>
<ul>
<li><strong>Binarization &amp; OCR</strong>: Connected components are labelled and passed to an OCR engine to extract &ldquo;Character Groups&rdquo;.</li>
<li><strong>Thinning</strong>: Image is thinned to unit width.</li>
<li><strong>Simplification</strong>: Douglas-Peucker algorithm converts pixel paths into straight <strong>Line Segments</strong>.</li>
<li><strong>Triangle Detection</strong>: A disc growing algorithm walks inside black regions to identify <strong>Triangles</strong> (wedges). If the disc cannot grow, it is a thick line (Bold Bond).</li>
</ul>
<p><strong>2. Fuzzy Parameters</strong></p>
<p>The rules rely on tolerating drawing imperfections using defined parameters:</p>
<ul>
<li>$r_e$: Radius of collinearity (strict).</li>
<li>$d_l$ / $d_s$: Dash length / Dash separation (fuzzy).</li>
<li>$bdl$ / $bdw$: Bold dash length / width (fuzzy).</li>
<li>$bs$: Bond separation (max distance between parallel bonds).</li>
<li>$ol$: Minimal overlap.</li>
</ul>
<p><strong>3. The Rule System (R1-R18)</strong></p>
<p>The core logic uses 18 mutual-exclusion rules to rewrite geometric primitives into chemical graph edges.</p>
<ul>
<li><strong>Planar Bonds</strong>:
<ul>
<li><strong>R1-R3 (Single/Double/Triple)</strong>: Identifies parallel lines based on <code>bs</code> and <code>ol</code>. Uses &ldquo;cutting&rdquo; to split lines at implicit nodes.</li>
</ul>
</li>
<li><strong>Ambiguity Resolution (Stereo vs. Planar)</strong>:
<ul>
<li><strong>R4 (Dashed Bold vs. Triple)</strong>: Checks context. If purely geometric measures match both, it defaults to Triple unless specific dash constraints are met.</li>
<li><strong>R5 (Dashed Wedge vs. Triple)</strong>: Similar disambiguation based on length monotonicity.</li>
<li><strong>R6 (Dashed Wedge vs. Double)</strong>: Differentiates based on line length differences ($l_1 &gt; l_2$).</li>
</ul>
</li>
<li><strong>Stereo Bonds</strong>:
<ul>
<li><strong>R7-R9 (Dashed Types)</strong>: Identifies collinear segments with specific neighbor patterns (1 neighbor for ends, 2 for internal).</li>
<li><strong>R10-R11 (Hollow Wedge)</strong>: Detects triangles formed by 3 or 4 lines.</li>
<li><strong>R14 (Solid Wedge)</strong>: Direct mapping from Triangle primitive.</li>
</ul>
</li>
<li><strong>Special Structures</strong>:
<ul>
<li><strong>R12 (Wavy Bond)</strong>: Zig-zag line segments.</li>
<li><strong>R13 (Arrow)</strong>: Dative bond.</li>
<li><strong>R16 (Aromatic Ring)</strong>: Circle inside a cycle of &gt;5 lines.</li>
<li><strong>R17-R18 (Bridge Bonds)</strong>: Handles 2.5D crossing bonds (open or closed gaps).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metric</strong>: Semantic graph matching. The output MOL file is compared to the ground truth MOL file using OpenBabel. Success = correct graph isomorphism.</p>
<p><strong>Results Table</strong>:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">System</th>
          <th style="text-align: left">Success Rate</th>
          <th style="text-align: left">Fail Rate</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>OSRA</strong></td>
          <td style="text-align: left">MolRec</td>
          <td style="text-align: left"><strong>88.46%</strong></td>
          <td style="text-align: left">11.54%</td>
      </tr>
      <tr>
          <td style="text-align: left"></td>
          <td style="text-align: left">OSRA</td>
          <td style="text-align: left">77.23%</td>
          <td style="text-align: left">22.77%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Maybridge</strong></td>
          <td style="text-align: left">MolRec</td>
          <td style="text-align: left"><strong>83.84%</strong></td>
          <td style="text-align: left">16.16%</td>
      </tr>
      <tr>
          <td style="text-align: left"></td>
          <td style="text-align: left">OSRA</td>
          <td style="text-align: left">72.57%</td>
          <td style="text-align: left">27.43%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Requirements not specified, but the approach (vectorization + rule matching) is computationally lightweight compared to modern deep learning methods.</li>
</ul>
]]></content:encoded></item><item><title>Reconstruction of Chemical Molecules from Images</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/algorri-reconstruction-2007/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/algorri-reconstruction-2007/</guid><description>A 5-module system converting raster images of chemical structures into machine-readable SDF files with custom vectorization.</description><content:encoded><![CDATA[<h2 id="methodological-basis">Methodological Basis</h2>
<p>This paper is a clear methodological contribution describing a novel system architecture. It proposes a five-stage pipeline to solve a specific engineering problem: converting rasterized chemical images into structured chemical files (SDF). The authors validate the method by benchmarking it against a commercial product (CLIDE) and analyzing performance across multiple databases.</p>
<h2 id="the-inaccessibility-of-raster-chemical-images">The Inaccessibility of Raster Chemical Images</h2>
<ul>
<li><strong>Data Inaccessibility</strong>: A massive amount of chemical knowledge (scientific articles, patents) exists only as raster images, rendering it inaccessible to computational analysis.</li>
<li><strong>Inefficiency of Manual Entry</strong>: Manual replication of molecules into CAD programs is the standard but unscalable solution for extracting this information.</li>
<li><strong>Limitations of Existing Tools</strong>: Previous academic and commercial attempts (early 90s systems like CLIDE) had faded or remained limited in robustness, leaving the problem &ldquo;wide open&rdquo;.</li>
</ul>
<h2 id="topology-preserving-chemical-vectorization">Topology-Preserving Chemical Vectorization</h2>
<p>The core novelty is the <strong>topology-preserving vectorization</strong> strategy designed specifically for chemical graphs.</p>
<ul>
<li><strong>Graph-Centric Vectorizer</strong>: This system prioritizes graph characteristics over the pixel precision of traditional CAD vectorizers, ensuring one line in the image becomes exactly one vector, regardless of line width or vertex thickness.</li>
<li><strong>Chemical Knowledge Module</strong>: The inclusion of a final validation step that applies chemical rules (valence, charge) to detect and potentially correct reconstruction errors.</li>
<li><strong>Hybrid Recognition</strong>: The separation of the pipeline into a &ldquo;Body&rdquo; path (vectorizer for bonds) and an &ldquo;OCR&rdquo; path (SVM for atomic symbols), which are re-integrated in a reconstruction phase.</li>
</ul>
<h2 id="validating-reconstruction-accuracy">Validating Reconstruction Accuracy</h2>
<p>The authors performed a quantitative validation using <strong>ground-truth SDF files</strong> to verify reconstruction accuracy. The success rate metric evaluated whether the reconstructed graph perfectly matched the true SDF:</p>
<p>$$ \text{Accuracy} = \frac{\text{Correctly Reconstructed SDFs}}{\text{Total Images Evaluated}} $$</p>
<ul>
<li><strong>Baselines</strong>: The system was benchmarked against the commercial software <strong>CLIDE</strong> on &ldquo;Database 1&rdquo;.</li>
<li><strong>Datasets</strong>: Three distinct databases were used:
<ul>
<li><strong>Database 1</strong>: 100 images (varied fonts/line widths).</li>
<li><strong>Database 2</strong>: 100 images.</li>
<li><strong>Database 3</strong>: 7,604 images (large-scale test).</li>
</ul>
</li>
</ul>
<h2 id="system-performance-and-scalability">System Performance and Scalability</h2>
<ul>
<li><strong>Superior Performance</strong>: On Database 1, the proposed system correctly reconstructed <strong>97%</strong> of images, whereas the commercial CLIDE system only reconstructed <strong>25%</strong> (after parameter tuning).</li>
<li><strong>Scalability</strong>: The system maintained reasonable performance on the large dataset (Database 3), achieving <strong>67%</strong> accuracy.</li>
<li><strong>Robustness</strong>: The system can handle varying fonts and line widths via parameterization.</li>
<li><strong>Future Work</strong>: The authors plan to implement a feedback loop where the Chemical Knowledge Module can send error signals back to earlier modules to correct inconsistencies.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Reproducibility Status</strong>: Closed / Not Reproducible (Paywalled paper, no public code or data).</p>
<h3 id="data">Data</h3>
<p>The paper utilizes three databases for validation. The authors note that for these images, the correct SDF files were already available, allowing for direct automated checking.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Database 1</td>
          <td>100 Images</td>
          <td>Varied line widths, fonts, symbols; used for CLIDE comparison.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Database 2</td>
          <td>100 Images</td>
          <td>General chemical database.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Database 3</td>
          <td>7,604 Images</td>
          <td>Large-scale database.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The system is composed of five distinct modules executed in sequence:</p>
<p><strong>1. Binarization &amp; Segmentation</strong></p>
<ul>
<li><strong>Preprocessing</strong>: Removal of anti-aliasing effects followed by <strong>adaptive histogram binarization</strong>.</li>
<li><strong>Connected Components</strong>: A non-recursive raster-scan algorithm identifies connected Run-Length Encoded (RLE) segments.</li>
</ul>
<p><strong>2. Optical Character Recognition (OCR)</strong></p>
<ul>
<li><strong>Feature Extraction</strong>: Uses functions similar to <strong>Zernike moments</strong> and a <strong>wavelet transform strategy</strong>.</li>
<li><strong>Classification</strong>: Identifies isolated characters/symbols and separates them from the molecular &ldquo;body&rdquo;.</li>
</ul>
<p><strong>3. Vectorizer</strong></p>
<ul>
<li><strong>Logic</strong>: Assigns local directions to RLE segments based on neighbors, then groups segments with similar local direction patterns.</li>
<li><strong>Constraint</strong>: Enforces a 1-to-1 mapping between visual lines and graph vectors to prevent spurious small vectors at thick joints.</li>
</ul>
<p><strong>4. Reconstruction (Heuristics)</strong></p>
<p>This module annotates vectors with chemical significance:</p>
<ul>
<li><strong>Chiral Bonds (Wedges)</strong>: Identified by registering vectors against original pixel density. If a vector corresponds to a thick geometric form (triangle/rectangle), it is labeled chiral.</li>
<li><strong>Dotted Chiral Bonds</strong>: Identified by clustering isolated vectors (no neighbors) using <strong>quadtree clustering</strong> on geometric centers. Coherent parallel clusters are fused into a single bond.</li>
<li><strong>Double/Triple Bonds</strong>: Detected by checking for parallel vectors within a <strong>Region of Interest (ROI)</strong> defined as the vector&rsquo;s bounding box <strong>dilated by a factor of 2</strong>.</li>
<li><strong>Superatoms</strong>: OCR results are clustered by dilating bounding boxes; overlapping boxes are grouped into names (e.g., &ldquo;COOH&rdquo;).</li>
</ul>
<p><strong>5. Chemical Knowledge</strong></p>
<p>Validates the generated graph against rules for valences and charges. If valid, an SDF file is generated.</p>
<h3 id="models">Models</h3>
<ul>
<li><strong>SVM (Support Vector Machine)</strong>: Used within the OCR module to classify connected components as characters or symbols. It is trained to be tolerant to rotation and font variations.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The primary metric is binary success rate per molecule (perfect reconstruction of the SDF).</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (DB1)</th>
          <th>Value (DB3)</th>
          <th>Baseline (CLIDE on DB1)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Correct Reconstruction</td>
          <td><strong>97%</strong></td>
          <td>67%</td>
          <td>25%</td>
          <td>CLIDE required significant parameter tuning to reach 25%.</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Algorri, M.-E., Zimmermann, M., Friedrich, C. M., Akle, S., &amp; Hofmann-Apitius, M. (2007). Reconstruction of Chemical Molecules from Images. <em>Proceedings of the 29th Annual International Conference of the IEEE EMBS</em>, 4609-4612. <a href="https://doi.org/10.1109/IEMBS.2007.4353366">https://doi.org/10.1109/IEMBS.2007.4353366</a></p>
<p><strong>Publication venue</strong>: IEEE EMBS 2007</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{algorriReconstructionChemicalMolecules2007,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Reconstruction of {{Chemical Molecules}} from {{Images}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 29th Annual International Conference of the IEEE EMBS}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Algorri, Maria-Elena and Zimmermann, Marc and Friedrich, Christoph M. and Akle, Santiago and {Hofmann-Apitius}, Martin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2007}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{4609--4612}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/IEMBS.2007.4353366}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>OSRA: Open Source Optical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/osra/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/osra/</guid><description>The first open-source optical structure recognition (OSR) utility for converting chemical images into SMILES/SD formats.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Filippov, I. V., &amp; Nicklaus, M. C. (2009). Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution. <em>Journal of Chemical Information and Modeling</em>, 49(3), 740-743. <a href="https://doi.org/10.1021/ci800067r">https://doi.org/10.1021/ci800067r</a></p>
<p><strong>Publication</strong>: J. Chem. Inf. Model. 2009</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://sourceforge.net/projects/osra/">SourceForge Project</a></li>
<li><a href="http://cactus.nci.nih.gov/osra">Web Interface (Historical)</a></li>
</ul>
<h2 id="overview-and-motivation">Overview and Motivation</h2>
<p><strong>Resource</strong></p>
<p>This paper is a quintessential <strong>Infrastructure</strong> contribution ($\Psi_{\text{Resource}}$). While it contains significant algorithmic detail, the rhetorical structure and primary goal place it squarely as an infrastructure paper. The dominant contribution is the creation, release, and documentation of a software tool (OSRA).</p>
<p>A vast amount of chemical knowledge is locked in scientific and patent documents as graphical images (Kekulé structures). This is the classic chemical informatics challenge: decades of chemical knowledge are trapped in visual form.</p>
<ul>
<li><strong>Legacy Data Gap</strong>: Historical literature does not use computer-parsable formats, making automated processing of millions of documents impossible without optical recognition. Scientific papers and patents have historically depicted molecules as 2D structural diagrams.</li>
<li><strong>Need for Automation</strong>: Manual transcription is not scalable for the hundreds of thousands of documents available. While modern standards like <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> and CML exist, the vast majority of chemical literature remains inaccessible for computational analysis.</li>
<li><strong>Open Source Gap</strong>: Before OSRA, only commercial software like CLiDE existed for this task, creating a barrier for academic researchers and limiting reproducibility. no universal or open-source solution was available prior to this work.</li>
</ul>
<h2 id="core-innovations-and-pipeline">Core Innovations and Pipeline</h2>
<p>OSRA is claimed to be the <strong>first open-source optical structure recognition (OSR) program</strong>. The novelty lies in creating an accessible OCSR system with a practical, multi-stage pipeline that combines classical image processing techniques with chemical knowledge.</p>
<p><strong>Key contributions:</strong></p>
<ol>
<li>
<p><strong>Integrated Pipeline</strong>: It uniquely combines existing open-source image processing tools (ImageMagick for formats, Potrace for vectorization, GOCR/OCRAD for text) into a chemical recognition workflow. The value is in the assembly and integration.</p>
</li>
<li>
<p><strong>Vectorization-Based Approach</strong>: OSRA uses the Potrace library to convert bitmap images into vector graphics (Bezier curves), then analyzes the geometry of these curves to identify bonds and atoms. This is more robust than angle-based detection methods because it leverages continuous mathematical properties of curves.</p>
</li>
<li>
<p><strong>Multi-Resolution Processing with Confidence Estimation</strong>: The system automatically processes each image at three different resolutions (72, 150, and 300 dpi), generating up to three candidate structures. A learned confidence function trained via linear regression on chemical features (heteroatom count, ring patterns, fragment count) selects the most chemically sensible result.</p>
</li>
<li>
<p><strong>Resolution Independence</strong>: Unlike some predecessors, it is designed to handle over 90 image formats and works independently of specific resolutions or fonts.</p>
</li>
<li>
<p><strong>Comprehensive Chemical Rules</strong>: OSRA implements sophisticated heuristics for chemical structure interpretation:</p>
<ul>
<li>Distinguishes bridge bond crossings from tetrahedral carbon centers using graph connectivity rules</li>
<li>Recognizes stereochemistry from wedge bonds (detected via line thickness gradients)</li>
<li>Handles old-style aromatic notation (circles inside rings)</li>
<li>Expands common chemical abbreviations (superatoms like &ldquo;COOH&rdquo; or &ldquo;CF₃&rdquo;)</li>
<li>Uses the 75th percentile of bond lengths as the reference length to avoid outlier bias</li>
</ul>
</li>
</ol>
<h2 id="methodology-and-validation">Methodology and Validation</h2>
<p>The authors validated OSRA against both commercial software and manual curation:</p>
<ol>
<li>
<p><strong>Commercial Comparison</strong>: They compared OSRA against CLiDE (a commercial OSR tool) using a &ldquo;small test set&rdquo; of 11 files provided by Simbiosys containing 42 structures. Performance was measured both as exact match accuracy and as Tanimoto similarity using molecular fingerprints.</p>
</li>
<li>
<p><strong>Internal Validation</strong>: They tested on an internal set of 66 images containing 215 structures, covering various resolutions, color depths, and drawing styles to assess performance at scale and characterize typical error patterns.</p>
</li>
<li>
<p><strong>Metric Definition</strong>: They defined recognition success using both exact matches (&ldquo;Perfect by InChI&rdquo;) and Tanimoto similarity (using CACTVS fingerprints). The authors explicitly argued for using Tanimoto similarity as the primary evaluation metric, reasoning that partial recognition (e.g., missing a methyl group) still provides useful chemical information, which binary &ldquo;correct/incorrect&rdquo; judgments fail to capture.</p>
</li>
</ol>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li>
<p><strong>Competitive Accuracy</strong>: On the small comparative set, OSRA recognized 26 structures perfectly (by InChI) versus CLiDE&rsquo;s 11, demonstrating that an open-source, rule-based approach could outperform established commercial systems.</p>
</li>
<li>
<p><strong>Robustness</strong>: On the internal diverse set (215 structures), OSRA achieved a 93% average Tanimoto similarity and perfectly recognized 107 structures (50%). Tanimoto similarity above 85% was achieved for 182 structures (85%). This established OSRA as a competitive tool for practical use.</p>
</li>
<li>
<p><strong>Multi-Resolution Success</strong>: The multi-resolution strategy allowed OSRA to handle images with varying quality and formats. The confidence function (with correlation coefficient $r=0.89$) successfully identified which resolution produced the most chemically plausible structure.</p>
</li>
<li>
<p><strong>Limitations</strong>: The authors acknowledge issues with:</p>
<ul>
<li>&ldquo;Imperfect segmentation&rdquo; leading to missed structures (3 missed in internal set) and false positives (7 in internal set)</li>
<li>Novel drawing conventions not covered by the implemented heuristics</li>
<li>Highly degraded or noisy images where vectorization fails</li>
<li>Hand-drawn structures that deviate significantly from standard chemical drawing practices</li>
<li>Complex reaction schemes with multiple molecules and arrows</li>
</ul>
</li>
<li>
<p><strong>Open-Source Impact</strong>: By releasing OSRA as open-source software, the authors enabled widespread adoption and community contribution. This established a foundation for future OCSR research and made the technology accessible to researchers without commercial software budgets.</p>
</li>
</ul>
<p>The work established that rule-based OCSR systems could achieve competitive performance when carefully engineered with chemical knowledge. OSRA became a standard baseline for the field and remained the dominant open-source solution until the rise of deep learning methods over a decade later. The vectorization-based approach and the emphasis on Tanimoto similarity as an evaluation metric influenced subsequent work in the area.</p>
<h2 id="technical-details">Technical Details</h2>
<p><strong>Grayscale Conversion</strong></p>
<p>OSRA uses a non-standard grayscale conversion to preserve light-colored atoms (e.g., yellow sulfur):</p>
<p>$$\text{Gray} = \min(R, G, B)$$</p>
<p>This prevents light colors from being washed out during binarization, unlike the standard weighted formula ($0.3R + 0.59G + 0.11B$).</p>
<p><strong>Image Segmentation</strong></p>
<p>Chemical structures are identified within a page using specific bounding box criteria:</p>
<ul>
<li><strong>Black pixel density</strong>: Must be between 0.0 and 0.2</li>
<li><strong>Aspect ratio</strong>: Height-to-width ratio must be between 0.2 and 5.0</li>
<li><strong>Minimum size</strong>: Width and height must be &gt;50 pixels at resolutions &gt;150 dpi</li>
</ul>
<p><strong>Noise Detection and Smoothing</strong></p>
<p>A &ldquo;noise factor&rdquo; determines whether anisotropic smoothing is applied:</p>
<p>$$\text{Noise Factor} = \frac{\text{Count of 2-pixel line segments}}{\text{Count of 3-pixel line segments}}$$</p>
<p>Smoothing is applied only if this ratio is between 0.5 and 1.0.</p>
<p><strong>Atom Detection from Bezier Curves</strong></p>
<p>Potrace Bezier control points are flagged as potential atoms if:</p>
<ol>
<li>The point is classified as a &ldquo;corner&rdquo; by Potrace</li>
<li>The vector direction change has a <strong>normal component</strong> of at least 2 pixels</li>
</ol>
<p>The normal component criterion is more robust than angle-based detection because angles are difficult to measure accurately in pixelated environments where line thickness is non-zero.</p>
<p><strong>Bond Length Estimation</strong></p>
<p>The reference bond length is computed as the <strong>75th percentile</strong> of all detected bond lengths. This avoids bias from outlier bonds (e.g., extremely short or long bonds from recognition errors).</p>
<p><strong>Confidence Function</strong></p>
<p>A linear regression function selects the best result from the multi-scale processing:</p>
<p>$$\text{confidence} = 0.316 - 0.016N_{c} + 0.034N_{N} + 0.067N_{o} + \ldots + 0.330N_{\text{rings5}} + \ldots$$</p>
<p>where $N_C$, $N_N$, $N_O$ represent counts of carbon, nitrogen, oxygen atoms, respectively. It prioritizes structures with more recognized heteroatoms and rings, while penalizing fragment counts. Additional terms account for ring pattern</p>
<ul>
<li><strong>Minimum size</strong>: Width and height must be &gt;50 pixels at resolutions &gt;150 dpi</li>
</ul>
<h4 id="noise-detection-and-smoothing">Noise Detection and Smoothing</h4>
<p>A &ldquo;noise factor&rdquo; determines whether anisotropic smoothing is applied:</p>
<p>$$\text{Noise Factor} = \frac{\text{Count of 2-pixel line segments}}{\text{Count of 3-pixel line segments}}$$
| Purpose | Dataset | Size | Notes |
|&mdash;&mdash;&mdash;|&mdash;&mdash;&mdash;|&mdash;&mdash;|&mdash;&mdash;-|
| Comparison | &ldquo;Small test set&rdquo; (Simbiosys) | 11 files (42 structures) | Used to compare vs. CLiDE |
| Validation | Internal Test Set | 66 images (215 structures) | Various resolutions, color depths, styles |</p>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics used to define &ldquo;Success&rdquo;:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Definition</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Perfect by InChI</strong></td>
          <td>Exact match of the InChI string to the human-curated structure.</td>
      </tr>
      <tr>
          <td><strong>Average Tanimoto</strong></td>
          <td>Tanimoto similarity (CACTVS fingerprints) between OSRA output and ground truth.</td>
      </tr>
      <tr>
          <td><strong>uuuuu</strong></td>
          <td>NCI CADD identifier match (topology only; ignores stereochem/charge/tautomers).</td>
      </tr>
  </tbody>
</table>
<p><strong>Results Table (Comparison)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Perfect (InChI)</th>
          <th>T &gt; 85%</th>
          <th>uuuuu Match</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>OSRA</strong></td>
          <td>26 / 42</td>
          <td>39 / 42</td>
          <td>28 / 42</td>
      </tr>
      <tr>
          <td><strong>CLiDE</strong></td>
          <td>11 / 42</td>
          <td>26 / 42</td>
          <td>12 / 42</td>
      </tr>
  </tbody>
</table>
<h3 id="softwaredependencies">Software/Dependencies</h3>
<p>The system relies on external libraries:</p>
<ul>
<li><strong>ImageMagick</strong>: Image format parsing (supports 90+ formats)</li>
<li><strong>Ghostscript</strong>: PDF/PS interpretation</li>
<li><strong>Potrace</strong>: Vectorization (converts bitmap to Bezier curves)</li>
<li><strong>GOCR / OCRAD</strong>: Optical Character Recognition (heteroatom label recognition)</li>
<li><strong>OpenBabel / RDKit</strong>: Chemical backends for connection table compilation</li>
<li><strong>Output Formats</strong>: SMILES strings and SD files</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{filippovOpticalStructureRecognition2009,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Optical {{Structure Recognition Software To Recover Chemical Information}}: {{OSRA}}, {{An Open Source Solution}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Optical {{Structure Recognition Software To Recover Chemical Information}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Filippov, Igor V. and Nicklaus, Marc C.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2009}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = mar,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{49}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{740--743}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci800067r}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>The confidence function is a linear regression model trained on chemical features:</p>
<p>$$\text{Confidence} = 0.316 - 0.016N_C + 0.034N_N + 0.067N_O + 0.036N_F + \ldots$$</p>
<p>where $N_C$, $N_N$, $N_O$, $N_F$ represent counts of carbon, nitrogen, oxygen, and fluorine atoms, respectively. Additional terms account for ring counts and fragment counts. The model achieves a correlation coefficient of $r=0.89$.</p>
<p>This function scores the three resolution candidates (72, 150, and 300 dpi), and the highest-scoring structure is selected as the final output.</p>
<h3 id="data">Data</h3>
<p><strong>Test Sets</strong>:</p>
<ul>
<li><strong>CLiDE Comparison</strong>: 42 structures from 11 files (Simbiosys small test set)</li>
<li><strong>Internal Validation</strong>: 215 structures</li>
</ul>
<p><strong>Evaluation Metrics</strong>:</p>
<ul>
<li>Exact match accuracy (binary correct/incorrect)</li>
<li>Tanimoto similarity using molecular fingerprints (preferred metric for partial recognition credit)</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Pipeline Components</strong>:</p>
<ol>
<li><strong>Image Preprocessing</strong>: ImageMagick (supports 90+ formats)</li>
<li><strong>Vectorization</strong>: Potrace library (converts bitmap to Bezier curves)</li>
<li><strong>OCR</strong>: GOCR and OCRAD (heteroatom label recognition)</li>
<li><strong>Output Formats</strong>: SMILES strings and SD files</li>
</ol>
]]></content:encoded></item><item><title>Optical Recognition of Chemical Graphics</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/casey-ocsr-1993/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/casey-ocsr-1993/</guid><description>A 1993 prototype system for converting scanned chemical diagrams into connection tables using vectorization and heuristic-based structure recognition.</description><content:encoded><![CDATA[<h2 id="contribution-early-ocsr-pipeline-methodology">Contribution: Early OCSR Pipeline Methodology</h2>
<p><strong>Method</strong>. This paper proposes a novel architectural pipeline for the automatic recognition of chemical structure diagrams. It defines a specific sequence of algorithmic steps, including diagram separation, vectorization, segmentation, and structural analysis, which converts pixel data into a semantic chemical representation (MDL Molfile).</p>
<h2 id="motivation-digitizing-legacy-chemical-data">Motivation: Digitizing Legacy Chemical Data</h2>
<p><strong>Problem</strong>: In 1993, vast databases of chemical information existed, but the entry of graphical data was significantly less advanced than the facilities for manipulating it.</p>
<p><strong>Gap</strong>: Creating digital chemical structures required trained operators to manually redraw diagrams that already existed in printed journals and catalogs, leading to a costly duplication of effort.</p>
<p><strong>Goal</strong>: To automate the creation of coded representations (connection tables) directly from optically scanned diagrams on printed pages.</p>
<h2 id="novelty-general-document-analysis-integrated-with-chemical-rules">Novelty: General Document Analysis Integrated with Chemical Rules</h2>
<p><strong>Pipeline Approach</strong>: The authors present a complete end-to-end system that integrates general document analysis with domain-specific chemical rules.</p>
<p><strong>Convex Bounding Separation</strong>: A novel use of &ldquo;bounding polygons&rdquo; defined by 8 fixed-direction bands to distinguish diagram components from text with linear computational cost.</p>
<p><strong>Vector-Based Segmentation</strong>: The system uses the output of a vectorizer (GIFTS) to classify diagram elements. It relies on the observation that vectorizers approximate characters with sets of short vectors to distinguish them from bonds.</p>
<h2 id="methodology-and-system-evaluation">Methodology and System Evaluation</h2>
<p><strong>System Implementation</strong>: The algorithm was implemented in &lsquo;C&rsquo; on IBM PS/2 personal computers running OS/2 Presentation Manager.</p>
<p><strong>Input Specification</strong>: The system was tested on documents scanned at 300 dpi using an IBM 3119 scanner.</p>
<p><strong>Qualitative Evaluation</strong>: The authors evaluated the system on &ldquo;typical scanned structures&rdquo; and &ldquo;simple planar diagrams&rdquo;. Large-scale quantitative benchmarking was not conducted in this work.</p>
<h2 id="results-performance-and-limitations">Results, Performance, and Limitations</h2>
<p><strong>Performance</strong>: The prototype processes a typical structure (after extraction) in less than one minute.</p>
<p><strong>Accuracy</strong>: It is reported to be accurate for simple planar diagrams.</p>
<p><strong>Output Format</strong>: The system successfully generates MDL Molfiles that interface with standard chemistry software like REACCS, MACCS, and modeling tools.</p>
<p><strong>Limitations</strong>: The system struggles with broken lines, characters touching bond structures, and requires manual intervention for complex errors.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Status:</strong> Closed (Historical). As an early prototype from 1993, no source code, datasets, or digital models were publicly released. Reproducing this exact system would require recreating the pipeline from the described heuristics and sourcing vintage OCR software.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><em>None available</em></td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">No digital artifacts were released with this 1993 publication.</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The paper does not release a dataset but specifies the input requirements for the system.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Input</td>
          <td>Scanned Documents</td>
          <td>N/A</td>
          <td>Black ink on white paper; scanned at 300 dpi bi-level.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper relies on a pipeline of specific heuristics and geometric rules.</p>
<p><strong>1. Diagram Separation (Region Growing)</strong></p>
<ul>
<li><strong>Bounding Polygons</strong>: Uses convex polygons defined by pairs of parallel sides in 8 fixed directions. This approximation improves distance estimation compared to bounding rectangles.</li>
<li><strong>Seed Detection</strong>: Finds a connected component with bounding dimension $D &gt; d_{\text{max char size}}$.</li>
<li><strong>Aggregation</strong>: Iteratively searches for neighboring components within a specific distance threshold $d_t$ (where $d_t$ is smaller than the whitespace margin) and merges them into the bounding polygon.</li>
</ul>
<p><strong>2. Vectorization &amp; Segmentation</strong></p>
<ul>
<li><strong>Vectorization</strong>: Uses the GIFTS system (IBM Tokyo) to fit lines to pixels.</li>
<li><strong>Classification Heuristics</strong>:
<ul>
<li><strong>Ratio Test</strong>: If the ratio of a group&rsquo;s dimension to the full diagram dimension is below a threshold $\tau$, it is classified as a <strong>Symbol</strong>:
$$ \frac{D_{\text{group}}}{D_{\text{diagram}}} &lt; \tau $$</li>
<li><strong>Context Rule</strong>: Small vector groups near letters are classified as <strong>Characters</strong> (handles &rsquo;l&rsquo; in &lsquo;Cl&rsquo;).</li>
<li><strong>Circle Rule</strong>: A group is a <strong>Circle</strong> (aromatic ring) if it contains $N \ge 8$ vectors in a roughly circular arrangement.</li>
<li><strong>Default</strong>: Otherwise, classified as <strong>Bond Structure</strong>.</li>
</ul>
</li>
</ul>
<p><strong>3. Cleanup &amp; Structure Recognition</strong></p>
<ul>
<li><strong>Short Vector Removal</strong>: Vectors shorter than a fraction of the median line length $L_{\text{median}}$ are shrunk to their midpoint (fixing broken junctions).</li>
<li><strong>Vertex Merging</strong>: If two vectors meet at an angle $\theta &lt; 35^{\circ}$, the vertex is removed (fixing single lines broken into two).</li>
<li><strong>Aromatic Processing</strong>: If a circle is detected, the system identifies the 6 closest atoms and adds double bonds to every second bond in the ring.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>OCR</strong>:</p>
<ul>
<li>The system uses a feature-based, single-font OCR engine.</li>
<li>It assumes non-serif, plain styles typical of drafting standards.</li>
<li>Character images are normalized for size before recognition.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Scanner</strong>: IBM 3119 (300 dpi).</li>
<li><strong>Compute</strong>: IBM PS/2 series running OS/2.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Casey, R., et al. (1993). Optical Recognition of Chemical Graphics. <em>Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR &lsquo;93)</em>, 627-631. <a href="https://doi.org/10.1109/ICDAR.1993.395658">https://doi.org/10.1109/ICDAR.1993.395658</a></p>
<p><strong>Publication</strong>: ICDAR 1993</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{caseyOpticalRecognitionChemical1993,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Optical Recognition of Chemical Graphics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of 2nd {{International Conference}} on {{Document Analysis}} and {{Recognition}} ({{ICDAR}} &#39;93)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Casey, R. and Boyer, S. and Healey, P. and Miller, A. and Oudot, B. and Zilles, K.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1993</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{627--631}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE Comput. Soc. Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Tsukuba Science City, Japan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ICDAR.1993.395658}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>OCSR Methods: A Taxonomy of Approaches</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/</guid><description>Overview of optical chemical structure recognition methods organized by approach, from deep learning to rule-based systems.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>Optical Chemical Structure Recognition (OCSR) aims to automatically extract machine-readable molecular representations (e.g., SMILES, <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>, mol files) from images of chemical structures. Methods have evolved from early rule-based systems to modern deep learning approaches.</p>
<p>This note organizes OCSR methods by their fundamental approach, providing a framework for understanding the landscape of techniques.</p>
<h2 id="common-limitations-and-failure-modes">Common Limitations and Failure Modes</h2>
<p>Regardless of the underlying paradigm, most OCSR systems struggle with a common set of challenges:</p>
<ol>
<li><strong>Stereochemistry</strong>: Ambiguous wedge/dash bonds, varying drawing conventions, and implicit stereocenters frequently lead to incorrect isomer generation.</li>
<li><strong>Markush Structures</strong>: Generic structures with variable R-groups (common in patents) require complex subgraph mapping that sequence-based models often fail to capture.</li>
<li><strong>Image Degradation</strong>: Artifacts, low resolution, skewed scans, and hand-drawn irregularities degrade the performance of both rule-based heuristics and CNN feature extractors.</li>
<li><strong>Superatoms and Abbreviations</strong>: Textual abbreviations (e.g., &ldquo;Ph&rdquo;, &ldquo;t-Bu&rdquo;, &ldquo;BoC&rdquo;) embedded within the image require joint optical character recognition (OCR) and structural parsing.</li>
</ol>
<h2 id="review--survey-papers">Review &amp; Survey Papers</h2>
<p>Comprehensive surveys and systematization of knowledge papers that organize and synthesize the OCSR literature.</p>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Paper</th>
          <th>Notes</th>
          <th>Focus</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2020</td>
          <td><a href="https://doi.org/10.1186/s13321-020-00465-0">A review of optical chemical structure recognition tools</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/rajan-ocsr-review-2020/">Rajan et al. 2020</a></td>
          <td>Survey of 30 years of OCSR development (1990-2019); benchmark of three open-source tools (OSRA, Imago, MolVec) on four datasets</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.1186/s13321-022-00642-3">Review of techniques and models used in optical chemical structure recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/musazade-ocsr-review-2022/">Musazade et al. 2022</a></td>
          <td>Systematization of OCSR evolution from rule-based systems to modern deep learning; identifies paradigm shift to image captioning and critiques evaluation metrics</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.1039/D3DD00228D">Comparing software tools for optical chemical structure recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/krasnov-ocsr-benchmark-2024/">Krasnov et al. 2024</a></td>
          <td>Benchmark of 8 open-access tools on 2,702 manually curated patent images; proposes ChemIC classifier for hybrid routing approach</td>
      </tr>
  </tbody>
</table>
<h2 id="deep-learning-methods">Deep Learning Methods</h2>
<p>End-to-end neural network architectures that learn to map images directly to molecular representations.</p>
<p><strong>Note on Paper Types</strong>: Papers listed below are primarily <strong>Method</strong> ($\Psi_{\text{Method}}$) papers focused on novel architectures and performance improvements. Some also have secondary <strong>Resource</strong> ($\Psi_{\text{Resource}}$) contributions through released tools or datasets. See the <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI and Physical Sciences paper taxonomy</a> for classification details.</p>
<h3 id="image-to-sequence-paradigm">Image-to-Sequence Paradigm</h3>
<p>Treating chemical structure recognition as an image captioning task, these methods use encoder-decoder architectures (often with attention mechanisms) to generate sequential molecular representations like SMILES directly from pixels. Formally, given an image $I$, the model learns to sequentially output tokens $y_t$ to maximize the conditional probability:
$$ p(Y|I) = \prod_{t=1}^{T} p(y_t | y_{&lt;t}, I; \theta) $$
where $\theta$ represents the model parameters. This paradigm is powerful but can hallucinate chemically invalid structures if the decoder fails to learn chemical syntax rules.</p>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Paper</th>
          <th>Notes</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2019</td>
          <td><a href="https://doi.org/10.1021/acs.jcim.8b00669">Molecular Structure Extraction From Documents Using Deep Learning</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/staker-deep-learning-2019/">Staker et al. Notes</a></td>
          <td>U-Net segmentation + CNN-GridLSTM encoder-decoder with attention</td>
      </tr>
      <tr>
          <td>2020</td>
          <td><a href="https://doi.org/10.1186/s13321-020-00469-w">DECIMER: towards deep learning for chemical image recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer/">DECIMER Notes</a></td>
          <td>Inception V3 encoder + GRU decoder with attention</td>
      </tr>
      <tr>
          <td>2021</td>
          <td><a href="https://doi.org/10.1039/D1SC02957F">ChemPix: automated recognition of hand-drawn hydrocarbon structures</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/chempix/">ChemPix Notes</a></td>
          <td>CNN encoder + LSTM decoder with attention</td>
      </tr>
      <tr>
          <td>2021</td>
          <td><a href="https://doi.org/10.1186/s13321-021-00538-8">DECIMER 1.0: deep learning for chemical image recognition using transformers</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-1.0/">DECIMER 1.0 Notes</a></td>
          <td>EfficientNet-B3 encoder + Transformer decoder with SELFIES output</td>
      </tr>
      <tr>
          <td>2021</td>
          <td><a href="https://doi.org/10.48550/arXiv.2104.14721">End-to-End Attention-based Image Captioning</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/vit-inchi-transformer/">ViT-InChI Transformer Notes</a></td>
          <td>Vision Transformer encoder + Transformer decoder with InChI output</td>
      </tr>
      <tr>
          <td>2021</td>
          <td><a href="https://doi.org/10.1039/D1SC01839F">Img2Mol - accurate SMILES recognition from molecular graphical depictions</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/img2mol/">Img2Mol Notes</a></td>
          <td>CNN encoder + pre-trained CDDD decoder for continuous embedding</td>
      </tr>
      <tr>
          <td>2021</td>
          <td><a href="https://doi.org/10.48550/arXiv.2109.04202">IMG2SMI: Translating Molecular Structure Images to SMILES</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/img2smi/">IMG2SMI Notes</a></td>
          <td>ResNet-101 encoder + Transformer decoder with SELFIES output</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.3390/app12020680">Automated Recognition of Chemical Molecule Images Based on an Improved TNT Model</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/icmdt/">ICMDT Notes</a></td>
          <td>Deep TNT encoder + Transformer decoder with InChI output</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.1002/cmtd.202100069">Image2SMILES: Transformer-Based Molecular Optical Recognition Engine</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/image2smiles/">Image2SMILES Notes</a></td>
          <td>ResNet-50 encoder + Transformer decoder with FG-SMILES output</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.1093/bioinformatics/btac545">MICER: a pre-trained encoder-decoder architecture for molecular image captioning</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/micer/">MICER Notes</a></td>
          <td>Fine-tuned ResNet101 encoder + LSTM decoder with attention</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.1039/D1DD00013F">Performance of chemical structure string representations for chemical image recognition using transformers</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/rajan-string-representations-2022/">Rajan String Representations</a></td>
          <td>Comparative ablation: SMILES vs DeepSMILES vs SELFIES vs InChI</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.1186/s13321-022-00624-5">SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/swinocsr/">SwinOCSR Notes</a></td>
          <td>Swin Transformer encoder + Transformer decoder with DeepSMILES output</td>
      </tr>
      <tr>
          <td>2023</td>
          <td><a href="https://doi.org/10.1145/3581783.3612573">Handwritten Chemical Structure Image to Structure-Specific Markup Using Random Conditional Guided Decoder</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/hu-handwritten-rcgd-2023/">Hu et al. RCGD Notes</a></td>
          <td>DenseNet encoder + GRU decoder with attention and SSML output</td>
      </tr>
      <tr>
          <td>2023</td>
          <td><a href="https://doi.org/10.1038/s41467-023-40782-0">DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-ai/">DECIMER.ai Notes</a></td>
          <td>EfficientNet-V2-M encoder + Transformer decoder with SMILES output</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.1038/s41598-024-67496-7">ChemReco: automated recognition of hand-drawn carbon-hydrogen-oxygen structures using deep learning</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/chemreco/">ChemReco Notes</a></td>
          <td>EfficientNet encoder + Transformer decoder with SMILES output</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.1186/s13321-024-00872-7">Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/decimer-hand-drawn/">Enhanced DECIMER Notes</a></td>
          <td>EfficientNet-V2-M encoder + Transformer decoder with SMILES output</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.1021/acs.jcim.3c02082">Image2InChI: Automated Molecular Optical Image Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/image2inchi/">Image2InChI Notes</a></td>
          <td>Improved SwinTransformer encoder + Transformer decoder with InChI output</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.1039/D4RA02442G">MMSSC-Net: multi-stage sequence cognitive networks for drug molecule recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/mmssc-net/">MMSSC-Net Notes</a></td>
          <td>SwinV2 encoder + GPT-2 decoder with MLP for multi-stage cognition</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.48550/arXiv.2412.07594">RFL: Simplifying Chemical Structure Recognition with Ring-Free Language</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/rfl/">RFL Notes</a></td>
          <td>DenseNet encoder + GRU decoder with hierarchical ring decomposition</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.1021/acs.jpclett.5c03057">Dual-Path Global Awareness Transformer for Optical Chemical Structure Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/dgat/">DGAT Notes</a></td>
          <td>ResNet-101 encoder + Transformer with CGFE/SDGLA modules and SMILES output</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.48550/arXiv.2506.07553">GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/vision-language/gtr-mol-vlm/">GTR-CoT Notes</a></td>
          <td>Qwen-VL 2.5 3B encoder-decoder with graph traversal chain-of-thought and SMILES output</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.48550/arXiv.2411.11098">MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/vision-language/mol-parser/">MolParser Notes</a></td>
          <td>Swin Transformer encoder + BART decoder with Extended SMILES (E-SMILES) output</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.48550/arXiv.2511.17300">MolSight: OCSR with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/molsight/">MolSight Notes</a></td>
          <td>EfficientViT-L1 encoder + Transformer decoder with RL (GRPO) and SMILES output</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.48550/arXiv.2501.15415">OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/vision-language/ocsu/">OCSU Notes</a></td>
          <td>Mol-VL: Qwen2-VL encoder-decoder with multi-task learning for multi-level understanding</td>
      </tr>
  </tbody>
</table>
<h3 id="image-to-graph-paradigm">Image-to-Graph Paradigm</h3>
<p>Methods that explicitly construct molecular graphs as intermediate representations, identifying atoms as vertices $V$ and bonds as edges $E$ before converting to standard molecular formats. Graph approaches construct an adjacency matrix $A$ and feature vectors, effectively turning OCSR into a joint probability model over nodes, edges, and their spatial coordinates:
$$ p(G|I) = \prod_{v \in V} p(v|I) \prod_{u,v \in V} p(e_{uv}|v_u, v_v, I) $$
This avoids hallucinating invalid character strings and explicitly grounds the predictions to the image space (via bounding boxes/segmentation), improving interpretability.</p>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Paper</th>
          <th>Notes</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2020</td>
          <td><a href="https://doi.org/10.1021/acs.jcim.0c00459">ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep Learning</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/chemgrapher-2020/">ChemGrapher Notes</a></td>
          <td>U-Net-based semantic segmentation + graph building algorithm + classification CNNs</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.1093/bib/bbac033">ABC-Net: A divide-and-conquer based deep learning architecture for SMILES recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/abc-net/">ABC-Net Notes</a></td>
          <td>U-Net-style FCN with keypoint detection heatmaps + multi-task property prediction</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.48550/arXiv.2202.09580">Image-to-Graph Transformers for Chemical Structure Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/image-to-graph-transformers/">Image-to-Graph Transformers Notes</a></td>
          <td>ResNet-34 encoder + Transformer encoder + Graph-Aware Transformer (GRAT) decoder</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.1021/acs.jcim.2c00733">MolMiner: You Only Look Once for Chemical Structure Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molminer/">MolMiner Notes</a></td>
          <td>MobileNetV2 segmentation + YOLOv5 object detection + EasyOCR + graph construction</td>
      </tr>
      <tr>
          <td>2023</td>
          <td><a href="https://openaccess.thecvf.com/content/ICCV2023/html/Morin_MolGrapher_Graph-based_Visual_Recognition_of_Chemical_Structures_ICCV_2023_paper.html">MolGrapher: Graph-based Visual Recognition of Chemical Structures</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molgrapher/">MolGrapher Notes</a></td>
          <td>ResNet-18 keypoint detector + supergraph construction + GNN classifier</td>
      </tr>
      <tr>
          <td>2023</td>
          <td><a href="https://doi.org/10.1021/acs.jcim.2c01480">MolScribe: Robust Molecular Structure Recognition with Image-To-Graph Generation</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molscribe/">MolScribe Notes</a></td>
          <td>Swin Transformer encoder + Transformer decoder with explicit atom coordinates and bond prediction</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.48550/arXiv.2404.01743">Atom-Level Optical Chemical Structure Recognition with Limited Supervision</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/atomlenz/">AtomLenz Notes</a></td>
          <td>Faster R-CNN object detection + graph constructor with weakly supervised training (ProbKT*)</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.1186/s13321-024-00926-w">MolNexTR: a generalized deep learning model for molecular image recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/vision-language/molnextr/">MolNexTR Notes</a></td>
          <td>Dual-stream (ConvNext + ViT) encoder + Transformer decoder with graph generation</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.1109/CVPR52734.2025.01352">MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/markush/markushgrapher/">MarkushGrapher Notes</a></td>
          <td>UDOP VTL encoder + MolScribe OCSR encoder + T5 decoder with CXSMILES + substituent table</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.48550/arXiv.2505.03777">MolMole: Molecule Mining from Scientific Literature</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molmole/">MolMole Notes</a></td>
          <td>ViDetect (DINO) + ViReact (RxnScribe) + ViMore (detection-based) unified page-level pipeline</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.48550/arXiv.2501.15415">OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/vision-language/ocsu/">OCSU Notes</a></td>
          <td>DoubleCheck: MolScribe + attentive feature enhancement with local ambiguous atom refinement</td>
      </tr>
  </tbody>
</table>
<h3 id="image-to-fingerprint-paradigm">Image-to-Fingerprint Paradigm</h3>
<p>Methods that bypass molecular graph reconstruction entirely, generating molecular fingerprints directly from images through functional group recognition and spatial analysis. These approaches prioritize retrieval and similarity search over exact structure reconstruction.</p>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Paper</th>
          <th>Notes</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.1186/s13321-025-01091-4">SubGrapher: visual fingerprinting of chemical structures</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/vision-language/subgrapher/">SubGrapher Notes</a></td>
          <td>Dual Mask-RCNN instance segmentation (1,534 groups + 27 backbones) + substructure-graph + SVMF fingerprint</td>
      </tr>
  </tbody>
</table>
<h3 id="image-classification-and-filtering">Image Classification and Filtering</h3>
<p>Methods that classify chemical structure images for preprocessing purposes, such as detecting Markush structures or other problematic inputs that should be filtered before full OCSR processing.</p>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Paper</th>
          <th>Notes</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2023</td>
          <td><a href="https://doi.org/10.48550/arXiv.2311.14633">One Strike, You&rsquo;re Out: Detecting Markush Structures in Low Signal-to-Noise Ratio Images</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/markush/jurriaans-markush-detection-2023/">Jurriaans et al. Notes</a></td>
          <td>Patch-based pipeline with Inception V3 or ResNet18 for binary classification</td>
      </tr>
  </tbody>
</table>
<h2 id="traditional-machine-learning-methods">Traditional Machine Learning Methods</h2>
<p>Hybrid approaches combining classical machine learning algorithms (neural networks, SVMs, CRFs) with domain-specific heuristics and image processing. These methods (primarily from 1992-2014) used ML for specific subtasks like character recognition or symbol classification while relying on rule-based systems for chemical structure interpretation.</p>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Paper</th>
          <th>Notes</th>
          <th>Key ML Component</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1992</td>
          <td><a href="https://doi.org/10.1021/ci00008a018">Kekulé: OCR-Optical Chemical (Structure) Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/kekule-1992/">Kekulé Notes</a></td>
          <td>Multilayer perceptron for OCR</td>
      </tr>
      <tr>
          <td>1996</td>
          <td><a href="https://doi.org/10.1007/3-540-61226-2_14">Automatic Interpretation of Chemical Structure Diagrams</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/kekule-1996/">Kekulé-1 Notes</a></td>
          <td>Neural network with shared weights (proto-CNN)</td>
      </tr>
      <tr>
          <td>2007</td>
          <td><a href="https://cdn.aaai.org/AAAI/2007/AAAI07-134.pdf">Recognition of Hand Drawn Chemical Diagrams</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/ouyang-davis-aaai-2007/">Ouyang-Davis Notes</a></td>
          <td>SVM for symbol classification</td>
      </tr>
      <tr>
          <td>2008</td>
          <td><a href="https://static.aminer.org/pdf/PDF/000/295/640/neural_versus_syntactic_recognition_of_handwritten_numerals.pdf">Chemical Ring Handwritten Recognition Based on Neural Networks</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/hewahi-ring-recognition-2008/">Hewahi et al. Notes</a></td>
          <td>Two-phase classifier-recognizer with feed-forward NNs</td>
      </tr>
      <tr>
          <td>2008</td>
          <td><a href="https://doi.org/10.1109/IJCNN.2008.4634125">Recognition of On-line Handwritten Chemical Expressions</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/yang-online-handwritten-2008/">Yang et al. Notes</a></td>
          <td>Two-level algorithm with edit distance matching</td>
      </tr>
      <tr>
          <td>2008</td>
          <td><a href="https://doi.org/10.1109/ICPR.2008.4761824">A Study of On-Line Handwritten Chemical Expressions Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/yang-icpr-2008/">Yang et al. Notes</a></td>
          <td>ANN with two-level substance recognition</td>
      </tr>
      <tr>
          <td>2009</td>
          <td><a href="https://doi.org/10.1109/ICDAR.2009.64">A Unified Framework for Recognizing Handwritten Chemical Expressions</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/chang-unified-framework-2009/">Chang et al. Notes</a></td>
          <td>GMM for spatial relations, NN for bond verification</td>
      </tr>
      <tr>
          <td>2009</td>
          <td><a href="https://doi.org/10.1109/ICDAR.2009.99">HMM-Based Online Recognition of Handwritten Chemical Symbols</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/zhang-hmm-handwriting-2009/">Zhang et al. Notes</a></td>
          <td>Hidden Markov Model for online handwriting</td>
      </tr>
      <tr>
          <td>2009</td>
          <td><a href="https://doi.org/10.1109/ICDAR.2009.70">The Understanding and Structure Analyzing for Online Handwritten Chemical Formulas</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/wang-online-handwritten-2009/">Wang et al. Notes</a></td>
          <td>HMM for text recognition + CFG for structure parsing</td>
      </tr>
      <tr>
          <td>2010</td>
          <td><a href="https://doi.org/10.1109/ICPR.2010.465">A SVM-HMM Based Online Classifier for Handwritten Chemical Symbols</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/zhang-svm-hmm-2010/">Zhang et al. Notes</a></td>
          <td>Dual-stage SVM-HMM with PSR algorithm</td>
      </tr>
      <tr>
          <td>2011</td>
          <td><a href="https://doi.org/10.1145/1943403.1943444">ChemInk: A Natural Real-Time Recognition System for Chemical Drawings</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/chemink-2011/">ChemInk Notes</a></td>
          <td>Conditional Random Field (CRF) joint model</td>
      </tr>
      <tr>
          <td>2013</td>
          <td><a href="https://doi.org/10.1109/ICIS.2013.6607894">Online Chemical Symbol Recognition for Handwritten Chemical Expression Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/tang-online-symbol-2013/">Tang et al. Notes</a></td>
          <td>SVM with elastic matching for handwriting</td>
      </tr>
      <tr>
          <td>2014</td>
          <td><a href="https://doi.org/10.1021/ci5002197">Markov Logic Networks for Optical Chemical Structure Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/mlocsr/">MLOCSR Notes</a></td>
          <td>Markov Logic Network for probabilistic inference</td>
      </tr>
  </tbody>
</table>
<h2 id="rule-based-methods">Rule-Based Methods</h2>
<p>Classic approaches using heuristics, image processing, and domain-specific rules. While some systems use traditional OCR engines (which may contain ML components), the chemical structure recognition itself is purely algorithmic.</p>
<p><strong>Note</strong>: The chemoCR systems use SVM-based OCR but employ rule-based topology-preserving vectorization for core structure reconstruction, placing them primarily in this category.</p>
<h3 id="core-methods">Core Methods</h3>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Paper</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1990</td>
          <td><a href="https://doi.org/10.1021/ci00067a014">Computational Perception and Recognition of Digitized Molecular Structures</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/contreras-ocr-1990/">Contreras et al. Notes</a></td>
      </tr>
      <tr>
          <td>1993</td>
          <td><a href="https://doi.org/10.1021/ci00013a010">Chemical Literature Data Extraction: The CLiDE Project</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/clide-1993/">CLiDE Notes</a></td>
      </tr>
      <tr>
          <td>1993</td>
          <td><a href="https://doi.org/10.1109/ICDAR.1993.395658">Optical Recognition of Chemical Graphics</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/casey-ocsr-1993/">Casey et al. Notes</a></td>
      </tr>
      <tr>
          <td>1999</td>
          <td><a href="https://doi.org/10.1109/ICDAR.1999.791730">Automatic Reading of Handwritten Chemical Formulas from a Structural Representation of the Image</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/ramel-handwritten-1999/">Ramel et al. Notes</a></td>
      </tr>
      <tr>
          <td>2007</td>
          <td><a href="https://doi.org/10.1109/ENC.2007.25">Automatic Recognition of Chemical Images</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/algorri-chemical-image-recognition-2007/">chemoCR Notes</a></td>
      </tr>
      <tr>
          <td>2007</td>
          <td><a href="https://doi.org/10.1109/IEMBS.2007.4353366">Reconstruction of Chemical Molecules from Images</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/algorri-reconstruction-2007/">chemoCR Notes</a></td>
      </tr>
      <tr>
          <td>2009</td>
          <td><a href="https://doi.org/10.1186/1752-153X-3-4">Automated extraction of chemical structure information from digital raster images</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/chemreader-2009/">ChemReader Notes</a></td>
      </tr>
      <tr>
          <td>2009</td>
          <td><a href="https://doi.org/10.1021/ci800449t">CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/clide-pro-2009/">CLiDE Pro Notes</a></td>
      </tr>
      <tr>
          <td>2009</td>
          <td><a href="https://doi.org/10.1021/ci800067r">Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/osra/">OSRA Notes</a></td>
      </tr>
      <tr>
          <td>2012</td>
          <td><a href="https://doi.org/10.1117/12.912185">Chemical Structure Recognition: A Rule Based Approach</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/molrec-2012/">MolRec Notes</a></td>
      </tr>
      <tr>
          <td>2015</td>
          <td><a href="https://doi.org/10.2991/jimet-15.2015.50">Research on Chemical Expression Images Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/hong-chemical-expression-2015/">Hong et al. Notes</a></td>
      </tr>
  </tbody>
</table>
<h3 id="trec-2011-chemistry-track">TREC 2011 Chemistry Track</h3>
<p>The <a href="/notes/chemistry/optical-structure-recognition/benchmarks/trec-chem-2011/">TREC 2011 Chemistry Track</a> provided a standardized benchmark for comparing OCSR systems, introducing the novel Image-to-Structure task alongside Prior Art and Technology Survey tasks. Papers from this evaluation are grouped here.</p>
<table>
  <thead>
      <tr>
          <th>System</th>
          <th>Paper</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>chemoCR</td>
          <td><a href="https://trec.nist.gov/pubs/trec20/papers/chemoCR.chem.update.pdf">Chemical Structure Reconstruction with chemoCR</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/chemocr-trec-2011/">chemoCR Notes</a></td>
      </tr>
      <tr>
          <td>ChemReader</td>
          <td><a href="https://trec.nist.gov/pubs/trec20/papers/chemreader.chem.update.pdf">Image-to-Structure Task by ChemReader</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/chemreader-trec-2011/">ChemReader at TREC 2011 Notes</a></td>
      </tr>
      <tr>
          <td>Imago</td>
          <td><a href="https://trec.nist.gov/pubs/trec20/papers/GGA.chemical.pdf">Imago: open-source toolkit for 2D chemical structure image recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/imago-trec-2011/">Imago Notes</a></td>
      </tr>
      <tr>
          <td>OSRA</td>
          <td><a href="https://trec.nist.gov/pubs/trec20/papers/saic-frederick.chem.pdf">Optical Structure Recognition Application entry in Image2Structure task</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/osra-trec-2011/">OSRA at TREC 2011 Notes</a></td>
      </tr>
      <tr>
          <td>MolRec</td>
          <td><a href="https://trec.nist.gov/pubs/trec20/papers/UoB.chem.update.pdf">Performance of MolRec at TREC 2011 Overview and Analysis of Results</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/molrec_at_trec/">MolRec at TREC Notes</a></td>
      </tr>
      <tr>
          <td>ChemInfty</td>
          <td><a href="https://www.inftyreader.org/inftyreader-contents/about-inftyreader/list-of-academic-papers/2011_GREC_ChemInfty.pdf">Robust Method of Segmentation and Recognition of Chemical Structure Images in ChemInfty</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/cheminfty/">ChemInfty Notes</a></td>
      </tr>
  </tbody>
</table>
<h3 id="clef-2012-chemistry-track">CLEF 2012 Chemistry Track</h3>
<p>The <a href="/notes/chemistry/optical-structure-recognition/benchmarks/clef-ip-2012/">CLEF-IP 2012 benchmarking lab</a> introduced three specific IR tasks in the intellectual property domain: claims-based passage retrieval, flowchart recognition, and chemical structure recognition. The chemical structure recognition task included both segmentation (identifying bounding boxes) and recognition (converting to MOL format) subtasks, with a particular focus on challenging Markush structures common in patents.</p>
<table>
  <thead>
      <tr>
          <th>System</th>
          <th>Paper</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolRec</td>
          <td><a href="https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf">MolRec at CLEF 2012 - Overview and Analysis of Results</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/molrec-clef-2012/">MolRec at CLEF 2012 Notes</a></td>
      </tr>
      <tr>
          <td>OSRA</td>
          <td><a href="https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-FilippovEt2012.pdf">Optical Structure Recognition Application entry to CLEF-IP 2012</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/osra-clef-2012/">OSRA at CLEF-IP 2012 Notes</a></td>
      </tr>
  </tbody>
</table>
]]></content:encoded></item><item><title>Kekulé: OCR-Optical Chemical Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/kekule-1992/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/kekule-1992/</guid><description>A seminal 1992 system for Optical Chemical Structure Recognition (OCSR) using neural networks and heuristic graph compilation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: McDaniel, J. R., &amp; Balmuth, J. R. (1992). Kekulé: OCR-Optical Chemical (Structure) Recognition. <em>Journal of Chemical Information and Computer Sciences</em>, 32(4), 373-378. <a href="https://doi.org/10.1021/ci00008a018">https://doi.org/10.1021/ci00008a018</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Computer Sciences, 1992</p>
<h2 id="system-architecture-and-methodological-approach">System Architecture and Methodological Approach</h2>
<p>This is a <strong>Methodological Paper</strong> ($\Psi_{\text{Method}}$). It proposes a novel software architecture (&ldquo;Kekulé&rdquo;) designed to solve a specific technical problem: the automatic conversion of printed chemical structure diagrams into computer-readable connection tables. The paper focuses on the &ldquo;how&rdquo; of the system by detailing the seven-step pipeline from scanning to graph compilation, validating the method through performance testing on a specific dataset.</p>
<h2 id="motivation-bridging-visual-diagrams-and-connection-tables">Motivation: Bridging Visual Diagrams and Connection Tables</h2>
<p>The primary motivation is to bridge the gap between how chemists communicate (structural diagrams) and how chemical databases store information (connection tables like MOLfiles).</p>
<ul>
<li><strong>Inefficiency of Manual Entry</strong>: Manual compilation of structural descriptions is &ldquo;tedious and highly prone to error&rdquo;.</li>
<li><strong>Redrawing Costs</strong>: Even using drawing programs (like ChemDraw ancestors) to capture connectivity is inefficient; redrawing a complex molecule like vitamin $B_{12}$ takes ~20 minutes.</li>
<li><strong>Lack of Existing Solutions</strong>: Existing OCR systems at the time failed on chemical diagrams because they could not handle the mix of graphics (bonds) and text (atom labels), and struggled with small, mixed fonts.</li>
</ul>
<h2 id="novelty-a-hybrid-ocr-and-heuristic-approach">Novelty: A Hybrid OCR and Heuristic Approach</h2>
<p>Kekulé represents the first successful attempt to integrate all of the required elements of image processing, OCR, structure editing, and database communication into a complete system.</p>
<ul>
<li><strong>Hybrid OCR Approach</strong>: Unlike commercial OCR of the time, it used a custom implementation combining rotation correction (for skew) with a <strong>multilayer perceptron neural network</strong> trained specifically on small fonts (down to 3.2 points).</li>
<li><strong>Heuristic Feature Extraction</strong>: The authors developed specific heuristics to handle chemical artifacts, such as an exhaustive search for dashed lines, explicitly rejecting Hough transforms as unreliable for short segments.</li>
<li><strong>Contextual &ldquo;Spell Checking&rdquo;</strong>: The system uses chemical context to verify OCR results, such as checking atom symbols against a valid list and using bond connections to disambiguate characters.</li>
</ul>
<h2 id="experimental-setup-and-dataset-validation">Experimental Setup and Dataset Validation</h2>
<p>The authors performed a validation study on a diverse set of chemical structures to stress-test the system:</p>
<ul>
<li><strong>Dataset</strong>: 444 chemical structures were selected from a wide variety of sources, including the <em>Merck Index</em>, <em>Aldrich Handbook</em>, and <em>ACS Nomenclature Guide</em>, specifically chosen to &ldquo;test Kekulé&rsquo;s limits&rdquo;.</li>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Processing Success</strong>: Percentage of structures processed.</li>
<li><strong>User Intervention</strong>: Average number of prompts per structure for verification.</li>
<li><strong>Editing Time</strong>: Time required to correct interpretation errors (arbitrary &ldquo;good&rdquo; limit set at 30 seconds).</li>
</ul>
</li>
</ul>
<h2 id="results-and-system-performance">Results and System Performance</h2>
<ul>
<li><strong>High Success Rate</strong>: 98.9% of the 444 structures were processed successfully.</li>
<li><strong>Performance Speed</strong>: The average processing time was 9 seconds per structure on an 80486 (33 MHz) processor.</li>
<li><strong>Error Modes</strong>: The primary bottleneck was broken characters in scanned images (e.g., breaks in &lsquo;H&rsquo; or &lsquo;N&rsquo; crossbars), which slowed down the OCR significantly.</li>
<li><strong>Impact</strong>: The system demonstrated that automated interpretation was faster and less error-prone than manual redrawing.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>The following details outline the specific technical implementation described in the 1992 paper.</p>
<h3 id="data">Data</h3>
<p>The authors did not release a public dataset but described their test set sources in detail.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Mixed Chemical Sources</td>
          <td>444 structures</td>
          <td>Sourced from <em>Merck Index</em>, <em>Aldrich Handbook</em>, <em>ACS Nomenclature Guide</em>, etc.</td>
      </tr>
      <tr>
          <td>Training (OCR)</td>
          <td>Font Exemplars</td>
          <td>Unknown</td>
          <td>&ldquo;Exemplars of characters from numerous serif and sanserif fonts&rdquo;.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper details a 7-step pipeline. Key algorithmic choices include:</p>
<ul>
<li>
<p><strong>Vectorization</strong>:</p>
<ul>
<li>Images are reduced to 1-pixel width using <strong>thinning</strong> and <strong>raster-to-vector translation</strong>.</li>
<li>An <strong>adaptive smoothing algorithm</strong> is applied to remove pixel-level jitter.</li>
</ul>
</li>
<li>
<p><strong>Feature Extraction (Dashed Lines)</strong>:</p>
<ul>
<li><strong>Hough Transforms</strong> were rejected due to poor performance on short line segments.</li>
<li><strong>Slope sorting</strong> was rejected due to variance in short dashes.</li>
<li><strong>Chosen Method</strong>: Exhaustive search/testing of all features that <em>might</em> be dashed lines (subset of features).</li>
</ul>
</li>
<li>
<p><strong>Graph Compilation</strong>:</p>
<ul>
<li><strong>Character Grouping</strong>: Characters are assembled into strings based on XY adjacency.</li>
<li><strong>Node Creation</strong>: Character strings become nodes. Vectors with endpoints &ldquo;too far&rdquo; from strings create new nodes.</li>
<li><strong>Heuristics</strong>: Circles are converted to alternating single-double bonds; &ldquo;thick&rdquo; bonds between wedges are automatically generated.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The core machine learning component is the OCR engine.</p>
<ul>
<li><strong>Architecture</strong>: A <strong>multilayer perceptron neural network</strong> (fully connected).</li>
<li><strong>Input</strong>: Normalized characters. Normalization involves rotation (for skew), scaling, under-sampling, and contrast/density adjustments.</li>
<li><strong>Output</strong>: Ranked probability matches. Outputs above an experimental threshold are retained. If a character is ambiguous (e.g., &lsquo;5&rsquo; vs &lsquo;S&rsquo;), both are kept and resolved via chemical context.</li>
<li><strong>Performance</strong>: Raw accuracy ~96% on small fonts (compared to ~85% for commercial OCR of the era).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The system was developed and tested on hardware typical of the early 1990s.</p>
<ul>
<li><strong>Processor</strong>: Intel 80486 at 33 MHz.</li>
<li><strong>Scanners</strong>: Hewlett-Packard ScanJet (300 dpi) and Logitech ScanMan (400 dpi hand-held).</li>
<li><strong>Platform</strong>: Microsoft Windows.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{mcdanielKekuleOCRopticalChemical1992,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Kekulé: {{OCR-optical}} Chemical (Structure) Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Kekulé}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{McDaniel, Joe R. and Balmuth, Jason R.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1992</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jul,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Computer Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{32}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{373--378}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0095-2338, 1520-5142}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci00008a018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>IMG2SMI: Translating Molecular Structure Images to SMILES</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/img2smi/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/img2smi/</guid><description>Campos &amp; Ji's method for converting 2D molecular images to SMILES strings using Transformers and SELFIES representation.</description><content:encoded><![CDATA[<h2 id="contributions--taxonomy">Contributions &amp; Taxonomy</h2>
<p>This is both a <strong>Method</strong> and <strong>Resource</strong> paper:</p>
<ul>
<li><strong>Method</strong>: It adapts standard image captioning architectures (encoder-decoder) to the domain of Optical Chemical Structure Recognition (OCSR), treating molecule recognition as a translation task.</li>
<li><strong>Resource</strong>: It introduces <strong>MOLCAP</strong>, a large-scale dataset of 81 million molecules aggregated from public chemical databases, addressing the data scarcity that previously hindered deep learning approaches to OCSR.</li>
</ul>
<h2 id="the-bottleneck-in-chemical-literature-translation">The Bottleneck in Chemical Literature Translation</h2>
<p>Chemical literature is &ldquo;full of recipes written in a language computers cannot understand&rdquo; because molecules are depicted as 2D images. This creates a fundamental bottleneck:</p>
<ul>
<li><strong>The Problem</strong>: Chemists must manually redraw molecular structures to search for related compounds or reactions. This is slow, error-prone, and makes large-scale literature mining impossible.</li>
<li><strong>Existing Tools</strong>: Legacy systems like OSRA (Optical Structure Recognition Application) rely on handcrafted rules and often require human correction, making them unfit for unsupervised, high-throughput processing.</li>
<li><strong>The Goal</strong>: An automated system that can translate structure images directly to machine-readable strings (SMILES/SELFIES) without human supervision, enabling large-scale knowledge extraction from decades of chemistry literature and patents.</li>
</ul>
<h2 id="core-innovation-selfies-and-image-captioning">Core Innovation: SELFIES and Image Captioning</h2>
<p>The core novelty is demonstrating that <strong>how you represent the output text is as important as the model architecture itself</strong>. Key contributions:</p>
<ol>
<li>
<p><strong>Image Captioning Framework</strong>: Applies modern encoder-decoder architectures (ResNet-101 + Transformer) to OCSR, treating it as an image-to-text translation problem with a standard cross-entropy loss objective over the generation sequence:
$$ \mathcal{L} = -\sum\limits_{t=1}^{T} \log P(y_t \mid y_1, \ldots, y_{t-1}, x) $$</p>
</li>
<li>
<p><strong>SELFIES as Target Representation</strong>: The key mechanism relies on using <strong>SELFIES</strong> (Self-Referencing Embedded Strings) as the output format. SELFIES is based on a formal grammar where every possible string corresponds to a valid molecule, eliminating the syntactic invalidity problems (unmatched parentheses, invalid characters) that plague SMILES generation.</p>
</li>
<li>
<p><strong>MOLCAP Dataset</strong>: Created a comprehensive dataset of 81 million unique molecules from PubChem, ChEMBL, <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a>, and other sources. Generated 256x256 pixel images using RDKit for 1 million training samples and 5,000 validation samples.</p>
</li>
<li>
<p><strong>Task-Specific Evaluation</strong>: Demonstrated that traditional NLP metrics (BLEU) are poor indicators of scientific utility. Introduced evaluation based on <strong>molecular fingerprints</strong> (MACCS, RDK, Morgan) and <strong>Tanimoto similarity</strong>:
$$ T(a, b) = \frac{c}{a + b - c} $$
where $c$ is the number of common fingerprint bits, and $a$ and $b$ are the number of set bits in each respective molecule&rsquo;s fingerprint. This formulation reliably measures functional chemical similarity.</p>
</li>
</ol>
<h2 id="experimental-setup-and-ablation-studies">Experimental Setup and Ablation Studies</h2>
<p>The evaluation focused on comparing IMG2SMI to existing systems and identifying which design choices matter most:</p>
<ol>
<li>
<p><strong>Baseline Comparisons</strong>: Benchmarked against OSRA (rule-based system) and DECIMER (first deep learning approach) on the MOLCAP dataset to establish whether modern architectures could surpass traditional methods.</p>
</li>
<li>
<p><strong>Ablation Studies</strong>: Extensive ablations isolating key factors:</p>
<ul>
<li><strong>Decoder Architecture</strong>: Transformer vs. RNN/LSTM decoders</li>
<li><strong>Encoder Fine-tuning</strong>: Fine-tuned vs. frozen pre-trained ResNet weights</li>
<li><strong>Output Representation</strong>: SELFIES vs. character-level SMILES vs. BPE-tokenized SMILES (the most critical ablation)</li>
</ul>
</li>
</ol>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>MACCS FTS</th>
          <th>Valid Captions</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RNN + Fixed Encoder</td>
          <td>0.1526</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>RNN + Fine-tuned Encoder</td>
          <td>0.4180</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Transformer + Fixed Encoder</td>
          <td>0.7674</td>
          <td>61.1%</td>
      </tr>
      <tr>
          <td>Transformer + Fine-tuned Encoder</td>
          <td>0.9475</td>
          <td>99.4%</td>
      </tr>
      <tr>
          <td>Character-level SMILES (fine-tuned)</td>
          <td>N/A</td>
          <td>2.1%</td>
      </tr>
      <tr>
          <td>BPE SMILES (2000 vocab, fine-tuned)</td>
          <td>N/A</td>
          <td>20.0%</td>
      </tr>
      <tr>
          <td>SELFIES (fine-tuned)</td>
          <td>0.9475</td>
          <td>99.4%</td>
      </tr>
  </tbody>
</table>
<ol start="3">
<li><strong>Metric Analysis</strong>: Systematic comparison of evaluation metrics including BLEU, ROUGE, Levenshtein distance, exact match accuracy, and molecular fingerprint-based similarity measures.</li>
</ol>
<h2 id="results-findings-and-limitations">Results, Findings, and Limitations</h2>
<p><strong>Performance Gains</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>IMG2SMI</th>
          <th>OSRA</th>
          <th>DECIMER</th>
          <th>Random Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MACCS FTS</td>
          <td>0.9475</td>
          <td>0.3600</td>
          <td>0.0000</td>
          <td>0.3378</td>
      </tr>
      <tr>
          <td>RDK FTS</td>
          <td>0.9020</td>
          <td>0.2790</td>
          <td>0.0000</td>
          <td>0.2229</td>
      </tr>
      <tr>
          <td>Morgan FTS</td>
          <td>0.8707</td>
          <td>0.2677</td>
          <td>0.0000</td>
          <td>0.1081</td>
      </tr>
      <tr>
          <td>ROUGE</td>
          <td>0.6240</td>
          <td>0.0684</td>
          <td>0.0000</td>
          <td>0.0422</td>
      </tr>
      <tr>
          <td>Exact Match</td>
          <td>7.24%</td>
          <td>0.04%</td>
          <td>0.00%</td>
          <td>0.00%</td>
      </tr>
      <tr>
          <td>Valid Captions</td>
          <td>99.4%</td>
          <td>65.2%</td>
          <td>N/A</td>
          <td>N/A</td>
      </tr>
  </tbody>
</table>
<ul>
<li>163% improvement over OSRA on MACCS Tanimoto similarity.</li>
<li>Nearly 10x improvement on ROUGE scores (0.6240 vs. 0.0684).</li>
<li>Average Tanimoto similarity exceeds 0.85 (functionally similar molecules even when not exact matches).</li>
</ul>
<p><strong>Key Findings</strong>:</p>
<ul>
<li><strong>SELFIES is Critical</strong>: Using SELFIES yields <strong>99.4% valid molecules</strong>, compared to only ~2% validity for character-level SMILES.</li>
<li><strong>Architecture Matters</strong>: Transformer decoder significantly outperforms RNN/LSTM approaches. Fine-tuning the ResNet encoder (vs. frozen weights) yields substantial performance gains (e.g., MACCS FTS: 0.7674 to 0.9475).</li>
<li><strong>Metric Insights</strong>: BLEU is a poor metric for this task. Molecular fingerprint-based Tanimoto similarity is most informative because it measures functional chemical similarity.</li>
</ul>
<p><strong>Limitations</strong>:</p>
<ul>
<li><strong>Low Exact Match</strong>: Only <strong>7.24%</strong> exact matches. The model captures the overarching functional groups and structure but misses fine details like exact double bond placement.</li>
<li><strong>Complexity Bias</strong>: Trained on large molecules (average length &gt;40 tokens), so it performs poorly on very simple structures where OSRA still excels.</li>
</ul>
<p><strong>Conclusion</strong>: The work shows that modern encoder-decoder architectures combined with valid-by-construction molecular representations (SELFIES) can outperform traditional rule-based systems by large margins on fingerprint-based similarity metrics. The system is useful for literature mining where functional similarity matters more than exact matches, though 7.24% exact match accuracy and poor performance on simple molecules indicate clear directions for future work.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: Image captioning system based on DETR (Detection Transformer) framework.</p>
<p><strong>Visual Encoder</strong>:</p>
<ul>
<li><strong>Backbone</strong>: ResNet-101 pre-trained on ImageNet</li>
<li><strong>Feature Extraction</strong>: 4th layer extraction (convolutions only)</li>
<li><strong>Output</strong>: 2048-dimensional dense feature vector</li>
</ul>
<p><strong>Caption Decoder</strong>:</p>
<ul>
<li><strong>Type</strong>: Transformer encoder-decoder</li>
<li><strong>Layers</strong>: 3 stacked encoder layers, 3 stacked decoder layers</li>
<li><strong>Attention Heads</strong>: 8</li>
<li><strong>Hidden Dimensions</strong>: 2048 (feed-forward networks)</li>
<li><strong>Dropout</strong>: 0.1</li>
<li><strong>Layer Normalization</strong>: 1e-12</li>
</ul>
<p><strong>Training Configuration</strong>:</p>
<ul>
<li><strong>Optimizer</strong>: AdamW</li>
<li><strong>Learning Rate</strong>: 5e-5 (selected after sweep from 1e-4 to 1e-6)</li>
<li><strong>Weight Decay</strong>: 1e-4</li>
<li><strong>Batch Size</strong>: 32</li>
<li><strong>Epochs</strong>: 5</li>
<li><strong>Codebase</strong>: Built on open-source DETR implementation</li>
</ul>
<h3 id="data">Data</h3>
<p><strong>MOLCAP Dataset</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Total Size</td>
          <td>81,230,291 molecules</td>
          <td>Aggregated from PubChem, ChEMBL, GDB13</td>
      </tr>
      <tr>
          <td>Training Split</td>
          <td>1,000,000 molecules</td>
          <td>Randomly selected unique molecules</td>
      </tr>
      <tr>
          <td>Validation Split</td>
          <td>5,000 molecules</td>
          <td>Randomly selected for evaluation</td>
      </tr>
      <tr>
          <td>Image Resolution</td>
          <td>256x256 pixels</td>
          <td>Generated using RDKit</td>
      </tr>
      <tr>
          <td>Median SELFIES Length</td>
          <td>&gt;45 characters</td>
          <td>More complex than typical benchmarks</td>
      </tr>
      <tr>
          <td>Full Dataset Storage</td>
          <td>~16.24 TB</td>
          <td>Necessitated use of 1M subset</td>
      </tr>
      <tr>
          <td>Augmentation</td>
          <td>None</td>
          <td>No cropping, rotation, or other augmentation</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li>Images generated using RDKit at 256x256 resolution</li>
<li>Molecules converted to canonical representations</li>
<li>SELFIES tokenization for model output</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Primary Metrics</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>IMG2SMI Value</th>
          <th>OSRA Baseline</th>
          <th>Purpose</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MACCS FTS</td>
          <td>0.9475</td>
          <td>0.3600</td>
          <td>Fingerprint Tanimoto Similarity (functional groups)</td>
      </tr>
      <tr>
          <td>RDK FTS</td>
          <td>0.9020</td>
          <td>0.2790</td>
          <td>RDKit fingerprint similarity</td>
      </tr>
      <tr>
          <td>Morgan FTS</td>
          <td>0.8707</td>
          <td>0.2677</td>
          <td>Morgan fingerprint similarity (circular)</td>
      </tr>
      <tr>
          <td>ROUGE</td>
          <td>0.6240</td>
          <td>0.0684</td>
          <td>Text overlap metric</td>
      </tr>
      <tr>
          <td>Exact Match</td>
          <td>7.24%</td>
          <td>0.04%</td>
          <td>Structural identity (strict)</td>
      </tr>
      <tr>
          <td>Valid Captions</td>
          <td>99.4%</td>
          <td>65.2%</td>
          <td>Syntactic validity (with SELFIES)</td>
      </tr>
      <tr>
          <td>Levenshtein Distance</td>
          <td>21.13</td>
          <td>32.76</td>
          <td>String edit distance (lower is better)</td>
      </tr>
  </tbody>
</table>
<p><strong>Secondary Metrics</strong> (shown to be less informative for chemical tasks):</p>
<ul>
<li>BLEU, ROUGE (better suited for natural language)</li>
<li>Levenshtein distance (doesn&rsquo;t capture chemical similarity)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Single NVIDIA GeForce RTX 2080 Ti</li>
<li><strong>Training Time</strong>: ~5 hours per epoch, approximately 25 hours total for 5 epochs</li>
<li><strong>Memory</strong>: Sufficient for batch size 32 with ResNet-101 + Transformer architecture</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<p>The paper mentions releasing both code and the MOLCAP dataset, but no public repository or download link has been confirmed as available.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MOLCAP dataset</td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>81M molecules; claimed released but no public URL found</td>
      </tr>
      <tr>
          <td>IMG2SMI code</td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Built on DETR; claimed released but no public URL found</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Campos, D., &amp; Ji, H. (2021). IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System (No. arXiv:2109.04202). arXiv. <a href="https://doi.org/10.48550/arXiv.2109.04202">https://doi.org/10.48550/arXiv.2109.04202</a></p>
<p><strong>Publication</strong>: arXiv preprint (2021)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://doi.org/10.48550/arXiv.2109.04202">Paper on arXiv</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{campos2021img2smi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Campos, Daniel and Ji, Heng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2109.04202}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arXiv.2109.04202}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Hand-Drawn Chemical Diagram Recognition (AAAI 2007)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/ouyang-davis-aaai-2007/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/ouyang-davis-aaai-2007/</guid><description>A sketch recognition system for organic chemistry that uses domain knowledge (chemical valence) to correct recognition errors.</description><content:encoded><![CDATA[<h2 id="contribution-and-methodological-approach">Contribution and Methodological Approach</h2>
<p>This is a <strong>Method</strong> paper. It proposes a multi-stage pipeline for interpreting hand-drawn diagrams that integrates a trainable symbol recognizer with a domain-specific verification step. The authors validate the method through an ablation study comparing the full system against a baseline lacking domain knowledge.</p>
<h2 id="motivation-for-sketch-based-interfaces">Motivation for Sketch-Based Interfaces</h2>
<p>Current software for specifying chemical structures (e.g., ChemDraw, IsisDraw) relies on mouse and keyboard interfaces, which lack the speed, ease of use, and naturalness of drawing on paper. The goal is to bridge the gap between natural expression and computer interpretation by building a system that understands freehand chemical sketches.</p>
<h2 id="novel-integration-of-chemical-domain-knowledge">Novel Integration of Chemical Domain Knowledge</h2>
<p>The primary novelty is the integration of <strong>domain knowledge</strong> (specifically chemical valence rules) directly into the interpretation loop to resolve ambiguities and correct errors.</p>
<p>Specific technical contributions include:</p>
<ul>
<li><strong>Hybrid Recognizer</strong>: Combines feature-based SVMs, image-based template matching (modified Tanimoto), and off-the-shelf handwriting recognition to handle the mix of geometry and text.</li>
<li><strong>Domain Verification Loop</strong>: A post-processing step that checks the chemical validity of the structure (e.g., nitrogen must have 3 bonds). If an inconsistency is found, the system searches the space of alternative hypotheses generated during the initial parsing phase to find a valid interpretation.</li>
<li><strong>Contextual Parsing</strong>: Uses a sliding window (up to 7 strokes) and spatial context to parse interspersed symbols.</li>
<li><strong>Implicit Structure Handling</strong>: Supports two common chemistry notations: (1) implicit elements, where carbon and hydrogen atoms are omitted and inferred from bond connectivity and valence rules, and (2) aromatic rings, detected as a circle drawn inside a hexagonal 6-carbon cycle.</li>
</ul>
<h2 id="experimental-design-and-user-study">Experimental Design and User Study</h2>
<p>The authors conducted a user study to evaluate the system&rsquo;s robustness on unconstrained sketches.</p>
<ul>
<li><strong>Participants</strong>: 6 users familiar with organic chemistry.</li>
<li><strong>Task</strong>: Each user drew 12 pre-specified molecular compounds on a Tablet PC.</li>
<li><strong>Conditions</strong>: The system was evaluated in two modes:
<ol>
<li><strong>Domain</strong>: The full system with chemical valence checks.</li>
<li><strong>Baseline</strong>: A simplified version with no knowledge of chemical valence/verification.</li>
</ol>
</li>
<li><strong>Data Split</strong>: Evaluated on collected sketches using a leave-one-out style approach (training on 11 examples from the same users).</li>
</ul>
<h2 id="results-and-error-reduction-analysis">Results and Error Reduction Analysis</h2>
<ul>
<li><strong>Performance</strong>: The full system achieved an overall <strong>F-measure of 0.87</strong> (Precision 0.86, Recall 0.89).</li>
<li><strong>Impact of Domain Knowledge</strong>: Using domain knowledge reduced the overall error rate (measured by recall) by <strong>27%</strong> compared to the baseline. The improvement was statistically significant ($p &lt; .05$).</li>
<li><strong>Error Recovery</strong>: The system successfully recovered from interpretations that were geometrically plausible but chemically impossible (e.g., misinterpreting &ldquo;N&rdquo; as bonds), as illustrated in their qualitative analysis.</li>
<li><strong>Output Integration</strong>: Once interpreted, the resulting structure is expressed in a standard chemical specification format that can be passed to tools such as ChemDraw (for rendering) or SciFinder (for database queries).</li>
<li><strong>Limitations</strong>: The system struggled with &ldquo;messy&rdquo; sketches where users drew single bonds with multiple strokes or over-traced lines, as the current bond recognizer assumes single-stroke straight bonds.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study collected a custom dataset of hand-drawn diagrams.</p>
<ul>
<li><strong>Volume</strong>: 6 participants $\times$ 12 molecules = 72 total sketches (implied).</li>
<li><strong>Preprocessing</strong>:
<ul>
<li><strong>Scale Normalization</strong>: The system estimates scale based on the average length of straight bonds (chosen because they are easy to identify). This normalizes geometric features for the classifier.</li>
<li><strong>Stroke Segmentation</strong>: Poly-line approximation using recursive splitting (minimizing least squared error) to break multi-segment strokes (e.g., connected bonds) into primitives.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Ink Parsing (Sliding Window)</strong></p>
<ul>
<li>Examines all combinations of up to <strong>$n=7$</strong> sequential strokes.</li>
<li>Classifies each group as a valid symbol or invalid garbage.</li>
</ul>
<p><strong>2. Template Matching (Image-based)</strong></p>
<ul>
<li>Used for resolving ambiguities in text/symbols (e.g., &lsquo;H&rsquo; vs &lsquo;N&rsquo;).</li>
<li><strong>Metric</strong>: Modified <strong>Tanimoto coefficient</strong>. Unlike standard Tanimoto (point overlap), this version accounts for relative angle and curvature at each point.</li>
</ul>
<p><strong>3. Domain Verification</strong></p>
<ul>
<li><strong>Trigger</strong>: An element with incorrect valence (e.g., Hydrogen with &gt;1 bond).</li>
<li><strong>Resolution</strong>: Searches stored alternative hypotheses for the affected strokes. It accepts a new hypothesis if it resolves the valence error without introducing new ones.</li>
<li><strong>Constraint</strong>: It keeps an inconsistent structure if the original confidence score is significantly higher than alternatives (assuming user is still drawing or intentionally left it incomplete).</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Symbol Recognizer (Discriminative Classifier)</strong></p>
<ul>
<li><strong>Type</strong>: Support Vector Machine (SVM).</li>
<li><strong>Classes</strong>: Element letters, straight bonds, hash bonds, wedge bonds, invalid groups.</li>
<li><strong>Input Features</strong>:
<ol>
<li>Number of strokes</li>
<li>Bounding-box dimensions (width, height, diagonal)</li>
<li>Ink density (ink length / diagonal length)</li>
<li>Inter-stroke distance (max distance between strokes in group)</li>
<li>Inter-stroke orientation (vector of relative orientations)</li>
</ol>
</li>
</ul>
<p><strong>Text Recognition</strong></p>
<ul>
<li><strong>Microsoft Tablet PC SDK</strong>: Used for recognizing alphanumeric characters (elements and subscripts).</li>
<li>Integrated with the SVM and Template Matcher via a combined scoring mechanism.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (Overall)</th>
          <th>Baseline Comparison</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Precision</strong></td>
          <td>0.86</td>
          <td>0.81 (Baseline)</td>
          <td>Full system vs. no domain knowledge</td>
      </tr>
      <tr>
          <td><strong>Recall</strong></td>
          <td>0.89</td>
          <td>0.85 (Baseline)</td>
          <td>27% error reduction</td>
      </tr>
      <tr>
          <td><strong>F-Measure</strong></td>
          <td>0.87</td>
          <td>0.83 (Baseline)</td>
          <td>Statistically significant ($p &lt; .05$)</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>True Positive Definition</strong>: Match in both location (stroke grouping) and classification (label).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Device</strong>: 1.5GHz Tablet PC.</li>
<li><strong>Performance</strong>: Real-time feedback.</li>
</ul>
<h3 id="reproducibility">Reproducibility</h3>
<p>No source code, trained models, or collected sketch data were publicly released. The paper is openly available through the AAAI digital library. The system depends on the Microsoft Tablet PC SDK (a proprietary, now-discontinued component), which would make exact replication difficult even with the algorithm descriptions provided.</p>
<p><strong>Status</strong>: Closed</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ouyang, T. Y., &amp; Davis, R. (2007). Recognition of Hand Drawn Chemical Diagrams. <em>Proceedings of the 22nd National Conference on Artificial Intelligence</em> (AAAI-07), 846-851.</p>
<p><strong>Publication</strong>: AAAI 2007</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{ouyang2007recognition,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Recognition of Hand Drawn Chemical Diagrams}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ouyang, Tom Y and Davis, Randall}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 22nd National Conference on Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{846--851}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2007}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Graph Perception for Chemical Structure OCR</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/contreras-ocr-1990/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/contreras-ocr-1990/</guid><description>A 1990 methodological paper presenting an early OCR system for digitizing chemical structure images into connectivity tables using C and Prolog.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Contreras, M. L., Allendes, C., Alvarez, L. T., &amp; Rozas, R. (1990). Computational perception and recognition of digitized molecular structures. <em>Journal of Chemical Information and Computer Sciences</em>, 30(3), 302-307. <a href="https://doi.org/10.1021/ci00067a014">https://doi.org/10.1021/ci00067a014</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Computer Sciences, 1990</p>
<h2 id="contribution-graph-perception-and-character-recognition">Contribution: Graph Perception and Character Recognition</h2>
<p>This is a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong>.</p>
<p>It proposes a specific algorithmic pipeline (&ldquo;graph perception and character recognition&rdquo;) to solve the technical problem of converting pixelated images of molecules into machine-readable connectivity tables. The dominant contribution is the novel set of algorithms (contour search, circular inspection, matrix parametrization).</p>
<h2 id="motivation-automating-chemical-database-entry">Motivation: Automating Chemical Database Entry</h2>
<p>The primary motivation is to automate the input of chemical structures into databases.</p>
<ul>
<li><strong>Problem</strong>: Manual input of structures (especially large ones with stereochemistry) is time-consuming and prone to human error.</li>
<li><strong>Gap</strong>: Existing methods required significant human intervention. The authors created a system that handles the &ldquo;graph/skeleton&rdquo; and the &ldquo;alphanumeric characters&rdquo; effectively to speed up entry into systems like ARIUSA or CAD tools.</li>
</ul>
<h2 id="algorithmic-novelty-circular-inspection-processing">Algorithmic Novelty: Circular Inspection Processing</h2>
<p>The paper introduces a unified &ldquo;capture-to-recognition&rdquo; system written in C that handles both type-printed and hand-printed structures. Key novelties include:</p>
<ul>
<li><strong>Circular Inspection Algorithm</strong>: A specific technique for detecting internal rings and multiple bonds by sweeping a radius of 0.3 bond lengths around atoms.</li>
<li><strong>Hybrid Recognition</strong>: Combining &ldquo;graph perception&rdquo; (vectorizing the lines) with &ldquo;character recognition&rdquo; (OCR for atom labels) in a single pipeline.</li>
<li><strong>Matrix Parametrization for OCR</strong>: A feature extraction method that assigns hexadecimal IDs to character matrices based on pixel gradients and &ldquo;semibytes&rdquo;.</li>
</ul>
<h2 id="methodology-validation-via-custom-structure-dataset">Methodology: Validation via Custom Structure Dataset</h2>
<p>The authors validated the system by digitizing and recognizing a set of test structures:</p>
<ul>
<li><strong>Dataset</strong>: 200 type-printed structures and 50 hand-printed structures.</li>
<li><strong>Metric</strong>: &ldquo;Reliability&rdquo; percentage (correct recognition of the connectivity table).</li>
<li><strong>Speed Comparison</strong>: Measured processing time against a &ldquo;qualified person&rdquo; performing manual input for an average 20-atom molecule.</li>
</ul>
<h2 id="results-speed-and-file-size-efficiency">Results: Speed and File Size Efficiency</h2>
<ul>
<li><strong>Accuracy</strong>: The system achieved <strong>94% reliability</strong> for both type- and hand-printed graphs.</li>
<li><strong>Character Recognition</strong>: Isolated character recognition achieved <strong>&gt;99% reliability</strong>.</li>
<li><strong>Speed</strong>: The system was <strong>3-5 times faster</strong> than manual human input.</li>
<li><strong>Efficiency</strong>: The storage required for a recognized molecule (e.g., $C_{19}H_{31}N$) was significantly smaller (4.1 kb) than the raw image bitmap.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The paper does not use a standard external dataset but rather a custom set of structures for validation.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Validation</strong></td>
          <td style="text-align: left">Type-printed structures</td>
          <td style="text-align: left">200 images</td>
          <td style="text-align: left">Used to test reliability</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Validation</strong></td>
          <td style="text-align: left">Hand-printed structures</td>
          <td style="text-align: left">50 images</td>
          <td style="text-align: left">&ldquo;Straight enough&rdquo; drawings required</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper details three specific algorithmic components crucial for replication:</p>
<ol>
<li>
<p><strong>Graph Perception (Contour Search)</strong>:</p>
<ul>
<li><strong>Sweep</strong>: Left-to-right horizontal sweep to find the first pixel.</li>
<li><strong>Contour Follow</strong>: Counter-clockwise algorithm used to trace borders.</li>
<li><strong>Vertex Detection</strong>: A vertex is flagged if the linear trajectory deflection angle is $&gt;18^\circ$.</li>
<li><strong>Atom Localization</strong>: Two or more vertices in a small space indicate an atom position.</li>
</ul>
</li>
<li>
<p><strong>Circular Inspection (Branching/Rings)</strong>:</p>
<ul>
<li><strong>Radius</strong>: A circle is inspected around each atom with $r = 0.3 \times \text{single bond length}$.</li>
<li><strong>Branch Detection</strong>: &ldquo;Unknown border pixels&rdquo; found on this circle trigger new contour searches to find attached bonds or rings.</li>
</ul>
</li>
<li>
<p><strong>Character Recognition (Matrix Feature Extraction)</strong>:</p>
<ul>
<li><strong>Separation</strong>: Characters are separated into isolated matrices and &ldquo;relocated&rdquo; to the top-left corner.</li>
<li><strong>Parametrization</strong>: The matrix is divided into zones. A &ldquo;semibyte&rdquo; (4-bit code) is generated by checking for pixel density in specific directions.</li>
<li><strong>ID Assignment</strong>: Matrices are assigned a Hex ID (e.g., <code>8</code>, <code>1</code>, <code>0</code>, <code>6</code>) based on these semibytes.</li>
<li><strong>Differentiation</strong>: Secondary parameters (concavities, vertical lines) resolve conflicts (e.g., between &lsquo;b&rsquo; and &lsquo;h&rsquo;).</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<p>The system does not use learned weights (neural networks). It relies on <strong>rule-based topological recognition</strong>.</p>
<ul>
<li><strong>Representation</strong>: The final output is a Prolog data structure converted into a connectivity table.</li>
<li><strong>Atom Recognition</strong>: Terminal atoms are identified by linear projection; if no pixels are found, it defaults to Carbon.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The performance metrics reflect 1990s hardware, useful for historical context or low-resource reimplementation.</p>
<ul>
<li><strong>Capture</strong>: PC-AT microcomputer with HP-Scanjet.</li>
<li><strong>Processing</strong>: MicroVax II (8 MB real memory, 159 MB hard disc) running Ultrix-32.</li>
<li><strong>Memory Usage</strong>: A $300 \times 300$ dpi image required ~175 kb; a recognized graph required ~1.6 kb.</li>
<li><strong>Time</strong>: Processing time per molecule was 0.7 - 1.0 minutes.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{contrerasComputationalPerceptionRecognition1990,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Computational Perception and Recognition of Digitized Molecular Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Contreras, M. Leonor and Allendes, Carlos and Alvarez, L. Tomas and Rozas, Roberto}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1990</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = aug,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Computer Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{30}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{302--307}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0095-2338, 1520-5142}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci00067a014}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemReader: Automated Structure Extraction</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/chemreader-2009/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/chemreader-2009/</guid><description>ChemReader extracts chemical structures from raster images using modified Hough transform and chemical spell checking for improved accuracy.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Park, J., Rosania, G. R., Shedden, K. A., Nguyen, M., Lyu, N., &amp; Saitou, K. (2009). Automated extraction of chemical structure information from digital raster images. <em>Chemistry Central Journal</em>, 3(1), 4. <a href="https://doi.org/10.1186/1752-153X-3-4">https://doi.org/10.1186/1752-153X-3-4</a></p>
<p><strong>Publication</strong>: Chemistry Central Journal 2009</p>
<h2 id="paper-contribution-method--pipeline">Paper Contribution: Method &amp; Pipeline</h2>
<p>This is a <strong>Method</strong> paper.</p>
<p>It proposes a novel software system, <strong>ChemReader</strong>, designed to automate the analog-to-digital conversion of chemical structure diagrams. The paper focuses on the algorithmic pipeline, specifically modifying standard computer vision techniques like the Hough Transform to suit chemical graphs. It validates the method through direct performance comparisons against existing State-of-the-Art tools (OSRA, CLiDE, Kekule).</p>
<h2 id="motivation-unlocking-analog-chemical-information">Motivation: Unlocking Analog Chemical Information</h2>
<p>There is a massive amount of chemical information (molecular interactions, pathways, disease processes) locked in scientific literature. However, this information is typically encoded as &ldquo;analog diagrams&rdquo; (raster images) embedded in text. Existing text-based search engines cannot index these structures effectively.</p>
<p>While previous tools existed (Kekule, OROCS, CLiDE), they often required high-resolution images (150-300 dpi) or manual intervention to separate diagrams from text, making fully automated, large-scale database annotation impractical.</p>
<h2 id="core-innovation-modified-transforms-and-spell-checking">Core Innovation: Modified Transforms and Spell Checking</h2>
<p>The authors introduce <strong>ChemReader</strong>, a fully automated toolkit with several specific algorithmic innovations tailored for chemical diagrams:</p>
<ul>
<li><strong>Modified Hough Transform (HT):</strong> Unlike standard HT, which treats all pixels equally, ChemReader uses a modified weight function that accounts for pixel connectivity and line thickness to better detect chemical bonds.</li>
<li><strong>Chemical Spell Checker:</strong> A post-processing step that uses a dictionary of common chemical abbreviations (770 entries) and n-gram probabilities to correct Optical Character Recognition (OCR) errors (e.g., correcting specific atom labels based on valence rules), improving accuracy from 66% to 87%.</li>
<li><strong>Specific Substructure Detection:</strong> Dedicated algorithms for detecting stereochemical &ldquo;wedge&rdquo; bonds using corner detection and aromatic rings using the Generalized Hough Transform.</li>
</ul>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p>The authors compared ChemReader against three other systems: <strong>OSRA V1.0.1</strong>, <strong>CLiDE V2.1 Lite</strong>, and <strong>Kekule V2.0 demo</strong>.</p>
<p>They used three distinct datasets to test robustness:</p>
<ol>
<li><strong>Set I (50 images):</strong> Diverse drawing styles and fonts collected via Google Image Search.</li>
<li><strong>Set II (100 images):</strong> Ligand images from the GLIDA database, linked to PubChem for ground truth.</li>
<li><strong>Set III (212 images):</strong> Low-resolution images embedded in 121 scanned journal articles from PubMed.</li>
</ol>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li><strong>Accuracy:</strong> ChemReader significantly outperformed competitors. In the difficult <strong>Set III</strong> (journal articles), ChemReader achieved <strong>30.2%</strong> correct exact output, compared to 17% for OSRA and 6.6% for CLiDE.</li>
<li><strong>Similarity:</strong> Even when exact matches failed, ChemReader maintained high Tanimoto similarity scores (0.74-0.86), indicating it successfully captured the majority of chemically significant features.</li>
<li><strong>Substructure Recognition:</strong> ChemReader demonstrated higher recall rates across all PubChem fingerprint categories (rings, atom pairs, SMARTS patterns) compared to other tools.</li>
<li><strong>Error Correction:</strong> The &ldquo;Chemical Spell Checker&rdquo; improved character recognition accuracy from <strong>66% to 87%</strong>.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study utilized three test sets collected from public sources.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td><strong>Set I</strong></td>
          <td>50 images</td>
          <td>Sourced from Google Image Search to vary styles/fonts.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><strong>Set II</strong></td>
          <td>100 images</td>
          <td>Randomly selected ligands from the GLIDA database; ground truth via PubChem.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><strong>Set III</strong></td>
          <td>212 images</td>
          <td>Extracted from 121 PubMed journal articles; specifically excludes non-chemical figures.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The pipeline consists of several sequential processing steps:</p>
<ul>
<li><strong>De-noising:</strong> Uses <strong>GREYCstoration</strong>, an anisotropic smoothing algorithm, to regulate image noise.</li>
<li><strong>Segmentation:</strong> Uses an <strong>8-connectivity algorithm</strong> to group pixels. Components are classified as text or graphics based on height/area ratios.</li>
<li><strong>Line Detection (Modified Hough Transform):</strong>
<ul>
<li>Standard Hough Transform is modified to weight pixel pairs $(P_i, P_j)$ based on connectivity.</li>
<li><strong>Weight Function ($W_{ij}$):</strong>
$$W_{ij} = \begin{cases} n_{ij}(P_0 - x_{ij}) &amp; \text{if } x_{ij}/n_{ij} &gt; P_0 \\ 0 &amp; \text{otherwise} \end{cases}$$
Where $n_{ij}$ is the pixel count between points, $x_{ij}$ is the count of black pixels, and $P_0$ is a density threshold.</li>
</ul>
</li>
<li><strong>Wedge Bond Detection:</strong> Uses corner detection to find triangles where the area equals the number of black pixels (isosceles shape check).</li>
<li><strong>Chemical Spell Checker:</strong>
<ul>
<li>Calculates the Maximum Likelihood ($ML$) of a character string being a valid chemical word $T$ from a dictionary.</li>
<li><strong>Similarity Metric:</strong>
$$Sim(S_i, T_i) = 1 - \sqrt{\sum_{j=1}^{M} [I^{S_i}(j) - I^{T_i}(j)]^2}$$
Uses pixel-by-pixel intensity difference between the input segment $S$ and candidate template $T$.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Character Recognition:</strong> Uses the open-source <strong>GOCR</strong> library. It employs template matching based on features like holes, pixel densities, and transitions.</li>
<li><strong>Chemical Dictionary:</strong> A lookup table containing <strong>770</strong> frequently used chemical abbreviations and fundamental valence rules.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance was measured using exact structure matching and fingerprint similarity.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (Set III)</th>
          <th>Baseline (OSRA)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>% Correct</strong></td>
          <td><strong>30.2%</strong></td>
          <td>17%</td>
          <td>Exact structure match using ChemAxon JChem.</td>
      </tr>
      <tr>
          <td><strong>Avg Similarity</strong></td>
          <td><strong>0.740</strong></td>
          <td>0.526</td>
          <td>Tanimoto similarity on PubChem Substructure Fingerprints.</td>
      </tr>
      <tr>
          <td><strong>Precision (Rings)</strong></td>
          <td><strong>0.87</strong></td>
          <td>0.84</td>
          <td>Precision rate for recognizing ring systems.</td>
      </tr>
      <tr>
          <td><strong>Recall (Rings)</strong></td>
          <td><strong>0.83</strong></td>
          <td>0.73</td>
          <td>Recall rate for recognizing ring systems.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Platform:</strong> C++ implementation running on MS Windows.</li>
<li><strong>Dependencies:</strong> GOCR (OCR), GREYCstoration (Image processing).</li>
</ul>
]]></content:encoded></item><item><title>Chemical Machine Vision</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/chemical-machine-vision/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/chemical-machine-vision/</guid><description>Machine vision approach using Gabor wavelets and Kohonen networks to classify chemical raster images and extract structural metadata.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Gkoutos, G. V., Rzepa, H., Clark, R. M., Adjei, O., &amp; Johal, H. (2003). Chemical Machine Vision: Automated Extraction of Chemical Metadata from Raster Images. <em>Journal of Chemical Information and Computer Sciences</em>, 43(5), 1342-1355. <a href="https://doi.org/10.1021/ci034017n">https://doi.org/10.1021/ci034017n</a></p>
<p><strong>Publication</strong>: J. Chem. Inf. Comput. Sci. 2003</p>
<h2 id="paper-classification-methodological-approach">Paper Classification: Methodological Approach</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel architectural pipeline applying &ldquo;machine vision&rdquo; techniques (Gabor wavelets and Kohonen networks) to the problem of identifying chemical diagrams in low-resolution raster images. The paper focuses on the &ldquo;how&rdquo; (the algorithm and its parameters) and validates the method through quantitative experiments optimizing feature vectors and masks.</p>
<h2 id="motivation-extracting-legacy-chemical-data">Motivation: Extracting Legacy Chemical Data</h2>
<p>The primary motivation is to unlock the &ldquo;large amount of data&rdquo; trapped in legacy raster images (GIF, JPEG) on the Web that lack semantic metadata.</p>
<ul>
<li><strong>Legacy Data Problem</strong>: Most chemical structural information on the Web is embedded in raster images, not machine-readable formats like Molfiles.</li>
<li><strong>Limitations of Existing Tools</strong>: Previous tools like Kekule and CLiDE acted as &ldquo;Chemical OCR,&rdquo; attempting to reconstruct exact atom-bond connections. This required high-resolution images (&gt;300 dpi) and human intervention, making them unsuitable for automated Web crawling of low-resolution (72-96 dpi) images.</li>
<li><strong>Goal</strong>: To create a low-cost, automated tool for a &ldquo;robot-based Internet resource discovery tool&rdquo; that can classify images (e.g., &ldquo;is this a molecule?&rdquo;).</li>
</ul>
<h2 id="core-innovation-texture-recognition-over-structural-ocr">Core Innovation: Texture Recognition over Structural OCR</h2>
<p>The core novelty is the shift from &ldquo;Optical Character Recognition&rdquo; (exact reconstruction) to <strong>&ldquo;Texture Recognition&rdquo;</strong> (classification).</p>
<ul>
<li><strong>Texture-Based Approach</strong>: The authors treat chemical diagrams as textures. They use <strong>Gabor wavelets</strong> to extract texture features. <strong>Crucially, this system does not recognize specific chemical structures</strong> (i.e., atom-bond connectivity tables, <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, or Molfiles). It only classifies images into broad categories.</li>
<li><strong>Incremental Learning</strong>: The system uses a <strong>Kohonen Self-Organizing Feature Map (KSOFM)</strong> combined with Class Boundary Analysis (CBA). This allows for &ldquo;incremental learning,&rdquo; where new classes (e.g., aromatic vs. non-aromatic) can be added without retraining the entire system.</li>
<li><strong>Optimization for Chemistry</strong>: The authors identify specific parameters (frequency channels, mask sizes) that are optimal for the &ldquo;texture&rdquo; of chemical diagrams.</li>
<li><strong>Integration with ChemDig</strong>: The method was designed to feed into ChemDig, a robot-based index engine for automated web crawling and metadata generation.</li>
</ul>
<h2 id="experimental-setup-parameter-optimization">Experimental Setup: Parameter Optimization</h2>
<p>The authors performed optimization and validation experiments using a dataset of <strong>300 images</strong> divided into three classes: Ring Systems, Non-Ring Systems, and Non-Chemistry (textures, biological figures, etc.).</p>
<ol>
<li><strong>Parameter Optimization</strong>: They systematically varied hyperparameters to find the optimal configuration:
<ul>
<li><strong>Feature Vector Size</strong>: Tested sizes from 100 to 4000 elements.</li>
<li><strong>Energy Mask Size</strong>: Tested windows from $3 \times 3$ to $15 \times 15$ pixels.</li>
<li><strong>Frequency Channels</strong>: Tested seven spatial frequencies ($\sqrt{2}$ to $64\sqrt{2}$).</li>
</ul>
</li>
<li><strong>Classification Performance</strong>: Evaluated the system&rsquo;s ability to classify unseen test images using a 50:50 training/test split.</li>
<li><strong>Comparison</strong>: Qualitatively compared the approach against vectorization tools (Autotrace, CR2V).</li>
</ol>
<h2 id="results-robust-classification-of-low-resolution-images">Results: Robust Classification of Low-Resolution Images</h2>
<ul>
<li><strong>Optimal Configuration</strong>: The system performed best with a feature vector size of ~1500 elements, a $9 \x9$ energy mask, and frequency channel $4\sqrt{2}$.</li>
<li><strong>High Accuracy</strong>: Achieved a recognition rate of <strong>91%</strong> with a 50:50 training/test split, and up to <strong>92%</strong> with a 70:30 split.</li>
<li><strong>Robustness</strong>: The system successfully distinguished between chemical and non-chemical images (zero false negatives for chemical images).</li>
<li><strong>Limitations</strong>: Misclassifications occurred between &ldquo;ring&rdquo; and &ldquo;non-ring&rdquo; systems when structures had similar visual &ldquo;textures&rdquo; (e.g., similar density or layout).</li>
<li><strong>Impact</strong>: The method is viable for automating metadata generation (e.g., <code>alt</code> tags) for web crawlers, functioning as a coarse-grained filter before more expensive processing.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study used a custom dataset of raster images collected from the Web.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td><strong>Custom Web Dataset</strong></td>
          <td>300 images</td>
          <td>Split into 3 classes: Ring Systems, Non-Ring Systems, Non-Chemistry.</td>
      </tr>
      <tr>
          <td>Resolution</td>
          <td><strong>Low-Res Web Images</strong></td>
          <td>72-96 dpi</td>
          <td>Deliberately chosen to mimic Web conditions where OCR fails.</td>
      </tr>
      <tr>
          <td>Format</td>
          <td><strong>Raster</strong></td>
          <td>GIF, JPEG</td>
          <td>Typical web formats.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The core pipeline consists of a <strong>Gabor Transform Unit</strong> followed by a <strong>Training/Classification Unit</strong>.</p>
<ul>
<li><strong>Gabor Wavelets</strong>: Used for feature extraction. The 2D Gabor wavelet equation is:
$$h(x,y)=\exp\left{-\frac{1}{2}\left[\frac{x^{2}}{\sigma_{x}^{2}}+\frac{y^{2}}{\sigma_{y}^{2}}\right]\right}\cos(2\pi\mu_{\sigma}x+\phi)$$
<ul>
<li><strong>Bank Structure</strong>: 28 filters total (4 orientations $\times$ 7 radial frequencies).</li>
<li><strong>Orientations</strong>: $0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}$.</li>
<li><strong>Frequencies</strong>: 1 octave apart, specifically $1\sqrt{2}, \dots, 64\sqrt{2}$.</li>
<li><strong>Selected Frequency</strong>: $4\sqrt{2}$ was found to be optimal for chemistry.</li>
</ul>
</li>
<li><strong>Preprocessing</strong>:
<ul>
<li><strong>Buffer Mounting</strong>: Images are mounted in a buffer (set to 0) to handle edge artifacts.</li>
<li><strong>Look-Up-Tables (LUT/LUF)</strong>: A binary Look-Up-Frame (LUF) indicates Regions of Interest (ROI) to avoid computing empty space; values are stored in a Look-Up-Table (LUT) to prevent re-computation of overlapping windows.</li>
</ul>
</li>
<li><strong>Feature Extraction</strong>:
<ul>
<li><strong>Non-linear Thresholding</strong>: $\psi(t) = \tanh(\alpha t)$ with $\alpha = 0.25$.</li>
<li><strong>Energy Function</strong>: Calculated as average absolute deviation from the mean using a window $W_{xy}$.
$$e_{k}(x,y)=\frac{1}{M^{2}}\sum_{(a,b)\in W_{xy}}|\psi(r_{k}(a,b))|$$</li>
<li><strong>Optimal Window</strong>: $9 \times 9$ pixels.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The classification model relies on competitive learning.</p>
<ul>
<li><strong>Architecture</strong>: <strong>Kohonen Self-Organizing Feature Map (KSOFM)</strong>.</li>
<li><strong>Training</strong>:
<ul>
<li><strong>Learning Rate</strong>: Starts at 1.0, decreases to 0.1.</li>
<li><strong>Class Boundary Analysis (CBA)</strong>: Computes the centroid (mean) and variance of each cluster. The variance defines the class boundary.</li>
</ul>
</li>
<li><strong>Classification Metric</strong>: <strong>Euclidean Distance Norm</strong>. An unknown vector is classified based on the shortest distance to a cluster center, provided it falls within the variance boundary.
$$D_{ij}=||x_{i}-x_{j}||$$</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance was measured using recognition rate ($R_s$) and misclassification error ($E_s$).</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Recognition Rate</td>
          <td><strong>91%</strong></td>
          <td>N/A</td>
          <td>Achieved with 50:50 split. 92% with 70:30 split.</td>
      </tr>
      <tr>
          <td>Feature Size</td>
          <td><strong>~1500</strong></td>
          <td>4000</td>
          <td>Reducing vector size from 4000 to 1500 maintained ~80% accuracy while improving speed.</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{gkoutosChemicalMachineVision2003,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Chemical {{Machine Vision}}: {{Automated Extraction}} of {{Chemical Metadata}} from {{Raster Images}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Chemical {{Machine Vision}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Gkoutos, Georgios V. and Rzepa, Henry and Clark, Richard M. and Adjei, Osei and Johal, Harpal}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2003</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = sep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Computer Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{43}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1342--1355}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0095-2338}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci034017n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemical Literature Data Extraction: The CLiDE Project</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/clide-1993/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/clide-1993/</guid><description>Seminal OCSR system converting scanned chemical diagrams into connection tables via primitive recognition and semantic interpretation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ibison, P., Jacquot, M., Kam, F., Neville, A. G., Simpson, R. W., Tonnelier, C., Venczel, T., &amp; Johnson, A. P. (1993). Chemical Literature Data Extraction: The CLiDE Project. <em>Journal of Chemical Information and Computer Sciences</em>, 33(3), 338-344. <a href="https://doi.org/10.1021/ci00013a010">https://doi.org/10.1021/ci00013a010</a></p>
<p><strong>Publication</strong>: J. Chem. Inf. Comput. Sci. 1993</p>
<h2 id="contribution-and-taxonomy">Contribution and Taxonomy</h2>
<p><strong>Classification: Method ($\Psi_{\text{Method}}$)</strong></p>
<p>This methodological paper proposes a novel software architecture for Optical Chemical Structure Recognition (OCSR). It details specific algorithms for image segmentation, vectorization, and chemical interpretation, validated through the successful extraction of complex structures from literature.</p>
<h2 id="motivation-automating-literature-extraction">Motivation: Automating Literature Extraction</h2>
<p>The manual creation of chemical reaction databases is a time-consuming and expensive process requiring trained chemists to abstract information from literature. While commercial tools existed for interpreting isolated scanned structures (like Kekulé), there was a lack of systems capable of processing whole pages of journals (including embedded text, reaction schemes, and structures) without significant human intervention.</p>
<h2 id="core-innovation-a-three-phase-hybrid-architecture">Core Innovation: A Three-Phase Hybrid Architecture</h2>
<p>CLiDE introduces a comprehensive <strong>three-phase architecture</strong> (Recognition, Grouping, Interpretation) that integrates computer vision with chemical knowledge. Key novelties include:</p>
<ul>
<li><strong>Context-Aware Interpretation:</strong> The use of an extendable <strong>superatom database</strong> to resolve ambiguities in chemical text (e.g., expanding &ldquo;OAc&rdquo; or &ldquo;Me&rdquo; into connection tables).</li>
<li><strong>Hybrid Primitive Detection:</strong> A combination of contour coding for solid lines and a modified Hough transform specifically tuned for detecting dashed chemical bonds.</li>
<li><strong>Semantic Re-construction:</strong> A scoring system for bond-atom association that considers both distance and vector direction to handle poorly drawn structures.</li>
</ul>
<h2 id="methodology-and-experimental-validation">Methodology and Experimental Validation</h2>
<p>The authors validated the system on a set of &ldquo;difficult cases&rdquo; selected to test specific capabilities. These included:</p>
<ul>
<li><strong>Crossing Bonds:</strong> Structures where bonds intersect without a central atom (Fig. 9d, 9e).</li>
<li><strong>Stereochemistry:</strong> Identification of wedged, dashed, and wavy bonds.</li>
<li><strong>Generic Structures:</strong> Parsing generic text blocks (e.g., $R^1 = Me$) and performing substitutions.</li>
<li><strong>Accuracy Estimation:</strong> The authors report an approximate 90% recognition rate for distinct characters in literature scans.</li>
</ul>
<h2 id="results-and-structural-reconstruction">Results and Structural Reconstruction</h2>
<p>The system successfully generates connection tables (exported as MOLfiles or ChemDraw files) from scanned bitmaps. It effectively distinguishes between graphical primitives (wedges, lines) and text, accurately reconstructing stereochemistry and resolving superatom synonyms (e.g., converting &ldquo;MeO&rdquo; to &ldquo;OMe&rdquo;). The authors conclude that while character recognition depends heavily on image quality, the graphic primitive recognition is robust for lines above a threshold length.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Input Format:</strong> Binary bitmaps scanned from journal pages.</li>
<li><strong>Resolution:</strong> 300 dpi (generating ~1 MB per page).</li>
<li><strong>Superatom Database:</strong> A lookup table containing ~200 entries. Each entry includes:
<ul>
<li><strong>Valency/Charge:</strong> Explicit constraints (e.g., &ldquo;HO&rdquo; takes 1 bond, &ldquo;CO2&rdquo; takes 2).</li>
<li><strong>Bonding Index:</strong> Specifies which letter in the string serves as the attachment point (e.g., letter 2 for &ldquo;HO&rdquo;, letters 1 and 2 for &ldquo;CO2&rdquo;).</li>
<li><strong>Sub-Connection Table:</strong> The internal atomic representation of the group.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Primitive Recognition (Vectorization)</strong></p>
<ul>
<li><strong>Contour Coding:</strong> Uses the <strong>Ahronovitz-Bertier-Habib</strong> method to generate interpixel contours (directions N, S, E, W) for connected components.</li>
<li><strong>Polygonal Approximation:</strong> A method similar to <strong>Sklansky and Gonzalez</strong> breaks contours into &ldquo;fractions&rdquo;.
<ul>
<li><em>Rule:</em> Long sides are &ldquo;straight fractions&rdquo;; consecutive short sides are &ldquo;curved fractions&rdquo;.</li>
<li><em>Reconstruction:</em> Parallel fractions are paired to form bond borders. If a border is split (due to noise or crossing lines), the system attempts to merge collinear segments.</li>
</ul>
</li>
<li><strong>Dash Detection:</strong> A <strong>modified Hough transform</strong> is applied to small connected components. It requires at least <strong>three collinear dashes</strong> to classify a sequence as a dashed bond.</li>
</ul>
<p><strong>2. Interpretation Rules</strong></p>
<ul>
<li><strong>Bond-Atom Association:</strong>
<ul>
<li><em>Candidate Selection:</em> The system identifies $m$ closest bonds for a superatom requiring $n$ connections ($m \ge n$).</li>
<li><em>Scoring Function:</em> Connections are selected based on minimizing <strong>perpendicular distance</strong> (alignment).</li>
</ul>
</li>
<li><strong>Crossing Bonds:</strong> Resolved using rules based on <strong>proximity, length, collinearity, and ring membership</strong> to distinguish actual crossings from central carbon atoms.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>OCR:</strong> A neural network trained on alphanumeric characters.
<ul>
<li><strong>Input Representation:</strong> Density matrices derived from character bitmaps.</li>
<li><strong>Post-processing:</strong> Unrecognized characters are flagged for manual correction.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Platform:</strong> SUN SPARC workstation.</li>
<li><strong>Scanner:</strong> Agfa Focus S 800GS.</li>
<li><strong>Implementation Language:</strong> C++.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ibisonChemicalLiteratureData1993,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Chemical Literature Data Extraction: {{The CLiDE Project}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Chemical Literature Data Extraction}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Ibison, P. and Jacquot, M. and Kam, F. and Neville, A. G. and Simpson, R. W. and Tonnelier, C. and Venczel, T. and Johnson, A. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1993</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = may,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Chemical Information and Computer Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{33}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{338--344}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{0095-2338, 1520-5142}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/ci00013a010}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Automatic Recognition of Chemical Images</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/algorri-chemical-image-recognition-2007/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/algorri-chemical-image-recognition-2007/</guid><description>A rule-based system for extracting chemical structure information from raster images, validated against commercial baselines.</description><content:encoded><![CDATA[<h2 id="contribution-rule-based-image-mining-architecture">Contribution: Rule-Based Image Mining Architecture</h2>
<p><strong>$\Psi_{\text{Method}}$ (Methodological Basis)</strong></p>
<p>This is a methodological paper describing a system architecture for <strong>image mining</strong> in the chemical domain. It focuses on the engineering challenge of converting rasterized depictions of molecules into computer-readable SDF files. The paper details the algorithmic pipeline and validates it through quantitative benchmarking against a commercial alternative.</p>
<h2 id="motivation-digitizing-chemical-literature">Motivation: Digitizing Chemical Literature</h2>
<ul>
<li><strong>Loss of Information</strong>: Chemical software creates images. The chemical significance is lost when published in scientific literature, making the data &ldquo;dead&rdquo; to computers.</li>
<li><strong>Gap in Technology</strong>: Image mining lags behind advances in text mining. Existing commercial solutions (like CLIDE) faded away or remained limited.</li>
<li><strong>Scale of Problem</strong>: The colossal production of chemical documents requires automated tools to exploit this information at large scale.</li>
</ul>
<h2 id="core-innovation-graph-preserving-vectorization">Core Innovation: Graph-Preserving Vectorization</h2>
<ul>
<li><strong>Graph-Preserving Vectorization</strong>: The system uses a custom vectorizer designed to preserve the &ldquo;graph characteristics&rdquo; of chemical diagrams (1 vector = 1 line), which avoids creating spurious vectors at thick joints. It aims to generate a mathematical graph, $G = (V, E)$, mapped geometrically to the image lines.</li>
<li><strong>Chemical Knowledge Integration</strong>: A distinct module validates the reconstructed graph against chemical rules (valences, charges) to ensure the output is chemically valid.</li>
<li><strong>Hybrid Processing</strong>: The system splits the image into &ldquo;connected components&rdquo; for an OCR path (text/symbols) and a &ldquo;body&rdquo; path (bonds), reassembling them later.</li>
</ul>
<h2 id="methodology--experiments-benchmark-validation">Methodology &amp; Experiments: Benchmark Validation</h2>
<p>The authors performed a quantitative validation using <strong>three different databases</strong> where ground-truth SDF files were available. They also compared their system against the commercial tool <strong>CLIDE</strong> (Chemical Literature Data Extraction).</p>
<ul>
<li><strong>Database 1</strong>: 100 images (varied line widths/fonts)</li>
<li><strong>Database 2</strong>: 100 images</li>
<li><strong>Database 3</strong>: 7,604 images (large-scale batch processing)</li>
</ul>
<h2 id="results--conclusions-superior-accuracy-over-baselines">Results &amp; Conclusions: Superior Accuracy over Baselines</h2>
<ul>
<li><strong>High Accuracy</strong>: The system achieved <strong>94%</strong> correct reconstruction on Database 1 and <strong>77%</strong> on Database 2. Accuracy was measured as correct recovery of identical geometry and connections.</li>
</ul>
<p>$$ \text{Acc} = \frac{\text{Correct Images}}{\text{Total Images}} $$</p>
<ul>
<li><strong>Baseline Superiority</strong>: The commercial tool CLIDE only successfully reconstructed ~50% of images in Database 1 (compared to the authors&rsquo; 94%).</li>
<li><strong>Scalability</strong>: On the large dataset (Database 3), the system achieved <strong>67%</strong> accuracy in batch mode.</li>
<li><strong>Robustness</strong>: The authors state the system uses a handful of parameters and works robustly across different image types. CLIDE lacked flexibility and required manual intervention.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Reproducibility Status</strong>: Closed / Not Formally Reproducible. As is common with applied research from this era, the source code, training models (SVM), and specific datasets used for benchmarking do not appear to be publicly maintained or available.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><em>None available</em></td>
          <td style="text-align: left">N/A</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">No public code, models, or datasets were released with this 2007 publication.</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Database 1</td>
          <td>100 Images</td>
          <td>Used for comparison with CLIDE; 94% success rate</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Database 2</td>
          <td>100 Images</td>
          <td>77% success rate</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Database 3</td>
          <td>7,604 Images</td>
          <td>Large-scale test; 67% success rate</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper outlines a 5-module pipeline:</p>
<ol>
<li><strong>Pre-processing</strong>: Adaptive histogram binarization and non-recursive connected component labeling using RLE segments.</li>
<li><strong>OCR</strong>: A &ldquo;chemically oriented OCR&rdquo; using wavelet functions for feature extraction and a <strong>Support Vector Machine (SVM)</strong> for classification. It distinguishes characters from molecular structure.</li>
<li><strong>Vectorizer</strong>: Assigns local directions to RLE segments and groups them into patterns. Crucially, it enforces a one-to-one mapping between image lines and graph vectors.</li>
<li><strong>Reconstruction</strong>: A rule-based module that annotates vectors:
<ul>
<li><strong>Stereochemistry</strong>: Registers vectors against original pixels; thick geometric forms (triangles) become chiral wedges.</li>
<li><strong>Dotted Bonds</strong>: Identifies isolated vectors and clusters them using <strong>quadtree clustering</strong>.</li>
<li><strong>Multi-bonds</strong>: Identifies parallel vectors within a dilated bounding box (factor of 2).</li>
</ul>
</li>
<li><strong>Chemical Knowledge</strong>: Validates the graph valences and properties before exporting SDF.</li>
</ol>
<h3 id="models">Models</h3>
<ul>
<li><strong>SVM</strong>: Used in the OCR module to classify text/symbols. It supports dynamic training to correct classification mistakes.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The primary metric is the percentage of correctly reconstructed images (generating a valid, matching SDF file).</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>System Value (DB1)</th>
          <th>Baseline (CLIDE)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Reconstruction Accuracy</td>
          <td><strong>94%</strong></td>
          <td>~50%</td>
          <td>CLIDE noted as unsuitable for batch processing</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Algorri, M.-E., Zimmermann, M., &amp; Hofmann-Apitius, M. (2007). Automatic Recognition of Chemical Images. <em>Eighth Mexican International Conference on Current Trends in Computer Science</em>, 41-46. <a href="https://doi.org/10.1109/ENC.2007.25">https://doi.org/10.1109/ENC.2007.25</a></p>
<p><strong>Publication</strong>: ENC 2007 (IEEE Computer Society)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{algorriAutomaticRecognitionChemical2007,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Automatic {{Recognition}} of {{Chemical Images}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Eighth {{Mexican International Conference}} on {{Current Trends}} in {{Computer Science}} ({{ENC}} 2007)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Algorri, Maria-Elena and Zimmermann, Marc and {Hofmann-Apitius}, Martin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2007}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{41--46}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IEEE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/ENC.2007.25}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>αExtractor: Chemical Info from Biomedical Literature</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/alpha-extractor/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/alpha-extractor/</guid><description>αExtractor uses ResNet-Transformer to extract chemical structures from literature images, including noisy and hand-drawn molecules.</description><content:encoded><![CDATA[<h2 id="methodological-contribution-a-robust-optical-recognition-system">Methodological Contribution: A Robust Optical Recognition System</h2>
<p>This is primarily a <strong>Method</strong> ($\Psi_{\text{Method}}$) paper with a significant secondary <strong>Resource</strong> ($\Psi_{\text{Resource}}$) contribution (see the <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI and Physical Sciences paper taxonomy</a> for more on these categories).</p>
<p>The dominant methodological contribution is the ResNet-Transformer recognition architecture that outperforms existing OCSR tools across multiple benchmarks through robustness engineering. It specifically focuses on training on 20 million synthetic images with aggressive augmentation to handle degraded image conditions. The work answers the core methodological question &ldquo;How well does this work?&rdquo; through extensive benchmarking against existing OCSR tools and ablation studies validating architectural choices.</p>
<p>The secondary resource contribution comes from releasing αExtractor as a freely available web service, correcting labeling errors in standard benchmarks (CLEF, UOB, JPO), and providing an end-to-end document processing pipeline for biomedical literature mining.</p>
<h2 id="motivation-extracting-visual-chemical-knowledge-from-biomedical-literature">Motivation: Extracting Visual Chemical Knowledge from Biomedical Literature</h2>
<p>The motivation addresses a familiar pain point in chemical informatics within a biomedical context. Vast amounts of chemical knowledge in biomedical literature exist only as images, such as molecular structures embedded in figures, chemical synthesis schemes, and compound diagrams. This visual knowledge remains effectively invisible to computational methods, which creates a massive bottleneck for drug discovery research, systematic reviews, and large-scale chemical database construction.</p>
<p>Existing OCSR tools face two critical problems when applied to biomedical literature:</p>
<ol>
<li>
<p><strong>Real-world image quality</strong>: Biomedical papers often contain low-resolution figures, images with complex backgrounds, noise from scanning/digitization, and inconsistent drawing styles across different journals and decades of publications.</p>
</li>
<li>
<p><strong>End-to-end extraction</strong>: Most OCSR systems assume the presence of clean, cropped molecular images. In practice, you need to first find the molecular structures within multi-panel figures, reaction schemes, and dense document layouts before you can recognize them.</p>
</li>
</ol>
<p>The authors argue that a practical literature mining system needs to solve both problems simultaneously via robust recognition under noisy conditions and automated detection of molecular images within complex documents.</p>
<h2 id="core-innovation-robust-resnet-transformer-architecture">Core Innovation: Robust ResNet-Transformer Architecture</h2>
<p>The core innovation lies in combining a competition-winning recognition architecture with extensive robustness engineering and end-to-end document processing. The key contributions include:</p>
<ol>
<li>
<p><strong>ResNet-Transformer Recognition Model</strong>: The core recognition system uses a <strong>Residual Neural Network (ResNet)</strong> encoder paired with a <strong>Transformer decoder</strong> in an image-captioning framework. This architecture won first place in a Kaggle molecular translation competition, which provided a strong foundation for the recognition task. Let the input image be $I$. The model maximizes the joint likelihood of the SMILES tokens $T$ and coordinate sequences $X, Y$:
$$
\begin{aligned}
\mathcal{L}_{\text{total}} = - \sum_{i=1}^{L} \log P(T_i \mid I, T_{&lt;i}) - \lambda \sum_{i=1}^{L} \big(\log P(X_i \mid I, X_{&lt;i}) + \log P(Y_i \mid I, Y_{&lt;i})\big)
\end{aligned}
$$
Here, continuous $X$ and $Y$ atom coordinates are mapped strictly to 200 discrete bins to formulate the coordinate prediction as a standard classification task alongside SMILES generation.</p>
</li>
<li>
<p><strong>Enhanced Molecular Representation</strong>: The model produces an augmented representation that encompasses:</p>
<ul>
<li>Standard molecular connectivity information</li>
<li><strong>Bond type tokens</strong> (solid wedge bonds, dashed bonds, etc.) that preserve 3D stereochemical information</li>
<li><strong>Atom coordinate predictions</strong> that allow reconstruction of the exact molecular pose from the original image</li>
</ul>
<p>This dual prediction of discrete structure and continuous coordinates makes the output strictly faithful to the source material and enables better quality assessment.</p>
</li>
<li>
<p><strong>Massive Synthetic Training Dataset</strong>: The model was trained on approximately <strong>20 million synthetic molecular images</strong> generated from PubChem SMILES with aggressive data augmentation. The augmentation strategy randomized visual styles, image quality, and rendering parameters to create maximum diversity, ensuring the network rarely saw the same molecular depiction twice. This forces the model to learn robust, style-invariant features.</p>
</li>
<li>
<p><strong>End-to-End Document Processing Pipeline</strong>: αExtractor integrates <strong>object detection</strong> and <strong>structure recognition</strong> into a complete document mining system:</p>
<ul>
<li>An object detection model automatically locates molecular images within PDF documents</li>
<li>The recognition model converts detected images to structured representations</li>
<li>A web service interface makes the entire pipeline accessible to researchers without machine learning expertise</li>
</ul>
</li>
<li>
<p><strong>Robustness-First Design</strong>: The system was explicitly designed to handle degraded image conditions that break traditional OCSR tools, including low resolution, background interference, color variations, and scanning artifacts commonly found in legacy biomedical literature.</p>
</li>
</ol>
<h2 id="experimental-methodology-stress-testing-under-real-world-conditions">Experimental Methodology: Stress Testing under Real-World Conditions</h2>
<p>The evaluation focused on demonstrating robust performance across diverse image conditions, from pristine benchmarks to challenging real-world scenarios:</p>
<ol>
<li>
<p><strong>Benchmark Dataset Evaluation</strong>: αExtractor was tested on four standard OCSR benchmarks:</p>
<ul>
<li><strong>CLEF</strong>: Chemical structure recognition challenge dataset</li>
<li><strong>UOB</strong>: University of Birmingham patent images</li>
<li><strong>JPO</strong>: Japan Patent Office molecular diagrams</li>
<li><strong>USPTO</strong>: US Patent and Trademark Office structures</li>
</ul>
<p>Performance was measured using exact SMILES match accuracy.</p>
</li>
<li>
<p><strong>Error Analysis and Dataset Correction</strong>: During evaluation, the researchers discovered numerous labeling errors in the original benchmark datasets. They systematically identified and corrected these errors, then re-evaluated all methods on the cleaned datasets to get more accurate performance measurements.</p>
</li>
<li>
<p><strong>Robustness Stress Testing</strong>: The system was evaluated on two challenging datasets specifically designed to test robustness:</p>
<ul>
<li><strong>Color background images</strong> (200 samples): Molecular structures on complex, colorful backgrounds that simulate real figure conditions</li>
<li><strong>Low-quality images</strong> (200 samples): Degraded images with noise, blur, and artifacts typical of scanned documents</li>
</ul>
<p>These tests compared αExtractor against three open-source tools (OSRA, Molvel, and Imago) under realistic degradation conditions.</p>
</li>
<li>
<p><strong>Generalization Testing</strong>: In the most challenging experiment, αExtractor was tested on the <strong>DECIMER hand-drawn molecule images dataset</strong> (Brinkhaus et al., 2022), representing a completely different visual domain not represented in the training data. This tested whether the learned features could generalize beyond digital rendering styles to human-drawn chemistry.</p>
</li>
<li>
<p><strong>End-to-End Document Extraction</strong>: The complete pipeline was evaluated on 50 PDF files containing 2,336 molecular images. This tested both the object detection component (finding molecules in complex documents) and the recognition component (converting them to SMILES) in a realistic literature mining scenario.</p>
</li>
<li>
<p><strong>Speed Benchmarking</strong>: Inference time was measured to demonstrate the practical efficiency needed for large-scale document processing.</p>
</li>
</ol>
<h2 id="results--conclusions-strong-performance-on-degraded-images">Results &amp; Conclusions: Strong Performance on Degraded Images</h2>
<ul>
<li>
<p><strong>Substantial Accuracy Gains</strong>: On the four benchmark datasets, αExtractor achieved accuracies of 91.83% (CLEF), 98.47% (UOB), 88.67% (JPO), and 93.64% (USPTO), compared to previous best results of 84.6%, 90.0%, 72.2%, and 89.9% respectively. After correcting dataset labeling errors, the true accuracies were even higher, reaching <strong>95.77% on CLEF, 99.86% on UOB, and 92.44% on JPO</strong>.</p>
</li>
<li>
<p><strong>Robustness on Degraded Images</strong>: Open-source competitors struggled on degraded images (achieving 5.5% accuracy at best). αExtractor maintained <strong>over 90% accuracy</strong> on both color background and low-quality image datasets, demonstrating the effectiveness of the synthetic training strategy.</p>
</li>
<li>
<p><strong>Generalization to Hand-Drawn Molecules</strong>: On hand-drawn molecules, a domain completely absent from training data, αExtractor achieved <strong>61.4% accuracy</strong> while other tools scored between 0.69% and 2.93%. This suggests the model learned genuinely chemical features rather than style-specific patterns.</p>
</li>
<li>
<p><strong>Practical End-to-End Performance</strong>: In the complete document processing evaluation, αExtractor detected <strong>95.1% of molecular images</strong> (2,221 out of 2,336) and correctly recognized <strong>94.5% of detected structures</strong> (2,098 correct predictions). This demonstrates the system&rsquo;s readiness for real-world literature mining applications.</p>
</li>
<li>
<p><strong>Ablation Results</strong>: Ablation experiments confirmed that each architectural component (ResNet backbone, Transformer encoder, Transformer decoder) contributes to performance, with the Transformer decoder having the largest impact. Replacing the Transformer decoder with an LSTM decoder substantially reduced accuracy (Table S6 in the paper).</p>
</li>
<li>
<p><strong>Dataset Quality Issues</strong>: The systematic discovery of labeling errors in standard benchmarks highlights a broader problem in OCSR evaluation. The corrected datasets provide more reliable baselines for future method development.</p>
</li>
<li>
<p><strong>Spatial Layout Limitation</strong>: αExtractor correctly identifies molecular connectivity, but the re-rendered structures may have different spatial layouts than the originals. This could complicate visual verification for complex molecules, even if the chemical information remains accurate.</p>
</li>
<li>
<p><strong>Non-Standard Depiction Handling</strong>: For images with non-standard bond depictions or atomic valences, αExtractor correctly identifies and normalizes them to standard representations. While chemically accurate, this means the re-rendered structure may visually differ from the original image.</p>
</li>
</ul>
<p>Overall, αExtractor combines accurate recognition (over 90% on degraded images), end-to-end document processing, and strong generalization across image conditions. It targets large-scale literature mining tasks where previous tools struggled with degraded inputs. The focus on real-world robustness over benchmark optimization reflects a practical approach to deploying machine learning in scientific workflows.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This paper is <strong>Partially Reproducible</strong>. While the authors detail the model architectures and training techniques, the source code, training dataset (20M synthetic images), and pre-trained weights remain closed-source and proprietary. The authors released a sample of their test data and host an online web server for running inference.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/CLEF_corrected">Corrected CLEF Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Authors&rsquo; corrected version of the CLEF benchmark.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/UOB_corrected">Corrected UOB Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Authors&rsquo; corrected version of the UOB benchmark.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/JPO_corrected">Corrected JPO Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Authors&rsquo; corrected version of the JPO benchmark.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/Colored_Background">Color Background Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">200 samples of molecular structures on complex, colorful backgrounds.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/Low_Quality">Low Quality Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">200 samples of degraded images with noise, blur, and artifacts.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/PDF">PDF Test Set</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Sample PDF files for end-to-end document extraction evaluation.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://extractor.alphama.com.cn/csr">αExtractor Web Server</a></td>
          <td style="text-align: left">Other</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Online service for running inference using the proprietary system.</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<p><strong>Image Recognition Model:</strong></p>
<ul>
<li><strong>Backbone:</strong> ResNet50 producing output of shape $2048 \times 19 \times 19$, projected to 512 channels via a feed-forward layer</li>
<li><strong>Transformer Architecture:</strong> 3 encoder layers and 3 decoder layers with hidden dimension of 512</li>
<li><strong>Output Format:</strong> Generates SMILES tokens plus two auxiliary coordinate sequences (X-axis and Y-axis) that are length-aligned with the SMILES tokens via padding</li>
</ul>
<p><strong>Object Detection Model:</strong></p>
<ul>
<li><strong>Architecture:</strong> DETR (Detection Transformer) with ResNet101 backbone</li>
<li><strong>Transformer Architecture:</strong> 6 encoder layers and 6 decoder layers with hidden dimension of 256</li>
<li><strong>Purpose:</strong> Locates molecular images within PDF pages before recognition</li>
</ul>
<p><strong>Coordinate Prediction:</strong></p>
<ul>
<li>Continuous X/Y coordinates are discretized into <strong>200 discrete bins</strong></li>
<li>Padding tokens added to coordinate sequences to align perfectly with SMILES token sequence, enabling simultaneous structure and pose prediction</li>
</ul>
<h3 id="data">Data</h3>
<p><strong>Training Data:</strong></p>
<ul>
<li><strong>Synthetic Generation:</strong> Python script rendering PubChem SMILES into 2D images</li>
<li><strong>Dataset Size:</strong> Approximately 20.3 million synthetic molecular images from PubChem</li>
<li><strong>Superatom Handling:</strong> 50% of molecules had functional groups replaced with superatoms (e.g., &ldquo;COOH&rdquo;) or generic labels (R1, X1) to match literature drawing conventions</li>
<li><strong>Rendering Augmentation:</strong> Randomized bond thickness, bond spacing, font size, font color, and padding size</li>
</ul>
<p><strong>Geometric Augmentation:</strong></p>
<ul>
<li>Shear along x-axis: $\pm 15^\circ$</li>
<li>Rotation: $\pm 15^\circ$</li>
<li>Piecewise affine scaling</li>
</ul>
<p><strong>Noise Injection:</strong></p>
<ul>
<li>Pepper noise: 0-2%</li>
<li>Salt noise: 0-40%</li>
<li>Gaussian noise: scale 0-0.16</li>
</ul>
<p><strong>Destructive Augmentation:</strong></p>
<ul>
<li>JPEG compression: severity levels 2-5</li>
<li>Random masking</li>
</ul>
<p><strong>Evaluation Datasets:</strong></p>
<ul>
<li><strong>CLEF</strong>: Chemical structure recognition challenge dataset</li>
<li><strong>UOB</strong>: University of Birmingham patent images</li>
<li><strong>JPO</strong>: Japan Patent Office molecular diagrams</li>
<li><strong>USPTO</strong>: US Patent and Trademark Office structures</li>
<li><strong>Color background images</strong>: 200 samples</li>
<li><strong>Low-quality images</strong>: 200 samples</li>
<li><strong>Hand-drawn structures</strong>: Test set for generalization</li>
<li><strong>End-to-end document extraction</strong>: 50 PDFs (567 pages, 2,336 molecular images)</li>
</ul>
<h3 id="training">Training</h3>
<p><strong>Image Recognition Model:</strong></p>
<ul>
<li><strong>Optimizer:</strong> Adam with learning rate of 1e-4</li>
<li><strong>Batch Size:</strong> 100</li>
<li><strong>Epochs:</strong> 5</li>
<li><strong>Loss Function:</strong> Cross-entropy loss for both SMILES prediction and coordinate prediction</li>
</ul>
<p><strong>Object Detection Model:</strong></p>
<ul>
<li><strong>Optimizer:</strong> Adam with learning rate of 1e-4</li>
<li><strong>Batch Size:</strong> 24</li>
<li><strong>Training Strategy:</strong> Pre-trained on synthetic &ldquo;Lower Quality&rdquo; data for 5 epochs, then fine-tuned on annotated real &ldquo;High Quality&rdquo; data for 30 epochs</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics:</strong></p>
<ul>
<li><strong>Recognition</strong>: SMILES accuracy (exact match)</li>
<li><strong>End-to-End Pipeline</strong>:
<ul>
<li><strong>Recall</strong>: 95.1% for detection</li>
<li><strong>Accuracy</strong>: 94.5% for recognition</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p><strong>Inference Hardware:</strong></p>
<ul>
<li>Cloud CPU server (8 CPUs, 64 GB RAM)</li>
<li><strong>Throughput:</strong> Processed 50 PDFs (567 pages) in 40 minutes</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xiong, J., Liu, X., Li, Z., Xiao, H., Wang, G., Niu, Z., Fei, C., Zhong, F., Wang, G., Zhang, W., Fu, Z., Liu, Z., Chen, K., Jiang, H., &amp; Zheng, M. (2023). αExtractor: a system for automatic extraction of chemical information from biomedical literature. <em>Science China Life Sciences</em>, 67(3), 618-621. <a href="https://doi.org/10.1007/s11427-023-2388-x">https://doi.org/10.1007/s11427-023-2388-x</a></p>
<p><strong>Publication</strong>: Science China Life Sciences (2023)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://doi.org/10.1007/s11427-023-2388-x">Paper on Springer</a></li>
</ul>
]]></content:encoded></item><item><title>MolRec: Rule-Based OCSR System at TREC 2011 Benchmark</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec_at_trec/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec_at_trec/</guid><description>Rule-based system for optical chemical structure recognition using vectorization and geometric analysis, achieving 95% accuracy on TREC 2011.</description><content:encoded><![CDATA[<h2 id="contribution-rule-based-ocsr-system">Contribution: Rule-Based OCSR System</h2>
<p>This is a <strong>Method</strong> paper that presents and validates MolRec, a rule-based system for Optical Chemical Structure Recognition (OCSR). While the paper emphasizes performance analysis on the TREC 2011 benchmark, the core contribution is the system architecture itself: a multi-stage pipeline using vectorization, geometric rule-based analysis, and graph construction to convert chemical diagram images into machine-readable MOL files.</p>
<h2 id="motivation-robust-conversion-of-chemical-diagrams">Motivation: Robust Conversion of Chemical Diagrams</h2>
<p>Chemical molecular diagrams are ubiquitous in scientific documents across chemistry and life sciences. Converting these static raster images into machine-readable formats (like MOL files) that encode precise spatial and connectivity information is important for cheminformatics applications such as database indexing, similarity searching, and automated literature mining.</p>
<p>While pixel-based pattern matching approaches exist, they struggle with variations in drawing style, image quality, and diagram complexity. An approach that can handle the geometric and topological diversity of real-world chemical diagrams is needed.</p>
<h2 id="novelty-vectorization-and-geometric-rules">Novelty: Vectorization and Geometric Rules</h2>
<p>MolRec uses a <strong>vectorization and geometric rule-based pipeline</strong>. Key technical innovations include:</p>
<p><strong>Disk-Growing Heuristic for Wedge Bonds</strong>: A novel dynamic algorithm to distinguish wedge bonds from bold lines. A disk with radius greater than the average line width is placed inside the connected component and grown to the largest size that still covers only foreground pixels. The disk is then walked in the direction that allows it to continue growing. When it can grow no more, the base of the triangle (stereo-center) has been located, identifying the wedge orientation.</p>
<p><strong>Joint Breaking Strategy</strong>: Explicitly breaking all connected joints in the vectorization stage to avoid combinatorial connection complexity. This allows uniform treatment of all line segment connections regardless of junction complexity.</p>
<p><strong>Superatom Dictionary Mining</strong>: The system mines MOL files from the OSRA dataset to build a comprehensive superatom dictionary (e.g., &ldquo;Ph&rdquo;, &ldquo;COOH&rdquo;), supplemented by the Marvin abbreviation collection.</p>
<p><strong>Comprehensive Failure Analysis</strong>: Unlike most OCSR papers that report only aggregate accuracy, this work provides a detailed categorization of all 55 failures, identifying 61 specific error reasons and their root causes.</p>
<h2 id="methodology-and-trec-2011-experiments">Methodology and TREC 2011 Experiments</h2>
<p><strong>Benchmark</strong>: The system was evaluated on the <strong>TREC 2011 Chemical Track</strong> test set consisting of 1,000 molecular diagram images. The authors performed two independent runs with slightly different internal parameter settings to assess reproducibility.</p>
<p><strong>Evaluation Metric</strong>: Correct recall of chemical structures. Output MOL files were compared semantically to ground truth using <strong>OpenBabel</strong>, which ignores syntactically different but chemically equivalent representations.</p>
<p><strong>Failure Analysis</strong>: Across both runs, 55 unique diagrams were misrecognized (50 in run 1, 51 in run 2, with significant overlap). The authors manually examined all 55 and categorized them, identifying 61 specific reasons for mis-recognition. This analysis provides insight into systematic limitations of the rule-based approach.</p>
<h2 id="results-and-top-failure-modes">Results and Top Failure Modes</h2>
<p><strong>High Accuracy</strong>: MolRec achieved a <strong>95% correct recovery rate</strong> on the TREC 2011 benchmark:</p>
<ul>
<li>Run 1: 950/1000 structures correctly recognized (95.0%)</li>
<li>Run 2: 949/1000 structures correctly recognized (94.9%)</li>
</ul>
<p>The near-identical results across runs with slightly different internal parameters show stability of the rule-based approach.</p>
<p><strong>Top Failure Modes</strong> (from detailed analysis of 55 unique misrecognized diagrams, yielding 61 total error reasons):</p>
<ul>
<li><strong>Dashed wedge bond misidentification (15 cases)</strong>: Most common failure. Short dashes at the narrow end were interpreted as a separate dashed bond while longer dashes were treated as a dashed wedge or dashed bold bond, splitting one bond into two with a spurious node.</li>
<li><strong>Incorrect stereochemistry (10 cases)</strong>: Heuristics guessed wrong 3D orientations for ambiguous bold/dashed bonds where syntax alone is insufficient.</li>
<li><strong>Touching components (6 cases)</strong>: Characters touching bonds, letters touching symbols, or ink bleed between close parallel lines caused segmentation failures.</li>
<li><strong>Incorrect character grouping (5 cases)</strong>: Characters too close together for reliable separation.</li>
<li><strong>Solid circles without 3D hydrogen bond (5 cases)</strong>: MolRec correctly interprets solid circles as implying a hydrogen atom via a solid wedge bond, but some solution MOL files in the test set omit this bond, causing a mismatch.</li>
<li><strong>Diagram caption confusion (5 cases)</strong>: Captions appearing within images are mistakenly parsed as part of the molecular structure.</li>
<li><strong>Unrecognised syntax (5 cases)</strong>: User annotations, unusual notations (e.g., wavy line crossing a dashed wedge), and repetition structures.</li>
<li><strong>Broken characters (3 cases)</strong>: Degraded or partial characters without recovery mechanisms.</li>
<li><strong>Connectivity of superatoms (3 cases)</strong>: Ambiguous permutation of connection points for multi-bonded superatoms.</li>
<li><strong>Problematic bridge bonds (3 cases)</strong>: Extreme perspective or angles outside MolRec&rsquo;s thresholds.</li>
<li><strong>Unhandled bond type (1 case)</strong>: A dashed dative bond not previously encountered.</li>
</ul>
<p><strong>System Strengths</strong>:</p>
<ul>
<li>Douglas-Peucker line simplification proves faster and more robust than Hough transforms across different drawing styles</li>
<li>Disk-growing wedge bond detection effectively distinguishes 3D orientations in most cases</li>
<li>Mining MOL files for superatom dictionary captures real-world chemical abbreviation usage patterns</li>
</ul>
<p><strong>Fundamental Limitations Revealed</strong>:</p>
<ul>
<li><strong>Brittleness</strong>: Small variations in drawing style or image quality can cause cascading failures</li>
<li><strong>Stereochemistry ambiguity</strong>: Even humans disagree on ambiguous cases; automated resolution based purely on syntax is inherently limited</li>
<li><strong>Segmentation dependence</strong>: Most failures trace back to incorrect separation of text, bonds, and graphical elements</li>
<li><strong>No error recovery</strong>: Early-stage mistakes propagate through the pipeline with no mechanism for correction</li>
</ul>
<p><strong>Test Set Quality Issues</strong>: The paper also highlights several cases where the TREC 2011 ground truth itself was questionable. Some solution MOL files omitted stereo bond information for solid circle notations, dative (polar) bonds were inconsistently interpreted as either double bonds or single bonds across the training and test sets, and one diagram contained over-connected carbon atoms (5 bonds without the required positive charge indication) that the solution MOL file did not flag.</p>
<p>The systematic error analysis reveals what 95% accuracy means in practice. The failure modes highlight scalability challenges for rule-based systems when applied to diverse real-world documents with noise, artifacts, and non-standard conventions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Dictionary Mining</td>
          <td>OSRA Dataset</td>
          <td>Unknown</td>
          <td>Mined to create superatom dictionary for abbreviations like &ldquo;Ph&rdquo;, &ldquo;COOH&rdquo;</td>
      </tr>
      <tr>
          <td>Dictionary</td>
          <td>Marvin Collection</td>
          <td>N/A</td>
          <td>Integrated Marvin abbreviation group collection for additional superatoms</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>TREC 2011 Test Set</td>
          <td>1,000 images</td>
          <td>Standard benchmark for Text REtrieval Conference Chemical Track</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The MolRec pipeline consists of sequential image processing and graph construction stages:</p>
<p><strong>1. Preprocessing</strong></p>
<ul>
<li><strong>Binarization</strong>: Input image converted to binary</li>
<li><strong>Connected Component Labeling</strong>: Identifies distinct graphical elements</li>
<li><strong>OCR</strong>: Simple metric space-based engine identifies characters (letters $L$, digits $N$, symbols $S$)</li>
<li><strong>Character Grouping</strong>: Spatial proximity and type-based heuristics group characters:
<ul>
<li>Horizontal: Letter-Letter, Digit-Digit, Letter-Symbol</li>
<li>Vertical: Letter-Letter only</li>
<li>Diagonal: Letter-Digit, Letter-Charge</li>
</ul>
</li>
</ul>
<p><strong>2. Vectorization (Line Finding)</strong></p>
<ul>
<li><strong>Image Thinning</strong>: Reduce lines to unit width</li>
<li><strong>Douglas-Peucker Algorithm</strong>: Simplify polylines into straight line segments</li>
<li><strong>Joint Breaking</strong>: Explicitly split lines at junctions where $&gt;2$ segments meet, avoiding combinatorial connection complexity</li>
</ul>
<p><strong>3. Bond Recognition Rules</strong></p>
<p>After erasing text from the image, remaining line segments are analyzed:</p>
<ul>
<li><strong>Double/Triple Bonds</strong>: Cluster segments with same slope within threshold distance</li>
<li><strong>Dashed Bonds</strong>: Identify repeated short segments of similar length with collinear center points</li>
<li><strong>Wedge/Bold Bonds</strong>: Dynamic disk algorithm:
<ul>
<li>Place disk with radius $&gt;$ average line width inside component</li>
<li>Grow disk to maximum size to locate triangle base (stereo-center)</li>
<li>&ldquo;Walk&rdquo; disk to find narrow end, distinguishing wedge orientation</li>
</ul>
</li>
<li><strong>Wavy Bonds</strong>: Identify sawtooth pattern polylines after thinning</li>
<li><strong>Implicit Nodes</strong>: Split longer segments at points where parallel shorter segments terminate (carbon atoms in chains)</li>
</ul>
<p><strong>4. Graph Construction</strong></p>
<ul>
<li><strong>Node Formation</strong>: Group line segment endpoints by distance threshold</li>
<li><strong>Disambiguation</strong>: Logic separates lowercase &ldquo;l&rdquo;, uppercase &ldquo;I&rdquo;, digit &ldquo;1&rdquo;, and vertical bonds</li>
<li><strong>Superatom Expansion</strong>: Replace abbreviations with full structures using mined dictionary</li>
<li><strong>Stereochemistry Resolution</strong>: Heuristics based on neighbor counts determine direction for ambiguous bold/dashed bonds (known limitation)</li>
</ul>
<p><strong>5. MOL File Generation</strong></p>
<ul>
<li>Final graph structure converted to standard MOL file format</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Run 1</th>
          <th>Run 2</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Correct Recall</td>
          <td>950/1000</td>
          <td>949/1000</td>
          <td>Slightly different internal parameters between runs</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>95.0%</td>
          <td>94.9%</td>
          <td>Semantic comparison using OpenBabel</td>
      </tr>
  </tbody>
</table>
<p><strong>Comparison Method</strong>: OpenBabel converts graphs to MOL files and compares them semantically to ground truth, ignoring syntactic variations that don&rsquo;t affect chemical meaning.</p>
<p><strong>Failure Categorization</strong>: 55 unique misrecognized diagrams analyzed across both runs, identifying 61 specific error reasons across 11 categories including dashed wedge bond misidentification (15), incorrect stereochemistry (10), touching components (6), incorrect character grouping (5), solid circles (5), diagram caption confusion (5), unrecognised syntax (5), broken characters (3), superatom connectivity (3), problematic bridge bonds (3), and unhandled bond type (1).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://openbabel.org/">Open Babel</a></td>
          <td>Code</td>
          <td>GPL-2.0</td>
          <td>Used for semantic MOL file comparison</td>
      </tr>
      <tr>
          <td><a href="https://sourceforge.net/projects/osra/">OSRA</a></td>
          <td>Code</td>
          <td>GPL-2.0</td>
          <td>Source of superatom dictionary data (MOL files mined)</td>
      </tr>
      <tr>
          <td>TREC 2011 Chemical Track</td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>1,000 molecular diagram images (available via NIST)</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility Status</strong>: Partially Reproducible. The MolRec source code is not publicly available. The evaluation dataset (TREC 2011) is accessible through NIST, and the tools used for comparison (OpenBabel) are open source. However, full reproduction of MolRec&rsquo;s pipeline would require reimplementation from the paper&rsquo;s descriptions.</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute Details</strong>: Not explicitly specified in the paper</li>
<li><strong>Performance Note</strong>: Vectorization approach noted as &ldquo;proven to be fast&rdquo; compared to Hough transform alternatives</li>
</ul>
<h3 id="references">References</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{sadawiPerformanceMolRecTREC2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Performance of {{MolRec}} at {{TREC}} 2011 {{Overview}} and {{Analysis}} of {{Results}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 20th {{Text REtrieval Conference}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Sadawi, Noureddin M. and Sexton, Alan P. and Sorge, Volker}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sadawi, N. M., Sexton, A. P., &amp; Sorge, V. (2011). Performance of MolRec at TREC 2011 Overview and Analysis of Results. <em>Proceedings of the 20th Text REtrieval Conference</em>. <a href="https://trec.nist.gov/pubs/trec20/papers/UoB.chem.update.pdf">https://trec.nist.gov/pubs/trec20/papers/UoB.chem.update.pdf</a></p>
<p><strong>Publication</strong>: TREC 2011</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://openbabel.org/">Open Babel</a> - Used for semantic MOL file comparison</li>
<li><a href="https://sourceforge.net/projects/osra/">OSRA Project</a> - Source of superatom dictionary data</li>
</ul>
]]></content:encoded></item><item><title>MolRec: Chemical Structure Recognition at CLEF 2012</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec_at_clef/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec_at_clef/</guid><description>MolRec achieves 95%+ accuracy on simple structures but struggles with complex diagrams, revealing rule-based OCSR limits and systematic failures.</description><content:encoded><![CDATA[<h2 id="systematization-of-rule-based-ocsr">Systematization of Rule-Based OCSR</h2>
<p>This is a <strong>Systematization</strong> paper that evaluates and analyzes MolRec&rsquo;s performance in the CLEF 2012 chemical structure recognition competition. The work provides systematic insights into how the improved MolRec system performed on different types of molecular diagrams and reveals structural challenges facing rule-based OCSR approaches through comprehensive failure analysis.</p>
<h2 id="investigating-the-limits-of-rule-based-recognition">Investigating the Limits of Rule-Based Recognition</h2>
<p>This work builds on the TREC 2011 competition, where a previous implementation of MolRec already performed well. The CLEF 2012 competition provided an opportunity to test an improved, more computationally efficient version of MolRec on different datasets and understand how performance varies across complexity levels.</p>
<p>The motivation is to understand exactly where rule-based chemical structure recognition breaks down. Examining the specific types of structures that cause failures provides necessary context for the high accuracy rates achieved on simpler structures.</p>
<h2 id="the-two-stage-molrec-architecture">The Two-Stage MolRec Architecture</h2>
<p>The novelty lies in the systematic evaluation across two different difficulty levels and the comprehensive failure analysis. The authors tested an improved MolRec implementation that was more efficient than the TREC 2011 version, providing insights into both system evolution and the inherent challenges of chemical structure recognition.</p>
<p><strong>MolRec Architecture Overview</strong>: The system follows a two-stage pipeline approach:</p>
<ol>
<li>
<p><strong>Vectorization Stage</strong>: The system preprocesses input images through three steps:</p>
<ul>
<li><strong>Image binarization</strong> using Otsu&rsquo;s method to convert grayscale images to black and white, followed by labelling of connected components</li>
<li><strong>OCR processing</strong> using nearest neighbor classification with a Euclidean metric to identify and remove text components (atom labels, charges, etc.)</li>
<li><strong>Separation of bond elements</strong>: thinning connected components to single-pixel width, building polyline representations, detecting circles, arrows, and solid triangles, and applying the Douglas-Peucker line simplification algorithm to clean up vectorized bonds</li>
</ul>
</li>
<li>
<p><strong>Rule Engine Stage</strong>: A set of 18 chemical rules converts geometric primitives into molecular graphs:</p>
<ul>
<li><strong>Bridge bond recognition</strong> (2 rules applied before all others, handling structures with multiple connection paths depicted in 2.5-dimensional perspective drawings)</li>
<li><strong>Standard bond and atom recognition</strong> (16 rules applied in arbitrary order)</li>
<li><strong>Context-aware disambiguation</strong> resolving ambiguities using the full graph structure and character groups</li>
<li><strong>Superatom expansion</strong> looking up character groups identifying more than one atom in a dictionary and replacing them with molecule subgraphs</li>
</ul>
</li>
</ol>
<p>The system can output results in standard formats like MOL files or SMILES strings.</p>
<h2 id="clef-2012-experimental-design">CLEF 2012 Experimental Design</h2>
<p>The CLEF 2012 organizers provided a set of 961 test images clipped from patent documents, split into two sets:</p>
<ol>
<li>
<p><strong>Automatic Evaluation Set (865 images)</strong>: Images selected for automatic evaluation by comparison of generated MOL files with ground truth MOL files using the OpenBabel toolkit.</p>
</li>
<li>
<p><strong>Manual Evaluation Set (95 images)</strong>: A more challenging collection of images containing elements not supported by OpenBabel (typically Markush structures), requiring manual visual evaluation. This set was intentionally included to provide a greater challenge.</p>
</li>
</ol>
<p>The authors ran MolRec four times with slightly adjusted internal parameters, then manually examined every incorrect recognition to categorize error types.</p>
<h2 id="performance-divergence-and-critical-failure-modes">Performance Divergence and Critical Failure Modes</h2>
<p>The results reveal a stark performance gap between simple and complex molecular structures:</p>
<p><strong>Performance on Automatic Evaluation Set</strong>: On the 865-image set, MolRec achieved <strong>94.91% to 96.18% accuracy</strong> across four runs with different parameter settings. A total of 46 different diagrams were mis-recognized across all runs.</p>
<p><strong>Performance on Manual Evaluation Set</strong>: On the 95-image set, accuracy dropped to <strong>46.32% to 58.95%</strong>. A total of 52 different diagrams were mis-recognized across all runs. Some diagrams failed for multiple reasons.</p>
<p><strong>Key Failure Modes Identified</strong> (with counts from the paper&rsquo;s Table 3):</p>
<ul>
<li>
<p><strong>Character Grouping</strong> (26 manual, 0 automatic): An implementation bug caused the digit &ldquo;1&rdquo; to be repeated within atom groups, so $R_{21}$ was incorrectly recognized as $R_{211}$. A separate problem was difficulty correctly separating closely spaced atom groups.</p>
</li>
<li>
<p><strong>Touching Characters</strong> (8 manual, 1 automatic): The system does not handle touching characters, so overlapping characters cause mis-recognition.</p>
</li>
<li>
<p><strong>Four-Way Junction Failures</strong> (6 manual, 7 automatic): The vectorization process could not handle junctions where four lines meet, leading to incorrect connectivity.</p>
</li>
<li>
<p><strong>OCR Errors</strong> (5 manual, 11 automatic): Character recognition errors included &ldquo;G&rdquo; interpreted as &ldquo;O&rdquo;, &ldquo;alkyl&rdquo; being mis-recognized, and &ldquo;I&rdquo; interpreted as a vertical single bond.</p>
</li>
<li>
<p><strong>Missed Solid and Dashed Wedge Bonds</strong> (0 manual, 6 each in automatic): The system incorrectly recognized a number of solid wedge and dashed wedge bonds.</p>
</li>
<li>
<p><strong>Missed Wavy Bonds</strong> (2 manual, 1 automatic): Some wavy bonds were not recognized despite the dedicated wavy bond rule.</p>
</li>
<li>
<p><strong>Missed Charge Signs</strong> (1 manual, 2 automatic): While correctly recognizing positive charge signs, MolRec missed three negative charge signs, including one placed at the top left of an atom name.</p>
</li>
<li>
<p><strong>Other Errors</strong>: An atom too close to a bond endpoint was erroneously considered connected, a solid wedge bond too close to a closed node was incorrectly connected, and a dashed bold bond had its stereocentre incorrectly determined.</p>
</li>
</ul>
<p><strong>Dataset Quality Issues</strong>: The authors discovered 11 images where the ground truth MOL files were incorrect and MolRec&rsquo;s recognition was actually correct. As the authors note, such ground truth errors are very difficult to avoid in this complex task.</p>
<p><strong>Key Insights</strong>:</p>
<ul>
<li>
<p><strong>Performance gap between simple and complex structures</strong>: While MolRec achieved over 94% accuracy on standard molecular diagrams, the performance drop on the more challenging manual evaluation set (down to 46-59%) highlights the difficulty of real patent document images.</p>
</li>
<li>
<p><strong>Many errors are fixable</strong>: The authors note that many mis-recognition problems (such as the character grouping bug and four-way junction vectorization) can be solved with relatively simple enhancements.</p>
</li>
<li>
<p><strong>Touching character segmentation</strong> remains a notoriously difficult open problem that the authors plan to explore further.</p>
</li>
<li>
<p><strong>Evaluation challenges</strong>: The discovery of 11 incorrect ground truth MOL files illustrates how difficult it is to create reliable benchmarks for chemical structure recognition.</p>
</li>
</ul>
<p>The authors conclude that despite the high recognition rates on simpler structures, there is still plenty of room for improvement. They identify future work areas including recognition of more general Markush structures, robust charge sign spotting, and accurate identification of wedge bonds.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="system-architecture">System Architecture</h3>
<p><strong>Model Type</strong>: Non-neural, Rule-Based System (Vectorization Pipeline + Logic Engine)</p>
<h3 id="data">Data</h3>
<p><strong>Evaluation Datasets (CLEF 2012)</strong>: 961 total test images clipped from patent documents:</p>
<ul>
<li><strong>Automatic Evaluation Set</strong>: 865 images evaluated automatically using OpenBabel for exact structural matching of generated MOL files against ground truth</li>
<li><strong>Manual Evaluation Set</strong>: 95 images containing elements not supported by OpenBabel (typically Markush structures), requiring manual visual evaluation</li>
</ul>
<p><strong>Training Data</strong>: The paper does not describe the reference set used to build OCR character prototypes for the nearest neighbor classifier.</p>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Vectorization Pipeline</strong> (three steps):</p>
<ul>
<li><strong>Image Binarization</strong>: Otsu&rsquo;s method, followed by connected component labelling</li>
<li><strong>OCR</strong>: Nearest neighbor classification with Euclidean distance metric; recognized characters are removed from the image</li>
<li><strong>Bond Element Separation</strong>: Thinning to single-pixel width, polyline construction, Douglas-Peucker line simplification (threshold set to 1-2x average line width), detection of circles, arrows, and solid triangles</li>
</ul>
<p><strong>Rule Engine</strong>: 18 chemical structure rules converting geometric primitives to molecular graphs:</p>
<ul>
<li><strong>Bridge Bond Rules (2 rules)</strong>: Applied before all other rules, handling bridge bond structures depicted in 2.5-dimensional perspective drawings</li>
<li><strong>Wavy Bond Rule</strong>: Detailed in paper, identifies approximately collinear connected line segments with zig-zag patterns ($n \geq 3$ segments)</li>
<li><strong>Standard Recognition Rules</strong>: 16 rules for bonds, atoms, and chemical features (applied in arbitrary order; most not detailed in this paper)</li>
</ul>
<p><strong>Optimization</strong>: Performance tuned via manual adjustment of fuzzy and strict geometric threshold parameters.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Automated</strong>: Exact structural match via OpenBabel MOL file comparison</li>
<li><strong>Manual</strong>: Visual inspection by human experts for structures where OpenBabel fails</li>
</ul>
<p><strong>Results</strong>:</p>
<ul>
<li><strong>Automatic Evaluation Set (865 images)</strong>: 94.91% to 96.18% accuracy across four runs</li>
<li><strong>Manual Evaluation Set (95 images)</strong>: 46.32% to 58.95% accuracy across four runs</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. Given the era (2012) and algorithmic nature (Otsu, thinning, geometric analysis), likely ran on standard CPUs.</p>
<h3 id="reproducibility-assessment">Reproducibility Assessment</h3>
<p><strong>Closed.</strong> No public code, data, or models are available. The paper outlines high-level logic (Otsu binarization, Douglas-Peucker simplification, 18-rule system) but does not provide:</p>
<ul>
<li>The complete specification of all 18 rules (only Rule 2.2 for wavy bonds is detailed)</li>
<li>Exact numerical threshold values for fuzzy/strict parameters used in the CLEF runs</li>
<li>OCR training data or character prototype specifications</li>
</ul>
<p>The authors refer readers to a separate 2012 DRR (SPIE) paper [5] for a more detailed overview of the MolRec system architecture.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sadawi, N. M., Sexton, A. P., &amp; Sorge, V. (2012). MolRec at CLEF 2012: Overview and Analysis of Results. Working Notes of CLEF 2012 Evaluation Labs and Workshop. CLEF. <a href="https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf">https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf</a></p>
<p><strong>Publication</strong>: CLEF 2012 Workshop (ImageCLEF Track)</p>
]]></content:encoded></item><item><title>MolNexTR: A Dual-Stream Molecular Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/molnextr/</link><pubDate>Sat, 04 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/molnextr/</guid><description>Dual-stream encoder combining ConvNext and ViT for robust optical chemical structure recognition across diverse molecular drawing styles.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chen, Y., Leung, C. T., Huang, Y., Sun, J., Chen, H., &amp; Gao, H. (2024). MolNexTR: a generalized deep learning model for molecular image recognition. <em>Journal of Cheminformatics</em>, 16(141). <a href="https://doi.org/10.1186/s13321-024-00926-w">https://doi.org/10.1186/s13321-024-00926-w</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/CYF2000127/MolNexTR">GitHub Repository</a></li>
<li><a href="https://huggingface.co/datasets/CYF200127/MolNexTR/tree/main">HuggingFace Dataset/Model</a></li>
</ul>
<h2 id="methodology-overview-and-taxonomic-classification">Methodology Overview and Taxonomic Classification</h2>
<p>This is a <strong>Method</strong> paper ($\Psi_{\text{Method}}$). It proposes a neural network architecture (MolNexTR) that integrates ConvNext and Vision Transformers to solve the Optical Chemical Structure Recognition (OCSR) task. The paper validates this method through ablation studies and benchmarking against existing methods including MolScribe and DECIMER.</p>
<h2 id="the-challenge-of-domain-specific-drawing-styles-in-ocsr">The Challenge of Domain-Specific Drawing Styles in OCSR</h2>
<p>Converting molecular images from chemical literature into machine-readable formats (SMILES) is critical but challenging due to the high variance in drawing styles, fonts, and conventions (e.g., Markush structures, abbreviations). Existing methods have limitations:</p>
<ul>
<li>CNN-based and ViT-based models often struggle to generalize across diverse, non-standard drawing styles found in real literature.</li>
<li>Pure ViT methods lack translation invariance and local feature representation, while pure CNNs struggle with global dependencies.</li>
<li>Many models predict SMILES strings directly, making it difficult to enforce chemical validity or resolve complex stereochemistry and abbreviations.</li>
</ul>
<h2 id="core-innovation-dual-stream-encoding-and-image-contamination">Core Innovation: Dual-Stream Encoding and Image Contamination</h2>
<p>MolNexTR introduces three main innovations:</p>
<ol>
<li><strong>Dual-Stream Encoder</strong>: A hybrid architecture processing images simultaneously through a ConvNext stream (for local features) and a Vision Transformer stream (for long-range dependencies), fusing them to capture multi-scale information.</li>
<li><strong>Image Contamination Augmentation</strong>: A specialized data augmentation algorithm that simulates real-world &ldquo;noise&rdquo; found in literature, such as overlapping text, arrows, and partial molecular fragments, to improve robustness.</li>
<li><strong>Graph-Based Decoding with Post-Processing</strong>: Unlike pure image-to-SMILES translation, it predicts atoms and bonds (graph generation) and uses a stereochemical discrimination and abbreviation self-correction module to enforce chemical rules (e.g., chirality) and resolve superatoms (e.g., &ldquo;Ph&rdquo;, &ldquo;Bn&rdquo;).</li>
</ol>
<p>The prediction of atom labels and coordinates is formulated as a conditional autoregressive generation task, optimized via a cross-entropy loss:
$$ \mathcal{L}_{\text{atom}} = -\sum_{t=1}^{T} \log P(x_t \mid \text{Image}, x_{&lt;t}) $$</p>
<h2 id="experimental-setup-benchmarking-on-synthetic-and-real-data">Experimental Setup: Benchmarking on Synthetic and Real Data</h2>
<p>The model was trained on synthetic data (PubChem) and real patent data (USPTO). It was evaluated on nine benchmarks (three synthetic, six real-world):</p>
<ul>
<li><strong>Synthetic</strong>: Indigo, ChemDraw, RDKit (rendered from 5,719 molecules)</li>
<li><strong>Real-World</strong>: CLEF, UOB, JPO, USPTO, Staker, and a newly curated ACS dataset (diverse styles)</li>
</ul>
<p><strong>Baselines</strong>: Compared against rule-based (OSRA, MolVec) and deep learning models (MolScribe, DECIMER, SwinOCSR, Img2Mol).</p>
<p><strong>Ablations</strong>: Tested the impact of the dual-stream encoder vs. single streams, and the contribution of individual augmentation strategies.</p>
<h2 id="empirical-results-and-robustness-findings">Empirical Results and Robustness Findings</h2>
<ul>
<li><strong>Performance</strong>: MolNexTR achieved 81-97% accuracy across test sets, outperforming the second-best method (often MolScribe) by margins of 0.3% to 10.0% (on the difficult ACS dataset).</li>
<li><strong>Perturbation resilience</strong>: The model maintained higher accuracy under image perturbations (rotation, noise) and &ldquo;curved arrow&rdquo; noise common in reaction mechanisms compared to MolScribe and DECIMER (Table 3).</li>
<li><strong>Ablation Results</strong>: The dual-stream encoder consistently outperformed single CNN or ViT baselines, and the image contamination algorithm significantly boosted performance on noisy real-world data (ACS).</li>
<li><strong>Limitations</strong>: The model still struggles with extremely complex hand-drawn molecules and mechanism diagrams where arrows or text are conflated with structure. The authors also note that R-group information in real literature often appears in separate text or tables, which the model does not incorporate.</li>
</ul>
<p><strong>Key Results (Table 2, SMILES exact match accuracy %)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>MolScribe</th>
          <th>MolNexTR</th>
          <th>Improvement</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Indigo</td>
          <td>97.5</td>
          <td>97.8</td>
          <td>+0.3</td>
      </tr>
      <tr>
          <td>ChemDraw</td>
          <td>93.8</td>
          <td>95.1</td>
          <td>+1.3</td>
      </tr>
      <tr>
          <td>RDKit</td>
          <td>94.6</td>
          <td>96.4</td>
          <td>+1.8</td>
      </tr>
      <tr>
          <td>CLEF</td>
          <td>88.3</td>
          <td>90.4</td>
          <td>+2.1</td>
      </tr>
      <tr>
          <td>UOB</td>
          <td>87.9</td>
          <td>88.5</td>
          <td>+0.6</td>
      </tr>
      <tr>
          <td>JPO</td>
          <td>77.7</td>
          <td>82.1</td>
          <td>+4.4</td>
      </tr>
      <tr>
          <td>USPTO</td>
          <td>92.6</td>
          <td>93.8</td>
          <td>+1.2</td>
      </tr>
      <tr>
          <td>Staker</td>
          <td>86.9</td>
          <td>88.3</td>
          <td>+1.4</td>
      </tr>
      <tr>
          <td>ACS</td>
          <td>71.9</td>
          <td>81.9</td>
          <td>+10.0</td>
      </tr>
  </tbody>
</table>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Training Data</strong>:</p>
<ul>
<li><strong>Synthetic</strong>: ~1M molecules randomly selected from PubChem, rendered using RDKit and Indigo with varied styles (thickness, fonts, bond width)</li>
<li><strong>Real</strong>: 0.68M images from USPTO, with coordinates normalized from MOLfiles</li>
</ul>
<p><strong>Augmentation</strong>:</p>
<ul>
<li><strong>Render Augmentation</strong>: Randomized drawing styles (line width, font size, label modes)</li>
<li><strong>Image Augmentation</strong>: Rotation, cropping, blurring, noise (Gaussian, salt-and-pepper)</li>
<li><strong>Molecular Augmentation</strong>: Randomly replacing functional groups with abbreviations (from a list of &gt;100) or complex chains (e.g., CH3CH2NH2); adding R-groups</li>
<li><strong>Image Contamination</strong>: Adding &ldquo;noise&rdquo; objects (arrows, lines, text, partial structures) at a minimum distance from the main molecule to simulate literature artifacts</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Dual-Stream Encoder</strong>:</p>
<ul>
<li><strong>CNN Stream</strong>: ConvNext backbone (pre-trained on ImageNet), generating feature maps at scales $H/4$ to $H/32$</li>
<li><strong>ViT Stream</strong>: Parallel transformer blocks receiving patches of sizes $p=4, 8, 16, 32$. Uses Multi-Head Self-Attention (MHSA) and Feed-Forward Networks (FFN)</li>
<li><strong>Fusion</strong>: Outputs from both streams are concatenated</li>
</ul>
<p><strong>Decoder (Graph Generation)</strong>:</p>
<ul>
<li><strong>Transformer Decoder</strong>: 6 layers, 8 heads, hidden dim 256</li>
<li><strong>Task 1 (Atoms)</strong>: Autoregressive prediction of atom tokens $(l, x, y)$ (label + coordinates)</li>
<li><strong>Task 2 (Bonds)</strong>: Prediction of bond types between atom pairs (None, Single, Double, Triple, Aromatic, Solid Wedge, Dashed Wedge)</li>
</ul>
<p><strong>Post-Processing</strong>:</p>
<ul>
<li><strong>Stereochemistry</strong>: Uses predicted coordinates and bond types (wedge/dash) to resolve chirality using RDKit logic</li>
<li><strong>Abbreviation Correction</strong>: Matches superatoms to a dictionary; if unknown, attempts to greedily connect atoms based on valence or finds the nearest match ($\sigma=0.8$ similarity threshold)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-Decoder (ConvNext + ViT Encoder -&gt; Transformer Decoder)</li>
<li><strong>Hyperparameters</strong>:
<ul>
<li>Optimizer: ADAM (max lr 3e-4, linear warmup for 5% of steps)</li>
<li>Batch Size: 256</li>
<li>Image Size: $384 \times 384$</li>
<li>Dropout: 0.1</li>
</ul>
</li>
<li><strong>Training</strong>: Fine-tuned CNN backbone for 40 epochs on 10 NVIDIA RTX 3090 GPUs</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Primary Metric</strong>: SMILES sequence exact matching accuracy (canonicalized)</p>
<p><strong>Benchmarks</strong>:</p>
<ul>
<li><strong>Synthetic</strong>: Indigo (5,719), ChemDraw (5,719), RDKit (5,719)</li>
<li><strong>Real</strong>: CLEF (992), UOB (5,740), JPO (450), USPTO (5,719), Staker (50,000), ACS (331)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPUs</strong>: 10 NVIDIA RTX 3090 GPUs</li>
<li><strong>Cluster</strong>: HPC3 Cluster at HKUST (ITSC)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/CYF2000127/MolNexTR">MolNexTR GitHub</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official implementation (PyTorch, Jupyter notebooks)</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/datasets/CYF200127/MolNexTR">MolNexTR HuggingFace</a></td>
          <td>Dataset/Model</td>
          <td>Apache-2.0</td>
          <td>Training data and model checkpoint</td>
      </tr>
  </tbody>
</table>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{chenMolNexTRGeneralizedDeep2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{MolNexTR}: a generalized deep learning model for molecular image recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Chen, Yufan and Leung, Ching Ting and Huang, Yong and Sun, Jianwei and Chen, Hao and Gao, Hanyu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{141}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-024-00926-w}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemInfty: Chemical Structure Recognition in Patent Images</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/cheminfty/</link><pubDate>Sat, 04 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/rule-based/cheminfty/</guid><description>Fujiyoshi et al.'s segment-based approach for recognizing chemical structures in challenging Japanese patent images with touching characters and broken lines.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fujiyoshi, A., Nakagawa, K., &amp; Suzuki, M. (2011). Robust Method of Segmentation and Recognition of Chemical Structure Images in ChemInfty. <em>Pre-Proceedings of the 9th IAPR International Workshop on Graphics Recognition, GREC.</em></p>
<p><strong>Publication</strong>: GREC 2011 (Graphics Recognition Workshop)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.sciaccess.net/en/InftyReader/">InftyReader Project</a></li>
</ul>
<h2 id="contribution-segment-based-ocsr-method">Contribution: Segment-Based OCSR Method</h2>
<p>This is a <strong>method paper</strong> that introduces ChemInfty, a rule-based system for Optical Chemical Structure Recognition (OCSR) specifically designed to handle the challenging, low-quality images found in Japanese patent applications.</p>
<h2 id="motivation-the-challenge-of-degraded-patent-images">Motivation: The Challenge of Degraded Patent Images</h2>
<p>The motivation is straightforward: Japanese patent applications contain a massive amount of chemical knowledge, but the images are remarkably poor quality. Unlike the relatively clean molecular diagrams in scientific papers, patent images suffer from multiple problems that break conventional OCSR systems.</p>
<p>The authors quantified these issues in a sample of 200 patent images and found that 22% contained touching characters (where atom labels merge together), 19.5% had characters touching bond lines, and 8.5% had broken lines. These are not edge cases; they are pervasive enough to cripple existing recognition tools. Established systems like CLIDE, ChemReader, and OSRA struggle significantly with line-touching characters and broken lines, leading to recognition failures.</p>
<p>The challenge is compounded by the sheer diversity of creation methods. Some structures are drawn with sophisticated molecular editors, others with basic paint programs, and some are even handwritten. This means there&rsquo;s no standardization in fonts, character sizes, or line thickness. Add in the effects of scanning and faxing, and you have images with significant noise, distortion, and degradation.</p>
<p>The goal of ChemInfty is to build a system robust enough to handle these messy real-world conditions and make Japanese patent chemistry computer-searchable.</p>
<h2 id="core-innovation-segment-decomposition-and-dynamic-programming">Core Innovation: Segment Decomposition and Dynamic Programming</h2>
<p>The novelty lies in a segment-based decomposition approach that separates the recognition problem into manageable pieces before attempting to classify them. The key insight is that traditional OCR fails on these images because characters and lines are physically merged. You cannot recognize a character if you cannot cleanly separate it from the surrounding bonds first.</p>
<p>ChemInfty&rsquo;s approach has several distinctive elements:</p>
<ol>
<li>
<p><strong>Line and Curve Segmentation</strong>: The system first decomposes the image into smaller line and curve segments. The decomposition happens at natural breakpoints&mdash;crossings, sharp bends, and other locations where touching is likely to occur. This creates a set of primitive elements that can be recombined in different ways.</p>
</li>
<li>
<p><strong>Linear Order Assumption for Scalability</strong>: To make the dynamic programming approach computationally tractable and avoid combinatorial explosion, the system assumes that segments to be combined are adjacent when sorted in one of four directional orderings ($\perp, \setminus, \triangle, \rightarrow$). This constraint dramatically reduces the search space while still capturing the natural spatial relationships in chemical diagrams.</p>
</li>
<li>
<p><strong>Dynamic Programming for Segment Combination</strong>: Once the image is decomposed, the system faces a combinatorial problem: which segments should be grouped together to form characters, and which should be classified as bonds? The authors use dynamic programming to efficiently search for the &ldquo;most suitable combination&rdquo; of segments. This optimization finds the configuration that maximizes the likelihood of valid chemical structure elements.</p>
</li>
<li>
<p><strong>Two-Pass OCR Strategy</strong>: ChemInfty integrates with InftyReader, a powerful OCR engine. The system uses OCR twice in the pipeline:</p>
<ul>
<li><strong>First pass</strong>: High-confidence character recognition removes obvious atom labels early, simplifying the remaining image</li>
<li><strong>Second pass</strong>: After the segment-based method identifies and reconstructs difficult character regions, OCR is applied again to the cleaned-up character image</li>
</ul>
<p>This two-stage approach handles both easy and hard cases effectively: simple characters are recognized immediately, while complex cases get special treatment.</p>
</li>
<li>
<p><strong>Image Thinning for Structure Analysis</strong>: Before segmentation, the system thins the remaining graphical elements (after removing high-confidence characters) to skeleton lines. This thinning operation reveals the underlying topological structure&mdash;crossings, bends, and endpoints&mdash;making it easier to detect where segments should be divided.</p>
</li>
<li>
<p><strong>Proximity-Based Grouping</strong>: After identifying potential character segments, the system groups nearby segments together. This spatial clustering ensures that parts of the same character that were separated by bonds get recombined correctly.</p>
</li>
</ol>
<h2 id="methodology-real-world-patent-evaluation">Methodology: Real-World Patent Evaluation</h2>
<p>The evaluation focused on demonstrating that ChemInfty could handle real-world patent images at scale:</p>
<ol>
<li>
<p><strong>Large-Scale Patent Dataset</strong>: The system was tested on chemical structure images from Japanese patent applications published in 2008. This represents a realistic deployment scenario with all the messiness of actual documents.</p>
</li>
<li>
<p><strong>Touching Character Separation</strong>: The authors specifically measured the system&rsquo;s ability to separate characters from bonds when they were touching. Success was defined as cleanly extracting the character region so that OCR could recognize it.</p>
</li>
<li>
<p><strong>Recognition Accuracy by Object Type</strong>: Performance was broken down by element type (characters, line segments, solid wedges, and hashed wedges). This granular analysis revealed which components were easier or harder for the system to handle.</p>
</li>
<li>
<p><strong>End-to-End Performance</strong>: The overall recognition ratio was calculated across all object types to establish the system&rsquo;s practical utility for automated patent processing.</p>
</li>
</ol>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li>
<p><strong>Effective Separation for Line-Touching Characters</strong>: The segment-based method successfully separated 63.5% of characters that were touching bond lines. This is a substantial improvement over standard OCR, which typically fails completely on such cases. The authors note that when image quality is reasonable, the separation method works well.</p>
</li>
<li>
<p><strong>Strong Overall Character Recognition</strong>: Character recognition achieved 85.86% accuracy, which is respectable given the poor quality of the input images. Combined with the 90.73% accuracy for line segments, this demonstrates the system can reliably reconstruct the core molecular structure.</p>
</li>
<li>
<p><strong>Weak Performance on Wedges</strong>: The system struggled significantly with stereochemistry notation. Solid wedges were correctly recognized only 52.54% of the time, and hashed wedges fared even worse at 23.63%. This is a critical limitation since stereochemistry is often essential for understanding molecular properties.</p>
</li>
<li>
<p><strong>Image Quality Dependency</strong>: The authors acknowledge that the method&rsquo;s effectiveness is ultimately limited by image quality. When images are severely degraded (blurred to the point where even humans struggle to distinguish characters from noise), the segmentation approach cannot reliably separate touching elements.</p>
</li>
<li>
<p><strong>Overall System Performance</strong>: The combined recognition ratio of 86.58% for all objects indicates that ChemInfty is a working system but not yet production-ready. The authors conclude that further refinement is necessary, particularly for wedge recognition and handling extremely low-quality images.</p>
</li>
</ul>
<p>The work establishes that segment-based decomposition with dynamic programming is a viable approach for handling the specific challenges of patent image OCSR. The two-pass OCR strategy and the use of image thinning to reveal structure are practical engineering solutions that improve robustness. However, the results also highlight that rule-based methods are fundamentally limited by image quality. There is only so much you can do with algorithmic cleverness when the input is severely degraded. This limitation would motivate later work on deep learning approaches that can learn robust feature representations from large datasets.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="technical-paradigm">Technical Paradigm</h3>
<p><strong>This is a pre-deep learning (2011) classical computer vision paper.</strong> The system uses rule-based methods and traditional OCR engines, not neural networks.</p>
<h3 id="models">Models</h3>
<ul>
<li><strong>InftyReader</strong>: A mathematical OCR engine used for the initial high-confidence character recognition pass. This is a pre-existing external tool.</li>
<li><strong>DEF-based OCR</strong>: A standard OCR engine based on Directional Element Features (DEF). These are manually engineered statistical features (histograms of edge directions), not learned neural network features.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The paper details a multi-step recognition pipeline:</p>
<ol>
<li><strong>Preprocessing</strong>: Binarization and smoothing</li>
<li><strong>Initial Character Removal</strong>: High-confidence characters are recognized by the InftyReader OCR engine and removed from the image to simplify segmentation</li>
<li><strong>Skeletonization</strong>: Thinning using <strong>Hilditch&rsquo;s algorithm</strong> to skeletonize graphical elements, revealing topological structure (crossings, bends, endpoints)</li>
<li><strong>Feature Point Detection</strong>:
<ul>
<li><strong>Crossing points</strong>: Direct detection on skeleton</li>
<li><strong>Bending points</strong>: Detected using the <strong>Hough transformation</strong></li>
</ul>
</li>
<li><strong>Dynamic Programming Search</strong>:
<ul>
<li><strong>Input</strong>: Set of line/curve segments $S$</li>
<li><strong>Procedure</strong>: Sort segments in 4 directions ($\perp, \setminus, \triangle, \rightarrow$). For each direction, use DP to find the grouping that minimizes a heuristic score</li>
<li><strong>Complexity</strong>: $O(n^2)$ where $n$ is the number of segments</li>
<li><strong>Scoring</strong>: Uses a function <code>Measure(S')</code> that returns a score (0-100) indicating if a subset of segments forms a valid character or bond</li>
</ul>
</li>
</ol>
<p>The scoring function <code>Measure(S')</code> used in the dynamic programming algorithm is never mathematically defined in the paper, limiting replicability.</p>
<h3 id="data">Data</h3>
<p><strong>Evaluation Dataset</strong>: Chemical structure images from Japanese patent applications published in 2008. The complete 2008 dataset contains 229,969 total images.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Japanese Published Patent Applications (2008)</td>
          <td>1,599 images</td>
          <td>Contains 229,969 total images for the year. Format: TIFF, 200-400 dpi.</td>
      </tr>
      <tr>
          <td>Analysis</td>
          <td>Random subset for frequency analysis</td>
          <td>200 images</td>
          <td>Used to estimate frequency of touching/broken characters (found in ~20% of images).</td>
      </tr>
  </tbody>
</table>
<p><strong>No Training Set</strong>: The system is rule-based and uses pre-built OCR engines, so no model training was performed.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Primary Metric</strong>: Recognition ratio (percentage of correctly recognized objects)</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Line-touching Separation</td>
          <td>63.5%</td>
          <td>Success rate for separating text glued to lines</td>
      </tr>
      <tr>
          <td>Character Recognition</td>
          <td>85.86%</td>
          <td>For all character sizes</td>
      </tr>
      <tr>
          <td>Line segments</td>
          <td>90.73%</td>
          <td>Standard bond recognition</td>
      </tr>
      <tr>
          <td>Solid Wedge Recognition</td>
          <td>52.54%</td>
          <td>Low performance noted as area for improvement</td>
      </tr>
      <tr>
          <td>Hashed Wedges</td>
          <td>23.63%</td>
          <td>Poorest performing element type</td>
      </tr>
      <tr>
          <td>Overall</td>
          <td>86.58%</td>
          <td>Combined across all object types</td>
      </tr>
  </tbody>
</table>
<p><strong>Total Objects Evaluated</strong>: 742,287 objects (characters, line segments, solid wedges, hashed wedges) extracted from the patent images.</p>
<h3 id="hardware">Hardware</h3>
<p>Not reported. Computational cost was not a primary concern for this classical CV system.</p>
<h3 id="replicability">Replicability</h3>
<p><strong>Low.</strong> The paper does not provide sufficient detail for full replication:</p>
<ul>
<li>The scoring function <code>Measure(S')</code> used in the dynamic programming algorithm is never mathematically defined</li>
<li>Dependency on the proprietary/specialized InftyReader engine</li>
<li>No pseudocode provided for the segment decomposition heuristics</li>
</ul>
<h3 id="notes-on-wedge-recognition">Notes on Wedge Recognition</h3>
<p>The system&rsquo;s poor performance on solid wedges (52.54%) and hashed wedges (23.63%) reflects a fundamental challenge for classical thinning algorithms. Wedge bonds are dense triangular regions that indicate 3D stereochemistry. When skeletonized using algorithms like Hilditch&rsquo;s method, these &ldquo;blob&rdquo; shapes often distort into unrecognizable patterns, unlike the clean thin lines that represent regular bonds.</p>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{fujiyoshiRobustMethodSegmentation2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Robust {{Method}} of {{Segmentation}} and {{Recognition}} of {{Chemical Structure Images}} in {{ChemInfty}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Fujiyoshi, Akio and Nakagawa, Koji and Suzuki, Masakazu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2011</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Pre-proceedings of the 9th IAPR international workshop on graphics recognition, GREC}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolParser: End-to-End Molecular Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/mol-parser/</link><pubDate>Fri, 03 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/mol-parser/</guid><description>MolParser converts molecular images from scientific documents to machine-readable formats using end-to-end learning with Extended SMILES representation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fang, X., Wang, J., Cai, X., Chen, S., Yang, S., Tao, H., Wang, N., Yao, L., Zhang, L., &amp; Ke, G. (2025). MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild. In <em>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</em> (pp. 24528-24538). <a href="https://doi.org/10.48550/arXiv.2411.11098">https://doi.org/10.48550/arXiv.2411.11098</a></p>
<p><strong>Publication</strong>: ICCV 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/chemistry/optical-structure-recognition/vision-language/molparser_7m-wildmol/">MolParser-7M Dataset</a> - 7M+ image-text pairs for OCSR</li>
<li><a href="https://huggingface.co/datasets/UniParser/MolParser-7M">MolParser-7M on HuggingFace</a> - Dataset repository</li>
<li><a href="https://huggingface.co/UniParser/MolDet">MolDet YOLO Detector</a> - Object detection model for extracting molecular images from documents</li>
</ul>
<h2 id="contribution-end-to-end-ocsr-and-real-world-resources">Contribution: End-to-End OCSR and Real-World Resources</h2>
<p>This is primarily a <strong>Method</strong> paper (see <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI and Physical Sciences paper taxonomy</a>), with a significant secondary contribution as a <strong>Resource</strong> paper.</p>
<p><strong>Method contribution ($\Psi_{\text{Method}}$)</strong>: The paper proposes a novel end-to-end architecture combining a Swin Transformer encoder with a BART decoder, and crucially introduces <strong>Extended SMILES (E-SMILES)</strong>, a new syntactic extension to standard SMILES notation that enables representation of Markush structures, abstract rings, and variable attachment points found in patents. The work validates this method through extensive ablation studies, achieving the highest accuracy among tested OCSR systems on WildMol-10k (76.9%).</p>
<p><strong>Resource contribution ($\Psi_{\text{Resource}}$)</strong>: The paper introduces <strong>MolParser-7M</strong>, the largest OCSR dataset to date (7.7M image-text pairs), and <strong>WildMol</strong>, a challenging benchmark of 20,000 manually annotated real-world molecular images. The construction of these datasets through an active learning data engine with human-in-the-loop validation represents significant infrastructure that enables future OCSR research.</p>
<h2 id="motivation-extracting-chemistry-from-real-world-documents">Motivation: Extracting Chemistry from Real-World Documents</h2>
<p>The motivation stems from a practical problem in chemical informatics: vast amounts of chemical knowledge remain embedded in unstructured formats. Patents, research papers, and legacy documents depict molecular structures as images. This creates a barrier for large-scale data analysis and prevents Large Language Models from effectively understanding scientific literature in chemistry and drug discovery.</p>
<p>Existing OCSR methods struggle with real-world documents for two fundamental reasons:</p>
<ol>
<li><strong>Representational limitations</strong>: Standard SMILES notation cannot capture complex structural templates like <strong>Markush structures</strong>, which are ubiquitous in patents. These structures define entire families of compounds using variable R-groups and abstract patterns, making them essential for intellectual property but impossible to represent with conventional methods.</li>
<li><strong>Data distribution mismatch</strong>: Real-world molecular images suffer from noise, inconsistent drawing styles, variable resolution, and interference from surrounding text. Models trained exclusively on clean, synthetically rendered molecules fail to generalize when applied to actual documents.</li>
</ol>
<h2 id="novelty-e-smiles-and-human-in-the-loop-data-engine">Novelty: E-SMILES and Human-in-the-Loop Data Engine</h2>
<p>The novelty lies in a comprehensive system that addresses both representation and data quality challenges through four integrated contributions:</p>
<ol>
<li>
<p><strong>Extended SMILES (E-SMILES)</strong>: A backward-compatible extension to the SMILES format that can represent complex structures previously inexpressible in standard chemical notations. E-SMILES uses a separator token <code>&lt;sep&gt;</code> to delineate the core molecular structure from supplementary annotations. These annotations employ XML-like tags to encode Markush structures, polymers, abstract rings, and other complex patterns. Critically, the core structure remains parseable by standard cheminformatics tools like RDKit, while the supplementary tags provide a structured, LLM-friendly format for capturing edge cases.</p>
</li>
<li>
<p><strong>MolParser-7M Dataset</strong>: The largest publicly available OCSR dataset, containing over 7 million image-text pairs. What distinguishes this dataset is both its scale and its composition. It includes 400,000 &ldquo;in-the-wild&rdquo; samples (molecular images extracted from actual patents and scientific papers) and subsequently curated by human annotators. This real-world data addresses the distribution mismatch problem directly by exposing the model to the same noise, artifacts, and stylistic variations it encounters in production.</p>
</li>
<li>
<p><strong>Human-in-the-Loop Data Engine</strong>: A systematic approach to collecting and annotating real-world training data. The pipeline begins with an object detection model that extracts molecular images from over a million PDF documents. An active learning algorithm then identifies the most informative samples (those where the current model struggles) for human annotation. The model pre-annotates these images, and human experts review and correct them. This creates an iterative improvement cycle: annotate, train, identify new challenging cases, repeat.</p>
</li>
<li>
<p><strong>Efficient End-to-End Architecture</strong>: The model treats OCSR as an image captioning problem. A Swin-Transformer vision encoder extracts visual features, a simple MLP compresses them, and a BART decoder generates the E-SMILES string autoregressively. The model minimizes the standard negative log-likelihood of the target E-SMILES token sequence $y$ given the sequence history and input image $x$:</p>
</li>
</ol>
<p>$$
\begin{aligned}
\mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, x; \theta)
\end{aligned}
$$</p>
<p>The training strategy employs curriculum learning, starting with simple molecules and gradually introducing complexity and heavier data augmentation.</p>
<h2 id="experimental-setup-two-stage-training-and-benchmarking">Experimental Setup: Two-Stage Training and Benchmarking</h2>
<p>The evaluation focused on demonstrating that MolParser generalizes to real-world documents:</p>
<ol>
<li>
<p><strong>Two-Stage Training Protocol</strong>: The model underwent a systematic training process:</p>
<ul>
<li><strong>Pre-training</strong>: Initial training on millions of synthetic molecular images using curriculum learning. The curriculum progresses from simple molecules to complex structures while gradually increasing data augmentation intensity (blur, noise, perspective transforms).</li>
<li><strong>Fine-tuning</strong>: Subsequent training on 400,000 curated real-world samples extracted from patents and papers. This fine-tuning phase is critical for adapting to the noise and stylistic variations of actual documents.</li>
</ul>
</li>
<li>
<p><strong>Benchmark Evaluation</strong>: The model was evaluated on multiple standard OCSR benchmarks to establish baseline performance on clean data. These benchmarks test recognition accuracy on well-formatted molecular diagrams.</p>
</li>
<li>
<p><strong>Real-World Document Analysis</strong>: The critical test involved applying MolParser to molecular structures extracted directly from scientific documents. This evaluation measures the gap between synthetic benchmark performance and real-world applicability (the core problem the paper addresses).</p>
</li>
<li>
<p><strong>Ablation Studies</strong>: Experiments isolating the contribution of each component:</p>
<ul>
<li>The impact of real-world training data versus synthetic-only training</li>
<li>The effectiveness of curriculum learning versus standard training</li>
<li>The value of the human-in-the-loop annotation pipeline versus random sampling</li>
<li>The necessity of E-SMILES extensions for capturing complex structures</li>
</ul>
</li>
</ol>
<h2 id="outcomes-and-empirical-findings">Outcomes and Empirical Findings</h2>
<ul>
<li>
<p><strong>Performance on Benchmarks</strong>: MolParser achieves competitive results on standard benchmarks and the best performance on real-world documents. On clean benchmarks like USPTO-10K, MolScribe (96.0%) slightly edges MolParser-Base (94.5%), but on WildMol-10k, MolParser-Base achieved 76.9% accuracy, significantly outperforming MolScribe (66.4%) and MolGrapher (45.5%). This gap validates the core hypothesis that training on actual document images is essential for practical deployment.</p>
</li>
<li>
<p><strong>Real-World Data is Critical</strong>: Models trained exclusively on synthetic data show substantial performance degradation when applied to real documents. The 400,000 in-the-wild training samples bridge this gap, demonstrating that data quality and distribution matching matter as much as model architecture. Ablation experiments showed that pretraining on MolParser-7M synthetic data alone achieved 51.9% accuracy on WildMol, while adding real-world fine-tuning raised this to 76.9%. Using the smaller MolGrapher-300k synthetic dataset without fine-tuning yielded only 22.4%.</p>
</li>
<li>
<p><strong>E-SMILES Enables Broader Coverage</strong>: The extended representation successfully captures molecular structures that were previously inexpressible, particularly Markush structures from patents. This expands the scope of what can be automatically extracted from chemical literature to include patent-style structural templates.</p>
</li>
<li>
<p><strong>Human-in-the-Loop Scales Efficiently</strong>: The active learning pipeline reduces annotation time by approximately 90% while maintaining high quality. This approach makes it feasible to curate large-scale, high-quality datasets for specialized domains where expert knowledge is expensive.</p>
</li>
<li>
<p><strong>Speed and Accuracy</strong>: The end-to-end architecture achieves both high accuracy and fast inference, making it practical for large-scale document processing. MolParser-Base processes 40 images per second on RTX 4090D, while the Tiny variant achieves 131 FPS. The direct image-to-text approach avoids the error accumulation of multi-stage pipelines.</p>
</li>
<li>
<p><strong>Downstream Applications</strong>: The Swin Transformer encoder, once trained on MolParser-7M, serves as an effective molecular fingerprint for property prediction. Paired with a simple two-layer MLP on MoleculeNet benchmarks, MolParser-pretrained features achieved an average ROC-AUC of 73.7% across five tasks, compared to 68.9% for ImageNet-pretrained Swin-T features. The authors also demonstrate chemical reaction parsing by feeding MolDet detections and MolParser E-SMILES into GPT-4o.</p>
</li>
<li>
<p><strong>Limitations</strong>: The authors acknowledge that molecular chirality is not yet fully exploited by the system. The E-SMILES format does not currently support dashed abstract rings, coordination bonds, special symbol Markush patterns, or replication of long structural segments. Additionally, scaling up the volume of real annotated training data could further improve performance.</p>
</li>
</ul>
<p>The work establishes that practical OCSR requires more than architectural innovations. It demands careful attention to data quality, representation design, and the distribution mismatch between synthetic training data and real-world applications. The combination of E-SMILES, the MolParser-7M dataset, and the human-in-the-loop data engine provides a template for building reliable vision systems in scientific domains where clean training data is scarce but expert knowledge is available.</p>
<h2 id="artifacts">Artifacts</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/datasets/UniParser/MolParser-7M">MolParser-7M</a></td>
          <td>Dataset</td>
          <td>CC-BY-NC-SA-4.0</td>
          <td>7.7M image-SMILES pairs for OCSR pretraining and fine-tuning</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/UniParser/MolDet">MolDet</a></td>
          <td>Model</td>
          <td>CC-BY-NC-SA-4.0</td>
          <td>YOLO11-based molecule detector for PDF documents</td>
      </tr>
  </tbody>
</table>
<p>No official source code repository has been released. Model weights for MolParser itself are not publicly available as of the dataset release.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data is split into a massive synthetic pre-training set and a curated fine-tuning set.</p>
<p><strong>Training Data Composition (MolParser-7M)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset Name</th>
          <th>Size</th>
          <th>Composition / Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Pre-training</strong></td>
          <td>MolParser-7M (Synthetic)</td>
          <td>~7.7M</td>
          <td><strong>Markush-3M</strong> (40%), <strong>ChEMBL-2M</strong> (27%), <strong>Polymer-1M</strong> (14%), PAH-600k (8%), BMS-360k (5%), MolGrapher-300K (4%), Pauling-100k (2%). Generated via RDKit/Indigo with randomized styles.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>MolParser-SFT-400k</td>
          <td>400k</td>
          <td>Real images from patents/papers selected via active learning (confidence filtering 0.6-0.9) and manually annotated. 66% of fine-tuning mix.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>MolParser-Gen-200k</td>
          <td>200k</td>
          <td>Subset of synthetic data kept to prevent catastrophic forgetting. 32% of fine-tuning mix.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>Handwrite-5k</td>
          <td>5k</td>
          <td>Handwritten molecules from Img2Mol to support hand-drawn queries. 1% of fine-tuning mix.</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Sources</strong>: 1.2M patents and scientific papers (PDF documents)</li>
<li><strong>Extraction</strong>: MolDet (YOLO11-based detector) identified ~20M molecular images, deduplicated to ~4M candidates</li>
<li><strong>Selection</strong>: Active learning ensemble (5-fold models) identified high-uncertainty samples for annotation</li>
<li><strong>Annotation</strong>: Human experts corrected model pre-annotations (90% time savings vs. from-scratch annotation)</li>
</ul>
<p><strong>Test Benchmarks</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>USPTO-10k</td>
          <td>10,000</td>
          <td>Standard synthetic benchmark</td>
      </tr>
      <tr>
          <td>Maybridge UoB</td>
          <td>-</td>
          <td>Synthetic molecules</td>
      </tr>
      <tr>
          <td>CLEF-2012</td>
          <td>-</td>
          <td>Patent images</td>
      </tr>
      <tr>
          <td>JPO</td>
          <td>-</td>
          <td>Japanese patent office</td>
      </tr>
      <tr>
          <td>ColoredBG</td>
          <td>-</td>
          <td>Colored background molecules</td>
      </tr>
      <tr>
          <td><strong>WildMol-10k</strong></td>
          <td>10,000</td>
          <td>Ordinary molecules cropped from real PDFs (new)</td>
      </tr>
      <tr>
          <td><strong>WildMol-10k-M</strong></td>
          <td>10,000</td>
          <td>Markush structures (significantly harder, new)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Extended SMILES (E-SMILES) Encoding</strong>:</p>
<ul>
<li><strong>Format</strong>: <code>SMILES&lt;sep&gt;EXTENSION</code> where <code>&lt;sep&gt;</code> separates core structure from supplementary annotations</li>
<li><strong>Extensions use XML-like tags</strong>:
<ul>
<li><code>&lt;a&gt;index:group&lt;/a&gt;</code> for substituents/variable groups (Markush structures)</li>
<li><code>&lt;r&gt;</code> for groups connected at any ring position</li>
<li><code>&lt;c&gt;</code> for abstract rings</li>
<li><code>&lt;dum&gt;</code> for connection points</li>
</ul>
</li>
<li><strong>Backward compatible</strong>: Core SMILES parseable by RDKit; extensions provide structured format for edge cases</li>
</ul>
<p><strong>Curriculum Learning Strategy</strong>:</p>
<ul>
<li><strong>Phase 1</strong>: No augmentation, simple molecules (&lt;60 tokens)</li>
<li><strong>Phase 2</strong>: Gradually increase augmentation intensity and sequence length</li>
<li>Progressive complexity allows stable training on diverse molecular structures</li>
</ul>
<p><strong>Active Learning Data Selection</strong>:</p>
<ol>
<li>Train 5 model folds on current dataset</li>
<li>Compute pairwise Tanimoto similarity of predictions on candidate images</li>
<li>Select samples with confidence scores <strong>0.6-0.9</strong> for human review (highest learning value)</li>
<li>Human experts correct model pre-annotations</li>
<li>Iteratively expand training set with hard samples</li>
</ol>
<p><strong>Data Augmentations</strong>:</p>
<ul>
<li>RandomAffine (rotation, scale, translation)</li>
<li>JPEGCompress (compression artifacts)</li>
<li>InverseColor (color inversion)</li>
<li>SurroundingCharacters (text interference)</li>
<li>RandomCircle (circular artifacts)</li>
<li>ColorJitter (brightness, contrast variations)</li>
<li>Downscale (resolution reduction)</li>
<li>Bounds (boundary cropping variations)</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture follows a standard <strong>Image Captioning</strong> (Encoder-Decoder) paradigm.</p>
<p><strong>Architecture Specifications</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Vision Encoder</strong></td>
          <td>Swin Transformer (ImageNet pretrained)</td>
      </tr>
      <tr>
          <td>- Tiny variant</td>
          <td>66M parameters, $224 \times 224$ input</td>
      </tr>
      <tr>
          <td>- Small variant</td>
          <td>108M parameters, $224 \times 224$ input</td>
      </tr>
      <tr>
          <td>- Base variant</td>
          <td>216M parameters, $384 \times 384$ input</td>
      </tr>
      <tr>
          <td><strong>Connector</strong></td>
          <td>2-layer MLP reducing channel dimension by half</td>
      </tr>
      <tr>
          <td><strong>Text Decoder</strong></td>
          <td>BART-Decoder (12 layers, 16 attention heads)</td>
      </tr>
  </tbody>
</table>
<p><strong>Training Configuration</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Setting</th>
          <th>Pre-training</th>
          <th>Fine-tuning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Hardware</strong></td>
          <td>8x NVIDIA RTX 4090D GPUs</td>
          <td>8x NVIDIA RTX 4090D GPUs</td>
      </tr>
      <tr>
          <td><strong>Optimizer</strong></td>
          <td>AdamW</td>
          <td>AdamW</td>
      </tr>
      <tr>
          <td><strong>Learning Rate</strong></td>
          <td>$1 \times 10^{-4}$</td>
          <td>$5 \times 10^{-5}$</td>
      </tr>
      <tr>
          <td><strong>Weight Decay</strong></td>
          <td>$1 \times 10^{-2}$</td>
          <td>$1 \times 10^{-2}$</td>
      </tr>
      <tr>
          <td><strong>Scheduler</strong></td>
          <td>Cosine with warmup</td>
          <td>Cosine with warmup</td>
      </tr>
      <tr>
          <td><strong>Epochs</strong></td>
          <td>20</td>
          <td>4</td>
      </tr>
      <tr>
          <td><strong>Label Smoothing</strong></td>
          <td>0.01</td>
          <td>0.005</td>
      </tr>
  </tbody>
</table>
<p><strong>Curriculum Learning Schedule</strong> (Pre-training):</p>
<ul>
<li>Starts with simple molecules (&lt;60 tokens, no augmentation)</li>
<li>Gradually adds complexity and augmentation (blur, noise, perspective transforms)</li>
<li>Enables stable learning across diverse molecular structures</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>: Exact match accuracy on predicted E-SMILES strings (molecule-level exact match)</p>
<p><strong>Key Results</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>MolParser-Base</th>
          <th>MolScribe</th>
          <th>MolGrapher</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>WildMol-10k</strong></td>
          <td><strong>76.9%</strong></td>
          <td>66.4%</td>
          <td>45.5%</td>
          <td>Real-world patent/paper crops</td>
      </tr>
      <tr>
          <td><strong>USPTO-10k</strong></td>
          <td>94.5%</td>
          <td><strong>96.0%</strong></td>
          <td>93.3%</td>
          <td>Synthetic benchmark</td>
      </tr>
      <tr>
          <td><strong>Throughput (FPS)</strong></td>
          <td><strong>39.8</strong></td>
          <td>16.5</td>
          <td>2.2</td>
          <td>Measured on RTX 4090D</td>
      </tr>
  </tbody>
</table>
<p><strong>Additional Performance</strong>:</p>
<ul>
<li>MolParser-Tiny: 131 FPS on RTX 4090D (66M params)</li>
<li>Real-world vs. synthetic gap: Fine-tuning on MolParser-SFT-400k closed the performance gap between clean benchmarks and in-the-wild documents</li>
</ul>
<p><strong>Ablation Findings</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Factor</th>
          <th>Impact</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Real-world training data</td>
          <td>Fine-tuning on real data raised accuracy from 51.9% to 76.9% on WildMol-10k</td>
      </tr>
      <tr>
          <td>Curriculum learning</td>
          <td>Augmentation alone raised WildMol-10k from 40.1% to 69.5%; adding curriculum learning further raised it to 76.9%</td>
      </tr>
      <tr>
          <td>Active learning selection</td>
          <td>More effective than random sampling for annotation budget</td>
      </tr>
      <tr>
          <td>E-SMILES extensions</td>
          <td>Essential for Markush structure recognition (impossible with standard SMILES)</td>
      </tr>
      <tr>
          <td>Dataset scale</td>
          <td>Larger pre-training dataset (7M vs 300k) improved WildMol-10k accuracy from 22.4% to 51.9% before fine-tuning</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: 8x NVIDIA RTX 4090D GPUs</li>
<li><strong>Inference</strong>: Single RTX 4090D sufficient for real-time processing</li>
<li><strong>Training time</strong>: 20 epochs pre-training + 4 epochs fine-tuning (specific duration not reported)</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{fang2025molparser,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fang, Xi and Wang, Jiankun and Cai, Xiaochen and Chen, Shangqian and Yang, Shuwen and Tao, Haoyi and Wang, Nan and Yao, Lin and Zhang, Linfeng and Ke, Guolin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{24528--24538}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2411.11098}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CV}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arXiv.2411.11098}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolParser-7M &amp; WildMol: Large-Scale OCSR Datasets</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/molparser_7m-wildmol/</link><pubDate>Fri, 03 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/molparser_7m-wildmol/</guid><description>MolParser-7M is the largest open-source OCSR dataset with 7.7M image-SMILES pairs including 400k real-world annotated samples.</description><content:encoded><![CDATA[<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/molparser-markush-example.webp"
         alt="Example of a complex Markush structure"
         title="Example of a complex Markush structure"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">An example of a complex Markush structure that can be represented by the E-SMILES format but not by standard SMILES or FG-SMILES.</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/molparser-low-quality-example.webp"
         alt="Sample from the WildMol benchmark"
         title="Sample from the WildMol benchmark"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">A sample from the WildMol benchmark, showing a low-quality, noisy molecular image cropped from real-world literature that challenges OCSR systems.</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/molparser-colored-example.webp"
         alt="Colored molecule with annotations"
         title="Colored molecule with annotations"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>MolParser-7M (Training Set)</strong></td>
          <td>7,740,871</td>
          <td>A large-scale dataset for training OCSR models, split into pre-training and fine-tuning stages.</td>
      </tr>
      <tr>
          <td><strong>WildMol (Test Set)</strong></td>
          <td>20,000</td>
          <td>A benchmark of 20,000 human-annotated samples cropped from real PDF files to evaluate OCSR models in &lsquo;in-the-wild&rsquo; scenarios. Comprises WildMol-10k (10k ordinary molecules) and WildMol-10k-M (10k Markush structures).</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarks">Benchmarks</h2>

<div class="benchmarks-content">
  <div class="benchmark-section">
    <h3 id="wildmol-10k-accuracy">WildMol-10K Accuracy<a hidden class="anchor" aria-hidden="true" href="#wildmol-10k-accuracy">#</a></h3>
    <p class="benchmark-description">Evaluation of OCSR models on 10,000 real-world molecular images cropped from scientific literature and patents</p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Accuracy (%)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>MolParser-Base</strong><br><small>End-to-end visual recognition trained on MolParser-7M</small>
          </td>
          <td>76.9</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>MolScribe</strong><br><small>Transformer-based OCSR system</small>
          </td>
          <td>66.4</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>DECIMER 2.7</strong><br><small>Deep learning for chemical image recognition</small>
          </td>
          <td>56</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>MolGrapher</strong><br><small>Graph-based molecular structure recognition</small>
          </td>
          <td>45.5</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>MolVec 0.9.7</strong><br><small>Vector-based structure recognition</small>
          </td>
          <td>26.4</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>OSRA 2.1</strong><br><small>Optical Structure Recognition Application</small>
          </td>
          <td>26.3</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>Img2Mol</strong><br><small>Image-to-molecule translation</small>
          </td>
          <td>24.4</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>Imago 2.0</strong><br><small>Chemical structure recognition toolkit</small>
          </td>
          <td>6.9</td>
        </tr>
      </tbody>
    </table>
  </div>
</div>

<h2 id="key-contribution">Key Contribution</h2>
<p>Introduces MolParser-7M, the largest open-source Optical Chemical Structure Recognition (OCSR) dataset, uniquely combining diverse synthetic data with a large volume of manually-annotated, &ldquo;in-the-wild&rdquo; images from real scientific documents to improve model robustness. Also introduces WildMol, a new challenging benchmark for evaluating OCSR performance on real-world data, including Markush structures.</p>
<h2 id="overview">Overview</h2>
<p>The MolParser project addresses the challenge of recognizing molecular structures from images found in real-world scientific documents. Unlike existing OCSR datasets that rely primarily on synthetically generated images, MolParser-7M incorporates 400,000 manually annotated images cropped from actual patents and scientific papers, making it the first large-scale dataset to bridge the gap between synthetic training data and real-world deployment scenarios.</p>
<h2 id="strengths">Strengths</h2>
<ul>
<li>Largest open-source OCSR dataset with over 7.7 million pairs</li>
<li>The only large-scale OCSR training set that includes a significant amount (400k) of &ldquo;in-the-wild&rdquo; data cropped from real patents and literature</li>
<li>High diversity of molecular structures from numerous sources (PubChem, ChEMBL, polymers, etc.)</li>
<li>Introduces the WildMol benchmark for evaluating performance on challenging, real-world data, including Markush structures</li>
<li>The &ldquo;in-the-wild&rdquo; fine-tuning data (MolParser-SFT-400k) was curated via an efficient active learning data engine with human-in-the-loop validation</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li>The E-SMILES format cannot represent certain complex cases, such as coordination bonds, dashed abstract rings, Markush structures depicted with special patterns, and replication of long structural segments on the skeleton</li>
<li>The model and data do not yet fully exploit molecular chirality, which is critical for chemical properties</li>
<li>Performance could be further improved by scaling up the amount of real annotated training data</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="synthetic-data-generation">Synthetic Data Generation</h3>
<p>To ensure diversity, molecular structures were collected from databases like ChEMBL, PubChem, and Kaggle BMS. A significant number of Markush, polymer, and fused-ring structures were also randomly generated. Images were rendered using RDKit and epam.indigo with randomized parameters (e.g., bond width, font size, rotation) to increase visual diversity. The pretraining dataset is composed of the following subsets:</p>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Ratio</th>
          <th>Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Markush-3M</td>
          <td>40%</td>
          <td>Random groups replacement from PubChem</td>
      </tr>
      <tr>
          <td>ChEMBL-2M</td>
          <td>27%</td>
          <td>Molecules selected from ChEMBL</td>
      </tr>
      <tr>
          <td>Polymer-1M</td>
          <td>14%</td>
          <td>Randomly generated polymer molecules</td>
      </tr>
      <tr>
          <td>PAH-600k</td>
          <td>8%</td>
          <td>Randomly generated fused-ring molecules</td>
      </tr>
      <tr>
          <td>BMS-360k</td>
          <td>5%</td>
          <td>Molecules with long carbon chains from BMS</td>
      </tr>
      <tr>
          <td>MolGrapher-300K</td>
          <td>4%</td>
          <td>Training data from MolGrapher</td>
      </tr>
      <tr>
          <td>Pauling-100k</td>
          <td>2%</td>
          <td>Pauling-style images drawn using epam.indigo</td>
      </tr>
  </tbody>
</table>
<h3 id="in-the-wild-data-engine-molparser-sft-400k">In-the-Wild Data Engine (MolParser-SFT-400k)</h3>
<p>A YOLO11 object detection model (MolDet) located and cropped over 20 million molecule images from 1.22 million real PDFs (patents and papers). After de-duplication via p-hash similarity, 4 million unique images remained.</p>
<p>An active learning algorithm was used to select the most informative samples for annotation, targeting images where an ensemble of 5-fold models showed moderate confidence (0.6-0.9 Tanimoto similarity), indicating they were challenging but learnable.</p>
<p>This active learning approach with model pre-annotations reduced manual annotation time per molecule to 30 seconds, approximately 90% savings compared to annotating from scratch. In the final fine-tuning dataset, 56.04% of annotations directly utilized raw model pre-annotations, 20.97% passed review after a single manual correction, 13.87% were accepted after a second round of annotation, and 9.13% required three or more rounds.</p>
<p>The fine-tuning dataset is composed of:</p>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Ratio</th>
          <th>Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolParser-SFT-400k</td>
          <td>66%</td>
          <td>Manually annotated data obtained via data engine</td>
      </tr>
      <tr>
          <td>MolParser-Gen-200k</td>
          <td>32%</td>
          <td>Synthetic data selected from pretraining stage</td>
      </tr>
      <tr>
          <td>Handwrite-5k</td>
          <td>1%</td>
          <td>Handwritten molecules selected from Img2Mol</td>
      </tr>
  </tbody>
</table>
<h3 id="e-smiles-specification">E-SMILES Specification</h3>
<p>To accommodate complex patent structures that standard SMILES cannot support, the authors introduced an Extended SMILES format (<code>SMILES&lt;sep&gt;EXTENSION</code>). The <code>EXTENSION</code> component uses XML-like tokens to manage complexities:</p>
<ul>
<li><code>&lt;a&gt;...&lt;/a&gt;</code> encapsulates Markush R-groups and abbreviation groups.</li>
<li><code>&lt;r&gt;...&lt;/r&gt;</code> denotes ring attachments with uncertainty positions.</li>
<li><code>&lt;c&gt;...&lt;/c&gt;</code> defines abstract rings.</li>
<li><code>&lt;dum&gt;</code> identifies a connection point.</li>
</ul>
<p>This format enables Markush-molecule matching and LLM integration, while retaining RDKit compatibility for the standard SMILES portion.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/datasets/UniParser/MolParser-7M">MolParser-7M</a></td>
          <td>Dataset</td>
          <td>CC-BY-NC-SA-4.0</td>
          <td>Training and test data on HuggingFace. SFT subset is partially released.</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/UniParser/MolDet">MolDet (YOLO11)</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Molecule detection model on HuggingFace</td>
      </tr>
      <tr>
          <td><a href="https://ocsr.dp.tech/">MolParser Demo</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Online OCSR demo using MolParser-Base</td>
      </tr>
  </tbody>
</table>
<p>The dataset is publicly available on HuggingFace under a CC-BY-NC-SA-4.0 (non-commercial) license. The MolParser-SFT-400k subset is only partially released. The YOLO11-based MolDet detection model is also available on HuggingFace. No public code repository is provided for the MolParser recognition model itself. All experiments were conducted on 8 NVIDIA RTX 4090D GPUs, and throughput benchmarks were measured on a single RTX 4090D GPU.</p>
]]></content:encoded></item><item><title>SubGrapher: Visual Fingerprinting of Chemical Structures</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/subgrapher/</link><pubDate>Mon, 28 Apr 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/subgrapher/</guid><description>SubGrapher creates molecular fingerprints directly from chemical structure images through functional group segmentation for database retrieval.</description><content:encoded><![CDATA[<h2 id="paper-classification-and-taxonomy">Paper Classification and Taxonomy</h2>
<p>This is primarily a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong> with a secondary <strong>Resource ($\Psi_{\text{Resource}}$)</strong> contribution. Using the <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI and Physical Sciences paper taxonomy</a> framework:</p>
<p><strong>Primary Classification: Method</strong></p>
<p>The dominant basis vector is Methodological because SubGrapher introduces an architecture that replaces the two-step OCSR workflow (image, then structure, then fingerprint) with single-step fingerprinting (image to visual fingerprint). The paper validates this approach through systematic comparison against state-of-the-art methods (MolGrapher, OSRA, DECIMER, MolScribe), demonstrating superior performance on specific tasks like retrieval and robustness to image quality degradation.</p>
<p><strong>Secondary Classification: Resource</strong></p>
<p>The paper makes non-negligible resource contributions by releasing:</p>
<ul>
<li>Code and model weights on <a href="https://github.com/DS4SD/SubGrapher">GitHub</a> and <a href="https://huggingface.co/docling-project/SubGrapher">HuggingFace</a></li>
<li>Five new visual fingerprinting benchmark datasets for molecule retrieval tasks</li>
<li>Comprehensive functional group knowledge base (1,534 substructures)</li>
</ul>
<h2 id="motivation-extracting-complex-structures-from-noisy-images">Motivation: Extracting Complex Structures from Noisy Images</h2>
<p>The motivation tackles a fundamental challenge in chemical informatics: extracting molecular information from the vast amounts of unstructured scientific literature, particularly patents. Millions of molecular structures exist only as images in these documents, making them inaccessible for computational analysis, database searches, or machine learning applications.</p>
<p>Traditional Optical Chemical Structure Recognition (OCSR) tools attempt to fully reconstruct molecular graphs from images, converting them into machine-readable formats like SMILES. However, these approaches face two critical limitations:</p>
<ol>
<li><strong>Brittleness to image quality</strong>: Poor resolution, noise, or unconventional drawing styles frequently degrade recognition accuracy</li>
<li><strong>Limited handling of complex structures</strong>: Markush structures, generic molecular templates with variable R-groups commonly used in patents, are poorly supported by most conventional OCSR methods</li>
</ol>
<p>The key insight driving SubGrapher is that full molecular reconstruction may be unnecessary for many applications. For tasks like database searching, similarity analysis, or document retrieval, a molecular fingerprint - a vectorized representation capturing structural features - is often sufficient. This realization opens up a new approach: bypass the fragile reconstruction step and create fingerprints directly from visual information.</p>
<h2 id="key-innovation-direct-visual-fingerprinting">Key Innovation: Direct Visual Fingerprinting</h2>
<p>SubGrapher takes a different approach to extracting chemical information from images. It creates &ldquo;visual fingerprints&rdquo; through functional group recognition. The key innovations are:</p>
<ol>
<li>
<p><strong>Direct Image-to-Fingerprint Pipeline</strong>: SubGrapher eliminates the traditional two-step process (image → structure → fingerprint) by generating fingerprints directly from pixels. This single-stage approach avoids error accumulation from failed structure reconstructions and can handle images where conventional OCSR tools produce invalid outputs.</p>
</li>
<li>
<p><strong>Dual Instance Segmentation Architecture</strong>: The system employs two specialized Mask-RCNN networks working in parallel:</p>
<ul>
<li><strong>Functional group detector</strong>: Trained to identify 1,534 expert-defined functional groups using pixel-level segmentation masks</li>
<li><strong>Carbon backbone detector</strong>: Recognizes 27 common carbon chain patterns to capture the molecular scaffold</li>
</ul>
<p>Using instance segmentation provides detailed spatial information and higher accuracy through richer supervision during training.</p>
</li>
<li>
<p><strong>Extensive Functional Group Knowledge Base</strong>: The method uses one of the most comprehensive open-source collections of functional groups, encompassing 1,534 substructures. These were systematically defined by:</p>
<ul>
<li>Starting with chemically logical atom combinations (C, O, S, N, B, P)</li>
<li>Expanding to include relevant subgroups and variations</li>
<li>Filtering based on frequency (appearing ~1,000+ times in PubChem)</li>
<li>Additional halogen substituents and organometallic groups relevant to EUV photoresists</li>
<li>Manual curation with SMILES, SMARTS, and descriptive names</li>
</ul>
</li>
<li>
<p><strong>Substructure-Graph Construction</strong>: After detecting functional groups and carbon backbones, SubGrapher builds a connectivity graph where:</p>
<ul>
<li>Each node represents an identified substructure</li>
<li>Edges connect substructures whose bounding boxes overlap (with 10% margin expansion)</li>
<li>This graph captures both the chemical components and their spatial relationships</li>
</ul>
</li>
<li>
<p><strong>Substructure-based Visual Molecular Fingerprint (SVMF)</strong>: The final output is a continuous, count-based fingerprint formally defined as a matrix $SVMF(m) \in \mathbb{R}^{n \times n}$ where $n=1561$ (1,534 functional groups + 27 carbon backbones). The matrix is stored as a compressed upper triangular form:</p>
<p><strong>Diagonal elements</strong> ($i = j$): Weighted count of substructure $i$ plus self-intersection
$$SVMF_{ii}(m) = h_1 \cdot n_i + g_{ii}$$
where $h_1 = 10$ is the diagonal weight hyperparameter, $n_i$ is the instance count, and $g_{ii}$ is the self-intersection coefficient.</p>
<p><strong>Off-diagonal elements</strong> ($i \neq j$): Intersection coefficient based on shortest path distance $d$ in the substructure graph
$$SVMF_{ij}(m) = h_2(d) \cdot \text{intersection}(s_i, s_j)$$
where the distance decay function $h_2(d)$ is:</p>
<ul>
<li>$d \leq 1$: weight = 2</li>
<li>$d = 2$: weight = 2/4 = 0.5</li>
<li>$d = 3$: weight = 2/16 = 0.125</li>
<li>$d = 4$: weight = $2/256 \approx 0.0078$</li>
<li>$d &gt; 4$: weight = 0</li>
</ul>
<p><strong>Key properties</strong>:</p>
<ul>
<li>Carbon chain intersection coefficients are divided by 2, giving functional groups higher effective weight</li>
<li>Similarity between fingerprints calculated using a normalized Euclidean distance (ratio of L2 norm of difference to L2 norm of sum)</li>
<li>Resulting fingerprints are highly sparse (average 0.001% non-zero elements)</li>
<li>Compressed storage enables efficient database searches</li>
</ul>
</li>
<li>
<p><strong>Markush Structure Compatibility</strong>: SubGrapher processes Markush structures by recognizing their constituent functional groups and creating meaningful fingerprints for similarity searches, achieving higher accuracy than existing OCSR methods on the USPTO-Markush benchmark (S-F1: 88).</p>
</li>
</ol>
<h2 id="experimental-validation-and-benchmarks">Experimental Validation and Benchmarks</h2>
<p>The evaluation focused on demonstrating SubGrapher&rsquo;s effectiveness across two critical tasks: accurate substructure detection and robust molecule retrieval from diverse image collections.</p>
<h4 id="substructure-detection-performance">Substructure Detection Performance</h4>
<p>SubGrapher&rsquo;s ability to identify functional groups was tested on three challenging benchmarks that expose different failure modes of OCSR systems:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size</th>
          <th>Description</th>
          <th>Key Challenge</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>JPO</strong></td>
          <td>341 images</td>
          <td>Japanese Patent Office images (molecules with abbreviations removed)</td>
          <td>Low quality, noise, artifacts, non-standard drawing styles</td>
      </tr>
      <tr>
          <td><strong>USPTO-10K-L</strong></td>
          <td>1,000 images</td>
          <td>Large molecules (&gt;70 atoms)</td>
          <td>Scale variation, structural complexity, many functional groups</td>
      </tr>
      <tr>
          <td><strong>USPTO-Markush</strong></td>
          <td>74 images</td>
          <td>Generic Markush structures</td>
          <td>Variable R-groups, abstract patterns, template representation</td>
      </tr>
  </tbody>
</table>
<p><strong>Key findings:</strong></p>
<ol>
<li>
<p><strong>JPO Dataset (Low-Quality Patent Images)</strong>: SubGrapher achieved the highest Molecule Exact Match rate (83%), demonstrating robustness to image quality degradation where rule-based methods like OSRA scored lower (67% M-EM).</p>
</li>
<li>
<p><strong>USPTO-10K-L (Large Molecules)</strong>: SubGrapher achieved an S-F1 of 97, matching the rule-based OSRA and outperforming all other learning-based methods (MolScribe: 90, DECIMER: 86, MolGrapher: 56). The object detection approach handled scale variation better than other deep-learning OCSR tools on these challenging targets.</p>
</li>
<li>
<p><strong>USPTO-Markush (Generic Structures)</strong>: SubGrapher achieved the highest Substructure F1-score (88) on this benchmark, outperforming MolScribe (86), OSRA (74), and DECIMER (10). While other OCSR tools can attempt these images, they have limited support for Markush features. SubGrapher&rsquo;s instance segmentation approach handles complex Markush structures more effectively by focusing on relevant image regions.</p>
</li>
</ol>
<p>Qualitative analysis revealed that SubGrapher correctly identified functional groups in scenarios where other methods failed completely: images with captions, unconventional drawing styles, or significant quality degradation.</p>
<h4 id="visual-fingerprinting-for-molecule-retrieval">Visual Fingerprinting for Molecule Retrieval</h4>
<p>The core application was evaluated using a retrieval task designed to simulate real-world database searching:</p>
<ol>
<li>
<p><strong>Benchmark Creation</strong>: Five benchmark datasets were constructed around structurally similar molecules (adenosine, camphor, cholesterol, limonene, and pyridine), each containing 500 molecules sampled from PubChem with at least 90% Tanimoto similarity to the reference molecule, rendered as augmented images.</p>
</li>
<li>
<p><strong>Retrieval Task</strong>: Given a SMILES string as a query, the goal was to find the corresponding molecular image within the dataset of 500 visually similar structures. This tests whether the visual fingerprint can distinguish between closely related molecules.</p>
</li>
<li>
<p><strong>Performance Comparison</strong>: SubGrapher significantly outperformed baseline methods, retrieving the correct molecule at an average rank of 95 out of 500. The key advantage was robustness: SubGrapher generates a unique fingerprint for every image, even with partial or uncertain predictions. In contrast, OCSR-based methods frequently fail to produce valid SMILES, making them unable to generate fingerprints for comparison.</p>
</li>
<li>
<p><strong>Real-World Case Study</strong>: A practical demonstration involved searching a 54-page patent document containing 356 chemical images for a specific Markush structure. SubGrapher successfully located the target structure, highlighting its utility for large-scale document mining.</p>
</li>
</ol>
<h4 id="training-data-generation">Training Data Generation</h4>
<p>Since no public datasets existed with the required pixel-level mask annotations for functional groups, the researchers developed a comprehensive synthetic data generation pipeline:</p>
<ol>
<li>
<p><strong>Extended MolDepictor</strong>: They enhanced existing molecular rendering tools to create images from SMILES strings and generate corresponding segmentation masks for all substructures present in each molecule.</p>
</li>
<li>
<p><strong>Markush Structure Rendering</strong>: The pipeline was extended to handle complex generic structures using CXSMILES representations and the CDK library for rendering, creating training data for molecular templates with structural, positional, and frequency variation indicators.</p>
</li>
<li>
<p><strong>Diverse Molecular Sources</strong>: Training molecules were sourced from PubChem to ensure broad chemical diversity and coverage of different structural families.</p>
</li>
</ol>
<h2 id="results-impact-and-limitations">Results, Impact, and Limitations</h2>
<ul>
<li><strong>Superior Robustness to Image Quality</strong>: SubGrapher consistently outperformed traditional OCSR methods on degraded images, particularly the JPO patent dataset. SubGrapher&rsquo;s learned representations proved more resilient to noise, artifacts, and unconventional drawing styles than rule-based alternatives like OSRA (M-EM: 83 vs. 67 on JPO).</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SubGrapher</th>
          <th>MolScribe</th>
          <th>OSRA</th>
          <th>DECIMER</th>
          <th>MolGrapher</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>S-F1</strong> (JPO)</td>
          <td>92</td>
          <td><strong>94</strong></td>
          <td>81</td>
          <td>86</td>
          <td>89</td>
      </tr>
      <tr>
          <td><strong>M-EM</strong> (JPO)</td>
          <td><strong>83</strong></td>
          <td>82</td>
          <td>67</td>
          <td>79</td>
          <td>80</td>
      </tr>
      <tr>
          <td><strong>S-F1</strong> (USPTO-10K-L)</td>
          <td><strong>97</strong></td>
          <td>90</td>
          <td><strong>97</strong></td>
          <td>86</td>
          <td>56</td>
      </tr>
      <tr>
          <td><strong>M-EM</strong> (USPTO-10K-L)</td>
          <td>55</td>
          <td>55</td>
          <td><strong>75</strong></td>
          <td>66</td>
          <td>31</td>
      </tr>
      <tr>
          <td><strong>S-F1</strong> (USPTO-Markush)</td>
          <td><strong>88</strong></td>
          <td>86</td>
          <td>74</td>
          <td>10</td>
          <td>35</td>
      </tr>
      <tr>
          <td><strong>M-EM</strong> (USPTO-Markush)</td>
          <td>82</td>
          <td><strong>86</strong></td>
          <td>70</td>
          <td>11</td>
          <td>30</td>
      </tr>
      <tr>
          <td><strong>Avg Retrieval Rank</strong></td>
          <td><strong>95/500</strong></td>
          <td>181-241/500</td>
          <td>138-185/500</td>
          <td>N/A</td>
          <td>N/A</td>
      </tr>
  </tbody>
</table>
<p>Note: Retrieval rank ranges reflect the best and worst fingerprint method pairing for each OCSR model (RDKit Daylight or MHFP).</p>
<ul>
<li>
<p><strong>Effective Handling of Scale and Complexity</strong>: The instance segmentation approach successfully managed large molecules and complex structures where traditional graph-reconstruction methods struggled. The Substructure F1-scores on USPTO-10K-L and USPTO-Markush benchmarks demonstrated clear advantages for challenging molecular targets.</p>
</li>
<li>
<p><strong>Markush Structure Processing</strong>: SubGrapher achieves the highest Substructure F1-score on Markush structures (88 vs. MolScribe&rsquo;s 86 and OSRA&rsquo;s 74). While other OCSR methods can attempt Markush images, they support only limited features such as abbreviation-based variable groups. SubGrapher handles complex Markush features more effectively, expanding the scope of automatically extractable chemical information from patent literature.</p>
</li>
<li>
<p><strong>Robust Molecule Retrieval Performance</strong>: The visual fingerprinting approach achieved reliable retrieval performance (average rank 95/500) across diverse molecular families. The key advantage was consistency: SubGrapher generates meaningful fingerprints even from partial or uncertain predictions, while OCSR-based methods often fail to produce any usable output.</p>
</li>
<li>
<p><strong>Practical Document Mining Capability</strong>: The successful identification of specific Markush structures within large patent documents (54 pages, 356 images) demonstrates real-world applicability for large-scale literature mining and intellectual property analysis.</p>
</li>
<li>
<p><strong>Single-Stage Architecture Benefits</strong>: By eliminating the traditional image → structure → fingerprint pipeline, SubGrapher avoids error accumulation from failed molecular reconstructions. Every input image produces a fingerprint, making the system more reliable for batch processing of diverse document collections.</p>
</li>
<li>
<p><strong>Limitations and Scope</strong>: The method remains focused on common organic functional groups and may struggle with inorganic chemistry, organometallic complexes, or highly specialized molecular classes not well-represented in the training data. The 1,534 functional groups, while extensive, represent a curated subset of chemical space. SubGrapher also cannot distinguish enantiomers, as the detected substructures lack stereochemistry information. Additionally, the method currently cannot recognize substructures in abbreviations or single-atom fragments.</p>
</li>
</ul>
<p>The work demonstrates that direct fingerprint generation can be more robust and practical than traditional structure reconstruction approaches. SubGrapher&rsquo;s ability to handle Markush structures and degraded images makes it particularly valuable for patent analysis and large-scale document mining, where traditional OCSR methods frequently fail. The approach suggests that task-specific learning (fingerprints for retrieval) can outperform general-purpose reconstruction methods in many practical applications.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Training Data Generation</strong>: The paper developed a custom synthetic data pipeline since no public datasets existed with pixel-level mask annotations for functional groups:</p>
<ul>
<li><strong>Extended MolDepictor</strong>: Enhanced molecular rendering tool to generate both images and corresponding segmentation masks for all substructures</li>
<li><strong>Markush Structure Rendering</strong>: Pipeline extended to handle complex generic structures</li>
<li><strong>Source Molecules</strong>: PubChem for broad chemical diversity</li>
</ul>
<p><strong>Evaluation Benchmarks</strong>:</p>
<ul>
<li><strong>JPO Dataset</strong>: Real patent images with poor resolution, noise, and artifacts</li>
<li><strong>USPTO-10K-L</strong>: Large complex molecular structures</li>
<li><strong>USPTO-Markush</strong>: Generic structures with variable R-groups</li>
<li><strong>Retrieval Benchmarks</strong>: Five datasets (adenosine, camphor, cholesterol, limonene, pyridine), each with 500 similar molecular images</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: Dual instance segmentation system using Mask-RCNN</p>
<ul>
<li><strong>Functional Group Detector</strong>: Mask-RCNN trained to identify 1,534 expert-defined functional groups</li>
<li><strong>Carbon Backbone Detector</strong>: Mask-RCNN trained to recognize 27 common carbon chain patterns</li>
<li><strong>Backbone Network</strong>: Not specified in the paper</li>
</ul>
<p><strong>Functional Group Knowledge Base</strong>: 1,534 substructures systematically defined by:</p>
<ul>
<li>Starting with chemically logical atom combinations (C, O, S, N, B, P)</li>
<li>Expanding to include relevant subgroups and variations</li>
<li>Filtering based on frequency (appearing ~1,000+ times in PubChem)</li>
<li>Manual curation with SMILES, SMARTS, and descriptive names</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Functional Group Definition</strong>:</p>
<ul>
<li><strong>1,534 Functional Groups</strong>: Defined by manually curated SMARTS patterns
<ul>
<li>Must contain heteroatoms (O, N, S, P, B)</li>
<li>Frequency threshold: ~1,000+ occurrences in PubChem</li>
<li>Systematically constructed from chemically logical atom combinations</li>
<li>Manual curation with SMILES, SMARTS, and descriptive names</li>
</ul>
</li>
<li><strong>27 Carbon Backbones</strong>: Patterns of 3-6 carbon atoms (rings and chains) to capture molecular scaffolds</li>
</ul>
<p><strong>Substructure-Graph Construction</strong>:</p>
<ol>
<li>Detect functional groups and carbon backbones using Mask-RCNN models</li>
<li>Build connectivity graph:
<ul>
<li>Each node represents an identified substructure instance</li>
<li>Edges connect substructures whose bounding boxes overlap</li>
<li>Bounding boxes expanded by 10% of smallest box&rsquo;s diagonal to ensure connectivity between adjacent groups</li>
<li>Carbon chain intersection coefficients divided by 2, giving functional groups higher effective weight</li>
</ul>
</li>
</ol>
<p><strong>SVMF Fingerprint Generation</strong>:</p>
<ul>
<li>Matrix form: $SVMF(m) \in \mathbb{R}^{n \times n}$ where $n=1561$</li>
<li>Stored as compressed sparse upper triangular matrix</li>
<li><strong>Diagonal elements</strong>: $SVMF_{ii} = h_1 \cdot n_i + g_{ii}$ where $h_1 = 10$</li>
<li><strong>Off-diagonal elements</strong>: $SVMF_{ij} = h_2(d) \cdot \text{intersection}(s_i, s_j)$ where:
<ul>
<li>$h_2(d) = 2$ for $d = 0, 1$</li>
<li>$h_2(2) = 2/4$, $h_2(3) = 2/16$, $h_2(4) = 2/256$</li>
<li>$h_2(d) = 0$ for $d &gt; 4$</li>
</ul>
</li>
<li>Average sparsity: 0.001% non-zero elements</li>
<li>Similarity metric: Normalized Euclidean distance (L2 norm of difference divided by L2 norm of sum)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Substructure F1-score (S-F1)</strong>: Harmonic mean of precision and recall for individual substructure detection across all molecules in the dataset</li>
<li><strong>Molecule Exact Match (M-EM)</strong>: Percentage of molecules where S-F1 = 1.0 (all substructures correctly identified)</li>
<li><strong>Retrieval Rank</strong>: Average rank of ground truth molecule in candidate list of 500 similar structures when querying with SMILES fingerprint, averaged across 50 queries per benchmark</li>
</ul>
<p><strong>Baselines</strong>: Compared against SOTA OCSR methods:</p>
<ul>
<li>Deep learning: MolScribe, MolGrapher, DECIMER</li>
<li>Rule-based: OSRA</li>
<li>Fingerprint methods: RDKit Daylight, MHFP (applied to OCSR outputs)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. Training and inference hardware details are not provided in the main text or would be found in supplementary materials.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/DS4SD/SubGrapher">SubGrapher (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official inference code</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/docling-project/SubGrapher">SubGrapher (HuggingFace)</a></td>
          <td>Model</td>
          <td>MIT</td>
          <td>Pre-trained model weights</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/datasets/docling-project/SubGrapher-Datasets">SubGrapher-Datasets (HuggingFace)</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>Visual fingerprinting benchmark datasets</td>
      </tr>
  </tbody>
</table>
<h3 id="implementation-gaps">Implementation Gaps</h3>
<p>The following details are not available in the paper and would require access to the code repository or supplementary information:</p>
<ul>
<li>Specific backbone architecture for Mask-RCNN (ResNet variant, Swin Transformer, etc.)</li>
<li>Optimizer type (AdamW, SGD, etc.)</li>
<li>Learning rate and scheduler</li>
<li>Batch size and number of training epochs</li>
<li>Loss function weights (box loss vs. mask loss)</li>
<li>GPU/TPU specifications used for training</li>
<li>Training time and computational requirements</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Morin, L., Meijer, G. I., Weber, V., Van Gool, L., &amp; Staar, P. W. J. (2025). SubGrapher: Visual fingerprinting of chemical structures. Journal of Cheminformatics, 17(1), 149. <a href="https://doi.org/10.1186/s13321-025-01091-4">https://doi.org/10.1186/s13321-025-01091-4</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics (2025)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{morinSubGrapherVisualFingerprinting2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{SubGrapher: Visual Fingerprinting of Chemical Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{SubGrapher}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valéry and Van Gool, Luc and Staar, Peter W. J.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{149}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-025-01091-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>RFL: Simplifying Chemical Structure Recognition (AAAI 2025)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/rfl/</link><pubDate>Thu, 19 Dec 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/rfl/</guid><description>Ring-Free Language (RFL) and Molecular Skeleton Decoder (MSD) for improved optical chemical structure recognition from molecular images.</description><content:encoded><![CDATA[<h2 id="methodological-contribution">Methodological Contribution</h2>
<p>This is a <strong>Methodological</strong> paper ($\Psi_{\text{Method}}$). It introduces a novel representation system (Ring-Free Language) and a specialized neural architecture (Molecular Skeleton Decoder) designed to solve specific limitations in converting 2D images to 1D chemical strings. The paper validates this method through direct comparison with existing baselines and ablation studies.</p>
<h2 id="motivation-limitations-of-1d-serialization">Motivation: Limitations of 1D Serialization</h2>
<p>Current Optical Chemical Structure Recognition (OCSR) methods typically rely on &ldquo;unstructured modeling,&rdquo; where 2D molecular graphs are flattened into 1D strings like SMILES or SSML. While simple, these linear formats struggle to explicitly capture complex spatial relationships, particularly in molecules with multiple rings and branches. End-to-end models often fail to &ldquo;understand&rdquo; the graph structure when forced to predict these implicit 1D sequences, leading to error accumulation in complex scenarios.</p>
<h2 id="innovation-ring-free-language-rfl-and-molecular-skeleton-decoder-msd">Innovation: Ring-Free Language (RFL) and Molecular Skeleton Decoder (MSD)</h2>
<p>The authors propose two primary contributions to decouple spatial complexity:</p>
<ol>
<li><strong>Ring-Free Language (RFL)</strong>: A divide-and-conquer representation that splits a molecular graph $G$ into three explicit components: a molecular skeleton $\mathcal{S}$, individual ring structures $\mathcal{R}$, and branch information $\mathcal{F}$. This allows rings to be collapsed into &ldquo;SuperAtoms&rdquo; or &ldquo;SuperBonds&rdquo; during initial parsing.</li>
<li><strong>Molecular Skeleton Decoder (MSD)</strong>: A hierarchical architecture that progressively predicts the skeleton first, then the individual rings (using SuperAtom features as conditions), and finally classifies the branch connections.</li>
</ol>
<h2 id="methodology-and-experiments">Methodology and Experiments</h2>
<p>The method was evaluated on both handwritten and printed chemical structures against two baselines: DenseWAP (Zhang et al. 2018) and RCGD (Hu et al. 2023).</p>
<ul>
<li><strong>Datasets</strong>:
<ul>
<li><strong>EDU-CHEMC</strong>: ~49k handwritten samples (challenging, diverse styles)</li>
<li><strong>Mini-CASIA-CSDB</strong>: ~89k printed samples (from ChEMBL)</li>
<li><strong>Synthetic Complexity Dataset</strong>: A custom split of ChEMBL data grouped by structural complexity (atoms + bonds + rings) to test generalization</li>
</ul>
</li>
<li><strong>Ablation Studies</strong> (Table 2, on EDU-CHEMC with MSD-DenseWAP): Without MSD or <code>[conn]</code>, EM=38.70%. Adding <code>[conn]</code> alone raised EM to 44.02%. Adding MSD alone raised EM to 52.76%. Both together achieved EM=64.96%, confirming each component&rsquo;s contribution.</li>
</ul>
<h2 id="outcomes-and-conclusions">Outcomes and Conclusions</h2>
<ul>
<li><strong>New best results</strong>: MSD-RCGD achieved 65.39% EM on EDU-CHEMC (handwritten) and 95.23% EM on Mini-CASIA-CSDB (printed), outperforming the RCGD baseline (62.86% and 95.01%, respectively). MSD-DenseWAP surpassed the previous best on EDU-CHEMC by 2.06% EM (64.92% vs. 62.86%).</li>
<li><strong>Universal improvement</strong>: Applying MSD/RFL to DenseWAP improved its accuracy from 61.35% to 64.92% EM on EDU-CHEMC and from 92.09% to 94.10% EM on Mini-CASIA-CSDB, demonstrating the method is model-agnostic.</li>
<li><strong>Complexity handling</strong>: When trained on low-complexity molecules only (levels 1-2), MSD-DenseWAP still recognized higher-complexity unseen structures, while standard DenseWAP could hardly recognize them at all (Figure 6 in the paper).</li>
</ul>
<p>The authors note that this is the first end-to-end solution that decouples and models chemical structures in a structured form. Future work aims to extend structured-based modeling to other tasks such as tables, flowcharts, and diagrams.</p>
<hr>
<h2 id="artifacts">Artifacts</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/JingMog/RFL-MSD">RFL-MSD</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch implementation</td>
      </tr>
  </tbody>
</table>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors utilized one handwritten and one printed dataset, plus a synthetic set for stress-testing complexity.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training/Test</strong></td>
          <td><strong>EDU-CHEMC</strong></td>
          <td>48,998 Train / 2,992 Test</td>
          <td>Handwritten images from educational scenarios</td>
      </tr>
      <tr>
          <td><strong>Training/Test</strong></td>
          <td><strong>Mini-CASIA-CSDB</strong></td>
          <td>89,023 Train / 8,287 Test</td>
          <td>Printed images rendered from ChEMBL using RDKit</td>
      </tr>
      <tr>
          <td><strong>Generalization</strong></td>
          <td><strong>ChEMBL Subset</strong></td>
          <td>5 levels of complexity</td>
          <td>Custom split based on Eq: $N_{atom} + N_{bond} + 12 \times N_{ring}$</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>RFL Splitting (Encoding)</strong>:</p>
<ol>
<li><strong>Detect Rings</strong>: Use DFS to find all non-nested rings $\mathcal{R}$.</li>
<li><strong>Determine Adjacency ($\gamma$)</strong>: Calculate shared edges between rings.</li>
<li><strong>Merge</strong>:
<ul>
<li>If $\gamma(r_i) = 0$ (isolated), merge ring into a <strong>SuperAtom</strong> node.</li>
<li>If $\gamma(r_i) &gt; 0$ (adjacent), merge ring into a <strong>SuperBond</strong> edge.</li>
</ul>
</li>
<li><strong>Update</strong>: Record connection info in $\mathcal{F}$ and remove ring details from the main graph to form Skeleton $\mathcal{S}$.</li>
</ol>
<p><strong>MSD Decoding</strong>:</p>
<ul>
<li><strong>Hierarchical Prediction</strong>: The model predicts the Skeleton $\mathcal{S}$ first.</li>
<li><strong>Contextual Ring Prediction</strong>: When a SuperAtom/Bond token is predicted, its hidden state $f^s$ is stored. After the skeleton is finished, $f^s$ is used as a condition to autoregressively decode the specific ring structure.</li>
<li><strong>Token <code>[conn]</code></strong>: A special token separates connected ring bonds from unconnected ones to sparsify the branch classification task.</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture follows a standard Image-to-Sequence pattern but with a forked decoder.</p>
<ul>
<li><strong>Encoder</strong>: DenseNet (Growth rate=24, Depth=32 per block)</li>
<li><strong>Decoder (MSD)</strong>:
<ul>
<li><strong>Core</strong>: GRU with Attention (Hidden dim=256, Embedding dim=256, Dropout=0.15)</li>
<li><strong>Skeleton Module</strong>: Autoregressively predicts sequence tokens. Uses Maxout activation.</li>
<li><strong>Branch Module</strong>: A binary classifier (MLP) taking concatenated features of skeleton bonds $f_{bs}$ and ring bonds $f_{br}$ to predict connectivity matrix $\mathcal{F}$.</li>
</ul>
</li>
<li><strong>Loss Function</strong>: $\mathcal{O} = \lambda_1 \mathcal{L}_{ce} + \lambda_2 \mathcal{L}_{cls}$ (where $\lambda_1 = \lambda_2 = 1$)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics focus on exact image reconstruction and structural validity.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>EM (Exact Match)</strong></td>
          <td>% of images where predicted graph exactly matches ground truth.</td>
          <td>Primary metric</td>
      </tr>
      <tr>
          <td><strong>Struct-EM</strong></td>
          <td>% of correctly identified chemical structures (ignoring non-chemical text).</td>
          <td>Auxiliary metric</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 4 x NVIDIA Tesla V100 (32GB RAM)</li>
<li><strong>Training Configuration</strong>:
<ul>
<li>Batch size: 8 (Handwritten), 32 (Printed)</li>
<li>Epochs: 50</li>
<li>Optimizer: Adam ($lr=2\times10^{-4}$, decayed by 0.5 via MultiStepLR)</li>
</ul>
</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chang, Q., Chen, M., Pi, C., Hu, P., Zhang, Z., Ma, J., Du, J., Yin, B., &amp; Hu, J. (2025). RFL: Simplifying Chemical Structure Recognition with Ring-Free Language. In <em>Proceedings of the AAAI Conference on Artificial Intelligence</em>, 39(2), 2007-2015. <a href="https://doi.org/10.1609/aaai.v39i2.32197">https://doi.org/10.1609/aaai.v39i2.32197</a></p>
<p><strong>Publication</strong>: AAAI 2025 (Oral)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/JingMog/RFL-MSD">Official Code Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{changRFLSimplifyingChemical2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{RFL: Simplifying Chemical Structure Recognition with Ring-Free Language}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{RFL}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Chang, Qikai and Chen, Mingjun and Pi, Changpeng and Hu, Pengfei and Zhang, Zhenrong and Ma, Jiefeng and Du, Jun and Yin, Baocai and Hu, Jinshui}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the AAAI Conference on Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{39}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{2007--2015}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2412.07594}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1609/aaai.v39i2.32197}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>