<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Benchmarks and Reviews on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/</link><description>Recent content in Benchmarks and Reviews on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sun, 05 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/index.xml" rel="self" type="application/rss+xml"/><item><title>Uni-Parser: Industrial-Grade Multi-Modal PDF Parsing (2025)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/uni-parser-2025/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/uni-parser-2025/</guid><description>Uni-Parser is a modular, multi-expert PDF parsing engine for scientific documents with integrated OCSR and chemical structure recognition.</description><content:encoded><![CDATA[<h2 id="an-industrial-grade-multi-modal-document-parser">An Industrial-Grade Multi-Modal Document Parser</h2>
<p>Uni-Parser is a modular, loosely coupled PDF parsing engine built for scientific literature and patents. It routes different content types (text, equations, tables, figures, chemical structures) to specialized expert models, then reassembles the parsed outputs into structured formats (JSON, Markdown, HTML) for downstream consumption by LLMs and other applications.</p>
<p>The system processes up to 20 PDF pages per second on 8 NVIDIA RTX 4090D GPUs and supports over 80 languages for OCR.</p>
<h2 id="a-five-stage-pipeline-architecture">A Five-Stage Pipeline Architecture</h2>
<p>The system is organized into five sequential stages:</p>
<ol>
<li><strong>Document Pre-Processing</strong>: Validates PDFs, extracts metadata, checks text accessibility, and identifies language.</li>
<li><strong>Group-based Layout Detection</strong>: Locates semantic blocks and identifies their categories using a novel tree-structured layout representation. Groups naturally paired elements (image-caption, table-title, molecule-identifier).</li>
<li><strong>Semantic Contents Parsing</strong>: Routes each block to a specialized model: OCR for text, formula recognition for equations, table structure recognition, OCSR for chemical structures, reaction extraction, and chart parsing. Over ten sub-models operate in parallel.</li>
<li><strong>Semantic Contents Gathering</strong>: Filters non-essential elements, reconstructs reading order, merges cross-page and multi-column content, and reintegrates inline multimodal elements.</li>
<li><strong>Output Formatting and Semantic Chunking</strong>: Exports parsed documents in task-specific formats with proper chunking for RAG and other downstream tasks.</li>
</ol>
<h2 id="group-based-layout-detection">Group-Based Layout Detection</h2>
<p>A key contribution is the group-based layout detection model (Uni-Parser-LD), which uses a hierarchical tree structure to represent page layouts. Elements are organized into a bottom layer (parent nodes like paragraphs, tables, images) and a top layer (child nodes like captions, footnotes, identifiers). This preserves semantic associations between paired elements, such as molecules and their identifiers.</p>
<p>The model is trained on 500k pages, including 220k human-annotated pages from scientific journals and patents across 85 languages. A modified DETR-based architecture was selected as the backbone after finding that RT-DETRv2, YOLOv12, and D-FINE exhibited training instability for this task.</p>
<h2 id="chemical-structure-recognition-with-molparser-15">Chemical Structure Recognition with MolParser 1.5</h2>
<p>Uni-Parser integrates MolParser 1.5 for OCSR, an end-to-end model that directly generates molecular representations from images. The authors explicitly note that graph-based (atom-bond) methods were the first direction they explored but ultimately abandoned because of:</p>
<ul>
<li>Strong reliance on rigid, hand-crafted rules that limit scalability</li>
<li>Substantially higher annotation costs (over 20x compared to end-to-end approaches)</li>
<li>Lower performance ceilings despite increasing training data</li>
</ul>
<h3 id="molecule-localization">Molecule Localization</h3>
<p>Uni-Parser-LD achieves strong molecule detection performance:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>mAP@50</th>
          <th>mAP@50-95</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Uni-Parser-LD</strong> (Uni-Parser Bench)</td>
          <td><strong>0.994</strong></td>
          <td><strong>0.968</strong></td>
      </tr>
      <tr>
          <td>MolDet-Doc-L</td>
          <td>0.983</td>
          <td>0.919</td>
      </tr>
      <tr>
          <td>MolDet-General-L</td>
          <td>0.974</td>
          <td>0.815</td>
      </tr>
      <tr>
          <td><strong>Uni-Parser-LD</strong> (BioVista Bench)</td>
          <td><strong>0.981</strong></td>
          <td><strong>0.844</strong></td>
      </tr>
      <tr>
          <td>MolDet-Doc-L</td>
          <td>0.961</td>
          <td>0.871</td>
      </tr>
      <tr>
          <td>MolDet-General-L</td>
          <td>0.945</td>
          <td>0.815</td>
      </tr>
      <tr>
          <td>BioMiner</td>
          <td>0.929</td>
          <td>-</td>
      </tr>
      <tr>
          <td>MolMiner</td>
          <td>0.899</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<h3 id="ocsr-accuracy">OCSR Accuracy</h3>
<p>MolParser 1.5 consistently outperforms prior methods across molecule types:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Full</th>
          <th>Chiral</th>
          <th>Markush</th>
          <th>All</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>MolParser 1.5</strong> (Uni-Parser Bench)</td>
          <td><strong>0.979</strong></td>
          <td><strong>0.809</strong></td>
          <td><strong>0.805</strong></td>
          <td><strong>0.886</strong></td>
      </tr>
      <tr>
          <td>MolParser 1.0</td>
          <td>0.953</td>
          <td>0.676</td>
          <td>0.664</td>
          <td>0.800</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>0.617</td>
          <td>0.274</td>
          <td>0.168</td>
          <td>0.417</td>
      </tr>
      <tr>
          <td><strong>MolParser 1.5</strong> (BioVista Bench)</td>
          <td><strong>0.795</strong></td>
          <td><strong>0.604</strong></td>
          <td><strong>0.761</strong></td>
          <td><strong>0.780</strong></td>
      </tr>
      <tr>
          <td>MolParser 1.0</td>
          <td>0.669</td>
          <td>0.352</td>
          <td>0.733</td>
          <td>0.703</td>
      </tr>
      <tr>
          <td>MolMiner</td>
          <td>0.774</td>
          <td>0.497</td>
          <td>0.185</td>
          <td>0.507</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>0.703</td>
          <td>0.481</td>
          <td>0.156</td>
          <td>0.455</td>
      </tr>
      <tr>
          <td>MolNexTR</td>
          <td>0.695</td>
          <td>0.419</td>
          <td>0.045</td>
          <td>0.401</td>
      </tr>
      <tr>
          <td>DECIMER</td>
          <td>0.545</td>
          <td>0.326</td>
          <td>0.000</td>
          <td>0.298</td>
      </tr>
  </tbody>
</table>
<p>Chiral molecule recognition remains a significant challenge and is identified as a key area for future work.</p>
<h2 id="document-parsing-benchmarks">Document Parsing Benchmarks</h2>
<p>On the Uni-Parser Benchmark (150 PDFs, 2,887 pages from patents and scientific articles), Uni-Parser (HQ mode) achieves an overall score of 89.74 (excluding molecules), outperforming both pipeline tools (MinerU, PP-StructureV3) and specialized VLMs (MinerU2-VLM, DeepSeek-OCR, PaddleOCR-VL). Competing systems score zero on molecule localization and OCSR because they lack molecular recognition capabilities.</p>
<p>On the general-document OmniDocBench-1.5, a variant (Uni-Parser-G) using a swapped layout module achieves 89.75 overall, competitive with top-performing specialized VLMs.</p>
<h2 id="comparison-with-ocsr-enabled-pdf-parsers">Comparison with OCSR-Enabled PDF Parsers</h2>
<p>On a controlled test set of 141 simple molecules, Uni-Parser outperforms other PDF parsing systems with OCSR support:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Recall</th>
          <th>OCSR Success</th>
          <th>OCSR Acc</th>
          <th>Id Match</th>
          <th>Time</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Uni-Parser</strong></td>
          <td><strong>100%</strong></td>
          <td><strong>100%</strong></td>
          <td><strong>96.5%</strong></td>
          <td><strong>100%</strong></td>
          <td><strong>1.8s</strong></td>
      </tr>
      <tr>
          <td>MathPix</td>
          <td>100%</td>
          <td>75.9%</td>
          <td>59.6%</td>
          <td>-</td>
          <td>66.1s</td>
      </tr>
      <tr>
          <td>MinerU.Chem</td>
          <td>66.7%</td>
          <td>63.1%</td>
          <td>22.7%</td>
          <td>-</td>
          <td>~7 min</td>
      </tr>
  </tbody>
</table>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/UniParser">HuggingFace Models</a></td>
          <td>Model/Dataset</td>
          <td>Unknown</td>
          <td>MolDet models and MolParser-7M dataset available</td>
      </tr>
      <tr>
          <td><a href="https://uni-parser.github.io">Project Page</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Project website with documentation</td>
      </tr>
  </tbody>
</table>
<p>The Uni-Parser system is deployed on a cluster of 240 NVIDIA L40 GPUs (48 GB each) with 22 CPU cores and 90 GB of host memory per GPU. The reference throughput benchmark (20 pages/second) uses 8 NVIDIA RTX 4090D GPUs. The HuggingFace organization hosts MolDet detection models and several datasets (MolParser-7M, RxnBench, OmniScience), but the full Uni-Parser system code and end-to-end inference pipeline do not appear to be publicly released. MolParser 1.5 model weights are not publicly available as of this writing.</p>
<h2 id="limitations-and-future-directions">Limitations and Future Directions</h2>
<ul>
<li>Chiral molecule recognition remains a challenge for end-to-end OCSR models</li>
<li>Chemical reaction understanding in real-world literature has substantial room for improvement</li>
<li>Layout models are primarily tailored to scientific and patent documents, with plans to expand to newspapers, slides, books, and financial statements</li>
<li>Chart parsing falls short of industrial-level requirements across the diversity of chart types in scientific literature</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fang, X., Tao, H., Yang, S., Huang, C., Zhong, S., Lu, H., Lyu, H., Li, X., Zhang, L., &amp; Ke, G. (2025). Uni-Parser Technical Report. <em>arXiv preprint arXiv:2512.15098</em>. <a href="https://arxiv.org/abs/2512.15098">https://arxiv.org/abs/2512.15098</a></p>
<p><strong>Publication</strong>: arXiv 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://uni-parser.github.io">Project Page</a></li>
<li><a href="https://huggingface.co/UniParser">HuggingFace Models</a></li>
</ul>
]]></content:encoded></item><item><title>Image-to-Sequence OCSR: A Comparative Analysis</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/image-to-sequence-comparison/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/image-to-sequence-comparison/</guid><description>Comparative analysis of image-to-sequence OCSR methods across architecture, output format, training data, and compute requirements.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>This note provides a comparative analysis of image-to-sequence methods for Optical Chemical Structure Recognition (OCSR). These methods treat molecular structure recognition as an image captioning task, using encoder-decoder architectures to generate sequential molecular representations (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>) directly from pixels.</p>
<p>For the full taxonomy of OCSR approaches including image-to-graph and rule-based methods, see the <a href="/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/">OCSR Methods taxonomy</a>.</p>
<h2 id="architectural-evolution-2019-2025">Architectural Evolution (2019-2025)</h2>
<p>The field has undergone rapid architectural evolution, with clear generational shifts in both encoder and decoder design.</p>
<h3 id="timeline">Timeline</h3>
<table>
  <thead>
      <tr>
          <th>Era</th>
          <th>Encoder</th>
          <th>Decoder</th>
          <th>Representative Methods</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>2019-2020</strong></td>
          <td>CNN (Inception V3, ResNet)</td>
          <td>LSTM/GRU with Attention</td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/staker-deep-learning-2019/">Staker et al.</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer/">DECIMER</a></td>
      </tr>
      <tr>
          <td><strong>2021</strong></td>
          <td>EfficientNet, ViT</td>
          <td>Transformer</td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-1.0/">DECIMER 1.0</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/img2mol/">Img2Mol</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/vit-inchi-transformer/">ViT-InChI</a></td>
      </tr>
      <tr>
          <td><strong>2022</strong></td>
          <td>Swin Transformer, ResNet</td>
          <td>Transformer</td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/swinocsr/">SwinOCSR</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/image2smiles/">Image2SMILES</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/micer/">MICER</a></td>
      </tr>
      <tr>
          <td><strong>2023-2024</strong></td>
          <td>EfficientNetV2, SwinV2</td>
          <td>Transformer</td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-ai/">DECIMER.ai</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/image2inchi/">Image2InChI</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/mmssc-net/">MMSSC-Net</a></td>
      </tr>
      <tr>
          <td><strong>2025</strong></td>
          <td>EfficientViT, VLMs (Qwen2-VL)</td>
          <td>LLM decoders, RL fine-tuning</td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/molsight/">MolSight</a>, <a href="/notes/chemistry/optical-structure-recognition/vision-language/gtr-mol-vlm/">GTR-CoT</a>, <a href="/notes/chemistry/optical-structure-recognition/vision-language/ocsu/">OCSU</a></td>
      </tr>
  </tbody>
</table>
<h3 id="encoder-architectures">Encoder Architectures</h3>
<table>
  <thead>
      <tr>
          <th>Architecture</th>
          <th>Methods Using It</th>
          <th>Key Characteristics</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Inception V3</strong></td>
          <td>DECIMER (2020)</td>
          <td>Early CNN approach, 299x299 input</td>
      </tr>
      <tr>
          <td><strong>ResNet-50/101</strong></td>
          <td>IMG2SMI, Image2SMILES, MICER, DGAT</td>
          <td>Strong baseline, well-understood</td>
      </tr>
      <tr>
          <td><strong>EfficientNet-B3</strong></td>
          <td>DECIMER 1.0</td>
          <td>Efficient scaling, compound coefficients</td>
      </tr>
      <tr>
          <td><strong>EfficientNet-V2-M</strong></td>
          <td>DECIMER.ai, DECIMER-Hand-Drawn</td>
          <td>Improved training efficiency</td>
      </tr>
      <tr>
          <td><strong>EfficientViT-L1</strong></td>
          <td>MolSight</td>
          <td>Optimized for deployment</td>
      </tr>
      <tr>
          <td><strong>Swin Transformer</strong></td>
          <td>SwinOCSR, MolParser</td>
          <td>Hierarchical vision transformer</td>
      </tr>
      <tr>
          <td><strong>SwinV2</strong></td>
          <td>MMSSC-Net, Image2InChI</td>
          <td>Improved training stability</td>
      </tr>
      <tr>
          <td><strong>Vision Transformer (ViT)</strong></td>
          <td>ViT-InChI</td>
          <td>Pure attention encoder</td>
      </tr>
      <tr>
          <td><strong>DenseNet</strong></td>
          <td>RFL, Hu et al. RCGD</td>
          <td>Dense connections, feature reuse</td>
      </tr>
      <tr>
          <td><strong>Deep TNT</strong></td>
          <td>ICMDT</td>
          <td>Transformer-in-Transformer</td>
      </tr>
      <tr>
          <td><strong>Qwen2-VL</strong></td>
          <td>OCSU, GTR-CoT</td>
          <td>Vision-language model encoder</td>
      </tr>
  </tbody>
</table>
<h3 id="decoder-architectures">Decoder Architectures</h3>
<table>
  <thead>
      <tr>
          <th>Architecture</th>
          <th>Methods Using It</th>
          <th>Output Format</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GRU with Attention</strong></td>
          <td>DECIMER, RFL, Hu et al. RCGD</td>
          <td>SMILES, RFL, SSML</td>
      </tr>
      <tr>
          <td><strong>LSTM with Attention</strong></td>
          <td>Staker et al., ChemPix, MICER</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td><strong>Transformer</strong></td>
          <td>Most 2021+ methods</td>
          <td>SMILES, SELFIES, InChI</td>
      </tr>
      <tr>
          <td><strong>GPT-2</strong></td>
          <td>MMSSC-Net</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td><strong>BART</strong></td>
          <td>MolParser</td>
          <td>E-SMILES</td>
      </tr>
      <tr>
          <td><strong>Pre-trained CDDD</strong></td>
          <td>Img2Mol</td>
          <td>Continuous embedding → SMILES</td>
      </tr>
  </tbody>
</table>
<h2 id="output-representation-comparison">Output Representation Comparison</h2>
<p>The choice of molecular string representation significantly impacts model performance. Representations fall into three categories: core molecular formats for single structures, extended formats for molecular families and variable structures (primarily Markush structures in patents), and specialized representations optimizing for specific recognition challenges.</p>
<p>The <a href="/notes/chemistry/optical-structure-recognition/benchmarks/rajan-string-representations-2022/">Rajan et al. 2022 ablation study</a> provides a comparison of core formats.</p>
<h3 id="core-molecular-formats">Core Molecular Formats</h3>
<p>These represent specific, concrete molecular structures.</p>
<table>
  <thead>
      <tr>
          <th>Format</th>
          <th>Validity Guarantee</th>
          <th>Sequence Length</th>
          <th>Key Characteristic</th>
          <th>Used By</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>SMILES</strong></td>
          <td>No</td>
          <td>Shortest (baseline)</td>
          <td>Standard, highest accuracy</td>
          <td>DECIMER.ai, MolSight, DGAT, most 2023+</td>
      </tr>
      <tr>
          <td><strong>DeepSMILES</strong></td>
          <td>Partial</td>
          <td>~1.1x SMILES</td>
          <td>Reduces non-local dependencies</td>
          <td>SwinOCSR</td>
      </tr>
      <tr>
          <td><strong>SELFIES</strong></td>
          <td>Yes (100%)</td>
          <td>~1.5x SMILES</td>
          <td>Guaranteed valid molecules</td>
          <td>DECIMER 1.0, IMG2SMI</td>
      </tr>
      <tr>
          <td><strong>InChI</strong></td>
          <td>N/A (canonical)</td>
          <td>Variable (long)</td>
          <td>Unique identifiers, layered syntax</td>
          <td>ViT-InChI, ICMDT, Image2InChI</td>
      </tr>
      <tr>
          <td><strong>FG-SMILES</strong></td>
          <td>No</td>
          <td>Similar to SMILES</td>
          <td>Functional group-aware tokenization</td>
          <td>Image2SMILES</td>
      </tr>
  </tbody>
</table>
<h4 id="smiles-and-variants">SMILES and Variants</h4>
<p><strong>SMILES</strong> remains the dominant format due to its compactness and highest accuracy on clean data. Standard SMILES uses single characters for ring closures and branches that may appear far apart in the sequence, creating learning challenges for sequence models.</p>
<p><strong>DeepSMILES</strong> addresses these non-local syntax dependencies by modifying how branches and ring closures are encoded, making sequences more learnable for neural models. Despite this modification, DeepSMILES sequences are ~1.1x longer than standard SMILES (not shorter). The format offers partial validity improvements through regex-based tokenization with a compact 76-token vocabulary, providing a middle ground between SMILES accuracy and guaranteed validity.</p>
<p><strong>SELFIES</strong> guarantees 100% valid molecules by design through a context-free grammar, eliminating invalid outputs entirely. This comes at the cost of ~1.5x longer sequences and a typical 2-5% accuracy drop compared to SMILES on exact-match metrics. The validity guarantee makes SELFIES particularly attractive for generative modeling applications.</p>
<p><strong>InChI</strong> uses a layered canonical syntax fundamentally different from SMILES-based formats. While valuable for unique molecular identification, its complex multi-layer structure (formula, connectivity, stereochemistry, isotopes, etc.) and longer sequences make it less suitable for image-to-sequence learning, resulting in lower recognition accuracy.</p>
<h4 id="key-findings-from-rajan-et-al-2022">Key Findings from Rajan et al. 2022</h4>
<ol>
<li><strong>SMILES achieves highest exact-match accuracy</strong> on clean synthetic data</li>
<li><strong>SELFIES guarantees 100% valid molecules</strong> but at cost of ~2-5% accuracy drop</li>
<li><strong>InChI is problematic</strong> due to complex layered syntax and longer sequences</li>
<li><strong>DeepSMILES offers middle ground</strong> with partial validity improvements through modified syntax</li>
</ol>
<h3 id="extended-formats-for-variable-structures">Extended Formats for Variable Structures</h3>
<p><strong>Markush structures</strong> represent families of molecules, using variable groups (R1, R2, etc.) with textual definitions. They are ubiquitous in patent documents for intellectual property protection. Standard SMILES cannot represent these variable structures.</p>
<table>
  <thead>
      <tr>
          <th>Format</th>
          <th>Base Format</th>
          <th>Key Feature</th>
          <th>Used By</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>E-SMILES</strong></td>
          <td>SMILES + XML annotations</td>
          <td>Backward-compatible with separator token</td>
          <td>MolParser</td>
      </tr>
      <tr>
          <td><strong>CXSMILES</strong></td>
          <td>SMILES + extension block</td>
          <td>Substituent tables, compression</td>
          <td>MarkushGrapher</td>
      </tr>
  </tbody>
</table>
<p><strong>E-SMILES</strong> (Extended SMILES) maintains backward compatibility by using a <code>&lt;sep&gt;</code> token to separate core SMILES from XML-like annotations. Annotations encode Markush substituents (<code>&lt;a&gt;index:group&lt;/a&gt;</code>), polymer structures (<code>&lt;p&gt;polymer_info&lt;/p&gt;</code>), and abstract ring patterns (<code>&lt;r&gt;abstract_ring&lt;/r&gt;</code>). The core structure remains parseable by standard RDKit.</p>
<p><strong>CXSMILES</strong> optimizes representation by moving variable groups directly into the main SMILES string as special atoms with explicit atom indexing (e.g., <code>C:1</code>) to link to an extension block containing substituent tables. This handles both frequency variation and position variation in Markush structures.</p>
<h3 id="specialized-representations">Specialized Representations</h3>
<p>These formats optimize for specific recognition challenges beyond standard single-molecule tasks.</p>
<h4 id="rfl-ring-free-language">RFL: Ring-Free Language</h4>
<p><strong>RFL</strong> fundamentally restructures molecular serialization through hierarchical ring decomposition, addressing a core challenge: standard 1D formats (SMILES, SSML) flatten complex 2D molecular graphs, losing explicit spatial relationships.</p>
<p><strong>Mechanism</strong>: RFL decomposes molecules into three explicit components:</p>
<ul>
<li><strong>Molecular Skeleton (𝒮)</strong>: Main graph with rings &ldquo;collapsed&rdquo;</li>
<li><strong>Ring Structures (ℛ)</strong>: Individual ring components stored separately</li>
<li><strong>Branch Information (ℱ)</strong>: Connectivity between skeleton and rings</li>
</ul>
<p><strong>Technical approach</strong>:</p>
<ol>
<li>Detect all non-nested rings using DFS</li>
<li>Calculate adjacency ($\gamma$) between rings based on shared edges</li>
<li>Merge isolated rings ($\gamma=0$) into <strong>SuperAtoms</strong> (single node placeholders)</li>
<li>Merge adjacent rings ($\gamma&gt;0$) into <strong>SuperBonds</strong> (edge placeholders)</li>
<li>Progressive decoding: predict skeleton first, then conditionally decode rings using stored hidden states</li>
</ol>
<p><strong>Performance</strong>: RFL achieves SOTA results on both handwritten (95.38% EM) and printed (95.58% EM) structures, with particular strength on high-complexity molecules where standard baselines fail completely (0% → ~30% on hardest tier).</p>
<p><strong>Note</strong>: RFL does not preserve original drawing orientation; it&rsquo;s focused on computational efficiency through hierarchical decomposition.</p>
<h4 id="ssml-structure-specific-markup-language">SSML: Structure-Specific Markup Language</h4>
<p><strong>SSML</strong> is the primary orientation-preserving format in OCSR. Based on Chemfig (LaTeX chemical drawing package), it provides step-by-step drawing instructions.</p>
<p><strong>Key characteristics</strong>:</p>
<ul>
<li>Describes <em>how to draw</em> the molecule alongside its graph structure</li>
<li>Uses &ldquo;reconnection marks&rdquo; for cyclic structures</li>
<li>Preserves branch angles and spatial relationships</li>
<li>Significantly outperformed SMILES for handwritten recognition: 92.09% vs 81.89% EM (Hu et al. RCGD 2023)</li>
</ul>
<p><strong>Use case</strong>: Particularly valuable for hand-drawn structure recognition where visual alignment between image and reconstruction sequence aids model learning.</p>
<h2 id="training-data-comparison">Training Data Comparison</h2>
<p>Training data scale has grown dramatically, with a shift toward combining synthetic and real-world images.</p>
<h3 id="data-scale-evolution">Data Scale Evolution</h3>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Typical Scale</th>
          <th>Maximum Reported</th>
          <th>Primary Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2019-2020</td>
          <td>1-15M</td>
          <td>57M (Staker)</td>
          <td>Synthetic (RDKit, CDK)</td>
      </tr>
      <tr>
          <td>2021-2022</td>
          <td>5-35M</td>
          <td>35M (DECIMER 1.0)</td>
          <td>Synthetic with augmentation</td>
      </tr>
      <tr>
          <td>2023-2024</td>
          <td>100-150M</td>
          <td>450M+ (DECIMER.ai)</td>
          <td>Synthetic + real patents</td>
      </tr>
      <tr>
          <td>2025</td>
          <td>1-10M + real</td>
          <td>7.7M (MolParser)</td>
          <td>Curated real + synthetic</td>
      </tr>
  </tbody>
</table>
<h3 id="synthetic-vs-real-data">Synthetic vs Real Data</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Training Data</th>
          <th>Real-World Performance Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>DECIMER.ai</strong></td>
          <td>450M+ synthetic (RanDepict)</td>
          <td>Strong generalization via domain randomization</td>
      </tr>
      <tr>
          <td><strong>MolParser</strong></td>
          <td>7.7M with active learning</td>
          <td>Explicitly targets &ldquo;in the wild&rdquo; images</td>
      </tr>
      <tr>
          <td><strong>GTR-CoT</strong></td>
          <td>Real patent/paper images</td>
          <td>Chain-of-thought improves reasoning</td>
      </tr>
      <tr>
          <td><strong>MolSight</strong></td>
          <td>Multi-stage curriculum</td>
          <td>RL fine-tuning for stereochemistry</td>
      </tr>
  </tbody>
</table>
<h3 id="data-augmentation-strategies">Data Augmentation Strategies</h3>
<p>Common augmentation techniques across methods:</p>
<table>
  <thead>
      <tr>
          <th>Technique</th>
          <th>Purpose</th>
          <th>Used By</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Rotation</strong></td>
          <td>Orientation invariance</td>
          <td>Nearly all methods</td>
      </tr>
      <tr>
          <td><strong>Gaussian blur</strong></td>
          <td>Image quality variation</td>
          <td>DECIMER, MolParser</td>
      </tr>
      <tr>
          <td><strong>Salt-and-pepper noise</strong></td>
          <td>Scan artifact simulation</td>
          <td>DECIMER, Image2SMILES</td>
      </tr>
      <tr>
          <td><strong>Affine transforms</strong></td>
          <td>Perspective variation</td>
          <td>ChemPix, MolParser</td>
      </tr>
      <tr>
          <td><strong>Font/style variation</strong></td>
          <td>Rendering diversity</td>
          <td>RanDepict (DECIMER.ai)</td>
      </tr>
      <tr>
          <td><strong>Hand-drawn simulation</strong></td>
          <td>Sketch-like inputs</td>
          <td>ChemPix, ChemReco, DECIMER-Hand-Drawn</td>
      </tr>
      <tr>
          <td><strong>Background variation</strong></td>
          <td>Document context</td>
          <td>MolParser, DECIMER.ai</td>
      </tr>
  </tbody>
</table>
<h2 id="hardware-and-compute-requirements">Hardware and Compute Requirements</h2>
<p>Hardware requirements span several orders of magnitude, from consumer GPUs to TPU pods.</p>
<h3 id="training-hardware-comparison">Training Hardware Comparison</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Hardware</th>
          <th>Training Time</th>
          <th>Dataset Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Staker et al. (2019)</strong></td>
          <td>8x GPUs</td>
          <td>26 days</td>
          <td>57M</td>
      </tr>
      <tr>
          <td><strong>IMG2SMI (2021)</strong></td>
          <td>1x RTX 2080 Ti</td>
          <td>5 epochs</td>
          <td>~10M</td>
      </tr>
      <tr>
          <td><strong>Image2SMILES (2022)</strong></td>
          <td>4x V100</td>
          <td>2 weeks</td>
          <td>30M</td>
      </tr>
      <tr>
          <td><strong>MICER (2022)</strong></td>
          <td>4x V100</td>
          <td>42 hours</td>
          <td>10M</td>
      </tr>
      <tr>
          <td><strong>DECIMER 1.0 (2021)</strong></td>
          <td>TPU v3-8</td>
          <td>Not reported</td>
          <td>35M</td>
      </tr>
      <tr>
          <td><strong>DECIMER.ai (2023)</strong></td>
          <td>TPU v3-256</td>
          <td>Not reported</td>
          <td>450M+</td>
      </tr>
      <tr>
          <td><strong>SwinOCSR (2022)</strong></td>
          <td>4x RTX 3090</td>
          <td>5 days</td>
          <td>5M</td>
      </tr>
      <tr>
          <td><strong>MolParser (2025)</strong></td>
          <td>8x A100</td>
          <td>Curriculum learning</td>
          <td>7.7M</td>
      </tr>
      <tr>
          <td><strong>MolSight (2025)</strong></td>
          <td>Not specified</td>
          <td>RL fine-tuning (GRPO)</td>
          <td>Multi-stage</td>
      </tr>
  </tbody>
</table>
<h3 id="inference-considerations">Inference Considerations</h3>
<p>Few papers report inference speed consistently. Available data:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Inference Speed</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>DECIMER 1.0</strong></td>
          <td>4x faster than DECIMER</td>
          <td>TensorFlow Lite optimization</td>
      </tr>
      <tr>
          <td><strong>OSRA</strong> (baseline)</td>
          <td>~1 image/sec</td>
          <td>CPU-based rule system</td>
      </tr>
      <tr>
          <td><strong>MolScribe</strong></td>
          <td>Real-time capable</td>
          <td>Optimized Swin encoder</td>
      </tr>
  </tbody>
</table>
<h3 id="accessibility-tiers">Accessibility Tiers</h3>
<table>
  <thead>
      <tr>
          <th>Tier</th>
          <th>Hardware</th>
          <th>Representative Methods</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Consumer</strong></td>
          <td>1x RTX 2080/3090</td>
          <td>IMG2SMI, ChemPix</td>
      </tr>
      <tr>
          <td><strong>Workstation</strong></td>
          <td>4x V100/A100</td>
          <td>Image2SMILES, MICER, SwinOCSR</td>
      </tr>
      <tr>
          <td><strong>Cloud/HPC</strong></td>
          <td>TPU pods, 8+ A100</td>
          <td>DECIMER.ai, MolParser</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmark-performance">Benchmark Performance</h2>
<h3 id="common-evaluation-datasets">Common Evaluation Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Type</th>
          <th>Size</th>
          <th>Challenge</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>USPTO</strong></td>
          <td>Patent images</td>
          <td>~5K test</td>
          <td>Real-world complexity</td>
      </tr>
      <tr>
          <td><strong>UOB</strong></td>
          <td>Scanned images</td>
          <td>~5K test</td>
          <td>Scan artifacts</td>
      </tr>
      <tr>
          <td><strong>Staker</strong></td>
          <td>Synthetic</td>
          <td>Variable</td>
          <td>Baseline synthetic</td>
      </tr>
      <tr>
          <td><strong>CLEF</strong></td>
          <td>Patent images</td>
          <td>~1K test</td>
          <td>Markush structures</td>
      </tr>
      <tr>
          <td><strong>JPO</strong></td>
          <td>Japanese patents</td>
          <td>~1K test</td>
          <td>Different rendering styles</td>
      </tr>
  </tbody>
</table>
<h3 id="accuracy-comparison-exact-match-">Accuracy Comparison (Exact Match %)</h3>
<p>Methods are roughly grouped by evaluation era; direct comparison is complicated by different test sets.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>USPTO</th>
          <th>UOB</th>
          <th>Staker</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>OSRA</strong> (baseline)</td>
          <td>~70%</td>
          <td>~65%</td>
          <td>~80%</td>
          <td>Rule-based reference</td>
      </tr>
      <tr>
          <td><strong>DECIMER 1.0</strong></td>
          <td>~85%</td>
          <td>~80%</td>
          <td>~90%</td>
          <td>First transformer-based</td>
      </tr>
      <tr>
          <td><strong>SwinOCSR</strong></td>
          <td>~88%</td>
          <td>~82%</td>
          <td>~92%</td>
          <td>Swin encoder advantage</td>
      </tr>
      <tr>
          <td><strong>DECIMER.ai</strong></td>
          <td>~90%</td>
          <td>~85%</td>
          <td>~95%</td>
          <td>Scale + augmentation</td>
      </tr>
      <tr>
          <td><strong>MolParser</strong></td>
          <td>~92%</td>
          <td>~88%</td>
          <td>~96%</td>
          <td>Real-world focus</td>
      </tr>
      <tr>
          <td><strong>MolSight</strong></td>
          <td>~93%+</td>
          <td>~89%+</td>
          <td>~97%+</td>
          <td>RL fine-tuning boost</td>
      </tr>
  </tbody>
</table>
<p><em>Note: Numbers are approximate and may vary by specific test split. See individual paper notes for precise figures.</em></p>
<h3 id="stereochemistry-recognition">Stereochemistry Recognition</h3>
<p>Stereochemistry remains a persistent challenge across all methods:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Approach</th>
          <th>Stereo Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Most methods</strong></td>
          <td>Standard SMILES</td>
          <td>Lower than non-stereo</td>
      </tr>
      <tr>
          <td><strong>MolSight</strong></td>
          <td>RL (GRPO) specifically for stereo</td>
          <td>Improved</td>
      </tr>
      <tr>
          <td><strong>MolNexTR</strong></td>
          <td>Graph-based explicit stereo</td>
          <td>Better handling</td>
      </tr>
      <tr>
          <td><strong>Image2InChI</strong></td>
          <td>InChI stereo layers</td>
          <td>Mixed results</td>
      </tr>
  </tbody>
</table>
<h2 id="hand-drawn-recognition">Hand-Drawn Recognition</h2>
<p>A distinct sub-lineage focuses on hand-drawn/sketched chemical structures.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Target Domain</th>
          <th>Key Innovation</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ChemPix (2021)</strong></td>
          <td>Hand-drawn hydrocarbons</td>
          <td>First deep learning for sketches</td>
      </tr>
      <tr>
          <td><strong>Hu et al. RCGD (2023)</strong></td>
          <td>Hand-drawn structures</td>
          <td>Random conditional guided decoder</td>
      </tr>
      <tr>
          <td><strong>ChemReco (2024)</strong></td>
          <td>Hand-drawn C-H-O structures</td>
          <td>EfficientNet + curriculum learning</td>
      </tr>
      <tr>
          <td><strong>DECIMER-Hand-Drawn (2024)</strong></td>
          <td>General hand-drawn</td>
          <td>Enhanced DECIMER architecture</td>
      </tr>
  </tbody>
</table>
<h3 id="hand-drawn-vs-printed-trade-offs">Hand-Drawn vs Printed Trade-offs</h3>
<ul>
<li>Hand-drawn methods sacrifice some accuracy on clean printed images</li>
<li>Require specialized training data (synthetic hand-drawn simulation)</li>
<li>Generally smaller training sets due to data collection difficulty</li>
<li>Better suited for educational and lab notebook applications</li>
</ul>
<h2 id="key-innovations-by-method">Key Innovations by Method</h2>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Primary Innovation</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Staker et al.</strong></td>
          <td>First end-to-end deep learning OCSR</td>
      </tr>
      <tr>
          <td><strong>DECIMER 1.0</strong></td>
          <td>Transformer decoder + SELFIES</td>
      </tr>
      <tr>
          <td><strong>Img2Mol</strong></td>
          <td>Continuous embedding space (CDDD)</td>
      </tr>
      <tr>
          <td><strong>Image2SMILES</strong></td>
          <td>Functional group-aware SMILES (FG-SMILES)</td>
      </tr>
      <tr>
          <td><strong>SwinOCSR</strong></td>
          <td>Hierarchical vision transformer encoder</td>
      </tr>
      <tr>
          <td><strong>DECIMER.ai</strong></td>
          <td>Massive scale + RanDepict augmentation</td>
      </tr>
      <tr>
          <td><strong>MolParser</strong></td>
          <td>Extended SMILES + active learning</td>
      </tr>
      <tr>
          <td><strong>MolSight</strong></td>
          <td>RL fine-tuning (GRPO) for accuracy</td>
      </tr>
      <tr>
          <td><strong>GTR-CoT</strong></td>
          <td>Chain-of-thought graph traversal</td>
      </tr>
      <tr>
          <td><strong>OCSU</strong></td>
          <td>Multi-task vision-language understanding</td>
      </tr>
      <tr>
          <td><strong>RFL</strong></td>
          <td>Hierarchical ring decomposition with SuperAtoms/SuperBonds</td>
      </tr>
  </tbody>
</table>
<h2 id="open-challenges">Open Challenges</h2>
<ol>
<li><strong>Stereochemistry</strong>: Consistent challenge across all methods; RL approaches (MolSight) show promise</li>
<li><strong>Abbreviations/R-groups</strong>: E-SMILES and Markush-specific methods emerging</li>
<li><strong>Real-world robustness</strong>: Gap between synthetic training and patent/paper images</li>
<li><strong>Inference speed</strong>: Rarely reported; important for production deployment</li>
<li><strong>Memory efficiency</strong>: Almost never documented; limits accessibility</li>
<li><strong>Multi-molecule images</strong>: Most methods assume single isolated structure</li>
</ol>
<h2 id="references">References</h2>
<p>Individual paper notes linked throughout. For the complete method listing, see the <a href="/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/">OCSR Methods taxonomy</a>.</p>
]]></content:encoded></item><item><title>Benchmarking Eight OCSR Tools on Patent Images (2024)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/krasnov-ocsr-benchmark-2024/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/krasnov-ocsr-benchmark-2024/</guid><description>Benchmark of 8 open-access OCSR methods on 2702 manually curated patent images, with ChemIC classifier for hybrid approach.</description><content:encoded><![CDATA[<h2 id="contribution-benchmarking-general-and-specialized-ocsr-tools">Contribution: Benchmarking General and Specialized OCSR Tools</h2>
<p>This paper is primarily a <strong>Resource</strong> contribution ($0.7 \Psi_{\text{Resource}}$) with a secondary <strong>Method</strong> component ($0.3 \Psi_{\text{Method}}$).</p>
<p>It establishes a new, independent benchmark dataset of 2,702 manually selected patent images to evaluate existing Optical Chemical Structure Recognition (OCSR) tools. The authors rigorously compare 8 different methods using this dataset to determine the state-of-the-art. The Resource contribution is evidenced by the creation of this curated benchmark, explicit evaluation metrics (exact connectivity table matching), and public release of datasets, processing scripts, and evaluation tools on Zenodo.</p>
<p>The secondary Method contribution comes through the development of &ldquo;ChemIC,&rdquo; a ResNet-50 image classifier designed to categorize images (Single vs. Multiple vs. Reaction) to enable a modular processing pipeline. However, this method serves to support the insights gained from the benchmarking resource.</p>
<h2 id="motivation-the-need-for-realistic-modality-diverse-patent-benchmarks">Motivation: The Need for Realistic, Modality-Diverse Patent Benchmarks</h2>
<p><strong>Lack of Standardization</strong>: A universally accepted standard set of images for OCSR quality measurement is currently missing; existing tools are often evaluated on synthetic data or limited datasets.</p>
<p><strong>Industrial Relevance</strong>: Patents contain diverse and &ldquo;noisy&rdquo; image modalities (Markush structures, salts, reactions, hand-drawn styles) that are critical for Freedom to Operate (FTO) and novelty checks in the pharmaceutical industry. These real-world complexities are often missing from existing benchmarks.</p>
<p><strong>Modality Gaps</strong>: Different tools excel at different tasks (e.g., single molecules vs. reactions). Monolithic approaches frequently break down on complex patent documents, and there was minimal systematic understanding of which tools perform best for which image types.</p>
<p><strong>Integration Needs</strong>: The authors aimed to identify tools to replace or augment their existing rule-based system (OSRA) within the SciWalker application, requiring a rigorous comparative study.</p>
<h2 id="core-innovation-a-curated-multi-modality-dataset-and-hybrid-classification-pipeline">Core Innovation: A Curated Multi-Modality Dataset and Hybrid Classification Pipeline</h2>
<p><strong>Independent Benchmark</strong>: Creation of a manually curated test set of 2,702 images from real-world patents (WO, EP, US), specifically selected to include &ldquo;problematic&rdquo; edge cases like inorganic complexes, peptides, and Markush structures, providing a more realistic evaluation environment than synthetic datasets.</p>
<p><strong>Comprehensive Comparison</strong>: Side-by-side evaluation of 8 open-access tools: DECIMER, ReactionDataExtractor, MolScribe, RxnScribe, SwinOCSR, OCMR, MolVec, and OSRA, using identical test conditions and evaluation criteria.</p>
<p><strong>ChemIC Classifier</strong>: Implementation of a specialized image classifier (ResNet-50) to distinguish between single molecules, multiple molecules, reactions, and non-chemical images, facilitating a &ldquo;hybrid&rdquo; pipeline that routes images to the most appropriate tool.</p>
<p><strong>Strict Evaluation Logic</strong>: Utilization of an exact match criterion for connectivity tables (ignoring partial similarity scores like Tanimoto) to reflect rigorous industrial requirements for novelty checking in patent applications.</p>
<h2 id="methodology-exact-match-evaluation-across-eight-open-source-systems">Methodology: Exact-Match Evaluation Across Eight Open-Source Systems</h2>
<p><strong>Tool Selection</strong>: Installed and tested 8 tools: DECIMER v2.4.0, ReactionDataExtractor v2.0.0, MolScribe v1.1.1, RxnScribe v1.0, MolVec v0.9.8, OCMR, SwinOCSR, and OSRA v2.1.5.</p>
<p><strong>Dataset Construction</strong>:</p>
<ul>
<li><strong>Test Set</strong>: 2,702 patent images split into three &ldquo;buckets&rdquo;: A (Single structure - 1,454 images), B (Multiple structures - 661 images), C (Reactions - 481 images).</li>
<li><strong>Training Set (for ChemIC)</strong>: 16,000 images from various sources (Patents, Im2Latex, etc.) split into 12,804 training, 1,604 validation, and 1,604 test images.</li>
</ul>
<p><strong>Evaluation Protocol</strong>:</p>
<ul>
<li>Calculated Precision, Recall, and F1 scores based on an <em>exact connectivity table structure matching</em> (rejecting Tanimoto similarity as industrially insufficient). The metrics follow standard formulations where true positives ($\text{TP}$) represent perfectly assembled structures:
$$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \qquad \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \qquad \text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$</li>
<li>Manual inspection by four chemists to verify predictions.</li>
<li>Developed custom tools (<code>ImageComparator</code> and <code>ExcelConstructor</code>) to facilitate visual comparison and result aggregation.</li>
</ul>
<p><strong>Segmentation Test</strong>: Applied DECIMER segmentation to multi-structure images to see if splitting them before processing improved results, combining segmentation with MolScribe for final predictions.</p>
<h2 id="key-findings-modality-specialization-outperforms-monolithic-approaches">Key Findings: Modality Specialization Outperforms Monolithic Approaches</h2>
<p><strong>Single Molecules</strong>: <strong>MolScribe</strong> achieved the highest performance (Precision: 87%, F1: 93%), followed closely by <strong>DECIMER</strong> (Precision: 84%, F1: 91%). These transformer-based approaches outperformed rule-based methods on single-structure images (e.g., MolScribe F1: 93% vs. OSRA F1: 78%).</p>
<p><strong>Reactions</strong>: Evaluated on 103 randomly selected reaction images containing 284 total reactions, <strong>RxnScribe</strong> outperformed others (Recall: 97%, F1: 86%), demonstrating the value of specialized architectures for reaction diagrams. General-purpose tools struggled with reaction recognition.</p>
<p><strong>Multiple Structures</strong>: Evaluated on 20 multi-structure images containing 146 single structures, all AI-based tools struggled. <strong>OSRA</strong> (rule-based) performed best here but still had low precision (58%). Combining DECIMER segmentation (with the <code>expand</code> option) with MolScribe on these same 20 images improved precision to 82% and F1 to 90%, showing that image segmentation as a preprocessing step can boost multi-structure performance.</p>
<p><strong>Failures</strong>: Current tools fail on polymers, large oligomers, and complex Markush structures. Most tools (except MolVec) correctly recognize cis-trans and tetrahedral stereochemistry, but other forms (e.g., octahedral, axial, helical) are not recognized. None of the evaluated tools can reliably recognize dative/coordinate bonds in metal complexes, indicating gaps in training data coverage.</p>
<p><strong>Classifier Utility</strong>: The ChemIC model achieved 99.62% accuracy on the test set, validating the feasibility of a modular pipeline where images are routed to the specific tool best suited for that modality. The authors estimate that a hybrid system (MolScribe + OSRA + RxnScribe) routed by ChemIC would achieve an average F1 of 80%, compared to 68% for OSRA alone across all modalities.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Benchmark (Test)</strong></td>
          <td>Manual Patent Selection</td>
          <td>2,702 Images</td>
          <td>Sources: WO, EP, US patents<br><strong>Bucket A</strong>: Single structures (1,454)<br><strong>Bucket B</strong>: Multi-structures (661)<br><strong>Bucket C</strong>: Reactions (481)</td>
      </tr>
      <tr>
          <td><strong>ChemIC Training</strong></td>
          <td>Aggregated Sources</td>
          <td>16,000 Images</td>
          <td>Sources: Patents (OntoChem), MolScribe dataset, DECIMER dataset, RxnScribe dataset, Im2Latex-100k<br><strong>Split</strong>: 12,804 Train / 1,604 Val / 1,604 Test</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Scoring Logic</strong>:</p>
<ul>
<li><strong>Single Molecules</strong>: Score = 1 if exact match of connectivity table (all atoms, valencies, bonds, superatom abbreviations, and charge correct), 0 otherwise. Stereochemistry correctness was not considered a scoring criterion. Tanimoto similarity explicitly rejected as too lenient.</li>
<li><strong>Reactions</strong>: Considered correct if at least one reactant and one product are correct and capture main features. Stoichiometry and conditions ignored.</li>
</ul>
<p><strong>Image Segmentation</strong>: Used DECIMER segmentation (with <code>expand</code> option) to split multi-structure images into single images before passing to MolScribe.</p>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Version</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>DECIMER</strong></td>
          <td>v2.4.0</td>
          <td>EfficientNet-V2-M encoder + Transformer decoder</td>
      </tr>
      <tr>
          <td><strong>MolScribe</strong></td>
          <td>v1.1.1</td>
          <td>Swin Transformer encoder + Transformer decoder</td>
      </tr>
      <tr>
          <td><strong>RxnScribe</strong></td>
          <td>v1.0</td>
          <td>Specialized for reaction diagrams</td>
      </tr>
      <tr>
          <td><strong>ReactionDataExtractor</strong></td>
          <td>v2.0.0</td>
          <td>Deep learning-based extraction</td>
      </tr>
      <tr>
          <td><strong>MolVec</strong></td>
          <td>v0.9.8</td>
          <td>Rule-based vectorization</td>
      </tr>
      <tr>
          <td><strong>OSRA</strong></td>
          <td>v2.1.5</td>
          <td>Rule-based recognition</td>
      </tr>
      <tr>
          <td><strong>SwinOCSR</strong></td>
          <td>-</td>
          <td>Swin Transformer encoder-decoder</td>
      </tr>
      <tr>
          <td><strong>OCMR</strong></td>
          <td>-</td>
          <td>CNN-based framework</td>
      </tr>
      <tr>
          <td><strong>ChemIC (New)</strong></td>
          <td>-</td>
          <td>ResNet-50 CNN in PyTorch for 4-class classification</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Key Results on Single Structures (Bucket A - 400 random sample):</strong></p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Precision</th>
          <th>Recall</th>
          <th>F1 Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>MolScribe</strong></td>
          <td>87%</td>
          <td>100%</td>
          <td>93%</td>
      </tr>
      <tr>
          <td><strong>DECIMER</strong></td>
          <td>84%</td>
          <td>100%</td>
          <td>91%</td>
      </tr>
      <tr>
          <td><strong>OCMR</strong></td>
          <td>77%</td>
          <td>100%</td>
          <td>87%</td>
      </tr>
      <tr>
          <td><strong>MolVec</strong></td>
          <td>74%</td>
          <td>100%</td>
          <td>85%</td>
      </tr>
      <tr>
          <td><strong>OSRA</strong></td>
          <td>64%</td>
          <td>100%</td>
          <td>78%</td>
      </tr>
      <tr>
          <td><strong>SwinOCSR</strong></td>
          <td>65%</td>
          <td>95%</td>
          <td>77%</td>
      </tr>
  </tbody>
</table>
<p><strong>Key Results on Reactions (Bucket C):</strong></p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Precision</th>
          <th>Recall</th>
          <th>F1 Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>RxnScribe</strong></td>
          <td>77%</td>
          <td>97%</td>
          <td>86%</td>
      </tr>
      <tr>
          <td><strong>OSRA</strong></td>
          <td>64%</td>
          <td>65%</td>
          <td>64%</td>
      </tr>
      <tr>
          <td><strong>ReactionDataExtractor</strong></td>
          <td>49%</td>
          <td>62%</td>
          <td>55%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p><strong>ChemIC Training</strong>: Trained on a machine with 40 Intel(R) Xeon(R) Gold 6226 CPUs. Training time approximately 6 hours for 100 epochs (early stopping at epoch 26).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10546827">Zenodo Repository (Code &amp; Data)</a></td>
          <td>Code, Dataset</td>
          <td>Unknown</td>
          <td>Benchmark images, processing scripts, evaluation tools, ChemIC classifier code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/ontochem/ImageComparator">ImageComparator</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Java tool for visual comparison of OCSR predictions</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Krasnov, A., Barnabas, S. J., Boehme, T., Boyer, S. K., &amp; Weber, L. (2024). Comparing software tools for optical chemical structure recognition. <em>Digital Discovery</em>, 3(4), 681-693. <a href="https://doi.org/10.1039/D3DD00228D">https://doi.org/10.1039/D3DD00228D</a></p>
<p><strong>Publication</strong>: Digital Discovery 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://doi.org/10.5281/zenodo.10546827">Zenodo Repository (Code &amp; Data)</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{krasnovComparingSoftwareTools2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Comparing Software Tools for Optical Chemical Structure Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Krasnov, Aleksei and Barnabas, Shadrack J. and Boehme, Timo and Boyer, Stephen K. and Weber, Lutz}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{681--693}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1039/D3DD00228D}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>String Representations for Chemical Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/rajan-string-representations-2022/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/rajan-string-representations-2022/</guid><description>Ablation study comparing SMILES, DeepSMILES, SELFIES, and InChI for OCSR. SMILES achieves highest accuracy; SELFIES guarantees validity.</description><content:encoded><![CDATA[<h2 id="empirical-focus-and-resource-contributions">Empirical Focus and Resource Contributions</h2>
<p>This is an <strong>Empirical Paper</strong> ($\Psi_{\text{Empirical}}$) with a secondary contribution as a <strong>Resource Paper</strong> ($\Psi_{\text{Resource}}$).</p>
<p>It functions as a systematic ablation study, keeping the model architecture (EfficientNet-B3 + Transformer) constant while varying the input/output representation (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, DeepSMILES, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>) to determine which format yields the best performance for Optical Chemical Structure Recognition (OCSR). It also contributes large-scale benchmarking datasets derived from ChEMBL and PubChem.</p>
<h2 id="the-syntax-challenge-in-chemical-image-recognition">The Syntax Challenge in Chemical Image Recognition</h2>
<p>Optical Chemical Structure Recognition (OCSR) is essential for extracting chemical information buried in scientific literature and patents. While deep learning offers a promising alternative to rule-based approaches, neural networks struggle with the syntax of standard chemical representations like SMILES. Specifically, the tokenization of SMILES strings (where ring closures and branches are marked by single characters potentially far apart in the sequence) creates learning difficulties for sequence-to-sequence models. Newer representations like DeepSMILES and SELFIES were developed to address these syntax issues, but their comparative performance in image-to-text tasks had not been rigorously benchmarked.</p>
<h2 id="isolating-string-representation-variables">Isolating String Representation Variables</h2>
<p>The core novelty is the <strong>comparative isolation of the string representation variable</strong> in an OCSR context. Previous approaches often selected a representation (usually SMILES) without validating if it was optimal for the learning task. This study specifically tests the hypothesis that syntax-robust representations (like SELFIES) improve deep learning performance compared to standard SMILES. It provides empirical evidence on the trade-off between <em>validity</em> (guaranteed by SELFIES) and <em>accuracy</em> (highest with SMILES).</p>
<h2 id="large-scale-image-to-text-translation-experiments">Large-Scale Image-to-Text Translation Experiments</h2>
<p>The authors performed a large-scale image-to-text translation experiment:</p>
<ul>
<li><strong>Task</strong>: Converting 2D chemical structure images into text strings.</li>
<li><strong>Data</strong>:
<ul>
<li><strong>ChEMBL</strong>: ~1.6M molecules, split into two datasets (with and without stereochemistry).</li>
<li><strong>PubChem</strong>: ~3M molecules, split similarly, to test performance scaling with data size.</li>
</ul>
</li>
<li><strong>Representations</strong>: The same chemical structures were converted into four formats: SMILES, DeepSMILES, SELFIES, and InChI.</li>
<li><strong>Metric</strong>: The models were evaluated on:
<ul>
<li><strong>Validity</strong>: Can the predicted string be decoded back to a molecule?</li>
<li><strong>Exact Match</strong>: Is the predicted string identical to the ground truth?</li>
<li><strong>Tanimoto Similarity</strong>: How chemically similar is the prediction to the ground truth (using PubChem fingerprints)? The similarity $\mathcal{T}$ between two molecular fingerprints $A$ and $B$ is calculated as:
$$ \mathcal{T}(A, B) = \frac{A \cdot B}{|A|^2 + |B|^2 - A \cdot B} $$</li>
</ul>
</li>
</ul>
<h2 id="comparative-performance-and-validity-trade-offs">Comparative Performance and Validity Trade-offs</h2>
<ul>
<li><strong>SMILES is the most accurate</strong>: Contrary to the hypothesis that syntax-robust formats would learn better, SMILES consistently achieved the highest exact match accuracy (up to 88.62% on PubChem data) and average Tanimoto similarity (0.98). This is likely due to SMILES having shorter string lengths and fewer unique tokens compared to SELFIES.</li>
<li><strong>SELFIES guarantees validity</strong>: While slightly less accurate in direct translation, SELFIES achieved 100% structural validity (every prediction could be decoded), whereas SMILES predictions occasionally contained syntax errors.</li>
<li><strong>InChI is unsuitable</strong>: InChI performed significantly worse (approx. 64% exact match) due to extreme maximum string lengths (up to 273 characters).</li>
<li><strong>Stereochemistry adds difficulty</strong>: Including stereochemistry reduced accuracy across all representations due to increased token count and visual complexity.</li>
<li><strong>Recommendation</strong>: Use SMILES for maximum accuracy; use SELFIES if generating valid structures is the priority (e.g., generative tasks).</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study used curated subsets from ChEMBL and PubChem. Images were generated synthetically.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ChEMBL (Dataset 1/2)</td>
          <td>~1.5M</td>
          <td>Filtered for MW &lt; 1500, specific elements (C,H,O,N,P,S,F,Cl,Br,I,Se,B).</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>PubChem (Dataset 3/4)</td>
          <td>~3.0M</td>
          <td>Same filtering rules, used to test scaling.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Test Split</td>
          <td>~120k - 250k</td>
          <td>Created using RDKit MaxMin algorithm to ensure chemical diversity.</td>
      </tr>
  </tbody>
</table>
<p><strong>Image Generation</strong>:</p>
<ul>
<li><strong>Tool</strong>: CDK Structure Diagram Generator (SDG).</li>
<li><strong>Specs</strong>: $300 \times 300$ pixels, rotated by random angles ($0-360^{\circ}$), saved as 8-bit PNG.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Tokenization Rules</strong> (Critical for replication):</p>
<ul>
<li><strong>SELFIES</strong>: Split at every <code>][</code> (e.g., <code>[C][N]</code> $\rightarrow$ <code>[C]</code>, <code>[N]</code>).</li>
<li><strong>SMILES / DeepSMILES</strong>: Regex-based splitting:
<ul>
<li>Every heavy atom (e.g., <code>C</code>, <code>N</code>).</li>
<li>Every bracket <code>(</code> and <code>)</code>.</li>
<li>Every bond symbol <code>=</code> and <code>#</code>.</li>
<li>Every single-digit number.</li>
<li>Everything inside square brackets <code>[]</code> is kept as a single token.</li>
</ul>
</li>
<li><strong>InChI</strong>: The prefix <code>InChI=1S/</code> was treated as a single token and removed during training, then re-added for evaluation.</li>
</ul>
<h3 id="models">Models</h3>
<p>The model follows the <strong>DECIMER</strong> architecture.</p>
<ul>
<li><strong>Encoder</strong>: EfficientNet-B3 (pre-trained with &ldquo;Noisy Student&rdquo; weights).
<ul>
<li>Output: Image feature vectors of shape $10 \times 10 \times 1536$.</li>
</ul>
</li>
<li><strong>Decoder</strong>: Transformer (similar to the &ldquo;Base&rdquo; model from <em>Attention Is All You Need</em>).
<ul>
<li>Layers: 4 encoder-decoder layers.</li>
<li>Attention Heads: 8.</li>
<li>Dimension ($d_{\text{model}}$): 512.</li>
<li>Feed-forward ($d_{\text{ff}}$): 2048.</li>
<li>Dropout: 10%.</li>
</ul>
</li>
<li><strong>Loss</strong>: Sparse categorical cross-entropy.</li>
<li><strong>Optimizer</strong>: Adam with custom learning rate scheduler.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics were calculated after converting all predictions back to standard SMILES.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Baseline (SMILES)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Identical Match</strong></td>
          <td>88.62% (PubChem)</td>
          <td>Strict character-for-character equality.</td>
      </tr>
      <tr>
          <td><strong>Valid Structure</strong></td>
          <td>99.78%</td>
          <td>SMILES had rare syntax errors; SELFIES achieved 100%.</td>
      </tr>
      <tr>
          <td><strong>Tanimoto (Avg)</strong></td>
          <td>0.98</td>
          <td>Calculated using PubChem fingerprints via CDK.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: Google Cloud TPUs (v3-8).</li>
<li><strong>Format</strong>: Data converted to TFRecords (128 image/text pairs per record) for TPU efficiency.</li>
<li><strong>Batch Size</strong>: 1024.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER_Short_Communication">DECIMER Short Communication</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Training and evaluation scripts (Python, Java)</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5155037">Datasets on Zenodo</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>SMILES data and processing scripts</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Steinbeck, C., &amp; Zielesny, A. (2022). Performance of chemical structure string representations for chemical image recognition using transformers. <em>Digital Discovery</em>, 1(2), 84-90. <a href="https://doi.org/10.1039/D1DD00013F">https://doi.org/10.1039/D1DD00013F</a></p>
<p><strong>Publication</strong>: Digital Discovery 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://chemrxiv.org/doi/pdf/10.26434/chemrxiv-2021-7c9wf">ChemRxiv Preprint (PDF)</a></li>
<li><a href="https://github.com/Kohulan/DECIMER_Short_Communication">Official Code Repository</a></li>
<li><a href="https://doi.org/10.5281/zenodo.5155037">Data on Zenodo</a></li>
<li>Related work: <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer/">DECIMER</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-1.0/">DECIMER 1.0</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/img2smi/">IMG2SMI</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanPerformanceChemicalStructure2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Performance of Chemical Structure String Representations for Chemical Image Recognition Using Transformers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Steinbeck, Christoph and Zielesny, Achim}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{84--90}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1039/D1DD00013F}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Review of OCSR Techniques and Models (Musazade 2022)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/musazade-ocsr-review-2022/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/musazade-ocsr-review-2022/</guid><description>Systematization of OCSR evolution from rule-based systems to deep learning, highlighting the paradigm shift to image captioning approaches.</description><content:encoded><![CDATA[<h2 id="systematization-of-ocsr-evolution">Systematization of OCSR Evolution</h2>
<p>This is a <strong>Systematization</strong> paper ($\Psi_{\text{Systematization}}$). It organizes existing literature into two distinct evolutionary phases: <strong>Rule-based systems</strong> (1990s-2010s) and <strong>Machine Learning-based systems</strong> (2015-present). It synthesizes performance metrics across these paradigms to highlight the shift from simple classification to &ldquo;image captioning&rdquo; (sequence generation).</p>
<p><strong>Justification</strong>: The paper focuses on &ldquo;organizing and synthesizing existing literature&rdquo; and answers the core question: &ldquo;What do we know?&rdquo; The dominant contribution is systematization based on several key indicators:</p>
<ol>
<li>
<p><strong>Survey Structure</strong>: The paper explicitly structures content by categorizing the field into two distinct historical and methodological groups: &ldquo;Rule-based systems&rdquo; and &ldquo;ML-based systems&rdquo;. It traces the &ldquo;evolution of approaches from rule-based structure analyses to complex statistical models&rdquo;, moving chronologically from early tools like OROCS and OSRA (1990s-2000s) to modern Deep Learning approaches like DECIMER and Vision Transformers.</p>
</li>
<li>
<p><strong>Synthesis of Knowledge</strong>: The paper aggregates performance metrics from various distinct studies into unified comparison tables (Table 1 for rule-based and Table 2 for ML-based). It synthesizes technical details of different models, explaining how specific architectures (CNNs, LSTMs, Attention mechanisms) are applied to the specific problem of Optical Chemical Structure Recognition (OCSR).</p>
</li>
<li>
<p><strong>Identification of Gaps</strong>: The authors dedicate specific sections to &ldquo;Gaps of rule-based systems&rdquo; and &ldquo;Gaps of ML-based systems&rdquo;. It concludes with recommendations for future development, such as the need for &ldquo;standardized datasets&rdquo; and specific improvements in image augmentation and evaluation metrics.</p>
</li>
</ol>
<h2 id="motivation-for-digitization-in-cheminformatics">Motivation for Digitization in Cheminformatics</h2>
<p>The primary motivation is the need to digitize vast amounts of chemical knowledge locked in non-digital formats (e.g., scanned PDFs, older textbooks). This is challenging because:</p>
<ol>
<li><strong>Representational Variety</strong>: A single chemical formula can be drawn in many visually distinct ways (e.g., different orientations, bond styles, fonts).</li>
<li><strong>Legacy Data</strong>: Older documents contain noise, low resolution, and disconnected strokes that confuse standard computer vision models.</li>
<li><strong>Lack of Standardization</strong>: There is no centralized database or standardized benchmark for evaluating OCSR performance, making comparison difficult.</li>
</ol>
<h2 id="key-insights-and-the-paradigm-shift">Key Insights and the Paradigm Shift</h2>
<p>The paper provides a structured comparison of the &ldquo;evolution&rdquo; of OCSR, specifically identifying the pivot point where the field moved from object detection to <strong>NLP-inspired sequence generation</strong>.</p>
<p>Key insights include:</p>
<ul>
<li><strong>The Paradigm Shift</strong>: Identifying that OCSR has effectively become an &ldquo;image captioning&rdquo; problem where the &ldquo;caption&rdquo; is a <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> or <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> string.</li>
<li><strong>Metric Critique</strong>: It critically analyzes the flaws in current evaluation metrics, noting that Levenshtein Distance (LD) is better than simple accuracy but still fails to capture semantic chemical severity (e.g., mistaking &ldquo;F&rdquo; for &ldquo;S&rdquo; is worse than a wrong digit).</li>
<li><strong>Hybrid Potential</strong>: Despite the dominance of ML, the authors argue that rule-based heuristics are still valuable for post-processing validation (e.g., checking element order, sequence structure, and formula correspondence).</li>
</ul>
<h2 id="comparative-analysis-of-rule-based-vs-ml-systems">Comparative Analysis of Rule-Based vs. ML Systems</h2>
<p>As a review paper, it aggregates experimental results from primary sources. It compares:</p>
<ul>
<li><strong>Rule-based systems</strong>: OSRA, chemoCR, Imago, Markov Logic OCSR, and various heuristic approaches.</li>
<li><strong>ML-based systems</strong>: DECIMER (multiple versions), MSE-DUDL, ICMDT (Image Captioning Model based on Deep Transformer-in-Transformer), and other BMS Kaggle competition solutions.</li>
</ul>
<p>It contrasts these systems using:</p>
<ul>
<li><strong>Datasets</strong>: BMS (synthetic, 4M images), PubChem (synthetic), U.S. Patents (real-world scanned).</li>
<li><strong>Metrics</strong>: Tanimoto similarity (structural overlap) and Levenshtein distance (string edit distance).</li>
</ul>
<h2 id="outcomes-critical-gaps-and-recommendations">Outcomes, Critical Gaps, and Recommendations</h2>
<ol>
<li><strong>Transformers are SOTA</strong>: Attention-based encoder-decoder models outperform CNN-RNN hybrids. DECIMER 1.0 achieved 96.47% Tanimoto $= 1.0$ on its test set using an EfficientNet-B3 encoder and Transformer decoder.</li>
<li><strong>Data Hungry</strong>: Modern approaches require massive datasets (millions of images) and significant compute. DECIMER 1.0 trained on 39M images for 14 days on TPU, while the original DECIMER took 27 days on a single GPU. Rule-based systems required neither large data nor heavy compute but hit a performance ceiling.</li>
<li><strong>Critical Gaps</strong>:
<ul>
<li><strong>Super-atoms</strong>: Current models struggle with abbreviated super-atoms (e.g., &ldquo;Ph&rdquo;, &ldquo;COOH&rdquo;).</li>
<li><strong>Stereochemistry</strong>: 3D information (wedges/dashes) is often lost or misinterpreted.</li>
<li><strong>Resolution</strong>: Models are brittle to resolution changes; some require high-res, others fail if images aren&rsquo;t downscaled.</li>
</ul>
</li>
<li><strong>Recommendation</strong>: Future systems should integrate &ldquo;smart&rdquo; pre-processing (denoising without cropping) and use domain-specific distance metrics. The authors also note that post-processing formula validation (checking element order, sequence structure, and formula correspondence) increases accuracy by around 5-6% on average. They suggest exploring Capsule Networks as an alternative to CNNs, since capsules add position invariance through routing-by-agreement rather than max-pooling.</li>
</ol>
<h2 id="reproducibility">Reproducibility</h2>
<p>As a review paper, this work does not introduce original code, models, or datasets. The paper itself is open access via the Journal of Cheminformatics. This section summarizes the technical details of the systems reviewed.</p>
<h3 id="data">Data</h3>
<p>The review identifies the following key datasets used for training OCSR models:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>BMS (Bristol-Myers Squibb)</strong></td>
          <td style="text-align: left">Synthetic</td>
          <td style="text-align: left">~4M images</td>
          <td style="text-align: left">2.4M train / 1.6M test. Used for Kaggle competition. Test images contain noise (salt &amp; pepper, blur) and rotations absent from training images.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>PubChem</strong></td>
          <td style="text-align: left">Synthetic</td>
          <td style="text-align: left">~39M</td>
          <td style="text-align: left">Generated via CDK (Chemistry Development Kit). Used by DECIMER 1.0 (90/10 train/test split).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>U.S. Patents (USPTO)</strong></td>
          <td style="text-align: left">Scanned</td>
          <td style="text-align: left">Variable</td>
          <td style="text-align: left">Real-world noise, often low resolution. One of several training sources for MSE-DUDL (alongside PubChem and Indigo, totaling 50M+ samples).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>ChemInfty</strong></td>
          <td style="text-align: left">Scanned</td>
          <td style="text-align: left">869 images</td>
          <td style="text-align: left">Older benchmark used to evaluate rule-based systems (e.g., Markov Logic OCSR).</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The review highlights the progression of algorithms:</p>
<ul>
<li><strong>Rule-Based</strong>: Hough transforms for bond detection, vectorization/skeletonization, and OCR for atom labels.</li>
<li><strong>Sequence Modeling</strong>:
<ul>
<li><strong>Image Captioning</strong>: Encoder (CNN/ViT) → Decoder (RNN/Transformer).</li>
<li><strong>Tokenization</strong>: Parsing InChI/SMILES into discrete tokens (e.g., splitting <code>C13</code> into <code>C</code>, <code>13</code>).</li>
<li><strong>Beam Search</strong>: Used in inference (typical $k=15-20$) to find the most likely chemical string.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>Key architectures reviewed:</p>
<ul>
<li><strong>DECIMER 1.0</strong>: Uses <strong>EfficientNet-B3</strong> (Encoder) and <strong>Transformer</strong> (Decoder). Predicts <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> strings (more robust than <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>).</li>
<li><strong>Swin Transformer</strong>: Often used in Kaggle ensembles as the visual encoder due to better handling of variable image sizes.</li>
<li><strong>Grid LSTM</strong>: Used in older deep learning approaches (MSE-DUDL) to capture spatial dependencies.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics standard in the field:</p>
<ul>
<li><strong>Levenshtein Distance (LD)</strong>: Edit distance between predicted and ground truth strings. Lower is better. Formally, for two sequences $a$ and $b$ (e.g. SMILES strings) of lengths $|a|$ and $|b|$, the recursive distance $LD(a, b)$ is bounded from $0$ to $\max(|a|, |b|)$.</li>
<li><strong>Tanimoto Similarity</strong>: Measures overlap of molecular fingerprints ($0.0 - 1.0$). Higher is better. DECIMER 1.0 achieved a Tanimoto of 0.99 on PubChem data (Table 2). Calculated as:
$$
\begin{aligned}
T(A, B) = \frac{N_c}{N_a + N_b - N_c}
\end{aligned}
$$
where $N_a$ and $N_b$ are the number of bits set to 1 in fingerprints $A$ and $B$, and $N_c$ is the number of common bits set to 1.</li>
<li><strong>1-1 Match Rate</strong>: Exact string matching (accuracy). For DECIMER 1.0, 96.47% of results achieved Tanimoto $= 1.0$.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Cost</strong>: High for SOTA. DECIMER 1.0 required ~14 days on TPU. The original DECIMER took ~27 days on a single NVIDIA GPU.</li>
<li><strong>Inference</strong>: Transformer models are heavy; rule-based systems run on standard CPUs but with lower accuracy.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Musazade, F., Jamalova, N., &amp; Hasanov, J. (2022). Review of techniques and models used in optical chemical structure recognition in images and scanned documents. <em>Journal of Cheminformatics</em>, 14(1), 61. <a href="https://doi.org/10.1186/s13321-022-00642-3">https://doi.org/10.1186/s13321-022-00642-3</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2022</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{musazadeReviewTechniquesModels2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Review of Techniques and Models Used in Optical Chemical Structure Recognition in Images and Scanned Documents}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Musazade, Fidan and Jamalova, Narmin and Hasanov, Jamaladdin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = sep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{61}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-022-00642-3}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>A Review of Optical Chemical Structure Recognition Tools</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/rajan-ocsr-review-2020/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/rajan-ocsr-review-2020/</guid><description>Comprehensive review and benchmarking of 30 years of Optical Chemical Structure Recognition (OCSR) methods and tools.</description><content:encoded><![CDATA[<h2 id="systematization-and-benchmarking-of-ocsr">Systematization and Benchmarking of OCSR</h2>
<p>This is primarily a <strong>Systematization</strong> paper ($0.7 \Psi_{\text{Systematization}}$) with a significant <strong>Resource</strong> component ($0.3 \Psi_{\text{Resource}}$).</p>
<p>It serves as a <strong>Systematization</strong> because it organizes nearly three decades of research in Optical Chemical Structure Recognition (OCSR), categorizing methods into rule-based systems (e.g., Kekulé, CLiDE, OSRA) and emerging machine-learning approaches (e.g., MSE-DUDL, Chemgrapher). It synthesizes information on 16 distinct tools, many of which are commercial or no longer available.</p>
<p>It acts as a <strong>Resource</strong> by defining a benchmark for the field. The authors evaluate the three available open-source tools (Imago, MolVec, OSRA) against four distinct datasets to establish baseline performance metrics for accuracy and speed.</p>
<h2 id="motivation-digitizing-legacy-chemical-literature">Motivation: Digitizing Legacy Chemical Literature</h2>
<p>A vast amount of chemical knowledge remains &ldquo;hidden&rdquo; in the primary scientific literature (printed or PDF), conveyed as 2D images. Because these depictions are not machine-readable, there is a &ldquo;backlog of decades of chemical literature&rdquo; that cannot be easily indexed or searched in open-access databases.</p>
<p>While Chemical Named Entity Recognition (NER) exists for text, translating graphical depictions into formats like SMILES or SDfiles requires specialized OCSR tools. The motivation is to enable the automated curation of this legacy data to feed public databases.</p>
<h2 id="core-innovations-historical-taxonomy-and-open-standards">Core Innovations: Historical Taxonomy and Open Standards</h2>
<p>The primary novelty is the comprehensive aggregation of the history of the field, which had not been thoroughly reviewed recently. It details the algorithmic evolution from the first work in 1990 to deep learning methods in 2019.</p>
<p>Specific contributions include:</p>
<ul>
<li><strong>Historical Taxonomy</strong>: Classification of tools into rule-based vs. machine-learning, and open-source vs. commercial/unavailable.</li>
<li><strong>Open Source Benchmark</strong>: A comparative performance analysis of the only three open-source tools available at the time (Imago, MolVec, OSRA) on standardized datasets.</li>
<li><strong>Algorithmic Breakdown</strong>: Detailed summaries of the workflows for closed-source or lost tools (e.g., Kekulé, OROCS, ChemReader) based on their original publications.</li>
</ul>
<h2 id="benchmarking-methodology-and-open-source-evaluation">Benchmarking Methodology and Open-Source Evaluation</h2>
<p>The authors performed a benchmark study to evaluate the accuracy and speed of three open-source OCSR tools: <strong>MolVec (0.9.7)</strong>, <strong>Imago (2.0)</strong>, and <strong>OSRA (2.1.0)</strong>.</p>
<p>They tested these tools on four datasets of varying quality and origin:</p>
<ol>
<li><strong>USPTO</strong>: 5,719 images from US patents (high quality).</li>
<li><strong>UOB</strong>: 5,740 images from the University of Birmingham, published alongside MolRec.</li>
<li><strong>CLEF 2012</strong>: 961 images from the CLEF-IP evaluation (well-segmented, clean).</li>
<li><strong>JPO</strong>: 450 images from Japanese patents (low quality, noise, Japanese characters).</li>
</ol>
<p>Evaluation metrics were:</p>
<ul>
<li><strong>Accuracy</strong>: Percentage of perfectly recognized structures, mathematically defined as exact string matching between generated and reference standard InChI sequences $\text{Accuracy} = \frac{\text{Correct InChI Matches}}{\text{Total Images}}$ (verified by converting output to <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> strings and matching against reference InChIs).</li>
<li><strong>Speed</strong>: Total processing time for the dataset.</li>
</ul>
<h2 id="results-and-general-conclusions">Results and General Conclusions</h2>
<p><strong>Benchmark Results (Table 2)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Metric</th>
          <th>MolVec 0.9.7</th>
          <th>Imago 2.0</th>
          <th>OSRA 2.1.0</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>USPTO (5,719 images)</td>
          <td>Time (min)</td>
          <td>28.65</td>
          <td>72.83</td>
          <td>145.04</td>
      </tr>
      <tr>
          <td></td>
          <td>Accuracy</td>
          <td>88.41%</td>
          <td>87.20%</td>
          <td>87.69%</td>
      </tr>
      <tr>
          <td>UOB (5,740 images)</td>
          <td>Time (min)</td>
          <td>28.42</td>
          <td>152.52</td>
          <td>125.78</td>
      </tr>
      <tr>
          <td></td>
          <td>Accuracy</td>
          <td>88.39%</td>
          <td>63.54%</td>
          <td>86.50%</td>
      </tr>
      <tr>
          <td>CLEF 2012 (961 images)</td>
          <td>Time (min)</td>
          <td>4.41</td>
          <td>16.03</td>
          <td>21.33</td>
      </tr>
      <tr>
          <td></td>
          <td>Accuracy</td>
          <td>80.96%</td>
          <td>65.45%</td>
          <td>94.90%</td>
      </tr>
      <tr>
          <td>JPO (450 images)</td>
          <td>Time (min)</td>
          <td>7.50</td>
          <td>22.55</td>
          <td>16.68</td>
      </tr>
      <tr>
          <td></td>
          <td>Accuracy</td>
          <td>66.67%</td>
          <td>40.00%</td>
          <td>57.78%</td>
      </tr>
  </tbody>
</table>
<p><strong>Key Observations</strong>:</p>
<ul>
<li><strong>MolVec</strong> was the fastest tool, processing datasets significantly quicker than competitors (e.g., 28.65 min for USPTO vs. 145.04 min for OSRA).</li>
<li><strong>OSRA</strong> performed exceptionally well on clean, well-segmented data (94.90% on CLEF 2012) but was slower.</li>
<li><strong>Imago</strong> generally lagged in accuracy compared to the other two, particularly on the UOB dataset (63.54% vs. 88.39% for MolVec and 86.50% for OSRA).</li>
<li><strong>JPO Difficulty</strong>: All tools struggled with the noisy Japanese Patent Office dataset (accuracies ranged from 40.00% to 66.67%), highlighting issues with noise and non-standard labels.</li>
</ul>
<p><strong>General Conclusions</strong>:</p>
<ul>
<li>No &ldquo;gold standard&rdquo; tool existed (as of 2020) that solved all problems (page segmentation, R-groups, NLP integration).</li>
<li>Rule-based approaches dominate the history of the field, but deep learning methods (MSE-DUDL, Chemgrapher) were emerging, though they were closed-source at the time of writing.</li>
<li>There was a critical need for tools that could handle full-page recognition (combining segmentation and recognition).</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>The authors provided sufficient detail to replicate the benchmarking study.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Kohulan/OCSR_Review">OCSR_Review (GitHub)</a></td>
          <td>Code / Data</td>
          <td>MIT</td>
          <td>Benchmark images (PNG, 72 dpi) and evaluation scripts</td>
      </tr>
      <tr>
          <td><a href="https://sourceforge.net/p/osra/wiki/Download/">OSRA</a></td>
          <td>Code</td>
          <td>Open Source</td>
          <td>Version 2.1.0 tested; precompiled binaries are commercial</td>
      </tr>
      <tr>
          <td><a href="https://lifescience.opensource.epam.com/download/imago.html">Imago</a></td>
          <td>Code</td>
          <td>Open Source</td>
          <td>Version 2.0 tested; no longer actively developed</td>
      </tr>
      <tr>
          <td><a href="https://github.com/ncats/molvec">MolVec</a></td>
          <td>Code</td>
          <td>LGPL-2.1</td>
          <td>Version 0.9.7 tested; Java-based standalone tool</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The study used four public datasets. Images were converted to PNG (72 dpi) to ensure compatibility across all tools.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size</th>
          <th>Source</th>
          <th>Characteristics</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>USPTO</strong></td>
          <td>5,719</td>
          <td>OSRA Validation Set</td>
          <td>US Patent images, generally clean.</td>
      </tr>
      <tr>
          <td><strong>UOB</strong></td>
          <td>5,740</td>
          <td>Univ. of Birmingham</td>
          <td>Published alongside MolRec.</td>
      </tr>
      <tr>
          <td><strong>CLEF 2012</strong></td>
          <td>961</td>
          <td>CLEF-IP 2012</td>
          <td>Well-segmented, high quality.</td>
      </tr>
      <tr>
          <td><strong>JPO</strong></td>
          <td>450</td>
          <td>Japanese Patent Office</td>
          <td>Low quality, noisy, contains Japanese text.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper does not propose a new algorithm but benchmarks existing ones. The execution commands for reproducibility were:</p>
<ul>
<li><strong>Imago</strong>: Executed via command line without installation.
<code>./imago_console -dir /image/directory/path</code></li>
<li><strong>MolVec</strong>: Executed as a JAR file.
<code>java -cp [dependencies] gov.nih.ncats.molvec.Main -dir [input_dir] -outDir [output_dir]</code></li>
<li><strong>OSRA</strong>: Installed via Conda (PyOSRA) due to compilation complexity. Required dictionaries for superatoms and spelling.
<code>osra -f sdf -a [superatom_dict] -l [spelling_dict] -w [output_file] [input_file]</code></li>
</ul>
<h3 id="models">Models</h3>
<p>The specific versions of the open-source software tested were:</p>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Version</th>
          <th>Technology</th>
          <th>License</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>MolVec</strong></td>
          <td>0.9.7</td>
          <td>Java-based, rule-based</td>
          <td>LGPL-2.1</td>
      </tr>
      <tr>
          <td><strong>Imago</strong></td>
          <td>2.0</td>
          <td>C++, rule-based</td>
          <td>Open Source</td>
      </tr>
      <tr>
          <td><strong>OSRA</strong></td>
          <td>2.1.0</td>
          <td>C++, rule-based</td>
          <td>Open Source</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metric</strong>: Perfect structural match. The output SDfile/SMILES was converted to a Standard InChI string and compared to the ground truth InChI. Any deviation counted as a failure.</li>
<li><strong>Environment</strong>: Linux workstation (Ubuntu 20.04 LTS).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The benchmark was performed on a high-end workstation to measure processing time.</p>
<ul>
<li><strong>CPUs</strong>: 2x Intel Xeon Silver 4114 (40 threads total).</li>
<li><strong>RAM</strong>: 64 GB.</li>
<li><strong>Parallelization</strong>: MolVec had pre-implemented parallelization features that contributed to its speed.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Brinkhaus, H. O., Zielesny, A., &amp; Steinbeck, C. (2020). A review of optical chemical structure recognition tools. <em>Journal of Cheminformatics</em>, 12(1), 60. <a href="https://doi.org/10.1186/s13321-020-00465-0">https://doi.org/10.1186/s13321-020-00465-0</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2020</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanReviewOpticalChemical2020,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{A Review of Optical Chemical Structure Recognition Tools}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Brinkhaus, Henning Otto and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2020</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{60}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-020-00465-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Overview of the TREC 2011 Chemical IR Track Benchmark</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/trec-chem-2011/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/trec-chem-2011/</guid><description>Overview of the 2011 TREC Chemical IR track, establishing benchmarks for patent prior art, technology surveys, and chemical image recognition.</description><content:encoded><![CDATA[<h2 id="contribution-establishing-chemical-ir-benchmarks">Contribution: Establishing Chemical IR Benchmarks</h2>
<p>This is a <strong>Resource ($\Psi_{\text{Resource}}$)</strong> paper with a secondary contribution in <strong>Systematization ($\Psi_{\text{Systematization}}$)</strong>.</p>
<p>It serves as an infrastructural foundation for the field by establishing the &ldquo;yardstick&rdquo; for chemical information retrieval. It defines three distinct tasks, curates the necessary datasets (text and image), and creates the evaluation metrics required to measure progress. Secondarily, it systematizes the field by analyzing 36 different runs from 9 research groups, categorizing the performance of various approaches against these new benchmarks.</p>
<h2 id="motivation-bridging-text-and-image-search-in-chemistry">Motivation: Bridging Text and Image Search in Chemistry</h2>
<p>The primary motivation is to bridge the gap between distinct research communities (text mining and image understanding), which are both essential for chemical information retrieval but rarely interact. Professional searchers in chemistry rely heavily on non-textual information (structures), yet prior evaluation efforts lacked specific tasks to handle image data. The track aims to provide professional searchers with a clear understanding of the limits of current tools while stimulating research interest in both patent retrieval and chemical image recognition.</p>
<h2 id="novelty-the-image-to-structure-i2s-task">Novelty: The Image-to-Structure (I2S) Task</h2>
<p>The core novelty is the introduction of the <strong>Image-to-Structure (I2S)</strong> task. While previous years provided image data, this was the first specific task requiring participants to translate a raster image of a molecule into a chemical structure file. Additionally, the Technology Survey (TS) task shifted its focus specifically to <strong>biomedical and pharmaceutical topics</strong> to investigate how general IR systems handle the high terminological diversity (synonyms, abbreviations) typical of biomedical patents.</p>
<h2 id="methodology-trec-2011-task-formulations">Methodology: TREC 2011 Task Formulations</h2>
<p>The organizers conducted a large-scale benchmarking campaign across three specific tasks:</p>
<ol>
<li><strong>Prior Art (PA) Task</strong>: A patent retrieval task using 1,000 topics distributed among the EPO, USPTO, and WIPO.</li>
<li><strong>Technology Survey (TS) Task</strong>: An ad-hoc retrieval task focused on 6 specific biomedical/pharmaceutical information needs (e.g., &ldquo;Tests for HCG hormone&rdquo;).</li>
<li><strong>Image-to-Structure (I2S) Task</strong>: A recognition task using 1,000 training images and 1,000 evaluation images from USPTO patents, where systems had to generate the correct chemical structure (MOL file).</li>
</ol>
<p>A total of 9 groups submitted 36 runs across these tasks. Relevance judgments were performed using stratified sampling and a dual-evaluator system (junior and senior experts) for the TS task.</p>
<h2 id="outcomes-task-achievements-and-limitations">Outcomes: Task Achievements and Limitations</h2>
<ul>
<li><strong>Image-to-Structure Success</strong>: The new I2S task was the most successful task that year, with 5 participating groups submitting 11 runs. All participants recognized over 60% of the structures.</li>
<li><strong>Prior Art Saturation</strong>: Only 2 groups participated in the PA task. The organizers concluded that this task had reached its &ldquo;final point,&rdquo; having learned the extent to which relevant documents can be retrieved in one pass for chemical patent applications.</li>
<li><strong>Biomedical Complexity</strong>: Four teams submitted 14 runs for the TS task, which highlighted the complexity of biomedical queries. The use of specialized domain experts (senior evaluators) and students (junior evaluators) provided high-quality relevance data, though the small number of topics (6) limits broad generalization.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>The following details describe the benchmark environment established by the organizers, allowing for the replication of the evaluation.</p>
<h3 id="data">Data</h3>
<p>The track utilized a large collection of approximately 500GB of compressed text and image data.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Dataset / Source</th>
          <th style="text-align: left">Size / Split</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Prior Art (PA)</strong></td>
          <td style="text-align: left">EPO, USPTO, WIPO patents</td>
          <td style="text-align: left">1,000 Topics</td>
          <td style="text-align: left">Distributed: 334 EPO, 333 USPTO, 333 WIPO.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Tech Survey (TS)</strong></td>
          <td style="text-align: left">Biomedical patents/articles</td>
          <td style="text-align: left">6 Topics</td>
          <td style="text-align: left">Topics formulated by domain experts; focused on complexity (synonyms, abbreviations).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Image (I2S)</strong></td>
          <td style="text-align: left">USPTO patent images</td>
          <td style="text-align: left">1,000 Train / 1,000 Eval</td>
          <td style="text-align: left">Criteria: No polymers, &ldquo;organic&rdquo; elements only, MW &lt; 1000, single fragment.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper defines specific <strong>evaluation algorithms</strong> used to ground-truth the submissions:</p>
<ul>
<li><strong>Stratified Sampling (TS)</strong>: Pools were generated using the method from Yilmaz et al. (2008). The pool included the top 10 documents from all runs, 30% of the top 30, and 10% of the rest down to rank 1000.</li>
<li><strong>InChI Matching (I2S)</strong>: Evaluation relied on generating <strong>Standard <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> Keys</strong> from both the ground truth MOL files and the participant submissions. Success was defined by exact string matching of these keys. This provided a relatively controversy-free measure of chemical identity.</li>
</ul>
<h3 id="models">Models</h3>
<p>While the paper does not propose a single model, it evaluates several distinct approaches submitted by participants. Notable systems mentioned include:</p>
<ul>
<li><strong>OSRA</strong> (SAIC-Frederik / NIH)</li>
<li><strong>ChemReader</strong> (University of Michigan)</li>
<li><strong>ChemOCR</strong> (Fraunhofer SCAI)</li>
<li><strong>UoB</strong> (University of Birmingham)</li>
<li><strong>GGA</strong> (GGA Software)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance was measured using standard IR metrics for text and exact matching for images.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>MAP / xinfAP</strong></td>
          <td style="text-align: left">Prior Art / Tech Survey</td>
          <td style="text-align: left">Mean Average Precision ($\text{MAP}$) and Extended Inferred AP ($\text{xinfAP}$) were used to measure retrieval quality.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>infNDCG</strong></td>
          <td style="text-align: left">Tech Survey</td>
          <td style="text-align: left">Used to account for graded relevance (highly relevant vs relevant, formalized as $\text{infNDCG}$).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Recall</strong></td>
          <td style="text-align: left">Image-to-Structure</td>
          <td style="text-align: left">Percentage of images where the generated InChI key matched exactly ($R = \frac{\text{Correct}}{\text{Total}}$).</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://trec.nist.gov/data/chemical11.html">TREC 2011 Chemistry Track Data</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Topics, relevance judgments, and image sets for all three tasks</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://trec.nist.gov/pubs/trec20/t20.proceedings.html">TREC 2011 Proceedings</a></td>
          <td style="text-align: left">Other</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Full proceedings including participant system descriptions</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Specific hardware requirements for the participating systems are not detailed in this overview, but the dataset size (500GB) implies significant storage and I/O throughput requirements.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Lupu, M., Gurulingappa, H., Filippov, I., Zhao, J., Fluck, J., Zimmermann, M., Huang, J., &amp; Tait, J. (2011). Overview of the TREC 2011 Chemical IR Track. In <em>Proceedings of the Twentieth Text REtrieval Conference (TREC 2011)</em>.</p>
<p><strong>Publication</strong>: Text REtrieval Conference (TREC) 2011</p>
<p><strong>Resources</strong>:</p>
<ul>
<li><a href="https://trec.nist.gov/pubs/trec20/t20.proceedings.html">TREC 2011 Proceedings</a></li>
<li><a href="https://trec.nist.gov/data/chemical11.html">TREC 2011 Chemistry Track Data</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{lupuOverviewTREC20112011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Overview of the {{TREC}} 2011 {{Chemical IR Track}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Lupu, Mihai and Gurulingappa, Harsha and Filippov, Igor and Zhao, Jiashu and Fluck, Juliane and Zimmermann, Marc and Huang, Jimmy and Tait, John}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the Twentieth Text REtrieval Conference (TREC 2011)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{NIST}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{The third year of the Chemical IR evaluation track benefitted from the support of many more people interested in the domain, as shown by the number of co-authors of this overview paper. We continued the two tasks we had before, and introduced a new task focused on chemical image recognition. The objective is to gradually move towards systems really useful to the practitioners, and in chemistry, this involves both text and images. The track had a total of 9 groups participating, submitting a total of 36 runs.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>OSRA at CLEF-IP 2012: Native TIFF Processing for Patents</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/osra-clef-2012/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/osra-clef-2012/</guid><description>Evaluation of OSRA on CLEF-IP 2012 patent data showing native TIFF processing outperforms external splitting tools and pairwise-distance segmentation.</description><content:encoded><![CDATA[<h2 id="contribution-evaluating-native-processing-in-osra">Contribution: Evaluating Native Processing in OSRA</h2>
<p>This is a <strong>Method</strong> paper ($\Psi_{\text{Method}}$).</p>
<p>It focuses on evaluating the algorithmic performance of the Optical Structure Recognition Application (OSRA) and justifies specific implementation details (such as pairwise distance clustering) through comparative analysis. The paper systematically compares preprocessing workflows (native vs. <code>tiffsplit</code>) to demonstrate how implementation choices impact precision, recall, and F1 scores.</p>
<h2 id="motivation-advancing-chemical-structure-recognition">Motivation: Advancing Chemical Structure Recognition</h2>
<p>The primary motivation is to solve the <strong>Chemical Structure Recognition</strong> task within the context of the CLEF-IP 2012 challenge. The goal is to accurately convert images of chemical structures found in patent documents into established computerized molecular formats (connection tables).</p>
<p>A secondary technical motivation is to address issues in page segmentation where standard bounding box approaches fail to separate overlapping or nested molecular structures.</p>
<h2 id="core-innovation-pairwise-distance-segmentation">Core Innovation: Pairwise Distance Segmentation</h2>
<p>The core novelty lies in the algorithmic approach to object detection and page segmentation:</p>
<ol>
<li>
<p><strong>Rejection of Bounding Boxes</strong>: Unlike standard OCR approaches, OSRA does not use a bounding box paradigm internally. Instead, it relies on the <strong>minimum pairwise distance</strong> between points of different connected components. This allows the system to correctly handle cases where a larger molecule &ldquo;surrounds&rdquo; a smaller one, which bounding boxes would incorrectly merge.</p>
</li>
<li>
<p><strong>Native TIFF Processing</strong>: The authors identify that external tools (specifically <code>tiffsplit</code>) introduce artifacts during multi-page TIFF conversion. They implement native splitting facilities within OSRA, which substantially improves precision (from 0.433 to 0.708 at tolerance 0).</p>
</li>
</ol>
<h2 id="experimental-setup-segmentation-and-recognition-tracks">Experimental Setup: Segmentation and Recognition Tracks</h2>
<p>The authors performed two specific tracks for the CLEF-IP 2012 challenge:</p>
<ol>
<li>
<p><strong>Page Segmentation</strong>:</p>
<ul>
<li><strong>Dataset</strong>: 5421 ground truth structures.</li>
<li><strong>Comparison</strong>: Run 1 used <code>tiffsplit</code> (external tool) to separate pages; Run 2 used OSRA&rsquo;s native internal page splitting.</li>
<li><strong>Metrics</strong>: Precision, Recall, and F1 scores calculated at varying pixel tolerances (0, 10, 20, 40, 55 pixels).</li>
</ul>
</li>
<li>
<p><strong>Structure Recognition</strong>:</p>
<ul>
<li><strong>Dataset</strong>: A test set split into an &ldquo;Automatic&rdquo; evaluation set (865 structures checkable via <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> keys) and a &ldquo;Manual&rdquo; evaluation set (95 structures requiring human review due to Markush labels).</li>
<li><strong>Metric</strong>: Recognition rate (Recalled %).</li>
</ul>
</li>
</ol>
<h2 id="results-and-conclusions-native-processing-gains">Results and Conclusions: Native Processing Gains</h2>
<ul>
<li><strong>Native vs. External Splitting</strong>: The native OSRA page splitting outperformed the external <code>tiffsplit</code> tool by a wide margin. At tolerance 0, native processing achieved <strong>0.708 Precision</strong> compared to <strong>0.433</strong> for <code>tiffsplit</code>. The authors attribute this gap to artifacts introduced during <code>tiffsplit</code>&rsquo;s internal TIFF format conversion. The native run also returned far fewer records (5,254 vs. 8,800 for <code>tiffsplit</code>), indicating fewer false detections.</li>
<li><strong>Recognition Rate</strong>: Across 960 total structures, the system achieved an <strong>83% recognition rate</strong> (88% on the automatic set, 40% on the manual Markush set).</li>
<li><strong>Context</strong>: The results were consistent with OSRA&rsquo;s second-place finish (out of 6 participants) at TREC-CHEM 2011.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The experiments used the CLEF-IP 2012 benchmark datasets.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Set</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Segmentation</strong></td>
          <td>Ground Truth</td>
          <td>5,421 structures</td>
          <td>Used to evaluate bounding box/coordinate accuracy.</td>
      </tr>
      <tr>
          <td><strong>Recognition</strong></td>
          <td>Automatic</td>
          <td>865 structures</td>
          <td>Evaluated via InChI key matching.</td>
      </tr>
      <tr>
          <td><strong>Recognition</strong></td>
          <td>Manual</td>
          <td>95 structures</td>
          <td>Evaluated manually due to Markush-style labels.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Component Clustering (Pairwise Distance)</strong></p>
<p>The segmentation algorithm avoids bounding boxes.</p>
<ul>
<li><strong>Logic</strong>: Calculate the minimum pairwise distance between points of distinct graphical components.</li>
<li><strong>Criterion</strong>: If distance $d &lt; \text{threshold}$, components are grouped.</li>
<li><strong>Advantage</strong>: Enables separation of complex geometries where a bounding box $B_1$ might fully encompass $B_2$ (e.g., a large ring surrounding a salt ion), whereas the actual pixels are disjoint.</li>
</ul>
<p><strong>2. Image Pre-processing</strong></p>
<ul>
<li><strong>Workflow A (Run 1)</strong>: Multi-page TIFF → <code>tiffsplit</code> binary → Single TIFFs → OSRA.</li>
<li><strong>Workflow B (Run 2)</strong>: Multi-page TIFF → OSRA Internal Split → Recognition.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Page Segmentation Results (tiffsplit, Run 1)</strong></p>
<p>Using <code>tiffsplit</code> for page splitting returned 8,800 records against 5,421 ground truth structures.</p>
<table>
  <thead>
      <tr>
          <th>Tolerance (px)</th>
          <th>Precision</th>
          <th>Recall</th>
          <th>F1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0</td>
          <td>0.433</td>
          <td>0.703</td>
          <td>0.536</td>
      </tr>
      <tr>
          <td>10</td>
          <td>0.490</td>
          <td>0.795</td>
          <td>0.606</td>
      </tr>
      <tr>
          <td>20</td>
          <td>0.507</td>
          <td>0.823</td>
          <td>0.627</td>
      </tr>
      <tr>
          <td>40</td>
          <td>0.536</td>
          <td>0.870</td>
          <td>0.663</td>
      </tr>
      <tr>
          <td>55</td>
          <td>0.549</td>
          <td>0.891</td>
          <td>0.679</td>
      </tr>
  </tbody>
</table>
<p><strong>Page Segmentation Results (Native Split, Run 2)</strong></p>
<p>Using OSRA&rsquo;s native TIFF reading returned 5,254 records, with much higher precision.</p>
<table>
  <thead>
      <tr>
          <th>Tolerance (px)</th>
          <th>Precision</th>
          <th>Recall</th>
          <th>F1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0</td>
          <td>0.708</td>
          <td>0.686</td>
          <td>0.697</td>
      </tr>
      <tr>
          <td>10</td>
          <td>0.793</td>
          <td>0.769</td>
          <td>0.781</td>
      </tr>
      <tr>
          <td>20</td>
          <td>0.821</td>
          <td>0.795</td>
          <td>0.808</td>
      </tr>
      <tr>
          <td>40</td>
          <td>0.867</td>
          <td>0.840</td>
          <td>0.853</td>
      </tr>
      <tr>
          <td>55</td>
          <td>0.887</td>
          <td>0.860</td>
          <td>0.873</td>
      </tr>
  </tbody>
</table>
<p><strong>Structure Recognition Results</strong></p>
<table>
  <thead>
      <tr>
          <th>Set</th>
          <th>Count</th>
          <th>Recalled</th>
          <th>Percentage</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Automatic</td>
          <td>865</td>
          <td>761</td>
          <td>88%</td>
      </tr>
      <tr>
          <td>Manual</td>
          <td>95</td>
          <td>38</td>
          <td>40%</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>960</strong></td>
          <td><strong>799</strong></td>
          <td><strong>83%</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="http://cactus.nci.nih.gov/osra">OSRA</a></td>
          <td>Code</td>
          <td>Open Source</td>
          <td>Official project page at NCI/NIH</td>
      </tr>
  </tbody>
</table>
<p>OSRA is described as an open source utility. The CLEF-IP 2012 benchmark datasets were provided as part of the shared task. No hardware or compute requirements are specified in the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Filippov, I. V., Katsubo, D., &amp; Nicklaus, M. C. (2012). Optical Structure Recognition Application entry to CLEF-IP 2012. <em>CLEF 2012 Evaluation Labs and Workshop, Online Working Notes</em>.</p>
<p><strong>Publication</strong>: CLEF 2012</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="http://cactus.nci.nih.gov/osra">Project Home Page</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{filippovOpticalStructureRecognition2012,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Optical {{Structure Recognition Application}} Entry to {{CLEF-IP}} 2012}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Filippov, Igor V and Katsubo, Dmitry and Nicklaus, Marc C}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2012}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{CLEF 2012 Evaluation Labs and Workshop, Online Working Notes}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-FilippovEt2012.pdf}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{We present our entry to CLEF 2012 Chemical Structure Recognition task. Our submission includes runs for both bounding box extraction and molecule structure recognition tasks using Optical Structure Recognition Application. OSRA is an open source utility to convert images of chemical structures to connection tables into established computerized molecular formats. It has been under constant development since 2007.}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolRec at CLEF 2012: Rule-Based Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec-clef-2012/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec-clef-2012/</guid><description>Overview and failure analysis of the MolRec rule-based chemical structure recognition system evaluated on the CLEF 2012 chemical structure recognition task.</description><content:encoded><![CDATA[<h2 id="contribution-to-chemical-structure-recognition">Contribution to Chemical Structure Recognition</h2>
<p>This is a <strong>Method</strong> paper. It describes the architecture of an engineered artifact (the &ldquo;MolRec&rdquo; system) and evaluates its efficacy on a specific task (Chemical Structure Recognition) using a standardized benchmark. It focuses on the mechanisms of vectorization and rule-based rewriting.</p>
<h2 id="motivation-and-clef-2012-context">Motivation and CLEF 2012 Context</h2>
<p>The work was motivated by the <strong>CLEF 2012 chemical structure recognition task</strong>. The goal was to automatically interpret chemical diagram images clipped from patent documents. This is challenging because real-world patent images contain complex structures, such as bridge bonds and elements not supported by standard conversion tools like OpenBabel.</p>
<h2 id="novelty-in-rule-based-vectorization">Novelty in Rule-Based Vectorization</h2>
<p>The primary contribution is an <strong>improved rule-based rewrite engine</strong> compared to the authors&rsquo; previous TREC 2011 submission, featuring a fully overhauled implementation that improves both recognition performance and computational efficiency. The system uses a two-stage approach:</p>
<ol>
<li><strong>Vectorization</strong>: Extracts geometric primitives (lines, circles, arrows) and characters.</li>
<li><strong>Rule Engine</strong>: Applies 18 specific geometric rewriting rules to transform primitives into a chemical graph, which can then be exported to MOL or SMILES format.</li>
</ol>
<p>Notably, the system explicitly handles &ldquo;bridge bonds&rdquo; (3D perspective structures) by applying specific recognition rules before general bond detection.</p>
<h2 id="experimental-setup-on-the-clef-2012-corpus">Experimental Setup on the CLEF 2012 Corpus</h2>
<p>The system was evaluated on the <strong>CLEF 2012 corpus</strong> of 961 test images, split into two distinct sets to test different capabilities:</p>
<ul>
<li><strong>Automatic Set</strong>: 865 images evaluated automatically using OpenBabel to compare generated MOL files against ground truth.</li>
<li><strong>Manual Set</strong>: 95 &ldquo;challenging&rdquo; images containing elements beyond OpenBabel&rsquo;s scope (e.g., Markush structures), evaluated via manual visual inspection.</li>
</ul>
<p>The authors performed <strong>four runs</strong> with slightly different internal parameters to test system stability.</p>
<h2 id="performance-outcomes-and-failure-analysis">Performance Outcomes and Failure Analysis</h2>
<p><strong>Performance:</strong></p>
<ul>
<li><strong>Automatic Set</strong>: High performance, achieving accuracy between <strong>94.91% and 96.18%</strong>.</li>
<li><strong>Manual Set</strong>: Lower performance, with accuracy between <strong>46.32% and 58.95%</strong>, reflecting the difficulty of complex patent diagrams containing Markush structures and other elements beyond OpenBabel&rsquo;s scope.</li>
</ul>
<p><strong>Failure Analysis:</strong></p>
<p>The authors conducted a detailed error analysis on 52 distinct mis-recognized diagrams from the manual set and 46 from the automatic set. Key failure modes include:</p>
<ul>
<li><strong>Character Grouping</strong>: The largest error source in the manual set (26 images). A bug caused the digit &ldquo;1&rdquo; to be repeated within atom groups, and closely-spaced atom groups were incorrectly merged.</li>
<li><strong>Touching Characters</strong>: 8 images in the manual set and 1 in the automatic set. The system lacks segmentation for characters that touch, causing OCR failure.</li>
<li><strong>Four-way Junctions</strong>: 6 manual and 7 automatic images. Vectorization failed to correctly identify junctions where four lines meet.</li>
<li><strong>Missed Wedge Bonds</strong>: 6 images each for missed solid wedge and dashed wedge bonds in the automatic set.</li>
<li><strong>OCR Errors</strong>: 5 manual and 11 automatic images, including misrecognition of &ldquo;G&rdquo; as &ldquo;O&rdquo; and &ldquo;I&rdquo; interpreted as a vertical single bond.</li>
<li><strong>Charge Signs</strong>: MolRec correctly recognized positive charge signs but missed three negative charge signs, including one placed at the top left of an atom name.</li>
<li><strong>Dataset Errors</strong>: The authors identified 11 images where the ground truth MOL files were incorrect, but MolRec&rsquo;s recognition was actually correct.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The dataset was provided by CLEF 2012 organizers and consists of images clipped from patent documents.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation (Auto)</td>
          <td>CLEF 2012 Set 1</td>
          <td>865 images</td>
          <td>Evaluated via OpenBabel</td>
      </tr>
      <tr>
          <td>Evaluation (Manual)</td>
          <td>CLEF 2012 Set 2</td>
          <td>95 images</td>
          <td>Complex/Markush structures</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The <strong>MolRec</strong> pipeline consists of two primary modules:</p>
<p><strong>1. Vectorization Module</strong></p>
<ul>
<li><strong>Binarization</strong>: Uses <strong>Otsu&rsquo;s method</strong>.</li>
<li><strong>OCR</strong>: Extracts connected components and classifies them using <strong>nearest neighbor classification</strong> with a Euclidean metric. Detected characters are removed from the image.</li>
<li><strong>Bond Separation</strong>:
<ul>
<li>Thins remaining components to single-pixel width.</li>
<li>Builds polyline representations.</li>
<li>Splits polylines at junctions (3+ lines meeting).</li>
<li><strong>Simplification</strong>: Applies the <strong>Douglas-Peucker algorithm</strong> with a threshold of 1-2 average line widths to remove scanning artifacts while preserving corners. The threshold is based on measured average line width, allowing adaptation to different line styles.</li>
<li>Also detects circles, arrow heads, and solid triangles (annotated with direction).</li>
</ul>
</li>
</ul>
<p><strong>2. Rule Engine</strong></p>
<ul>
<li><strong>Input</strong>: Geometric primitives (segments, circles, triangles, arrows, character groups).</li>
<li><strong>Structure</strong>: 18 rewrite rules.</li>
<li><strong>Priority</strong>: Two rules for <strong>Bridge Bonds</strong> (Open/Closed) are applied <em>first</em>.</li>
<li><strong>Standard Rules</strong>: 16 rules applied in arbitrary order for standard bonds (Single, Double, Triple, Wedge, Dative, etc.).</li>
<li><strong>Implicit Nodes</strong>: Some rules handle cases where carbon atoms are implicit at bond junctions. These rules detect double or triple bonds while producing new geometric objects by splitting bonds at implicit nodes for further processing.</li>
<li><strong>Example Rule (Wavy Bond)</strong>:
<ul>
<li><em>Condition 1</em>: Set of line segments $L$ where $n \ge 3$.</li>
<li><em>Condition 2</em>: Segment lengths match &ldquo;dash length&rdquo; parameter.</li>
<li><em>Condition 3</em>: All elements are connected.</li>
<li><em>Condition 4</em>: Center points are approximately collinear.</li>
<li><em>Condition 5</em>: Endpoints form a single sequence (end elements have 1 neighbor, internal have 2).</li>
<li><em>Condition 6</em>: Two unconnected endpoints must be the pair of endpoints that are furthest apart.</li>
<li><em>Consequence</em>: Replace $L$ with a Wavy Bond between the furthest two endpoints. The bond has unknown direction.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p>MolRec is a <strong>rule-based system</strong> and does not use trained deep learning models or weights.</p>
<ul>
<li><strong>Superatoms</strong>: Uses a dictionary look-up to resolve character groups representing superatoms into subgraphs.</li>
<li><strong>Disambiguation</strong>: Context-based logic is applied <em>after</em> graph construction to resolve ambiguities (e.g., distinguishing vertical bond <code>|</code> from letter <code>I</code> or digit <code>1</code>).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Set</th>
          <th>Run 1</th>
          <th>Run 2</th>
          <th>Run 3</th>
          <th>Run 4</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Auto (865 images)</td>
          <td>96.18% (832/865)</td>
          <td>94.91% (821/865)</td>
          <td>94.91% (821/865)</td>
          <td>96.18% (832/865)</td>
      </tr>
      <tr>
          <td>Manual (95 images)</td>
          <td>46.32% (44/95)</td>
          <td>58.95% (56/95)</td>
          <td>46.32% (44/95)</td>
          <td>56.84% (54/95)</td>
      </tr>
  </tbody>
</table>
<p><strong>Key Parameters</strong>:</p>
<ul>
<li><strong>Dash Length</strong>: Range of acceptable values for dashed lines.</li>
<li><strong>Simplification Threshold</strong>: 1-2x average line width for Douglas-Peucker.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf">CLEF 2012 Workshop Paper</a></td>
          <td>Other</td>
          <td>Open Access</td>
          <td>CEUR Workshop Proceedings</td>
      </tr>
  </tbody>
</table>
<h3 id="reproducibility-classification-closed">Reproducibility Classification: Closed</h3>
<p>No source code for the MolRec system has been publicly released. The CLEF 2012 evaluation dataset was distributed to task participants and is not openly available. The rule-based algorithm is described in sufficient detail to re-implement, but exact parameter values and the character classification training set are not fully specified. No hardware or compute requirements are reported.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sadawi, N. M., Sexton, A. P., &amp; Sorge, V. (2012). MolRec at CLEF 2012 &ndash; Overview and Analysis of Results. <em>CLEF 2012 Evaluation Labs and Workshop, Online Working Notes</em>. <a href="https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf">https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf</a></p>
<p><strong>Publication</strong>: CLEF 2012 Evaluation Labs and Workshop, Online Working Notes</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{sadawi2012molrec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolRec at CLEF 2012--Overview and Analysis of Results}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Sadawi, Noureddin M and Sexton, Alan P and Sorge, Volker}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{CLEF 2012 Evaluation Labs and Workshop, Online Working Notes}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2012}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>CLEF-IP 2012: Patent and Chemical Structure Benchmark</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/clef-ip-2012/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/clef-ip-2012/</guid><description>Overview of the CLEF-IP 2012 benchmarking lab focusing on patent passage retrieval, flowchart recognition, and chemical structure extraction.</description><content:encoded><![CDATA[<h2 id="patent-retrieval-and-the-clef-ip-2012-benchmark">Patent Retrieval and the CLEF-IP 2012 Benchmark</h2>
<p>This is a <strong>Resource</strong> paper (benchmark infrastructure). It establishes a standardized test bed for the Intellectual Property (IP) Information Retrieval community by defining tasks, curating datasets (topics and relevance judgments), and establishing evaluation protocols. The paper does not propose a new method itself but aggregates and analyzes the performance of participant systems on these shared tasks.</p>
<h2 id="motivation-for-standardized-ip-information-retrieval">Motivation for Standardized IP Information Retrieval</h2>
<p>The volume of patent applications is increasing rapidly, necessitating automated methods to help patent experts find prior art and classify documents.</p>
<ul>
<li><strong>Economic Impact:</strong> Thorough searches are critical due to the high economic value of granted patents.</li>
<li><strong>Complexity:</strong> Patent work-flows are specific; examiners need to find prior art for specific <em>claims</em> alongside whole documents, and often rely on non-textual data like flowcharts and chemical diagrams.</li>
<li><strong>Gap:</strong> Existing general IR tools are insufficient for the specific granularity (passages, images, structures) required in the IP domain.</li>
</ul>
<h2 id="novel-multi-modal-tasks-claims-flowcharts-and-chemicals">Novel Multi-modal Tasks: Claims, Flowcharts, and Chemicals</h2>
<p>The 2012 edition of the lab introduced three specific tasks targeting different modalities of patent data:</p>
<ol>
<li><strong>Passage Retrieval starting from Claims:</strong> Moving beyond document-level retrieval to identifying specific relevant passages based on claim text.</li>
<li><strong>Flowchart Recognition:</strong> A new image analysis task requiring the extraction of structural information (nodes, edges, text) from patent images.</li>
<li><strong>Chemical Structure Recognition:</strong> A dual task of segmenting molecular diagrams from full pages and recognizing them into structural files (MOL), specifically addressing the challenge of Markush structures in patents.</li>
</ol>
<h2 id="benchmarking-setup-and-evaluation">Benchmarking Setup and Evaluation</h2>
<p>The &ldquo;experiments&rdquo; were the benchmarking tasks themselves, performed by participants (e.g., University of Birmingham, SAIC, TU Vienna).</p>
<ul>
<li><strong>Passage Retrieval:</strong> Participants retrieved documents and passages for 105 test topics (sets of claims) from a corpus of 1.5 million patents. Performance was measured using PRES, Recall, and MAP at the document level, and AP/Precision at the passage level.</li>
<li><strong>Flowchart Recognition:</strong> Participants extracted graph structures from 100 test images. Evaluation compared the submitted graphs to ground truth using a distance metric based on the Maximum Common Subgraph (MCS).</li>
<li><strong>Chemical Structure:</strong>
<ul>
<li><em>Segmentation:</em> Identifying bounding boxes of chemical structures in 30 multipage TIFF patents.</li>
<li><em>Recognition:</em> Converting 865 &ldquo;automatic&rdquo; (standard MOL) and 95 &ldquo;manual&rdquo; (Markush/complex) diagrams into structure files.</li>
</ul>
</li>
</ul>
<h2 id="key-findings-and-baseline-results">Key Findings and Baseline Results</h2>
<ul>
<li><strong>Passage Retrieval:</strong> Approaches varied from two-step retrieval (document then passage) to full NLP techniques. Translation tools were universally used due to the multilingual corpus (English, German, French).</li>
<li><strong>Chemical Recognition:</strong> The best performing system (UoB, run uob-4) achieved 92% recall on total structures (886/960), with 96% on the automatic set and 57% on the manual set. SAIC achieved 83% total recall. The manual evaluation highlighted a critical need for standards extending MOL files to support Markush structures, which are common in patents but poorly supported by current tools.</li>
<li><strong>Flowchart Recognition:</strong> The evaluation was not completed at the time of writing the workshop notes. The evaluation required a combination of structural matching and edit-distance for text labels because OCR outputs rarely &ldquo;hard-matched&rdquo; the gold standard.</li>
</ul>
<h3 id="chemical-structure-recognition-results">Chemical Structure Recognition Results</h3>
<p><strong>Segmentation</strong> (SAIC, best run using OSRA native rendering):</p>
<table>
  <thead>
      <tr>
          <th>Tolerance (px)</th>
          <th>Precision</th>
          <th>Recall</th>
          <th>$F_1$</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0</td>
          <td>0.708</td>
          <td>0.686</td>
          <td>0.697</td>
      </tr>
      <tr>
          <td>10</td>
          <td>0.793</td>
          <td>0.769</td>
          <td>0.781</td>
      </tr>
      <tr>
          <td>20</td>
          <td>0.821</td>
          <td>0.795</td>
          <td>0.808</td>
      </tr>
      <tr>
          <td>40</td>
          <td>0.867</td>
          <td>0.840</td>
          <td>0.853</td>
      </tr>
      <tr>
          <td>55</td>
          <td>0.887</td>
          <td>0.860</td>
          <td>0.873</td>
      </tr>
  </tbody>
</table>
<p><strong>Recognition</strong> (automatic and manual sets):</p>
<table>
  <thead>
      <tr>
          <th>System</th>
          <th>Auto (#/865)</th>
          <th>Auto %</th>
          <th>Manual (#/95)</th>
          <th>Manual %</th>
          <th>Total (#/960)</th>
          <th>Total %</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SAIC</td>
          <td>761</td>
          <td>88%</td>
          <td>38</td>
          <td>40%</td>
          <td>799</td>
          <td>83%</td>
      </tr>
      <tr>
          <td>UoB-1</td>
          <td>832</td>
          <td>96%</td>
          <td>44</td>
          <td>46%</td>
          <td>876</td>
          <td>91%</td>
      </tr>
      <tr>
          <td>UoB-2</td>
          <td>821</td>
          <td>95%</td>
          <td>56</td>
          <td>59%</td>
          <td>877</td>
          <td>91%</td>
      </tr>
      <tr>
          <td>UoB-3</td>
          <td>821</td>
          <td>95%</td>
          <td>44</td>
          <td>46%</td>
          <td>865</td>
          <td>90%</td>
      </tr>
      <tr>
          <td>UoB-4</td>
          <td>832</td>
          <td>96%</td>
          <td>54</td>
          <td>57%</td>
          <td>886</td>
          <td>92%</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The collection focuses on European Patent Office (EPO) and WIPO documents published up to 2002.</p>
<p><strong>1. Passage Retrieval Data</strong></p>
<ul>
<li><strong>Corpus:</strong> &gt;1.5 million XML patent documents (EP and WO sources).</li>
<li><strong>Training Set:</strong> 51 topics (sets of claims) with relevance judgments (18 DE, 21 EN, 12 FR).</li>
<li><strong>Test Set:</strong> 105 topics (35 per language).</li>
<li><strong>Topic Source:</strong> Extracted manually from search reports listing &ldquo;X&rdquo; or &ldquo;Y&rdquo; citations (highly relevant prior art).</li>
</ul>
<p><strong>2. Flowchart Data</strong></p>
<ul>
<li><strong>Format:</strong> Black and white TIFF images.</li>
<li><strong>Training Set:</strong> 50 images with textual graph representations.</li>
<li><strong>Test Set:</strong> 100 images.</li>
<li><strong>Ground Truth:</strong> A defined textual format describing nodes (<code>NO</code>), directed edges (<code>DE</code>), undirected edges (<code>UE</code>), and meta-data (<code>MT</code>).</li>
</ul>
<p><strong>3. Chemical Structure Data</strong></p>
<ul>
<li><strong>Segmentation:</strong> 30 patent files rendered as 300dpi monochrome multipage TIFFs.</li>
<li><strong>Recognition (Automatic Set):</strong> 865 diagram images fully representable in standard MOL format.</li>
<li><strong>Recognition (Manual Set):</strong> 95 diagram images containing Markush structures or variability not supported by standard MOL.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Ground Truth Generation:</strong></p>
<ul>
<li><strong>Qrels Generator:</strong> An in-house tool was used to manually map search report citations to specific XML passages (XPaths) for the passage retrieval task.</li>
<li><strong>McGregor Algorithm:</strong> Used for the flowchart evaluation to compute the Maximum Common Subgraph (MCS) between participant submissions and ground truth.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Passage Retrieval Metrics:</strong></p>
<ul>
<li><strong>Document Level:</strong> PRES (Patent Retrieval Evaluation Score), Recall, MAP. Cut-off at 100 documents.</li>
<li><strong>Passage Level:</strong> $AP(D)$ (Average Precision at document level) and $Precision(D)$ (Precision at document level), averaged across all relevant documents for a topic.</li>
</ul>
<p><strong>Flowchart Recognition Metric:</strong></p>
<ul>
<li><strong>Graph Distance ($d$):</strong> Defined quantitatively based on the Maximum Common Subgraph (MCS) between a target flowchart ($F_t$) and a submitted flowchart ($F_s$):
$$
\begin{aligned}
d(F_t, F_s) &amp;= 1 - \frac{|mcs(F_t, F_s)|}{|F_t| + |F_s| - |mcs(F_t, F_s)|}
\end{aligned}
$$
where $|F|$ represents the size of the graph (nodes + edges).</li>
<li><strong>Levels:</strong> Evaluated at three levels: Basic (structure only), Intermediate (structure + node types), and Complete (structure + types + text labels).</li>
</ul>
<p><strong>Chemical Structure Metrics:</strong></p>
<ul>
<li><strong>Segmentation:</strong> Precision, Recall, and $F_1$ based on bounding box matches. A match is valid if borders are within a tolerance (0 to 55 pixels).</li>
<li><strong>Recognition:</strong>
<ul>
<li><em>Automatic:</em> Comparison of InChI strings generated by Open Babel.</li>
<li><em>Manual:</em> Visual comparison of images rendered by MarvinView.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<p>The CLEF-IP 2012 benchmark data was distributed to registered participants through the CLEF evaluation framework. The patent corpus is derived from the MAREC dataset (EPO and WIPO documents published until 2002). Evaluation tools for segmentation (bounding box comparison) and recognition (InChI comparison via Open Babel) were developed in-house by the organizers. The McGregor algorithm implementation for flowchart evaluation was also custom-built.</p>
<p>No public code repositories or pre-trained models are associated with this paper, as it is a benchmarking infrastructure paper. The evaluation protocols and data formats are fully described in the paper.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="http://www.ifs.tuwien.ac.at/~clef-ip">CLEF-IP 2012 data</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Distributed to registered CLEF participants; no persistent public archive</td>
      </tr>
      <tr>
          <td><a href="https://www.ir-facility.org/prototypes/marec">MAREC corpus</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Source patent corpus (EPO/WIPO documents up to 2002)</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Status</strong>: Partially Reproducible</li>
<li><strong>Missing components</strong>: The benchmark datasets were distributed to participants and are not hosted on a persistent public repository. The in-house evaluation tools (qrels generator, segmentation comparator, flowchart distance calculator) are not publicly released.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Piroi, F., Lupu, M., Hanbury, A., Magdy, W., Sexton, A. P., &amp; Filippov, I. (2012). CLEF-IP 2012: Retrieval Experiments in the Intellectual Property Domain. <em>CLEF 2012 Working Notes</em>, CEUR Workshop Proceedings, Vol. 1178.</p>
<p><strong>Publication</strong>: CLEF 2012 Working Notes (CEUR-WS Vol. 1178)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{piroi2012clefip,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{CLEF-IP 2012: Retrieval Experiments in the Intellectual Property Domain}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Piroi, Florina and Lupu, Mihai and Hanbury, Allan and Magdy, Walid and Sexton, Alan P. and Filippov, Igor}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{CLEF 2012 Working Notes}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">series</span>=<span style="color:#e6db74">{CEUR Workshop Proceedings}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1178}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2012}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{CEUR-WS.org}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-PiroiEt2012.pdf}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemReader Image-to-Structure OCR at TREC 2011 Chemical IR</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/chemreader-trec-2011/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/chemreader-trec-2011/</guid><description>ChemReader OCR software evaluation on TREC 2011 Chemical IR campaign achieving 93% accuracy on image-to-structure task.</description><content:encoded><![CDATA[<h2 id="methodological-application-applying-chemreader-to-chemical-ocr">Methodological Application: Applying ChemReader to Chemical OCR</h2>
<p>This is a <strong>Methodological Paper</strong> ($\Psi_{\text{Method}}$).</p>
<p>The dominant vector is $\Psi_{\text{Method}}$ because the paper&rsquo;s core contribution is the empirical evaluation and refinement of ChemReader on the Image-to-Structure (I2S) task. The rhetorical indicators align with this classification through the reporting of quantitative performance metrics, detailed <strong>error analysis</strong>, and a focus on <strong>how well the system works</strong> and how its underlying algorithms need refinement.</p>
<h2 id="motivation-bridging-the-gap-in-image-to-structure-tasks">Motivation: Bridging the Gap in Image-to-Structure Tasks</h2>
<p>The motivation is two-fold:</p>
<ol>
<li>
<p><strong>Scientific Need</strong>: Traditional text-based chemical mining methods cannot utilize image data in scientific literature. Chemical OCR software is required to extract 2D chemical structure diagrams from raster images and convert them into a machine-readable chemical file format, paving the way for advanced chemical literature mining.</p>
</li>
<li>
<p><strong>Benchmark Participation</strong>: The immediate motivation was participation in the <strong>TREC Chemical IR campaign&rsquo;s Image-to-Structure (I2S) task</strong>, which was designed to evaluate existing chemical OCR software and establish a platform for developing chemical information retrieval techniques utilizing image data.</p>
</li>
</ol>
<h2 id="novelty-benchmark-evaluation-and-error-analysis-of-chemreader">Novelty: Benchmark Evaluation and Error Analysis of ChemReader</h2>
<p>ChemReader was previously introduced in earlier publications and is a chemical OCR system tailored to a chemical database annotation scheme. The novelty of this paper lies in <strong>evaluating ChemReader within the formal I2S benchmark setting</strong> and conducting a detailed <strong>error analysis</strong> of its performance. After fixing a stereo bond omission and a corner detection bug discovered during the evaluation, ChemReader achieved 93% accuracy (930/1000) on the benchmark test set.</p>
<h2 id="experimental-setup-the-trec-2011-i2s-challenge">Experimental Setup: The TREC 2011 I2S Challenge</h2>
<p>The experiment was the application of the ChemReader software to the <strong>Image-to-Structure (I2S) task</strong> of the TREC Chemical IR campaign.</p>
<ul>
<li><strong>Setup</strong>: The software was used to process image data provided for the I2S task.</li>
<li><strong>Evaluation</strong>: The system was initially evaluated, revealing two issues: the omission of <strong>bond stereo types</strong> in the output structures and a bug in the <strong>corner detection</strong> code that failed on lines touching the image boundary. Each issue lowered accuracy by approximately 10%.</li>
<li><strong>Analysis</strong>: After fixing these issues, ChemReader was re-evaluated on the full 1000-image test set (<strong>Test III</strong>). A detailed error analysis was then conducted on 20 randomly selected samples from Test III results.</li>
</ul>
<h2 id="training-progress">Training Progress</h2>
<p>The paper reports three rounds of major training, with approximately 15% accuracy gain per round:</p>
<ul>
<li><strong>Initial (untrained)</strong>: 57% accuracy on 100 selected training images</li>
<li>Key changes included deactivating unnecessary heuristic algorithms (resizing, de-noising, line merging), limiting the character set, updating the chemical dictionary to a lightweight version, and fixing precision loss from type conversions.</li>
<li>Each round improved accuracy by approximately 15% (Figure 1 in the paper shows the progression).</li>
</ul>
<h2 id="outcomes-high-accuracy-hindered-by-complex-connectivity-rules">Outcomes: High Accuracy Hindered by Complex Connectivity Rules</h2>
<ul>
<li>
<p><strong>Submitted Results</strong>: Test I achieved 691/1000 correct outputs (avg. Tanimoto similarity 0.9769), and Test II achieved 689/1000 (avg. Tanimoto similarity 0.9823). Both scored lower than training accuracy due to the stereo bond omission and corner detection bug.</p>
</li>
<li>
<p><strong>Key Finding</strong>: After fixing these two issues, ChemReader achieved <strong>93% accuracy</strong> (930/1000) on the I2S task (Test III), comparable to the highest accuracy among participants.</p>
</li>
<li>
<p><strong>Limitation/Future Direction</strong>: A detailed <strong>error analysis</strong> on 20 randomly selected samples from Test III (Table 2) showed that the software requires the incorporation of <strong>more chemical intelligence in its algorithms</strong> to address remaining systematic errors. The most frequent errors were:</p>
<ul>
<li>Wrongly merged nodes: 6 samples (30%), caused by nodes too close to be distinguished by a distance threshold</li>
<li>Missed bonds: 4 samples (20%), caused by filtering out short line segments</li>
<li>Nonstandard representations: noise symbols confusing the system, nonstandard wedge/hatched bond styles, and 3D crossing bonds that ChemReader cannot interpret</li>
</ul>
</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Training</td>
          <td style="text-align: left">TREC 2011 Chemical IR I2S Training Set</td>
          <td style="text-align: left">1000 images (100 used for quick eval)</td>
          <td style="text-align: left">TIF format, one chemical structure per image</td>
      </tr>
      <tr>
          <td style="text-align: left">Evaluation</td>
          <td style="text-align: left">TREC 2011 Chemical IR I2S Test Set</td>
          <td style="text-align: left">1000 images (20 sampled for error analysis)</td>
          <td style="text-align: left">Same format constraints; 930/1000 correct in Test III</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>ChemReader is a <strong>chemical Optical Character Recognition (OCR) system</strong> with a 17-step pipeline:</p>
<ol>
<li><strong>Pixel clustering</strong>: Region-growing to identify the chemical structure region</li>
<li><strong>Preprocessing</strong>: Resizing, de-noising, and bond length estimation (deactivated for I2S task)</li>
<li><strong>Text identification</strong>: Connected components with similar heights/areas labeled as characters</li>
<li><strong>Benzene ring detection</strong>: Identifying circles representing aromatic bonds</li>
<li><strong>Hatched bond detection</strong>: Finding short collinear line segments of uniform length</li>
<li><strong>Skeletonization</strong>: Thinning bond pixels for downstream processing</li>
<li><strong>Ring structure detection</strong>: Pentagonal/hexagonal rings via Generalized Hough Transformation (GHT)</li>
<li><strong>Line detection</strong>: Modified Hough Transformation with corner detection for bond extraction</li>
<li><strong>Line filtering</strong>: Removing spurious short segments</li>
<li><strong>Secondary text identification</strong>: Re-examining unidentified fragments for text</li>
<li><strong>Character recognition</strong>: Dual-engine approach (GOCR template matching + Euclidean distance-based engine)</li>
<li><strong>Chemical spell checker</strong>: Matching against a dictionary of 770 chemical abbreviations</li>
<li><strong>Secondary line detection</strong>: Re-running line detection on remaining pixels</li>
<li><strong>Line merging/breaking</strong>: Combining fragmented bonds or splitting at junction nodes</li>
<li><strong>Graph construction</strong>: Creating nodes from bond endpoints and chemical symbol centers, merging nearby nodes</li>
<li><strong>Connected component selection</strong>: Selecting the largest graph component</li>
<li><strong>Output</strong>: Connection table in machine-readable format</li>
</ol>
<h3 id="models">Models</h3>
<p>ChemReader is a rule-based system relying on traditional computer vision (Hough Transformation, region growing, skeletonization) and template-based character recognition. It does not use machine learning model architectures such as CNNs or neural networks.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Test</th>
          <th style="text-align: left">Correct Outputs</th>
          <th style="text-align: left">Avg. Tanimoto Similarity</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Test I (submitted)</td>
          <td style="text-align: left">691/1000</td>
          <td style="text-align: left">0.9769</td>
          <td style="text-align: left">Original submission</td>
      </tr>
      <tr>
          <td style="text-align: left">Test II (submitted)</td>
          <td style="text-align: left">689/1000</td>
          <td style="text-align: left">0.9823</td>
          <td style="text-align: left">Alternative parameter setting</td>
      </tr>
      <tr>
          <td style="text-align: left">Test III (post-fix)</td>
          <td style="text-align: left">930/1000 (93%)</td>
          <td style="text-align: left">0.9913</td>
          <td style="text-align: left">After fixing stereo bond omission and corner detection bug</td>
      </tr>
  </tbody>
</table>
<p><strong>Error Breakdown</strong> (from 20-sample analysis of Test III):</p>
<ul>
<li>Wrongly merged nodes: 6 (30%)</li>
<li>Missed bonds: 4 (20%)</li>
<li>Nonstandard representations (noise symbols, nonstandard wedge/hatched bonds, 3D crossing bonds): remaining errors</li>
</ul>
<h3 id="reproducibility-assessment">Reproducibility Assessment</h3>
<p>ChemReader&rsquo;s source code is not publicly available. The TREC 2011 Chemical IR I2S image sets were distributed to task participants but are not openly hosted. No pre-trained models apply (rule-based system). The paper provides a detailed algorithmic description (17-step pipeline) and parameter values, but full reproduction requires access to both the ChemReader codebase and the TREC image sets.</p>
<p><strong>Status</strong>: Closed</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Park, J., Li, Y., Rosania, G. R., &amp; Saitou, K. (2011). Image-to-Structure Task by ChemReader. <em>TREC 2011 Chemical IR Track Report</em>.</p>
<p><strong>Publication</strong>: TREC 2011 Chemical IR Track</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://trec.nist.gov/pubs/trec20/papers/CHEM.OVERVIEW.pdf">TREC 2011 Chemical IR Track Overview</a></li>
<li><a href="/notes/chemistry/optical-structure-recognition/rule-based/chemreader-2009/">ChemReader 2009 original paper</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@techreport</span>{parkImagetoStructureTaskChemReader2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Image-to-Structure Task by {ChemReader}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Park, Jungkap and Li, Ye and Rosania, Gus R. and Saitou, Kazuhiro}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">institution</span> = <span style="color:#e6db74">{University of Michigan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">type</span> = <span style="color:#e6db74">{TREC 2011 Chemical IR Track Report}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemical Structure Reconstruction with chemoCR (2011)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/chemocr-trec-2011/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/chemocr-trec-2011/</guid><description>A hybrid system combining pattern recognition and rule-based expert systems to reconstruct chemical structures from bitmap images.</description><content:encoded><![CDATA[<h2 id="contribution-the-chemocr-architecture">Contribution: The chemoCR Architecture</h2>
<p><strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong></p>
<p>This paper focuses entirely on the architecture and workflow of the <strong>chemoCR</strong> system. It proposes specific algorithmic innovations (texture-based vectorization, graph constraint exploration) and defines a comprehensive pipeline for converting raster images into semantic chemical graphs. The primary contribution is the system design and its operational efficacy.</p>
<h2 id="motivation-digitizing-image-locked-chemical-structures">Motivation: Digitizing Image-Locked Chemical Structures</h2>
<p>Chemical structures are the preferred language of chemistry, yet they are frequently locked in non-machine-readable formats (bitmap images like GIF, BMP) within patents and journals.</p>
<ul>
<li><strong>The Problem:</strong> Once published as images, chemical structure information is &ldquo;dead&rdquo; to analysis software.</li>
<li><strong>The Gap:</strong> Manual reconstruction is slow and error-prone. Existing tools struggled with the diversity of drawing styles (e.g., varying line thickness, font types, and non-standard bond representations).</li>
<li><strong>The Goal:</strong> To automate the conversion of these depictions into connection tables (SDF/MOL files) to make the data accessible for computational chemistry applications.</li>
</ul>
<h2 id="core-innovation-rule-based-semantic-object-identification">Core Innovation: Rule-Based Semantic Object Identification</h2>
<p>The system is based on a &ldquo;Semantic Entity&rdquo; approach that identifies chemically significant objects (chiral bonds, superatoms, reaction arrows) from structural formula depictions. Key technical innovations include:</p>
<ol>
<li><strong>Texture-based Vectorization:</strong> A new algorithm that computes local directions to vectorize lines, robust against varying drawing styles.</li>
<li><strong>Expert System Integration:</strong> A graph constraint exploration algorithm that applies an XML-based rule set to classify geometric primitives into chemical classes such as <code>BOND</code>, <code>DOUBLEBOND</code>, <code>TRIPLEBOND</code>, <code>BONDSET</code>, <code>DOTTED CHIRAL</code>, <code>STRINGASSOCIATION</code>, <code>DOT</code>, <code>RADICAL</code>, <code>REACTION</code>, <code>REACTION ARROW</code>, <code>REACTION PLUS</code>, <code>CHARGE</code>, and <code>UNKNOWN</code>.</li>
<li><strong>Validation Scoring:</strong> A built-in validation module that tests valences, bond lengths and angles, typical atom types, and fragments to assign a confidence score (0 to 1) to the reconstruction.</li>
</ol>
<h2 id="experiments-the-trec-2011-image-to-structure-task">Experiments: The TREC 2011 Image-to-Structure Task</h2>
<p>The system was evaluated as part of the <strong>TREC 2011 Image-to-Structure (I2S) Task</strong>.</p>
<ul>
<li><strong>Dataset:</strong> 1,000 unique chemical structure images provided by USPTO.</li>
<li><strong>Configuration:</strong> The authors used chemoCR v0.93 in batch mode with a single pre-configured parameter set (&ldquo;Houben-Weyl&rdquo;), originally developed for the Houben-Weyl book series of organic chemistry reactions published by Thieme.</li>
<li><strong>Process:</strong> The workflow included image binarization, connected component analysis, OCR for atom labels, and final molecule assembly.</li>
<li><strong>Metric:</strong> Perfect match recall against ground-truth MOL files.</li>
</ul>
<h2 id="results-and-conclusions-expert-systems-vs-dirty-data">Results and Conclusions: Expert Systems vs. &ldquo;Dirty&rdquo; Data</h2>
<ul>
<li><strong>Performance:</strong> The system achieved a <strong>perfect match for 656 out of 1,000 structures (65.6%)</strong>.</li>
<li><strong>Error Analysis:</strong> Failures were primarily attributed to &ldquo;unclear semantics&rdquo; in drawing styles, such as:
<ul>
<li>Overlapping objects (e.g., atom labels clashing with bonds).</li>
<li>Ambiguous primitives (dots interpreted as both radicals and chiral centers).</li>
<li>Markush structures (variable groups), which were excluded from the I2S task definition. A prototype for Markush detection existed but was not used.</li>
</ul>
</li>
<li><strong>Limitations:</strong> The vectorizer cannot recognize curves and circles, only straight lines. Aromatic ring detection (via a heuristic that looks for a large &ldquo;O&rdquo; character in the center of a ring system) was switched off for the I2S task. The system maintained 12 different parameter sets for various drawing styles, and selecting the correct set was critical.</li>
<li><strong>Impact:</strong> Demonstrated that rule-based expert systems combined with standard pattern recognition could handle high-quality datasets effectively, though non-standard drawing styles remain a challenge.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The paper relies on the TREC 2011 I2S dataset, comprising images extracted from USPTO patents.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>TREC 2011 I2S</td>
          <td>1,000 images</td>
          <td>Binarized bitmaps from USPTO patents.</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>Internal Training Set</td>
          <td>Unknown</td>
          <td>Used to optimize parameter sets (e.g., &ldquo;Houben-Weyl&rdquo; set).</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The paper describes three main workflow phases (preprocessing, semantic entity recognition, and molecule reconstruction plus validation), organized into four pipeline sections:</p>
<ol>
<li>
<p><strong>Preprocessing:</strong></p>
<ul>
<li><em>Vaporizer Unit:</em> Erases parts of the image that are presumably not structure diagrams (e.g., text or other human-readable information), isolating the chemical depictions.</li>
<li><em>Connected Components:</em> Groups all foreground pixels that are 8-connected into components.</li>
<li><em>Text Tagging and OCR:</em> Identifies components that map to text areas and converts bitmap letters into characters.</li>
</ul>
</li>
<li>
<p><strong>Vectorization:</strong></p>
<ul>
<li><em>Algorithm:</em> <strong>Compute Local Directions</strong>. It analyzes segment clusters to detect ascending, descending, horizontal, and vertical trends in pixel data, converting them into vectors.</li>
<li><em>Feature:</em> Explicitly handles &ldquo;thick chirals&rdquo; (wedges) by computing orientation.</li>
</ul>
</li>
<li>
<p><strong>Reconstruction (Expert System):</strong></p>
<ul>
<li><em>Core Logic:</em> <strong>Graph Constraint Exploration</strong>. It visits connected components and evaluates them against an XML Rule Set.</li>
<li><em>Classification:</em> Objects are tagged with chemical keywords (e.g., <code>BONDSET</code> for ring systems and chains, <code>STRINGASSOCIATION</code> for atom labels, <code>DOTTED CHIRAL</code> for chiral bonds).</li>
<li><em>Rules:</em> Configurable via <code>chemoCRSettings.xml</code>. The successful rule with the highest priority value defines the annotation for each component.</li>
</ul>
</li>
<li>
<p><strong>Assembly &amp; Validation:</strong></p>
<ul>
<li>Combines classified vectors and OCR text into a semantic graph.</li>
<li><em>Superatoms:</em> Matches text groups against a loaded superatom database (e.g., &ldquo;COOH&rdquo;, &ldquo;Boc&rdquo;).</li>
<li><em>Validation:</em> Calculates a score (0-1) based on chemical feasibility (valences, bond lengths and angles, typical atom types, and fragments).</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<p>The system is primarily rule-based but utilizes machine learning components for specific sub-tasks:</p>
<ul>
<li><strong>OCR:</strong> A trainable OCR module using supervised machine learning to recognize atom labels ($H, C, N, O$). The specific classifier is not detailed in the paper.</li>
<li><strong>Rule Base:</strong> An XML file containing the expert system logic. This is the &ldquo;model&rdquo; for structural interpretation.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation was performed strictly within the context of the TREC competition.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Recall (Perfect Match)</td>
          <td>656 / 1000</td>
          <td>N/A</td>
          <td>Strict structural identity required.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Software Stack:</strong> Platform-independent JAVA libraries.</li>
<li><strong>Compute:</strong> Batch mode processing supported; specific hardware requirements (CPU/RAM) were not disclosed.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>chemoCR (Fraunhofer SCAI)</td>
          <td>Software</td>
          <td>Unknown</td>
          <td>Project page defunct; tool was proprietary</td>
      </tr>
      <tr>
          <td><a href="https://trec.nist.gov/pubs/trec20/papers/chemoCR.chem.update.pdf">TREC 2011 Proceedings Paper</a></td>
          <td>Paper</td>
          <td>Public</td>
          <td>Official NIST proceedings</td>
      </tr>
  </tbody>
</table>
<p>No source code was publicly released. The chemoCR system was a proprietary tool from Fraunhofer SCAI. The TREC 2011 I2S dataset was distributed to competition participants and is not independently hosted.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zimmermann, M. (2011). Chemical Structure Reconstruction with chemoCR. <em>TREC 2011 Proceedings</em>.</p>
<p><strong>Publication</strong>: Text REtrieval Conference (TREC) 2011</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zimmermannChemicalStructureReconstruction2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Chemical Structure Reconstruction with {{chemoCR}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Text {{REtrieval Conference}} ({{TREC}}) 2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zimmermann, Marc}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>OSRA at TREC-CHEM 2011: Optical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/osra-trec-2011/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/osra-trec-2011/</guid><description>A methodological overview of OSRA, an open-source pipeline for converting chemical structure images into machine-readable formats.</description><content:encoded><![CDATA[<h2 id="contribution-method-and-resource">Contribution: Method and Resource</h2>
<p>This is primarily a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong>, with a secondary <strong>Resource ($\Psi_{\text{Resource}}$)</strong> component.</p>
<p>It is Methodological because it details the specific algorithmic workflow (segmentation, binarization, vectorization, and rule-based recognition) used to translate pixel data into chemical semantics. It specifically addresses the &ldquo;Image2Structure&rdquo; task. It also serves as a Resource contribution by introducing OSRA as a free, open-source utility available to the community.</p>
<h2 id="motivation-limitations-of-standard-ocr-in-chemistry">Motivation: Limitations of Standard OCR in Chemistry</h2>
<p>A vast body of chemical information exists in journal publications and patents as two-dimensional structure diagrams. While human-readable, these images are inaccessible to machine data mining techniques like virtual screening. Standard Optical Character Recognition (OCR) is insufficient, and widely used techniques such as wavelet transforms or neural networks (as used in face recognition) are not applicable here because chemical diagrams contain far more structural complexity than alphabet characters, and misinterpretation of a single element can yield a valid but incorrect molecule.</p>
<h2 id="core-innovation-chemistry-aware-heuristic-pipeline">Core Innovation: Chemistry-Aware Heuristic Pipeline</h2>
<p>The authors present a specialized pipeline distinct from standard OCR that combines image processing with domain-specific chemical logic. Key technical contributions include:</p>
<ul>
<li><strong>Entropy-based Page Segmentation</strong>: A statistical method using row entropy to distinguish between pages with mixed text/graphics and pages with single structures.</li>
<li><strong>Custom Binarization</strong>: A specific grayscale conversion ($Gr=\min(R,G,B)$).</li>
<li><strong>Heuristic Confidence Scoring</strong>: A linear &ldquo;confidence function&rdquo; derived from atom and ring counts to select the best structure resolution.</li>
<li><strong>Specialized Bond Recognition</strong>: Algorithms to detect bridge bonds, wedge/dashed bonds (3D info), and aromatic rings via inner circles.</li>
</ul>
<h2 id="methodology-evaluation-on-trec-chem-image2structure">Methodology: Evaluation on TREC-CHEM Image2Structure</h2>
<p>The system was validated through submission to the <strong>Image2Structure task of TREC-CHEM</strong>.</p>
<ul>
<li><strong>Version</strong>: OSRA version 1.3.8 was used without modifications.</li>
<li><strong>Setup</strong>: Two runs were submitted: one with default settings (automatic scale selection) and one fixed at 300 dpi.</li>
<li><strong>Data</strong>: The evaluation used a &ldquo;Training set&rdquo; and a &ldquo;Challenge Set&rdquo; provided by the task organizers.</li>
<li><strong>Metric</strong>: Recall rates were measured for both sets.</li>
</ul>
<h2 id="results-and-real-world-impact">Results and Real-World Impact</h2>
<ul>
<li><strong>Performance</strong>: The default settings achieved an <strong>84.3%</strong> recall on the training set and <strong>84.8%</strong> on the challenge set. The 300 dpi run performed slightly better (86.1% training, 85.6% challenge).</li>
<li><strong>Utility</strong>: The tool is widely used by academic and commercial researchers to extract data from patents (USPTO, JPO).</li>
<li><strong>Validation</strong>: Recognition rates have shown steady improvement over a 3-year development period.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://osra.sourceforge.net">OSRA (SourceForge)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Open-source OCSR tool</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: The primary evaluation data came from the TREC-CHEM Image2Structure task.</li>
<li><strong>Reference Datasets</strong>: The paper references the &ldquo;Chem-Infty Dataset&rdquo; as a source of ground-truthed chemical structure images.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The OSRA pipeline is heuristic-heavy. Key implementation details for replication include:</p>
<p><strong>1. Page Segmentation</strong></p>
<ul>
<li><strong>Entropy Calculation</strong>: Used to detect text vs. graphics. Entropy $E = -p \log p$ is calculated for rows in a feature matrix of component distances.</li>
<li><strong>Thresholds</strong>: Max entropy &gt; 6 indicates mixed text/graphics; $\le$ 3 indicates a single structure. A threshold of <strong>4</strong> is used to distinguish the two.</li>
<li><strong>Separator Removal</strong>: Linear separators (aspect ratio above 100 or below 0.01, size above 300 pixels) are deleted early. Table frames are identified as connected components with aspect ratio between 0.1 and 10, with at least 300 pixels lying on the surrounding rectangle.</li>
<li><strong>Text Removal</strong>: Text blocks are identified if a group of segments (distance determined by local minima in distance matrix) contains &gt; 8 segments, has a fill ratio &gt; 0.2, or aspect ratio &gt; 10.</li>
</ul>
<p><strong>2. Image Preprocessing</strong></p>
<ul>
<li><strong>Grayscale</strong>: $Gr = \min(R, G, B)$.</li>
<li><strong>Resolutions</strong>: Processed at 72, 150, 300 dpi, and a dynamic resolution between 500-1200 dpi.</li>
<li><strong>Noise Factor</strong>: Ratio of 2-pixel line segments to 3-pixel line segments. If this factor is between <strong>0.5 and 1.0</strong>, anisotropic smoothing (GREYCstoration) is applied.</li>
<li><strong>Thinning</strong>: Uses the method by J. M. Cychosz to reduce lines to 1 pixel width.</li>
</ul>
<p><strong>3. Vectorization &amp; Atom Detection</strong></p>
<ul>
<li><strong>Library</strong>: Potrace is used for vectorization.</li>
<li><strong>Atom Identification</strong>: Atoms are detected at Bezier curve control points if:
<ul>
<li>Potrace classifies it as a corner.</li>
<li>Direction change normal component is $\ge$ 2 pixels.</li>
<li>The distance from the last atom to the next control point is less than the distance from the last atom to the current control point.</li>
</ul>
</li>
<li><strong>OCR</strong>: GOCR and OCRAD are used for label recognition on connected sets smaller than max character dimensions. Tesseract and Cuneiform were also tested but did not improve recognition results.</li>
</ul>
<p><strong>4. Chemical Logic</strong></p>
<ul>
<li><strong>Average Bond Length</strong>: Defined as the value at the <strong>75th percentile</strong> of the sorted bond length list (to avoid bias from small artifacts).</li>
<li><strong>Aromaticity</strong>: Flagged if a circle is found inside a ring, atoms are within half the average bond length of the circle, and bond angles to the center are less than 90 degrees.</li>
<li><strong>Bridge Bonds</strong>: Detected if an atom connected to 4 pairwise collinear single bonds (none terminal) can be removed without changing fragment count, rotatable bonds, or reducing the number of 5- and 6-membered rings by 2.</li>
</ul>
<p><strong>5. Connection Table Compilation</strong></p>
<ul>
<li><strong>Library</strong>: OpenBabel is used for conversion into SMILES or SDF formats.</li>
<li><strong>Process</strong>: A molecular object is constructed from connectivity information along with stereo- and aromaticity flags. Superatom fragments are added at this stage using a user-modifiable dictionary.</li>
</ul>
<h3 id="models">Models</h3>
<p>This is a non-learning based system (Rule-based/Heuristic). However, it uses a tuned linear function for confidence estimation.</p>
<p><strong>Confidence Function</strong>: Used to select the best resolution result.</p>
<p>$$
\begin{aligned}
\text{confidence} &amp;= 0.316030 - 0.016315 N_C + 0.034336 N_N + 0.066810 N_O \\
&amp;+ 0.035674 N_F + 0.065504 N_S + 0.04 N_{Cl} + 0.066811 N_{Br} \\
&amp;+ 0.01 N_R - 0.02 N_{Xx} - 0.212739 N_{rings} + 0.071300 N_{aromatic} \\
&amp;+ 0.329922 N_{rings5} + 0.342865 N_{rings6} - 0.037796 N_{fragments}
\end{aligned}
$$</p>
<p>Where $N_C$ is carbon count, $N_{rings}$ is ring count, etc.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Run</th>
          <th>Training Set</th>
          <th>Challenge Set</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Recall</td>
          <td>Default Settings</td>
          <td>84.3%</td>
          <td>84.8%</td>
      </tr>
      <tr>
          <td>Recall</td>
          <td>Fixed 300 dpi</td>
          <td>86.1%</td>
          <td>85.6%</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Filippov, I. V., Katsubo, D., &amp; Nicklaus, M. C. (2011). Optical Structure Recognition Application entry in Image2Structure task. <em>TREC-CHEM</em>.</p>
<p><strong>Publication</strong>: TREC-CHEM 2011</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://osra.sourceforge.net">SourceForge Project</a></li>
<li><a href="https://launchpad.net/cuneiform-linux">Cuneiform Linux Port</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@techreport</span>{filippovOpticalStructureRecognition2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Optical {{Structure Recognition Application}} Entry in {{Image2Structure}} Task}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Filippov, Igor V. and Katsubo, Dmitry and Nicklaus, Marc C.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">institution</span> = <span style="color:#e6db74">{National Cancer Institute}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">note</span> = <span style="color:#e6db74">{TREC-CHEM Entry}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Imago: Open-Source Chemical Structure Recognition (2011)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/imago-trec-2011/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/imago-trec-2011/</guid><description>Open-source C++ toolkit for extracting 2D chemical structures from scientific literature using heuristic image processing methods.</description><content:encoded><![CDATA[<h2 id="paper-contribution-and-resource-utility">Paper Contribution and Resource Utility</h2>
<p>This is primarily a <strong>Resource ($\Psi_{\text{Resource}}$)</strong> paper, with a secondary <strong>Method ($\Psi_{\text{Method}}$)</strong> component.</p>
<p><strong>Resource:</strong> The paper&rsquo;s main contribution is the release of the &ldquo;Imago&rdquo; open-source toolkit. It emphasizes infrastructure utility: cross-platform C++ implementation, a core written from scratch without third-party code, and the inclusion of both GUI and command-line tools.</p>
<p><strong>Method:</strong> It provides a detailed description of the recognition pipeline (filtering, segmentation, vectorization) to document the resource.</p>
<h2 id="motivation-the-deep-web-of-chemical-structures">Motivation: The Deep Web of Chemical Structures</h2>
<p>Chemical databases (like PubChem or PubMed) allow text and substructure searches, but a vast amount of chemical data remains &ldquo;locked&rdquo; in the images of scientific articles and patents. This is described as a &ldquo;Deep Web indexing problem&rdquo;. To bridge this gap, the authors identify a need for efficient, accurate, and portable algorithms to convert 2D raster images of molecules into graph representations suitable for indexing and search.</p>
<h2 id="core-innovation-a-dependency-free-c-architecture">Core Innovation: A Dependency-Free C++ Architecture</h2>
<p>The novelty lies in the <strong>open-source, dependency-free implementation</strong>.</p>
<p><strong>Portability:</strong> The core of the toolkit is written from scratch in modern C++ without third-party libraries, specifically targeting portability to mobile devices and various platforms.</p>
<p><strong>Integration:</strong> It combines optical character recognition (OCR) with specific chemical heuristics (like identifying stereochemistry and abbreviations) into a single usable workflow.</p>
<h2 id="methodology-and-experimental-validation-at-trec-chem">Methodology and Experimental Validation at TREC-CHEM</h2>
<p>The paper describes the algorithm used in Imago and reflects on its participation in the <strong>Image2Structure task at TREC-CHEM 2011</strong>. No quantitative results are reported; the &ldquo;Discussion&rdquo; section instead reflects on qualitative performance issues observed during the task, such as handling low resolution, noise, and connected atom labels.</p>
<h2 id="outcomes-limitations-and-future-directions">Outcomes, Limitations, and Future Directions</h2>
<p><strong>Release:</strong> The authors successfully released Imago under the GPLv3 license, including an API for developers. The toolkit outputs recognized structures in MDL Molfile format.</p>
<p><strong>Limitations Identified:</strong> The straightforward pipeline fails when images have low resolution (atom labels merge with bonds), high noise, or tight character spacing (symbols rendered without space pixels between them). Additionally, when few symbols are present, the average bond length estimate can have large error, causing atom symbols to be misidentified as bond chains.</p>
<p><strong>Future Directions:</strong> The authors propose moving from a linear pipeline to an &ldquo;optimization procedure&rdquo; that maximizes a confidence score, using probabilistic mapping of image primitives to chemical entities. They also argue that recognition programs should output a confidence score to enable automatic batch processing (only images with low confidence need manual review). They suggest a multi-pass workflow where each iteration adjusts parameters to improve the confidence level, and they note the additional challenge of separating molecule images from text in real articles and patents.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The paper does not specify a training dataset for the core logic (which appears heuristic-based), but references testing context:</p>
<ul>
<li><strong>Domain:</strong> Images from scientific articles and patents.</li>
<li><strong>Validation:</strong> TREC-CHEM 2011 Image2Structure task data.</li>
<li><strong>Databases:</strong> Mentions PubMed and PubChem as context for the type of data being indexed.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The recognition pipeline follows a strict linear sequence:</p>
<ol>
<li>
<p><strong>Preprocessing:</strong></p>
<ul>
<li><strong>Binarization:</strong> Threshold-based.</li>
<li><strong>Supersegmentation:</strong> Locates the chemical structure using a $15 \times 15$ window neighbor search.</li>
<li><strong>Filtering:</strong> Removes single-down stereo bonds (dashed triangles) early to prevent incorrect recognition of the small line segments during classification.</li>
</ul>
</li>
<li>
<p><strong>Separation (Symbols vs. Graphics):</strong></p>
<ul>
<li><strong>Heuristic:</strong> Estimates &ldquo;capital letter height&rdquo;.</li>
<li><strong>Criteria:</strong> Groups segments by height and aspect ratio range $[\text{MinSymRatio}, \text{MaxSymRatio}]$.</li>
</ul>
</li>
<li>
<p><strong>Skeleton Construction (Vectorization):</strong></p>
<ul>
<li><strong>Thinning:</strong> Uses neighborhood maps to reduce lines to 1-pixel thickness.</li>
<li><strong>De-crossing:</strong> Each black pixel with more than 2 black pixels in its 8-neighborhood becomes white, isolating polylines.</li>
<li><strong>Smoothing:</strong> Uses the <strong>Douglas-Peucker algorithm</strong>.</li>
<li><strong>Graph Adjustment:</strong> Merges close vertices and detects bond orders based on parallel edges.</li>
</ul>
</li>
<li>
<p><strong>Symbol Recognition:</strong></p>
<ul>
<li><strong>Grouping:</strong> Uses a <strong>Relative Neighborhood Graph</strong> to group characters into superatoms/labels.</li>
<li><strong>OCR:</strong> Classification based on <strong>Fourier descriptors</strong> of outer/inner contours.</li>
</ul>
</li>
<li>
<p><strong>Chemical Expansion:</strong></p>
<ul>
<li><strong>Abbreviation:</strong> Expands common groups (e.g., Ph, COOH) stored as SMILES notation, using the <strong>Indigo toolkit</strong> for 2D coordinate generation of the expanded structures.</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<ul>
<li><strong>Type:</strong> Heuristic-based computer vision pipeline; no learned deep learning weights mentioned.</li>
<li><strong>Stereo Recognition:</strong>
<ul>
<li><strong>Single Down:</strong> Identified as $k \ge 3$ parallel equidistant lines.</li>
<li><strong>Single Up:</strong> Identified by checking if a bond was a solid triangle before thinning.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics:</strong> None quantitatively reported in the text; discussion focuses on qualitative failure modes (low resolution, noise).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/epam/Imago">Imago GitHub Repository</a></td>
          <td>Code</td>
          <td>Apache-2.0 (current); GPLv3 (as published)</td>
          <td>Official C++ implementation</td>
      </tr>
      <tr>
          <td><a href="https://lifescience.opensource.epam.com/imago/">Imago Project Page</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Documentation and downloads</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Requirements:</strong> Designed to be lightweight and portable (mobile-device capable), written in C++. No specific GPU/TPU requirements.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Smolov, V., Zentsev, F., &amp; Rybalkin, M. (2011). Imago: Open-Source Toolkit for 2D Chemical Structure Image Recognition. <em>TREC-CHEM 2011</em>.</p>
<p><strong>Publication</strong>: TREC-CHEM 2011</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://trec.nist.gov/pubs/trec20/t20.proceedings.html">TREC-CHEM 2011 Proceedings</a></li>
<li><a href="https://lifescience.opensource.epam.com/imago/">Project Website</a></li>
<li><a href="https://github.com/epam/Imago">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@techreport</span>{smolovImagoOpenSourceToolkit2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Imago: {{Open-Source Toolkit}} for {{2D Chemical Structure Image Recognition}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Smolov, Viktor and Zentsev, Fedor and Rybalkin, Mikhail}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">institution</span> = <span style="color:#e6db74">{{GGA Software Services LLC}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">note</span> = <span style="color:#e6db74">{TREC-CHEM 2011}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>OCSR Methods: A Taxonomy of Approaches</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/</guid><description>Overview of optical chemical structure recognition methods organized by approach, from deep learning to rule-based systems.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>Optical Chemical Structure Recognition (OCSR) aims to automatically extract machine-readable molecular representations (e.g., SMILES, <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>, mol files) from images of chemical structures. Methods have evolved from early rule-based systems to modern deep learning approaches.</p>
<p>This note organizes OCSR methods by their fundamental approach, providing a framework for understanding the landscape of techniques.</p>
<h2 id="common-limitations-and-failure-modes">Common Limitations and Failure Modes</h2>
<p>Regardless of the underlying paradigm, most OCSR systems struggle with a common set of challenges:</p>
<ol>
<li><strong>Stereochemistry</strong>: Ambiguous wedge/dash bonds, varying drawing conventions, and implicit stereocenters frequently lead to incorrect isomer generation.</li>
<li><strong>Markush Structures</strong>: Generic structures with variable R-groups (common in patents) require complex subgraph mapping that sequence-based models often fail to capture.</li>
<li><strong>Image Degradation</strong>: Artifacts, low resolution, skewed scans, and hand-drawn irregularities degrade the performance of both rule-based heuristics and CNN feature extractors.</li>
<li><strong>Superatoms and Abbreviations</strong>: Textual abbreviations (e.g., &ldquo;Ph&rdquo;, &ldquo;t-Bu&rdquo;, &ldquo;BoC&rdquo;) embedded within the image require joint optical character recognition (OCR) and structural parsing.</li>
</ol>
<h2 id="review--survey-papers">Review &amp; Survey Papers</h2>
<p>Comprehensive surveys and systematization of knowledge papers that organize and synthesize the OCSR literature.</p>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Paper</th>
          <th>Notes</th>
          <th>Focus</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2020</td>
          <td><a href="https://doi.org/10.1186/s13321-020-00465-0">A review of optical chemical structure recognition tools</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/rajan-ocsr-review-2020/">Rajan et al. 2020</a></td>
          <td>Survey of 30 years of OCSR development (1990-2019); benchmark of three open-source tools (OSRA, Imago, MolVec) on four datasets</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.1186/s13321-022-00642-3">Review of techniques and models used in optical chemical structure recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/musazade-ocsr-review-2022/">Musazade et al. 2022</a></td>
          <td>Systematization of OCSR evolution from rule-based systems to modern deep learning; identifies paradigm shift to image captioning and critiques evaluation metrics</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.1039/D3DD00228D">Comparing software tools for optical chemical structure recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/krasnov-ocsr-benchmark-2024/">Krasnov et al. 2024</a></td>
          <td>Benchmark of 8 open-access tools on 2,702 manually curated patent images; proposes ChemIC classifier for hybrid routing approach</td>
      </tr>
  </tbody>
</table>
<h2 id="deep-learning-methods">Deep Learning Methods</h2>
<p>End-to-end neural network architectures that learn to map images directly to molecular representations.</p>
<p><strong>Note on Paper Types</strong>: Papers listed below are primarily <strong>Method</strong> ($\Psi_{\text{Method}}$) papers focused on novel architectures and performance improvements. Some also have secondary <strong>Resource</strong> ($\Psi_{\text{Resource}}$) contributions through released tools or datasets. See the <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI and Physical Sciences paper taxonomy</a> for classification details.</p>
<h3 id="image-to-sequence-paradigm">Image-to-Sequence Paradigm</h3>
<p>Treating chemical structure recognition as an image captioning task, these methods use encoder-decoder architectures (often with attention mechanisms) to generate sequential molecular representations like SMILES directly from pixels. Formally, given an image $I$, the model learns to sequentially output tokens $y_t$ to maximize the conditional probability:
$$ p(Y|I) = \prod_{t=1}^{T} p(y_t | y_{&lt;t}, I; \theta) $$
where $\theta$ represents the model parameters. This paradigm is powerful but can hallucinate chemically invalid structures if the decoder fails to learn chemical syntax rules.</p>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Paper</th>
          <th>Notes</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2019</td>
          <td><a href="https://doi.org/10.1021/acs.jcim.8b00669">Molecular Structure Extraction From Documents Using Deep Learning</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/staker-deep-learning-2019/">Staker et al. Notes</a></td>
          <td>U-Net segmentation + CNN-GridLSTM encoder-decoder with attention</td>
      </tr>
      <tr>
          <td>2020</td>
          <td><a href="https://doi.org/10.1186/s13321-020-00469-w">DECIMER: towards deep learning for chemical image recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer/">DECIMER Notes</a></td>
          <td>Inception V3 encoder + GRU decoder with attention</td>
      </tr>
      <tr>
          <td>2021</td>
          <td><a href="https://doi.org/10.1039/D1SC02957F">ChemPix: automated recognition of hand-drawn hydrocarbon structures</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/chempix/">ChemPix Notes</a></td>
          <td>CNN encoder + LSTM decoder with attention</td>
      </tr>
      <tr>
          <td>2021</td>
          <td><a href="https://doi.org/10.1186/s13321-021-00538-8">DECIMER 1.0: deep learning for chemical image recognition using transformers</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-1.0/">DECIMER 1.0 Notes</a></td>
          <td>EfficientNet-B3 encoder + Transformer decoder with SELFIES output</td>
      </tr>
      <tr>
          <td>2021</td>
          <td><a href="https://doi.org/10.48550/arXiv.2104.14721">End-to-End Attention-based Image Captioning</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/vit-inchi-transformer/">ViT-InChI Transformer Notes</a></td>
          <td>Vision Transformer encoder + Transformer decoder with InChI output</td>
      </tr>
      <tr>
          <td>2021</td>
          <td><a href="https://doi.org/10.1039/D1SC01839F">Img2Mol - accurate SMILES recognition from molecular graphical depictions</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/img2mol/">Img2Mol Notes</a></td>
          <td>CNN encoder + pre-trained CDDD decoder for continuous embedding</td>
      </tr>
      <tr>
          <td>2021</td>
          <td><a href="https://doi.org/10.48550/arXiv.2109.04202">IMG2SMI: Translating Molecular Structure Images to SMILES</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/img2smi/">IMG2SMI Notes</a></td>
          <td>ResNet-101 encoder + Transformer decoder with SELFIES output</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.3390/app12020680">Automated Recognition of Chemical Molecule Images Based on an Improved TNT Model</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/icmdt/">ICMDT Notes</a></td>
          <td>Deep TNT encoder + Transformer decoder with InChI output</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.1002/cmtd.202100069">Image2SMILES: Transformer-Based Molecular Optical Recognition Engine</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/image2smiles/">Image2SMILES Notes</a></td>
          <td>ResNet-50 encoder + Transformer decoder with FG-SMILES output</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.1093/bioinformatics/btac545">MICER: a pre-trained encoder-decoder architecture for molecular image captioning</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/micer/">MICER Notes</a></td>
          <td>Fine-tuned ResNet101 encoder + LSTM decoder with attention</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.1039/D1DD00013F">Performance of chemical structure string representations for chemical image recognition using transformers</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/rajan-string-representations-2022/">Rajan String Representations</a></td>
          <td>Comparative ablation: SMILES vs DeepSMILES vs SELFIES vs InChI</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.1186/s13321-022-00624-5">SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/swinocsr/">SwinOCSR Notes</a></td>
          <td>Swin Transformer encoder + Transformer decoder with DeepSMILES output</td>
      </tr>
      <tr>
          <td>2023</td>
          <td><a href="https://doi.org/10.1145/3581783.3612573">Handwritten Chemical Structure Image to Structure-Specific Markup Using Random Conditional Guided Decoder</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/hu-handwritten-rcgd-2023/">Hu et al. RCGD Notes</a></td>
          <td>DenseNet encoder + GRU decoder with attention and SSML output</td>
      </tr>
      <tr>
          <td>2023</td>
          <td><a href="https://doi.org/10.1038/s41467-023-40782-0">DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-ai/">DECIMER.ai Notes</a></td>
          <td>EfficientNet-V2-M encoder + Transformer decoder with SMILES output</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.1038/s41598-024-67496-7">ChemReco: automated recognition of hand-drawn carbon-hydrogen-oxygen structures using deep learning</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/chemreco/">ChemReco Notes</a></td>
          <td>EfficientNet encoder + Transformer decoder with SMILES output</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.1186/s13321-024-00872-7">Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/decimer-hand-drawn/">Enhanced DECIMER Notes</a></td>
          <td>EfficientNet-V2-M encoder + Transformer decoder with SMILES output</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.1021/acs.jcim.3c02082">Image2InChI: Automated Molecular Optical Image Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/image2inchi/">Image2InChI Notes</a></td>
          <td>Improved SwinTransformer encoder + Transformer decoder with InChI output</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.1039/D4RA02442G">MMSSC-Net: multi-stage sequence cognitive networks for drug molecule recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/mmssc-net/">MMSSC-Net Notes</a></td>
          <td>SwinV2 encoder + GPT-2 decoder with MLP for multi-stage cognition</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.48550/arXiv.2412.07594">RFL: Simplifying Chemical Structure Recognition with Ring-Free Language</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/rfl/">RFL Notes</a></td>
          <td>DenseNet encoder + GRU decoder with hierarchical ring decomposition</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.1021/acs.jpclett.5c03057">Dual-Path Global Awareness Transformer for Optical Chemical Structure Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/dgat/">DGAT Notes</a></td>
          <td>ResNet-101 encoder + Transformer with CGFE/SDGLA modules and SMILES output</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.48550/arXiv.2506.07553">GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/vision-language/gtr-mol-vlm/">GTR-CoT Notes</a></td>
          <td>Qwen-VL 2.5 3B encoder-decoder with graph traversal chain-of-thought and SMILES output</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.48550/arXiv.2411.11098">MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/vision-language/mol-parser/">MolParser Notes</a></td>
          <td>Swin Transformer encoder + BART decoder with Extended SMILES (E-SMILES) output</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.48550/arXiv.2511.17300">MolSight: OCSR with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/molsight/">MolSight Notes</a></td>
          <td>EfficientViT-L1 encoder + Transformer decoder with RL (GRPO) and SMILES output</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.48550/arXiv.2501.15415">OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/vision-language/ocsu/">OCSU Notes</a></td>
          <td>Mol-VL: Qwen2-VL encoder-decoder with multi-task learning for multi-level understanding</td>
      </tr>
  </tbody>
</table>
<h3 id="image-to-graph-paradigm">Image-to-Graph Paradigm</h3>
<p>Methods that explicitly construct molecular graphs as intermediate representations, identifying atoms as vertices $V$ and bonds as edges $E$ before converting to standard molecular formats. Graph approaches construct an adjacency matrix $A$ and feature vectors, effectively turning OCSR into a joint probability model over nodes, edges, and their spatial coordinates:
$$ p(G|I) = \prod_{v \in V} p(v|I) \prod_{u,v \in V} p(e_{uv}|v_u, v_v, I) $$
This avoids hallucinating invalid character strings and explicitly grounds the predictions to the image space (via bounding boxes/segmentation), improving interpretability.</p>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Paper</th>
          <th>Notes</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2020</td>
          <td><a href="https://doi.org/10.1021/acs.jcim.0c00459">ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep Learning</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/chemgrapher-2020/">ChemGrapher Notes</a></td>
          <td>U-Net-based semantic segmentation + graph building algorithm + classification CNNs</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.1093/bib/bbac033">ABC-Net: A divide-and-conquer based deep learning architecture for SMILES recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/abc-net/">ABC-Net Notes</a></td>
          <td>U-Net-style FCN with keypoint detection heatmaps + multi-task property prediction</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.48550/arXiv.2202.09580">Image-to-Graph Transformers for Chemical Structure Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/image-to-graph-transformers/">Image-to-Graph Transformers Notes</a></td>
          <td>ResNet-34 encoder + Transformer encoder + Graph-Aware Transformer (GRAT) decoder</td>
      </tr>
      <tr>
          <td>2022</td>
          <td><a href="https://doi.org/10.1021/acs.jcim.2c00733">MolMiner: You Only Look Once for Chemical Structure Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molminer/">MolMiner Notes</a></td>
          <td>MobileNetV2 segmentation + YOLOv5 object detection + EasyOCR + graph construction</td>
      </tr>
      <tr>
          <td>2023</td>
          <td><a href="https://openaccess.thecvf.com/content/ICCV2023/html/Morin_MolGrapher_Graph-based_Visual_Recognition_of_Chemical_Structures_ICCV_2023_paper.html">MolGrapher: Graph-based Visual Recognition of Chemical Structures</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molgrapher/">MolGrapher Notes</a></td>
          <td>ResNet-18 keypoint detector + supergraph construction + GNN classifier</td>
      </tr>
      <tr>
          <td>2023</td>
          <td><a href="https://doi.org/10.1021/acs.jcim.2c01480">MolScribe: Robust Molecular Structure Recognition with Image-To-Graph Generation</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molscribe/">MolScribe Notes</a></td>
          <td>Swin Transformer encoder + Transformer decoder with explicit atom coordinates and bond prediction</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.48550/arXiv.2404.01743">Atom-Level Optical Chemical Structure Recognition with Limited Supervision</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/atomlenz/">AtomLenz Notes</a></td>
          <td>Faster R-CNN object detection + graph constructor with weakly supervised training (ProbKT*)</td>
      </tr>
      <tr>
          <td>2024</td>
          <td><a href="https://doi.org/10.1186/s13321-024-00926-w">MolNexTR: a generalized deep learning model for molecular image recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/vision-language/molnextr/">MolNexTR Notes</a></td>
          <td>Dual-stream (ConvNext + ViT) encoder + Transformer decoder with graph generation</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.1109/CVPR52734.2025.01352">MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/markush/markushgrapher/">MarkushGrapher Notes</a></td>
          <td>UDOP VTL encoder + MolScribe OCSR encoder + T5 decoder with CXSMILES + substituent table</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.48550/arXiv.2505.03777">MolMole: Molecule Mining from Scientific Literature</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molmole/">MolMole Notes</a></td>
          <td>ViDetect (DINO) + ViReact (RxnScribe) + ViMore (detection-based) unified page-level pipeline</td>
      </tr>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.48550/arXiv.2501.15415">OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/vision-language/ocsu/">OCSU Notes</a></td>
          <td>DoubleCheck: MolScribe + attentive feature enhancement with local ambiguous atom refinement</td>
      </tr>
  </tbody>
</table>
<h3 id="image-to-fingerprint-paradigm">Image-to-Fingerprint Paradigm</h3>
<p>Methods that bypass molecular graph reconstruction entirely, generating molecular fingerprints directly from images through functional group recognition and spatial analysis. These approaches prioritize retrieval and similarity search over exact structure reconstruction.</p>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Paper</th>
          <th>Notes</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2025</td>
          <td><a href="https://doi.org/10.1186/s13321-025-01091-4">SubGrapher: visual fingerprinting of chemical structures</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/vision-language/subgrapher/">SubGrapher Notes</a></td>
          <td>Dual Mask-RCNN instance segmentation (1,534 groups + 27 backbones) + substructure-graph + SVMF fingerprint</td>
      </tr>
  </tbody>
</table>
<h3 id="image-classification-and-filtering">Image Classification and Filtering</h3>
<p>Methods that classify chemical structure images for preprocessing purposes, such as detecting Markush structures or other problematic inputs that should be filtered before full OCSR processing.</p>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Paper</th>
          <th>Notes</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2023</td>
          <td><a href="https://doi.org/10.48550/arXiv.2311.14633">One Strike, You&rsquo;re Out: Detecting Markush Structures in Low Signal-to-Noise Ratio Images</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/markush/jurriaans-markush-detection-2023/">Jurriaans et al. Notes</a></td>
          <td>Patch-based pipeline with Inception V3 or ResNet18 for binary classification</td>
      </tr>
  </tbody>
</table>
<h2 id="traditional-machine-learning-methods">Traditional Machine Learning Methods</h2>
<p>Hybrid approaches combining classical machine learning algorithms (neural networks, SVMs, CRFs) with domain-specific heuristics and image processing. These methods (primarily from 1992-2014) used ML for specific subtasks like character recognition or symbol classification while relying on rule-based systems for chemical structure interpretation.</p>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Paper</th>
          <th>Notes</th>
          <th>Key ML Component</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1992</td>
          <td><a href="https://doi.org/10.1021/ci00008a018">Kekulé: OCR-Optical Chemical (Structure) Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/kekule-1992/">Kekulé Notes</a></td>
          <td>Multilayer perceptron for OCR</td>
      </tr>
      <tr>
          <td>1996</td>
          <td><a href="https://doi.org/10.1007/3-540-61226-2_14">Automatic Interpretation of Chemical Structure Diagrams</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/kekule-1996/">Kekulé-1 Notes</a></td>
          <td>Neural network with shared weights (proto-CNN)</td>
      </tr>
      <tr>
          <td>2007</td>
          <td><a href="https://cdn.aaai.org/AAAI/2007/AAAI07-134.pdf">Recognition of Hand Drawn Chemical Diagrams</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/ouyang-davis-aaai-2007/">Ouyang-Davis Notes</a></td>
          <td>SVM for symbol classification</td>
      </tr>
      <tr>
          <td>2008</td>
          <td><a href="https://static.aminer.org/pdf/PDF/000/295/640/neural_versus_syntactic_recognition_of_handwritten_numerals.pdf">Chemical Ring Handwritten Recognition Based on Neural Networks</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/hewahi-ring-recognition-2008/">Hewahi et al. Notes</a></td>
          <td>Two-phase classifier-recognizer with feed-forward NNs</td>
      </tr>
      <tr>
          <td>2008</td>
          <td><a href="https://doi.org/10.1109/IJCNN.2008.4634125">Recognition of On-line Handwritten Chemical Expressions</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/yang-online-handwritten-2008/">Yang et al. Notes</a></td>
          <td>Two-level algorithm with edit distance matching</td>
      </tr>
      <tr>
          <td>2008</td>
          <td><a href="https://doi.org/10.1109/ICPR.2008.4761824">A Study of On-Line Handwritten Chemical Expressions Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/yang-icpr-2008/">Yang et al. Notes</a></td>
          <td>ANN with two-level substance recognition</td>
      </tr>
      <tr>
          <td>2009</td>
          <td><a href="https://doi.org/10.1109/ICDAR.2009.64">A Unified Framework for Recognizing Handwritten Chemical Expressions</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/chang-unified-framework-2009/">Chang et al. Notes</a></td>
          <td>GMM for spatial relations, NN for bond verification</td>
      </tr>
      <tr>
          <td>2009</td>
          <td><a href="https://doi.org/10.1109/ICDAR.2009.99">HMM-Based Online Recognition of Handwritten Chemical Symbols</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/zhang-hmm-handwriting-2009/">Zhang et al. Notes</a></td>
          <td>Hidden Markov Model for online handwriting</td>
      </tr>
      <tr>
          <td>2009</td>
          <td><a href="https://doi.org/10.1109/ICDAR.2009.70">The Understanding and Structure Analyzing for Online Handwritten Chemical Formulas</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/wang-online-handwritten-2009/">Wang et al. Notes</a></td>
          <td>HMM for text recognition + CFG for structure parsing</td>
      </tr>
      <tr>
          <td>2010</td>
          <td><a href="https://doi.org/10.1109/ICPR.2010.465">A SVM-HMM Based Online Classifier for Handwritten Chemical Symbols</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/zhang-svm-hmm-2010/">Zhang et al. Notes</a></td>
          <td>Dual-stage SVM-HMM with PSR algorithm</td>
      </tr>
      <tr>
          <td>2011</td>
          <td><a href="https://doi.org/10.1145/1943403.1943444">ChemInk: A Natural Real-Time Recognition System for Chemical Drawings</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/chemink-2011/">ChemInk Notes</a></td>
          <td>Conditional Random Field (CRF) joint model</td>
      </tr>
      <tr>
          <td>2013</td>
          <td><a href="https://doi.org/10.1109/ICIS.2013.6607894">Online Chemical Symbol Recognition for Handwritten Chemical Expression Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/online-recognition/tang-online-symbol-2013/">Tang et al. Notes</a></td>
          <td>SVM with elastic matching for handwriting</td>
      </tr>
      <tr>
          <td>2014</td>
          <td><a href="https://doi.org/10.1021/ci5002197">Markov Logic Networks for Optical Chemical Structure Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/mlocsr/">MLOCSR Notes</a></td>
          <td>Markov Logic Network for probabilistic inference</td>
      </tr>
  </tbody>
</table>
<h2 id="rule-based-methods">Rule-Based Methods</h2>
<p>Classic approaches using heuristics, image processing, and domain-specific rules. While some systems use traditional OCR engines (which may contain ML components), the chemical structure recognition itself is purely algorithmic.</p>
<p><strong>Note</strong>: The chemoCR systems use SVM-based OCR but employ rule-based topology-preserving vectorization for core structure reconstruction, placing them primarily in this category.</p>
<h3 id="core-methods">Core Methods</h3>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Paper</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1990</td>
          <td><a href="https://doi.org/10.1021/ci00067a014">Computational Perception and Recognition of Digitized Molecular Structures</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/contreras-ocr-1990/">Contreras et al. Notes</a></td>
      </tr>
      <tr>
          <td>1993</td>
          <td><a href="https://doi.org/10.1021/ci00013a010">Chemical Literature Data Extraction: The CLiDE Project</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/clide-1993/">CLiDE Notes</a></td>
      </tr>
      <tr>
          <td>1993</td>
          <td><a href="https://doi.org/10.1109/ICDAR.1993.395658">Optical Recognition of Chemical Graphics</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/casey-ocsr-1993/">Casey et al. Notes</a></td>
      </tr>
      <tr>
          <td>1999</td>
          <td><a href="https://doi.org/10.1109/ICDAR.1999.791730">Automatic Reading of Handwritten Chemical Formulas from a Structural Representation of the Image</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/ramel-handwritten-1999/">Ramel et al. Notes</a></td>
      </tr>
      <tr>
          <td>2007</td>
          <td><a href="https://doi.org/10.1109/ENC.2007.25">Automatic Recognition of Chemical Images</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/algorri-chemical-image-recognition-2007/">chemoCR Notes</a></td>
      </tr>
      <tr>
          <td>2007</td>
          <td><a href="https://doi.org/10.1109/IEMBS.2007.4353366">Reconstruction of Chemical Molecules from Images</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/algorri-reconstruction-2007/">chemoCR Notes</a></td>
      </tr>
      <tr>
          <td>2009</td>
          <td><a href="https://doi.org/10.1186/1752-153X-3-4">Automated extraction of chemical structure information from digital raster images</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/chemreader-2009/">ChemReader Notes</a></td>
      </tr>
      <tr>
          <td>2009</td>
          <td><a href="https://doi.org/10.1021/ci800449t">CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/clide-pro-2009/">CLiDE Pro Notes</a></td>
      </tr>
      <tr>
          <td>2009</td>
          <td><a href="https://doi.org/10.1021/ci800067r">Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/osra/">OSRA Notes</a></td>
      </tr>
      <tr>
          <td>2012</td>
          <td><a href="https://doi.org/10.1117/12.912185">Chemical Structure Recognition: A Rule Based Approach</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/molrec-2012/">MolRec Notes</a></td>
      </tr>
      <tr>
          <td>2015</td>
          <td><a href="https://doi.org/10.2991/jimet-15.2015.50">Research on Chemical Expression Images Recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/hong-chemical-expression-2015/">Hong et al. Notes</a></td>
      </tr>
  </tbody>
</table>
<h3 id="trec-2011-chemistry-track">TREC 2011 Chemistry Track</h3>
<p>The <a href="/notes/chemistry/optical-structure-recognition/benchmarks/trec-chem-2011/">TREC 2011 Chemistry Track</a> provided a standardized benchmark for comparing OCSR systems, introducing the novel Image-to-Structure task alongside Prior Art and Technology Survey tasks. Papers from this evaluation are grouped here.</p>
<table>
  <thead>
      <tr>
          <th>System</th>
          <th>Paper</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>chemoCR</td>
          <td><a href="https://trec.nist.gov/pubs/trec20/papers/chemoCR.chem.update.pdf">Chemical Structure Reconstruction with chemoCR</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/chemocr-trec-2011/">chemoCR Notes</a></td>
      </tr>
      <tr>
          <td>ChemReader</td>
          <td><a href="https://trec.nist.gov/pubs/trec20/papers/chemreader.chem.update.pdf">Image-to-Structure Task by ChemReader</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/chemreader-trec-2011/">ChemReader at TREC 2011 Notes</a></td>
      </tr>
      <tr>
          <td>Imago</td>
          <td><a href="https://trec.nist.gov/pubs/trec20/papers/GGA.chemical.pdf">Imago: open-source toolkit for 2D chemical structure image recognition</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/imago-trec-2011/">Imago Notes</a></td>
      </tr>
      <tr>
          <td>OSRA</td>
          <td><a href="https://trec.nist.gov/pubs/trec20/papers/saic-frederick.chem.pdf">Optical Structure Recognition Application entry in Image2Structure task</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/osra-trec-2011/">OSRA at TREC 2011 Notes</a></td>
      </tr>
      <tr>
          <td>MolRec</td>
          <td><a href="https://trec.nist.gov/pubs/trec20/papers/UoB.chem.update.pdf">Performance of MolRec at TREC 2011 Overview and Analysis of Results</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/molrec_at_trec/">MolRec at TREC Notes</a></td>
      </tr>
      <tr>
          <td>ChemInfty</td>
          <td><a href="https://www.inftyreader.org/inftyreader-contents/about-inftyreader/list-of-academic-papers/2011_GREC_ChemInfty.pdf">Robust Method of Segmentation and Recognition of Chemical Structure Images in ChemInfty</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/rule-based/cheminfty/">ChemInfty Notes</a></td>
      </tr>
  </tbody>
</table>
<h3 id="clef-2012-chemistry-track">CLEF 2012 Chemistry Track</h3>
<p>The <a href="/notes/chemistry/optical-structure-recognition/benchmarks/clef-ip-2012/">CLEF-IP 2012 benchmarking lab</a> introduced three specific IR tasks in the intellectual property domain: claims-based passage retrieval, flowchart recognition, and chemical structure recognition. The chemical structure recognition task included both segmentation (identifying bounding boxes) and recognition (converting to MOL format) subtasks, with a particular focus on challenging Markush structures common in patents.</p>
<table>
  <thead>
      <tr>
          <th>System</th>
          <th>Paper</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolRec</td>
          <td><a href="https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf">MolRec at CLEF 2012 - Overview and Analysis of Results</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/molrec-clef-2012/">MolRec at CLEF 2012 Notes</a></td>
      </tr>
      <tr>
          <td>OSRA</td>
          <td><a href="https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-FilippovEt2012.pdf">Optical Structure Recognition Application entry to CLEF-IP 2012</a></td>
          <td><a href="/notes/chemistry/optical-structure-recognition/benchmarks/osra-clef-2012/">OSRA at CLEF-IP 2012 Notes</a></td>
      </tr>
  </tbody>
</table>
]]></content:encoded></item><item><title>MolRec: Rule-Based OCSR System at TREC 2011 Benchmark</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec_at_trec/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec_at_trec/</guid><description>Rule-based system for optical chemical structure recognition using vectorization and geometric analysis, achieving 95% accuracy on TREC 2011.</description><content:encoded><![CDATA[<h2 id="contribution-rule-based-ocsr-system">Contribution: Rule-Based OCSR System</h2>
<p>This is a <strong>Method</strong> paper that presents and validates MolRec, a rule-based system for Optical Chemical Structure Recognition (OCSR). While the paper emphasizes performance analysis on the TREC 2011 benchmark, the core contribution is the system architecture itself: a multi-stage pipeline using vectorization, geometric rule-based analysis, and graph construction to convert chemical diagram images into machine-readable MOL files.</p>
<h2 id="motivation-robust-conversion-of-chemical-diagrams">Motivation: Robust Conversion of Chemical Diagrams</h2>
<p>Chemical molecular diagrams are ubiquitous in scientific documents across chemistry and life sciences. Converting these static raster images into machine-readable formats (like MOL files) that encode precise spatial and connectivity information is important for cheminformatics applications such as database indexing, similarity searching, and automated literature mining.</p>
<p>While pixel-based pattern matching approaches exist, they struggle with variations in drawing style, image quality, and diagram complexity. An approach that can handle the geometric and topological diversity of real-world chemical diagrams is needed.</p>
<h2 id="novelty-vectorization-and-geometric-rules">Novelty: Vectorization and Geometric Rules</h2>
<p>MolRec uses a <strong>vectorization and geometric rule-based pipeline</strong>. Key technical innovations include:</p>
<p><strong>Disk-Growing Heuristic for Wedge Bonds</strong>: A novel dynamic algorithm to distinguish wedge bonds from bold lines. A disk with radius greater than the average line width is placed inside the connected component and grown to the largest size that still covers only foreground pixels. The disk is then walked in the direction that allows it to continue growing. When it can grow no more, the base of the triangle (stereo-center) has been located, identifying the wedge orientation.</p>
<p><strong>Joint Breaking Strategy</strong>: Explicitly breaking all connected joints in the vectorization stage to avoid combinatorial connection complexity. This allows uniform treatment of all line segment connections regardless of junction complexity.</p>
<p><strong>Superatom Dictionary Mining</strong>: The system mines MOL files from the OSRA dataset to build a comprehensive superatom dictionary (e.g., &ldquo;Ph&rdquo;, &ldquo;COOH&rdquo;), supplemented by the Marvin abbreviation collection.</p>
<p><strong>Comprehensive Failure Analysis</strong>: Unlike most OCSR papers that report only aggregate accuracy, this work provides a detailed categorization of all 55 failures, identifying 61 specific error reasons and their root causes.</p>
<h2 id="methodology-and-trec-2011-experiments">Methodology and TREC 2011 Experiments</h2>
<p><strong>Benchmark</strong>: The system was evaluated on the <strong>TREC 2011 Chemical Track</strong> test set consisting of 1,000 molecular diagram images. The authors performed two independent runs with slightly different internal parameter settings to assess reproducibility.</p>
<p><strong>Evaluation Metric</strong>: Correct recall of chemical structures. Output MOL files were compared semantically to ground truth using <strong>OpenBabel</strong>, which ignores syntactically different but chemically equivalent representations.</p>
<p><strong>Failure Analysis</strong>: Across both runs, 55 unique diagrams were misrecognized (50 in run 1, 51 in run 2, with significant overlap). The authors manually examined all 55 and categorized them, identifying 61 specific reasons for mis-recognition. This analysis provides insight into systematic limitations of the rule-based approach.</p>
<h2 id="results-and-top-failure-modes">Results and Top Failure Modes</h2>
<p><strong>High Accuracy</strong>: MolRec achieved a <strong>95% correct recovery rate</strong> on the TREC 2011 benchmark:</p>
<ul>
<li>Run 1: 950/1000 structures correctly recognized (95.0%)</li>
<li>Run 2: 949/1000 structures correctly recognized (94.9%)</li>
</ul>
<p>The near-identical results across runs with slightly different internal parameters show stability of the rule-based approach.</p>
<p><strong>Top Failure Modes</strong> (from detailed analysis of 55 unique misrecognized diagrams, yielding 61 total error reasons):</p>
<ul>
<li><strong>Dashed wedge bond misidentification (15 cases)</strong>: Most common failure. Short dashes at the narrow end were interpreted as a separate dashed bond while longer dashes were treated as a dashed wedge or dashed bold bond, splitting one bond into two with a spurious node.</li>
<li><strong>Incorrect stereochemistry (10 cases)</strong>: Heuristics guessed wrong 3D orientations for ambiguous bold/dashed bonds where syntax alone is insufficient.</li>
<li><strong>Touching components (6 cases)</strong>: Characters touching bonds, letters touching symbols, or ink bleed between close parallel lines caused segmentation failures.</li>
<li><strong>Incorrect character grouping (5 cases)</strong>: Characters too close together for reliable separation.</li>
<li><strong>Solid circles without 3D hydrogen bond (5 cases)</strong>: MolRec correctly interprets solid circles as implying a hydrogen atom via a solid wedge bond, but some solution MOL files in the test set omit this bond, causing a mismatch.</li>
<li><strong>Diagram caption confusion (5 cases)</strong>: Captions appearing within images are mistakenly parsed as part of the molecular structure.</li>
<li><strong>Unrecognised syntax (5 cases)</strong>: User annotations, unusual notations (e.g., wavy line crossing a dashed wedge), and repetition structures.</li>
<li><strong>Broken characters (3 cases)</strong>: Degraded or partial characters without recovery mechanisms.</li>
<li><strong>Connectivity of superatoms (3 cases)</strong>: Ambiguous permutation of connection points for multi-bonded superatoms.</li>
<li><strong>Problematic bridge bonds (3 cases)</strong>: Extreme perspective or angles outside MolRec&rsquo;s thresholds.</li>
<li><strong>Unhandled bond type (1 case)</strong>: A dashed dative bond not previously encountered.</li>
</ul>
<p><strong>System Strengths</strong>:</p>
<ul>
<li>Douglas-Peucker line simplification proves faster and more robust than Hough transforms across different drawing styles</li>
<li>Disk-growing wedge bond detection effectively distinguishes 3D orientations in most cases</li>
<li>Mining MOL files for superatom dictionary captures real-world chemical abbreviation usage patterns</li>
</ul>
<p><strong>Fundamental Limitations Revealed</strong>:</p>
<ul>
<li><strong>Brittleness</strong>: Small variations in drawing style or image quality can cause cascading failures</li>
<li><strong>Stereochemistry ambiguity</strong>: Even humans disagree on ambiguous cases; automated resolution based purely on syntax is inherently limited</li>
<li><strong>Segmentation dependence</strong>: Most failures trace back to incorrect separation of text, bonds, and graphical elements</li>
<li><strong>No error recovery</strong>: Early-stage mistakes propagate through the pipeline with no mechanism for correction</li>
</ul>
<p><strong>Test Set Quality Issues</strong>: The paper also highlights several cases where the TREC 2011 ground truth itself was questionable. Some solution MOL files omitted stereo bond information for solid circle notations, dative (polar) bonds were inconsistently interpreted as either double bonds or single bonds across the training and test sets, and one diagram contained over-connected carbon atoms (5 bonds without the required positive charge indication) that the solution MOL file did not flag.</p>
<p>The systematic error analysis reveals what 95% accuracy means in practice. The failure modes highlight scalability challenges for rule-based systems when applied to diverse real-world documents with noise, artifacts, and non-standard conventions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Dictionary Mining</td>
          <td>OSRA Dataset</td>
          <td>Unknown</td>
          <td>Mined to create superatom dictionary for abbreviations like &ldquo;Ph&rdquo;, &ldquo;COOH&rdquo;</td>
      </tr>
      <tr>
          <td>Dictionary</td>
          <td>Marvin Collection</td>
          <td>N/A</td>
          <td>Integrated Marvin abbreviation group collection for additional superatoms</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>TREC 2011 Test Set</td>
          <td>1,000 images</td>
          <td>Standard benchmark for Text REtrieval Conference Chemical Track</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The MolRec pipeline consists of sequential image processing and graph construction stages:</p>
<p><strong>1. Preprocessing</strong></p>
<ul>
<li><strong>Binarization</strong>: Input image converted to binary</li>
<li><strong>Connected Component Labeling</strong>: Identifies distinct graphical elements</li>
<li><strong>OCR</strong>: Simple metric space-based engine identifies characters (letters $L$, digits $N$, symbols $S$)</li>
<li><strong>Character Grouping</strong>: Spatial proximity and type-based heuristics group characters:
<ul>
<li>Horizontal: Letter-Letter, Digit-Digit, Letter-Symbol</li>
<li>Vertical: Letter-Letter only</li>
<li>Diagonal: Letter-Digit, Letter-Charge</li>
</ul>
</li>
</ul>
<p><strong>2. Vectorization (Line Finding)</strong></p>
<ul>
<li><strong>Image Thinning</strong>: Reduce lines to unit width</li>
<li><strong>Douglas-Peucker Algorithm</strong>: Simplify polylines into straight line segments</li>
<li><strong>Joint Breaking</strong>: Explicitly split lines at junctions where $&gt;2$ segments meet, avoiding combinatorial connection complexity</li>
</ul>
<p><strong>3. Bond Recognition Rules</strong></p>
<p>After erasing text from the image, remaining line segments are analyzed:</p>
<ul>
<li><strong>Double/Triple Bonds</strong>: Cluster segments with same slope within threshold distance</li>
<li><strong>Dashed Bonds</strong>: Identify repeated short segments of similar length with collinear center points</li>
<li><strong>Wedge/Bold Bonds</strong>: Dynamic disk algorithm:
<ul>
<li>Place disk with radius $&gt;$ average line width inside component</li>
<li>Grow disk to maximum size to locate triangle base (stereo-center)</li>
<li>&ldquo;Walk&rdquo; disk to find narrow end, distinguishing wedge orientation</li>
</ul>
</li>
<li><strong>Wavy Bonds</strong>: Identify sawtooth pattern polylines after thinning</li>
<li><strong>Implicit Nodes</strong>: Split longer segments at points where parallel shorter segments terminate (carbon atoms in chains)</li>
</ul>
<p><strong>4. Graph Construction</strong></p>
<ul>
<li><strong>Node Formation</strong>: Group line segment endpoints by distance threshold</li>
<li><strong>Disambiguation</strong>: Logic separates lowercase &ldquo;l&rdquo;, uppercase &ldquo;I&rdquo;, digit &ldquo;1&rdquo;, and vertical bonds</li>
<li><strong>Superatom Expansion</strong>: Replace abbreviations with full structures using mined dictionary</li>
<li><strong>Stereochemistry Resolution</strong>: Heuristics based on neighbor counts determine direction for ambiguous bold/dashed bonds (known limitation)</li>
</ul>
<p><strong>5. MOL File Generation</strong></p>
<ul>
<li>Final graph structure converted to standard MOL file format</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Run 1</th>
          <th>Run 2</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Correct Recall</td>
          <td>950/1000</td>
          <td>949/1000</td>
          <td>Slightly different internal parameters between runs</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>95.0%</td>
          <td>94.9%</td>
          <td>Semantic comparison using OpenBabel</td>
      </tr>
  </tbody>
</table>
<p><strong>Comparison Method</strong>: OpenBabel converts graphs to MOL files and compares them semantically to ground truth, ignoring syntactic variations that don&rsquo;t affect chemical meaning.</p>
<p><strong>Failure Categorization</strong>: 55 unique misrecognized diagrams analyzed across both runs, identifying 61 specific error reasons across 11 categories including dashed wedge bond misidentification (15), incorrect stereochemistry (10), touching components (6), incorrect character grouping (5), solid circles (5), diagram caption confusion (5), unrecognised syntax (5), broken characters (3), superatom connectivity (3), problematic bridge bonds (3), and unhandled bond type (1).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://openbabel.org/">Open Babel</a></td>
          <td>Code</td>
          <td>GPL-2.0</td>
          <td>Used for semantic MOL file comparison</td>
      </tr>
      <tr>
          <td><a href="https://sourceforge.net/projects/osra/">OSRA</a></td>
          <td>Code</td>
          <td>GPL-2.0</td>
          <td>Source of superatom dictionary data (MOL files mined)</td>
      </tr>
      <tr>
          <td>TREC 2011 Chemical Track</td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>1,000 molecular diagram images (available via NIST)</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility Status</strong>: Partially Reproducible. The MolRec source code is not publicly available. The evaluation dataset (TREC 2011) is accessible through NIST, and the tools used for comparison (OpenBabel) are open source. However, full reproduction of MolRec&rsquo;s pipeline would require reimplementation from the paper&rsquo;s descriptions.</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute Details</strong>: Not explicitly specified in the paper</li>
<li><strong>Performance Note</strong>: Vectorization approach noted as &ldquo;proven to be fast&rdquo; compared to Hough transform alternatives</li>
</ul>
<h3 id="references">References</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{sadawiPerformanceMolRecTREC2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Performance of {{MolRec}} at {{TREC}} 2011 {{Overview}} and {{Analysis}} of {{Results}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 20th {{Text REtrieval Conference}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Sadawi, Noureddin M. and Sexton, Alan P. and Sorge, Volker}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sadawi, N. M., Sexton, A. P., &amp; Sorge, V. (2011). Performance of MolRec at TREC 2011 Overview and Analysis of Results. <em>Proceedings of the 20th Text REtrieval Conference</em>. <a href="https://trec.nist.gov/pubs/trec20/papers/UoB.chem.update.pdf">https://trec.nist.gov/pubs/trec20/papers/UoB.chem.update.pdf</a></p>
<p><strong>Publication</strong>: TREC 2011</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://openbabel.org/">Open Babel</a> - Used for semantic MOL file comparison</li>
<li><a href="https://sourceforge.net/projects/osra/">OSRA Project</a> - Source of superatom dictionary data</li>
</ul>
]]></content:encoded></item><item><title>MolRec: Chemical Structure Recognition at CLEF 2012</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec_at_clef/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/molrec_at_clef/</guid><description>MolRec achieves 95%+ accuracy on simple structures but struggles with complex diagrams, revealing rule-based OCSR limits and systematic failures.</description><content:encoded><![CDATA[<h2 id="systematization-of-rule-based-ocsr">Systematization of Rule-Based OCSR</h2>
<p>This is a <strong>Systematization</strong> paper that evaluates and analyzes MolRec&rsquo;s performance in the CLEF 2012 chemical structure recognition competition. The work provides systematic insights into how the improved MolRec system performed on different types of molecular diagrams and reveals structural challenges facing rule-based OCSR approaches through comprehensive failure analysis.</p>
<h2 id="investigating-the-limits-of-rule-based-recognition">Investigating the Limits of Rule-Based Recognition</h2>
<p>This work builds on the TREC 2011 competition, where a previous implementation of MolRec already performed well. The CLEF 2012 competition provided an opportunity to test an improved, more computationally efficient version of MolRec on different datasets and understand how performance varies across complexity levels.</p>
<p>The motivation is to understand exactly where rule-based chemical structure recognition breaks down. Examining the specific types of structures that cause failures provides necessary context for the high accuracy rates achieved on simpler structures.</p>
<h2 id="the-two-stage-molrec-architecture">The Two-Stage MolRec Architecture</h2>
<p>The novelty lies in the systematic evaluation across two different difficulty levels and the comprehensive failure analysis. The authors tested an improved MolRec implementation that was more efficient than the TREC 2011 version, providing insights into both system evolution and the inherent challenges of chemical structure recognition.</p>
<p><strong>MolRec Architecture Overview</strong>: The system follows a two-stage pipeline approach:</p>
<ol>
<li>
<p><strong>Vectorization Stage</strong>: The system preprocesses input images through three steps:</p>
<ul>
<li><strong>Image binarization</strong> using Otsu&rsquo;s method to convert grayscale images to black and white, followed by labelling of connected components</li>
<li><strong>OCR processing</strong> using nearest neighbor classification with a Euclidean metric to identify and remove text components (atom labels, charges, etc.)</li>
<li><strong>Separation of bond elements</strong>: thinning connected components to single-pixel width, building polyline representations, detecting circles, arrows, and solid triangles, and applying the Douglas-Peucker line simplification algorithm to clean up vectorized bonds</li>
</ul>
</li>
<li>
<p><strong>Rule Engine Stage</strong>: A set of 18 chemical rules converts geometric primitives into molecular graphs:</p>
<ul>
<li><strong>Bridge bond recognition</strong> (2 rules applied before all others, handling structures with multiple connection paths depicted in 2.5-dimensional perspective drawings)</li>
<li><strong>Standard bond and atom recognition</strong> (16 rules applied in arbitrary order)</li>
<li><strong>Context-aware disambiguation</strong> resolving ambiguities using the full graph structure and character groups</li>
<li><strong>Superatom expansion</strong> looking up character groups identifying more than one atom in a dictionary and replacing them with molecule subgraphs</li>
</ul>
</li>
</ol>
<p>The system can output results in standard formats like MOL files or SMILES strings.</p>
<h2 id="clef-2012-experimental-design">CLEF 2012 Experimental Design</h2>
<p>The CLEF 2012 organizers provided a set of 961 test images clipped from patent documents, split into two sets:</p>
<ol>
<li>
<p><strong>Automatic Evaluation Set (865 images)</strong>: Images selected for automatic evaluation by comparison of generated MOL files with ground truth MOL files using the OpenBabel toolkit.</p>
</li>
<li>
<p><strong>Manual Evaluation Set (95 images)</strong>: A more challenging collection of images containing elements not supported by OpenBabel (typically Markush structures), requiring manual visual evaluation. This set was intentionally included to provide a greater challenge.</p>
</li>
</ol>
<p>The authors ran MolRec four times with slightly adjusted internal parameters, then manually examined every incorrect recognition to categorize error types.</p>
<h2 id="performance-divergence-and-critical-failure-modes">Performance Divergence and Critical Failure Modes</h2>
<p>The results reveal a stark performance gap between simple and complex molecular structures:</p>
<p><strong>Performance on Automatic Evaluation Set</strong>: On the 865-image set, MolRec achieved <strong>94.91% to 96.18% accuracy</strong> across four runs with different parameter settings. A total of 46 different diagrams were mis-recognized across all runs.</p>
<p><strong>Performance on Manual Evaluation Set</strong>: On the 95-image set, accuracy dropped to <strong>46.32% to 58.95%</strong>. A total of 52 different diagrams were mis-recognized across all runs. Some diagrams failed for multiple reasons.</p>
<p><strong>Key Failure Modes Identified</strong> (with counts from the paper&rsquo;s Table 3):</p>
<ul>
<li>
<p><strong>Character Grouping</strong> (26 manual, 0 automatic): An implementation bug caused the digit &ldquo;1&rdquo; to be repeated within atom groups, so $R_{21}$ was incorrectly recognized as $R_{211}$. A separate problem was difficulty correctly separating closely spaced atom groups.</p>
</li>
<li>
<p><strong>Touching Characters</strong> (8 manual, 1 automatic): The system does not handle touching characters, so overlapping characters cause mis-recognition.</p>
</li>
<li>
<p><strong>Four-Way Junction Failures</strong> (6 manual, 7 automatic): The vectorization process could not handle junctions where four lines meet, leading to incorrect connectivity.</p>
</li>
<li>
<p><strong>OCR Errors</strong> (5 manual, 11 automatic): Character recognition errors included &ldquo;G&rdquo; interpreted as &ldquo;O&rdquo;, &ldquo;alkyl&rdquo; being mis-recognized, and &ldquo;I&rdquo; interpreted as a vertical single bond.</p>
</li>
<li>
<p><strong>Missed Solid and Dashed Wedge Bonds</strong> (0 manual, 6 each in automatic): The system incorrectly recognized a number of solid wedge and dashed wedge bonds.</p>
</li>
<li>
<p><strong>Missed Wavy Bonds</strong> (2 manual, 1 automatic): Some wavy bonds were not recognized despite the dedicated wavy bond rule.</p>
</li>
<li>
<p><strong>Missed Charge Signs</strong> (1 manual, 2 automatic): While correctly recognizing positive charge signs, MolRec missed three negative charge signs, including one placed at the top left of an atom name.</p>
</li>
<li>
<p><strong>Other Errors</strong>: An atom too close to a bond endpoint was erroneously considered connected, a solid wedge bond too close to a closed node was incorrectly connected, and a dashed bold bond had its stereocentre incorrectly determined.</p>
</li>
</ul>
<p><strong>Dataset Quality Issues</strong>: The authors discovered 11 images where the ground truth MOL files were incorrect and MolRec&rsquo;s recognition was actually correct. As the authors note, such ground truth errors are very difficult to avoid in this complex task.</p>
<p><strong>Key Insights</strong>:</p>
<ul>
<li>
<p><strong>Performance gap between simple and complex structures</strong>: While MolRec achieved over 94% accuracy on standard molecular diagrams, the performance drop on the more challenging manual evaluation set (down to 46-59%) highlights the difficulty of real patent document images.</p>
</li>
<li>
<p><strong>Many errors are fixable</strong>: The authors note that many mis-recognition problems (such as the character grouping bug and four-way junction vectorization) can be solved with relatively simple enhancements.</p>
</li>
<li>
<p><strong>Touching character segmentation</strong> remains a notoriously difficult open problem that the authors plan to explore further.</p>
</li>
<li>
<p><strong>Evaluation challenges</strong>: The discovery of 11 incorrect ground truth MOL files illustrates how difficult it is to create reliable benchmarks for chemical structure recognition.</p>
</li>
</ul>
<p>The authors conclude that despite the high recognition rates on simpler structures, there is still plenty of room for improvement. They identify future work areas including recognition of more general Markush structures, robust charge sign spotting, and accurate identification of wedge bonds.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="system-architecture">System Architecture</h3>
<p><strong>Model Type</strong>: Non-neural, Rule-Based System (Vectorization Pipeline + Logic Engine)</p>
<h3 id="data">Data</h3>
<p><strong>Evaluation Datasets (CLEF 2012)</strong>: 961 total test images clipped from patent documents:</p>
<ul>
<li><strong>Automatic Evaluation Set</strong>: 865 images evaluated automatically using OpenBabel for exact structural matching of generated MOL files against ground truth</li>
<li><strong>Manual Evaluation Set</strong>: 95 images containing elements not supported by OpenBabel (typically Markush structures), requiring manual visual evaluation</li>
</ul>
<p><strong>Training Data</strong>: The paper does not describe the reference set used to build OCR character prototypes for the nearest neighbor classifier.</p>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Vectorization Pipeline</strong> (three steps):</p>
<ul>
<li><strong>Image Binarization</strong>: Otsu&rsquo;s method, followed by connected component labelling</li>
<li><strong>OCR</strong>: Nearest neighbor classification with Euclidean distance metric; recognized characters are removed from the image</li>
<li><strong>Bond Element Separation</strong>: Thinning to single-pixel width, polyline construction, Douglas-Peucker line simplification (threshold set to 1-2x average line width), detection of circles, arrows, and solid triangles</li>
</ul>
<p><strong>Rule Engine</strong>: 18 chemical structure rules converting geometric primitives to molecular graphs:</p>
<ul>
<li><strong>Bridge Bond Rules (2 rules)</strong>: Applied before all other rules, handling bridge bond structures depicted in 2.5-dimensional perspective drawings</li>
<li><strong>Wavy Bond Rule</strong>: Detailed in paper, identifies approximately collinear connected line segments with zig-zag patterns ($n \geq 3$ segments)</li>
<li><strong>Standard Recognition Rules</strong>: 16 rules for bonds, atoms, and chemical features (applied in arbitrary order; most not detailed in this paper)</li>
</ul>
<p><strong>Optimization</strong>: Performance tuned via manual adjustment of fuzzy and strict geometric threshold parameters.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Automated</strong>: Exact structural match via OpenBabel MOL file comparison</li>
<li><strong>Manual</strong>: Visual inspection by human experts for structures where OpenBabel fails</li>
</ul>
<p><strong>Results</strong>:</p>
<ul>
<li><strong>Automatic Evaluation Set (865 images)</strong>: 94.91% to 96.18% accuracy across four runs</li>
<li><strong>Manual Evaluation Set (95 images)</strong>: 46.32% to 58.95% accuracy across four runs</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. Given the era (2012) and algorithmic nature (Otsu, thinning, geometric analysis), likely ran on standard CPUs.</p>
<h3 id="reproducibility-assessment">Reproducibility Assessment</h3>
<p><strong>Closed.</strong> No public code, data, or models are available. The paper outlines high-level logic (Otsu binarization, Douglas-Peucker simplification, 18-rule system) but does not provide:</p>
<ul>
<li>The complete specification of all 18 rules (only Rule 2.2 for wavy bonds is detailed)</li>
<li>Exact numerical threshold values for fuzzy/strict parameters used in the CLEF runs</li>
<li>OCR training data or character prototype specifications</li>
</ul>
<p>The authors refer readers to a separate 2012 DRR (SPIE) paper [5] for a more detailed overview of the MolRec system architecture.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sadawi, N. M., Sexton, A. P., &amp; Sorge, V. (2012). MolRec at CLEF 2012: Overview and Analysis of Results. Working Notes of CLEF 2012 Evaluation Labs and Workshop. CLEF. <a href="https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf">https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf</a></p>
<p><strong>Publication</strong>: CLEF 2012 Workshop (ImageCLEF Track)</p>
]]></content:encoded></item></channel></rss>