Optical Chemical Structure Recognition

A substantial fraction of chemical knowledge is recorded as 2D diagrams in journals, patents, and textbooks. Optical Chemical Structure Recognition (OCSR) is the task of extracting machine-readable molecular representations from those images: strings like SMILES (a compact text encoding of molecular structure) and InChI (a standardized identifier for chemical substances), or molecular graphs that encode atoms as nodes and bonds as edges. For a longer introduction to the field and its motivations, see the What is OCSR? post.

These notes trace the field from its origins in the early 1990s through to current vision-language approaches. Three broad eras give the collection its shape. The rule-based pioneers (1990s to mid-2010s), including tools like OSRA, MolVec, CLiDE, and Imago, vectorized images and applied hand-coded rules to classify bonds and atoms; their brittleness came from the difficulty of encoding every edge case explicitly. The deep learning transition (roughly 2015 to 2020) replaced those hand-coded rules with models that learned recognition patterns from large synthetic datasets, yielding both image-to-sequence architectures (DECIMER, Img2Mol, Image2SMILES) and image-to-graph architectures (MolGrapher, MolScribe). The current vision-language era (2021 onward), with models like MolParser, GTR-Mol-VLM, and Subgrapher, builds on large pretrained vision-language models to improve generalization across diverse diagram styles and chemical notation conventions.

Beyond the core recognition systems, the collection includes review papers, benchmark and competition write-ups (TREC-Chem 2011, CLEF-IP 2012), and notes on specialized sub-tasks: hand-drawn structure recognition, Markush structure detection, and component-level problems like ring and bond parsing.

For orientation, the two survey papers are the best starting points: rajan-ocsr-review-2020 covers the rule-based era and benchmarks the transition period, while musazade-ocsr-review-2022 picks up the thread with deep learning methods.

Computational Chemistry

OSRA at CLEF-IP 2012

Benchmarks OSRA on CLEF-IP 2012 patent data, demonstrating that native image processing significantly outperforms external splitting tools. Introduces a pairwise distance algorithm for segmentation that handles overlapping molecules better than bounding boxes.

Computational Chemistry

Overview of TREC 2011 Chemical IR Track

This resource paper details the third TREC Chemical IR campaign, introducing a novel Image-to-Structure task and analyzing 36 runs from 9 groups to benchmark chemical information retrieval.

Computational Chemistry

Probabilistic OCSR with Markov Logic Networks

This paper introduces MLOCSR, a system that pipelines low-level image vectorization with a high-level probabilistic Markov Logic Network to recognize chemical structures. It replaces brittle heuristics with weighted logic rules, significantly outperforming state-of-the-art systems like OSRA on degraded or low-resolution images.

Computational Chemistry

Optical Chemical Structure Recognition workflow visualization

Research on Chemical Expression Images Recognition

Proposes a new OCSR workflow that improves recognition rates by separating adhesive chemical symbols and specifically handling virtual/real wedge bonds using vectorization, achieving 90% exact match vs 82.2% for OSRA baseline.

Computational Chemistry

Chemical Structure Recognition (Rule-Based)

This paper introduces MolRec, a rule-based system for Optical Chemical Structure Recognition (OCSR). It defines a set of 18 geometric rewrite rules to disambiguate bonds and atoms in vectorised diagram images, demonstrating higher accuracy than the contemporary state-of-the-art (OSRA).

Computational Chemistry

ChemInk: Real-Time Recognition for Chemical Drawings

ChemInk introduces a sketch recognition system for chemical diagrams that combines multi-level visual features via a joint Conditional Random Field (CRF), achieving 97.4% accuracy and outperforming CAD tools in user speed.

Computational Chemistry

CLiDE Pro: Optical Chemical Structure Recognition Tool

This paper introduces CLiDE Pro, an advanced OCSR system that segments document images and reconstructs chemical connection tables. It features novel handling for crossing bonds and generic structures, validating performance on a publicly released benchmark of 454 scanned images.

Computational Chemistry

Imago: Structure Recognition at TREC-CHEM 2011

Imago is an open-source, cross-platform C++ toolkit designed to recognize 2D chemical structure images from scientific papers and convert them into machine-readable molecule formats using a rule-based pipeline.

Computational Chemistry

Kekulé-1 System for Chemical Structure Recognition

This paper introduces Kekulé-1, one of the first successful Optical Chemical Structure Recognition (OCSR) systems. It details a hybrid approach using neural networks for character recognition and heuristic vectorization for bond detection, achieving 98.9% accuracy on a test set of 524 structures.

Computational Chemistry

OSRA: Optical Structure Recognition Application

This paper details the algorithmic pipeline of OSRA, an open-source tool that converts raster images of chemical diagrams into connection tables (SMILES/SDF). It outlines specific heuristics for page segmentation, vectorization, and atom recognition used in the TREC-CHEM Image2Structure task.

Computational Chemistry

Structural Analysis of Handwritten Chemical Formulas

This paper proposes a strategy for interpreting handwritten chemical formulas by converting bitmap images into a dynamic structural graph of quadrilaterals. It achieves ~97% recognition on graphical elements by using recursive ‘specialists’ to identify chemical bonds and rings.

Computational Chemistry

Automatic chemical image recognition pipeline from raster image to structured file

Automatic Recognition of Chemical Images

This methodological paper presents a system for digitizing chemical images into SDF files. It utilizes a custom vectorization algorithm and chemical rule validation, achieving 94% accuracy on benchmark datasets compared to 50% for commercial tools.