Notes on recognizing molecular structures from images, covering 35 years of methods: from rule-based vectorization to vision-language models.
A substantial fraction of chemical knowledge is recorded as 2D diagrams in journals, patents, and textbooks. Optical Chemical Structure Recognition (OCSR) is the task of extracting machine-readable molecular representations from those images: strings like SMILES (a compact text encoding of molecular structure) and InChI (a standardized identifier for chemical substances), or molecular graphs that encode atoms as nodes and bonds as edges. For a longer introduction to the field and its motivations, see the What is OCSR? post.
These notes trace the field from its origins in the early 1990s through to current vision-language approaches. Three broad eras give the collection its shape. The rule-based pioneers (1990s to mid-2010s), including tools like OSRA, MolVec, CLiDE, and Imago, vectorized images and applied hand-coded rules to classify bonds and atoms; their brittleness came from the difficulty of encoding every edge case explicitly. The deep learning transition (roughly 2015 to 2020) replaced those hand-coded rules with models that learned recognition patterns from large synthetic datasets, yielding both image-to-sequence architectures (DECIMER, Img2Mol, Image2SMILES) and image-to-graph architectures (MolGrapher, MolScribe). The current vision-language era (2021 onward), with models like MolParser, GTR-Mol-VLM, and Subgrapher, builds on large pretrained vision-language models to improve generalization across diverse diagram styles and chemical notation conventions.
Beyond the core recognition systems, the collection includes review papers, benchmark and competition write-ups (TREC-Chem 2011, CLEF-IP 2012), and notes on specialized sub-tasks: hand-drawn structure recognition, Markush structure detection, and component-level problems like ring and bond parsing.
For orientation, the two survey papers are the best starting points: rajan-ocsr-review-2020 covers the rule-based era and benchmarks the transition period, while musazade-ocsr-review-2022 picks up the thread with deep learning methods.
HMM-based method for recognizing online handwritten chemical symbols using 11-dimensional local features including derivatives, curvature, and linearity. Achieves 89.5% top-1 accuracy and 98.7% top-3 accuracy on a custom dataset of 64 chemical symbols.
Img2Mol: Accurate SMILES from Molecular Depictions
A 2021 deep learning system using a two-stage approach for OCSR, encoding images into continuous CDDD embeddings before decoding to SMILES. It leverages extensive data augmentation to handle rotations, distortions, and rendering variations for fast and robust molecular structure recognition.
On-line Handwritten Chemical Expression Recognition
Yang et al. propose a two-level recognition system for handwritten chemical formulas, combining global structural analysis to identify substances with local character recognition using ANNs, achieving ~96% accuracy on a dataset of 1197 expressions.
Online Handwritten Chemical Formula Structure Analysis
A three-level grammatical framework (formula, molecule, text) for parsing online handwritten chemical formulas, generating semantic graphs that capture both connectivity and layout using context-free grammars and HMMs.
Recognition of On-line Handwritten Chemical Expressions
Proposes a novel two-level algorithm for on-line handwritten chemical expression recognition, combining substance-level matching with character-level segmentation to achieve 96% accuracy.
This paper reviews three decades of OCSR development, transitioning from rule-based heuristics to early deep learning approaches. It includes a benchmark study comparing the performance of three open-source tools (OSRA, Imago, MolVec) on four diverse datasets.
This paper proposes a double-stage architecture using SVM for rough classification and HMM for fine recognition. It features a novel Point Sequence Reordering (PSR) algorithm that significantly improves accuracy on organic ring structures.
Unified Framework for Handwritten Chemical Expressions
Proposes a unified statistical framework for recognizing both inorganic and organic handwritten chemical expressions. Introduces the Chemical Expression Structure Graph (CESG) and uses a weighted direction graph search for structural analysis, achieving 83.1% top-5 accuracy on a large proprietary dataset.
Describes chemoCR, a system that converts bitmap chemical diagrams into connection tables using a pipeline of texture-based vectorization, OCR, and a rule-based expert system, achieving 65.6% perfect recall on the TREC 2011 task.
ChemReader achieved 93% accuracy on the TREC 2011 Image-to-Structure task, with detailed error analysis revealing the need for improved chemical intelligence in bond recognition and node merging algorithms.
Describes the MolRec system’s performance in the CLEF 2012 Chemical Structure Recognition task, detailing its rule-based vectorization engine and analyzing failure modes like touching characters and complex bond types.