OCSR methods using vectorization, heuristics, and hand-coded rules to extract molecular structures from images (1990-2015).
The earliest OCSR systems converted raster images into vector primitives (lines, arcs, characters) and applied hand-coded chemical rules to assemble those primitives into molecular graphs. Pioneering tools like Kekulé (1992), CLiDE (1993), and the Contreras system (1990) established the core pipeline: binarize, thin, vectorize, classify atoms and bonds, then compile a connection table. Later systems such as OSRA, ChemReader, and CLiDE Pro refined each stage with better segmentation, chemical spell-checking, and support for superatom labels. The approach dominated the field for over two decades, but brittleness in the face of diverse drawing styles, noisy scans, and edge-case notation ultimately motivated the shift to learned representations.
GraphReco presents a rule-based OCSR system with two key innovations: a Fragment Merging line detection algorithm for precise bond identification and a Markov network for probabilistic resolution of atom/bond ambiguity during graph assembly. Achieves 94.2% accuracy on USPTO-10K, outperforming both traditional rule-based and some ML-based methods.
This paper introduces MLOCSR, a system that pipelines low-level image vectorization with a high-level probabilistic Markov Logic Network to recognize chemical structures. It replaces brittle heuristics with weighted logic rules, significantly outperforming state-of-the-art systems like OSRA on degraded or low-resolution images.
Research on Chemical Expression Images Recognition
Proposes a new OCSR workflow that improves recognition rates by separating adhesive chemical symbols and specifically handling virtual/real wedge bonds using vectorization, achieving 90% exact match vs 82.2% for OSRA baseline.
This paper introduces MolRec, a rule-based system for Optical Chemical Structure Recognition (OCSR). It defines a set of 18 geometric rewrite rules to disambiguate bonds and atoms in vectorised diagram images, demonstrating higher accuracy than the contemporary state-of-the-art (OSRA).
CLiDE Pro: Optical Chemical Structure Recognition Tool
This paper introduces CLiDE Pro, an advanced OCSR system that segments document images and reconstructs chemical connection tables. It features novel handling for crossing bonds and generic structures, validating performance on a publicly released benchmark of 454 scanned images.
Kekulé-1 System for Chemical Structure Recognition
This paper introduces Kekulé-1, one of the first successful Optical Chemical Structure Recognition (OCSR) systems. It details a hybrid approach using neural networks for character recognition and heuristic vectorization for bond detection, achieving 98.9% accuracy on a test set of 524 structures.
This methodological paper presents a system for digitizing chemical images into SDF files. It utilizes a custom vectorization algorithm and chemical rule validation, achieving 94% accuracy on benchmark datasets compared to 50% for commercial tools.
Chemical Literature Data Extraction: The CLiDE Project
The CLiDE project presents a foundational architecture for Optical Chemical Structure Recognition (OCSR). It details a three-phase pipeline to convert bitmapped journal pages into chemically significant connection tables, handling complex features like stereochemistry.
This 2003 paper introduces a machine vision approach for extracting chemical metadata from raster images. By using Gabor wavelets for feature extraction and Kohonen networks for classification, it distinguishes between chemical and non-chemical images, as well as ring and non-ring systems, without requiring high-resolution inputs.
This paper presents ChemReader, a fully automated optical structure recognition tool that converts raster images of chemical diagrams into machine-readable formats. It introduces a modified Hough transform for bond detection and a chemical spell checker that improves OCR accuracy from 66% to 87%.
This 1990 paper presents an early OCR pipeline for converting hand-drawn or printed chemical structures into connectivity tables. It introduces novel sweeping algorithms for graph perception and a matrix-based feature extraction method for character recognition.
This 1992 paper introduces Kekulé, one of the first complete Optical Chemical Structure Recognition (OCSR) systems. It details a pipeline integrating raster-to-vector conversion, neural network-based OCR, and rule-based logic to convert printed chemical diagrams into connection tables.