Rule-Based Systems

The earliest OCSR systems converted raster images into vector primitives (lines, arcs, characters) and applied hand-coded chemical rules to assemble those primitives into molecular graphs. Pioneering tools like Kekulé (1992), CLiDE (1993), and the Contreras system (1990) established the core pipeline: binarize, thin, vectorize, classify atoms and bonds, then compile a connection table. Later systems such as OSRA, ChemReader, and CLiDE Pro refined each stage with better segmentation, chemical spell-checking, and support for superatom labels. The approach dominated the field for over two decades, but brittleness in the face of diverse drawing styles, noisy scans, and edge-case notation ultimately motivated the shift to learned representations.

Computational Chemistry

Early optical recognition system converts scanned chemical diagrams to connection tables

Optical Recognition of Chemical Graphics

This paper describes an early prototype system that digitizes chemical structure diagrams from scanned documents. It employs a multi-stage pipeline involving convex bounding polygon extraction, vectorization, and rule-based heuristics to generate MDL Molfiles.

Computational Chemistry

Chemical structure diagram for optical recognition

OSRA: Open Source Optical Structure Recognition

This paper presents OSRA, the first open-source utility for converting graphical chemical structures from documents into machine-readable formats (SMILES/SD). It outlines a pipeline combining existing image processing tools with custom heuristics for bond and atom detection, establishing a foundation for accessible chemical information extraction.

Computational Chemistry

Five-stage pipeline for reconstructing chemical molecules from raster images

Reconstruction of Chemical Molecules from Images

This methodological paper proposes a comprehensive pipeline to digitize chemical structure images. It achieves 97% reconstruction accuracy on benchmarks by combining a topology-preserving vectorizer with a chemical knowledge validation module.

Computational Chemistry

ChemInfty: Chemical Structure Recognition in Patent Images

A 2011 rule-based OCSR system designed specifically for the challenging low-quality images in Japanese patent applications, using segment-based methods to handle pervasive problems like touching characters, merged atom labels with bonds, and broken lines.