Computational Chemistry
Early optical recognition system converts scanned chemical diagrams to connection tables

Optical Recognition of Chemical Graphics

This paper describes an early prototype system that digitizes chemical structure diagrams from scanned documents. It employs a multi-stage pipeline involving convex bounding polygon extraction, vectorization, and rule-based heuristics to generate MDL Molfiles.

Computational Chemistry
Chemical structure diagram for optical recognition

OSRA: Open Source Optical Structure Recognition

This paper presents OSRA, the first open-source utility for converting graphical chemical structures from documents into machine-readable formats (SMILES/SD). It outlines a pipeline combining existing image processing tools with custom heuristics for bond and atom detection, establishing a foundation for accessible chemical information extraction.

Computational Chemistry
Five-stage pipeline for reconstructing chemical molecules from raster images

Reconstruction of Chemical Molecules from Images

This methodological paper proposes a comprehensive pipeline to digitize chemical structure images. It achieves 97% reconstruction accuracy on benchmarks by combining a topology-preserving vectorizer with a chemical knowledge validation module.

Computational Chemistry
Optical chemical structure recognition example

MolRec: Chemical Structure Recognition at CLEF 2012

Performance evaluation of MolRec at the CLEF 2012 competition reveals a large performance gap between the automatic evaluation set (94-96% accuracy) and the manual evaluation set of complex patent structures (46-59% accuracy), with systematic analysis of failure modes including character grouping bugs, touching characters, and four-way junction vectorization.

Computational Chemistry
Optical chemical structure recognition example

MolRec: Rule-Based OCSR System at TREC 2011 Benchmark

Details the MolRec system for converting chemical diagram images into MOL files using vectorization, geometric rules, and graph construction. Achieved 95% accuracy on 1000 TREC 2011 benchmark images with comprehensive failure analysis of limitations.

Computational Chemistry
The transformation from a 2D chemical structure image to a SMILES representation

What is Optical Chemical Structure Recognition (OCSR)?

Discover how OCSR technology bridges the gap between molecular images and machine-readable data, evolving from rule-based systems to modern deep learning models for chemical knowledge extraction.

Computational Chemistry
αExtractor extracts structured chemical information from biomedical literature

αExtractor: Chemical Info from Biomedical Literature

A 2024 deep learning system for optical chemical structure recognition designed specifically for biomedical literature mining, using ResNet-Transformer architecture to handle challenging conditions including low-resolution images, noise, distortions, and even hand-drawn molecular diagrams from scientific documents.

Computational Chemistry
ChemInfty: Chemical Structure Recognition in Patent Images

ChemInfty: Chemical Structure Recognition in Patent Images

A 2011 rule-based OCSR system designed specifically for the challenging low-quality images in Japanese patent applications, using segment-based methods to handle pervasive problems like touching characters, merged atom labels with bonds, and broken lines.

Computational Chemistry
Diagram showing MolNexTR's dual-stream architecture: a molecular image feeds into parallel ConvNext and Vision Transformer encoders, producing a SMILES string.

MolNexTR: A Dual-Stream Molecular Image Recognition

MolNexTR proposes a dual-stream architecture combining ConvNext and Vision Transformers to improve molecular image recognition (OCSR). It achieves 81-97% accuracy across diverse benchmarks utilizing simultaneous local and global feature extraction alongside specialized image contamination augmentations.

Computational Chemistry
A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.

MolParser-7M & WildMol: Large-Scale OCSR Datasets

The MolParser project introduces two key datasets: MolParser-7M, the largest training dataset for Optical Chemical Structure Recognition (OCSR) with 7.7M pairs of images and E-SMILES strings, and WildMol, a new 20k-sample benchmark for evaluating models on challenging real-world data. The training data uniquely combines millions of diverse synthetic molecules with 400,000 manually annotated in-the-wild samples.

Computational Chemistry
Optical chemical structure recognition example

MolParser: End-to-End Molecular Structure Recognition

A 2025 end-to-end OCSR system addressing both technical and data challenges, introducing MolParser-7M (7M+ image-text pairs) and MolDet (YOLO-based detector) for extracting and recognizing molecular structures from real-world documents with diverse quality and styles.

Computational Chemistry
Diagram showing graph traversal chain-of-thought parsing of a molecular structure image into atom and bond predictions

GTR-CoT: Graph Traversal Chain-of-Thought for Molecules

A 2025 Vision-Language Model for OCSR that uses graph traversal chain-of-thought reasoning and a two-stage SFT plus GRPO training scheme to handle both printed molecules (including chemical abbreviations like Ph and Et) and hand-drawn structures, achieving strong performance on the new MolRec-Bench benchmark.