Optical-Chemical-Structure-Recognition

IMG2SMI: Translating Molecular Structure Images to SMILES

A 2021 image-to-text approach treating OCSR as an image captioning task. It uses Transformers with SELFIES representation to convert molecular structure diagrams into SMILES strings, enabling extraction of visual chemical knowledge from scientific literature.

Computational Chemistry

Kekulé: OCR-Optical Chemical Recognition

This 1992 paper introduces Kekulé, one of the first complete Optical Chemical Structure Recognition (OCSR) systems. It details a pipeline integrating raster-to-vector conversion, neural network-based OCR, and rule-based logic to convert printed chemical diagrams into connection tables.

Computational Chemistry

OCSR Methods: A Taxonomy of Approaches

A comprehensive categorization of OCSR methods, organizing techniques by their fundamental approach: deep learning, traditional ML, and rule-based systems.

Computational Chemistry

Early optical recognition system converts scanned chemical diagrams to connection tables

Optical Recognition of Chemical Graphics

This paper describes an early prototype system that digitizes chemical structure diagrams from scanned documents. It employs a multi-stage pipeline involving convex bounding polygon extraction, vectorization, and rule-based heuristics to generate MDL Molfiles.

Computational Chemistry

Chemical structure diagram for optical recognition

OSRA: Open Source Optical Structure Recognition

This paper presents OSRA, the first open-source utility for converting graphical chemical structures from documents into machine-readable formats (SMILES/SD). It outlines a pipeline combining existing image processing tools with custom heuristics for bond and atom detection, establishing a foundation for accessible chemical information extraction.

Computational Chemistry

Five-stage pipeline for reconstructing chemical molecules from raster images

Reconstruction of Chemical Molecules from Images

This methodological paper proposes a comprehensive pipeline to digitize chemical structure images. It achieves 97% reconstruction accuracy on benchmarks by combining a topology-preserving vectorizer with a chemical knowledge validation module.

Computational Chemistry

MolRec: Chemical Structure Recognition at CLEF 2012

Performance evaluation of MolRec at the CLEF 2012 competition reveals a large performance gap between the automatic evaluation set (94-96% accuracy) and the manual evaluation set of complex patent structures (46-59% accuracy), with systematic analysis of failure modes including character grouping bugs, touching characters, and four-way junction vectorization.

Computational Chemistry

MolRec: Rule-Based OCSR System at TREC 2011 Benchmark

Details the MolRec system for converting chemical diagram images into MOL files using vectorization, geometric rules, and graph construction. Achieved 95% accuracy on 1000 TREC 2011 benchmark images with comprehensive failure analysis of limitations.

Computational Chemistry

The transformation from a 2D chemical structure image to a SMILES representation

What is Optical Chemical Structure Recognition (OCSR)?

Discover how OCSR technology bridges the gap between molecular images and machine-readable data, evolving from rule-based systems to modern deep learning models for chemical knowledge extraction.

Computational Chemistry

αExtractor extracts structured chemical information from biomedical literature

αExtractor: Chemical Info from Biomedical Literature

A 2024 deep learning system for optical chemical structure recognition designed specifically for biomedical literature mining, using ResNet-Transformer architecture to handle challenging conditions including low-resolution images, noise, distortions, and even hand-drawn molecular diagrams from scientific documents.

Computational Chemistry

Segment-based chemical structure recognition pipeline for low-quality patent images with touching characters and broken lines

ChemInfty: Chemical Structure Recognition in Patent Images

A 2011 rule-based OCSR system designed specifically for the challenging low-quality images in Japanese patent applications, using segment-based methods to handle pervasive problems like touching characters, merged atom labels with bonds, and broken lines.

Computational Chemistry

Diagram showing MolNexTR's dual-stream architecture: a molecular image feeds into parallel ConvNext and Vision Transformer encoders, producing a SMILES string.

MolNexTR: A Dual-Stream Molecular Image Recognition

MolNexTR proposes a dual-stream architecture combining ConvNext and Vision Transformers to improve molecular image recognition (OCSR). It achieves 81-97% accuracy across diverse benchmarks utilizing simultaneous local and global feature extraction alongside specialized image contamination augmentations.