Computational Chemistry
Five-stage pipeline for reconstructing chemical molecules from raster images

Reconstruction of Chemical Molecules from Images

This methodological paper proposes a comprehensive pipeline to digitize chemical structure images. It achieves 97% reconstruction accuracy on benchmarks by combining a topology-preserving vectorizer with a chemical knowledge validation module.

Computational Chemistry
GTR-CoT: Graph Traversal Chain-of-Thought for Molecules

GTR-CoT: Graph Traversal Chain-of-Thought for Molecules

A 2025 Vision-Language Model for OCSR that uses graph traversal chain-of-thought reasoning to handle chemical abbreviations (Ph, Et, etc.) that break existing systems, achieving state-of-the-art performance by training on data where abbreviations are preserved.

Computational Chemistry
Optical chemical structure recognition example

MolRec: Chemical Structure Recognition at CLEF 2012

Performance evaluation of MolRec at the CLEF 2012 competition reveals a stark performance gap between simple (95%+ accuracy) and complex molecular structures (46-59% accuracy), providing systematic analysis of rule-based OCSR limitations including touching characters, stereochemistry recognition, and four-way junction failures.

Computational Chemistry
Optical chemical structure recognition example

MolRec: Rule-Based OCSR System

Details the MolRec system for converting chemical diagram images into MOL files using vectorization, geometric rules, and graph construction. Achieved 95% accuracy on 1000 TREC 2011 benchmark images with comprehensive failure analysis of limitations.

Computational Chemistry
Markush structure diagram

SubGrapher: Visual Fingerprinting of Chemical Structures

SubGrapher introduces a visual fingerprinting approach to Optical Chemical Structure Recognition that detects functional groups directly from images, enabling chemical database searches without full structure reconstruction and handling complex patent images including Markush structures.

Computational Chemistry
The transformation from a 2D chemical structure image to a SMILES representation

What is Optical Chemical Structure Recognition (OCSR)?

Discover how OCSR technology bridges the gap between molecular images and machine-readable data, evolving from rule-based systems to modern deep learning models for chemical knowledge extraction.

Computational Chemistry
αExtractor extracts structured chemical information from biomedical literature

αExtractor: Chemical Info from Biomedical Literature

A 2024 deep learning system for optical chemical structure recognition designed specifically for biomedical literature mining, using ResNet-Transformer architecture to handle challenging conditions including low-resolution images, noise, distortions, and even hand-drawn molecular diagrams from scientific documents.

Computational Chemistry
ChemInfty: Chemical Structure Recognition in Patent Images

ChemInfty: Chemical Structure Recognition in Patent Images

A 2011 rule-based OCSR system designed specifically for the challenging low-quality images in Japanese patent applications, using segment-based methods to handle pervasive problems like touching characters (22% of images), merged atom labels with bonds (19.5%), and broken lines (8.5%).

Computational Chemistry

MolNexTR: Dual-Stream Molecular Image Recognition

MolNexTR combines ConvNext and Vision Transformers to improve molecular image recognition (OCSR), achieving 81-97% accuracy across diverse benchmarks through a unique dual-stream encoder and novel data augmentation strategies.

Computational Chemistry
A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.

MolParser-7M & WildMol: Large-Scale OCSR Datasets

The MolParser project introduces two key datasets: MolParser-7M, the largest training dataset for Optical Chemical Structure Recognition (OCSR) with 7.7M pairs of images and E-SMILES strings, and WildMol, a new 20k-sample benchmark for evaluating models on challenging real-world data. The training data uniquely combines millions of diverse synthetic molecules with 400,000 manually annotated in-the-wild samples to significantly enhance model robustness.

Computational Chemistry
Optical chemical structure recognition example

MolParser: End-to-End Molecular Structure Recognition

A 2025 end-to-end OCSR system addressing both technical and data challenges, introducing MolParser-7M (7M+ image-text pairs) and MolDet (YOLO-based detector) for extracting and recognizing molecular structures from real-world documents with diverse quality and styles.

Machine Learning Fundamentals
Comparison of standard 3D CNN versus 3D Steerable CNN for handling rotational symmetry

3D Steerable CNNs: Rotationally Equivariant Features

Weiler et al.’s NeurIPS 2018 paper introducing 3D Steerable CNNs that achieve SE(3) equivariance through group representation theory and spherical harmonic convolution kernels, eliminating the need for data augmentation and improving data efficiency for scientific applications with rotational symmetry like molecular and protein structures.