Deep learning OCSR methods that treat molecular recognition as image captioning, producing SMILES, InChI, or SELFIES strings.
Image-to-sequence models reframe OCSR as an image captioning problem: an encoder (typically a CNN or Vision Transformer) extracts visual features, and an autoregressive decoder generates a string representation of the molecule, most commonly SMILES, InChI, or SELFIES. DECIMER pioneered this approach with CNN-based encoders trained on millions of synthetic images; subsequent work explored Transformer encoders (SwinOCSR, ICMDT), improved training objectives (MolSight’s reinforcement learning for stereochemistry), and alternative string targets (RFL’s ring-free language). These models benefit from large-scale synthetic data and generally handle diverse drawing styles better than rule-based predecessors, though they can hallucinate tokens for structures outside their training distribution.
Deep Learning for Molecular Structure Extraction (2019)
This paper presents a two-stage deep learning pipeline to extract chemical structures from documents and convert them to SMILES strings. By training on large-scale synthetic data, the method overcomes the brittleness of rule-based systems and demonstrates high accuracy even on low-resolution and noisy input images.
Img2Mol: Accurate SMILES Recognition from Depictions
A 2021 deep learning system using a two-stage approach for OCSR, encoding images into continuous CDDD embeddings before decoding to SMILES. It leverages extensive data augmentation to handle rotations, distortions, and rendering variations for fast and robust molecular structure recognition.
IMG2SMI: Translating Molecular Structure Images to SMILES
A 2021 image-to-text approach treating OCSR as an image captioning task. It uses Transformers with SELFIES representation to convert molecular structure diagrams into SMILES strings, enabling extraction of visual chemical knowledge from scientific literature.
αExtractor: Chemical Info from Biomedical Literature
A 2024 deep learning system for optical chemical structure recognition designed specifically for biomedical literature mining, using ResNet-Transformer architecture to handle challenging conditions including low-resolution images, noise, distortions, and even hand-drawn molecular diagrams from scientific documents.
RFL: Simplifying Chemical Structure Recognition (AAAI 2025)
Proposes Ring-Free Language (RFL) to hierarchically decouple molecular graphs into skeletons, rings, and branches, solving issues with 1D serialization of complex 2D structures. Introduces the Molecular Skeleton Decoder (MSD) to progressively predict these components, achieving strong results on handwritten and printed chemical structure recognition benchmarks.