Image-to-Sequence Models

Image-to-sequence models reframe OCSR as an image captioning problem: an encoder (typically a CNN or Vision Transformer) extracts visual features, and an autoregressive decoder generates a string representation of the molecule, most commonly SMILES, InChI, or SELFIES. These models benefit from large-scale synthetic data and generally handle diverse drawing styles better than rule-based predecessors, though they can hallucinate tokens for structures outside their training distribution.

CNN-Based Pioneers

Year	Paper	Key Idea
2019	Staker et al.	Early CNN encoder-decoder for molecular structure extraction
2020	DECIMER	CNN encoder trained on millions of synthetic images
2021	DECIMER 1.0	Transformer decoder upgrade with improved accuracy
2023	DECIMER.ai	Web platform integrating segmentation, OCSR, and DECIMER models

Transformer & ViT Architectures

Year	Paper	Key Idea
2021	Img2Mol	CDDD molecular fingerprint prediction from depictions
2021	IMG2SMI	Translating molecular images to SMILES strings
2021	ViT-InChI	End-to-end Vision Transformer for InChI generation
2022	Image2SMILES	Transformer OCSR with a synthetic data pipeline
2022	SwinOCSR	Swin Transformer encoder for end-to-end chemical OCR
2022	ICMDT	Automated recognition with interactive correction
2022	MICER	Transfer learning from ImageNet for molecular captioning
2024	Image2InChI	SwinTransformer encoder for InChI generation
2024	MMSSC-Net	Multi-stage sequence cognitive networks

Advanced Training & Novel Targets

Year	Paper	Key Idea
2023	αExtractor	ResNet-Transformer for noisy and hand-drawn structures in biomedical literature
2025	DGAT	Dual-path global awareness transformer
2025	MolSight	RL-based training with multi-granularity learning for stereochemistry
2025	RFL	Ring-free language target simplifying structure recognition

Computational Chemistry

Thymol molecular structure diagram for Staker deep learning OCSR

Deep Learning for Molecular Structure Extraction (2019)

This paper presents a two-stage deep learning pipeline to extract chemical structures from documents and convert them to SMILES strings. By training on large-scale synthetic data, the method overcomes the brittleness of rule-based systems and demonstrates high accuracy even on low-resolution and noisy input images.

Computational Chemistry

Ibuprofen molecular structure diagram for Img2Mol OCSR

Img2Mol: Accurate SMILES Recognition from Depictions

A 2021 deep learning system using a two-stage approach for OCSR, encoding images into continuous CDDD embeddings before decoding to SMILES. It leverages extensive data augmentation to handle rotations, distortions, and rendering variations for fast and robust molecular structure recognition.

Computational Chemistry

Optical chemical structure recognition example

IMG2SMI: Translating Molecular Structure Images to SMILES

A 2021 image-to-text approach treating OCSR as an image captioning task. It uses Transformers with SELFIES representation to convert molecular structure diagrams into SMILES strings, enabling extraction of visual chemical knowledge from scientific literature.

Computational Chemistry

αExtractor extracts structured chemical information from biomedical literature

αExtractor: Chemical Info from Biomedical Literature

A 2024 deep learning system for optical chemical structure recognition designed specifically for biomedical literature mining, using ResNet-Transformer architecture to handle challenging conditions including low-resolution images, noise, distortions, and even hand-drawn molecular diagrams from scientific documents.

Computational Chemistry

Diagram showing how Ring-Free Language decouples a molecular graph into skeleton, ring structures, and branch information

RFL: Simplifying Chemical Structure Recognition (AAAI 2025)

Proposes Ring-Free Language (RFL) to hierarchically decouple molecular graphs into skeletons, rings, and branches, solving issues with 1D serialization of complex 2D structures. Introduces the Molecular Skeleton Decoder (MSD) to progressively predict these components, achieving strong results on handwritten and printed chemical structure recognition benchmarks.

CNN-Based Pioneers#

Transformer & ViT Architectures#

Advanced Training & Novel Targets#

CNN-Based Pioneers

Transformer & ViT Architectures

Advanced Training & Novel Targets