Image-to-sequence models reframe OCSR as an image captioning problem: an encoder (typically a CNN or Vision Transformer) extracts visual features, and an autoregressive decoder generates a string representation of the molecule, most commonly SMILES, InChI, or SELFIES. These models benefit from large-scale synthetic data and generally handle diverse drawing styles better than rule-based predecessors, though they can hallucinate tokens for structures outside their training distribution.

CNN-Based Pioneers

YearPaperKey Idea
2019Staker et al.Early CNN encoder-decoder for molecular structure extraction
2020DECIMERCNN encoder trained on millions of synthetic images
2021DECIMER 1.0Transformer decoder upgrade with improved accuracy
2023DECIMER.aiWeb platform integrating segmentation, OCSR, and DECIMER models

Transformer & ViT Architectures

YearPaperKey Idea
2021Img2MolCDDD molecular fingerprint prediction from depictions
2021IMG2SMITranslating molecular images to SMILES strings
2021ViT-InChIEnd-to-end Vision Transformer for InChI generation
2022Image2SMILESTransformer OCSR with a synthetic data pipeline
2022SwinOCSRSwin Transformer encoder for end-to-end chemical OCR
2022ICMDTAutomated recognition with interactive correction
2022MICERTransfer learning from ImageNet for molecular captioning
2024Image2InChISwinTransformer encoder for InChI generation
2024MMSSC-NetMulti-stage sequence cognitive networks

Advanced Training & Novel Targets

YearPaperKey Idea
2023αExtractorResNet-Transformer for noisy and hand-drawn structures in biomedical literature
2025DGATDual-path global awareness transformer
2025MolSightRL-based training with multi-granularity learning for stereochemistry
2025RFLRing-free language target simplifying structure recognition