Deep learning OCSR methods that treat molecular recognition as image captioning, producing SMILES, InChI, or SELFIES strings.
Image-to-sequence models reframe OCSR as an image captioning problem: an encoder (typically a CNN or Vision Transformer) extracts visual features, and an autoregressive decoder generates a string representation of the molecule, most commonly SMILES, InChI, or SELFIES. DECIMER pioneered this approach with CNN-based encoders trained on millions of synthetic images; subsequent work explored Transformer encoders (SwinOCSR, ICMDT), improved training objectives (MolSight’s reinforcement learning for stereochemistry), and alternative string targets (RFL’s ring-free language). These models benefit from large-scale synthetic data and generally handle diverse drawing styles better than rule-based predecessors, though they can hallucinate tokens for structures outside their training distribution.
DECIMER.ai: Optical Chemical Structure Recognition
DECIMER.ai addresses the lack of open tools for Optical Chemical Structure Recognition (OCSR) by providing a comprehensive, deep-learning-based workflow. It features a novel data generation pipeline (RanDepict), a web application, and models for segmentation and recognition that rival or exceed proprietary solutions.
Dual-Path Global Awareness Transformer (DGAT) for OCSR
Proposes a new architecture (DGAT) to resolve global context loss in chemical structure recognition. Introduces Cascaded Global Feature Enhancement and Sparse Differential Global-Local Attention, achieving 84.0% BLEU-4 and handling complex chiral structures implicitly.
Image2InChI: SwinTransformer for Molecular Recognition
Proposes Image2InChI, an OCSR model with improved SwinTransformer encoder and novel feature fusion network with attention mechanisms that achieves 99.8% InChI accuracy on the BMS dataset.
MMSSC-Net introduces a multi-stage cognitive approach for OCSR, utilizing a SwinV2 encoder and GPT-2 decoder to recognize atomic and bond sequences. It achieves 75-98% accuracy across benchmark datasets by handling varying image resolutions and noise through fine-grained perception of atoms and bonds.
MolSight: OCSR with RL and Multi-Granularity Learning
MolSight introduces a three-stage training paradigm for Optical Chemical Structure Recognition (OCSR), utilizing large-scale pretraining, multi-granularity fine-tuning with auxiliary bond and coordinate prediction tasks, and reinforcement learning (GRPO) to achieve 85.1% stereochemical accuracy on USPTO, recognizing complex stereochemical structures like chiral centers and cis-trans isomers.
DECIMER 1.0: Transformers for Chemical Image Recognition
DECIMER 1.0 introduces a Transformer-based architecture coupled with EfficientNet-B3 to solve Optical Chemical Structure Recognition. By using the SELFIES representation (which guarantees 100% valid output strings) and scaling training to over 35 million molecules, it achieves 96.47% exact match accuracy on synthetic benchmarks, offering an open-source solution for mining chemical data from legacy literature.
End-to-End Transformer for Molecular Image Captioning
This paper introduces a convolution-free, end-to-end transformer model for molecular image translation. By replacing CNN encoders with Vision Transformers, it achieves a Levenshtein distance of 6.95 on noisy datasets, compared to 7.49 for ResNet50-LSTM baselines.
ICMDT: Automated Chemical Structure Image Recognition
This paper introduces ICMDT, a Transformer-based architecture for molecular translation (image-to-InChI). By enhancing the TNT block to fuse pixel, small patch, and large patch embeddings, the model achieves superior accuracy on the Bristol-Myers Squibb dataset compared to CNN-RNN and standard Transformer baselines.
Image2SMILES: Transformer OCSR with Synthetic Data Pipeline
A Transformer-based system for optical chemical structure recognition introducing a comprehensive data generation pipeline (FG-SMILES, Markush structures, visual contamination) achieving 79% accuracy on real-world images, outperforming rule-based systems like OSRA.
MICER: Molecular Image Captioning with Transfer Learning
MICER treats optical chemical structure recognition as an image captioning task, using transfer learning with a fine-tuned ResNet encoder and attention-based LSTM decoder to convert molecular images into SMILES strings, reaching 97.54% sequence accuracy on synthetic data and 82.33% on real-world images.
SwinOCSR: End-to-End Chemical OCR with Swin Transformers
Proposes an end-to-end architecture replacing standard CNN backbones with Swin Transformer to capture global image context. Introduces Multi-label Focal Loss to handle severe token imbalance in chemical datasets.
DECIMER: Deep Learning for Chemical Image Recognition
DECIMER adapts the “Show, Attend and Tell” image captioning architecture to translate chemical structure images into SMILES strings. By leveraging massive synthetic datasets generated from PubChem, it demonstrates that deep learning can perform optical chemical recognition without complex, hand-engineered rule systems.