Image-to-Sequence Models

Image-to-sequence models reframe OCSR as an image captioning problem: an encoder (typically a CNN or Vision Transformer) extracts visual features, and an autoregressive decoder generates a string representation of the molecule, most commonly SMILES, InChI, or SELFIES. DECIMER pioneered this approach with CNN-based encoders trained on millions of synthetic images; subsequent work explored Transformer encoders (SwinOCSR, ICMDT), improved training objectives (MolSight’s reinforcement learning for stereochemistry), and alternative string targets (RFL’s ring-free language). These models benefit from large-scale synthetic data and generally handle diverse drawing styles better than rule-based predecessors, though they can hallucinate tokens for structures outside their training distribution.

Computational Chemistry

Overview of the DECIMER.ai platform combining segmentation, classification, and image-to-SMILES recognition

DECIMER.ai: Optical Chemical Structure Recognition

DECIMER.ai addresses the lack of open tools for Optical Chemical Structure Recognition (OCSR) by providing a comprehensive, deep-learning-based workflow. It features a novel data generation pipeline (RanDepict), a web application, and models for segmentation and recognition that rival or exceed proprietary solutions.

Computational Chemistry

Architecture diagram of the DGAT model showing dual-path decoder with CGFE and SDGLA modules

Dual-Path Global Awareness Transformer (DGAT) for OCSR

Proposes a new architecture (DGAT) to resolve global context loss in chemical structure recognition. Introduces Cascaded Global Feature Enhancement and Sparse Differential Global-Local Attention, achieving 84.0% BLEU-4 and handling complex chiral structures implicitly.

Computational Chemistry

Diagram of the Image2InChI architecture showing a SwinTransformer encoder connected to an attention-based feature fusion decoder for converting molecular images to InChI strings.

Image2InChI: SwinTransformer for Molecular Recognition

Proposes Image2InChI, an OCSR model with improved SwinTransformer encoder and novel feature fusion network with attention mechanisms that achieves 99.8% InChI accuracy on the BMS dataset.

Computational Chemistry

Diagram of the MMSSC-Net architecture showing the SwinV2 encoder and GPT-2 decoder pipeline for molecular image recognition

MMSSC-Net: Multi-Stage Sequence Cognitive Networks

MMSSC-Net introduces a multi-stage cognitive approach for OCSR, utilizing a SwinV2 encoder and GPT-2 decoder to recognize atomic and bond sequences. It achieves 75-98% accuracy across benchmark datasets by handling varying image resolutions and noise through fine-grained perception of atoms and bonds.

Computational Chemistry

Three-stage training pipeline for MolSight showing pretraining, multi-granularity fine-tuning, and RL post-training stages

MolSight: OCSR with RL and Multi-Granularity Learning

MolSight introduces a three-stage training paradigm for Optical Chemical Structure Recognition (OCSR), utilizing large-scale pretraining, multi-granularity fine-tuning with auxiliary bond and coordinate prediction tasks, and reinforcement learning (GRPO) to achieve 85.1% stereochemical accuracy on USPTO, recognizing complex stereochemical structures like chiral centers and cis-trans isomers.

Computational Chemistry

Architecture diagram showing the DECIMER 1.0 transformer pipeline from chemical image input to SELFIES output

DECIMER 1.0: Transformers for Chemical Image Recognition

DECIMER 1.0 introduces a Transformer-based architecture coupled with EfficientNet-B3 to solve Optical Chemical Structure Recognition. By using the SELFIES representation (which guarantees 100% valid output strings) and scaling training to over 35 million molecules, it achieves 96.47% exact match accuracy on synthetic benchmarks, offering an open-source solution for mining chemical data from legacy literature.

Computational Chemistry

Architecture diagram showing Vision Transformer encoder processing image patches and Transformer decoder generating InChI strings

End-to-End Transformer for Molecular Image Captioning

This paper introduces a convolution-free, end-to-end transformer model for molecular image translation. By replacing CNN encoders with Vision Transformers, it achieves a Levenshtein distance of 6.95 on noisy datasets, compared to 7.49 for ResNet50-LSTM baselines.

Computational Chemistry

Chemical structure diagram representing the ICMDT molecular translation system

ICMDT: Automated Chemical Structure Image Recognition

This paper introduces ICMDT, a Transformer-based architecture for molecular translation (image-to-InChI). By enhancing the TNT block to fuse pixel, small patch, and large patch embeddings, the model achieves superior accuracy on the Bristol-Myers Squibb dataset compared to CNN-RNN and standard Transformer baselines.

Computational Chemistry

4-tert-butylphenol molecular structure diagram for Image2SMILES OCSR

Image2SMILES: Transformer OCSR with Synthetic Data Pipeline

A Transformer-based system for optical chemical structure recognition introducing a comprehensive data generation pipeline (FG-SMILES, Markush structures, visual contamination) achieving 79% accuracy on real-world images, outperforming rule-based systems like OSRA.

Computational Chemistry

Bromobenzene molecular structure diagram for MICER OCSR

MICER: Molecular Image Captioning with Transfer Learning

MICER treats optical chemical structure recognition as an image captioning task, using transfer learning with a fine-tuned ResNet encoder and attention-based LSTM decoder to convert molecular images into SMILES strings, reaching 97.54% sequence accuracy on synthetic data and 82.33% on real-world images.

Computational Chemistry

4-chlorofluorobenzene molecular structure diagram for SwinOCSR

SwinOCSR: End-to-End Chemical OCR with Swin Transformers

Proposes an end-to-end architecture replacing standard CNN backbones with Swin Transformer to capture global image context. Introduces Multi-label Focal Loss to handle severe token imbalance in chemical datasets.

Computational Chemistry

Encoder-decoder architecture translating a chemical structure bitmap into a SMILES string

DECIMER: Deep Learning for Chemical Image Recognition

DECIMER adapts the “Show, Attend and Tell” image captioning architecture to translate chemical structure images into SMILES strings. By leveraging massive synthetic datasets generated from PubChem, it demonstrates that deep learning can perform optical chemical recognition without complex, hand-engineered rule systems.