Image-to-Sequence OCSR: A Comparative Analysis

Overview

This note provides a comparative analysis of image-to-sequence methods for Optical Chemical Structure Recognition (OCSR). These methods treat molecular structure recognition as an image captioning task, using encoder-decoder architectures to generate sequential molecular representations (SMILES, SELFIES, InChI) directly from pixels.

For the full taxonomy of OCSR approaches including image-to-graph and rule-based methods, see the OCSR Methods taxonomy.

Architectural Evolution (2019-2025)

The field has undergone rapid architectural evolution, with clear generational shifts in both encoder and decoder design.

Timeline

Era	Encoder	Decoder	Representative Methods
2019-2020	CNN (Inception V3, ResNet)	LSTM/GRU with Attention	Staker et al., DECIMER
2021	EfficientNet, ViT	Transformer	DECIMER 1.0, Img2Mol, ViT-InChI
2022	Swin Transformer, ResNet	Transformer	SwinOCSR, Image2SMILES, MICER
2023-2024	EfficientNetV2, SwinV2	Transformer	DECIMER.ai, Image2InChI, MMSSC-Net
2025	EfficientViT, VLMs (Qwen2-VL)	LLM decoders, RL fine-tuning	MolSight, GTR-CoT, OCSU

Encoder Architectures

Architecture	Methods Using It	Key Characteristics
Inception V3	DECIMER (2020)	Early CNN approach, 299×299 input
ResNet-50/101	IMG2SMI, Image2SMILES, MICER, DGAT	Strong baseline, well-understood
EfficientNet-B3	DECIMER 1.0	Efficient scaling, compound coefficients
EfficientNet-V2-M	DECIMER.ai, DECIMER-Hand-Drawn	Improved training efficiency
EfficientViT-L1	MolSight	Optimized for deployment
Swin Transformer	SwinOCSR, MolParser	Hierarchical vision transformer
SwinV2	MMSSC-Net, Image2InChI	Improved training stability
Vision Transformer (ViT)	ViT-InChI	Pure attention encoder
DenseNet	RFL, Hu et al. RCGD	Dense connections, feature reuse
Deep TNT	ICMDT	Transformer-in-Transformer
Qwen2-VL	OCSU, GTR-CoT	Vision-language model encoder

Decoder Architectures

Architecture	Methods Using It	Output Format
GRU with Attention	DECIMER, RFL, Hu et al. RCGD	SMILES, RFL, SSML
LSTM with Attention	Staker et al., ChemPix, MICER	SMILES
Transformer	Most 2021+ methods	SMILES, SELFIES, InChI
GPT-2	MMSSC-Net	SMILES
BART	MolParser	E-SMILES
Pre-trained CDDD	Img2Mol	Continuous embedding → SMILES

Output Representation Comparison

The choice of molecular string representation significantly impacts model performance. Representations fall into three categories: core molecular formats for single structures, extended formats for molecular families and variable structures (primarily Markush structures in patents), and specialized representations optimizing for specific recognition challenges.

The Rajan et al. 2022 ablation study provides a comparison of core formats.

Core Molecular Formats

These represent specific, concrete molecular structures.

Format	Validity Guarantee	Sequence Length	Key Characteristic	Used By
SMILES	No	Shortest (baseline)	Standard, highest accuracy	DECIMER.ai, MolSight, DGAT, most 2023+
DeepSMILES	Partial	~1.1× SMILES	Reduces non-local dependencies	SwinOCSR
SELFIES	Yes (100%)	~1.5× SMILES	Guaranteed valid molecules	DECIMER 1.0, IMG2SMI
InChI	N/A (canonical)	Variable (long)	Unique identifiers, layered syntax	ViT-InChI, ICMDT, Image2InChI
FG-SMILES	No	Similar to SMILES	Functional group-aware tokenization	Image2SMILES

SMILES and Variants

SMILES remains the dominant format due to its compactness and highest accuracy on clean data. Standard SMILES uses single characters for ring closures and branches that may appear far apart in the sequence, creating learning challenges for sequence models.

DeepSMILES addresses these non-local syntax dependencies by modifying how branches and ring closures are encoded, making sequences more learnable for neural models. Despite this modification, DeepSMILES sequences are ~1.1× longer than standard SMILES (not shorter). The format offers partial validity improvements through regex-based tokenization with a compact 76-token vocabulary, providing a middle ground between SMILES accuracy and guaranteed validity.

SELFIES guarantees 100% valid molecules by design through a context-free grammar, eliminating invalid outputs entirely. This comes at the cost of ~1.5× longer sequences and a typical 2-5% accuracy drop compared to SMILES on exact-match metrics. The validity guarantee makes SELFIES particularly attractive for generative modeling applications.

InChI uses a layered canonical syntax fundamentally different from SMILES-based formats. While valuable for unique molecular identification, its complex multi-layer structure (formula, connectivity, stereochemistry, isotopes, etc.) and longer sequences make it less suitable for image-to-sequence learning, resulting in lower recognition accuracy.

Key Findings from Rajan et al. 2022

SMILES achieves highest exact-match accuracy on clean synthetic data
SELFIES guarantees 100% valid molecules but at cost of ~2-5% accuracy drop
InChI is problematic due to complex layered syntax and longer sequences
DeepSMILES offers middle ground with partial validity improvements through modified syntax

Extended Formats for Variable Structures

Markush structures represent families of molecules rather than single compounds, using variable groups (R1, R2, etc.) with textual definitions. They are ubiquitous in patent documents for intellectual property protection. Standard SMILES cannot represent these variable structures.

Format	Base Format	Key Feature	Used By
E-SMILES	SMILES + XML annotations	Backward-compatible with separator token	MolParser
CXSMILES	SMILES + extension block	Substituent tables, compression	MarkushGrapher

E-SMILES (Extended SMILES) maintains backward compatibility by using a <sep> token to separate core SMILES from XML-like annotations. Annotations encode Markush substituents (<a>index:group</a>), polymer structures (<p>polymer_info</p>), and abstract ring patterns (<r>abstract_ring</r>). The core structure remains parseable by standard RDKit.

CXSMILES optimizes representation by moving variable groups directly into the main SMILES string as special atoms with explicit atom indexing (e.g., C:1) to link to an extension block containing substituent tables. This handles both frequency variation and position variation in Markush structures.

Specialized Representations

These formats optimize for specific recognition challenges beyond standard single-molecule tasks.

RFL: Ring-Free Language

RFL fundamentally restructures molecular serialization through hierarchical ring decomposition, addressing a core challenge: standard 1D formats (SMILES, SSML) flatten complex 2D molecular graphs, losing explicit spatial relationships.

Mechanism: RFL decomposes molecules into three explicit components:

Molecular Skeleton (𝒮): Main graph with rings “collapsed”
Ring Structures (ℛ): Individual ring components stored separately
Branch Information (ℱ): Connectivity between skeleton and rings

Technical approach:

Detect all non-nested rings using DFS
Calculate adjacency (γ) between rings based on shared edges
Merge isolated rings (γ=0) into SuperAtoms (single node placeholders)
Merge adjacent rings (γ>0) into SuperBonds (edge placeholders)
Progressive decoding: predict skeleton first, then conditionally decode rings using stored hidden states

Performance: RFL achieves SOTA results on both handwritten (95.38% EM) and printed (95.58% EM) structures, with particular strength on high-complexity molecules where standard baselines fail completely (0% → ~30% on hardest tier).

Note: RFL does not preserve original drawing orientation; it’s focused on computational efficiency through hierarchical decomposition.

SSML: Structure-Specific Markup Language

SSML is the primary orientation-preserving format in OCSR. Based on Chemfig (LaTeX chemical drawing package), it provides step-by-step drawing instructions rather than abstract molecular graphs.

Key characteristics:

Describes how to draw the molecule, not just its graph structure
Uses “reconnection marks” for cyclic structures
Preserves branch angles and spatial relationships
Significantly outperformed SMILES for handwritten recognition: 92.09% vs 81.89% EM (Hu et al. RCGD 2023)

Use case: Particularly valuable for hand-drawn structure recognition where visual alignment between image and reconstruction sequence aids model learning.

Training Data Comparison

Training data scale has grown dramatically, with a shift toward combining synthetic and real-world images.

Data Scale Evolution

Year	Typical Scale	Maximum Reported	Primary Source
2019-2020	1-15M	57M (Staker)	Synthetic (RDKit, CDK)
2021-2022	5-35M	35M (DECIMER 1.0)	Synthetic with augmentation
2023-2024	100-150M	450M+ (DECIMER.ai)	Synthetic + real patents
2025	1-10M + real	7.7M (MolParser)	Curated real + synthetic

Synthetic vs Real Data

Method	Training Data	Real-World Performance Notes
DECIMER.ai	450M+ synthetic (RanDepict)	Strong generalization via domain randomization
MolParser	7.7M with active learning	Explicitly targets “in the wild” images
GTR-CoT	Real patent/paper images	Chain-of-thought improves reasoning
MolSight	Multi-stage curriculum	RL fine-tuning for stereochemistry

Data Augmentation Strategies

Common augmentation techniques across methods:

Technique	Purpose	Used By
Rotation	Orientation invariance	Nearly all methods
Gaussian blur	Image quality variation	DECIMER, MolParser
Salt-and-pepper noise	Scan artifact simulation	DECIMER, Image2SMILES
Affine transforms	Perspective variation	ChemPix, MolParser
Font/style variation	Rendering diversity	RanDepict (DECIMER.ai)
Hand-drawn simulation	Sketch-like inputs	ChemPix, ChemReco, DECIMER-Hand-Drawn
Background variation	Document context	MolParser, DECIMER.ai

Hardware and Compute Requirements

Hardware requirements span several orders of magnitude, from consumer GPUs to TPU pods.

Training Hardware Comparison

Method	Hardware	Training Time	Dataset Size
Staker et al. (2019)	8× GPUs	26 days	57M
IMG2SMI (2021)	1× RTX 2080 Ti	5 epochs	~10M
Image2SMILES (2022)	4× V100	2 weeks	30M
MICER (2022)	4× V100	42 hours	10M
DECIMER 1.0 (2021)	TPU v3-8	Not reported	35M
DECIMER.ai (2023)	TPU v3-256	Not reported	450M+
SwinOCSR (2022)	4× RTX 3090	5 days	5M
MolParser (2025)	8× A100	Curriculum learning	7.7M
MolSight (2025)	Not specified	RL fine-tuning (GRPO)	Multi-stage

Inference Considerations

Few papers report inference speed consistently. Available data:

Method	Inference Speed	Notes
DECIMER 1.0	4× faster than DECIMER	TensorFlow Lite optimization
OSRA (baseline)	~1 image/sec	CPU-based rule system
MolScribe	Real-time capable	Optimized Swin encoder

Accessibility Tiers

Tier	Hardware	Representative Methods
Consumer	1× RTX 2080/3090	IMG2SMI, ChemPix
Workstation	4× V100/A100	Image2SMILES, MICER, SwinOCSR
Cloud/HPC	TPU pods, 8+ A100	DECIMER.ai, MolParser

Benchmark Performance

Common Evaluation Datasets

Dataset	Type	Size	Challenge
USPTO	Patent images	~5K test	Real-world complexity
UOB	Scanned images	~5K test	Scan artifacts
Staker	Synthetic	Variable	Baseline synthetic
CLEF	Patent images	~1K test	Markush structures
JPO	Japanese patents	~1K test	Different rendering styles

Accuracy Comparison (Exact Match %)

Methods are roughly grouped by evaluation era; direct comparison is complicated by different test sets.

Method	USPTO	UOB	Staker	Notes
OSRA (baseline)	~70%	~65%	~80%	Rule-based reference
DECIMER 1.0	~85%	~80%	~90%	First transformer-based
SwinOCSR	~88%	~82%	~92%	Swin encoder advantage
DECIMER.ai	~90%	~85%	~95%	Scale + augmentation
MolParser	~92%	~88%	~96%	Real-world focus
MolSight	~93%+	~89%+	~97%+	RL fine-tuning boost

Note: Numbers are approximate and may vary by specific test split. See individual paper notes for precise figures.

Stereochemistry Recognition

Stereochemistry remains a persistent challenge across all methods:

Method	Approach	Stereo Accuracy
Most methods	Standard SMILES	Lower than non-stereo
MolSight	RL (GRPO) specifically for stereo	Improved
MolNexTR	Graph-based explicit stereo	Better handling
Image2InChI	InChI stereo layers	Mixed results

Hand-Drawn Recognition

A distinct sub-lineage focuses on hand-drawn/sketched chemical structures.

Method	Target Domain	Key Innovation
ChemPix (2021)	Hand-drawn hydrocarbons	First deep learning for sketches
Hu et al. RCGD (2023)	Hand-drawn structures	Random conditional guided decoder
ChemReco (2024)	Hand-drawn C-H-O structures	EfficientNet + curriculum learning
DECIMER-Hand-Drawn (2024)	General hand-drawn	Enhanced DECIMER architecture

Hand-Drawn vs Printed Trade-offs

Hand-drawn methods sacrifice some accuracy on clean printed images
Require specialized training data (synthetic hand-drawn simulation)
Generally smaller training sets due to data collection difficulty
Better suited for educational and lab notebook applications

Key Innovations by Method

Method	Primary Innovation
Staker et al.	First end-to-end deep learning OCSR
DECIMER 1.0	Transformer decoder + SELFIES
Img2Mol	Continuous embedding space (CDDD)
Image2SMILES	Functional group-aware SMILES (FG-SMILES)
SwinOCSR	Hierarchical vision transformer encoder
DECIMER.ai	Massive scale + RanDepict augmentation
MolParser	Extended SMILES + active learning
MolSight	RL fine-tuning (GRPO) for accuracy
GTR-CoT	Chain-of-thought graph traversal
OCSU	Multi-task vision-language understanding
RFL	Hierarchical ring decomposition with SuperAtoms/SuperBonds

Open Challenges

Stereochemistry: Consistent challenge across all methods; RL approaches (MolSight) show promise
Abbreviations/R-groups: E-SMILES and Markush-specific methods emerging
Real-world robustness: Gap between synthetic training and patent/paper images
Inference speed: Rarely reported; important for production deployment
Memory efficiency: Almost never documented; limits accessibility
Multi-molecule images: Most methods assume single isolated structure

References

Individual paper notes linked throughout. For the complete method listing, see the OCSR Methods taxonomy.

Content Details
Category	Computational Chemistry
Date	December 2025

Overview#

Architectural Evolution (2019-2025)#

Timeline#

Encoder Architectures#

Decoder Architectures#

Output Representation Comparison#

Core Molecular Formats#

SMILES and Variants#

Key Findings from Rajan et al. 2022#

Extended Formats for Variable Structures#

Specialized Representations#

RFL: Ring-Free Language#

SSML: Structure-Specific Markup Language#

Training Data Comparison#

Data Scale Evolution#

Synthetic vs Real Data#

Data Augmentation Strategies#

Hardware and Compute Requirements#

Training Hardware Comparison#

Inference Considerations#

Accessibility Tiers#

Benchmark Performance#

Common Evaluation Datasets#

Accuracy Comparison (Exact Match %)#

Stereochemistry Recognition#

Hand-Drawn Recognition#

Hand-Drawn vs Printed Trade-offs#

Key Innovations by Method#

Open Challenges#

References#