Overview

This note provides a comparative analysis of image-to-sequence methods for Optical Chemical Structure Recognition (OCSR). These methods treat molecular structure recognition as an image captioning task, using encoder-decoder architectures to generate sequential molecular representations (SMILES, SELFIES, InChI) directly from pixels.

For the full taxonomy of OCSR approaches including image-to-graph and rule-based methods, see the OCSR Methods taxonomy.

Architectural Evolution (2019-2025)

The field has undergone rapid architectural evolution, with clear generational shifts in both encoder and decoder design.

Timeline

EraEncoderDecoderRepresentative Methods
2019-2020CNN (Inception V3, ResNet)LSTM/GRU with AttentionStaker et al., DECIMER
2021EfficientNet, ViTTransformerDECIMER 1.0, Img2Mol, ViT-InChI
2022Swin Transformer, ResNetTransformerSwinOCSR, Image2SMILES, MICER
2023-2024EfficientNetV2, SwinV2TransformerDECIMER.ai, Image2InChI, MMSSC-Net
2025EfficientViT, VLMs (Qwen2-VL)LLM decoders, RL fine-tuningMolSight, GTR-CoT, OCSU

Encoder Architectures

ArchitectureMethods Using ItKey Characteristics
Inception V3DECIMER (2020)Early CNN approach, 299×299 input
ResNet-50/101IMG2SMI, Image2SMILES, MICER, DGATStrong baseline, well-understood
EfficientNet-B3DECIMER 1.0Efficient scaling, compound coefficients
EfficientNet-V2-MDECIMER.ai, DECIMER-Hand-DrawnImproved training efficiency
EfficientViT-L1MolSightOptimized for deployment
Swin TransformerSwinOCSR, MolParserHierarchical vision transformer
SwinV2MMSSC-Net, Image2InChIImproved training stability
Vision Transformer (ViT)ViT-InChIPure attention encoder
DenseNetRFL, Hu et al. RCGDDense connections, feature reuse
Deep TNTICMDTTransformer-in-Transformer
Qwen2-VLOCSU, GTR-CoTVision-language model encoder

Decoder Architectures

ArchitectureMethods Using ItOutput Format
GRU with AttentionDECIMER, RFL, Hu et al. RCGDSMILES, RFL, SSML
LSTM with AttentionStaker et al., ChemPix, MICERSMILES
TransformerMost 2021+ methodsSMILES, SELFIES, InChI
GPT-2MMSSC-NetSMILES
BARTMolParserE-SMILES
Pre-trained CDDDImg2MolContinuous embedding → SMILES

Output Representation Comparison

The choice of molecular string representation significantly impacts model performance. Representations fall into three categories: core molecular formats for single structures, extended formats for molecular families and variable structures (primarily Markush structures in patents), and specialized representations optimizing for specific recognition challenges.

The Rajan et al. 2022 ablation study provides a comparison of core formats.

Core Molecular Formats

These represent specific, concrete molecular structures.

FormatValidity GuaranteeSequence LengthKey CharacteristicUsed By
SMILESNoShortest (baseline)Standard, highest accuracyDECIMER.ai, MolSight, DGAT, most 2023+
DeepSMILESPartial~1.1× SMILESReduces non-local dependenciesSwinOCSR
SELFIESYes (100%)~1.5× SMILESGuaranteed valid moleculesDECIMER 1.0, IMG2SMI
InChIN/A (canonical)Variable (long)Unique identifiers, layered syntaxViT-InChI, ICMDT, Image2InChI
FG-SMILESNoSimilar to SMILESFunctional group-aware tokenizationImage2SMILES

SMILES and Variants

SMILES remains the dominant format due to its compactness and highest accuracy on clean data. Standard SMILES uses single characters for ring closures and branches that may appear far apart in the sequence, creating learning challenges for sequence models.

DeepSMILES addresses these non-local syntax dependencies by modifying how branches and ring closures are encoded, making sequences more learnable for neural models. Despite this modification, DeepSMILES sequences are ~1.1× longer than standard SMILES (not shorter). The format offers partial validity improvements through regex-based tokenization with a compact 76-token vocabulary, providing a middle ground between SMILES accuracy and guaranteed validity.

SELFIES guarantees 100% valid molecules by design through a context-free grammar, eliminating invalid outputs entirely. This comes at the cost of ~1.5× longer sequences and a typical 2-5% accuracy drop compared to SMILES on exact-match metrics. The validity guarantee makes SELFIES particularly attractive for generative modeling applications.

InChI uses a layered canonical syntax fundamentally different from SMILES-based formats. While valuable for unique molecular identification, its complex multi-layer structure (formula, connectivity, stereochemistry, isotopes, etc.) and longer sequences make it less suitable for image-to-sequence learning, resulting in lower recognition accuracy.

Key Findings from Rajan et al. 2022

  1. SMILES achieves highest exact-match accuracy on clean synthetic data
  2. SELFIES guarantees 100% valid molecules but at cost of ~2-5% accuracy drop
  3. InChI is problematic due to complex layered syntax and longer sequences
  4. DeepSMILES offers middle ground with partial validity improvements through modified syntax

Extended Formats for Variable Structures

Markush structures represent families of molecules rather than single compounds, using variable groups (R1, R2, etc.) with textual definitions. They are ubiquitous in patent documents for intellectual property protection. Standard SMILES cannot represent these variable structures.

FormatBase FormatKey FeatureUsed By
E-SMILESSMILES + XML annotationsBackward-compatible with separator tokenMolParser
CXSMILESSMILES + extension blockSubstituent tables, compressionMarkushGrapher

E-SMILES (Extended SMILES) maintains backward compatibility by using a <sep> token to separate core SMILES from XML-like annotations. Annotations encode Markush substituents (<a>index:group</a>), polymer structures (<p>polymer_info</p>), and abstract ring patterns (<r>abstract_ring</r>). The core structure remains parseable by standard RDKit.

CXSMILES optimizes representation by moving variable groups directly into the main SMILES string as special atoms with explicit atom indexing (e.g., C:1) to link to an extension block containing substituent tables. This handles both frequency variation and position variation in Markush structures.

Specialized Representations

These formats optimize for specific recognition challenges beyond standard single-molecule tasks.

RFL: Ring-Free Language

RFL fundamentally restructures molecular serialization through hierarchical ring decomposition, addressing a core challenge: standard 1D formats (SMILES, SSML) flatten complex 2D molecular graphs, losing explicit spatial relationships.

Mechanism: RFL decomposes molecules into three explicit components:

  • Molecular Skeleton (𝒮): Main graph with rings “collapsed”
  • Ring Structures (ℛ): Individual ring components stored separately
  • Branch Information (ℱ): Connectivity between skeleton and rings

Technical approach:

  1. Detect all non-nested rings using DFS
  2. Calculate adjacency (γ) between rings based on shared edges
  3. Merge isolated rings (γ=0) into SuperAtoms (single node placeholders)
  4. Merge adjacent rings (γ>0) into SuperBonds (edge placeholders)
  5. Progressive decoding: predict skeleton first, then conditionally decode rings using stored hidden states

Performance: RFL achieves SOTA results on both handwritten (95.38% EM) and printed (95.58% EM) structures, with particular strength on high-complexity molecules where standard baselines fail completely (0% → ~30% on hardest tier).

Note: RFL does not preserve original drawing orientation; it’s focused on computational efficiency through hierarchical decomposition.

SSML: Structure-Specific Markup Language

SSML is the primary orientation-preserving format in OCSR. Based on Chemfig (LaTeX chemical drawing package), it provides step-by-step drawing instructions rather than abstract molecular graphs.

Key characteristics:

  • Describes how to draw the molecule, not just its graph structure
  • Uses “reconnection marks” for cyclic structures
  • Preserves branch angles and spatial relationships
  • Significantly outperformed SMILES for handwritten recognition: 92.09% vs 81.89% EM (Hu et al. RCGD 2023)

Use case: Particularly valuable for hand-drawn structure recognition where visual alignment between image and reconstruction sequence aids model learning.

Training Data Comparison

Training data scale has grown dramatically, with a shift toward combining synthetic and real-world images.

Data Scale Evolution

YearTypical ScaleMaximum ReportedPrimary Source
2019-20201-15M57M (Staker)Synthetic (RDKit, CDK)
2021-20225-35M35M (DECIMER 1.0)Synthetic with augmentation
2023-2024100-150M450M+ (DECIMER.ai)Synthetic + real patents
20251-10M + real7.7M (MolParser)Curated real + synthetic

Synthetic vs Real Data

MethodTraining DataReal-World Performance Notes
DECIMER.ai450M+ synthetic (RanDepict)Strong generalization via domain randomization
MolParser7.7M with active learningExplicitly targets “in the wild” images
GTR-CoTReal patent/paper imagesChain-of-thought improves reasoning
MolSightMulti-stage curriculumRL fine-tuning for stereochemistry

Data Augmentation Strategies

Common augmentation techniques across methods:

TechniquePurposeUsed By
RotationOrientation invarianceNearly all methods
Gaussian blurImage quality variationDECIMER, MolParser
Salt-and-pepper noiseScan artifact simulationDECIMER, Image2SMILES
Affine transformsPerspective variationChemPix, MolParser
Font/style variationRendering diversityRanDepict (DECIMER.ai)
Hand-drawn simulationSketch-like inputsChemPix, ChemReco, DECIMER-Hand-Drawn
Background variationDocument contextMolParser, DECIMER.ai

Hardware and Compute Requirements

Hardware requirements span several orders of magnitude, from consumer GPUs to TPU pods.

Training Hardware Comparison

MethodHardwareTraining TimeDataset Size
Staker et al. (2019)8× GPUs26 days57M
IMG2SMI (2021)1× RTX 2080 Ti5 epochs~10M
Image2SMILES (2022)4× V1002 weeks30M
MICER (2022)4× V10042 hours10M
DECIMER 1.0 (2021)TPU v3-8Not reported35M
DECIMER.ai (2023)TPU v3-256Not reported450M+
SwinOCSR (2022)4× RTX 30905 days5M
MolParser (2025)8× A100Curriculum learning7.7M
MolSight (2025)Not specifiedRL fine-tuning (GRPO)Multi-stage

Inference Considerations

Few papers report inference speed consistently. Available data:

MethodInference SpeedNotes
DECIMER 1.04× faster than DECIMERTensorFlow Lite optimization
OSRA (baseline)~1 image/secCPU-based rule system
MolScribeReal-time capableOptimized Swin encoder

Accessibility Tiers

TierHardwareRepresentative Methods
Consumer1× RTX 2080/3090IMG2SMI, ChemPix
Workstation4× V100/A100Image2SMILES, MICER, SwinOCSR
Cloud/HPCTPU pods, 8+ A100DECIMER.ai, MolParser

Benchmark Performance

Common Evaluation Datasets

DatasetTypeSizeChallenge
USPTOPatent images~5K testReal-world complexity
UOBScanned images~5K testScan artifacts
StakerSyntheticVariableBaseline synthetic
CLEFPatent images~1K testMarkush structures
JPOJapanese patents~1K testDifferent rendering styles

Accuracy Comparison (Exact Match %)

Methods are roughly grouped by evaluation era; direct comparison is complicated by different test sets.

MethodUSPTOUOBStakerNotes
OSRA (baseline)~70%~65%~80%Rule-based reference
DECIMER 1.0~85%~80%~90%First transformer-based
SwinOCSR~88%~82%~92%Swin encoder advantage
DECIMER.ai~90%~85%~95%Scale + augmentation
MolParser~92%~88%~96%Real-world focus
MolSight~93%+~89%+~97%+RL fine-tuning boost

Note: Numbers are approximate and may vary by specific test split. See individual paper notes for precise figures.

Stereochemistry Recognition

Stereochemistry remains a persistent challenge across all methods:

MethodApproachStereo Accuracy
Most methodsStandard SMILESLower than non-stereo
MolSightRL (GRPO) specifically for stereoImproved
MolNexTRGraph-based explicit stereoBetter handling
Image2InChIInChI stereo layersMixed results

Hand-Drawn Recognition

A distinct sub-lineage focuses on hand-drawn/sketched chemical structures.

MethodTarget DomainKey Innovation
ChemPix (2021)Hand-drawn hydrocarbonsFirst deep learning for sketches
Hu et al. RCGD (2023)Hand-drawn structuresRandom conditional guided decoder
ChemReco (2024)Hand-drawn C-H-O structuresEfficientNet + curriculum learning
DECIMER-Hand-Drawn (2024)General hand-drawnEnhanced DECIMER architecture

Hand-Drawn vs Printed Trade-offs

  • Hand-drawn methods sacrifice some accuracy on clean printed images
  • Require specialized training data (synthetic hand-drawn simulation)
  • Generally smaller training sets due to data collection difficulty
  • Better suited for educational and lab notebook applications

Key Innovations by Method

MethodPrimary Innovation
Staker et al.First end-to-end deep learning OCSR
DECIMER 1.0Transformer decoder + SELFIES
Img2MolContinuous embedding space (CDDD)
Image2SMILESFunctional group-aware SMILES (FG-SMILES)
SwinOCSRHierarchical vision transformer encoder
DECIMER.aiMassive scale + RanDepict augmentation
MolParserExtended SMILES + active learning
MolSightRL fine-tuning (GRPO) for accuracy
GTR-CoTChain-of-thought graph traversal
OCSUMulti-task vision-language understanding
RFLHierarchical ring decomposition with SuperAtoms/SuperBonds

Open Challenges

  1. Stereochemistry: Consistent challenge across all methods; RL approaches (MolSight) show promise
  2. Abbreviations/R-groups: E-SMILES and Markush-specific methods emerging
  3. Real-world robustness: Gap between synthetic training and patent/paper images
  4. Inference speed: Rarely reported; important for production deployment
  5. Memory efficiency: Almost never documented; limits accessibility
  6. Multi-molecule images: Most methods assume single isolated structure

References

Individual paper notes linked throughout. For the complete method listing, see the OCSR Methods taxonomy.