SwinOCSR: Vision Transformers for Chemical OCR

Paper Information

Citation: Xu, Z., Li, J., Yang, Z. et al. (2022). SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer. Journal of Cheminformatics, 14(41). https://doi.org/10.1186/s13321-022-00624-5

Publication: Journal of Cheminformatics 2022

Additional Resources:

GitHub Repository

What kind of paper is this?

This is a Methodological Paper with a significant Resource component.

Method: It proposes a novel architecture (Swin Transformer backbone) and a specific loss function optimization (Focal Loss) for the task of Optical Chemical Structure Recognition (OCSR).
Resource: It constructs a large-scale synthetic dataset of 5 million molecules, specifically designing it to cover complex cases like substituents and aromatic rings.

What is the motivation?

Problem: OCSR (converting images of chemical structures to SMILES) is difficult due to complex chemical patterns and long sequences. Existing deep learning methods (often CNN-based) struggle to achieve satisfactory recognition rates.
Technical Gap: Standard CNN backbones (like ResNet or EfficientNet) focus on local feature extraction and may miss global dependencies required for interpreting complex molecular diagrams.
Data Imbalance: Chemical strings suffer from severe class imbalance (e.g., ‘C’ and ‘H’ are frequent; ‘Br’ or ‘Cl’ are rare), which causes standard Cross Entropy loss to underperform.

What is the novelty here?

Swin Transformer Backbone: SwinOCSR is the first approach to replace the standard CNN backbone with a Swin Transformer. This allows the model to leverage shifted window attention to capture both local and global image features more effectively.
Multi-label Focal Loss (MFL): The paper introduces a modified Focal Loss to OCSR to explicitly penalize the model for errors on rare tokens, addressing the “long-tail” distribution of chemical elements.
Structured Synthetic Dataset: Creation of a dataset explicitly balanced across four structural categories: Kekule rings, Aromatic rings, and their combinations with substituents.

What experiments were performed?

Backbone Comparison: The authors benchmarked SwinOCSR against the backbones of leading competitors: ResNet-50 (used in Image2SMILES) and EfficientNet-B3 (used in DECIMER 1.0).
Loss Function Ablation: They compared the performance of standard Cross Entropy (CE) loss against their proposed Multi-label Focal Loss (MFL).
Category Stress Test: Performance was evaluated separately on molecules with/without substituents and with/without aromaticity to test robustness.
Real-world Evaluation: The model was tested on a small set of 100 images manually extracted from literature vs. 100 generated images to measure domain shift.

What outcomes/conclusions?

SOTA Performance: SwinOCSR achieved 98.58% accuracy on the synthetic test set, significantly outperforming ResNet-50 (89.17%) and EfficientNet-B3 (86.70%) backbones.
Effective Handling of Length: The model maintained high accuracy (94.76%) even on very long DeepSMILES strings (76-100 characters), indicating superior global feature extraction.
Domain Shift Issues: While performance on synthetic data was near-perfect, accuracy dropped to 25% on real-world literature images. The authors attribute this to noise, low resolution, and stylistic variations (e.g., abbreviations) not present in the training set.

Reproducibility Details

Data

Source: The first 8.5 million structures from PubChem were downloaded, yielding ~6.9 million unique SMILES.
Generation Pipeline:
- Tools: CDK (Chemistry Development Kit) for image rendering; RDKit for SMILES canonicalization.
- Augmentation: To ensure diversity, the dataset was split into 4 categories (1.25M each): (1) Kekule, (2) Aromatic, (3) Kekule + Substituents, (4) Aromatic + Substituents. Substituents were randomly added from a list of 224 common patent substituents.
- Preprocessing: Images rendered as binary, resized to 224×224, and copied to 3 channels (RGB simulation).

Purpose	Dataset	Size	Notes
Training	Synthetic (PubChem-derived)	4,500,000	18:1:1 split (Train/Val/Test)
Validation	Synthetic (PubChem-derived)	250,000
Test	Synthetic (PubChem-derived)	250,000

Algorithms

Loss Function: Multi-label Focal Loss (MFL). The single-label classification task was cast as multi-label to apply Focal Loss, using a sigmoid activation on logits.
Optimization:
- Optimizer: Adam with initial learning rate 5e-4.
- Schedulers: Cosine decay for the Swin Transformer backbone; Step decay for the Transformer encoder/decoder.
- Regularization: Dropout rate of 0.1.

Models

Backbone (Encoder 1): Swin Transformer.
- Patch size: $4 \times 4$.
- Linear embedding dimension: 192.
- Structure: 4 stages with Swin Transformer Blocks (Window MSA + Shifted Window MSA).
- Output: Flattened patch sequence $S_b$.
Transformer Encoder (Encoder 2): 6 standard Transformer encoder layers. Uses Positional Embedding + Multi-Head Attention + MLP.
Transformer Decoder: 6 standard Transformer decoder layers. Uses Masked Multi-Head Attention (to prevent look-ahead) + Multi-Head Attention (connecting to encoder output $S_e$).
Tokenization: DeepSMILES format used (syntactically more robust than SMILES). Vocabulary size: 76 tokens (76 unique characters found in dataset). Embedding dimension: 256.

Evaluation

Metrics: Accuracy (Exact Match), Tanimoto Similarity (PubChem fingerprints), BLEU, ROUGE.

Metric	SwinOCSR (MFL)	ResNet-50	EfficientNet-B3	Notes
Accuracy	98.58%	89.17%	86.70%	MFL loss provides ~1% boost over CE
Tanimoto	99.77%	98.79%	98.46%	High similarity even when exact match fails
BLEU	99.59%	98.62%	98.37%

Hardware

GPU: Trained on NVIDIA Tesla V100-PCIE.
Training Time: 30 epochs.
Batch Size: 256 images ($224 \times 224$ pixels).

Publication Details
Authors	Zhanpeng Xu, Jianhua Li, Zhaopeng Yang, Shiliang Li, Honglin Li
Paper Title	SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer
Category	Computational Chemistry
Date	December 2025
Links	🔗 DOI • 💻 Code • 📄 Paper

Paper Information#

What kind of paper is this?#

What is the motivation?#

What is the novelty here?#

What experiments were performed?#

What outcomes/conclusions?#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#