Paper Information
Citation: Xu, Z., Li, J., Yang, Z. et al. (2022). SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer. Journal of Cheminformatics, 14(41). https://doi.org/10.1186/s13321-022-00624-5
Publication: Journal of Cheminformatics 2022
Additional Resources:
What kind of paper is this?
This is a Methodological Paper with a significant Resource component.
- Method: It proposes a novel architecture (Swin Transformer backbone) and a specific loss function optimization (Focal Loss) for the task of Optical Chemical Structure Recognition (OCSR).
- Resource: It constructs a large-scale synthetic dataset of 5 million molecules, specifically designing it to cover complex cases like substituents and aromatic rings.
What is the motivation?
- Problem: OCSR (converting images of chemical structures to SMILES) is difficult due to complex chemical patterns and long sequences. Existing deep learning methods (often CNN-based) struggle to achieve satisfactory recognition rates.
- Technical Gap: Standard CNN backbones (like ResNet or EfficientNet) focus on local feature extraction and may miss global dependencies required for interpreting complex molecular diagrams.
- Data Imbalance: Chemical strings suffer from severe class imbalance (e.g., ‘C’ and ‘H’ are frequent; ‘Br’ or ‘Cl’ are rare), which causes standard Cross Entropy loss to underperform.
What is the novelty here?
- Swin Transformer Backbone: SwinOCSR is the first approach to replace the standard CNN backbone with a Swin Transformer. This allows the model to leverage shifted window attention to capture both local and global image features more effectively.
- Multi-label Focal Loss (MFL): The paper introduces a modified Focal Loss to OCSR to explicitly penalize the model for errors on rare tokens, addressing the “long-tail” distribution of chemical elements.
- Structured Synthetic Dataset: Creation of a dataset explicitly balanced across four structural categories: Kekule rings, Aromatic rings, and their combinations with substituents.
What experiments were performed?
- Backbone Comparison: The authors benchmarked SwinOCSR against the backbones of leading competitors: ResNet-50 (used in Image2SMILES) and EfficientNet-B3 (used in DECIMER 1.0).
- Loss Function Ablation: They compared the performance of standard Cross Entropy (CE) loss against their proposed Multi-label Focal Loss (MFL).
- Category Stress Test: Performance was evaluated separately on molecules with/without substituents and with/without aromaticity to test robustness.
- Real-world Evaluation: The model was tested on a small set of 100 images manually extracted from literature vs. 100 generated images to measure domain shift.
What outcomes/conclusions?
- SOTA Performance: SwinOCSR achieved 98.58% accuracy on the synthetic test set, significantly outperforming ResNet-50 (89.17%) and EfficientNet-B3 (86.70%) backbones.
- Effective Handling of Length: The model maintained high accuracy (94.76%) even on very long DeepSMILES strings (76-100 characters), indicating superior global feature extraction.
- Domain Shift Issues: While performance on synthetic data was near-perfect, accuracy dropped to 25% on real-world literature images. The authors attribute this to noise, low resolution, and stylistic variations (e.g., abbreviations) not present in the training set.
Reproducibility Details
Data
- Source: The first 8.5 million structures from PubChem were downloaded, yielding ~6.9 million unique SMILES.
- Generation Pipeline:
- Tools: CDK (Chemistry Development Kit) for image rendering; RDKit for SMILES canonicalization.
- Augmentation: To ensure diversity, the dataset was split into 4 categories (1.25M each): (1) Kekule, (2) Aromatic, (3) Kekule + Substituents, (4) Aromatic + Substituents. Substituents were randomly added from a list of 224 common patent substituents.
- Preprocessing: Images rendered as binary, resized to 224×224, and copied to 3 channels (RGB simulation).
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Training | Synthetic (PubChem-derived) | 4,500,000 | 18:1:1 split (Train/Val/Test) |
| Validation | Synthetic (PubChem-derived) | 250,000 | |
| Test | Synthetic (PubChem-derived) | 250,000 |
Algorithms
- Loss Function: Multi-label Focal Loss (MFL). The single-label classification task was cast as multi-label to apply Focal Loss, using a sigmoid activation on logits.
- Optimization:
- Optimizer: Adam with initial learning rate
5e-4. - Schedulers: Cosine decay for the Swin Transformer backbone; Step decay for the Transformer encoder/decoder.
- Regularization: Dropout rate of
0.1.
- Optimizer: Adam with initial learning rate
Models
- Backbone (Encoder 1): Swin Transformer.
- Patch size: $4 \times 4$.
- Linear embedding dimension: 192.
- Structure: 4 stages with Swin Transformer Blocks (Window MSA + Shifted Window MSA).
- Output: Flattened patch sequence $S_b$.
- Transformer Encoder (Encoder 2): 6 standard Transformer encoder layers. Uses Positional Embedding + Multi-Head Attention + MLP.
- Transformer Decoder: 6 standard Transformer decoder layers. Uses Masked Multi-Head Attention (to prevent look-ahead) + Multi-Head Attention (connecting to encoder output $S_e$).
- Tokenization: DeepSMILES format used (syntactically more robust than SMILES). Vocabulary size: 76 tokens (76 unique characters found in dataset). Embedding dimension: 256.
Evaluation
- Metrics: Accuracy (Exact Match), Tanimoto Similarity (PubChem fingerprints), BLEU, ROUGE.
| Metric | SwinOCSR (MFL) | ResNet-50 | EfficientNet-B3 | Notes |
|---|---|---|---|---|
| Accuracy | 98.58% | 89.17% | 86.70% | MFL loss provides ~1% boost over CE |
| Tanimoto | 99.77% | 98.79% | 98.46% | High similarity even when exact match fails |
| BLEU | 99.59% | 98.62% | 98.37% |
Hardware
- GPU: Trained on NVIDIA Tesla V100-PCIE.
- Training Time: 30 epochs.
- Batch Size: 256 images ($224 \times 224$ pixels).