Paper Information

Citation: Xu, Z., Li, J., Yang, Z. et al. (2022). SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer. Journal of Cheminformatics, 14(41). https://doi.org/10.1186/s13321-022-00624-5

Publication: Journal of Cheminformatics 2022

Additional Resources:

What kind of paper is this?

This is a Methodological Paper with a significant Resource component.

  • Method: It proposes a novel architecture (Swin Transformer backbone) and a specific loss function optimization (Focal Loss) for the task of Optical Chemical Structure Recognition (OCSR).
  • Resource: It constructs a large-scale synthetic dataset of 5 million molecules, specifically designing it to cover complex cases like substituents and aromatic rings.

What is the motivation?

  • Problem: OCSR (converting images of chemical structures to SMILES) is difficult due to complex chemical patterns and long sequences. Existing deep learning methods (often CNN-based) struggle to achieve satisfactory recognition rates.
  • Technical Gap: Standard CNN backbones (like ResNet or EfficientNet) focus on local feature extraction and may miss global dependencies required for interpreting complex molecular diagrams.
  • Data Imbalance: Chemical strings suffer from severe class imbalance (e.g., ‘C’ and ‘H’ are frequent; ‘Br’ or ‘Cl’ are rare), which causes standard Cross Entropy loss to underperform.

What is the novelty here?

  • Swin Transformer Backbone: SwinOCSR is the first approach to replace the standard CNN backbone with a Swin Transformer. This allows the model to leverage shifted window attention to capture both local and global image features more effectively.
  • Multi-label Focal Loss (MFL): The paper introduces a modified Focal Loss to OCSR to explicitly penalize the model for errors on rare tokens, addressing the “long-tail” distribution of chemical elements.
  • Structured Synthetic Dataset: Creation of a dataset explicitly balanced across four structural categories: Kekule rings, Aromatic rings, and their combinations with substituents.

What experiments were performed?

  • Backbone Comparison: The authors benchmarked SwinOCSR against the backbones of leading competitors: ResNet-50 (used in Image2SMILES) and EfficientNet-B3 (used in DECIMER 1.0).
  • Loss Function Ablation: They compared the performance of standard Cross Entropy (CE) loss against their proposed Multi-label Focal Loss (MFL).
  • Category Stress Test: Performance was evaluated separately on molecules with/without substituents and with/without aromaticity to test robustness.
  • Real-world Evaluation: The model was tested on a small set of 100 images manually extracted from literature vs. 100 generated images to measure domain shift.

What outcomes/conclusions?

  • SOTA Performance: SwinOCSR achieved 98.58% accuracy on the synthetic test set, significantly outperforming ResNet-50 (89.17%) and EfficientNet-B3 (86.70%) backbones.
  • Effective Handling of Length: The model maintained high accuracy (94.76%) even on very long DeepSMILES strings (76-100 characters), indicating superior global feature extraction.
  • Domain Shift Issues: While performance on synthetic data was near-perfect, accuracy dropped to 25% on real-world literature images. The authors attribute this to noise, low resolution, and stylistic variations (e.g., abbreviations) not present in the training set.

Reproducibility Details

Data

  • Source: The first 8.5 million structures from PubChem were downloaded, yielding ~6.9 million unique SMILES.
  • Generation Pipeline:
    • Tools: CDK (Chemistry Development Kit) for image rendering; RDKit for SMILES canonicalization.
    • Augmentation: To ensure diversity, the dataset was split into 4 categories (1.25M each): (1) Kekule, (2) Aromatic, (3) Kekule + Substituents, (4) Aromatic + Substituents. Substituents were randomly added from a list of 224 common patent substituents.
    • Preprocessing: Images rendered as binary, resized to 224×224, and copied to 3 channels (RGB simulation).
PurposeDatasetSizeNotes
TrainingSynthetic (PubChem-derived)4,500,00018:1:1 split (Train/Val/Test)
ValidationSynthetic (PubChem-derived)250,000
TestSynthetic (PubChem-derived)250,000

Algorithms

  • Loss Function: Multi-label Focal Loss (MFL). The single-label classification task was cast as multi-label to apply Focal Loss, using a sigmoid activation on logits.
  • Optimization:
    • Optimizer: Adam with initial learning rate 5e-4.
    • Schedulers: Cosine decay for the Swin Transformer backbone; Step decay for the Transformer encoder/decoder.
    • Regularization: Dropout rate of 0.1.

Models

  • Backbone (Encoder 1): Swin Transformer.
    • Patch size: $4 \times 4$.
    • Linear embedding dimension: 192.
    • Structure: 4 stages with Swin Transformer Blocks (Window MSA + Shifted Window MSA).
    • Output: Flattened patch sequence $S_b$.
  • Transformer Encoder (Encoder 2): 6 standard Transformer encoder layers. Uses Positional Embedding + Multi-Head Attention + MLP.
  • Transformer Decoder: 6 standard Transformer decoder layers. Uses Masked Multi-Head Attention (to prevent look-ahead) + Multi-Head Attention (connecting to encoder output $S_e$).
  • Tokenization: DeepSMILES format used (syntactically more robust than SMILES). Vocabulary size: 76 tokens (76 unique characters found in dataset). Embedding dimension: 256.

Evaluation

  • Metrics: Accuracy (Exact Match), Tanimoto Similarity (PubChem fingerprints), BLEU, ROUGE.
MetricSwinOCSR (MFL)ResNet-50EfficientNet-B3Notes
Accuracy98.58%89.17%86.70%MFL loss provides ~1% boost over CE
Tanimoto99.77%98.79%98.46%High similarity even when exact match fails
BLEU99.59%98.62%98.37%

Hardware

  • GPU: Trained on NVIDIA Tesla V100-PCIE.
  • Training Time: 30 epochs.
  • Batch Size: 256 images ($224 \times 224$ pixels).