Paper Information
Citation: Rajan, K., Steinbeck, C., & Zielesny, A. (2022). Performance of chemical structure string representations for chemical image recognition using transformers. Digital Discovery, 1(2), 84-90. https://doi.org/10.1039/D1DD00013F
Publication: Digital Discovery 2022
Additional Resources:
- ChemRxiv Preprint (PDF)
- Official Code Repository
- Data on Zenodo
- Related work: DECIMER, DECIMER 1.0, IMG2SMI
What kind of paper is this?
This is a Methodological Paper ($\Psi_{\text{Method}}$) with a secondary contribution as a Resource Paper ($\Psi_{\text{Resource}}$).
It functions as a systematic ablation study, keeping the model architecture (EfficientNet-B3 + Transformer) constant while varying the input/output representation (SMILES, DeepSMILES, SELFIES, InChI) to determine which format yields the best performance for Optical Chemical Structure Recognition (OCSR). It also contributes large-scale benchmarking datasets derived from ChEMBL and PubChem.
What is the motivation?
Optical Chemical Structure Recognition (OCSR) is essential for extracting chemical information buried in scientific literature and patents. While deep learning offers a promising alternative to rule-based approaches, neural networks struggle with the syntax of standard chemical representations like SMILES. Specifically, the tokenization of SMILES strings - where ring closures and branches are marked by single characters potentially far apart in the sequence - creates learning difficulties for sequence-to-sequence models. Newer representations like DeepSMILES and SELFIES were developed to address these syntax issues, but their comparative performance in image-to-text tasks had not been rigorously benchmarked.
What is the novelty here?
The core novelty is the comparative isolation of the string representation variable in an OCSR context. Previous approaches often selected a representation (usually SMILES) without validating if it was optimal for the learning task. This study specifically tests the hypothesis that syntax-robust representations (like SELFIES) improve deep learning performance compared to standard SMILES. It provides empirical evidence on the trade-off between validity (guaranteed by SELFIES) and accuracy (highest with SMILES).
What experiments were performed?
The authors performed a large-scale image-to-text translation experiment:
- Task: Converting 2D chemical structure images into text strings.
- Data:
- ChEMBL: ~1.6M molecules, split into two datasets (with and without stereochemistry).
- PubChem: ~3M molecules, split similarly, to test performance scaling with data size.
- Representations: The same chemical structures were converted into four formats: SMILES, DeepSMILES, SELFIES, and InChI.
- Metric: The models were evaluated on:
- Validity: Can the predicted string be decoded back to a molecule?
- Exact Match: Is the predicted string identical to the ground truth?
- Tanimoto Similarity: How chemically similar is the prediction to the ground truth (using PubChem fingerprints)?
What were the outcomes and conclusions drawn?
- SMILES is the most accurate: Contrary to the hypothesis that syntax-robust formats would learn better, SMILES consistently achieved the highest exact match accuracy (up to 88.62% on PubChem data) and average Tanimoto similarity (0.98). This is likely due to SMILES having shorter string lengths and fewer unique tokens compared to SELFIES.
- SELFIES guarantees validity: While slightly less accurate in direct translation, SELFIES achieved 100% structural validity (every prediction could be decoded), whereas SMILES predictions occasionally contained syntax errors.
- InChI is unsuitable: InChI performed significantly worse (approx. 64% exact match) due to extreme string lengths (up to 273 tokens).
- Stereochemistry adds difficulty: Including stereochemistry reduced accuracy across all representations due to increased token count and visual complexity.
- Recommendation: Use SMILES for maximum accuracy; use SELFIES if generating valid structures is the priority (e.g., generative tasks).
Reproducibility Details
Data
The study used curated subsets from ChEMBL and PubChem. Images were generated synthetically.
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Training | ChEMBL (Dataset 1/2) | ~1.5M | Filtered for MW < 1500, specific elements (C,H,O,N,P,S,F,Cl,Br,I,Se,B). |
| Training | PubChem (Dataset 3/4) | ~3.0M | Same filtering rules, used to test scaling. |
| Evaluation | Test Split | ~120k - 250k | Created using RDKit MaxMin algorithm to ensure chemical diversity. |
Image Generation:
- Tool: CDK Structure Diagram Generator (SDG).
- Specs: $300 \times 300$ pixels, rotated by random angles ($0-360^{\circ}$), saved as 8-bit PNG.
Algorithms
Tokenization Rules (Critical for replication):
- SELFIES: Split at every
][(e.g.,[C][N]$\rightarrow$[C],[N]). - SMILES / DeepSMILES: Regex-based splitting:
- Every heavy atom (e.g.,
C,N). - Every bracket
(and). - Every bond symbol
=and#. - Every single-digit number.
- Everything inside square brackets
[]is kept as a single token.
- Every heavy atom (e.g.,
- InChI: The prefix
InChI=1S/was treated as a single token and removed during training, then re-added for evaluation.
Models
The model follows the DECIMER architecture.
- Encoder: EfficientNet-B3 (pre-trained with “Noisy Student” weights).
- Output: Image feature vectors of shape $10 \times 10 \times 1536$.
- Decoder: Transformer (similar to the “Base” model from Attention Is All You Need).
- Layers: 4 encoder-decoder layers.
- Attention Heads: 8.
- Dimension ($d_{\text{model}}$): 512.
- Feed-forward ($d_{\text{ff}}$): 2048.
- Dropout: 10%.
- Loss: Sparse categorical cross-entropy.
- Optimizer: Adam with custom learning rate scheduler.
Evaluation
Metrics were calculated after converting all predictions back to standard SMILES.
| Metric | Baseline (SMILES) | Notes |
|---|---|---|
| Identical Match | 88.62% (PubChem) | Strict character-for-character equality. |
| Valid Structure | 99.78% | SMILES had rare syntax errors; SELFIES achieved 100%. |
| Tanimoto (Avg) | 0.98 | Calculated using PubChem fingerprints via CDK. |
Hardware
- Training: Google Cloud TPUs (v3-8).
- Format: Data converted to TFRecords (128 image/text pairs per record) for TPU efficiency.
- Batch Size: 1024.
Citation
@article{rajanPerformanceChemicalStructure2022,
title = {Performance of Chemical Structure String Representations for Chemical Image Recognition Using Transformers},
author = {Rajan, Kohulan and Steinbeck, Christoph and Zielesny, Achim},
year = 2022,
journal = {Digital Discovery},
volume = {1},
number = {2},
pages = {84--90},
publisher = {Royal Society of Chemistry},
doi = {10.1039/D1DD00013F}
}