String Representations for Chemical Image Recognition

Paper Information

Citation: Rajan, K., Steinbeck, C., & Zielesny, A. (2022). Performance of chemical structure string representations for chemical image recognition using transformers. Digital Discovery, 1(2), 84-90. https://doi.org/10.1039/D1DD00013F

Publication: Digital Discovery 2022

Additional Resources:

ChemRxiv Preprint (PDF)
Official Code Repository
Data on Zenodo
Related work: DECIMER, DECIMER 1.0, IMG2SMI

Methodological Focus and Resource Contributions

This is a Methodological Paper ($\Psi_{\text{Method}}$) with a secondary contribution as a Resource Paper ($\Psi_{\text{Resource}}$).

It functions as a systematic ablation study, keeping the model architecture (EfficientNet-B3 + Transformer) constant while varying the input/output representation (SMILES, DeepSMILES, SELFIES, InChI) to determine which format yields the best performance for Optical Chemical Structure Recognition (OCSR). It also contributes large-scale benchmarking datasets derived from ChEMBL and PubChem.

The Syntax Challenge in Chemical Image Recognition

Optical Chemical Structure Recognition (OCSR) is essential for extracting chemical information buried in scientific literature and patents. While deep learning offers a promising alternative to rule-based approaches, neural networks struggle with the syntax of standard chemical representations like SMILES. Specifically, the tokenization of SMILES strings (where ring closures and branches are marked by single characters potentially far apart in the sequence) creates learning difficulties for sequence-to-sequence models. Newer representations like DeepSMILES and SELFIES were developed to address these syntax issues, but their comparative performance in image-to-text tasks had not been rigorously benchmarked.

Isolating String Representation Variables

The core novelty is the comparative isolation of the string representation variable in an OCSR context. Previous approaches often selected a representation (usually SMILES) without validating if it was optimal for the learning task. This study specifically tests the hypothesis that syntax-robust representations (like SELFIES) improve deep learning performance compared to standard SMILES. It provides empirical evidence on the trade-off between validity (guaranteed by SELFIES) and accuracy (highest with SMILES).

Large-Scale Image-to-Text Translation Experiments

The authors performed a large-scale image-to-text translation experiment:

Task: Converting 2D chemical structure images into text strings.
Data:
- ChEMBL: ~1.6M molecules, split into two datasets (with and without stereochemistry).
- PubChem: ~3M molecules, split similarly, to test performance scaling with data size.
Representations: The same chemical structures were converted into four formats: SMILES, DeepSMILES, SELFIES, and InChI.
Metric: The models were evaluated on:
- Validity: Can the predicted string be decoded back to a molecule?
- Exact Match: Is the predicted string identical to the ground truth?
- Tanimoto Similarity: How chemically similar is the prediction to the ground truth (using PubChem fingerprints)? The similarity $\mathcal{T}$ between two molecular fingerprints $A$ and $B$ is calculated as: $$ \mathcal{T}(A, B) = \frac{A \cdot B}{||A||^2 + ||B||^2 - A \cdot B} $$

Comparative Performance and Validity Trade-offs

SMILES is the most accurate: Contrary to the hypothesis that syntax-robust formats would learn better, SMILES consistently achieved the highest exact match accuracy (up to 88.62% on PubChem data) and average Tanimoto similarity (0.98). This is likely due to SMILES having shorter string lengths and fewer unique tokens compared to SELFIES.
SELFIES guarantees validity: While slightly less accurate in direct translation, SELFIES achieved 100% structural validity (every prediction could be decoded), whereas SMILES predictions occasionally contained syntax errors.
InChI is unsuitable: InChI performed significantly worse (approx. 64% exact match) due to extreme string lengths (up to 273 tokens).
Stereochemistry adds difficulty: Including stereochemistry reduced accuracy across all representations due to increased token count and visual complexity.
Recommendation: Use SMILES for maximum accuracy; use SELFIES if generating valid structures is the priority (e.g., generative tasks).

Reproducibility Details

Data

The study used curated subsets from ChEMBL and PubChem. Images were generated synthetically.

Purpose	Dataset	Size	Notes
Training	ChEMBL (Dataset 1/2)	~1.5M	Filtered for MW < 1500, specific elements (C,H,O,N,P,S,F,Cl,Br,I,Se,B).
Training	PubChem (Dataset 3/4)	~3.0M	Same filtering rules, used to test scaling.
Evaluation	Test Split	~120k - 250k	Created using RDKit MaxMin algorithm to ensure chemical diversity.

Image Generation:

Tool: CDK Structure Diagram Generator (SDG).
Specs: $300 \times 300$ pixels, rotated by random angles ($0-360^{\circ}$), saved as 8-bit PNG.

Algorithms

Tokenization Rules (Critical for replication):

SELFIES: Split at every ][ (e.g., [C][N] $\rightarrow$ [C], [N]).
SMILES / DeepSMILES: Regex-based splitting:
- Every heavy atom (e.g., C, N).
- Every bracket ( and ).
- Every bond symbol = and #.
- Every single-digit number.
- Everything inside square brackets [] is kept as a single token.
InChI: The prefix InChI=1S/ was treated as a single token and removed during training, then re-added for evaluation.

Models

The model follows the DECIMER architecture.

Encoder: EfficientNet-B3 (pre-trained with “Noisy Student” weights).
- Output: Image feature vectors of shape $10 \times 10 \times 1536$.
Decoder: Transformer (similar to the “Base” model from Attention Is All You Need).
- Layers: 4 encoder-decoder layers.
- Attention Heads: 8.
- Dimension ($d_{\text{model}}$): 512.
- Feed-forward ($d_{\text{ff}}$): 2048.
- Dropout: 10%.
Loss: Sparse categorical cross-entropy.
Optimizer: Adam with custom learning rate scheduler.

Evaluation

Metrics were calculated after converting all predictions back to standard SMILES.

Metric	Baseline (SMILES)	Notes
Identical Match	88.62% (PubChem)	Strict character-for-character equality.
Valid Structure	99.78%	SMILES had rare syntax errors; SELFIES achieved 100%.
Tanimoto (Avg)	0.98	Calculated using PubChem fingerprints via CDK.

Hardware

Training: Google Cloud TPUs (v3-8).
Format: Data converted to TFRecords (128 image/text pairs per record) for TPU efficiency.
Batch Size: 1024.

Citation

@article{rajanPerformanceChemicalStructure2022,
  title = {Performance of Chemical Structure String Representations for Chemical Image Recognition Using Transformers},
  author = {Rajan, Kohulan and Steinbeck, Christoph and Zielesny, Achim},
  year = 2022,
  journal = {Digital Discovery},
  volume = {1},
  number = {2},
  pages = {84--90},
  publisher = {Royal Society of Chemistry},
  doi = {10.1039/D1DD00013F}
}

Paper Information#

Methodological Focus and Resource Contributions#

The Syntax Challenge in Chemical Image Recognition#

Isolating String Representation Variables#

Large-Scale Image-to-Text Translation Experiments#

Comparative Performance and Validity Trade-offs#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Citation#