Paper Information

Citation: Rajan, K., Steinbeck, C., & Zielesny, A. (2022). Performance of chemical structure string representations for chemical image recognition using transformers. Digital Discovery, 1(2), 84-90. https://doi.org/10.1039/D1DD00013F

Publication: Digital Discovery 2022

Additional Resources:

What kind of paper is this?

This is a Methodological Paper ($\Psi_{\text{Method}}$) with a secondary contribution as a Resource Paper ($\Psi_{\text{Resource}}$).

It functions as a systematic ablation study, keeping the model architecture (EfficientNet-B3 + Transformer) constant while varying the input/output representation (SMILES, DeepSMILES, SELFIES, InChI) to determine which format yields the best performance for Optical Chemical Structure Recognition (OCSR). It also contributes large-scale benchmarking datasets derived from ChEMBL and PubChem.

What is the motivation?

Optical Chemical Structure Recognition (OCSR) is essential for extracting chemical information buried in scientific literature and patents. While deep learning offers a promising alternative to rule-based approaches, neural networks struggle with the syntax of standard chemical representations like SMILES. Specifically, the tokenization of SMILES strings - where ring closures and branches are marked by single characters potentially far apart in the sequence - creates learning difficulties for sequence-to-sequence models. Newer representations like DeepSMILES and SELFIES were developed to address these syntax issues, but their comparative performance in image-to-text tasks had not been rigorously benchmarked.

What is the novelty here?

The core novelty is the comparative isolation of the string representation variable in an OCSR context. Previous approaches often selected a representation (usually SMILES) without validating if it was optimal for the learning task. This study specifically tests the hypothesis that syntax-robust representations (like SELFIES) improve deep learning performance compared to standard SMILES. It provides empirical evidence on the trade-off between validity (guaranteed by SELFIES) and accuracy (highest with SMILES).

What experiments were performed?

The authors performed a large-scale image-to-text translation experiment:

  • Task: Converting 2D chemical structure images into text strings.
  • Data:
    • ChEMBL: ~1.6M molecules, split into two datasets (with and without stereochemistry).
    • PubChem: ~3M molecules, split similarly, to test performance scaling with data size.
  • Representations: The same chemical structures were converted into four formats: SMILES, DeepSMILES, SELFIES, and InChI.
  • Metric: The models were evaluated on:
    • Validity: Can the predicted string be decoded back to a molecule?
    • Exact Match: Is the predicted string identical to the ground truth?
    • Tanimoto Similarity: How chemically similar is the prediction to the ground truth (using PubChem fingerprints)?

What were the outcomes and conclusions drawn?

  • SMILES is the most accurate: Contrary to the hypothesis that syntax-robust formats would learn better, SMILES consistently achieved the highest exact match accuracy (up to 88.62% on PubChem data) and average Tanimoto similarity (0.98). This is likely due to SMILES having shorter string lengths and fewer unique tokens compared to SELFIES.
  • SELFIES guarantees validity: While slightly less accurate in direct translation, SELFIES achieved 100% structural validity (every prediction could be decoded), whereas SMILES predictions occasionally contained syntax errors.
  • InChI is unsuitable: InChI performed significantly worse (approx. 64% exact match) due to extreme string lengths (up to 273 tokens).
  • Stereochemistry adds difficulty: Including stereochemistry reduced accuracy across all representations due to increased token count and visual complexity.
  • Recommendation: Use SMILES for maximum accuracy; use SELFIES if generating valid structures is the priority (e.g., generative tasks).

Reproducibility Details

Data

The study used curated subsets from ChEMBL and PubChem. Images were generated synthetically.

PurposeDatasetSizeNotes
TrainingChEMBL (Dataset 1/2)~1.5MFiltered for MW < 1500, specific elements (C,H,O,N,P,S,F,Cl,Br,I,Se,B).
TrainingPubChem (Dataset 3/4)~3.0MSame filtering rules, used to test scaling.
EvaluationTest Split~120k - 250kCreated using RDKit MaxMin algorithm to ensure chemical diversity.

Image Generation:

  • Tool: CDK Structure Diagram Generator (SDG).
  • Specs: $300 \times 300$ pixels, rotated by random angles ($0-360^{\circ}$), saved as 8-bit PNG.

Algorithms

Tokenization Rules (Critical for replication):

  • SELFIES: Split at every ][ (e.g., [C][N] $\rightarrow$ [C], [N]).
  • SMILES / DeepSMILES: Regex-based splitting:
    • Every heavy atom (e.g., C, N).
    • Every bracket ( and ).
    • Every bond symbol = and #.
    • Every single-digit number.
    • Everything inside square brackets [] is kept as a single token.
  • InChI: The prefix InChI=1S/ was treated as a single token and removed during training, then re-added for evaluation.

Models

The model follows the DECIMER architecture.

  • Encoder: EfficientNet-B3 (pre-trained with “Noisy Student” weights).
    • Output: Image feature vectors of shape $10 \times 10 \times 1536$.
  • Decoder: Transformer (similar to the “Base” model from Attention Is All You Need).
    • Layers: 4 encoder-decoder layers.
    • Attention Heads: 8.
    • Dimension ($d_{\text{model}}$): 512.
    • Feed-forward ($d_{\text{ff}}$): 2048.
    • Dropout: 10%.
  • Loss: Sparse categorical cross-entropy.
  • Optimizer: Adam with custom learning rate scheduler.

Evaluation

Metrics were calculated after converting all predictions back to standard SMILES.

MetricBaseline (SMILES)Notes
Identical Match88.62% (PubChem)Strict character-for-character equality.
Valid Structure99.78%SMILES had rare syntax errors; SELFIES achieved 100%.
Tanimoto (Avg)0.98Calculated using PubChem fingerprints via CDK.

Hardware

  • Training: Google Cloud TPUs (v3-8).
  • Format: Data converted to TFRecords (128 image/text pairs per record) for TPU efficiency.
  • Batch Size: 1024.

Citation

@article{rajanPerformanceChemicalStructure2022,
  title = {Performance of Chemical Structure String Representations for Chemical Image Recognition Using Transformers},
  author = {Rajan, Kohulan and Steinbeck, Christoph and Zielesny, Achim},
  year = 2022,
  journal = {Digital Discovery},
  volume = {1},
  number = {2},
  pages = {84--90},
  publisher = {Royal Society of Chemistry},
  doi = {10.1039/D1DD00013F}
}