DECIMER 1.0: Transformers for Chemical Image Recognition

Paper Information

Citation: Rajan, K., Zielesny, A. & Steinbeck, C. (2021). DECIMER 1.0: deep learning for chemical image recognition using transformers. Journal of Cheminformatics, 13(1), 61. https://doi.org/10.1186/s13321-021-00538-8

Publication: Journal of Cheminformatics 2021

Additional Resources:

What kind of paper is this?

Method (Dominant) with strong Resource elements.

This is primarily a Method paper because it proposes a specific architectural evolution - replacing CNN-RNN/Encoder-Decoder models with a Transformer-based network - to solve the problem of image-to-structure translation. It validates this methodological shift through rigorous ablation studies comparing feature extractors (InceptionV3 vs. EfficientNet) and decoder architectures.

It also serves as a Resource contribution by releasing the open-source software, trained models, and describing the curation of a massive synthetic training dataset (>35 million molecules).

What is the motivation?

Data Inaccessibility: A vast amount of chemical knowledge (pre-1990s) is locked in printed or scanned literature and is not machine-readable.
Manual Bottlenecks: Manual curation and extraction of this data is tedious, slow, and error-prone.
Limitations of Prior Tools: Existing Optical Chemical Structure Recognition (OCSR) tools are often rule-based or struggle with the noise and variability of full-page scanned articles. Previous deep learning attempts were not publicly accessible or robust enough.

What is the novelty here?

Transformer Architecture: Shifts from the standard CNN-RNN (Encoder-Decoder) approach to a Transformer-based decoder, significantly improving accuracy.
EfficientNet Backbone: Replaces the standard InceptionV3 feature extractor with EfficientNet-B3, which improved feature extraction quality for chemical images.
SELFIES Representation: Utilizes SELFIES (SELF-referencing Embedded Strings) instead of SMILES as the target output. This guarantees 100% robust molecular strings and eliminates the “invalid SMILES” problem common in generative models.
Massive Scaling: Trains on a synthetic dataset of 35-39 million molecules, demonstrating that scaling data size directly correlates with improved model performance.

What experiments were performed?

Feature Extractor Ablation: Compared InceptionV3 vs. EfficientNet-B3 (and B7) on a 1-million molecule subset to determine the optimal image encoder.
Architecture Comparison: Benchmarked the Encoder-Decoder (CNN+RNN) against the Transformer model using Tanimoto similarity metrics.
Data Scaling: Evaluated performance across increasing training set sizes (1M, 10M, 15M, 35M) to observe scaling laws.
Stereochemistry & Ions: Tested the model’s ability to handle complex stereochemical information and charged groups (ions), creating separate datasets for these tasks.
Augmentation Robustness: Evaluated the model on augmented images (blur, noise, varying contrast) to simulate real-world scanned document conditions.

What were the outcomes and conclusions?

Superior Architecture: The Transformer model with EfficientNet-B3 features significantly outperformed the Encoder-Decoder baseline. On the 1M dataset, the Transformer achieved 74.57% exact matches (Tanimoto 1.0) compared to only 7.03% for the Encoder-Decoder.
High Accuracy at Scale: With the full 35-million molecule training set (Dataset 1), the model achieved a Tanimoto 1.0 score of 96.47% and an average Tanimoto similarity of 0.99.
Isomorphism: 99.75% of predictions with a Tanimoto score of 1.0 were confirmed to be structurally isomorphic to the ground truth.
Stereochemistry Costs: Including stereochemistry and ions increased the token count and difficulty, resulting in slightly lower accuracy (~89.87% exact match on Dataset 2).
Hardware Efficiency: Training on TPUs (v3-8) was ~4x faster than Nvidia V100 GPUs, reducing training time for the largest models from months to under 14 days.

Reproducibility Details

Data

The authors generated synthetic data from PubChem.

Purpose	Dataset	Size	Notes
Training	Dataset 1 (Clean)	35M molecules	No stereo/ions. Filtered for MW < 1500, bond count 3-40, SMILES len < 40.
Training	Dataset 2 (Complex)	33M molecules	Includes stereochemistry and charged groups (ions).
Training	Dataset 3 (Augmented)	33M molecules	Dataset 2 with image augmentations applied.
Preprocessing	-	-	Molecules converted to SELFIES. Images generated via CDK Structure Diagram Generator (SDG) as $299\times299$ 8-bit PNGs.
Format	TFRecords	75 MB chunks	128 Data points (image vector + tokenized string) per record.

Algorithms

Text Representation: SELFIES used to avoid invalid intermediate strings. Tokenized via Keras tokenizer.
- Dataset 1 Tokens: 27 unique tokens. Max length 47.
- Dataset 2/3 Tokens: 61 unique tokens (due to stereo/ion tokens).
Augmentation: Implemented using imgaug python package. Random application of:
- Gaussian/Average Blur, Additive Gaussian Noise, Salt & Pepper, Coarse Dropout, Gamma Contrast, Sharpen, Brightness.
Optimization: Adam optimizer with a custom learning rate scheduler (following the “Attention is all you need” paper).

Models

The final architecture is an Image-to-SELFIES Transformer.

Encoder (Feature Extractor):
- EfficientNet-B3 (pre-trained on Noisy-student).
- Input: $299 \times 299 \times 3$ images (normalized -1 to 1).
- Output Feature Vector: $10 \times 10 \times 1536$.
Decoder (Transformer):
- 4 Encoder-Decoder layers.
- 8 Parallel Attention Heads.
- Dimension size: 512.
- Feed-forward size: 2048.
- Dropout: 0.1.

Evaluation

Evaluation was performed on a held-out test set (10% of total data) selected via RDKit MaxMin algorithm for diversity.

Metric	Value	Baseline	Notes
Tanimoto 1.0	96.47%	74.57% (1M subset)	Percentage of predictions with perfect fingerprint match (Dataset 1, 35M training).
Avg Tanimoto	0.9923	0.9371 (1M subset)	Average similarity score (Dataset 1, 35M training).
Isomorphism	99.75%	-	Percentage of Tanimoto 1.0 predictions that are structurally identical (checked via InChI).

Hardware

Training Hardware: TPU v3-8 (Google Cloud). TPU v3-32 was tested but v3-8 was chosen for cost-effectiveness.
Comparison Hardware: Nvidia Tesla V100 (32GB GPU).
Performance:
- TPU v3-8 was ~4x faster than V100 GPU.
- 1 Million molecule model convergence: ~8.5 hours on TPU vs ~30 hours on GPU.
- Largest model (35M) took < 14 days on TPU.

Citation

@article{rajanDECIMER10Deep2021,
  title = {DECIMER 1.0: Deep Learning for Chemical Image Recognition Using Transformers},
  shorttitle = {DECIMER 1.0},
  author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
  year = {2021},
  month = {dec},
  journal = {Journal of Cheminformatics},
  volume = {13},
  number = {1},
  pages = {61},
  issn = {1758-2946},
  doi = {10.1186/s13321-021-00538-8},
  url = {https://doi.org/10.1186/s13321-021-00538-8}
}

Publication Details
Authors	Kohulan Rajan, Achim Zielesny, Christoph Steinbeck
Paper Title	DECIMER 1.0: deep learning for chemical image recognition using transformers
Category	Computational Chemistry
Date	December 2025
Links	🔗 DOI • 💻 Code • 📄 Paper

Paper Information#

What kind of paper is this?#

What is the motivation?#

What is the novelty here?#

What experiments were performed?#

What were the outcomes and conclusions?#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Citation#