DECIMER 1.0: Transformers for Chemical Image Recognition

Paper Information

Citation: Rajan, K., Zielesny, A. & Steinbeck, C. (2021). DECIMER 1.0: deep learning for chemical image recognition using transformers. Journal of Cheminformatics, 13(1), 61. https://doi.org/10.1186/s13321-021-00538-8

Publication: Journal of Cheminformatics 2021

Additional Resources:

Evaluating the Contribution: A Methodological Shift

Method (Dominant) with strong Resource elements.

This is primarily a Method paper because it proposes a specific architectural evolution. It replaces CNN-RNN/Encoder-Decoder models with a Transformer-based network to solve the problem of image-to-structure translation. It validates this methodological shift through rigorous ablation studies comparing feature extractors (InceptionV3 vs. EfficientNet) and decoder architectures.

It also serves as a Resource contribution by releasing the open-source software, trained models, and describing the curation of a massive synthetic training dataset (>35 million molecules).

Motivation: Inaccessible Chemical Knowledge

Data Inaccessibility: A vast amount of chemical knowledge (pre-1990s) is locked in printed or scanned literature and is not machine-readable.
Manual Bottlenecks: Manual curation and extraction of this data is tedious, slow, and error-prone.
Limitations of Prior Tools: Existing Optical Chemical Structure Recognition (OCSR) tools are often rule-based or struggle with the noise and variability of full-page scanned articles. Previous deep learning attempts were not publicly accessible or robust enough.

Key Innovation: Transformer-Based Molecular Translation

Transformer Architecture: Shifts from the standard CNN-RNN (Encoder-Decoder) approach to a Transformer-based decoder, significantly improving accuracy.
EfficientNet Backbone: Replaces the standard InceptionV3 feature extractor with EfficientNet-B3, which improved feature extraction quality for chemical images.
SELFIES Representation: Utilizes SELFIES (SELF-referencing Embedded Strings) as the target output. This guarantees 100% robust molecular strings and eliminates the “invalid SMILES” problem common in generative models.
Massive Scaling: Trains on synthetic datasets derived from PubChem (up to 39 million molecules total, with the largest training subset at ~35 million), demonstrating that scaling data size directly correlates with improved model performance.

Methodology and Experimental Validation

Feature Extractor Ablation: Compared InceptionV3 vs. EfficientNet-B3 (and B7) on a 1-million molecule subset to determine the optimal image encoder.
Architecture Comparison: Benchmarked the Encoder-Decoder (CNN+RNN) against the Transformer model using Tanimoto similarity metrics. The structural similarity between predicted and ground truth molecules was measured via Tanimoto similarity over molecular fingerprints: $$ T(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}|^2 + |\mathbf{B}|^2 - \mathbf{A} \cdot \mathbf{B}} $$
Data Scaling: Evaluated performance across increasing training set sizes (1M, 10M, 15M, 35M) to observe scaling laws.
Stereochemistry & Ions: Tested the model’s ability to handle complex stereochemical information and charged groups (ions), creating separate datasets for these tasks.
Augmentation Robustness: Evaluated the model on augmented images (blur, noise, varying contrast) to simulate real-world scanned document conditions.

Results and Scaling Observations

Architecture Comparison: The Transformer model with EfficientNet-B3 features outperformed the Encoder-Decoder baseline by a wide margin. On the 1M dataset, the Transformer achieved 74.57% exact matches (Tanimoto 1.0) compared to only 7.03% for the Encoder-Decoder (Table 4 in the paper).
High Accuracy at Scale: With the full 35-million molecule training set (Dataset 1), the model achieved a Tanimoto 1.0 score of 96.47% and an average Tanimoto similarity of 0.99.
Isomorphism: 99.75% of predictions with a Tanimoto score of 1.0 were confirmed to be structurally isomorphic to the ground truth (checked via InChI).
Stereochemistry Costs: Including stereochemistry and ions increased the token count and difficulty, resulting in slightly lower accuracy (~89.87% exact match on Dataset 2).
Hardware Efficiency: Training on TPUs (v3-8) was ~4x faster than Nvidia V100 GPUs. For the 1M molecule model, convergence took ~8h 41min on TPU v3-8 vs ~29h 48min on V100 GPU. The largest model (35M) took less than 14 days on TPU.
Augmentation Robustness (Dataset 3): When trained on augmented images and tested on non-augmented images, the model achieved 86.43% Tanimoto 1.0. Using a pre-trained model from Dataset 2 and refitting on augmented images improved this to 88.04% on non-augmented test images and 80.87% on augmented test images, retaining above 97% isomorphism rates.

Reproducibility Details

Data

The authors generated synthetic data from PubChem.

Purpose	Dataset	Size	Notes
Training	Dataset 1 (Clean)	39M total (35M train)	No stereo/ions. Filtered for MW < 1500, bond count 3-40, SMILES len < 40.
Training	Dataset 2 (Complex)	37M total (33M train)	Includes stereochemistry and charged groups (ions).
Training	Dataset 3 (Augmented)	37M total (33M train)	Dataset 2 with image augmentations applied.
Preprocessing	N/A	N/A	Molecules converted to SELFIES. Images generated via CDK Structure Diagram Generator (SDG) as $299 \times 299$ 8-bit PNGs.
Format	TFRecords	75 MB chunks	128 Data points (image vector + tokenized string) per record.

Algorithms

Text Representation: SELFIES used to avoid invalid intermediate strings. Tokenized via Keras tokenizer.
- Dataset 1 Tokens: 27 unique tokens. Max length 47.
- Dataset 2/3 Tokens: 61 unique tokens (due to stereo/ion tokens).
Augmentation: Implemented using imgaug python package. Random application of:
- Gaussian/Average Blur, Additive Gaussian Noise, Salt & Pepper, Coarse Dropout, Gamma Contrast, Sharpen, Brightness.
Optimization: Adam optimizer with a custom learning rate scheduler (following the “Attention is all you need” paper).

Models

The final architecture is an Image-to-SELFIES Transformer.

Encoder (Feature Extractor):
- EfficientNet-B3 (pre-trained on Noisy-student).
- Input: $299 \times 299 \times 3$ images (normalized -1 to 1).
- Output Feature Vector: $10 \times 10 \times 1536$.
Decoder (Transformer):
- 4 Encoder-Decoder layers.
- 8 Parallel Attention Heads.
- Dimension size: 512.
- Feed-forward size: 2048.
- Dropout: 0.1.

Evaluation

Evaluation was performed on a held-out test set (10% of total data) selected via RDKit MaxMin algorithm for diversity.

Metric	Value	Baseline	Notes
Tanimoto 1.0	96.47%	74.57% (1M subset)	Percentage of predictions with perfect fingerprint match (Dataset 1, 35M training).
Avg Tanimoto	0.9923	0.9371 (1M subset)	Average similarity score (Dataset 1, 35M training).
Isomorphism	99.75%	-	Percentage of Tanimoto 1.0 predictions that are structurally identical (checked via InChI).

Hardware

Training Hardware: TPU v3-8 (Google Cloud). TPU v3-32 was tested but v3-8 was chosen for cost-effectiveness.
Comparison Hardware: Nvidia Tesla V100 (32GB GPU).
Performance:
- TPU v3-8 was ~4x faster than V100 GPU.
- 1 Million molecule model convergence: 8h 41min on TPU vs ~29h 48min on GPU.
- Largest model (35M) took less than 14 days on TPU.

Reproducibility

The paper is open-access, and both code and data are publicly available.

Artifact	Type	License	Notes
DECIMER-TPU (GitHub)	Code	MIT	Official implementation using TensorFlow and TPU training
Code Archive (Zenodo)	Code	MIT	Archival snapshot of the codebase
Training Data (Zenodo)	Dataset	Unknown	SMILES data used for training (images generated via CDK SDG)
DECIMER Project Page	Other	N/A	Project landing page

Hardware Requirements: Training requires TPU v3-8 (Google Cloud) or Nvidia V100 GPU. The largest model (35M molecules) took less than 14 days on TPU v3-8.
Missing Components: Augmentation parameters are documented in the paper (Table 14). Pre-trained model weights are available through the GitHub repository.

Citation

@article{rajanDECIMER10Deep2021,
  title = {DECIMER 1.0: Deep Learning for Chemical Image Recognition Using Transformers},
  shorttitle = {DECIMER 1.0},
  author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
  year = {2021},
  month = {aug},
  journal = {Journal of Cheminformatics},
  volume = {13},
  number = {1},
  pages = {61},
  issn = {1758-2946},
  doi = {10.1186/s13321-021-00538-8},
  url = {https://doi.org/10.1186/s13321-021-00538-8}
}

Paper Information#

Evaluating the Contribution: A Methodological Shift#

Motivation: Inaccessible Chemical Knowledge#

Key Innovation: Transformer-Based Molecular Translation#

Methodology and Experimental Validation#

Results and Scaling Observations#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Reproducibility#

Citation#