Paper Information

Citation: Rajan, K., Zielesny, A. & Steinbeck, C. (2021). DECIMER 1.0: deep learning for chemical image recognition using transformers. Journal of Cheminformatics, 13(1), 61. https://doi.org/10.1186/s13321-021-00538-8

Publication: Journal of Cheminformatics 2021

Additional Resources:

What kind of paper is this?

Method (Dominant) with strong Resource elements.

This is primarily a Method paper because it proposes a specific architectural evolution - replacing CNN-RNN/Encoder-Decoder models with a Transformer-based network - to solve the problem of image-to-structure translation. It validates this methodological shift through rigorous ablation studies comparing feature extractors (InceptionV3 vs. EfficientNet) and decoder architectures.

It also serves as a Resource contribution by releasing the open-source software, trained models, and describing the curation of a massive synthetic training dataset (>35 million molecules).

What is the motivation?

  • Data Inaccessibility: A vast amount of chemical knowledge (pre-1990s) is locked in printed or scanned literature and is not machine-readable.
  • Manual Bottlenecks: Manual curation and extraction of this data is tedious, slow, and error-prone.
  • Limitations of Prior Tools: Existing Optical Chemical Structure Recognition (OCSR) tools are often rule-based or struggle with the noise and variability of full-page scanned articles. Previous deep learning attempts were not publicly accessible or robust enough.

What is the novelty here?

  • Transformer Architecture: Shifts from the standard CNN-RNN (Encoder-Decoder) approach to a Transformer-based decoder, significantly improving accuracy.
  • EfficientNet Backbone: Replaces the standard InceptionV3 feature extractor with EfficientNet-B3, which improved feature extraction quality for chemical images.
  • SELFIES Representation: Utilizes SELFIES (SELF-referencing Embedded Strings) instead of SMILES as the target output. This guarantees 100% robust molecular strings and eliminates the “invalid SMILES” problem common in generative models.
  • Massive Scaling: Trains on a synthetic dataset of 35-39 million molecules, demonstrating that scaling data size directly correlates with improved model performance.

What experiments were performed?

  • Feature Extractor Ablation: Compared InceptionV3 vs. EfficientNet-B3 (and B7) on a 1-million molecule subset to determine the optimal image encoder.
  • Architecture Comparison: Benchmarked the Encoder-Decoder (CNN+RNN) against the Transformer model using Tanimoto similarity metrics.
  • Data Scaling: Evaluated performance across increasing training set sizes (1M, 10M, 15M, 35M) to observe scaling laws.
  • Stereochemistry & Ions: Tested the model’s ability to handle complex stereochemical information and charged groups (ions), creating separate datasets for these tasks.
  • Augmentation Robustness: Evaluated the model on augmented images (blur, noise, varying contrast) to simulate real-world scanned document conditions.

What were the outcomes and conclusions?

  • Superior Architecture: The Transformer model with EfficientNet-B3 features significantly outperformed the Encoder-Decoder baseline. On the 1M dataset, the Transformer achieved 74.57% exact matches (Tanimoto 1.0) compared to only 7.03% for the Encoder-Decoder.
  • High Accuracy at Scale: With the full 35-million molecule training set (Dataset 1), the model achieved a Tanimoto 1.0 score of 96.47% and an average Tanimoto similarity of 0.99.
  • Isomorphism: 99.75% of predictions with a Tanimoto score of 1.0 were confirmed to be structurally isomorphic to the ground truth.
  • Stereochemistry Costs: Including stereochemistry and ions increased the token count and difficulty, resulting in slightly lower accuracy (~89.87% exact match on Dataset 2).
  • Hardware Efficiency: Training on TPUs (v3-8) was ~4x faster than Nvidia V100 GPUs, reducing training time for the largest models from months to under 14 days.

Reproducibility Details

Data

The authors generated synthetic data from PubChem.

PurposeDatasetSizeNotes
TrainingDataset 1 (Clean)35M moleculesNo stereo/ions. Filtered for MW < 1500, bond count 3-40, SMILES len < 40.
TrainingDataset 2 (Complex)33M moleculesIncludes stereochemistry and charged groups (ions).
TrainingDataset 3 (Augmented)33M moleculesDataset 2 with image augmentations applied.
Preprocessing--Molecules converted to SELFIES. Images generated via CDK Structure Diagram Generator (SDG) as $299\times299$ 8-bit PNGs.
FormatTFRecords75 MB chunks128 Data points (image vector + tokenized string) per record.

Algorithms

  • Text Representation: SELFIES used to avoid invalid intermediate strings. Tokenized via Keras tokenizer.
    • Dataset 1 Tokens: 27 unique tokens. Max length 47.
    • Dataset 2/3 Tokens: 61 unique tokens (due to stereo/ion tokens).
  • Augmentation: Implemented using imgaug python package. Random application of:
    • Gaussian/Average Blur, Additive Gaussian Noise, Salt & Pepper, Coarse Dropout, Gamma Contrast, Sharpen, Brightness.
  • Optimization: Adam optimizer with a custom learning rate scheduler (following the “Attention is all you need” paper).

Models

The final architecture is an Image-to-SELFIES Transformer.

  • Encoder (Feature Extractor):
    • EfficientNet-B3 (pre-trained on Noisy-student).
    • Input: $299 \times 299 \times 3$ images (normalized -1 to 1).
    • Output Feature Vector: $10 \times 10 \times 1536$.
  • Decoder (Transformer):
    • 4 Encoder-Decoder layers.
    • 8 Parallel Attention Heads.
    • Dimension size: 512.
    • Feed-forward size: 2048.
    • Dropout: 0.1.

Evaluation

Evaluation was performed on a held-out test set (10% of total data) selected via RDKit MaxMin algorithm for diversity.

MetricValueBaselineNotes
Tanimoto 1.096.47%74.57% (1M subset)Percentage of predictions with perfect fingerprint match (Dataset 1, 35M training).
Avg Tanimoto0.99230.9371 (1M subset)Average similarity score (Dataset 1, 35M training).
Isomorphism99.75%-Percentage of Tanimoto 1.0 predictions that are structurally identical (checked via InChI).

Hardware

  • Training Hardware: TPU v3-8 (Google Cloud). TPU v3-32 was tested but v3-8 was chosen for cost-effectiveness.
  • Comparison Hardware: Nvidia Tesla V100 (32GB GPU).
  • Performance:
    • TPU v3-8 was ~4x faster than V100 GPU.
    • 1 Million molecule model convergence: ~8.5 hours on TPU vs ~30 hours on GPU.
    • Largest model (35M) took < 14 days on TPU.

Citation

@article{rajanDECIMER10Deep2021,
  title = {DECIMER 1.0: Deep Learning for Chemical Image Recognition Using Transformers},
  shorttitle = {DECIMER 1.0},
  author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
  year = {2021},
  month = {dec},
  journal = {Journal of Cheminformatics},
  volume = {13},
  number = {1},
  pages = {61},
  issn = {1758-2946},
  doi = {10.1186/s13321-021-00538-8},
  url = {https://doi.org/10.1186/s13321-021-00538-8}
}