Paper Information

Citation: Rajan, K., Zielesny, A. & Steinbeck, C. (2021). DECIMER 1.0: deep learning for chemical image recognition using transformers. Journal of Cheminformatics, 13(1), 61. https://doi.org/10.1186/s13321-021-00538-8

Publication: Journal of Cheminformatics 2021

Additional Resources:

Evaluating the Contribution: A Methodological Shift

Method (Dominant) with strong Resource elements.

This is primarily a Method paper because it proposes a specific architectural evolution. It replaces CNN-RNN/Encoder-Decoder models with a Transformer-based network to solve the problem of image-to-structure translation. It validates this methodological shift through rigorous ablation studies comparing feature extractors (InceptionV3 vs. EfficientNet) and decoder architectures.

It also serves as a Resource contribution by releasing the open-source software, trained models, and describing the curation of a massive synthetic training dataset (>35 million molecules).

Motivation: Inaccessible Chemical Knowledge

  • Data Inaccessibility: A vast amount of chemical knowledge (pre-1990s) is locked in printed or scanned literature and is not machine-readable.
  • Manual Bottlenecks: Manual curation and extraction of this data is tedious, slow, and error-prone.
  • Limitations of Prior Tools: Existing Optical Chemical Structure Recognition (OCSR) tools are often rule-based or struggle with the noise and variability of full-page scanned articles. Previous deep learning attempts were not publicly accessible or robust enough.

Key Innovation: Transformer-Based Molecular Translation

  • Transformer Architecture: Shifts from the standard CNN-RNN (Encoder-Decoder) approach to a Transformer-based decoder, significantly improving accuracy.
  • EfficientNet Backbone: Replaces the standard InceptionV3 feature extractor with EfficientNet-B3, which improved feature extraction quality for chemical images.
  • SELFIES Representation: Utilizes SELFIES (SELF-referencing Embedded Strings) as the target output. This guarantees 100% robust molecular strings and eliminates the “invalid SMILES” problem common in generative models.
  • Massive Scaling: Trains on synthetic datasets derived from PubChem (up to 39 million molecules total, with the largest training subset at ~35 million), demonstrating that scaling data size directly correlates with improved model performance.

Methodology and Experimental Validation

  • Feature Extractor Ablation: Compared InceptionV3 vs. EfficientNet-B3 (and B7) on a 1-million molecule subset to determine the optimal image encoder.
  • Architecture Comparison: Benchmarked the Encoder-Decoder (CNN+RNN) against the Transformer model using Tanimoto similarity metrics. The structural similarity between predicted and ground truth molecules was measured via Tanimoto similarity over molecular fingerprints: $$ T(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}|^2 + |\mathbf{B}|^2 - \mathbf{A} \cdot \mathbf{B}} $$
  • Data Scaling: Evaluated performance across increasing training set sizes (1M, 10M, 15M, 35M) to observe scaling laws.
  • Stereochemistry & Ions: Tested the model’s ability to handle complex stereochemical information and charged groups (ions), creating separate datasets for these tasks.
  • Augmentation Robustness: Evaluated the model on augmented images (blur, noise, varying contrast) to simulate real-world scanned document conditions.

Results and Scaling Observations

  • Architecture Comparison: The Transformer model with EfficientNet-B3 features outperformed the Encoder-Decoder baseline by a wide margin. On the 1M dataset, the Transformer achieved 74.57% exact matches (Tanimoto 1.0) compared to only 7.03% for the Encoder-Decoder (Table 4 in the paper).
  • High Accuracy at Scale: With the full 35-million molecule training set (Dataset 1), the model achieved a Tanimoto 1.0 score of 96.47% and an average Tanimoto similarity of 0.99.
  • Isomorphism: 99.75% of predictions with a Tanimoto score of 1.0 were confirmed to be structurally isomorphic to the ground truth (checked via InChI).
  • Stereochemistry Costs: Including stereochemistry and ions increased the token count and difficulty, resulting in slightly lower accuracy (~89.87% exact match on Dataset 2).
  • Hardware Efficiency: Training on TPUs (v3-8) was ~4x faster than Nvidia V100 GPUs. For the 1M molecule model, convergence took ~8h 41min on TPU v3-8 vs ~29h 48min on V100 GPU. The largest model (35M) took less than 14 days on TPU.
  • Augmentation Robustness (Dataset 3): When trained on augmented images and tested on non-augmented images, the model achieved 86.43% Tanimoto 1.0. Using a pre-trained model from Dataset 2 and refitting on augmented images improved this to 88.04% on non-augmented test images and 80.87% on augmented test images, retaining above 97% isomorphism rates.

Reproducibility Details

Data

The authors generated synthetic data from PubChem.

PurposeDatasetSizeNotes
TrainingDataset 1 (Clean)39M total (35M train)No stereo/ions. Filtered for MW < 1500, bond count 3-40, SMILES len < 40.
TrainingDataset 2 (Complex)37M total (33M train)Includes stereochemistry and charged groups (ions).
TrainingDataset 3 (Augmented)37M total (33M train)Dataset 2 with image augmentations applied.
PreprocessingN/AN/AMolecules converted to SELFIES. Images generated via CDK Structure Diagram Generator (SDG) as $299 \times 299$ 8-bit PNGs.
FormatTFRecords75 MB chunks128 Data points (image vector + tokenized string) per record.

Algorithms

  • Text Representation: SELFIES used to avoid invalid intermediate strings. Tokenized via Keras tokenizer.
    • Dataset 1 Tokens: 27 unique tokens. Max length 47.
    • Dataset 2/3 Tokens: 61 unique tokens (due to stereo/ion tokens).
  • Augmentation: Implemented using imgaug python package. Random application of:
    • Gaussian/Average Blur, Additive Gaussian Noise, Salt & Pepper, Coarse Dropout, Gamma Contrast, Sharpen, Brightness.
  • Optimization: Adam optimizer with a custom learning rate scheduler (following the “Attention is all you need” paper).

Models

The final architecture is an Image-to-SELFIES Transformer.

  • Encoder (Feature Extractor):
    • EfficientNet-B3 (pre-trained on Noisy-student).
    • Input: $299 \times 299 \times 3$ images (normalized -1 to 1).
    • Output Feature Vector: $10 \times 10 \times 1536$.
  • Decoder (Transformer):
    • 4 Encoder-Decoder layers.
    • 8 Parallel Attention Heads.
    • Dimension size: 512.
    • Feed-forward size: 2048.
    • Dropout: 0.1.

Evaluation

Evaluation was performed on a held-out test set (10% of total data) selected via RDKit MaxMin algorithm for diversity.

MetricValueBaselineNotes
Tanimoto 1.096.47%74.57% (1M subset)Percentage of predictions with perfect fingerprint match (Dataset 1, 35M training).
Avg Tanimoto0.99230.9371 (1M subset)Average similarity score (Dataset 1, 35M training).
Isomorphism99.75%-Percentage of Tanimoto 1.0 predictions that are structurally identical (checked via InChI).

Hardware

  • Training Hardware: TPU v3-8 (Google Cloud). TPU v3-32 was tested but v3-8 was chosen for cost-effectiveness.
  • Comparison Hardware: Nvidia Tesla V100 (32GB GPU).
  • Performance:
    • TPU v3-8 was ~4x faster than V100 GPU.
    • 1 Million molecule model convergence: 8h 41min on TPU vs ~29h 48min on GPU.
    • Largest model (35M) took less than 14 days on TPU.

Reproducibility

The paper is open-access, and both code and data are publicly available.

ArtifactTypeLicenseNotes
DECIMER-TPU (GitHub)CodeMITOfficial implementation using TensorFlow and TPU training
Code Archive (Zenodo)CodeMITArchival snapshot of the codebase
Training Data (Zenodo)DatasetUnknownSMILES data used for training (images generated via CDK SDG)
DECIMER Project PageOtherN/AProject landing page
  • Hardware Requirements: Training requires TPU v3-8 (Google Cloud) or Nvidia V100 GPU. The largest model (35M molecules) took less than 14 days on TPU v3-8.
  • Missing Components: Augmentation parameters are documented in the paper (Table 14). Pre-trained model weights are available through the GitHub repository.

Citation

@article{rajanDECIMER10Deep2021,
  title = {DECIMER 1.0: Deep Learning for Chemical Image Recognition Using Transformers},
  shorttitle = {DECIMER 1.0},
  author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
  year = {2021},
  month = {aug},
  journal = {Journal of Cheminformatics},
  volume = {13},
  number = {1},
  pages = {61},
  issn = {1758-2946},
  doi = {10.1186/s13321-021-00538-8},
  url = {https://doi.org/10.1186/s13321-021-00538-8}
}