Paper Information
Citation: Rajan, K., Zielesny, A. & Steinbeck, C. (2021). DECIMER 1.0: deep learning for chemical image recognition using transformers. Journal of Cheminformatics, 13(1), 61. https://doi.org/10.1186/s13321-021-00538-8
Publication: Journal of Cheminformatics 2021
Additional Resources:
Evaluating the Contribution: A Methodological Shift
Method (Dominant) with strong Resource elements.
This is primarily a Method paper because it proposes a specific architectural evolution. It replaces CNN-RNN/Encoder-Decoder models with a Transformer-based network to solve the problem of image-to-structure translation. It validates this methodological shift through rigorous ablation studies comparing feature extractors (InceptionV3 vs. EfficientNet) and decoder architectures.
It also serves as a Resource contribution by releasing the open-source software, trained models, and describing the curation of a massive synthetic training dataset (>35 million molecules).
Motivation: Inaccessible Chemical Knowledge
- Data Inaccessibility: A vast amount of chemical knowledge (pre-1990s) is locked in printed or scanned literature and is not machine-readable.
- Manual Bottlenecks: Manual curation and extraction of this data is tedious, slow, and error-prone.
- Limitations of Prior Tools: Existing Optical Chemical Structure Recognition (OCSR) tools are often rule-based or struggle with the noise and variability of full-page scanned articles. Previous deep learning attempts were not publicly accessible or robust enough.
Key Innovation: Transformer-Based Molecular Translation
- Transformer Architecture: Shifts from the standard CNN-RNN (Encoder-Decoder) approach to a Transformer-based decoder, significantly improving accuracy.
- EfficientNet Backbone: Replaces the standard InceptionV3 feature extractor with EfficientNet-B3, which improved feature extraction quality for chemical images.
- SELFIES Representation: Utilizes SELFIES (SELF-referencing Embedded Strings) as the target output. This guarantees 100% robust molecular strings and eliminates the “invalid SMILES” problem common in generative models.
- Massive Scaling: Trains on synthetic datasets derived from PubChem (up to 39 million molecules total, with the largest training subset at ~35 million), demonstrating that scaling data size directly correlates with improved model performance.
Methodology and Experimental Validation
- Feature Extractor Ablation: Compared InceptionV3 vs. EfficientNet-B3 (and B7) on a 1-million molecule subset to determine the optimal image encoder.
- Architecture Comparison: Benchmarked the Encoder-Decoder (CNN+RNN) against the Transformer model using Tanimoto similarity metrics. The structural similarity between predicted and ground truth molecules was measured via Tanimoto similarity over molecular fingerprints: $$ T(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}|^2 + |\mathbf{B}|^2 - \mathbf{A} \cdot \mathbf{B}} $$
- Data Scaling: Evaluated performance across increasing training set sizes (1M, 10M, 15M, 35M) to observe scaling laws.
- Stereochemistry & Ions: Tested the model’s ability to handle complex stereochemical information and charged groups (ions), creating separate datasets for these tasks.
- Augmentation Robustness: Evaluated the model on augmented images (blur, noise, varying contrast) to simulate real-world scanned document conditions.
Results and Scaling Observations
- Architecture Comparison: The Transformer model with EfficientNet-B3 features outperformed the Encoder-Decoder baseline by a wide margin. On the 1M dataset, the Transformer achieved 74.57% exact matches (Tanimoto 1.0) compared to only 7.03% for the Encoder-Decoder (Table 4 in the paper).
- High Accuracy at Scale: With the full 35-million molecule training set (Dataset 1), the model achieved a Tanimoto 1.0 score of 96.47% and an average Tanimoto similarity of 0.99.
- Isomorphism: 99.75% of predictions with a Tanimoto score of 1.0 were confirmed to be structurally isomorphic to the ground truth (checked via InChI).
- Stereochemistry Costs: Including stereochemistry and ions increased the token count and difficulty, resulting in slightly lower accuracy (~89.87% exact match on Dataset 2).
- Hardware Efficiency: Training on TPUs (v3-8) was ~4x faster than Nvidia V100 GPUs. For the 1M molecule model, convergence took ~8h 41min on TPU v3-8 vs ~29h 48min on V100 GPU. The largest model (35M) took less than 14 days on TPU.
- Augmentation Robustness (Dataset 3): When trained on augmented images and tested on non-augmented images, the model achieved 86.43% Tanimoto 1.0. Using a pre-trained model from Dataset 2 and refitting on augmented images improved this to 88.04% on non-augmented test images and 80.87% on augmented test images, retaining above 97% isomorphism rates.
Reproducibility Details
Data
The authors generated synthetic data from PubChem.
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Training | Dataset 1 (Clean) | 39M total (35M train) | No stereo/ions. Filtered for MW < 1500, bond count 3-40, SMILES len < 40. |
| Training | Dataset 2 (Complex) | 37M total (33M train) | Includes stereochemistry and charged groups (ions). |
| Training | Dataset 3 (Augmented) | 37M total (33M train) | Dataset 2 with image augmentations applied. |
| Preprocessing | N/A | N/A | Molecules converted to SELFIES. Images generated via CDK Structure Diagram Generator (SDG) as $299 \times 299$ 8-bit PNGs. |
| Format | TFRecords | 75 MB chunks | 128 Data points (image vector + tokenized string) per record. |
Algorithms
- Text Representation: SELFIES used to avoid invalid intermediate strings. Tokenized via Keras tokenizer.
- Dataset 1 Tokens: 27 unique tokens. Max length 47.
- Dataset 2/3 Tokens: 61 unique tokens (due to stereo/ion tokens).
- Augmentation: Implemented using
imgaugpython package. Random application of:- Gaussian/Average Blur, Additive Gaussian Noise, Salt & Pepper, Coarse Dropout, Gamma Contrast, Sharpen, Brightness.
- Optimization: Adam optimizer with a custom learning rate scheduler (following the “Attention is all you need” paper).
Models
The final architecture is an Image-to-SELFIES Transformer.
- Encoder (Feature Extractor):
- EfficientNet-B3 (pre-trained on Noisy-student).
- Input: $299 \times 299 \times 3$ images (normalized -1 to 1).
- Output Feature Vector: $10 \times 10 \times 1536$.
- Decoder (Transformer):
- 4 Encoder-Decoder layers.
- 8 Parallel Attention Heads.
- Dimension size: 512.
- Feed-forward size: 2048.
- Dropout: 0.1.
Evaluation
Evaluation was performed on a held-out test set (10% of total data) selected via RDKit MaxMin algorithm for diversity.
| Metric | Value | Baseline | Notes |
|---|---|---|---|
| Tanimoto 1.0 | 96.47% | 74.57% (1M subset) | Percentage of predictions with perfect fingerprint match (Dataset 1, 35M training). |
| Avg Tanimoto | 0.9923 | 0.9371 (1M subset) | Average similarity score (Dataset 1, 35M training). |
| Isomorphism | 99.75% | - | Percentage of Tanimoto 1.0 predictions that are structurally identical (checked via InChI). |
Hardware
- Training Hardware: TPU v3-8 (Google Cloud). TPU v3-32 was tested but v3-8 was chosen for cost-effectiveness.
- Comparison Hardware: Nvidia Tesla V100 (32GB GPU).
- Performance:
- TPU v3-8 was ~4x faster than V100 GPU.
- 1 Million molecule model convergence: 8h 41min on TPU vs ~29h 48min on GPU.
- Largest model (35M) took less than 14 days on TPU.
Reproducibility
The paper is open-access, and both code and data are publicly available.
| Artifact | Type | License | Notes |
|---|---|---|---|
| DECIMER-TPU (GitHub) | Code | MIT | Official implementation using TensorFlow and TPU training |
| Code Archive (Zenodo) | Code | MIT | Archival snapshot of the codebase |
| Training Data (Zenodo) | Dataset | Unknown | SMILES data used for training (images generated via CDK SDG) |
| DECIMER Project Page | Other | N/A | Project landing page |
- Hardware Requirements: Training requires TPU v3-8 (Google Cloud) or Nvidia V100 GPU. The largest model (35M molecules) took less than 14 days on TPU v3-8.
- Missing Components: Augmentation parameters are documented in the paper (Table 14). Pre-trained model weights are available through the GitHub repository.
Citation
@article{rajanDECIMER10Deep2021,
title = {DECIMER 1.0: Deep Learning for Chemical Image Recognition Using Transformers},
shorttitle = {DECIMER 1.0},
author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
year = {2021},
month = {aug},
journal = {Journal of Cheminformatics},
volume = {13},
number = {1},
pages = {61},
issn = {1758-2946},
doi = {10.1186/s13321-021-00538-8},
url = {https://doi.org/10.1186/s13321-021-00538-8}
}
