Paper Information
Citation: Rajan, K., Zielesny, A. & Steinbeck, C. (2021). DECIMER 1.0: deep learning for chemical image recognition using transformers. Journal of Cheminformatics, 13(1), 61. https://doi.org/10.1186/s13321-021-00538-8
Publication: Journal of Cheminformatics 2021
Additional Resources:
What kind of paper is this?
Method (Dominant) with strong Resource elements.
This is primarily a Method paper because it proposes a specific architectural evolution - replacing CNN-RNN/Encoder-Decoder models with a Transformer-based network - to solve the problem of image-to-structure translation. It validates this methodological shift through rigorous ablation studies comparing feature extractors (InceptionV3 vs. EfficientNet) and decoder architectures.
It also serves as a Resource contribution by releasing the open-source software, trained models, and describing the curation of a massive synthetic training dataset (>35 million molecules).
What is the motivation?
- Data Inaccessibility: A vast amount of chemical knowledge (pre-1990s) is locked in printed or scanned literature and is not machine-readable.
- Manual Bottlenecks: Manual curation and extraction of this data is tedious, slow, and error-prone.
- Limitations of Prior Tools: Existing Optical Chemical Structure Recognition (OCSR) tools are often rule-based or struggle with the noise and variability of full-page scanned articles. Previous deep learning attempts were not publicly accessible or robust enough.
What is the novelty here?
- Transformer Architecture: Shifts from the standard CNN-RNN (Encoder-Decoder) approach to a Transformer-based decoder, significantly improving accuracy.
- EfficientNet Backbone: Replaces the standard InceptionV3 feature extractor with EfficientNet-B3, which improved feature extraction quality for chemical images.
- SELFIES Representation: Utilizes SELFIES (SELF-referencing Embedded Strings) instead of SMILES as the target output. This guarantees 100% robust molecular strings and eliminates the “invalid SMILES” problem common in generative models.
- Massive Scaling: Trains on a synthetic dataset of 35-39 million molecules, demonstrating that scaling data size directly correlates with improved model performance.
What experiments were performed?
- Feature Extractor Ablation: Compared InceptionV3 vs. EfficientNet-B3 (and B7) on a 1-million molecule subset to determine the optimal image encoder.
- Architecture Comparison: Benchmarked the Encoder-Decoder (CNN+RNN) against the Transformer model using Tanimoto similarity metrics.
- Data Scaling: Evaluated performance across increasing training set sizes (1M, 10M, 15M, 35M) to observe scaling laws.
- Stereochemistry & Ions: Tested the model’s ability to handle complex stereochemical information and charged groups (ions), creating separate datasets for these tasks.
- Augmentation Robustness: Evaluated the model on augmented images (blur, noise, varying contrast) to simulate real-world scanned document conditions.
What were the outcomes and conclusions?
- Superior Architecture: The Transformer model with EfficientNet-B3 features significantly outperformed the Encoder-Decoder baseline. On the 1M dataset, the Transformer achieved 74.57% exact matches (Tanimoto 1.0) compared to only 7.03% for the Encoder-Decoder.
- High Accuracy at Scale: With the full 35-million molecule training set (Dataset 1), the model achieved a Tanimoto 1.0 score of 96.47% and an average Tanimoto similarity of 0.99.
- Isomorphism: 99.75% of predictions with a Tanimoto score of 1.0 were confirmed to be structurally isomorphic to the ground truth.
- Stereochemistry Costs: Including stereochemistry and ions increased the token count and difficulty, resulting in slightly lower accuracy (~89.87% exact match on Dataset 2).
- Hardware Efficiency: Training on TPUs (v3-8) was ~4x faster than Nvidia V100 GPUs, reducing training time for the largest models from months to under 14 days.
Reproducibility Details
Data
The authors generated synthetic data from PubChem.
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Training | Dataset 1 (Clean) | 35M molecules | No stereo/ions. Filtered for MW < 1500, bond count 3-40, SMILES len < 40. |
| Training | Dataset 2 (Complex) | 33M molecules | Includes stereochemistry and charged groups (ions). |
| Training | Dataset 3 (Augmented) | 33M molecules | Dataset 2 with image augmentations applied. |
| Preprocessing | - | - | Molecules converted to SELFIES. Images generated via CDK Structure Diagram Generator (SDG) as $299\times299$ 8-bit PNGs. |
| Format | TFRecords | 75 MB chunks | 128 Data points (image vector + tokenized string) per record. |
Algorithms
- Text Representation: SELFIES used to avoid invalid intermediate strings. Tokenized via Keras tokenizer.
- Dataset 1 Tokens: 27 unique tokens. Max length 47.
- Dataset 2/3 Tokens: 61 unique tokens (due to stereo/ion tokens).
- Augmentation: Implemented using
imgaugpython package. Random application of:- Gaussian/Average Blur, Additive Gaussian Noise, Salt & Pepper, Coarse Dropout, Gamma Contrast, Sharpen, Brightness.
- Optimization: Adam optimizer with a custom learning rate scheduler (following the “Attention is all you need” paper).
Models
The final architecture is an Image-to-SELFIES Transformer.
- Encoder (Feature Extractor):
- EfficientNet-B3 (pre-trained on Noisy-student).
- Input: $299 \times 299 \times 3$ images (normalized -1 to 1).
- Output Feature Vector: $10 \times 10 \times 1536$.
- Decoder (Transformer):
- 4 Encoder-Decoder layers.
- 8 Parallel Attention Heads.
- Dimension size: 512.
- Feed-forward size: 2048.
- Dropout: 0.1.
Evaluation
Evaluation was performed on a held-out test set (10% of total data) selected via RDKit MaxMin algorithm for diversity.
| Metric | Value | Baseline | Notes |
|---|---|---|---|
| Tanimoto 1.0 | 96.47% | 74.57% (1M subset) | Percentage of predictions with perfect fingerprint match (Dataset 1, 35M training). |
| Avg Tanimoto | 0.9923 | 0.9371 (1M subset) | Average similarity score (Dataset 1, 35M training). |
| Isomorphism | 99.75% | - | Percentage of Tanimoto 1.0 predictions that are structurally identical (checked via InChI). |
Hardware
- Training Hardware: TPU v3-8 (Google Cloud). TPU v3-32 was tested but v3-8 was chosen for cost-effectiveness.
- Comparison Hardware: Nvidia Tesla V100 (32GB GPU).
- Performance:
- TPU v3-8 was ~4x faster than V100 GPU.
- 1 Million molecule model convergence: ~8.5 hours on TPU vs ~30 hours on GPU.
- Largest model (35M) took < 14 days on TPU.
Citation
@article{rajanDECIMER10Deep2021,
title = {DECIMER 1.0: Deep Learning for Chemical Image Recognition Using Transformers},
shorttitle = {DECIMER 1.0},
author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
year = {2021},
month = {dec},
journal = {Journal of Cheminformatics},
volume = {13},
number = {1},
pages = {61},
issn = {1758-2946},
doi = {10.1186/s13321-021-00538-8},
url = {https://doi.org/10.1186/s13321-021-00538-8}
}