Paper Information
Citation: Li, Y., Chen, G., & Li, X. (2022). Automated Recognition of Chemical Molecule Images Based on an Improved TNT Model. Applied Sciences, 12(2), 680. https://doi.org/10.3390/app12020680
Publication: MDPI Applied Sciences 2022
Additional Resources:
Contribution: Image-to-Text Translation for Chemical Structures
This is a Method paper.
It proposes a novel neural network architecture, the Image Captioning Model based on Deep TNT (ICMDT), to solve the specific problem of “molecular translation” (image-to-text). The classification is supported by the following rhetorical indicators:
- Novel Mechanism: It introduces the “Deep TNT block” to improve upon the existing TNT architecture by fusing features at three levels (pixel, small patch, large patch).
- Baseline Comparison: The authors explicitly compare their model against four other architectures (CNN+RNN and CNN+Transformer variants).
- Ablation Study: Section 4.3 is dedicated to ablating specific components (position encoding, patch fusion) to prove their contribution to the performance gain.
Motivation: Digitizing Historical Chemical Literature
The primary motivation is to speed up chemical research by digitizing historical chemical literature.
- Problem: Historical sources often contain corrupted or noisy images, making automated recognition difficult.
- Gap: Existing models like the standard TNT (Transformer in Transformer) function primarily as encoders for classification and fail to effectively integrate local pixel-level information required for precise structure generation.
- Goal: To build a dependable generative model that can accurately translate these noisy images into InChI (International Chemical Identifier) text strings.
Novelty: Multi-Level Feature Fusion with Deep TNT
The core contribution is the Deep TNT block and the resulting ICMDT architecture.
- Deep TNT Block: The Deep TNT block expands upon standard local and global modeling by stacking three transformer blocks to process information at three granularities:
- Internal Transformer: Processes pixel embeddings.
- Middle Transformer: Processes small patch embeddings.
- Exterior Transformer: Processes large patch embeddings.
- Multi-level Fusion: The model fuses pixel-level features into small patches, and small patches into large patches, allowing for finer integration of local details.
- Position Encoding: A specific strategy of applying shared position encodings to small patches and pixels, while using a learnable 1D encoding for large patches.
Methodology: Benchmarking on the BMS Dataset
The authors evaluated the model on the Bristol-Myers Squibb Molecular Translation dataset.
- Baselines: They constructed four comparative models:
- EfficientNetb0 + RNN (Bi-LSTM)
- ResNet50d + RNN (Bi-LSTM)
- EfficientNetb0 + Transformer
- ResNet101d + Transformer
- Ablation: They tested the impact of removing the large patch position encoding (ICMDT*), reverting the encoder to a standard TNT-S (TNTD), and setting the patch size to 32 directly on TNT-S without the exterior transformer block (TNTD-B).
- Pre-processing Study: They experimented with denoising ratios and cropping strategies.
Results & Conclusions: Improved InChI Translation Accuracy
- Performance: ICMDT achieved the lowest Levenshtein distance (0.69) among all five models tested (Table 3). The best-performing baseline was ResNet101d+Transformer.
- Convergence: The model converged significantly faster than the baselines, outperforming others as early as epoch 6.7.
- Ablation Results: The full Deep TNT block reduced error by nearly half compared to the standard TNT encoder (0.69 vs 1.29 Levenshtein distance). Removing large patch position encoding (ICMDT*) degraded performance to 1.04, and directly using patch size 32 on TNT-S (TNTD-B) scored 1.37.
- Limitations: The model struggles with stereochemical layers (e.g., identifying clockwise neighbors or +/- signs) compared to non-stereochemical layers.
- Inference & Fusion: The multi-model inference and fusion pipeline (beam search, TTA, step-wise logit ensemble, and voting) improved results by 0.24 to 2.5 Levenshtein distance reduction over single models.
- Future Work: Integrating full object detection to predict atom/bond coordinates to better resolve 3D stereochemical information.
Reproducibility
Status: Partially Reproducible. The dataset is publicly available through Kaggle, and the paper provides detailed hyperparameters and architecture specifications. However, no source code or pretrained model weights have been released.
| Artifact | Type | License | Notes |
|---|---|---|---|
| BMS Molecular Translation (Kaggle) | Dataset | Competition Terms | Training/test images with InChI labels |
Missing components: No official code repository or pretrained weights. Reimplementation requires reconstructing the Deep TNT block, training pipeline, and inference/fusion strategy from the paper description alone.
Hardware/compute requirements: Not explicitly stated in the paper.
Data
The experiments used the Bristol-Myers Squibb Molecular Translation dataset from Kaggle.
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Training | BMS Training Set | 2,424,186 images | Supervised; contains noise and blur |
| Evaluation | BMS Test Set | 1,616,107 images | Higher noise variation than training set |
Pre-processing Strategy:
- Effective: Padding resizing (reshaping to square using the longer edge, padding insufficient parts with pixels from the middle of the image).
- Ineffective: Smart cropping (removing white borders degraded performance).
- Augmentation: GaussNoise, Blur, RandomRotate90, and PepperNoise ($SNR=0.996$).
- Denoising: Best results found by mixing denoised and original data (Ratio 2:13) during training.
Algorithms
- Optimizer: Lookahead ($\alpha=0.5, k=5$) and RAdam ($\beta_1=0.9, \beta_2=0.99$).
- Loss Function: Anti-Focal loss ($\gamma=0.5$) combined with Label Smoothing. Standard Focal Loss adds a modulating factor $(1-p_t)^\gamma$ to cross-entropy to focus on hard negatives. Anti-Focal Loss (Raunak et al., 2020) modifies this factor to reduce the disparity between training and inference distributions in Seq2Seq models.
- Training Schedule:
- Initial resolution: $224 \times 224$
- Fine-tuning: Resolution $384 \times 384$ for labels $>150$ length.
- Batch size: Dynamic, increasing from 16 to 1024 (with proportional learning rate scaling).
- Noisy Labels: Randomly replacing chemical elements in labels with a certain probability to improve robustness during inference.
- Inference Strategy:
- Beam Search ($k=16$ initially, $k=64$ if failing InChI validation).
- Test Time Augmentation (TTA): Rotations of $90^\circ$.
- Ensemble: Step-wise logit ensemble and voting based on Levenshtein distance scores.
Models
ICMDT Architecture:
- Encoder (Deep TNT) (Depth: 12 layers):
- Internal Block: Dim 160, Heads 4, Hidden size 640, MLP act GELU, Pixel patch size 4.
- Middle Block: Dim 10, Heads 6, Hidden size 128, MLP act GELU, Small patch size 16.
- Exterior Block: Dim 2560, Heads 10, Hidden size 5120, MLP act GELU, Large patch size 32.
- Decoder (Vanilla Transformer):
- Decoder dim: 2560, FFN dim: 1024.
- Depth: 3 layers, Heads: 8.
- Vocab size: 193 (InChI tokens), text_dim: 384.
Evaluation
Metric: Levenshtein Distance (measures single-character edit operations between generated and ground truth InChI strings).
Ablation Results (Table 3 from paper):
| Model | Params (M) | Levenshtein Distance |
|---|---|---|
| ICMDT | 138.16 | 0.69 |
| ICMDT* | 138.16 | 1.04 |
| TNTD | 114.36 | 1.29 |
| TNTD-B | 114.36 | 1.37 |
Baseline Comparison (from convergence curves, Figure 9):
| Model | Params (M) | Convergence (Epochs) |
|---|---|---|
| ICMDT | 138.16 | ~9.76 |
| ResNet101d + Transformer | 302.02 | 14+ |
| EfficientNetb0 + Transformer | - | - |
| ResNet50d + RNN | 90.6 | 14+ |
| EfficientNetb0 + RNN | 46.3 | - |
Citation
@article{liAutomatedRecognitionChemical2022,
title = {Automated {{Recognition}} of {{Chemical Molecule Images Based}} on an {{Improved TNT Model}}},
author = {Li, Yanchi and Chen, Guanyu and Li, Xiang},
year = 2022,
month = jan,
journal = {Applied Sciences},
volume = {12},
number = {2},
pages = {680},
publisher = {Multidisciplinary Digital Publishing Institute},
issn = {2076-3417},
doi = {10.3390/app12020680}
}
