ICMDT: Automated Chemical Structure Image Recognition

Paper Information

Citation: Li, Y., Chen, G., & Li, X. (2022). Automated Recognition of Chemical Molecule Images Based on an Improved TNT Model. Applied Sciences, 12(2), 680. https://doi.org/10.3390/app12020680

Publication: MDPI Applied Sciences 2022

Additional Resources:

Kaggle Competition: BMS Molecular Translation

Contribution: Image-to-Text Translation for Chemical Structures

This is a Method paper.

It proposes a novel neural network architecture, the Image Captioning Model based on Deep TNT (ICMDT), to solve the specific problem of “molecular translation” (image-to-text). The classification is supported by the following rhetorical indicators:

Novel Mechanism: It introduces the “Deep TNT block” to improve upon the existing TNT architecture by fusing features at three levels (pixel, small patch, large patch).
Baseline Comparison: The authors explicitly compare their model against four other architectures (CNN+RNN and CNN+Transformer variants).
Ablation Study: Section 4.3 is dedicated to ablating specific components (position encoding, patch fusion) to prove their contribution to the performance gain.

Motivation: Digitizing Historical Chemical Literature

The primary motivation is to speed up chemical research by digitizing historical chemical literature.

Problem: Historical sources often contain corrupted or noisy images, making automated recognition difficult.
Gap: Existing models like the standard TNT (Transformer in Transformer) function primarily as encoders for classification and fail to effectively integrate local pixel-level information required for precise structure generation.
Goal: To build a dependable generative model that can accurately translate these noisy images into InChI (International Chemical Identifier) text strings.

Novelty: Multi-Level Feature Fusion with Deep TNT

The core contribution is the Deep TNT block and the resulting ICMDT architecture.

Deep TNT Block: The Deep TNT block expands upon standard local and global modeling by stacking three transformer blocks to process information at three granularities:
1. Internal Transformer: Processes pixel embeddings.
2. Middle Transformer: Processes small patch embeddings.
3. Exterior Transformer: Processes large patch embeddings.
Multi-level Fusion: The model fuses pixel-level features into small patches, and small patches into large patches, allowing for finer integration of local details.
Position Encoding: A specific strategy of applying shared position encodings to small patches and pixels, while using a learnable 1D encoding for large patches.

Methodology: Benchmarking on the BMS Dataset

The authors evaluated the model on the Bristol-Myers Squibb Molecular Translation dataset.

Baselines: They constructed four comparative models:
- EfficientNetb0 + RNN (Bi-LSTM)
- ResNet50d + RNN (Bi-LSTM)
- EfficientNetb0 + Transformer
- ResNet101d + Transformer
Ablation: They tested the impact of removing the large patch position encoding (ICMDT*), reverting the encoder to a standard TNT-S (TNTD), and setting the patch size to 32 directly on TNT-S without the exterior transformer block (TNTD-B).
Pre-processing Study: They experimented with denoising ratios and cropping strategies.

Results & Conclusions: Improved InChI Translation Accuracy

Performance: ICMDT achieved the lowest Levenshtein distance (0.69) among all five models tested (Table 3). The best-performing baseline was ResNet101d+Transformer.
Convergence: The model converged significantly faster than the baselines, outperforming others as early as epoch 6.7.
Ablation Results: The full Deep TNT block reduced error by nearly half compared to the standard TNT encoder (0.69 vs 1.29 Levenshtein distance). Removing large patch position encoding (ICMDT*) degraded performance to 1.04, and directly using patch size 32 on TNT-S (TNTD-B) scored 1.37.
Limitations: The model struggles with stereochemical layers (e.g., identifying clockwise neighbors or +/- signs) compared to non-stereochemical layers.
Inference & Fusion: The multi-model inference and fusion pipeline (beam search, TTA, step-wise logit ensemble, and voting) improved results by 0.24 to 2.5 Levenshtein distance reduction over single models.
Future Work: Integrating full object detection to predict atom/bond coordinates to better resolve 3D stereochemical information.

Reproducibility

Status: Partially Reproducible. The dataset is publicly available through Kaggle, and the paper provides detailed hyperparameters and architecture specifications. However, no source code or pretrained model weights have been released.

Artifact	Type	License	Notes
BMS Molecular Translation (Kaggle)	Dataset	Competition Terms	Training/test images with InChI labels

Missing components: No official code repository or pretrained weights. Reimplementation requires reconstructing the Deep TNT block, training pipeline, and inference/fusion strategy from the paper description alone.

Hardware/compute requirements: Not explicitly stated in the paper.

Data

The experiments used the Bristol-Myers Squibb Molecular Translation dataset from Kaggle.

Purpose	Dataset	Size	Notes
Training	BMS Training Set	2,424,186 images	Supervised; contains noise and blur
Evaluation	BMS Test Set	1,616,107 images	Higher noise variation than training set

Pre-processing Strategy:

Effective: Padding resizing (reshaping to square using the longer edge, padding insufficient parts with pixels from the middle of the image).
Ineffective: Smart cropping (removing white borders degraded performance).
Augmentation: GaussNoise, Blur, RandomRotate90, and PepperNoise ($SNR=0.996$).
Denoising: Best results found by mixing denoised and original data (Ratio 2:13) during training.

Algorithms

Optimizer: Lookahead ($\alpha=0.5, k=5$) and RAdam ($\beta_1=0.9, \beta_2=0.99$).
Loss Function: Anti-Focal loss ($\gamma=0.5$) combined with Label Smoothing. Standard Focal Loss adds a modulating factor $(1-p_t)^\gamma$ to cross-entropy to focus on hard negatives. Anti-Focal Loss (Raunak et al., 2020) modifies this factor to reduce the disparity between training and inference distributions in Seq2Seq models.
Training Schedule:
- Initial resolution: $224 \times 224$
- Fine-tuning: Resolution $384 \times 384$ for labels $>150$ length.
- Batch size: Dynamic, increasing from 16 to 1024 (with proportional learning rate scaling).
- Noisy Labels: Randomly replacing chemical elements in labels with a certain probability to improve robustness during inference.
Inference Strategy:
- Beam Search ($k=16$ initially, $k=64$ if failing InChI validation).
- Test Time Augmentation (TTA): Rotations of $90^\circ$.
- Ensemble: Step-wise logit ensemble and voting based on Levenshtein distance scores.

Models

ICMDT Architecture:

Encoder (Deep TNT) (Depth: 12 layers):
- Internal Block: Dim 160, Heads 4, Hidden size 640, MLP act GELU, Pixel patch size 4.
- Middle Block: Dim 10, Heads 6, Hidden size 128, MLP act GELU, Small patch size 16.
- Exterior Block: Dim 2560, Heads 10, Hidden size 5120, MLP act GELU, Large patch size 32.
Decoder (Vanilla Transformer):
- Decoder dim: 2560, FFN dim: 1024.
- Depth: 3 layers, Heads: 8.
- Vocab size: 193 (InChI tokens), text_dim: 384.

Evaluation

Metric: Levenshtein Distance (measures single-character edit operations between generated and ground truth InChI strings).

Ablation Results (Table 3 from paper):

Model	Params (M)	Levenshtein Distance
ICMDT	138.16	0.69
ICMDT*	138.16	1.04
TNTD	114.36	1.29
TNTD-B	114.36	1.37

Baseline Comparison (from convergence curves, Figure 9):

Model	Params (M)	Convergence (Epochs)
ICMDT	138.16	~9.76
ResNet101d + Transformer	302.02	14+
EfficientNetb0 + Transformer	-	-
ResNet50d + RNN	90.6	14+
EfficientNetb0 + RNN	46.3	-

Citation

@article{liAutomatedRecognitionChemical2022,
  title = {Automated {{Recognition}} of {{Chemical Molecule Images Based}} on an {{Improved TNT Model}}},
  author = {Li, Yanchi and Chen, Guanyu and Li, Xiang},
  year = 2022,
  month = jan,
  journal = {Applied Sciences},
  volume = {12},
  number = {2},
  pages = {680},
  publisher = {Multidisciplinary Digital Publishing Institute},
  issn = {2076-3417},
  doi = {10.3390/app12020680}
}

Paper Information#

Contribution: Image-to-Text Translation for Chemical Structures#

Motivation: Digitizing Historical Chemical Literature#

Novelty: Multi-Level Feature Fusion with Deep TNT#

Methodology: Benchmarking on the BMS Dataset#

Results & Conclusions: Improved InChI Translation Accuracy#

Reproducibility#

Data#

Algorithms#

Models#

Evaluation#

Citation#