Paper Information

Citation: Li, Y., Chen, G., & Li, X. (2022). Automated Recognition of Chemical Molecule Images Based on an Improved TNT Model. Applied Sciences, 12(2), 680. https://doi.org/10.3390/app12020680

Publication: MDPI Applied Sciences 2022

Additional Resources:

Contribution: Image-to-Text Translation for Chemical Structures

This is a Method paper.

It proposes a novel neural network architecture, the Image Captioning Model based on Deep TNT (ICMDT), to solve the specific problem of “molecular translation” (image-to-text). The classification is supported by the following rhetorical indicators:

  • Novel Mechanism: It introduces the “Deep TNT block” to improve upon the existing TNT architecture by fusing features at three levels (pixel, small patch, large patch).
  • Baseline Comparison: The authors explicitly compare their model against four other architectures (CNN+RNN and CNN+Transformer variants).
  • Ablation Study: Section 4.3 is dedicated to ablating specific components (position encoding, patch fusion) to prove their contribution to the performance gain.

Motivation: Digitizing Historical Chemical Literature

The primary motivation is to speed up chemical research by digitizing historical chemical literature.

  • Problem: Historical sources often contain corrupted or noisy images, making automated recognition difficult.
  • Gap: Existing models like the standard TNT (Transformer in Transformer) function primarily as encoders for classification and fail to effectively integrate local pixel-level information required for precise structure generation.
  • Goal: To build a dependable generative model that can accurately translate these noisy images into InChI (International Chemical Identifier) text strings.

Novelty: Multi-Level Feature Fusion with Deep TNT

The core contribution is the Deep TNT block and the resulting ICMDT architecture.

  • Deep TNT Block: The Deep TNT block expands upon standard local and global modeling by stacking three transformer blocks to process information at three granularities:
    1. Internal Transformer: Processes pixel embeddings.
    2. Middle Transformer: Processes small patch embeddings.
    3. Exterior Transformer: Processes large patch embeddings.
  • Multi-level Fusion: The model fuses pixel-level features into small patches, and small patches into large patches, allowing for finer integration of local details.
  • Position Encoding: A specific strategy of applying shared position encodings to small patches and pixels, while using a learnable 1D encoding for large patches.

Methodology: Benchmarking on the BMS Dataset

The authors evaluated the model on the Bristol-Myers Squibb Molecular Translation dataset.

  • Baselines: They constructed four comparative models:
    • EfficientNetb0 + RNN (Bi-LSTM)
    • ResNet50d + RNN (Bi-LSTM)
    • EfficientNetb0 + Transformer
    • ResNet101d + Transformer
  • Ablation: They tested the impact of removing the large patch position encoding (ICMDT*) and reverting the encoder to a standard TNT-S (TNTD).
  • Pre-processing Study: They experimented with denoising ratios and cropping strategies.

Results & Conclusions: State-of-the-Art InChI Translation

  • Performance: ICMDT achieved the lowest Levenshtein distance (0.69) compared to the best baseline (1.45 for ResNet101d+Transformer).
  • Convergence: The model converged significantly faster than the baselines, outperforming others as early as epoch 6.7.
  • Ablation Results: The full Deep TNT block reduced error by nearly half compared to the standard TNT encoder (0.69 vs 1.29 Levenshtein distance).
  • Limitations: The model struggles with stereochemical layers (e.g., identifying clockwise neighbors or +/- signs) compared to non-stereochemical layers.
  • Future Work: Integrating full object detection to predict atom/bond coordinates to better resolve 3D stereochemical information.

Reproducibility Details

Data

The experiments used the Bristol-Myers Squibb Molecular Translation dataset from Kaggle.

PurposeDatasetSizeNotes
TrainingBMS Training Set2,424,186 imagesSupervised; contains noise and blur
EvaluationBMS Test Set1,616,107 imagesHigher noise variation than training set

Pre-processing Strategy:

  • Effective: Padding resizing (reshaping to square, padding with border pixels).
  • Ineffective: Smart cropping (removing white borders degraded performance).
  • Augmentation: GaussNoise, Blur, RandomRotate90, and PepperNoise ($SNR=0.996$).
  • Denoising: Best results found by mixing denoised and original data (Ratio 2:13) during training.

Algorithms

  • Optimizer: Lookahead ($\alpha=0.5, k=5$) and RAdam ($\beta_1=0.9, \beta_2=0.99$).
  • Loss Function: Anti-Focal loss ($\gamma=0.5$) combined with Label Smoothing. While standard Focal Loss adds a modulating factor $(1-p_t)^\gamma$ to cross-entropy to focus on hard negatives, Anti-Focal Loss modifies this to reduce the disparity between training and inference distributions: $$ \text{Anti-Focal Loss}(p_t) = - (1 + p_t)^\gamma \log(p_t) $$
  • Training Schedule:
    • Initial resolution: $224 \times 224$
    • Fine-tuning: Resolution $384 \times 384$ for labels $>150$ length.
    • Batch size: Dynamic, increasing from 16 to 1024.
  • Inference Strategy:
    • Beam Search ($k=16$ initially, $k=64$ if failing InChI validation).
    • Test Time Augmentation (TTA): Rotations of $90^\circ$.
    • Ensemble: Step-wise logit ensemble and voting based on Levenshtein distance scores.

Models

ICMDT Architecture:

  • Encoder (Deep TNT):
    • Internal Block: Hidden size 160, Heads 4, MLP act GELU.
    • Middle Block: Hidden size 640, Heads 10.
    • Exterior Block: Hidden size 2560, Heads 16.
    • Patch Sizes: Small patch $16 \times 16$, Pixel patch $4 \times 4$.
  • Decoder (Vanilla Transformer):
    • Dimensions: 5120 (Hidden), 1024 (FFN).
    • Depth: 3 layers, Heads: 8.
    • Vocab Size: 193 (InChI tokens).

Evaluation

Metric: Levenshtein Distance (measures single-character edit operations between generated and ground truth InChI strings).

ModelLevenshtein DistanceParams (M)Convergence (Epochs)
ICMDT (Ours)0.69138.16~9.76
TNTD (Ablation)1.29114.36-
ResNet101d + Transformer1.45302.0214+
ResNet50d + RNN3.8690.614+
EfficientNetb0 + RNN2.9546.311+

Citation

@article{liAutomatedRecognitionChemical2022,
  title = {Automated {{Recognition}} of {{Chemical Molecule Images Based}} on an {{Improved TNT Model}}},
  author = {Li, Yanchi and Chen, Guanyu and Li, Xiang},
  year = 2022,
  month = jan,
  journal = {Applied Sciences},
  volume = {12},
  number = {2},
  pages = {680},
  publisher = {Multidisciplinary Digital Publishing Institute},
  issn = {2076-3417},
  doi = {10.3390/app12020680}
}