MICER: Molecular Image Captioning with Transfer Learning

Paper Information

Citation: Yi, J., Wu, C., Zhang, X., Xiao, X., Qiu, Y., Zhao, W., Hou, T., & Cao, D. (2022). MICER: a pre-trained encoder-decoder architecture for molecular image captioning. Bioinformatics, 38(19), 4562-4572. https://doi.org/10.1093/bioinformatics/btac545

Publication: Bioinformatics 2022

Additional Resources:

GitHub Repository

What kind of paper is this?

This is a Method paper according to the AI for Physical Sciences taxonomy.

Novel Architecture: It proposes MICER, a specific encoder-decoder architecture that integrates transfer learning (fine-tuning pre-trained models) and attention mechanisms.
SOTA Comparison: It includes rigorous benchmarking comparing MICER against three rule-based tools (OSRA, MolVec, Imago) and existing deep learning methods (DECIMER), demonstrating superior performance.
Ablation Studies: The authors conduct extensive factor comparison experiments to isolate the effects of stereochemistry, molecular complexity, data volume, and encoder backbone choices.

What is the motivation?

Chemical structures in scientific literature are a “treasure for drug discovery,” but they are locked in image formats that are difficult to mine automatically.

Limitation of Rule-Based Systems: Traditional Optical Chemical Structure Recognition (OCSR) tools (like OSRA) rely on hand-crafted rules and expert knowledge. They are brittle, struggle with stylistic variations, and have low generalization ability.
Limitation of Existing DL Models: While deep learning has been applied (e.g., DECIMER), previous attempts often used frozen pre-trained feature extractors (without fine-tuning) or failed to fully exploit transfer learning, leading to suboptimal performance.
Goal: To build an end-to-end “image captioning” system that translates molecular images directly into SMILES strings without intermediate segmentation steps.

What is the novelty here?

The core novelty lies in the specific architectural integration of transfer learning with fine-tuning for the chemical domain.

Fine-tuned Encoder: Unlike DECIMER (which used a frozen network), MICER fine-tunes a pre-trained ResNet on molecular images, allowing the encoder to adapt from general object recognition to specific chemical feature extraction.
Attention-based Decoder: It incorporates an attention mechanism into the LSTM decoder, allowing the model to focus on specific image regions (atoms/bonds) when generating each character of the SMILES string, which improves interpretability.
Data Strategy: The paper explicitly analyzes “intrinsic features” of molecular data (stereochemistry, complexity) to guide the design of a robust training dataset, combining multiple chemical toolkits (Indigo, RDKit) to generate diverse styles.

What experiments were performed?

The authors performed two types of experiments: Factor Comparison (ablations) and Benchmarking.

Factor Comparisons: They evaluated how performance is affected by:

Stereochemistry (SI): Comparing models trained on data with/without stereochemical information.
Molecular Complexity (MC): Analyzing performance across 5 molecular weight intervals.
Data Volume (DV): Training on datasets ranging from 0.64 million to 10 million images.
Pre-trained Models (PTMs): Comparing 8 different backbones (e.g., ResNet, VGG, Inception, MobileNet) vs. a base CNN.

Benchmarking:

Baselines: OSRA, MolVec, Imago (rule-based); Base CNN, DECIMER (deep learning).
Datasets: Four test sets (100k images each, except UOB): Uni-style, Multi-style, Noisy, and Real-world (UOB dataset).
Metrics: Sequence Accuracy (Exact Match), Levenshtein Distance (ALD), and Tanimoto Similarity (Fingerprint match).

What outcomes/conclusions?

SOTA Performance: MICER achieved 97.54% Sequence Accuracy on uni-style data and 82.33% on the real-world UOB dataset, significantly outperforming the next best method (OSRA: ~23% on uni-style; DECIMER: ~21% on UOB).
Encoder Importance: ResNet101 was identified as the most effective encoder (87.58% accuracy in preliminary tests), outperforming deeper (DenseNet) or lighter (MobileNet) networks.
Data Insights: Performance saturates around 6 million training samples. Stereochemical information drops accuracy by ~6%, indicating wedge/dash bonds are harder to recognize.
Interpretability: Visualizing attention maps showed the model correctly attends to specific atoms (e.g., focusing on ‘S’ or ‘Cl’ pixels) when generating the corresponding character.

Reproducibility Details

Data

The training data was curated from the ZINC20 database.

Preprocessing:

Filtering: Removed organometallics, mixtures, and invalid molecules.
Standardization: SMILES were canonicalized and de-duplicated.
Generation: Images generated using Indigo and RDKit toolkits to vary styles.

Dataset Size:

Total: 10 million images selected for the final model.
Composition: 6 million “default style” (Indigo) + 4 million “multi-style” (Indigo + RDKit).
Splits: 8:1:1 ratio for Training/Validation/Test.

Vocabulary: A token dictionary of 39 characters + special tokens: [pad], [sos], [eos], [0]-[9], [C], [1], [c], [O], [N], [n], [F], [H], [o], [S], [s], [B], [r], [I], [i], [P], [p], (, ), [, ], @, =, #, /, -, +, \, %. Note that two-letter atoms like ‘Br’ are tokenized as distinct characters [B], [r].

Algorithms

Tokenization: Character-level tokenization (not atom-level); the model learns to assemble ‘C’ and ’l’ into ‘Cl’.
Attention Mechanism: Uses a soft attention mechanism where the decoder calculates an attention score between the encoder’s feature map ($8 \times 8 \times 512$) and the current hidden vector. Formula: $\text{att}_{\text{score}} = \text{softmax}{L_a{\tanh[L_f(F) + L_b(b_t)]}}$.
Training Configuration:
- Loss Function: Cross-entropy loss
- Optimizer: Adam optimizer
- Learning Rate: 2e-5
- Batch Size: 256
- Epochs: 15

Models

Encoder:

Backbone: Pre-trained ResNet101 (trained on ImageNet).
Modifications: The final layer is removed to output a Feature Map of size $8 \times 8 \times 512$.
Flattening: Reshaped to a $64 \times 512$ feature matrix for the decoder.

Decoder:

Type: Long Short-Term Memory (LSTM) with Attention.
Dropout: 0.3 applied to minimize overfitting.

The encoder uses a pilot block, max-pool, and 4 layers of convolutional blocks (CB), feeding into the attention LSTM.

Evaluation

Metrics:

SA (Sequence Accuracy): Strict exact match of SMILES strings.
ALD (Average Levenshtein Distance): Edit distance for character-level error analysis.
AMFTS / [email protected]: Tanimoto similarity of ECFP4 fingerprints to measure structural similarity.

Test Sets:

Uni-style: 100,000 images (Indigo default).
Multi-style: 100,000 images (>10 styles).
Noisy: 100,000 images with noise added.
UOB: 5,575 real-world images from literature.

Hardware

Compute: 4 × NVIDIA Tesla V100 GPUs
Training Time: Approximately 42 hours for the final model

Citation

@article{yiMICERPretrainedEncoder2022,
  title = {{{MICER}}: A Pre-Trained Encoder--Decoder Architecture for Molecular Image Captioning},
  shorttitle = {{{MICER}}},
  author = {Yi, Jiacai and Wu, Chengkun and Zhang, Xiaochen and Xiao, Xinyi and Qiu, Yanlong and Zhao, Wentao and Hou, Tingjun and Cao, Dongsheng},
  year = {2022},
  month = sep,
  journal = {Bioinformatics},
  volume = {38},
  number = {19},
  pages = {4562--4572},
  issn = {1367-4811},
  doi = {10.1093/bioinformatics/btac545}
}

Publication Details
Authors	Jiacai Yi, Chengkun Wu, Xiaochen Zhang, Xinyi Xiao, Yanlong Qiu, Wentao Zhao, Tingjun Hou, Dongsheng Cao
Paper Title	MICER: a pre-trained encoder-decoder architecture for molecular image captioning
Category	Computational Chemistry
Date	December 2025
Links	🔗 DOI • 💻 Code • 📄 Paper

Paper Information#

What kind of paper is this?#

What is the motivation?#

What is the novelty here?#

What experiments were performed?#

What outcomes/conclusions?#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Citation#