Paper Information

Citation: Yi, J., Wu, C., Zhang, X., Xiao, X., Qiu, Y., Zhao, W., Hou, T., & Cao, D. (2022). MICER: a pre-trained encoder-decoder architecture for molecular image captioning. Bioinformatics, 38(19), 4562-4572. https://doi.org/10.1093/bioinformatics/btac545

Publication: Bioinformatics 2022

Additional Resources:

MICER’s Contribution to Optical Structure Recognition

This is a Method paper according to the AI for Physical Sciences taxonomy. It proposes MICER, an encoder-decoder architecture that integrates transfer learning (fine-tuning pre-trained models) and attention mechanisms for Optical Chemical Structure Recognition (OCSR). The study includes rigorous benchmarking comparing MICER against three rule-based tools (OSRA, MolVec, Imago) and existing deep learning methods (DECIMER). The authors conduct extensive factor comparison experiments to isolate the effects of stereochemistry, molecular complexity, data volume, and encoder backbone choices.

The Challenge of Generalizing in OCSR

Chemical structures in scientific literature are valuable for drug discovery, but they are locked in image formats that are difficult to mine automatically. Traditional OCSR tools (like OSRA) rely on hand-crafted rules and expert knowledge. They are brittle, struggle with stylistic variations, and have low generalization ability. While deep learning has been applied (e.g., DECIMER), previous attempts often used frozen pre-trained feature extractors (without fine-tuning) or failed to fully exploit transfer learning, leading to suboptimal performance. The goal of this work is to build an end-to-end “image captioning” system that translates molecular images directly into SMILES strings without intermediate segmentation steps.

Integrating Fine-Tuning and Attention for Chemistry

The core novelty lies in the specific architectural integration of transfer learning with fine-tuning for the chemical domain. Unlike DECIMER, which used a frozen network, MICER fine-tunes a pre-trained ResNet on molecular images. This allows the encoder to adapt from general object recognition to specific chemical feature extraction.

The model incorporates an attention mechanism into the LSTM decoder, allowing the model to focus on specific image regions (atoms and bonds) when generating each character of the SMILES string. The paper explicitly analyzes “intrinsic features” of molecular data (stereochemistry, complexity) to guide the design of a robust training dataset, combining multiple chemical toolkits (Indigo, RDKit) to generate diverse styles.

Experimental Setup and Ablation Studies

The authors performed two types of experiments: Factor Comparison (ablations) and Benchmarking.

Factor Comparisons: They evaluated how performance is affected by:

  • Stereochemistry (SI): Comparing models trained on data with and without stereochemical information.
  • Molecular Complexity (MC): Analyzing performance across 5 molecular weight intervals.
  • Data Volume (DV): Training on datasets ranging from 0.64 million to 10 million images.
  • Pre-trained Models (PTMs): Comparing 8 different backbones (e.g., ResNet, VGG, Inception, MobileNet) versus a base CNN.

Benchmarking:

  • Baselines: OSRA, MolVec, Imago (rule-based); Base CNN, DECIMER (deep learning).
  • Datasets: Four test sets (100k images each, except UOB): Uni-style, Multi-style, Noisy, and Real-world (UOB dataset).
  • Metrics: Sequence Accuracy (Exact Match), Levenshtein Distance (ALD), and Tanimoto Similarity (Fingerprint match).

Results and Core Insights

MICER achieved 97.54% Sequence Accuracy on uni-style data and 82.33% on the real-world UOB dataset, significantly outperforming the next best method (OSRA scored approximately 23% on uni-style; DECIMER scored roughly 21% on UOB). ResNet101 was identified as the most effective encoder (87.58% accuracy in preliminary tests), outperforming deeper (DenseNet) or lighter (MobileNet) networks. Performance saturates around 6 million training samples. Stereochemical information drops accuracy by nearly 6%, indicating wedge and dash bonds are harder to recognize. Visualizing attention maps showed the model correctly attends to specific atoms (e.g., focusing on ‘S’ or ‘Cl’ pixels) when generating the corresponding character.

Reproducibility Details

Data

The training data was curated from the ZINC20 database.

Preprocessing:

  • Filtering: Removed organometallics, mixtures, and invalid molecules.
  • Standardization: SMILES were canonicalized and de-duplicated.
  • Generation: Images generated using Indigo and RDKit toolkits to vary styles.

Dataset Size:

  • Total: 10 million images selected for the final model.
  • Composition: 6 million “default style” (Indigo) + 4 million “multi-style” (Indigo + RDKit).
  • Splits: 8:1:1 ratio for Training/Validation/Test.

Vocabulary: A token dictionary of 39 characters + special tokens: [pad], [sos], [eos], [0]-[9], [C], [1], [c], [O], [N], [n], [F], [H], [o], [S], [s], [B], [r], [I], [i], [P], [p], (, ), [, ], @, =, #, /, -, +, \, %. Note that two-letter atoms like ‘Br’ are tokenized as distinct characters [B], [r].

Algorithms

  • Tokenization: Character-level tokenization (not atom-level); the model learns to assemble ‘C’ and ’l’ into ‘Cl’.
  • Attention Mechanism: Uses a soft attention mechanism where the decoder calculates an attention score between the encoder’s feature map ($8 \times 8 \times 512$) and the current hidden vector. Formula: $$ \begin{aligned} \text{att_score} &= \text{softmax}(L_a(\tanh(L_f(F) + L_b(b_t)))) \end{aligned} $$
  • Training Configuration:
    • Loss Function: Cross-entropy loss
    • Optimizer: Adam optimizer
    • Learning Rate: 2e-5
    • Batch Size: 256
    • Epochs: 15

Models

Encoder:

  • Backbone: Pre-trained ResNet101 (trained on ImageNet).
  • Modifications: The final layer is removed to output a Feature Map of size $8 \times 8 \times 512$.
  • Flattening: Reshaped to a $64 \times 512$ feature matrix for the decoder.

Decoder:

  • Type: Long Short-Term Memory (LSTM) with Attention.
  • Dropout: 0.3 applied to minimize overfitting.

The encoder uses a pilot block, max-pool, and 4 layers of convolutional blocks (CB), feeding into the attention LSTM.

Evaluation

Metrics:

  • SA (Sequence Accuracy): Strict exact match of SMILES strings.
  • ALD (Average Levenshtein Distance): Edit distance for character-level error analysis.
  • AMFTS / [email protected]: Tanimoto similarity of ECFP4 fingerprints to measure structural similarity.

Test Sets:

  • Uni-style: 100,000 images (Indigo default).
  • Multi-style: 100,000 images (>10 styles).
  • Noisy: 100,000 images with noise added.
  • UOB: 5,575 real-world images from literature.

Hardware

  • Compute: 4 x NVIDIA Tesla V100 GPUs
  • Training Time: Approximately 42 hours for the final model

Citation

@article{yiMICERPretrainedEncoder2022,
  title = {{{MICER}}: A Pre-Trained Encoder--Decoder Architecture for Molecular Image Captioning},
  shorttitle = {{{MICER}}},
  author = {Yi, Jiacai and Wu, Chengkun and Zhang, Xiaochen and Xiao, Xinyi and Qiu, Yanlong and Zhao, Wentao and Hou, Tingjun and Cao, Dongsheng},
  year = {2022},
  month = sep,
  journal = {Bioinformatics},
  volume = {38},
  number = {19},
  pages = {4562--4572},
  issn = {1367-4811},
  doi = {10.1093/bioinformatics/btac545}
}