Paper Information
Citation: Yi, J., Wu, C., Zhang, X., Xiao, X., Qiu, Y., Zhao, W., Hou, T., & Cao, D. (2022). MICER: a pre-trained encoder-decoder architecture for molecular image captioning. Bioinformatics, 38(19), 4562-4572. https://doi.org/10.1093/bioinformatics/btac545
Publication: Bioinformatics 2022
Additional Resources:
MICER’s Contribution to Optical Structure Recognition
This is a Method paper according to the AI for Physical Sciences taxonomy. It proposes MICER, an encoder-decoder architecture that integrates transfer learning (fine-tuning pre-trained models) and attention mechanisms for Optical Chemical Structure Recognition (OCSR). The study includes rigorous benchmarking comparing MICER against three rule-based tools (OSRA, MolVec, Imago) and existing deep learning methods (DECIMER). The authors conduct extensive factor comparison experiments to isolate the effects of stereochemistry, molecular complexity, data volume, and encoder backbone choices.
The Challenge of Generalizing in OCSR
Chemical structures in scientific literature are valuable for drug discovery, but they are locked in image formats that are difficult to mine automatically. Traditional OCSR tools (like OSRA) rely on hand-crafted rules and expert knowledge. They are brittle, struggle with stylistic variations, and have low generalization ability. While deep learning has been applied (e.g., DECIMER), previous attempts often used frozen pre-trained feature extractors (without fine-tuning) or failed to fully exploit transfer learning, leading to suboptimal performance. The goal of this work is to build an end-to-end “image captioning” system that translates molecular images directly into SMILES strings without intermediate segmentation steps.
Integrating Fine-Tuning and Attention for Chemistry
The core novelty lies in the specific architectural integration of transfer learning with fine-tuning for the chemical domain. Unlike DECIMER, which used a frozen network, MICER fine-tunes a pre-trained ResNet on molecular images. This allows the encoder to adapt from general object recognition to specific chemical feature extraction.
The model incorporates an attention mechanism into the LSTM decoder, allowing the model to focus on specific image regions (atoms and bonds) when generating each character of the SMILES string. The paper explicitly analyzes “intrinsic features” of molecular data (stereochemistry, complexity) to guide the design of the training dataset, combining multiple chemical toolkits (Indigo, RDKit) to generate diverse styles.
Experimental Setup and Ablation Studies
The authors performed two types of experiments: Factor Comparison (ablations) and Benchmarking.
Factor Comparisons: They evaluated how performance is affected by:
- Stereochemistry (SI): Comparing models trained on data with and without stereochemical information.
- Molecular Complexity (MC): Analyzing performance across 5 molecular weight intervals.
- Data Volume (DV): Training on datasets ranging from 0.64 million to 10 million images.
- Pre-trained Models (PTMs): Comparing 8 different backbones (e.g., ResNet, VGG, Inception, MobileNet) versus a base CNN.
Benchmarking:
- Baselines: OSRA, MolVec, Imago (rule-based); Base CNN, DECIMER (deep learning).
- Datasets: Four test sets (100k images each, except UOB): Uni-style, Multi-style, Noisy, and Real-world (UOB dataset).
- Metrics: Sequence Accuracy (Exact Match), Levenshtein Distance (ALD), and Tanimoto Similarity (Fingerprint match).
Results and Core Insights
MICER achieved 97.54% Sequence Accuracy on uni-style data and 82.33% on the real-world UOB dataset, outperforming rule-based and deep learning baselines across all four test sets.
| Dataset | Method | SA (%) | AMFTS (%) |
|---|---|---|---|
| Uni-style | OSRA | 23.14 | 56.83 |
| Uni-style | DECIMER | 35.32 | 86.92 |
| Uni-style | MICER | 97.54 | 99.74 |
| Multi-style | OSRA | 15.68 | 44.50 |
| Multi-style | MICER | 95.09 | 99.28 |
| Noisy | MICER | 94.95 | 99.25 |
| UOB (real-world) | OSRA | 80.24 | 91.17 |
| UOB (real-world) | DECIMER | 21.75 | 65.15 |
| UOB (real-world) | MICER | 82.33 | 94.47 |
ResNet101 was identified as the most effective encoder (87.58% SA in preliminary tests on 0.8M images), outperforming deeper (DenseNet121 at 81.41%) and lighter (MobileNetV2 at 39.83%) networks. Performance saturates around 6 million training samples, reaching 98.84% SA. Stereochemical information drops accuracy by approximately 6.1% (from 87.61% to 81.50%), indicating wedge and dash bonds are harder to recognize. Visualizing attention maps showed the model correctly attends to specific atoms (e.g., focusing on ‘S’ or ‘Cl’ pixels) when generating the corresponding character.
Limitations
The authors acknowledge several limitations. MICER struggles with superatoms, R-groups, text labels, and uncommon atoms (e.g., Sn) that were not seen during training. On noisy data, noise spots near Cl atoms can cause misclassification as O atoms. Complex molecular images with noise lead to misrecognition of noise points as single bonds and wedge-shaped bonds as double bonds. All methods, including MICER, have substantial room for improvement on real-world datasets that contain these challenging elements.
Reproducibility Details
Data
The training data was curated from the ZINC20 database.
Preprocessing:
- Filtering: Removed organometallics, mixtures, and invalid molecules.
- Standardization: SMILES were canonicalized and de-duplicated.
- Generation: Images generated using Indigo and RDKit toolkits to vary styles.
Dataset Size:
- Total: 10 million images selected for the final model.
- Composition: 6 million “default style” (Indigo) + 4 million “multi-style” (Indigo + RDKit).
- Splits: 8:1:1 ratio for Training/Validation/Test.
Vocabulary: A token dictionary of 39 SMILES characters plus 3 special tokens: [pad], [sos], [eos], [0]-[9], [C], [l], [c], [O], [N], [n], [F], [H], [o], [S], [s], [B], [r], [I], [i], [P], [p], (, ), [, ], @, =, #, /, -, +, \, %. Two-letter atoms like ‘Br’ are tokenized as distinct characters [B], [r], and ‘Cl’ as [C], [l].
Algorithms
- Tokenization: Character-level tokenization (not atom-level); the model learns to assemble ‘C’ and ’l’ into ‘Cl’.
- Attention Mechanism: Uses a soft attention mechanism where the decoder calculates an attention score between the encoder’s feature map ($8 \times 8 \times 512$) and the current hidden vector. Formula: $$ \begin{aligned} \text{att_score} &= \text{softmax}(L_a(\tanh(L_f(F) + L_b(b_t)))) \end{aligned} $$
- Training Configuration:
- Loss Function: Cross-entropy loss
- Optimizer: Adam optimizer
- Learning Rate: 2e-5
- Batch Size: 256
- Epochs: 15
Models
Encoder:
- Backbone: Pre-trained ResNet101 (trained on ImageNet).
- Modifications: The final layer is removed to output a Feature Map of size $8 \times 8 \times 512$.
- Flattening: Reshaped to a $64 \times 512$ feature matrix for the decoder.
Decoder:
- Type: Long Short-Term Memory (LSTM) with Attention.
- Dropout: 0.3 applied to minimize overfitting.
The encoder uses a pilot network (for universal feature extraction), a max-pooling layer, and multiple feature extraction layers containing convolutional blocks (CBs), feeding into the attention LSTM.
Evaluation
Metrics:
- SA (Sequence Accuracy): Strict exact match of SMILES strings.
- ALD (Average Levenshtein Distance): Edit distance for character-level error analysis.
- AMFTS / [email protected]: Tanimoto similarity of ECFP4 fingerprints to measure structural similarity.
Test Sets:
- Uni-style: 100,000 images (Indigo default).
- Multi-style: 100,000 images (>10 styles).
- Noisy: 100,000 images with noise added.
- UOB: 5,575 real-world images from literature.
Hardware
- Compute: 4 x NVIDIA Tesla V100 GPUs
- Training Time: Approximately 42 hours for the final model
Artifacts
| Artifact | Type | License | Notes |
|---|---|---|---|
| MICER | Code | MIT | Official implementation |
The training data (generated from ZINC20) and pre-trained model weights are not publicly released. The repository contains code but has minimal documentation (2 commits, no description).
Citation
@article{yiMICERPretrainedEncoder2022,
title = {{{MICER}}: A Pre-Trained Encoder--Decoder Architecture for Molecular Image Captioning},
shorttitle = {{{MICER}}},
author = {Yi, Jiacai and Wu, Chengkun and Zhang, Xiaochen and Xiao, Xinyi and Qiu, Yanlong and Zhao, Wentao and Hou, Tingjun and Cao, Dongsheng},
year = {2022},
month = sep,
journal = {Bioinformatics},
volume = {38},
number = {19},
pages = {4562--4572},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btac545}
}
