Paper Information

Citation: Campos, D., & Ji, H. (2021). IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System (No. arXiv:2109.04202). arXiv. https://doi.org/10.48550/arXiv.2109.04202

Publication: arXiv preprint (2021)

Additional Resources:

Contributions & Taxonomy

This is both a Method and Resource paper:

  • Method: It adapts standard image captioning architectures (encoder-decoder) to the domain of Optical Chemical Structure Recognition (OCSR), treating molecule recognition as a translation task.
  • Resource: It introduces MOLCAP, a large-scale dataset of 81 million molecules aggregated from public chemical databases, addressing the data scarcity that previously hindered deep learning approaches to OCSR.

The Bottleneck in Chemical Literature Translation

Chemical literature is “full of recipes written in a language computers cannot understand” because molecules are depicted as 2D images. This creates a fundamental bottleneck:

  • The Problem: Chemists must manually redraw molecular structures to search for related compounds or reactions. This is slow, error-prone, and makes large-scale literature mining impossible.
  • Existing Tools: Legacy systems like OSRA (Optical Structure Recognition Application) rely on handcrafted rules and often require human correction, making them unfit for unsupervised, high-throughput processing.
  • The Goal: An automated system that can translate structure images directly to machine-readable strings (SMILES/SELFIES) without human supervision, enabling large-scale knowledge extraction from decades of chemistry literature and patents.

Core Innovation: SELFIES and Image Captioning

The core novelty is demonstrating that how you represent the output text is as important as the model architecture itself. Key contributions:

  1. Image Captioning Framework: Applies modern encoder-decoder architectures (ResNet-101 + Transformer) to OCSR, treating it as an image-to-text translation problem with a standard cross-entropy loss objective over the generation sequence: $$ \mathcal{L} = -\sum\limits_{t=1}^{T} \log P(y_t \mid y_1, \ldots, y_{t-1}, x) $$

  2. SELFIES as Target Representation: The key mechanism relies on using SELFIES (Self-Referencing Embedded Strings) as the output format. SELFIES is based on a formal grammar where every possible string corresponds to a valid molecule, eliminating the syntactic invalidity problems (unmatched parentheses, invalid characters) that plague SMILES generation.

  3. MOLCAP Dataset: Created a comprehensive dataset of 81 million unique molecules from PubChem, ChEMBL, GDB13, and other sources. Generated 256x256 pixel images using RDKit for 1 million training samples and 5,000 validation samples.

  4. Task-Specific Evaluation: Demonstrated that traditional NLP metrics (BLEU) are poor indicators of scientific utility. Introduced evaluation based on molecular fingerprints (MACCS, RDK, Morgan) and Tanimoto similarity: $$ T(a, b) = \frac{c}{a + b - c} $$ where $c$ is the number of common fingerprint bits, and $a$ and $b$ are the number of set bits in each respective molecule’s fingerprint. This formulation reliably measures functional chemical similarity.

Experimental Setup and Ablation Studies

The evaluation focused on comparing IMG2SMI to existing systems and identifying which design choices matter most:

  1. Baseline Comparisons: Benchmarked against OSRA (rule-based system) and DECIMER (first deep learning approach) on the MOLCAP dataset to establish whether modern architectures could surpass traditional methods.

  2. Ablation Studies: Extensive ablations isolating key factors:

    • Decoder Architecture: Transformer vs. RNN/LSTM decoders
    • Encoder Fine-tuning: Fine-tuned vs. frozen pre-trained ResNet weights
    • Output Representation: SELFIES vs. character-level SMILES vs. BPE-tokenized SMILES (the most critical ablation)
ConfigurationMACCS FTSValid Captions
RNN + Fixed Encoder0.1526N/A
RNN + Fine-tuned Encoder0.4180N/A
Transformer + Fixed Encoder0.767461.1%
Transformer + Fine-tuned Encoder0.947599.4%
Character-level SMILES (fine-tuned)N/A2.1%
BPE SMILES (2000 vocab, fine-tuned)N/A20.0%
SELFIES (fine-tuned)0.947599.4%
  1. Metric Analysis: Systematic comparison of evaluation metrics including BLEU, ROUGE, Levenshtein distance, exact match accuracy, and molecular fingerprint-based similarity measures.

Results, Findings, and Limitations

Performance Gains:

MetricIMG2SMIOSRADECIMERRandom Baseline
MACCS FTS0.94750.36000.00000.3378
RDK FTS0.90200.27900.00000.2229
Morgan FTS0.87070.26770.00000.1081
ROUGE0.62400.06840.00000.0422
Exact Match7.24%0.04%0.00%0.00%
Valid Captions99.4%65.2%N/AN/A
  • 163% improvement over OSRA on MACCS Tanimoto similarity.
  • Nearly 10x improvement on ROUGE scores (0.6240 vs. 0.0684).
  • Average Tanimoto similarity exceeds 0.85 (functionally similar molecules even when not exact matches).

Key Findings:

  • SELFIES is Critical: Using SELFIES yields 99.4% valid molecules, compared to only ~2% validity for character-level SMILES.
  • Architecture Matters: Transformer decoder significantly outperforms RNN/LSTM approaches. Fine-tuning the ResNet encoder (vs. frozen weights) yields substantial performance gains (e.g., MACCS FTS: 0.7674 to 0.9475).
  • Metric Insights: BLEU is a poor metric for this task. Molecular fingerprint-based Tanimoto similarity is most informative because it measures functional chemical similarity.

Limitations:

  • Low Exact Match: Only 7.24% exact matches. The model captures the overarching functional groups and structure but misses fine details like exact double bond placement.
  • Complexity Bias: Trained on large molecules (average length >40 tokens), so it performs poorly on very simple structures where OSRA still excels.

Conclusion: The work shows that modern encoder-decoder architectures combined with valid-by-construction molecular representations (SELFIES) can outperform traditional rule-based systems by large margins on fingerprint-based similarity metrics. The system is useful for literature mining where functional similarity matters more than exact matches, though 7.24% exact match accuracy and poor performance on simple molecules indicate clear directions for future work.

Reproducibility Details

Models

Architecture: Image captioning system based on DETR (Detection Transformer) framework.

Visual Encoder:

  • Backbone: ResNet-101 pre-trained on ImageNet
  • Feature Extraction: 4th layer extraction (convolutions only)
  • Output: 2048-dimensional dense feature vector

Caption Decoder:

  • Type: Transformer encoder-decoder
  • Layers: 3 stacked encoder layers, 3 stacked decoder layers
  • Attention Heads: 8
  • Hidden Dimensions: 2048 (feed-forward networks)
  • Dropout: 0.1
  • Layer Normalization: 1e-12

Training Configuration:

  • Optimizer: AdamW
  • Learning Rate: 5e-5 (selected after sweep from 1e-4 to 1e-6)
  • Weight Decay: 1e-4
  • Batch Size: 32
  • Epochs: 5
  • Codebase: Built on open-source DETR implementation

Data

MOLCAP Dataset:

PropertyValueNotes
Total Size81,230,291 moleculesAggregated from PubChem, ChEMBL, GDB13
Training Split1,000,000 moleculesRandomly selected unique molecules
Validation Split5,000 moleculesRandomly selected for evaluation
Image Resolution256x256 pixelsGenerated using RDKit
Median SELFIES Length>45 charactersMore complex than typical benchmarks
Full Dataset Storage~16.24 TBNecessitated use of 1M subset
AugmentationNoneNo cropping, rotation, or other augmentation

Preprocessing:

  • Images generated using RDKit at 256x256 resolution
  • Molecules converted to canonical representations
  • SELFIES tokenization for model output

Evaluation

Primary Metrics:

MetricIMG2SMI ValueOSRA BaselinePurpose
MACCS FTS0.94750.3600Fingerprint Tanimoto Similarity (functional groups)
RDK FTS0.90200.2790RDKit fingerprint similarity
Morgan FTS0.87070.2677Morgan fingerprint similarity (circular)
ROUGE0.62400.0684Text overlap metric
Exact Match7.24%0.04%Structural identity (strict)
Valid Captions99.4%65.2%Syntactic validity (with SELFIES)
Levenshtein Distance21.1332.76String edit distance (lower is better)

Secondary Metrics (shown to be less informative for chemical tasks):

  • BLEU, ROUGE (better suited for natural language)
  • Levenshtein distance (doesn’t capture chemical similarity)

Hardware

  • GPU: Single NVIDIA GeForce RTX 2080 Ti
  • Training Time: ~5 hours per epoch, approximately 25 hours total for 5 epochs
  • Memory: Sufficient for batch size 32 with ResNet-101 + Transformer architecture

Artifacts

The paper mentions releasing both code and the MOLCAP dataset, but no public repository or download link has been confirmed as available.

ArtifactTypeLicenseNotes
MOLCAP datasetDatasetUnknown81M molecules; claimed released but no public URL found
IMG2SMI codeCodeUnknownBuilt on DETR; claimed released but no public URL found

Citation

@article{campos2021img2smi,
  title={IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System},
  author={Campos, Daniel and Ji, Heng},
  journal={arXiv preprint arXiv:2109.04202},
  year={2021},
  doi={10.48550/arXiv.2109.04202}
}