Image-to-Sequence Models on Hunter Heidenreich | ML Research Scientist

MolSight: OCSR with RL and Multi-Granularity Learning

Fri, 19 Dec 2025 00:00:00 +0000

Contribution: A Framework for Optical Chemical Structure Recognition

This is primarily a Method paper. It proposes a novel three-stage training framework (Pretraining → Fine-tuning → RL Post-training) to improve Optical Chemical Structure Recognition (OCSR). Specifically, it introduces the use of Group Relative Policy Optimization (GRPO) to solve non-differentiable chemical validity issues.

It also has a Resource component, as the authors construct and release Stereo-200k, a dataset specifically designed to train models on challenging stereoisomeric molecules.

Motivation: Resolving Stereochemical Cues

Existing OCSR systems struggle to accurately recognize stereochemical information (e.g., chirality, geometric isomerism) because the visual cues distinguishing stereoisomers (such as wedge and dash bonds) are subtle. Current methods often fail to capture the geometric relationships required to distinguish molecules with identical connectivity but different spatial arrangements. Accurate recognition is critical for downstream tasks like drug discovery where stereochemistry determines pharmacological effects.

Core Innovations: GRPO and Multi-Granularity Learning

MolSight introduces three key technical innovations:

Reinforcement Learning for OCSR: It is the first OCSR system to incorporate RL (specifically GRPO) to directly optimize for chemical semantic correctness.
Multi-Granularity Learning: It employs auxiliary heads for chemical bond classification and atom localization. Unlike previous approaches that optimize these jointly, MolSight decouples the coordinate head to prevent interference with SMILES generation.
SMILES-M Notation: A lightweight extension to SMILES to handle Markush structures (common in patents) without significant sequence length increase.

Experimental Methodology

The authors evaluated MolSight using a rigorous mix of real and synthetic benchmarks:

Baselines: Compared against rule-based (OSRA, MolVec, Imago) and deep learning methods (MolScribe, MolGrapher, DECIMER).
Benchmarks: Evaluated on real-world datasets (USPTO, Maybridge UoB, CLEF-2012, JPO) and synthetic datasets (Staker, ChemDraw, Indigo, Stereo-2K).
Ablation Studies: Tested the impact of the bond head, coordinate head, and RL stages separately.
Transfer Learning: Assessed the quality of learned representations by using the frozen encoder for molecular property prediction on MoleculeNet.

Results and Conclusions

SOTA Performance: MolSight achieved 85.1% stereochemical accuracy on the USPTO dataset, significantly outperforming the previous SOTA (MolScribe) which achieved 69.0%.
RL Effectiveness: Reinforcement learning post-training specifically improved performance on stereoisomers, raising Tanimoto similarity and exact match rates on the Stereo-2k test set.
Robustness: On perturbed USPTO images (random rotations and shearing), MolSight achieved 92.3% exact match accuracy (vs. the original 92.0%), while rule-based methods like OSRA dropped from 83.5% to 6.7%. On the low-resolution Staker dataset, MolSight reached 82.1% exact match.

Reproducibility Details

Data

The training pipeline uses three distinct data sources:

Pre-training: MolParser-7M. Contains diverse images but requires the SMILES-M extension to handle Markush structures.
Fine-tuning: PubChem-1M and USPTO-680K. Used for multi-granularity learning with bond and coordinate labels.
RL Post-training: Stereo-200k. A self-collected dataset from the first 2M compounds in PubChem, filtered for chirality (’@’) and cis-trans isomerism (’/’, ‘\’). It uses 5 different RDKit drawing styles to ensure robustness.

Algorithms

Reinforcement Learning: Uses GRPO (Group Relative Policy Optimization).
- Reward Function: A linear combination of Tanimoto similarity and a graded stereochemistry reward. $$ R = w_t \cdot r_{\text{tanimoto}} + w_s \cdot r_{\text{stereo}} $$ where $w_t=0.4$ and $w_s=0.6$. The stereochemistry reward $r_{\text{stereo}}$ is 1.0 for an InChIKey exact match, 0.3 if the atom count matches, and 0.1 otherwise.
- Sampling: Samples 4 completions per image with temperature 1.0 during RL training.
Auxiliary Tasks:
- Bond Classification: Concatenates hidden states of two atom queries to predict bond type via MLP.
- Atom Localization: Treated as a classification task (SimCC) but optimized using Maximum Likelihood Estimation (MLE) to account for uncertainty.

Models

Architecture: Encoder-Decoder Transformer. Input images are preprocessed to $512 \times 512$ resolution.
- Encoder: EfficientViT-L1 (~53M params), chosen for linear attention efficiency.
- Decoder: 6-layer Transformer with RoPE, SwiGLU, and RMSNorm. Randomly initialized (no LLM weights) due to vocabulary mismatch.
- Coordinate Head: Separated from the main decoder. It adds 2 extra Transformer layers to process atom queries before prediction to improve accuracy.
Parameter Tuning:
- Stage 3 (RL) uses LoRA (Rank=8, Alpha=16) to optimize the decoder.

Evaluation

Metrics:
- Exact Match: Exact recognition accuracy for the full molecular structure.
- Tanimoto Coefficient: Fingerprint similarity for chemical semantics.
- OKS (Object Keypoint Similarity): Used specifically for evaluating atom localization accuracy.
Perturbation: Robustness tested with random rotations [-5°, 5°] and xy-shearing [-0.1, 0.1].

Hardware

Compute: Training and inference performed on a single node.
Processors: Intel Xeon Silver 4210R CPU.
Accelerators: 4x NVIDIA GeForce RTX 3090/4090 GPUs.
Hyperparameters:
- Stage 1: Batch size 512, LR $4 \times 10^{-4}$.
- Stage 2: Batch size 256, Bond head LR $4 \times 10^{-4}$, Coord head LR $4 \times 10^{-5}$.
- Stage 3 (RL): Batch size 64, Base LR $1 \times 10^{-4}$.

Artifacts

Artifact	Type	License	Notes
MolSight (GitHub)	Code	Apache-2.0	Official PyTorch implementation with training and inference code

Paper Information

Citation: Zhang, W., Wang, X., Feng, B., & Liu, W. (2025). MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2026). https://doi.org/10.48550/arXiv.2511.17300

Publication: AAAI 2026

Additional Resources:

Official Repository

@inproceedings{zhang2025molsight,
      title={MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning},
      author={Wenrui Zhang and Xinggang Wang and Bin Feng and Wenyu Liu},
      year={2025},
      booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
      eprint={2511.17300},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.17300},
}

MMSSC-Net: Multi-Stage Sequence Cognitive Networks

Fri, 19 Dec 2025 00:00:00 +0000

Contribution: A Multi-Stage Architectural Pipeline

Methodological Paper ($\Psi_{\text{Method}}$). The paper proposes a deep learning architecture (MMSSC-Net) for Optical Chemical Structure Recognition (OCSR). It focuses on architectural innovation, specifically combining a SwinV2 visual encoder with a GPT-2 decoder, and validates this method through extensive benchmarking against existing rule-based and deep-learning baselines. It includes ablation studies to justify the choice of the visual encoder.

Motivation: Addressing Noise and Rigid Image Recognition

Data Usage Gap: Drug discovery relies heavily on scientific literature, but molecular structures are often locked in vector graphics or images that computers cannot easily process.
Limitations of Prior Work: Existing Rule-based methods are rigid and sensitive to noise. Previous Deep Learning approaches (Encoder-Decoder “Image Captioning” styles) often lack precision, interpretability, and struggle with varying image resolutions or large molecules.
Need for “Cognition”: The authors argue that treating the image as a single isolated whole is insufficient; a model needs to “perceive” fine-grained details (atoms and bonds) to handle noise and varying pixel qualities effectively.

Novelty: A Fine-Grained Perception Pipeline

Multi-Stage Cognitive Architecture: MMSSC-Net splits the task into stages:
1. Fine-grained Perception: Detecting atom and bond sequences (including spatial coordinates) using SwinV2.
2. Graph Construction: Assembling these into a molecular graph.
3. Sequence Evolution: converting the graph into a machine-readable format (SMILES).
Hybrid Transformer Model: It combines a hierarchical vision transformer (SwinV2) for encoding with a generative pre-trained transformer (GPT-2) and MLPs for decoding atomic and bond targets.
Robustness Mechanisms: The inclusion of random noise sequences during training to improve generalization to new molecular targets.

Methodology and Benchmarks

Baselines: compared against 8 other tools:
- Rule-based: MolVec, OSRA.
- Image-Smiles (DL): ABC-Net, Img2Mol, MolMiner.
- Image-Graph-Smiles (DL): Image-To-Graph, MolScribe, ChemGrapher.
Datasets: Evaluated on 5 diverse datasets: STAKER (synthetic), USPTO, CLEF, JPO, and UOB (real-world).
Metrics:
- Accuracy: Exact string match of the predicted SMILES.
- Tanimoto Similarity: Chemical similarity using Morgan fingerprints.
Ablation Study: Tested different visual encoders (Swin Transformer, ViT-B, ResNet-50) to validate the choice of SwinV2.
Resolution Sensitivity: Tested model performance across image resolutions from 256px to 2048px.

Results and Core Outcomes

Strong Performance: MMSSC-Net achieved 75-98% accuracy across datasets, outperforming baselines on most benchmarks. The first three intra-domain and real datasets achieved above 94% accuracy.
Resolution Robustness: The model maintained relatively stable accuracy across varying image resolutions, whereas baselines like Img2Mol showed greater sensitivity to resolution changes (Fig. 4 in the paper).
Efficiency: The SwinV2 encoder was noted to be more efficient than ViT-B in this context.
Limitations: The model struggles with stereochemistry, specifically confusing dashed wedge bonds with solid wedge bonds and misclassifying single bonds as solid wedge bonds. It also has difficulty with “irrelevant text” noise (e.g., unexpected symbols in JPO and DECIMER datasets).

Reproducibility Details

Data

The model was trained on a combination of PubChem and USPTO data, augmented to handle visual variability.

Purpose	Dataset	Size	Notes
Training	PubChem	1,000,000	Converted from InChI to SMILES; random sampling.
Training	USPTO	600,000	Patent images; converted from MOL to SMILES.
Evaluation	STAKER	40,000	Synthetic; Avg res $256 \times 256$.
Evaluation	USPTO	4,862	Real; Avg res $721 \times 432$.
Evaluation	CLEF	881	Real; Avg res $1245 \times 412$.
Evaluation	JPO	380	Real; Avg res $614 \times 367$.
Evaluation	UOB	5,720	Real; Avg res $759 \times 416$.

Augmentation:

Image: Random perturbations using RDKit/Indigo (rotation, filling, cropping, bond thickness/length, font size, Gaussian noise).
Molecular: Introduction of functional group abbreviations and R-substituents (dummy atoms) using SMARTS templates.

Algorithms

Target Sequence Formulation: The model predicts a sequence containing bounding box coordinates and type labels: ${y_{\text{min}}, x_{\text{min}}, y_{\text{max}}, x_{\text{max}}, C_{n}}$.
Loss Function: Cross-entropy loss with maximum likelihood estimation. $$ \max \sum_{i=1}^{N} \sum_{j=1}^{L} \omega_{j} \log P(t_{j}^{i} \mid x_{1}^{i}, x_{2}^{i}, \dots, x_{M}^{i}, t_{1}^{i}, \dots, t_{j-1}^{i}) $$
Noise Injection: A random sequence $T_r$ is appended to the target sequence during training to improve generalization to new goals.
Graph Construction: Atoms ($v$) and bonds ($e$) are recognized separately; bonds are defined by connecting spatial atomic coordinates.

Models

Encoder: Swin Transformer V2.
- Pre-trained on ImageNet-1K.
- Window size: $16 \times 16$.
- Parameters: 88M.
- Input resolution: $256 \times 256$.
- Features: Scaled cosine attention; log-space continuous position bias.
Decoder: GPT-2 + MLP.
- GPT-2: Used for recognizing atom types.
  - Layers: 24.
  - Attention Heads: 12.
  - Hidden Dimension: 768.
  - Dropout: 0.1.
- MLP: Used for classifying bond types (single, double, triple, aromatic, solid wedge, dashed wedge).
Vocabulary:
- Standard: 95 common numbers/characters ([0], [C], [=], etc.).
- Extended: 2000 SMARTS-based characters for isomers/groups (e.g., “[C2F5]”, “[halo]”).

Evaluation

Metrics:

Accuracy: Exact match of the generated SMILES string.
Tanimoto Similarity: Similarity of Morgan fingerprints between predicted and ground truth molecules.

Key Results (Accuracy):

Dataset	MMSSC-Net	MolVec (Rule)	ABC-Net (DL)	MolScribe (DL)
Indigo	98.14	95.63	96.4	97.5
RDKit	94.91	86.7	98.3	93.8
USPTO	94.24	88.47	*	92.6
CLEF	91.26	81.61	*	86.9
UOB	92.71	81.32	96.1	87.9
Staker	89.44	4.49	*	86.9
JPO	75.48	66.8	*	76.2

Hardware

Training Configuration:
- Batch Size: 128.
- Learning Rate: $4 \times 10^{-5}$.
- Epochs: 40.
Inference Speed: The SwinV2 encoder demonstrated higher efficiency (faster inference time) compared to ViT-B and ResNet-50 baselines during ablation.

Reproducibility

Artifact	Type	License	Notes
MMSSCNet (GitHub)	Code	Unknown	Official implementation; includes training and prediction scripts

The paper is published in RSC Advances (open access). Source code is available on GitHub, though the repository has minimal documentation and no explicit license. The training data comes from PubChem (public) and USPTO (public patent data). Pre-trained model weights do not appear to be released. No specific GPU hardware or training time is reported in the paper.

Paper Information

Citation: Zhang, D., Zhao, D., Wang, Z., Li, J., & Li, J. (2024). MMSSC-Net: multi-stage sequence cognitive networks for drug molecule recognition. RSC Advances, 14(26), 18182-18191. https://doi.org/10.1039/D4RA02442G

Publication: RSC Advances 2024

@article{zhangMMSSCNetMultistageSequence2024,
  title = {MMSSC-Net: Multi-Stage Sequence Cognitive Networks for Drug Molecule Recognition},
  shorttitle = {MMSSC-Net},
  author = {Zhang, Dehai and Zhao, Di and Wang, Zhengwu and Li, Junhui and Li, Jin},
  year = 2024,
  journal = {RSC Advances},
  volume = {14},
  number = {26},
  pages = {18182--18191},
  publisher = {Royal Society of Chemistry},
  doi = {10.1039/D4RA02442G},
  url = {https://pubs.rsc.org/en/content/articlelanding/2024/ra/d4ra02442g}
}

Image2InChI: SwinTransformer for Molecular Recognition

Fri, 19 Dec 2025 00:00:00 +0000

Image2InChI as a Methodological Innovation

This is a Methodological Paper ($\Psi_{\text{Method}}$). It proposes a specific new deep learning architecture (“Image2InChI”) to solve the task of Optical Chemical Structure Recognition (OCSR). The rhetorical focus is on engineering a system that outperforms baselines on specific metrics (InChI accuracy, MCS accuracy) and providing a valuable reference for future algorithmic work.

Bottlenecks in Chemical Literature Digitization

The accurate digitization of chemical literature is a bottleneck in AI-driven drug discovery. Chemical structures in patents and papers exist as optical images (pixels), but machine learning models require machine-readable string representations (like InChI or SMILES). Efficiently and automatically bridging this gap is a prerequisite for large-scale data mining in chemistry.

Hierarchical SwinTransformer and Attention Integration

The core novelty is the Image2InChI architecture, which integrates:

Improved SwinTransformer Encoder: Uses a hierarchical vision transformer to capture image features.
Feature Fusion with Attention: A novel network designed to integrate image patch features with InChI prediction steps.
End-to-End InChI Prediction: The architecture frames the problem as a direct image-to-sequence translation targeting InChI strings directly, diverging from techniques predicting independent graph components. The model is optimized using a standard Cross-Entropy Loss over the token vocabulary: $$ \mathcal{L}_{\text{CE}} = - \sum_{t=1}^{T} \log P(y_t \mid y_{

Benchmarking on the BMS Dataset

Benchmark Validation: The model was trained and tested on the BMS1000 (Bristol-Myers Squibb) dataset from a Kaggle competition.
Ablation/Comparative Analysis: The authors compared their method against other models in the supplement.
Preprocessing Validation: They justified their choice of denoising algorithms (8-neighborhood vs. Gaussian/Mean) to ensure preservation of bond lines while removing “spiky point noise”.

High InChI Recognition Metrics

High Accuracy: The model achieved 99.8% InChI accuracy, 94.8% Maximum Common Substructure (MCS) accuracy, and 96.2% Longest Common Subsequence (LCS) accuracy on the benchmarked dataset. It remains to be seen how well these models generalize to heavily degraded real-world patent images.
Effective Denoising: The authors concluded that eight-neighborhood filtering is superior to mean or Gaussian filtering for this specific domain because it removes isolated noise points without blurring the fine edges of chemical bonds.
Open Source: The authors stated their intention to release the code, though no public repository has been identified.

Artifacts

Artifact	Type	License	Notes
BMS Dataset (Kaggle)	Dataset	Competition	Bristol-Myers Squibb Molecular Translation competition dataset

No public code repository has been identified for Image2InChI despite the authors’ stated intent to release it.

Reproducibility Details

Data

The primary dataset used is the BMS (Bristol-Myers Squibb) Dataset.

Property	Details
Source	Kaggle Competition (BMS-Molecular-Translation)
Total Size	2.4 million images
Training Set	1.8 million images
Test Set	0.6 million images
Content	Each image corresponds to a unique International Chemical Identifier (InChI)

Other Datasets: The authors also utilized JPO (Japanese Patent Office), CLEF (CLEF-IP 2012), UOB (MolrecUOB), and USPTO datasets for broader benchmarking.

Preprocessing Pipeline:

Denoising: Eight-neighborhood filtering (threshold < 4 non-white pixels) is used to remove salt-and-pepper noise while preserving bond lines. Mean and Gaussian filtering were rejected due to blurring.
Sequence Padding:
- Analysis showed max InChI length < 270.
- Fixed sequence length set to 300.
- Tokens: (190), (191), (192) used for padding/framing.
Numerization: Characters are mapped to integers based on a fixed vocabulary (e.g., ‘C’ -> 178, ‘H’ -> 182).

Algorithms

Eight-Neighborhood Filtering (Denoising):

Pseudocode logic:

Iterate through every pixel.
Count non-white neighbors in the 3x3 grid (8 neighbors).
If count < threshold (default 4), treat as noise and remove.

InChI Tokenization:

InChI strings are split into character arrays.
Example: Vitamin C InChI=1S/C6H8O6... becomes [, C, 6, H, 8, O, 6, ..., , ...].
Mapped to integer tensor for model input.

Models

Architecture: Image2InChI

Encoder: Improved SwinTransformer (Hierarchical Vision Transformer).
Decoder: Transformer Decoder with patch embedding.
Fusion: A novel “feature fusion network with attention” integrates the visual tokens with the sequence generation process.
Framework: PyTorch 1.8.1.

Evaluation

Metrics:

InChI Acc: Exact match accuracy of the predicted InChI string (Reported: 99.8%).
MCS Acc: Maximum Common Substructure accuracy (structural similarity) (Reported: 94.8%).
LCS Acc: Longest Common Subsequence accuracy (string similarity) (Reported: 96.2%).
Morgan FP: Morgan Fingerprint similarity (Reported: 94.1%).

Hardware

Component	Specification
GPU	NVIDIA Tesla P100 (16GB VRAM)
Platform	MatPool cloud platform
CPU	Intel Xeon Gold 6271
RAM	32GB System Memory
Driver	NVIDIA-SMI 440.100
OS	Ubuntu 18.04

Paper Information

Citation: Li, D., Xu, X., Pan, J., Gao, W., & Zhang, S. (2024). Image2InChI: Automated Molecular Optical Image Recognition. Journal of Chemical Information and Modeling, 64(9), 3640-3649. https://doi.org/10.1021/acs.jcim.3c02082

Publication: Journal of Chemical Information and Modeling (JCIM) 2024

Additional Resources:

BMS Dataset (Kaggle)

Note: These notes are based on the Abstract and Supporting Information files only.

@article{li2024image2inchi,
  title={Image2InChI: Automated Molecular Optical Image Recognition},
  author={Li, Da-zhou and Xu, Xin and Pan, Jia-heng and Gao, Wei and Zhang, Shi-rui},
  journal={Journal of Chemical Information and Modeling},
  volume={64},
  number={9},
  pages={3640--3649},
  year={2024},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.3c02082}
}

Dual-Path Global Awareness Transformer (DGAT) for OCSR

Fri, 19 Dec 2025 00:00:00 +0000

Contribution Type: Deep Learning Method for OCSR

This is a Method paper ($\Psi_{\text{Method}}$).

The classification is based on the proposal of a novel deep learning architecture (DGAT) designed to address specific limitations in existing Optical Chemical Structure Recognition (OCSR) systems. The contribution is validated through benchmarking against external baselines (DeepOCSR, DECIMER, SwinOCSR) and ablation studies that isolate the impact of the new modules.

Motivation: Addressing Global Context Loss

Existing multimodal fusion methods for OCSR suffer from limited awareness of global context.

Problem: Models often generate erroneous sequences when processing complex motifs, such as rings or long chains, due to a disconnect between local feature extraction and global structural understanding.
Gap: Current architectures struggle to capture the “fine-grained differences between global and local features,” leading to topological errors.
Practical Need: Accurate translation of chemical images to machine-readable sequences (SMILES/SELFIES) is critical for materials science and AI-guided chemical research.

Core Innovation: Dual-Path Global Awareness Transformer

The authors propose the Dual-Path Global Awareness Transformer (DGAT), which redesigns the decoder with two novel mechanisms to better handle global context:

Cascaded Global Feature Enhancement (CGFE): This module bridges cross-modal gaps by emphasizing global context. It concatenates global visual features with sequence features and processes them through a Cross-Modal Assimilation MLP and an Adaptive Alignment MLP to align multimodal representations. The feature enhancement conceptually computes:

$$ f_{\text{enhanced}} = \text{MLP}_{\text{align}}(\text{MLP}_{\text{assimilate}}([f_{\text{global}}, f_{\text{seq}}])) $$
Sparse Differential Global-Local Attention (SDGLA): A module that dynamically captures fine-grained differences between global and local features. It uses sequence features (embedded with global info) as queries, while utilizing local and global visual features as keys/values in parallel attention heads to generate initial multimodal features.

Experimental Setup and Baselines

The model was evaluated on a newly constructed dataset and compared against five major baselines.

Baselines: DeepOCSR, DECIMER 1.0, DECIMER V2, SwinOCSR, and MPOCSR.
Ablation Studies:
- Layer Depth: Tested Transformer depths from 1 to 5 layers; 3 layers proved optimal for balancing gradient flow and parameter sufficiency.
- Beam Size: Tested inference beam sizes 1-5; size 3 achieved the best balance between search depth and redundancy.
- Module Contribution: Validated that removing CGFE results in a drop in structural similarity (Tanimoto), proving the need for pre-fusion alignment.
Robustness Analysis: Performance broken down by molecule complexity (atom count, ring count, bond count).
Chirality Validation: Qualitative analysis of attention maps on chiral molecules to verify the model learns stereochemical cues implicitly.

Results and Conclusions

Performance Over Baselines: DGAT outperformed the MPOCSR baseline across all metrics:
- BLEU-4: 84.0% (+5.3% improvement)
- ROUGE: 90.8% (+1.9% improvement)
- Tanimoto Similarity: 98.8% (+1.2% improvement)
- Exact Match Accuracy: 54.6% (+10.9% over SwinOCSR)
Chiral Recognition: The model implicitly recognizes chiral centers (e.g., generating [C@@H1] tokens correctly) based on 2D wedge cues without direct stereochemical supervision.
Limitations: Performance drops for extreme cases, such as molecules with 4+ rings or 4+ double/triple bonds, due to dataset imbalance. The model still hallucinates branches in highly complex topologies.

Reproducibility Details

Data

The training data is primarily drawn from PubChem and augmented to improve robustness.

Augmentation Strategy: Each sequence generates three images with random rendering parameters.
- Rotation: 0, 90, 180, 270, or random [0, 360)
- Bond Width: 1, 2, or 3 pixels
- Bond Offset: Sampled from 0.08-0.18 (inherited from Image2SMILES)
- CoordGen: Enabled with 20% probability
Evaluation Set: A newly constructed benchmark dataset was used for final reporting.

Algorithms

Training Configuration:
- Encoder LR: $5 \times 10^{-5}$ (Pretrained ResNet-101)
- Decoder LR: $1 \times 10^{-4}$ (Randomly initialized Transformer)
- Optimizer: Implied SGD/Adam (context mentions Momentum 0.9, Weight Decay 0.0001)
- Batch Size: 256
Inference:
- Beam Search: A beam size of 3 is used. Larger beam sizes (4-5) degraded BLEU/ROUGE scores due to increased redundancy.

Models

Visual Encoder:
- Backbone: ResNet-101 initialized with ImageNet weights
- Structure: Convolutional layers preserved up to the final module. Classification head removed.
- Pooling: A $7 \times 7$ average pooling layer is used to extract global visual features.
Sequence Decoder:
- Architecture: Transformer-based with CGFE and SDGLA modules.
- Depth: 3 Transformer layers
- Dropout: Not utilized

Evaluation

Performance is reported using sequence-level and structure-level metrics.

Metric	DGAT Score	Baseline (MPOCSR)	Notes
BLEU-4	84.0%	78.7%	Measures n-gram precision
ROUGE	90.8%	88.9%	Sequence recall metric
Tanimoto	98.8%	97.6%	Structural similarity fingerprint
Accuracy	54.6%	35.7%	Exact structure match rate

Artifacts

Artifact	Type	License	Notes
DGAT	Code	Unknown	Official implementation with training and evaluation scripts

Paper Information

Citation: Wang, R., Ji, Y., Li, Y., & Lee, S.-T. (2025). Dual-Path Global Awareness Transformer for Optical Chemical Structure Recognition. The Journal of Physical Chemistry Letters, 16(50), 12787-12795. https://doi.org/10.1021/acs.jpclett.5c03057

Publication: The Journal of Physical Chemistry Letters 2025

Additional Resources:

GitHub Repository

@article{wang2025dgat,
  title={Dual-Path Global Awareness Transformer for Optical Chemical Structure Recognition},
  author={Wang, Rui and Ji, Yujin and Li, Youyong and Lee, Shuit-Tong},
  journal={The Journal of Physical Chemistry Letters},
  volume={16},
  number={50},
  pages={12787--12795},
  year={2025},
  doi={10.1021/acs.jpclett.5c03057}
}

DECIMER.ai: Optical Chemical Structure Recognition

Fri, 19 Dec 2025 00:00:00 +0000

Project Scope and Contribution Type

This is primarily a Resource paper (Infrastructure Basis) with a significant Method component.

The primary contribution is DECIMER.ai, a fully open-source platform (web app and Python packages) for the entire chemical structure mining pipeline, filling a gap where most tools were proprietary or fragmented. It also contributes the RanDepict toolkit for massive synthetic data generation.

The secondary methodological contribution proposes and validates a specific deep learning architecture (EfficientNet-V2 encoder + Transformer decoder) that treats chemical structure recognition as an image-to-text translation task (SMILES generation).

The Scarcity of Machine-Readable Chemical Data

Data Scarcity: While the number of chemical publications is increasing, most chemical information is locked in non-machine-readable formats (images in PDFs) and is not available in public databases.

Limitations of Existing Tools: Prior OCSR (Optical Chemical Structure Recognition) tools were largely rule-based (fragile to noise) or proprietary.

Lack of Integration: There was no existing open-source system that combined segmentation (finding the molecule on a page), classification (confirming it is a molecule), and recognition (translating it to SMILES) into a single workflow.

DECIMER Architecture and Novel Image-to-SMILES Approach

Comprehensive Workflow: It is the first open-source platform to integrate segmentation (Mask R-CNN), classification (EfficientNet), and recognition (Transformer) into a unified pipeline.

Data-Driven Approach: Unlike tools like MolScribe which use intermediate graph representations and rules, DECIMER uses a purely data-driven “image-to-SMILES” translation approach without hard-coded chemical rules. The core recognition model operates as a sequence-to-sequence generator, mathematically formalizing the task as maximizing the conditional probability of a SMILES sequence given an image.

Massive Synthetic Training: The use of RanDepict to generate over 450 million synthetic images, covering diverse depiction styles and augmentations (including Markush structures), to train the model from scratch.

Benchmarking and Evaluation Methodology

Benchmarking: The system was tested against openly available tools (OSRA, MolVec, Imago, Img2Mol, SwinOCSR, MolScribe) on standard datasets: USPTO, UOB, CLEF, JPO, and a custom “Hand-drawn” dataset.

Robustness Testing: Performance was evaluated on both clean images and images with added distortions (rotation, shearing) to test the fragility of rule-based systems vs. DECIMER.

Markush Structure Analysis: Specific evaluation of the model’s ability to interpret Markush structures (generic structures with R-groups).

Comparison of Approaches: A direct comparison with MolScribe by training DECIMER on MolScribe’s smaller training set to isolate the impact of architecture vs. data volume.

Performance Outcomes and Key Findings

Comparative Performance: DECIMER Image Transformer consistently produced average Tanimoto similarities above 0.95 on in-domain test data and achieved competitive or leading results across external benchmarks, with extremely low rates of catastrophic failure. Tanimoto similarity is calculated based on molecular fingerprints $A$ and $B$ as: $$ T(A, B) = \frac{A \cdot B}{|A|^2 + |B|^2 - A \cdot B} $$

Data Volume Necessity: When trained on small datasets, MolScribe (graph/rule-based) outperformed DECIMER. DECIMER’s performance advantage relies heavily on its massive training scale (>400M images).

Robustness: The model showed no performance degradation on distorted images, unlike rule-based legacy tools.

Generalization: Despite having no hand-drawn images in the training set, the base model recognized 27% of hand-drawn structures perfectly (average Tanimoto 0.69), outperforming all alternative open tools. After fine-tuning with synthetic hand-drawn-like images from RanDepict, perfect predictions increased to 60% (average Tanimoto 0.89).

Reproducibility

Artifacts

Artifact	Type	License	Notes
DECIMER.ai Web App	Code	MIT	Laravel-based web application for the full pipeline
DECIMER Image Transformer	Code	MIT	Core OCSR Python package
DECIMER Image Segmentation	Code	MIT	Mask R-CNN segmentation for chemical structures in documents
DECIMER Image Classifier	Code	MIT	EfficientNet-based chemical structure image classifier
RanDepict	Code	MIT	Synthetic training data generation toolkit

Data

The models were trained on synthetic data generated from PubChem molecules.

Purpose	Dataset	Size	Generation/Notes
Training	`pubchem_1`	~108M mols	PubChem molecules (mass < 1500 Da), processed with RanDepict (v1.0.5). Included image augmentations.
Training	`pubchem_2`	~126M mols	Included Markush structures generated by pseudo-randomly replacing atoms with R-groups. Image size 299x299.
Training	`pubchem_3`	>453M images	Re-depicted `pubchem_2` molecules at 512x512 resolution. Used RanDepict v1.0.8.
Test	In-domain	250,000	Held-out set generated similarly to training data.
Benchmark	External	Various	USPTO (5719), UOB (5740), CLEF (992), JPO (450), Indigo (50k), Hand-drawn (5088).

Data Generation:

Tool: RanDepict (uses CDK, RDKit, Indigo, PIKAChU)
Augmentations: Rotation, shearing, noise, pixelation, curved arrows, text labels
Format: Data saved as TFRecord files for TPU training

Algorithms

SMILES Tokenization: Regex-based splitting (atoms, brackets, bonds). Added , , and padded with . used for unknown tokens.
Markush Token Handling: To avoid ambiguity, digits following ‘R’ (e.g., R1) were replaced with unique non-digit characters during training to distinguish them from ring-closure numbers.
Image Augmentation Pipeline: Custom RanDepict features (v1.1.4) were used to simulate “hand-drawn-like” styles based on ChemPIX’s implementation.

Models

The platform consists of three distinct models:

DECIMER Segmentation:
- Architecture: Mask R-CNN (TensorFlow 2.10.0 implementation)
- Purpose: Detects and cuts chemical structures from full PDF pages
DECIMER Image Classifier:
- Architecture: EfficientNet-V1-B0
- Input: 224x224 pixels
- Training: Fine-tuned on ~10.9M images (balanced chemical/non-chemical)
- Performance: AUC 0.99 on in-domain test set
DECIMER Image Transformer (OCSR Engine):
- Encoder: EfficientNet-V2-M (CNN). Input size 512x512. 52M parameters
- Decoder: Transformer. 4 encoder blocks, 4 decoder blocks, 8 attention heads. d_model=512, d_ff=2048. 59M parameters
- Total Params: ~111 Million

Evaluation

Primary Metric: Tanimoto Similarity (calculated on PubChem fingerprints of the predicted vs. ground truth SMILES)
Secondary Metrics: Exact Match (Identity), BLEU score (for string similarity, esp. Markush)
Failure Analysis: “Catastrophic failure” defined as Tanimoto similarity of 0 or invalid SMILES

Hardware

Training was performed on Google Cloud TPUs due to the massive dataset size.

pubchem_1/pubchem_2: Trained on TPU v3-32 pod slice
pubchem_3 (Final Model): Trained on TPU v3-256 pod slice
Training Time:
- Data generation (512x512): ~2 weeks on cluster (20 threads, 36 cores)
- Model Training (EffNet-V2-M): 1 day and 7 hours per epoch on TPU v3-256

Paper Information

Citation: Rajan, K., Brinkhaus, H. O., Agea, M. I., Zielesny, A., & Steinbeck, C. (2023). DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. Nature Communications, 14(1), 5045. https://doi.org/10.1038/s41467-023-40782-0

Publication: Nature Communications 2023

Additional Resources:

@article{rajanDECIMERaiOpenPlatform2023,
  title = {DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications},
  author = {Rajan, Kohulan and Brinkhaus, Henning Otto and Agea, M. Isabel and Zielesny, Achim and Steinbeck, Christoph},
  journal = {Nature Communications},
  volume = {14},
  number = {1},
  pages = {5045},
  year = {2023},
  doi = {10.1038/s41467-023-40782-0}
}

SwinOCSR: End-to-End Chemical OCR with Swin Transformers

Thu, 18 Dec 2025 00:00:00 +0000

Contribution: Methodological Architecture and Datasets

This is a Methodological Paper with a significant Resource component.

Method: It proposes a novel architecture (Swin Transformer backbone) and a specific loss function optimization (Focal Loss) for the task of Optical Chemical Structure Recognition (OCSR).
Resource: It constructs a large-scale synthetic dataset of 5 million molecules, specifically designing it to cover complex cases like substituents and aromatic rings.

Motivation: Addressing Visual Context and Data Imbalance

Problem: OCSR (converting images of chemical structures to SMILES) is difficult due to complex chemical patterns and long sequences. Existing deep learning methods (often CNN-based) struggle to achieve satisfactory recognition rates.
Technical Gap: Standard CNN backbones (like ResNet or EfficientNet) focus on local feature extraction and miss global dependencies required for interpreting complex molecular diagrams.
Data Imbalance: Chemical strings suffer from severe class imbalance (e.g., ‘C’ and ‘H’ are frequent; ‘Br’ or ‘Cl’ are rare), which causes standard Cross Entropy loss to underperform.

Core Innovation: Swin Transformers and Focal Loss

Swin Transformer Backbone: SwinOCSR replaces the standard CNN backbone with a Swin Transformer, using shifted window attention to capture both local and global image features more effectively.
Multi-label Focal Loss (MFL): The paper introduces a modified Focal Loss to OCSR, the first explicit attempt to address token imbalance in OCSR (per the authors). This penalizes the model for errors on rare tokens, addressing the “long-tail” distribution of chemical elements. The standard Focal Loss formulation heavily weights hard-to-classify examples: $$ \begin{aligned} FL(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t) \\ \end{aligned} $$
Structured Synthetic Dataset: Creation of a dataset explicitly balanced across four structural categories: Kekule rings, Aromatic rings, and their combinations with substituents.

Experimental Setup and Baselines

Backbone Comparison: The authors benchmarked SwinOCSR against the backbones of leading competitors: ResNet-50 (used in Image2SMILES) and EfficientNet-B3 (used in DECIMER 1.0).
Loss Function Ablation: They compared the performance of standard Cross Entropy (CE) loss against their proposed Multi-label Focal Loss (MFL).
Category Stress Test: Performance was evaluated separately on molecules with/without substituents and with/without aromaticity to test robustness.
Real-world Evaluation: The model was tested on 100 images manually extracted from the literature (with manually labeled SMILES), and separately on 100 CDK-generated images from those same SMILES, to measure the domain gap between synthetic and real-world data.

Results and Limitations

Synthetic test set performance: With Multi-label Focal Loss (MFL), SwinOCSR achieved 98.58% accuracy on the synthetic test set, compared to 97.36% with standard CE loss. Both ResNet-50 (89.17%) and EfficientNet-B3 (86.70%) backbones scored lower when using CE loss (Table 3).
Handling of long sequences: The model maintained high accuracy (94.76%) even on very long DeepSMILES strings (76-100 characters), indicating effective global feature extraction.
Per-category results: Performance was consistent across molecule categories: Category 1 (Kekule, 98.20%), Category 2 (Aromatic, 98.46%), Category 3 (Kekule + Substituents, 98.76%), Category 4 (Aromatic + Substituents, 98.89%). The model performed slightly better on molecules with substituents and aromatic rings.
Domain shift: While performance on synthetic data was strong, accuracy dropped to 25% on 100 real-world literature images. On 100 CDK-generated images from the same SMILES strings, accuracy was 94%, confirming that the gap stems from stylistic differences between CDK-rendered and real-world images. The authors attribute this to noise, low resolution, and variations such as condensed structural formulas and abbreviations.

Reproducibility Details

Data

Source: The first 8.5 million structures from PubChem were downloaded, yielding ~6.9 million unique SMILES.
Generation Pipeline:
- Tools: CDK (Chemistry Development Kit) for image rendering; RDKit for SMILES canonicalization.
- Augmentation: To ensure diversity, the dataset was split into 4 categories (1.25M each): (1) Kekule, (2) Aromatic, (3) Kekule + Substituents, (4) Aromatic + Substituents. Substituents were randomly added from a list of 224 common patent substituents.
- Preprocessing: Images rendered as binary, resized to 224x224, and copied to 3 channels (RGB simulation).

Purpose	Dataset	Size	Notes
Training	Synthetic (PubChem-derived)	4,500,000	18:1:1 split (Train/Val/Test)
Validation	Synthetic (PubChem-derived)	250,000
Test	Synthetic (PubChem-derived)	250,000

Algorithms

Loss Function: Multi-label Focal Loss (MFL). The single-label classification task was cast as multi-label to apply Focal Loss, using a sigmoid activation on logits.
Optimization:
- Optimizer: Adam with initial learning rate 5e-4.
- Schedulers: Cosine decay for the Swin Transformer backbone; Step decay for the Transformer encoder/decoder.
- Regularization: Dropout rate of 0.1.

Models

Backbone (Encoder 1): Swin Transformer.
- Patch size: $4 \times 4$.
- Linear embedding dimension: 192.
- Structure: 4 stages with Swin Transformer Blocks (Window MSA + Shifted Window MSA).
- Output: Flattened patch sequence $S_b$.
Transformer Encoder (Encoder 2): 6 standard Transformer encoder layers. Uses Positional Embedding + Multi-Head Attention + MLP.
Transformer Decoder: 6 standard Transformer decoder layers. Uses Masked Multi-Head Attention (to prevent look-ahead) + Multi-Head Attention (connecting to encoder output $S_e$).
Tokenization: DeepSMILES format used (syntactically more robust than SMILES). Vocabulary size: 76 tokens (76 unique characters found in dataset). Embedding dimension: 256.

Evaluation

Metrics: Accuracy (Exact Match), Tanimoto Similarity (PubChem fingerprints), BLEU, ROUGE.

Metric	SwinOCSR (CE)	SwinOCSR (MFL)	ResNet-50 (CE)	EfficientNet-B3 (CE)
Accuracy	97.36%	98.58%	89.17%	86.70%
Tanimoto	99.65%	99.77%	98.79%	98.46%
BLEU	99.46%	99.59%	98.62%	98.37%
ROUGE	99.64%	99.78%	98.87%	98.66%

Hardware

GPU: Trained on NVIDIA Tesla V100-PCIE.
Training Time: 30 epochs.
Batch Size: 256 images ($224 \times 224$ pixels).

Artifacts

Artifact	Type	License	Notes
SwinOCSR	Code + Data	Unknown	Official implementation with dataset and trained models

Paper Information

Citation: Xu, Z., Li, J., Yang, Z. et al. (2022). SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer. Journal of Cheminformatics, 14(41). https://doi.org/10.1186/s13321-022-00624-5

Publication: Journal of Cheminformatics 2022

Additional Resources:

GitHub Repository

MICER: Molecular Image Captioning with Transfer Learning

Thu, 18 Dec 2025 00:00:00 +0000

MICER’s Contribution to Optical Structure Recognition

This is a Method paper according to the AI for Physical Sciences taxonomy. It proposes MICER, an encoder-decoder architecture that integrates transfer learning (fine-tuning pre-trained models) and attention mechanisms for Optical Chemical Structure Recognition (OCSR). The study includes rigorous benchmarking comparing MICER against three rule-based tools (OSRA, MolVec, Imago) and existing deep learning methods (DECIMER). The authors conduct extensive factor comparison experiments to isolate the effects of stereochemistry, molecular complexity, data volume, and encoder backbone choices.

The Challenge of Generalizing in OCSR

Chemical structures in scientific literature are valuable for drug discovery, but they are locked in image formats that are difficult to mine automatically. Traditional OCSR tools (like OSRA) rely on hand-crafted rules and expert knowledge. They are brittle, struggle with stylistic variations, and have low generalization ability. While deep learning has been applied (e.g., DECIMER), previous attempts often used frozen pre-trained feature extractors (without fine-tuning) or failed to fully exploit transfer learning, leading to suboptimal performance. The goal of this work is to build an end-to-end “image captioning” system that translates molecular images directly into SMILES strings without intermediate segmentation steps.

Integrating Fine-Tuning and Attention for Chemistry

The core novelty lies in the specific architectural integration of transfer learning with fine-tuning for the chemical domain. Unlike DECIMER, which used a frozen network, MICER fine-tunes a pre-trained ResNet on molecular images. This allows the encoder to adapt from general object recognition to specific chemical feature extraction.

The model incorporates an attention mechanism into the LSTM decoder, allowing the model to focus on specific image regions (atoms and bonds) when generating each character of the SMILES string. The paper explicitly analyzes “intrinsic features” of molecular data (stereochemistry, complexity) to guide the design of the training dataset, combining multiple chemical toolkits (Indigo, RDKit) to generate diverse styles.

Experimental Setup and Ablation Studies

The authors performed two types of experiments: Factor Comparison (ablations) and Benchmarking.

Factor Comparisons: They evaluated how performance is affected by:

Stereochemistry (SI): Comparing models trained on data with and without stereochemical information.
Molecular Complexity (MC): Analyzing performance across 5 molecular weight intervals.
Data Volume (DV): Training on datasets ranging from 0.64 million to 10 million images.
Pre-trained Models (PTMs): Comparing 8 different backbones (e.g., ResNet, VGG, Inception, MobileNet) versus a base CNN.

Benchmarking:

Baselines: OSRA, MolVec, Imago (rule-based); Base CNN, DECIMER (deep learning).
Datasets: Four test sets (100k images each, except UOB): Uni-style, Multi-style, Noisy, and Real-world (UOB dataset).
Metrics: Sequence Accuracy (Exact Match), Levenshtein Distance (ALD), and Tanimoto Similarity (Fingerprint match).

Results and Core Insights

MICER achieved 97.54% Sequence Accuracy on uni-style data and 82.33% on the real-world UOB dataset, outperforming rule-based and deep learning baselines across all four test sets.

Dataset	Method	SA (%)	AMFTS (%)
Uni-style	OSRA	23.14	56.83
Uni-style	DECIMER	35.32	86.92
Uni-style	MICER	97.54	99.74
Multi-style	OSRA	15.68	44.50
Multi-style	MICER	95.09	99.28
Noisy	MICER	94.95	99.25
UOB (real-world)	OSRA	80.24	91.17
UOB (real-world)	DECIMER	21.75	65.15
UOB (real-world)	MICER	82.33	94.47

ResNet101 was identified as the most effective encoder (87.58% SA in preliminary tests on 0.8M images), outperforming deeper (DenseNet121 at 81.41%) and lighter (MobileNetV2 at 39.83%) networks. Performance saturates around 6 million training samples, reaching 98.84% SA. Stereochemical information drops accuracy by approximately 6.1% (from 87.61% to 81.50%), indicating wedge and dash bonds are harder to recognize. Visualizing attention maps showed the model correctly attends to specific atoms (e.g., focusing on ‘S’ or ‘Cl’ pixels) when generating the corresponding character.

Limitations

The authors acknowledge several limitations. MICER struggles with superatoms, R-groups, text labels, and uncommon atoms (e.g., Sn) that were not seen during training. On noisy data, noise spots near Cl atoms can cause misclassification as O atoms. Complex molecular images with noise lead to misrecognition of noise points as single bonds and wedge-shaped bonds as double bonds. All methods, including MICER, have substantial room for improvement on real-world datasets that contain these challenging elements.

Reproducibility Details

Data

The training data was curated from the ZINC20 database.

Preprocessing:

Filtering: Removed organometallics, mixtures, and invalid molecules.
Standardization: SMILES were canonicalized and de-duplicated.
Generation: Images generated using Indigo and RDKit toolkits to vary styles.

Dataset Size:

Total: 10 million images selected for the final model.
Composition: 6 million “default style” (Indigo) + 4 million “multi-style” (Indigo + RDKit).
Splits: 8:1:1 ratio for Training/Validation/Test.

Vocabulary: A token dictionary of 39 SMILES characters plus 3 special tokens: [pad], [sos], [eos], [0]-[9], [C], [l], [c], [O], [N], [n], [F], [H], [o], [S], [s], [B], [r], [I], [i], [P], [p], (, ), [, ], @, =, #, /, -, +, \, %. Two-letter atoms like ‘Br’ are tokenized as distinct characters [B], [r], and ‘Cl’ as [C], [l].

Algorithms

Tokenization: Character-level tokenization (not atom-level); the model learns to assemble ‘C’ and ’l’ into ‘Cl’.
Attention Mechanism: Uses a soft attention mechanism where the decoder calculates an attention score between the encoder’s feature map ($8 \times 8 \times 512$) and the current hidden vector. Formula: $$ \begin{aligned} \text{att_score} &= \text{softmax}(L_a(\tanh(L_f(F) + L_b(b_t)))) \end{aligned} $$
Training Configuration:
- Loss Function: Cross-entropy loss
- Optimizer: Adam optimizer
- Learning Rate: 2e-5
- Batch Size: 256
- Epochs: 15

Models

Encoder:

Backbone: Pre-trained ResNet101 (trained on ImageNet).
Modifications: The final layer is removed to output a Feature Map of size $8 \times 8 \times 512$.
Flattening: Reshaped to a $64 \times 512$ feature matrix for the decoder.

Decoder:

Type: Long Short-Term Memory (LSTM) with Attention.
Dropout: 0.3 applied to minimize overfitting.

The encoder uses a pilot network (for universal feature extraction), a max-pooling layer, and multiple feature extraction layers containing convolutional blocks (CBs), feeding into the attention LSTM.

Evaluation

Metrics:

SA (Sequence Accuracy): Strict exact match of SMILES strings.
ALD (Average Levenshtein Distance): Edit distance for character-level error analysis.
AMFTS / MFTS@1.0: Tanimoto similarity of ECFP4 fingerprints to measure structural similarity.

Test Sets:

Uni-style: 100,000 images (Indigo default).
Multi-style: 100,000 images (>10 styles).
Noisy: 100,000 images with noise added.
UOB: 5,575 real-world images from literature.

Hardware

Compute: 4 x NVIDIA Tesla V100 GPUs
Training Time: Approximately 42 hours for the final model

Artifacts

Artifact	Type	License	Notes
MICER	Code	MIT	Official implementation

The training data (generated from ZINC20) and pre-trained model weights are not publicly released. The repository contains code but has minimal documentation (2 commits, no description).

Paper Information

Citation: Yi, J., Wu, C., Zhang, X., Xiao, X., Qiu, Y., Zhao, W., Hou, T., & Cao, D. (2022). MICER: a pre-trained encoder-decoder architecture for molecular image captioning. Bioinformatics, 38(19), 4562-4572. https://doi.org/10.1093/bioinformatics/btac545

Publication: Bioinformatics 2022

Additional Resources:

GitHub Repository

@article{yiMICERPretrainedEncoder2022,
  title = {{{MICER}}: A Pre-Trained Encoder--Decoder Architecture for Molecular Image Captioning},
  shorttitle = {{{MICER}}},
  author = {Yi, Jiacai and Wu, Chengkun and Zhang, Xiaochen and Xiao, Xinyi and Qiu, Yanlong and Zhao, Wentao and Hou, Tingjun and Cao, Dongsheng},
  year = {2022},
  month = sep,
  journal = {Bioinformatics},
  volume = {38},
  number = {19},
  pages = {4562--4572},
  issn = {1367-4811},
  doi = {10.1093/bioinformatics/btac545}
}

Image2SMILES: Transformer OCSR with Synthetic Data Pipeline

Thu, 18 Dec 2025 00:00:00 +0000

Contribution: Image2SMILES as a Method and Resource

This is primarily a Method paper with a significant Resource component.

Method: It proposes a specific neural architecture (ResNet backbone and Transformer Decoder) to solve the Optical Chemical Structure Recognition (OCSR) task, answering “How well does this work?” with extensive benchmarks against rule-based systems like OSRA.
Resource: A core contribution is the “Generate and Train!” paradigm, where the authors release a comprehensive synthetic data generator to overcome the lack of labeled training data in the field.

Motivation: Bottlenecks in Recognizing Trapped Chemical Structures

Retrieving chemical structure data from legacy scientific literature is a major bottleneck in cheminformatics.

Problem: Chemical structures are often “trapped” in image formats (PDFs, scans). Manual extraction is slow, and existing rule-based tools (e.g., OSRA) are brittle when facing diverse drawing styles, “Markush” structures (templates), or visual contamination.
Gap: Deep learning approaches require massive datasets, but no large-scale annotated dataset of chemical figures exists.
Goal: To create a robust, data-driven recognition engine that can handle the messiness of real-world chemical publications (e.g., text overlays, arrows, partial overlaps).

Core Innovation: The “Generate and Train!” Pipeline and FG-SMILES

“Generate and Train!” Paradigm: The authors assert that architecture is secondary to data simulation. They developed an advanced augmentation pipeline that simulates geometry (rotation, bonds) alongside specific chemical drawing artifacts like “Markush” variables ($R_1$, $R_2$), functional group abbreviations (e.g., -OMe, -Ph), and visual “contamination” (stray text, arrows).
FG-SMILES: A modified SMILES syntax designed to handle functional groups and Markush templates as single tokens (pseudo-atoms), allowing the model to predict generalized scaffolds.
Encoder-Free Architecture: The authors found that a standard Transformer Encoder was unnecessary. They feed the flattened feature map from a ResNet backbone directly into the Transformer Decoder, which improved performance.

Methodology and Benchmarking Against OSRA

Training: The model was trained on 10 million synthetically generated images derived from PubChem structures, selected via a complexity-biased sampling algorithm.
Validation (Synthetic): Evaluated on a hold-out set of 1M synthetic images.
Validation (Real World):
- Dataset A: 332 manually cropped structures from 10 specific articles, excluding reaction schemes.
- Dataset B: 296 structures systematically extracted from Journal of Organic Chemistry (one paper per issue from 2020) to reduce selection bias.
Comparison: Benchmarked against OSRA (v2.11), a widely used rule-based OCSR tool.

Results: High-Precision Extraction and Key Limitations

Performance:
- Synthetic: 90.7% exact match accuracy.
- Real Data (Dataset A): Image2SMILES achieved 79.2% accuracy compared to OSRA’s 62.1%.
- Real Data (Dataset B): Image2SMILES achieved 62.5% accuracy compared to OSRA’s 24.0%.
Confidence Correlation: There is a strong correlation between the model’s confidence score and prediction validity. Thresholding at 0.995 yields 99.85% accuracy while ignoring 22.5% of data, enabling high-precision automated pipelines.
Key Failures: The model struggles with functional groups absent from its training dictionary (e.g., $\text{NMe}_2$, Ms), confusion of R-group indices ($R’$ vs $R_1$), and explicit hydrogens rendered as groups.

Reproducibility Details

Data

Source: A subset of 10 million molecules sampled from PubChem.
Selection Logic: Bias towards complex/rare structures using a “Full Coefficient” (FC) probability metric based on molecule size and ring/atom rarity.
- Formula: $BC=0.1+1.2\left(\frac{n_{\max}-n}{n_{\max}}\right)^{3}$ where $n_{\max}=60$.
Generation: Uses RDKit for rendering with augmentations: rotation, font size, line thickness, whitespace, and CoordGen (20% probability).
Contamination: “Visual noise” is stochastically added, including parts of other structures, labels, and arrows cropped from real documents.
Target Format: FG-SMILES (Functional Group SMILES). Replaces common functional groups with pseudo-atoms (e.g., [Me], [Ph], [NO2]) and supports variable R-group positions using a v token.

Algorithms

Contamination Augmentation: A dedicated algorithm simulates visual noise (arrows, text) touching or overlapping the main molecule to force robustness.
Functional Group Resolution: An algorithm identifies overlapping functional group templates (SMARTS) and resolves them to prevent nested group conflicts (e.g., resolving Methyl vs Methoxy).
Markush Support: Stochastic replacement of substituents with R-group labels ($R_1$, $R’$, etc.) based on a defined probability table (e.g., $P(R)=0.2$, $P(R_1)=0.15$).

Models

Architecture: “Image-to-Sequence” hybrid model.
- Backbone: ResNet-50, but with the last two residual blocks removed. Output shape: $512 \times 48 \times 48$.
- Neck: No Transformer Encoder. CNN features are flattened and passed directly to the Decoder.
- Decoder: Standard Transformer Decoder with parameters from the original Transformer architecture.
Input: Images resized to $384 \times 384 \times 3$.
Output: Sequence of FG-SMILES tokens.

Evaluation

Metric: Binary “Exact Match” (valid/invalid).
- Strict criteria: Stereo and R-group indices must match exactly (e.g., $R’$ vs $R_1$ is a failure).
Datasets:
- Internal: 5% random split of generated data (500k samples).
- External (Dataset A & B): Manually cropped real-world images from specified journals.

Hardware

Training: 4 $\times$ Nvidia V100 GPUs + 36 CPU cores.
Duration: ~2 weeks for training (5 epochs, ~63 hours/epoch). Data generation took 3 days on 80 CPUs.
Optimizer: RAdam with learning rate $3 \cdot 10^{-4}$.

Artifacts

Artifact	Type	License	Notes
Data Generator (GitHub)	Code	MIT	Synthetic training data generator
1M Generated Samples (Zenodo)	Dataset	Unknown	Randomly generated image-SMILES pairs
Real-World Test Images (Zenodo)	Dataset	Unknown	Cropped structures from real papers with target FG-SMILES
Syntelly Demo	Other	Proprietary	Web demo for PDF-to-SMILES extraction

Paper Information

Citation: Khokhlov, I., Krasnov, L., Fedorov, M. V., & Sosnin, S. (2022). Image2SMILES: Transformer-Based Molecular Optical Recognition Engine. Chemistry-Methods, 2(1), e202100069. https://doi.org/10.1002/cmtd.202100069

Publication: Chemistry-Methods 2022

Additional Resources:

@article{khokhlovImage2SMILESTransformerBasedMolecular2022,
  title = {Image2SMILES: Transformer-Based Molecular Optical Recognition Engine},
  shorttitle = {Image2SMILES},
  author = {Khokhlov, Ivan and Krasnov, Lev and Fedorov, Maxim V. and Sosnin, Sergey},
  year = {2022},
  journal = {Chemistry-Methods},
  volume = {2},
  number = {1},
  pages = {e202100069},
  issn = {2628-9725},
  doi = {10.1002/cmtd.202100069},
  url = {https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cmtd.202100069}
}

ICMDT: Automated Chemical Structure Image Recognition

Thu, 18 Dec 2025 00:00:00 +0000

Contribution: Image-to-Text Translation for Chemical Structures

This is a Method paper.

It proposes a novel neural network architecture, the Image Captioning Model based on Deep TNT (ICMDT), to solve the specific problem of “molecular translation” (image-to-text). The classification is supported by the following rhetorical indicators:

Novel Mechanism: It introduces the “Deep TNT block” to improve upon the existing TNT architecture by fusing features at three levels (pixel, small patch, large patch).
Baseline Comparison: The authors explicitly compare their model against four other architectures (CNN+RNN and CNN+Transformer variants).
Ablation Study: Section 4.3 is dedicated to ablating specific components (position encoding, patch fusion) to prove their contribution to the performance gain.

Motivation: Digitizing Historical Chemical Literature

The primary motivation is to speed up chemical research by digitizing historical chemical literature.

Problem: Historical sources often contain corrupted or noisy images, making automated recognition difficult.
Gap: Existing models like the standard TNT (Transformer in Transformer) function primarily as encoders for classification and fail to effectively integrate local pixel-level information required for precise structure generation.
Goal: To build a dependable generative model that can accurately translate these noisy images into InChI (International Chemical Identifier) text strings.

Novelty: Multi-Level Feature Fusion with Deep TNT

The core contribution is the Deep TNT block and the resulting ICMDT architecture.

Deep TNT Block: The Deep TNT block expands upon standard local and global modeling by stacking three transformer blocks to process information at three granularities:
1. Internal Transformer: Processes pixel embeddings.
2. Middle Transformer: Processes small patch embeddings.
3. Exterior Transformer: Processes large patch embeddings.
Multi-level Fusion: The model fuses pixel-level features into small patches, and small patches into large patches, allowing for finer integration of local details.
Position Encoding: A specific strategy of applying shared position encodings to small patches and pixels, while using a learnable 1D encoding for large patches.

Methodology: Benchmarking on the BMS Dataset

The authors evaluated the model on the Bristol-Myers Squibb Molecular Translation dataset.

Baselines: They constructed four comparative models:
- EfficientNetb0 + RNN (Bi-LSTM)
- ResNet50d + RNN (Bi-LSTM)
- EfficientNetb0 + Transformer
- ResNet101d + Transformer
Ablation: They tested the impact of removing the large patch position encoding (ICMDT*), reverting the encoder to a standard TNT-S (TNTD), and setting the patch size to 32 directly on TNT-S without the exterior transformer block (TNTD-B).
Pre-processing Study: They experimented with denoising ratios and cropping strategies.

Results & Conclusions: Improved InChI Translation Accuracy

Performance: ICMDT achieved the lowest Levenshtein distance (0.69) among all five models tested (Table 3). The best-performing baseline was ResNet101d+Transformer.
Convergence: The model converged significantly faster than the baselines, outperforming others as early as epoch 6.7.
Ablation Results: The full Deep TNT block reduced error by nearly half compared to the standard TNT encoder (0.69 vs 1.29 Levenshtein distance). Removing large patch position encoding (ICMDT*) degraded performance to 1.04, and directly using patch size 32 on TNT-S (TNTD-B) scored 1.37.
Limitations: The model struggles with stereochemical layers (e.g., identifying clockwise neighbors or +/- signs) compared to non-stereochemical layers.
Inference & Fusion: The multi-model inference and fusion pipeline (beam search, TTA, step-wise logit ensemble, and voting) improved results by 0.24 to 2.5 Levenshtein distance reduction over single models.
Future Work: Integrating full object detection to predict atom/bond coordinates to better resolve 3D stereochemical information.

Reproducibility

Status: Partially Reproducible. The dataset is publicly available through Kaggle, and the paper provides detailed hyperparameters and architecture specifications. However, no source code or pretrained model weights have been released.

Artifact	Type	License	Notes
BMS Molecular Translation (Kaggle)	Dataset	Competition Terms	Training/test images with InChI labels

Missing components: No official code repository or pretrained weights. Reimplementation requires reconstructing the Deep TNT block, training pipeline, and inference/fusion strategy from the paper description alone.

Hardware/compute requirements: Not explicitly stated in the paper.

Data

The experiments used the Bristol-Myers Squibb Molecular Translation dataset from Kaggle.

Purpose	Dataset	Size	Notes
Training	BMS Training Set	2,424,186 images	Supervised; contains noise and blur
Evaluation	BMS Test Set	1,616,107 images	Higher noise variation than training set

Pre-processing Strategy:

Effective: Padding resizing (reshaping to square using the longer edge, padding insufficient parts with pixels from the middle of the image).
Ineffective: Smart cropping (removing white borders degraded performance).
Augmentation: GaussNoise, Blur, RandomRotate90, and PepperNoise ($SNR=0.996$).
Denoising: Best results found by mixing denoised and original data (Ratio 2:13) during training.

Algorithms

Optimizer: Lookahead ($\alpha=0.5, k=5$) and RAdam ($\beta_1=0.9, \beta_2=0.99$).
Loss Function: Anti-Focal loss ($\gamma=0.5$) combined with Label Smoothing. Standard Focal Loss adds a modulating factor $(1-p_t)^\gamma$ to cross-entropy to focus on hard negatives. Anti-Focal Loss (Raunak et al., 2020) modifies this factor to reduce the disparity between training and inference distributions in Seq2Seq models.
Training Schedule:
- Initial resolution: $224 \times 224$
- Fine-tuning: Resolution $384 \times 384$ for labels $>150$ length.
- Batch size: Dynamic, increasing from 16 to 1024 (with proportional learning rate scaling).
- Noisy Labels: Randomly replacing chemical elements in labels with a certain probability to improve robustness during inference.
Inference Strategy:
- Beam Search ($k=16$ initially, $k=64$ if failing InChI validation).
- Test Time Augmentation (TTA): Rotations of $90^\circ$.
- Ensemble: Step-wise logit ensemble and voting based on Levenshtein distance scores.

Models

ICMDT Architecture:

Encoder (Deep TNT) (Depth: 12 layers):
- Internal Block: Dim 160, Heads 4, Hidden size 640, MLP act GELU, Pixel patch size 4.
- Middle Block: Dim 10, Heads 6, Hidden size 128, MLP act GELU, Small patch size 16.
- Exterior Block: Dim 2560, Heads 10, Hidden size 5120, MLP act GELU, Large patch size 32.
Decoder (Vanilla Transformer):
- Decoder dim: 2560, FFN dim: 1024.
- Depth: 3 layers, Heads: 8.
- Vocab size: 193 (InChI tokens), text_dim: 384.

Evaluation

Metric: Levenshtein Distance (measures single-character edit operations between generated and ground truth InChI strings).

Ablation Results (Table 3 from paper):

Model	Params (M)	Levenshtein Distance
ICMDT	138.16	0.69
ICMDT*	138.16	1.04
TNTD	114.36	1.29
TNTD-B	114.36	1.37

Baseline Comparison (from convergence curves, Figure 9):

Model	Params (M)	Convergence (Epochs)
ICMDT	138.16	~9.76
ResNet101d + Transformer	302.02	14+
EfficientNetb0 + Transformer	-	-
ResNet50d + RNN	90.6	14+
EfficientNetb0 + RNN	46.3	-

Paper Information

Citation: Li, Y., Chen, G., & Li, X. (2022). Automated Recognition of Chemical Molecule Images Based on an Improved TNT Model. Applied Sciences, 12(2), 680. https://doi.org/10.3390/app12020680

Publication: MDPI Applied Sciences 2022

Additional Resources:

Kaggle Competition: BMS Molecular Translation

@article{liAutomatedRecognitionChemical2022,
  title = {Automated {{Recognition}} of {{Chemical Molecule Images Based}} on an {{Improved TNT Model}}},
  author = {Li, Yanchi and Chen, Guanyu and Li, Xiang},
  year = 2022,
  month = jan,
  journal = {Applied Sciences},
  volume = {12},
  number = {2},
  pages = {680},
  publisher = {Multidisciplinary Digital Publishing Institute},
  issn = {2076-3417},
  doi = {10.3390/app12020680}
}

End-to-End Transformer for Molecular Image Captioning

Thu, 18 Dec 2025 00:00:00 +0000

Methodological Contribution

This is a Methodological Paper. It proposes a novel architectural approach to molecular image translation by replacing the standard CNN encoder with a Vision Transformer (ViT). The authors validate this method through comparative benchmarking against standard CNN+RNN baselines (e.g., ResNet+LSTM) and provide optimizations for inference speed.

Motivation and Problem Statement

The core problem addressed is existing molecular translation methods (extracting chemical structure from images into computer-readable InChI format) rely heavily on rule-based systems or CNN+RNN architectures. These current approaches often underperform when handling noisy images (common in scanned old journals) or images with few distinguishable features. There is a significant need in drug discovery to digitize and analyze legacy experimental data locked in image format within scientific publications.

Core Innovations: End-to-End ViT Encoder

The primary contribution is the use of a completely convolution-free Vision Transformer (ViT) as the encoder, allowing the model to utilize long-range dependencies among image patches from the very beginning via self-attention: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ The architecture is a pure Transformer (Encoder-Decoder), treating the molecular image similarly to a sequence of tokens (patches). Furthermore, the authors implement a specific caching strategy for the decoder to avoid recomputing embeddings for previously decoded tokens, reducing the time complexity of the decoding step.

Experimental Setup and Baselines

The model was compared against standard CNN + RNN and ResNet (18, 34, 50) + LSTM with attention. Ablation studies were conducted varying the number of transformer layers (3, 6, 12, 24) and image resolution (224x224 vs 384x384). The model trained on a large combined dataset, including Bristol Myers Squibb data, SMILES, GDB-13, and synthetically augmented images containing noise and artifacts. Performance was evaluated using the Levenshtein distance metric, which computes the minimum number of single-character edits to transform the predicted string into the ground truth.

Performance Outcomes and Capabilities

The proposed 24-layer ViT model (input size 384) achieved the lowest Levenshtein distance of 6.95, outperforming the ResNet50+LSTM baseline (7.49) and the standard CNN+RNN (103.7). Increasing the number of layers had a strong positive impact, with the 24-layer model becoming competitive with current approaches. The authors note the model was evaluated on datasets with low distinguishable features and noise, where the ViT encoder’s self-attention over all patches from the first layer helped capture relevant structure. The proposed caching optimization reduced the total decoding time complexity from $O(MN^2 + N^3)$ to $O(MN + N^2)$ for $N$ timesteps, by reducing the per-timestep cost to $O(M + N)$.

Reproducibility Details

Data

The model was trained on a combined dataset randomly split into 70% training, 10% test, and 20% validation.

Dataset	Description	Notes
Bristol Myers Squibb	~2.4 million synthetic images with InChI labels.	Provided by BMS global biopharmaceutical company.
SMILES	Kaggle contest data converted to InChI.	Images generated using RDKit.
GDB-13	Subset of 977 million small organic molecules (up to 13 atoms).	Converted from SMILES using RDKit.
Augmented Images	Synthetic images with salt/pepper noise, dropped atoms, and bond modifications.	Used to improve robustness against noise.

Algorithms

Training Objective: Cross-entropy loss minimization.
Inference Decoding: Autoregressive decoding predicting the next character of the InChI string.
Positional Encoding: Standard sine and cosine functions of different frequencies.
Optimization:
- Caching: Caches the output of each layer during decoding to avoid recomputing embeddings for already decoded tokens.
- JIT: PyTorch JIT compiler used for graph optimization (1.2-1.5x speed increase on GPU).
- Self-Critical Training: Finetuning performed using self-critical sequence training (SCST).

Models

Encoder (Vision Transformer):
- Input: Flattened 2D patches of the image. Patch size: $16 \times 16$.
- Projection: Trainable linear projection to latent vector size $D$.
- Structure: Alternating layers of Multi-Head Self-Attention (MHSA) and MLP blocks.
Decoder (Vanilla Transformer):
- Input: Tokenized InChI string + sinusoidal positional embedding.
- Vocabulary: 275 tokens (including , , ).
Hyperparameters (Best Model):
- Image Size: $384 \times 384$.
- Layers: 24.
- Feature Dimension: 512.
- Attention Heads: 12.
- Optimizer: Adam.
- Learning Rate: $3 \times 10^{-5}$ (decayed by 0.5 in last 2 epochs).
- Batch Size: Varied [64-512].

Evaluation

Primary Metric: Levenshtein Distance (lower is better).

Model	Image Size	Layers	Epochs	Levenshtein Dist.
Standard CNN+RNN	224	3	10	103.7
ResNet18 + LSTM	224	4	10	75.03
ResNet34 + LSTM	224	4	10	45.72
ResNet50 + LSTM	224	5	10	7.49
ViT Transformers	224	3	5	79.82
ViT Transformers	224	6	5	54.58
ViT Transformers	224	12	5	31.30
ViT Transformers (Best)	384	24	10	6.95

Hardware

System: 70GB GPU system.
Framework: PyTorch and PyTorch Lightning.

Paper Information

Citation: Sundaramoorthy, C., Kelvin, L. Z., Sarin, M., & Gupta, S. (2021). End-to-End Attention-based Image Captioning. arXiv preprint arXiv:2104.14721. https://doi.org/10.48550/arXiv.2104.14721

Publication: arXiv 2021 (preprint)

Note: This is an arXiv preprint and has not undergone formal peer review.

@misc{sundaramoorthyEndtoEndAttentionbasedImage2021,
  title = {End-to-{{End Attention-based Image Captioning}}},
  author = {Sundaramoorthy, Carola and Kelvin, Lin Ziwen and Sarin, Mahak and Gupta, Shubham},
  year = 2021,
  month = apr,
  number = {arXiv:2104.14721},
  eprint = {2104.14721},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2104.14721},
  archiveprefix = {arXiv}
}

DECIMER 1.0: Transformers for Chemical Image Recognition

Thu, 18 Dec 2025 00:00:00 +0000

Evaluating the Contribution: A Methodological Shift

Method (Dominant) with strong Resource elements.

This is primarily a Method paper because it proposes a specific architectural evolution. It replaces CNN-RNN/Encoder-Decoder models with a Transformer-based network to solve the problem of image-to-structure translation. It validates this methodological shift through rigorous ablation studies comparing feature extractors (InceptionV3 vs. EfficientNet) and decoder architectures.

It also serves as a Resource contribution by releasing the open-source software, trained models, and describing the curation of a massive synthetic training dataset (>35 million molecules).

Motivation: Inaccessible Chemical Knowledge

Data Inaccessibility: A vast amount of chemical knowledge (pre-1990s) is locked in printed or scanned literature and is not machine-readable.
Manual Bottlenecks: Manual curation and extraction of this data is tedious, slow, and error-prone.
Limitations of Prior Tools: Existing Optical Chemical Structure Recognition (OCSR) tools are often rule-based or struggle with the noise and variability of full-page scanned articles. Previous deep learning attempts were not publicly accessible or robust enough.

Key Innovation: Transformer-Based Molecular Translation

Transformer Architecture: Shifts from the standard CNN-RNN (Encoder-Decoder) approach to a Transformer-based decoder, significantly improving accuracy.
EfficientNet Backbone: Replaces the standard InceptionV3 feature extractor with EfficientNet-B3, which improved feature extraction quality for chemical images.
SELFIES Representation: Utilizes SELFIES (SELF-referencing Embedded Strings) as the target output. This guarantees 100% robust molecular strings and eliminates the “invalid SMILES” problem common in generative models.
Massive Scaling: Trains on synthetic datasets derived from PubChem (up to 39 million molecules total, with the largest training subset at ~35 million), demonstrating that scaling data size directly correlates with improved model performance.

Methodology and Experimental Validation

Feature Extractor Ablation: Compared InceptionV3 vs. EfficientNet-B3 (and B7) on a 1-million molecule subset to determine the optimal image encoder.
Architecture Comparison: Benchmarked the Encoder-Decoder (CNN+RNN) against the Transformer model using Tanimoto similarity metrics. The structural similarity between predicted and ground truth molecules was measured via Tanimoto similarity over molecular fingerprints: $$ T(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}|^2 + |\mathbf{B}|^2 - \mathbf{A} \cdot \mathbf{B}} $$
Data Scaling: Evaluated performance across increasing training set sizes (1M, 10M, 15M, 35M) to observe scaling laws.
Stereochemistry & Ions: Tested the model’s ability to handle complex stereochemical information and charged groups (ions), creating separate datasets for these tasks.
Augmentation Robustness: Evaluated the model on augmented images (blur, noise, varying contrast) to simulate real-world scanned document conditions.

Results and Scaling Observations

Architecture Comparison: The Transformer model with EfficientNet-B3 features outperformed the Encoder-Decoder baseline by a wide margin. On the 1M dataset, the Transformer achieved 74.57% exact matches (Tanimoto 1.0) compared to only 7.03% for the Encoder-Decoder (Table 4 in the paper).
High Accuracy at Scale: With the full 35-million molecule training set (Dataset 1), the model achieved a Tanimoto 1.0 score of 96.47% and an average Tanimoto similarity of 0.99.
Isomorphism: 99.75% of predictions with a Tanimoto score of 1.0 were confirmed to be structurally isomorphic to the ground truth (checked via InChI).
Stereochemistry Costs: Including stereochemistry and ions increased the token count and difficulty, resulting in slightly lower accuracy (~89.87% exact match on Dataset 2).
Hardware Efficiency: Training on TPUs (v3-8) was ~4x faster than Nvidia V100 GPUs. For the 1M molecule model, convergence took ~8h 41min on TPU v3-8 vs ~29h 48min on V100 GPU. The largest model (35M) took less than 14 days on TPU.
Augmentation Robustness (Dataset 3): When trained on augmented images and tested on non-augmented images, the model achieved 86.43% Tanimoto 1.0. Using a pre-trained model from Dataset 2 and refitting on augmented images improved this to 88.04% on non-augmented test images and 80.87% on augmented test images, retaining above 97% isomorphism rates.

Reproducibility Details

Data

The authors generated synthetic data from PubChem.

Purpose	Dataset	Size	Notes
Training	Dataset 1 (Clean)	39M total (35M train)	No stereo/ions. Filtered for MW < 1500, bond count 3-40, SMILES len < 40.
Training	Dataset 2 (Complex)	37M total (33M train)	Includes stereochemistry and charged groups (ions).
Training	Dataset 3 (Augmented)	37M total (33M train)	Dataset 2 with image augmentations applied.
Preprocessing	N/A	N/A	Molecules converted to SELFIES. Images generated via CDK Structure Diagram Generator (SDG) as $299 \times 299$ 8-bit PNGs.
Format	TFRecords	75 MB chunks	128 Data points (image vector + tokenized string) per record.

Algorithms

Text Representation: SELFIES used to avoid invalid intermediate strings. Tokenized via Keras tokenizer.
- Dataset 1 Tokens: 27 unique tokens. Max length 47.
- Dataset 2/3 Tokens: 61 unique tokens (due to stereo/ion tokens).
Augmentation: Implemented using imgaug python package. Random application of:
- Gaussian/Average Blur, Additive Gaussian Noise, Salt & Pepper, Coarse Dropout, Gamma Contrast, Sharpen, Brightness.
Optimization: Adam optimizer with a custom learning rate scheduler (following the “Attention is all you need” paper).

Models

The final architecture is an Image-to-SELFIES Transformer.

Encoder (Feature Extractor):
- EfficientNet-B3 (pre-trained on Noisy-student).
- Input: $299 \times 299 \times 3$ images (normalized -1 to 1).
- Output Feature Vector: $10 \times 10 \times 1536$.
Decoder (Transformer):
- 4 Encoder-Decoder layers.
- 8 Parallel Attention Heads.
- Dimension size: 512.
- Feed-forward size: 2048.
- Dropout: 0.1.

Evaluation

Evaluation was performed on a held-out test set (10% of total data) selected via RDKit MaxMin algorithm for diversity.

Metric	Value	Baseline	Notes
Tanimoto 1.0	96.47%	74.57% (1M subset)	Percentage of predictions with perfect fingerprint match (Dataset 1, 35M training).
Avg Tanimoto	0.9923	0.9371 (1M subset)	Average similarity score (Dataset 1, 35M training).
Isomorphism	99.75%	-	Percentage of Tanimoto 1.0 predictions that are structurally identical (checked via InChI).

Hardware

Training Hardware: TPU v3-8 (Google Cloud). TPU v3-32 was tested but v3-8 was chosen for cost-effectiveness.
Comparison Hardware: Nvidia Tesla V100 (32GB GPU).
Performance:
- TPU v3-8 was ~4x faster than V100 GPU.
- 1 Million molecule model convergence: 8h 41min on TPU vs ~29h 48min on GPU.
- Largest model (35M) took less than 14 days on TPU.

Reproducibility

The paper is open-access, and both code and data are publicly available.

Artifact	Type	License	Notes
DECIMER-TPU (GitHub)	Code	MIT	Official implementation using TensorFlow and TPU training
Code Archive (Zenodo)	Code	MIT	Archival snapshot of the codebase
Training Data (Zenodo)	Dataset	Unknown	SMILES data used for training (images generated via CDK SDG)
DECIMER Project Page	Other	N/A	Project landing page

Hardware Requirements: Training requires TPU v3-8 (Google Cloud) or Nvidia V100 GPU. The largest model (35M molecules) took less than 14 days on TPU v3-8.
Missing Components: Augmentation parameters are documented in the paper (Table 14). Pre-trained model weights are available through the GitHub repository.

Paper Information

Citation: Rajan, K., Zielesny, A. & Steinbeck, C. (2021). DECIMER 1.0: deep learning for chemical image recognition using transformers. Journal of Cheminformatics, 13(1), 61. https://doi.org/10.1186/s13321-021-00538-8

Publication: Journal of Cheminformatics 2021

Additional Resources:

@article{rajanDECIMER10Deep2021,
  title = {DECIMER 1.0: Deep Learning for Chemical Image Recognition Using Transformers},
  shorttitle = {DECIMER 1.0},
  author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
  year = {2021},
  month = {aug},
  journal = {Journal of Cheminformatics},
  volume = {13},
  number = {1},
  pages = {61},
  issn = {1758-2946},
  doi = {10.1186/s13321-021-00538-8},
  url = {https://doi.org/10.1186/s13321-021-00538-8}
}

Img2Mol: Accurate SMILES Recognition from Depictions

Wed, 17 Dec 2025 00:00:00 +0000

Method Classification

This is a method paper that introduces Img2Mol, a deep learning system for Optical Chemical Structure Recognition (OCSR). The work focuses on building a fast, accurate, and robust system for converting molecular structure depictions into machine-readable SMILES strings.

Systematization and Motivation

Vast amounts of chemical knowledge exist only as images in scientific literature and patents, making this data inaccessible for computational analysis, database searches, or machine learning pipelines. Manually extracting this information is slow and error-prone, creating a bottleneck for drug discovery and chemical research.

While rule-based OCSR systems like OSRA, MolVec, and Imago exist, they are brittle. Small variations in drawing style or image quality can cause them to fail. The authors argue that a deep learning approach, trained on diverse synthetic data, can generalize better across different depiction styles and handle the messiness of real-world images more reliably.

Two-Stage Architecture and Core Novelty

The novelty lies in a two-stage architecture that separates perception from decoding, combined with aggressive data augmentation to ensure robustness. The key contributions are:

1. Two-Stage Architecture with CDDD Embeddings

Img2Mol uses an intermediate representation to predict SMILES from pixels. A custom CNN encoder maps the input image to a 512-dimensional Continuous and Data-Driven Molecular Descriptor (CDDD) embedding - a pre-trained, learned molecular representation that smoothly captures chemical similarity. A pre-trained decoder then converts this CDDD vector into the final canonical SMILES string.

This two-stage design has several advantages:

The CDDD space is continuous and chemically meaningful, so nearby embeddings correspond to structurally similar molecules. This makes the regression task easier than learning discrete token sequences directly.
The decoder is pre-trained and fixed, so the CNN only needs to learn the image → CDDD mapping. This decouples the visual recognition problem from the sequence generation problem.
CDDD embeddings naturally enforce chemical validity constraints, reducing the risk of generating nonsensical structures.

2. Extensive Data Augmentation for Robustness

The model was trained on 11.1 million unique molecules from ChEMBL and PubChem, but the critical insight is how the training images were generated. To expose the CNN to maximum variation in depiction styles, the authors:

Used three different cheminformatics libraries (RDKit, OEChem, Indigo) to render images, each with its own drawing conventions
Applied wide-ranging augmentations: varying bond thickness, font size, rotation, resolution (originally 192-256 px; expanded to 190-2500 px in the final model), and other stylistic parameters
Over-sampled larger molecules to improve performance on complex structures, which are underrepresented in chemical databases

This ensures the network rarely sees the same depiction of a molecule twice, forcing it to learn invariant features.

3. Fast Inference

Because the architecture is a simple CNN followed by a fixed decoder, inference is very fast - especially compared to rule-based systems that rely on iterative graph construction algorithms. This makes Img2Mol practical for large-scale document mining.

Experimental Validation and Benchmarks

The evaluation focused on demonstrating that Img2Mol is more accurate, robust, and generalizable than existing rule-based systems:

Benchmark Comparisons: Img2Mol was tested on several standard OCSR benchmarks, including USPTO (patent images), University of Birmingham (UoB), CLEF, and JPO (Japanese Patent Office) datasets, against three open-source baselines: OSRA, MolVec, and Imago. No deep learning baselines were available at the time for comparison.
Resolution and Molecular Size Analysis: The initial model, Img2Mol(no aug.), was evaluated across different image resolutions and molecule sizes (measured by number of atoms) to understand failure modes. This revealed that:
- Performance degraded for molecules with >35 atoms
- Very high-resolution images lost detail when downscaled to the fixed input size
- Low-resolution images (where rule-based methods failed completely) were handled well
Data Augmentation Ablation: A final model, Img2Mol, was trained with the full augmentation pipeline (wider resolution range, over-sampling of large molecules). Performance was compared to the initial version to quantify the effect of augmentation.
Depiction Library Robustness: The model was tested on images generated by each of the three rendering libraries separately to confirm that training on diverse styles improved generalization.
Input Perturbation for Benchmark Fairness: For the smaller benchmark datasets (USPTO, UoB, CLEF, JPO), the authors applied slight random rotation (within +/-5 degrees) and shearing to each image five times to detect potential overfitting of rule-based methods to well-known benchmarks.
Generalization Tests: Img2Mol was evaluated on real-world patent images from the STAKER dataset, which were not synthetically generated. This tested whether the model could transfer from synthetic training data to real documents.
Hand-Drawn Molecule Recognition: As an exploratory test, the authors evaluated performance on hand-drawn molecular structures, a task the model was never trained for, to see if the learned features could generalize to completely different visual styles.
Speed Benchmarking: Inference time was measured and compared to rule-based baselines to demonstrate the practical efficiency of the approach.

Results, Conclusions, and Limitations

Key benchmark results from Table 1 of the paper (accuracy / Tanimoto similarity, in %):

Benchmark	Img2Mol	MolVec 0.9.8	Imago 2.0	OSRA 2.1
Img2Mol test set	88.25 / 95.27	2.59 / 13.03	0.02 / 4.74	2.59 / 13.03
STAKER	64.33 / 83.76	5.32 / 31.78	0.07 / 5.06	5.23 / 26.98
USPTO	42.29 / 73.07	30.68 / 65.50	5.07 / 7.28	6.37 / 44.21
UoB	78.18 / 88.51	75.01 / 86.88	5.12 / 7.19	70.89 / 85.27
CLEF	48.84 / 78.04	44.48 / 76.61	26.72 / 41.29	17.04 / 58.84
JPO	45.14 / 69.43	49.48 / 66.46	23.18 / 37.47	33.04 / 49.62

Per-library accuracy on a 5,000-compound subset (depicted five times each):

Library	Img2Mol	MolVec	Imago	OSRA
RDKit	93.4%	3.7%	0.3%	4.4%
OEChem	89.5%	33.4%	12.3%	26.3%
Indigo	79.0%	22.2%	4.2%	22.6%

Substantial Performance Gains: Img2Mol outperformed all three rule-based baselines on nearly every benchmark. MolVec scored higher on JPO (49.48% vs. 45.14% accuracy). Accuracy was measured both as exact SMILES match and as Tanimoto similarity (using ECFP6 1024-bit fingerprints). Even when Img2Mol did not predict the exact molecule, it often predicted a chemically similar one.
Robustness Across Conditions: The full Img2Mol model (with aggressive augmentation) showed consistent performance across all image resolutions and molecule sizes. In contrast, rule-based systems were “brittle” - performance dropped sharply with minor perturbations to image quality or style.
Depiction Library Invariance: Img2Mol’s performance was stable across all three rendering libraries (RDKit, OEChem, Indigo), validating the multi-library training strategy. Rule-based methods struggled particularly with RDKit-generated images.
Strong Generalization to Real-World Data: Despite being trained exclusively on synthetic images, Img2Mol performed well on real patent images from the STAKER dataset. This suggests the augmentation strategy successfully captured the diversity of real-world depictions.
Overfitting in Baselines: Rule-based methods performed surprisingly well on older benchmarks (USPTO, UoB, CLEF) but failed on newer datasets (Img2Mol’s test set, STAKER). This suggests they may be implicitly tuned to specific drawing conventions in legacy datasets.
Limited Hand-Drawn Recognition: Img2Mol could recognize simple hand-drawn structures but struggled with complex or large molecules. This is unsurprising given the lack of hand-drawn data in training, but it highlights a potential avenue for future work.
Speed Advantage: Img2Mol processed 5,000 images in approximately 4 minutes at the smallest input size, with compute time mostly independent of input resolution due to the fixed 224x224 rescaling. Rule-based methods showed sharply increasing compute times at higher resolutions.

The work establishes that deep learning can outperform traditional rule-based OCSR systems when combined with a principled two-stage architecture and comprehensive data augmentation. The CDDD embedding acts as a bridge between visual perception and chemical structure, providing a chemically meaningful intermediate representation that improves both accuracy and robustness. The focus on synthetic data diversity proves to be an effective strategy for generalizing to real-world documents.

Reproducibility Details

Models

Architecture: Custom 8-layer Convolutional Neural Network (CNN) encoder

Input: $224 \times 224$ pixel grayscale images
Backbone Structure: 8 convolutional layers organized into 3 stacks, followed by 3 fully connected layers
- Stack 1: 3 Conv layers ($7 \times 7$ filters, stride 3, padding 4) + Max Pooling
- Stack 2: 2 Conv layers + Max Pooling
- Stack 3: 3 Conv layers + Max Pooling
- Head: 3 fully connected layers
Output: 512-dimensional CDDD embedding vector

Decoder: Pre-trained CDDD decoder (from Winter et al.) - fixed during training, not updated

Algorithms

Loss Function: Mean Squared Error (MSE) regression minimizing the distance between the predicted and true embeddings:

$$ l(d) = l(\text{cddd}_{\text{true}} - \text{cddd}_{\text{predicted}}) $$

Optimizer: AdamW with initial learning rate $10^{-4}$

Training Schedule:

Batch size: 256
Training duration: 300 epochs
Plateau scheduler: Multiplies learning rate by 0.7 if validation loss plateaus for 10 epochs
Early stopping: Triggered if no improvement in validation loss for 50 epochs

Noise Tolerance: The decoder requires the CNN to predict embeddings with noise level $\sigma \le 0.15$ to achieve >90% accuracy

Data

Training Data: 11.1 million unique molecules from ChEMBL and PubChem

Splits: Approximately 50,000 examples each for validation and test sets

Synthetic Image Generation:

Three cheminformatics libraries: RDKit, OEChem, and Indigo
Augmentations: Resolution (190-2500 pixels), rotation, bond thickness, font size
Salt stripping: Keep only the largest fragment
Over-sampling: Larger molecules (>35 atoms) over-sampled to improve performance

Evaluation

Metrics:

Exact SMILES match accuracy
Tanimoto similarity (chemical fingerprint-based structural similarity)

Benchmarks:

Img2Mol test set (25,000 synthetic images at 224x224 px)
STAKER (30,000 real-world USPTO patent images at 256x256 px)
USPTO (4,852 patent images, avg. 649x417 px)
UoB (5,716 images from University of Birmingham, avg. 762x412 px)
CLEF (711 images, avg. 1243x392 px)
JPO (365 Japanese Patent Office images, avg. 607x373 px)
Hand-drawn molecular structures (exploratory, no defined benchmark)

Baselines: OSRA, MolVec, Imago (rule-based systems)

Hardware

⚠️ Unspecified in paper or supplementary materials. Inference speed reported as ~4 minutes for 5000 images; training hardware (GPU model, count) is undocumented.

Artifacts

Artifact	Type	License	Notes
Img2Mol GitHub	Code	Apache 2.0	Official implementation
Img2Mol model weights	Model	CC BY-NC 4.0	Non-commercial use only

Known Limitations

Molecular Size: Performance degrades for molecules with >35 atoms. This is partly a property of the CDDD latent space itself: for larger molecules, the “volume of decodable latent space” shrinks, making the decoder more sensitive to small noise perturbations in the predicted embedding.

Paper Information

Citation: Clevert, D.-A., Le, T., Winter, R., & Montanari, F. (2021). Img2Mol – accurate SMILES recognition from molecular graphical depictions. Chemical Science, 12(42), 14174–14181. https://doi.org/10.1039/d1sc01839f

Publication: Chemical Science (2021)

Additional Resources:

Deep Learning for Molecular Structure Extraction (2019)

Wed, 17 Dec 2025 00:00:00 +0000

Contribution Type: Method and Resource

This is primarily a methodological paper with a secondary resource contribution.

Method: It proposes a novel end-to-end deep learning architecture (Segmentation U-Net + Recognition Encoder-Decoder) to replace traditional rule-based optical chemical structure recognition (OCSR) systems.

Resource: It details a pipeline for generating large-scale synthetic datasets (images overlaying patent/journal backgrounds) necessary to train the deep learning models.

Motivation: Overcoming Brittle Rule-Based Systems

Existing tools for extracting chemical structures from literature (e.g., OSRA, CLIDE) rely on complex, handcrafted rules and heuristics (edge detection, vectorization). These systems suffer from:

Brittleness: They fail when image quality is low (low resolution, noise) or when artistic styles vary (wavy bonds, crossing lines).
Maintenance difficulty: Improvements require manual codification of new rules for every edge case, which is difficult to scale.
Data volume: The explosion of published life science papers (2000+ per day in Medline) creates a need for automated, robust curation tools that humans cannot match.

Core Innovation: End-to-End Pixel-to-SMILES Recognition

The authors present an end-to-end deep learning approach for this task that operates directly on raw pixels without explicit subcomponent recognition (e.g., detecting atoms and bonds separately). Key innovations include:

Pixel-to-SMILES: Treating structure recognition as an image captioning problem using an encoder-decoder architecture with attention, generating SMILES directly.
Low-Resolution Robustness: The model is trained on aggressively downsampled images (~60 dpi for segmentation, 256x256 for prediction), making it robust to poor quality and noisy inputs from legacy PDF extractions.
Implicit Superatom Handling: The model learns to recognize and generate sequences for superatoms (e.g., “OTBS”) contextually.

Experimental Setup and Large-Scale Synthetic Data

The authors validated their approach using a mix of large-scale synthetic training sets and real-world test sets:

Synthetic Generation: They created a segmentation dataset by overlaying USPTO molecules onto “whited-out” journal pages.
Ablation/Training: Metrics were tracked on Indigo (synthetic) and USPTO (real patent images) datasets.
External Validation:
- Valko Dataset: A standard benchmark of 454 heterogeneous images from literature.
- Proprietary Dataset: A collection of images from 47 articles and 5 patents to simulate real-world drug discovery curation.
Stress Testing: They analyzed performance distributions across molecular weight, heavy atom count, and rare elements (e.g., Uranium, Vanadium).

Results and Limitations in Complex Structures

High Accuracy on Standard Sets: The model achieved 82% accuracy on the Indigo validation set and 77% on the USPTO validation set. No apparent overfitting was observed on the Indigo data (57M training examples), though some overfitting occurred on the smaller USPTO set (1.7M training examples).
Real-World Viability: It achieved 83% accuracy on the proprietary internal test set, with validation and proprietary accuracies ranging from 77-83%, indicating the training sets reasonably approximate real drug discovery data.
Segmentation Quality: Low segmentation error rates were observed: only 3.3% of the Valko dataset and 6.6% of the proprietary images failed to segment properly.
Limitations on Complexity: Performance dropped to 41% on the Valko test set. Superatoms were the single largest contributor to prediction errors, with 21% of Valko samples containing one or more incorrectly predicted superatoms. Only 6.6% of total training images contained any superatom, limiting the model’s exposure.
Stereochemistry Challenges: 60% of compounds with incorrectly predicted stereochemistry had explicit stereochemistry in both the ground truth and the prediction, but with wrong configurations assigned (e.g., predicting R instead of S). The model often correctly identified which atoms have stereocenters but assigned the wrong direction, suggesting the architecture may not incorporate sufficient spatial context for configuration assignment.

Reproducibility Details

Data

The authors utilized three primary sources for generating training data. All inputs were strictly downsampled to improve robustness.

Purpose	Dataset	Size	Notes
Training	Indigo Set	57M	PubChem molecules rendered via Indigo (256x256).
Training	USPTO Set	1.7M	Image/SMILES pairs from public patent data.
Training	OS X Indigo	10M	Additional Indigo renders from Mac OS for style diversity.
Segmentation	Synthetic Pages	N/A	Generated by overlaying USPTO images on text-cleared PDF pages.

Preprocessing:

Segmentation Inputs: Grayscale, downsampled to ~60 dpi.
Prediction Inputs: Resized to 256x256 such that bond lengths are approximately 3-12 pixels.
Augmentation: Random affine transforms, brightness scaling, and binarization applied during training.

Algorithms

Segmentation Pipeline:

Multi-scale Inference: Masks generated at resolutions from 30 to 60 dpi (3 dpi increments) and averaged for the final mask.
Post-processing: Hough transform used to remove long straight lines (table borders). Mask blobs filtered by pixel count thresholds.

Prediction Pipeline:

Sequence Generation: SMILES generated character-by-character via greedy decoding. During inference, predictions are made at several low resolutions and the sequence with the highest confidence (product of per-character softmax outputs) is returned.
Attention-based Verification: Attention weights used to re-project predicted atoms back into 2D space to visually verify alignment with the input image.

Models

1. Segmentation Model (U-Net Variant):

Architecture: U-Net style with skip connections.
Input: 128x128x1 grayscale image.
Layers: Alternating 3x3 Conv and 2x2 Max Pool.
Activation: Parametric ReLU (pReLU).
Parameters: ~380,000.

2. Structure Prediction Model (Encoder-Decoder):

Encoder: CNN with 5x5 convolutions, 2x2 Max Pooling, pReLU. No pooling in first layers to preserve fine features.
Decoder: 3 layers of GridLSTM cells.
Attention: Soft/Global attention mechanism conditioned on the encoder state.
Input: 256x256x1 image.
Output: Sequence of characters (vocab size 65).
Parameters: ~46.3 million.

Evaluation

Evaluation required an exact string match of the Canonical SMILES (including stereochemistry) to the ground truth.

Metric	Value	Dataset	Notes
Accuracy	82%	Indigo Val	Synthetic validation set
Accuracy	77%	USPTO Val	Real patent images
Accuracy	83%	Proprietary	Internal pharma dataset (real world)
Accuracy	41%	Valko Test	External benchmark; difficult due to superatoms

Hardware

Segmentation Training: 1 GPU, ~4 days (650k steps).
Prediction Training: 8 NVIDIA Pascal GPUs, ~26 days (1M steps).
Framework: TensorFlow.
Optimizer: Adam.

Artifacts

No public code, pre-trained models, or generated datasets were released with this paper. The training pipeline relies on publicly available molecular databases (PubChem, USPTO) and open-source rendering tools (Indigo), but the specific training sets, model weights, and inference code remain unavailable.

Paper Information

Citation: Staker, J., Marshall, K., Abel, R., & McQuaw, C. (2019). Molecular Structure Extraction From Documents Using Deep Learning. Journal of Chemical Information and Modeling, 59(3), 1017-1029. https://doi.org/10.1021/acs.jcim.8b00669

Publication: Journal of Chemical Information and Modeling (JCIM) 2019

Additional Resources:

Schrödinger Publication Page

@article{stakerMolecularStructureExtraction2019,
  title = {Molecular Structure Extraction From Documents Using Deep Learning},
  author = {Staker, Joshua and Marshall, Kyle and Abel, Robert and McQuaw, Carolyn},
  year = {2019},
  month = {feb},
  journal = {Journal of Chemical Information and Modeling},
  volume = {59},
  number = {3},
  pages = {1017--1029},
  doi = {10.1021/acs.jcim.8b00669},
  url = {https://doi.org/10.1021/acs.jcim.8b00669}
}

DECIMER: Deep Learning for Chemical Image Recognition

Wed, 17 Dec 2025 00:00:00 +0000

Contribution: Method for Optical Chemical Entity Recognition

This is primarily a Method ($\Psi_{\text{Method}}$) paper with a strong Resource ($\Psi_{\text{Resource}}$) component.

Method: It proposes a novel architecture (DECIMER) that repurposes “show-and-tell” image captioning networks for Optical Chemical Entity Recognition (OCER), providing an alternative to traditional rule-based segmentation pipelines.
Resource: It establishes a framework for generating large-scale synthetic training data using open-source cheminformatics tools (CDK) and databases (PubChem), circumventing the scarcity of manually annotated chemical images.

Motivation: Brittleness of Heuristic Pipelines

The extraction of chemical structures from scientific literature (OCER) is critical for populating open-access databases. Traditional OCER systems (like OSRA or CLiDE) rely on complex multi-step pipelines involving vectorization, character recognition, and graph compilation. These systems are brittle and incorporating new structural features requires laborious engineering. Inspired by the success of deep neural network approaches like AlphaGo Zero, the authors sought to formulate an end-to-end deep learning approach that learns directly from data with minimal prior assumptions.

Novelty: Image Captioning for Molecular Graphs

Image-to-Text Formulation: The paper frames chemical structure recognition as an image captioning problem, translating a bitmap image directly into a SMILES string using an encoder-decoder network. This bypasses explicit segmentation of atoms and bonds entirely.
Synthetic Data Strategy: The authors generate synthetic images from PubChem using the CDK Structure Diagram Generator, scaling the dataset size to 15 million.
Robust String Representations: The study performs key ablation experiments on string representations, comparing standard SMILES against DeepSMILES to evaluate how syntactic validity affects the network’s learning capability.

Experimental Setup and Validation Strategies

Data Scaling: Models were trained on dataset sizes ranging from 54,000 to 15 million synthetic images to observe empirical scaling laws regarding accuracy and compute time.
Representation Comparison: The authors compared the validity of predicted strings and recognition accuracy when training on SMILES versus DeepSMILES. The cross-entropy loss formulation for sequence generation can be represented as: $$ \mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{
Metric Evaluation: Performance was measured using Validity (syntactic correctness) and Tanimoto Similarity $T$, computed on molecular fingerprints to capture partial correctness even if the exact string prediction failed: $$ T(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$

Results and Critical Conclusions

Data Representation: DeepSMILES proved superior to standard SMILES for training stability and output validity. Preliminary tests suggested SELFIES performs even better (0.78 Tanimoto vs 0.53 for DeepSMILES at 6M images).
Scaling Behavior: Accuracy improves linearly with dataset size. The authors extrapolate that near-perfect detection would require training on 50 to 100 million structures.
Current Limitations: At the reported training scale (up to 15M), the model does not yet rival traditional heuristic approaches, but the learning curve suggests it is a viable trajectory given sufficient compute and data.

Reproducibility Details

Data

The training data is synthetic, generated using the Chemistry Development Kit (CDK) Structure Diagram Generator (SDG) based on molecules from PubChem.

Curation Rules (applied to PubChem data):

Molecular weight < 1500 Daltons.
Elements restricted to: C, H, O, N, P, S, F, Cl, Br, I, Se, B.
No counter ions or charged groups.
No isotopes (e.g., D, T).
Bond count between 5 and 40.
SMILES length < 40 characters.
Implicit hydrogens only (except in functional groups).

Preprocessing:

Images: Generated as 299x299 bitmaps to match Inception V3 input requirements.
Augmentation: One random rotation applied per molecule; no noise or blurring added in this iteration.

Purpose	Dataset	Size	Notes
Training	Synthetic (PubChem)	54k - 15M	Scaled across 12 experiments
Testing	Independent Set	6k - 1.6M	10% of training size

Algorithms

Architecture: "Show, Attend and Tell" (Attention-based Image Captioning).
Optimization: Adam optimizer with learning rate 0.0005.
Loss Function: Sparse Categorical Crossentropy.
Training Loop: Trained for 25 epochs per model. Batch size of 640 images.

Models

The network is implemented in TensorFlow 2.0.

Encoder: Inception V3 (Convolutional NN), used unaltered. Extracts feature vectors saved as NumPy arrays.
Decoder: Gated Recurrent Unit (GRU) based Recurrent Neural Network (RNN) with soft attention mechanism.
Embeddings: Image embedding dimension size of 600.

Evaluation

The primary metric is Tanimoto similarity (Jaccard index) on PubChem fingerprints, which is robust for measuring structural similarity even when exact identity is not reached.

Metric	Definition
Tanimoto 1.0	Percentage of predictions that are chemically identical to ground truth (isomorphic).
Average Tanimoto	Mean similarity score across the test set (captures partial correctness).
Validity	Percentage of predicted strings that are valid DeepSMILES/SMILES.

Artifacts

Artifact	Type	License	Notes
DECIMER (Java utilities)	Code	MIT	CDK-based data generation and conversion tools
DECIMER-Image-to-SMILES	Code	MIT	TensorFlow training and inference scripts (archived)
PubChem	Dataset	Public Domain	Source of molecular structures for synthetic training data

Hardware

Training was performed on a single node.

GPU: 1x NVIDIA Tesla V100.
CPU: 2x Intel Xeon Gold 6230.
RAM: 384 GB.
Compute Time:
- Linear scaling with data size.
- 15 million structures took ~27 days (91,881s per epoch).
- Projected time for 100M structures: ~4 months on single GPU.

Paper Information

Citation: Rajan, K., Zielesny, A. & Steinbeck, C. (2020). DECIMER: towards deep learning for chemical image recognition. Journal of Cheminformatics, 12(1), 65. https://doi.org/10.1186/s13321-020-00469-w

Publication: Journal of Cheminformatics 2020

Additional Resources:

@article{rajanDECIMERDeepLearning2020,
  title = {{{DECIMER}}: Towards Deep Learning for Chemical Image Recognition},
  shorttitle = {{{DECIMER}}},
  author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
  year = 2020,
  month = oct,
  journal = {Journal of Cheminformatics},
  volume = {12},
  number = {1},
  pages = {65},
  issn = {1758-2946},
  doi = {10.1186/s13321-020-00469-w}
}

IMG2SMI: Translating Molecular Structure Images to SMILES

Sun, 14 Dec 2025 00:00:00 +0000

Contributions & Taxonomy

This is both a Method and Resource paper:

Method: It adapts standard image captioning architectures (encoder-decoder) to the domain of Optical Chemical Structure Recognition (OCSR), treating molecule recognition as a translation task.
Resource: It introduces MOLCAP, a large-scale dataset of 81 million molecules aggregated from public chemical databases, addressing the data scarcity that previously hindered deep learning approaches to OCSR.

The Bottleneck in Chemical Literature Translation

Chemical literature is “full of recipes written in a language computers cannot understand” because molecules are depicted as 2D images. This creates a fundamental bottleneck:

The Problem: Chemists must manually redraw molecular structures to search for related compounds or reactions. This is slow, error-prone, and makes large-scale literature mining impossible.
Existing Tools: Legacy systems like OSRA (Optical Structure Recognition Application) rely on handcrafted rules and often require human correction, making them unfit for unsupervised, high-throughput processing.
The Goal: An automated system that can translate structure images directly to machine-readable strings (SMILES/SELFIES) without human supervision, enabling large-scale knowledge extraction from decades of chemistry literature and patents.

Core Innovation: SELFIES and Image Captioning

The core novelty is demonstrating that how you represent the output text is as important as the model architecture itself. Key contributions:

Image Captioning Framework: Applies modern encoder-decoder architectures (ResNet-101 + Transformer) to OCSR, treating it as an image-to-text translation problem with a standard cross-entropy loss objective over the generation sequence: $$ \mathcal{L} = -\sum\limits_{t=1}^{T} \log P(y_t \mid y_1, \ldots, y_{t-1}, x) $$
SELFIES as Target Representation: The key mechanism relies on using SELFIES (Self-Referencing Embedded Strings) as the output format. SELFIES is based on a formal grammar where every possible string corresponds to a valid molecule, eliminating the syntactic invalidity problems (unmatched parentheses, invalid characters) that plague SMILES generation.
MOLCAP Dataset: Created a comprehensive dataset of 81 million unique molecules from PubChem, ChEMBL, GDB-13, and other sources. Generated 256x256 pixel images using RDKit for 1 million training samples and 5,000 validation samples.
Task-Specific Evaluation: Demonstrated that traditional NLP metrics (BLEU) are poor indicators of scientific utility. Introduced evaluation based on molecular fingerprints (MACCS, RDK, Morgan) and Tanimoto similarity: $$ T(a, b) = \frac{c}{a + b - c} $$ where $c$ is the number of common fingerprint bits, and $a$ and $b$ are the number of set bits in each respective molecule’s fingerprint. This formulation reliably measures functional chemical similarity.

Experimental Setup and Ablation Studies

The evaluation focused on comparing IMG2SMI to existing systems and identifying which design choices matter most:

Baseline Comparisons: Benchmarked against OSRA (rule-based system) and DECIMER (first deep learning approach) on the MOLCAP dataset to establish whether modern architectures could surpass traditional methods.
Ablation Studies: Extensive ablations isolating key factors:
- Decoder Architecture: Transformer vs. RNN/LSTM decoders
- Encoder Fine-tuning: Fine-tuned vs. frozen pre-trained ResNet weights
- Output Representation: SELFIES vs. character-level SMILES vs. BPE-tokenized SMILES (the most critical ablation)

Configuration	MACCS FTS	Valid Captions
RNN + Fixed Encoder	0.1526	N/A
RNN + Fine-tuned Encoder	0.4180	N/A
Transformer + Fixed Encoder	0.7674	61.1%
Transformer + Fine-tuned Encoder	0.9475	99.4%
Character-level SMILES (fine-tuned)	N/A	2.1%
BPE SMILES (2000 vocab, fine-tuned)	N/A	20.0%
SELFIES (fine-tuned)	0.9475	99.4%

Metric Analysis: Systematic comparison of evaluation metrics including BLEU, ROUGE, Levenshtein distance, exact match accuracy, and molecular fingerprint-based similarity measures.

Results, Findings, and Limitations

Performance Gains:

Metric	IMG2SMI	OSRA	DECIMER	Random Baseline
MACCS FTS	0.9475	0.3600	0.0000	0.3378
RDK FTS	0.9020	0.2790	0.0000	0.2229
Morgan FTS	0.8707	0.2677	0.0000	0.1081
ROUGE	0.6240	0.0684	0.0000	0.0422
Exact Match	7.24%	0.04%	0.00%	0.00%
Valid Captions	99.4%	65.2%	N/A	N/A

163% improvement over OSRA on MACCS Tanimoto similarity.
Nearly 10x improvement on ROUGE scores (0.6240 vs. 0.0684).
Average Tanimoto similarity exceeds 0.85 (functionally similar molecules even when not exact matches).

Key Findings:

SELFIES is Critical: Using SELFIES yields 99.4% valid molecules, compared to only ~2% validity for character-level SMILES.
Architecture Matters: Transformer decoder significantly outperforms RNN/LSTM approaches. Fine-tuning the ResNet encoder (vs. frozen weights) yields substantial performance gains (e.g., MACCS FTS: 0.7674 to 0.9475).
Metric Insights: BLEU is a poor metric for this task. Molecular fingerprint-based Tanimoto similarity is most informative because it measures functional chemical similarity.

Limitations:

Low Exact Match: Only 7.24% exact matches. The model captures the overarching functional groups and structure but misses fine details like exact double bond placement.
Complexity Bias: Trained on large molecules (average length >40 tokens), so it performs poorly on very simple structures where OSRA still excels.

Conclusion: The work shows that modern encoder-decoder architectures combined with valid-by-construction molecular representations (SELFIES) can outperform traditional rule-based systems by large margins on fingerprint-based similarity metrics. The system is useful for literature mining where functional similarity matters more than exact matches, though 7.24% exact match accuracy and poor performance on simple molecules indicate clear directions for future work.

Reproducibility Details

Models

Architecture: Image captioning system based on DETR (Detection Transformer) framework.

Visual Encoder:

Backbone: ResNet-101 pre-trained on ImageNet
Feature Extraction: 4th layer extraction (convolutions only)
Output: 2048-dimensional dense feature vector

Caption Decoder:

Type: Transformer encoder-decoder
Layers: 3 stacked encoder layers, 3 stacked decoder layers
Attention Heads: 8
Hidden Dimensions: 2048 (feed-forward networks)
Dropout: 0.1
Layer Normalization: 1e-12

Training Configuration:

Optimizer: AdamW
Learning Rate: 5e-5 (selected after sweep from 1e-4 to 1e-6)
Weight Decay: 1e-4
Batch Size: 32
Epochs: 5
Codebase: Built on open-source DETR implementation

Data

MOLCAP Dataset:

Property	Value	Notes
Total Size	81,230,291 molecules	Aggregated from PubChem, ChEMBL, GDB13
Training Split	1,000,000 molecules	Randomly selected unique molecules
Validation Split	5,000 molecules	Randomly selected for evaluation
Image Resolution	256x256 pixels	Generated using RDKit
Median SELFIES Length	>45 characters	More complex than typical benchmarks
Full Dataset Storage	~16.24 TB	Necessitated use of 1M subset
Augmentation	None	No cropping, rotation, or other augmentation

Preprocessing:

Images generated using RDKit at 256x256 resolution
Molecules converted to canonical representations
SELFIES tokenization for model output

Evaluation

Primary Metrics:

Metric	IMG2SMI Value	OSRA Baseline	Purpose
MACCS FTS	0.9475	0.3600	Fingerprint Tanimoto Similarity (functional groups)
RDK FTS	0.9020	0.2790	RDKit fingerprint similarity
Morgan FTS	0.8707	0.2677	Morgan fingerprint similarity (circular)
ROUGE	0.6240	0.0684	Text overlap metric
Exact Match	7.24%	0.04%	Structural identity (strict)
Valid Captions	99.4%	65.2%	Syntactic validity (with SELFIES)
Levenshtein Distance	21.13	32.76	String edit distance (lower is better)

Secondary Metrics (shown to be less informative for chemical tasks):

BLEU, ROUGE (better suited for natural language)
Levenshtein distance (doesn’t capture chemical similarity)

Hardware

GPU: Single NVIDIA GeForce RTX 2080 Ti
Training Time: ~5 hours per epoch, approximately 25 hours total for 5 epochs
Memory: Sufficient for batch size 32 with ResNet-101 + Transformer architecture

Artifacts

The paper mentions releasing both code and the MOLCAP dataset, but no public repository or download link has been confirmed as available.

Artifact	Type	License	Notes
MOLCAP dataset	Dataset	Unknown	81M molecules; claimed released but no public URL found
IMG2SMI code	Code	Unknown	Built on DETR; claimed released but no public URL found

Paper Information

Citation: Campos, D., & Ji, H. (2021). IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System (No. arXiv:2109.04202). arXiv. https://doi.org/10.48550/arXiv.2109.04202

Publication: arXiv preprint (2021)

Additional Resources:

Paper on arXiv

@article{campos2021img2smi,
  title={IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System},
  author={Campos, Daniel and Ji, Heng},
  journal={arXiv preprint arXiv:2109.04202},
  year={2021},
  doi={10.48550/arXiv.2109.04202}
}

αExtractor: Chemical Info from Biomedical Literature

Sat, 11 Oct 2025 00:00:00 +0000

Methodological Contribution: A Robust Optical Recognition System

This is primarily a Method ($\Psi_{\text{Method}}$) paper with a significant secondary Resource ($\Psi_{\text{Resource}}$) contribution (see the AI and Physical Sciences paper taxonomy for more on these categories).

The dominant methodological contribution is the ResNet-Transformer recognition architecture that outperforms existing OCSR tools across multiple benchmarks through robustness engineering. It specifically focuses on training on 20 million synthetic images with aggressive augmentation to handle degraded image conditions. The work answers the core methodological question “How well does this work?” through extensive benchmarking against existing OCSR tools and ablation studies validating architectural choices.

The secondary resource contribution comes from releasing αExtractor as a freely available web service, correcting labeling errors in standard benchmarks (CLEF, UOB, JPO), and providing an end-to-end document processing pipeline for biomedical literature mining.

Motivation: Extracting Visual Chemical Knowledge from Biomedical Literature

The motivation addresses a familiar pain point in chemical informatics within a biomedical context. Vast amounts of chemical knowledge in biomedical literature exist only as images, such as molecular structures embedded in figures, chemical synthesis schemes, and compound diagrams. This visual knowledge remains effectively invisible to computational methods, which creates a massive bottleneck for drug discovery research, systematic reviews, and large-scale chemical database construction.

Existing OCSR tools face two critical problems when applied to biomedical literature:

Real-world image quality: Biomedical papers often contain low-resolution figures, images with complex backgrounds, noise from scanning/digitization, and inconsistent drawing styles across different journals and decades of publications.
End-to-end extraction: Most OCSR systems assume the presence of clean, cropped molecular images. In practice, you need to first find the molecular structures within multi-panel figures, reaction schemes, and dense document layouts before you can recognize them.

The authors argue that a practical literature mining system needs to solve both problems simultaneously via robust recognition under noisy conditions and automated detection of molecular images within complex documents.

Core Innovation: Robust ResNet-Transformer Architecture

The core innovation lies in combining a competition-winning recognition architecture with extensive robustness engineering and end-to-end document processing. The key contributions include:

ResNet-Transformer Recognition Model: The core recognition system uses a Residual Neural Network (ResNet) encoder paired with a Transformer decoder in an image-captioning framework. This architecture won first place in a Kaggle molecular translation competition, which provided a strong foundation for the recognition task. Let the input image be $I$. The model maximizes the joint likelihood of the SMILES tokens $T$ and coordinate sequences $X, Y$: $$ \begin{aligned} \mathcal{L}_{\text{total}} = - \sum_{i=1}^{L} \log P(T_i \mid I, T_{
Enhanced Molecular Representation: The model produces an augmented representation that encompasses:
- Standard molecular connectivity information
- Bond type tokens (solid wedge bonds, dashed bonds, etc.) that preserve 3D stereochemical information
- Atom coordinate predictions that allow reconstruction of the exact molecular pose from the original image
This dual prediction of discrete structure and continuous coordinates makes the output strictly faithful to the source material and enables better quality assessment.
Massive Synthetic Training Dataset: The model was trained on approximately 20 million synthetic molecular images generated from PubChem SMILES with aggressive data augmentation. The augmentation strategy randomized visual styles, image quality, and rendering parameters to create maximum diversity, ensuring the network rarely saw the same molecular depiction twice. This forces the model to learn robust, style-invariant features.
End-to-End Document Processing Pipeline: αExtractor integrates object detection and structure recognition into a complete document mining system:
- An object detection model automatically locates molecular images within PDF documents
- The recognition model converts detected images to structured representations
- A web service interface makes the entire pipeline accessible to researchers without machine learning expertise
Robustness-First Design: The system was explicitly designed to handle degraded image conditions that break traditional OCSR tools, including low resolution, background interference, color variations, and scanning artifacts commonly found in legacy biomedical literature.

Experimental Methodology: Stress Testing under Real-World Conditions

The evaluation focused on demonstrating robust performance across diverse image conditions, from pristine benchmarks to challenging real-world scenarios:

Benchmark Dataset Evaluation: αExtractor was tested on four standard OCSR benchmarks:
- CLEF: Chemical structure recognition challenge dataset
- UOB: University of Birmingham patent images
- JPO: Japan Patent Office molecular diagrams
- USPTO: US Patent and Trademark Office structures
Performance was measured using exact SMILES match accuracy.
Error Analysis and Dataset Correction: During evaluation, the researchers discovered numerous labeling errors in the original benchmark datasets. They systematically identified and corrected these errors, then re-evaluated all methods on the cleaned datasets to get more accurate performance measurements.
Robustness Stress Testing: The system was evaluated on two challenging datasets specifically designed to test robustness:
- Color background images (200 samples): Molecular structures on complex, colorful backgrounds that simulate real figure conditions
- Low-quality images (200 samples): Degraded images with noise, blur, and artifacts typical of scanned documents
These tests compared αExtractor against three open-source tools (OSRA, Molvel, and Imago) under realistic degradation conditions.
Generalization Testing: In the most challenging experiment, αExtractor was tested on the DECIMER hand-drawn molecule images dataset (Brinkhaus et al., 2022), representing a completely different visual domain not represented in the training data. This tested whether the learned features could generalize beyond digital rendering styles to human-drawn chemistry.
End-to-End Document Extraction: The complete pipeline was evaluated on 50 PDF files containing 2,336 molecular images. This tested both the object detection component (finding molecules in complex documents) and the recognition component (converting them to SMILES) in a realistic literature mining scenario.
Speed Benchmarking: Inference time was measured to demonstrate the practical efficiency needed for large-scale document processing.

Results & Conclusions: Strong Performance on Degraded Images

Substantial Accuracy Gains: On the four benchmark datasets, αExtractor achieved accuracies of 91.83% (CLEF), 98.47% (UOB), 88.67% (JPO), and 93.64% (USPTO), compared to previous best results of 84.6%, 90.0%, 72.2%, and 89.9% respectively. After correcting dataset labeling errors, the true accuracies were even higher, reaching 95.77% on CLEF, 99.86% on UOB, and 92.44% on JPO.
Robustness on Degraded Images: Open-source competitors struggled on degraded images (achieving 5.5% accuracy at best). αExtractor maintained over 90% accuracy on both color background and low-quality image datasets, demonstrating the effectiveness of the synthetic training strategy.
Generalization to Hand-Drawn Molecules: On hand-drawn molecules, a domain completely absent from training data, αExtractor achieved 61.4% accuracy while other tools scored between 0.69% and 2.93%. This suggests the model learned genuinely chemical features rather than style-specific patterns.
Practical End-to-End Performance: In the complete document processing evaluation, αExtractor detected 95.1% of molecular images (2,221 out of 2,336) and correctly recognized 94.5% of detected structures (2,098 correct predictions). This demonstrates the system’s readiness for real-world literature mining applications.
Ablation Results: Ablation experiments confirmed that each architectural component (ResNet backbone, Transformer encoder, Transformer decoder) contributes to performance, with the Transformer decoder having the largest impact. Replacing the Transformer decoder with an LSTM decoder substantially reduced accuracy (Table S6 in the paper).
Dataset Quality Issues: The systematic discovery of labeling errors in standard benchmarks highlights a broader problem in OCSR evaluation. The corrected datasets provide more reliable baselines for future method development.
Spatial Layout Limitation: αExtractor correctly identifies molecular connectivity, but the re-rendered structures may have different spatial layouts than the originals. This could complicate visual verification for complex molecules, even if the chemical information remains accurate.
Non-Standard Depiction Handling: For images with non-standard bond depictions or atomic valences, αExtractor correctly identifies and normalizes them to standard representations. While chemically accurate, this means the re-rendered structure may visually differ from the original image.

Overall, αExtractor combines accurate recognition (over 90% on degraded images), end-to-end document processing, and strong generalization across image conditions. It targets large-scale literature mining tasks where previous tools struggled with degraded inputs. The focus on real-world robustness over benchmark optimization reflects a practical approach to deploying machine learning in scientific workflows.

Reproducibility Details

This paper is Partially Reproducible. While the authors detail the model architectures and training techniques, the source code, training dataset (20M synthetic images), and pre-trained weights remain closed-source and proprietary. The authors released a sample of their test data and host an online web server for running inference.

Artifact	Type	License	Notes
Corrected CLEF Dataset	Dataset	Unknown	Authors’ corrected version of the CLEF benchmark.
Corrected UOB Dataset	Dataset	Unknown	Authors’ corrected version of the UOB benchmark.
Corrected JPO Dataset	Dataset	Unknown	Authors’ corrected version of the JPO benchmark.
Color Background Dataset	Dataset	Unknown	200 samples of molecular structures on complex, colorful backgrounds.
Low Quality Dataset	Dataset	Unknown	200 samples of degraded images with noise, blur, and artifacts.
PDF Test Set	Dataset	Unknown	Sample PDF files for end-to-end document extraction evaluation.
αExtractor Web Server	Other	Unknown	Online service for running inference using the proprietary system.

Models

Image Recognition Model:

Backbone: ResNet50 producing output of shape $2048 \times 19 \times 19$, projected to 512 channels via a feed-forward layer
Transformer Architecture: 3 encoder layers and 3 decoder layers with hidden dimension of 512
Output Format: Generates SMILES tokens plus two auxiliary coordinate sequences (X-axis and Y-axis) that are length-aligned with the SMILES tokens via padding

Object Detection Model:

Architecture: DETR (Detection Transformer) with ResNet101 backbone
Transformer Architecture: 6 encoder layers and 6 decoder layers with hidden dimension of 256
Purpose: Locates molecular images within PDF pages before recognition

Coordinate Prediction:

Continuous X/Y coordinates are discretized into 200 discrete bins
Padding tokens added to coordinate sequences to align perfectly with SMILES token sequence, enabling simultaneous structure and pose prediction

Data

Training Data:

Synthetic Generation: Python script rendering PubChem SMILES into 2D images
Dataset Size: Approximately 20.3 million synthetic molecular images from PubChem
Superatom Handling: 50% of molecules had functional groups replaced with superatoms (e.g., “COOH”) or generic labels (R1, X1) to match literature drawing conventions
Rendering Augmentation: Randomized bond thickness, bond spacing, font size, font color, and padding size

Geometric Augmentation:

Shear along x-axis: $\pm 15^\circ$
Rotation: $\pm 15^\circ$
Piecewise affine scaling

Noise Injection:

Pepper noise: 0-2%
Salt noise: 0-40%
Gaussian noise: scale 0-0.16

Destructive Augmentation:

JPEG compression: severity levels 2-5
Random masking

Evaluation Datasets:

CLEF: Chemical structure recognition challenge dataset
UOB: University of Birmingham patent images
JPO: Japan Patent Office molecular diagrams
USPTO: US Patent and Trademark Office structures
Color background images: 200 samples
Low-quality images: 200 samples
Hand-drawn structures: Test set for generalization
End-to-end document extraction: 50 PDFs (567 pages, 2,336 molecular images)

Training

Image Recognition Model:

Optimizer: Adam with learning rate of 1e-4
Batch Size: 100
Epochs: 5
Loss Function: Cross-entropy loss for both SMILES prediction and coordinate prediction

Object Detection Model:

Optimizer: Adam with learning rate of 1e-4
Batch Size: 24
Training Strategy: Pre-trained on synthetic “Lower Quality” data for 5 epochs, then fine-tuned on annotated real “High Quality” data for 30 epochs

Evaluation

Metrics:

Recognition: SMILES accuracy (exact match)
End-to-End Pipeline:
- Recall: 95.1% for detection
- Accuracy: 94.5% for recognition

Hardware

Inference Hardware:

Cloud CPU server (8 CPUs, 64 GB RAM)
Throughput: Processed 50 PDFs (567 pages) in 40 minutes

Paper Information

Citation: Xiong, J., Liu, X., Li, Z., Xiao, H., Wang, G., Niu, Z., Fei, C., Zhong, F., Wang, G., Zhang, W., Fu, Z., Liu, Z., Chen, K., Jiang, H., & Zheng, M. (2023). αExtractor: a system for automatic extraction of chemical information from biomedical literature. Science China Life Sciences, 67(3), 618-621. https://doi.org/10.1007/s11427-023-2388-x

Publication: Science China Life Sciences (2023)

Additional Resources:

Paper on Springer

RFL: Simplifying Chemical Structure Recognition (AAAI 2025)

Thu, 19 Dec 2024 00:00:00 +0000

Methodological Contribution

This is a Methodological paper ($\Psi_{\text{Method}}$). It introduces a novel representation system (Ring-Free Language) and a specialized neural architecture (Molecular Skeleton Decoder) designed to solve specific limitations in converting 2D images to 1D chemical strings. The paper validates this method through direct comparison with existing baselines and ablation studies.

Motivation: Limitations of 1D Serialization

Current Optical Chemical Structure Recognition (OCSR) methods typically rely on “unstructured modeling,” where 2D molecular graphs are flattened into 1D strings like SMILES or SSML. While simple, these linear formats struggle to explicitly capture complex spatial relationships, particularly in molecules with multiple rings and branches. End-to-end models often fail to “understand” the graph structure when forced to predict these implicit 1D sequences, leading to error accumulation in complex scenarios.

Innovation: Ring-Free Language (RFL) and Molecular Skeleton Decoder (MSD)

The authors propose two primary contributions to decouple spatial complexity:

Ring-Free Language (RFL): A divide-and-conquer representation that splits a molecular graph $G$ into three explicit components: a molecular skeleton $\mathcal{S}$, individual ring structures $\mathcal{R}$, and branch information $\mathcal{F}$. This allows rings to be collapsed into “SuperAtoms” or “SuperBonds” during initial parsing.
Molecular Skeleton Decoder (MSD): A hierarchical architecture that progressively predicts the skeleton first, then the individual rings (using SuperAtom features as conditions), and finally classifies the branch connections.

Methodology and Experiments

The method was evaluated on both handwritten and printed chemical structures against two baselines: DenseWAP (Zhang et al. 2018) and RCGD (Hu et al. 2023).

Datasets:
- EDU-CHEMC: ~49k handwritten samples (challenging, diverse styles)
- Mini-CASIA-CSDB: ~89k printed samples (from ChEMBL)
- Synthetic Complexity Dataset: A custom split of ChEMBL data grouped by structural complexity (atoms + bonds + rings) to test generalization
Ablation Studies (Table 2, on EDU-CHEMC with MSD-DenseWAP): Without MSD or [conn], EM=38.70%. Adding [conn] alone raised EM to 44.02%. Adding MSD alone raised EM to 52.76%. Both together achieved EM=64.96%, confirming each component’s contribution.

Outcomes and Conclusions

New best results: MSD-RCGD achieved 65.39% EM on EDU-CHEMC (handwritten) and 95.23% EM on Mini-CASIA-CSDB (printed), outperforming the RCGD baseline (62.86% and 95.01%, respectively). MSD-DenseWAP surpassed the previous best on EDU-CHEMC by 2.06% EM (64.92% vs. 62.86%).
Universal improvement: Applying MSD/RFL to DenseWAP improved its accuracy from 61.35% to 64.92% EM on EDU-CHEMC and from 92.09% to 94.10% EM on Mini-CASIA-CSDB, demonstrating the method is model-agnostic.
Complexity handling: When trained on low-complexity molecules only (levels 1-2), MSD-DenseWAP still recognized higher-complexity unseen structures, while standard DenseWAP could hardly recognize them at all (Figure 6 in the paper).

The authors note that this is the first end-to-end solution that decouples and models chemical structures in a structured form. Future work aims to extend structured-based modeling to other tasks such as tables, flowcharts, and diagrams.

Artifacts

Artifact	Type	License	Notes
RFL-MSD	Code	MIT	Official PyTorch implementation

Reproducibility Details

Data

The authors utilized one handwritten and one printed dataset, plus a synthetic set for stress-testing complexity.

Purpose	Dataset	Size	Notes
Training/Test	EDU-CHEMC	48,998 Train / 2,992 Test	Handwritten images from educational scenarios
Training/Test	Mini-CASIA-CSDB	89,023 Train / 8,287 Test	Printed images rendered from ChEMBL using RDKit
Generalization	ChEMBL Subset	5 levels of complexity	Custom split based on Eq: $N_{atom} + N_{bond} + 12 \times N_{ring}$

Algorithms

RFL Splitting (Encoding):

Detect Rings: Use DFS to find all non-nested rings $\mathcal{R}$.
Determine Adjacency ($\gamma$): Calculate shared edges between rings.
Merge:
- If $\gamma(r_i) = 0$ (isolated), merge ring into a SuperAtom node.
- If $\gamma(r_i) > 0$ (adjacent), merge ring into a SuperBond edge.
Update: Record connection info in $\mathcal{F}$ and remove ring details from the main graph to form Skeleton $\mathcal{S}$.

MSD Decoding:

Hierarchical Prediction: The model predicts the Skeleton $\mathcal{S}$ first.
Contextual Ring Prediction: When a SuperAtom/Bond token is predicted, its hidden state $f^s$ is stored. After the skeleton is finished, $f^s$ is used as a condition to autoregressively decode the specific ring structure.
Token [conn]: A special token separates connected ring bonds from unconnected ones to sparsify the branch classification task.

Models

The architecture follows a standard Image-to-Sequence pattern but with a forked decoder.

Encoder: DenseNet (Growth rate=24, Depth=32 per block)
Decoder (MSD):
- Core: GRU with Attention (Hidden dim=256, Embedding dim=256, Dropout=0.15)
- Skeleton Module: Autoregressively predicts sequence tokens. Uses Maxout activation.
- Branch Module: A binary classifier (MLP) taking concatenated features of skeleton bonds $f_{bs}$ and ring bonds $f_{br}$ to predict connectivity matrix $\mathcal{F}$.
Loss Function: $\mathcal{O} = \lambda_1 \mathcal{L}_{ce} + \lambda_2 \mathcal{L}_{cls}$ (where $\lambda_1 = \lambda_2 = 1$)

Evaluation

Metrics focus on exact image reconstruction and structural validity.

Metric	Description	Notes
EM (Exact Match)	% of images where predicted graph exactly matches ground truth.	Primary metric
Struct-EM	% of correctly identified chemical structures (ignoring non-chemical text).	Auxiliary metric

Hardware

Compute: 4 x NVIDIA Tesla V100 (32GB RAM)
Training Configuration:
- Batch size: 8 (Handwritten), 32 (Printed)
- Epochs: 50
- Optimizer: Adam ($lr=2\times10^{-4}$, decayed by 0.5 via MultiStepLR)

Paper Information

Citation: Chang, Q., Chen, M., Pi, C., Hu, P., Zhang, Z., Ma, J., Du, J., Yin, B., & Hu, J. (2025). RFL: Simplifying Chemical Structure Recognition with Ring-Free Language. In Proceedings of the AAAI Conference on Artificial Intelligence, 39(2), 2007-2015. https://doi.org/10.1609/aaai.v39i2.32197

Publication: AAAI 2025 (Oral)

Additional Resources:

Official Code Repository

@inproceedings{changRFLSimplifyingChemical2025,
  title = {RFL: Simplifying Chemical Structure Recognition with Ring-Free Language},
  shorttitle = {RFL},
  author = {Chang, Qikai and Chen, Mingjun and Pi, Changpeng and Hu, Pengfei and Zhang, Zhenrong and Ma, Jiefeng and Du, Jun and Yin, Baocai and Hu, Jinshui},
  year = {2025},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  volume = {39},
  number = {2},
  pages = {2007--2015},
  eprint = {2412.07594},
  primaryclass = {cs},
  doi = {10.1609/aaai.v39i2.32197},
  archiveprefix = {arXiv}
}