Paper Information

Citation: Chen, Y., Leung, C. T., Huang, Y., Sun, J., Chen, H., & Gao, H. (2024). MolNexTR: a generalized deep learning model for molecular image recognition. Journal of Cheminformatics, 16(141). https://doi.org/10.1186/s13321-024-00926-w

Publication: Journal of Cheminformatics 2024

Additional Resources:

What kind of paper is this?

This is a Method paper ($\Psi_{\text{Method}}$). It proposes a novel neural network architecture (MolNexTR) that integrates ConvNext and Vision Transformers to solve the Optical Chemical Structure Recognition (OCSR) task. The paper validates this method through ablation studies and extensive benchmarking against current state-of-the-art models like MolScribe and DECIMER.

What is the motivation?

Converting molecular images from chemical literature into machine-readable formats (SMILES) is critical but challenging due to the high variance in drawing styles, fonts, and conventions (e.g., Markush structures, abbreviations). Existing methods have limitations:

  • Style Robustness: CNN-based and ViT-based models often struggle to generalize across diverse, non-standard drawing styles found in real literature.
  • Feature Extraction: Pure ViT methods lack translation invariance and local feature representation, while pure CNNs struggle with global dependencies.
  • Chemical Knowledge: Many models predict SMILES strings directly, making it difficult to enforce chemical validity or resolve complex stereochemistry and abbreviations.

What is the novelty here?

MolNexTR introduces three main innovations:

  1. Dual-Stream Encoder: A hybrid architecture processing images simultaneously through a ConvNext stream (for local features) and a Vision Transformer stream (for long-range dependencies), fusing them to capture multi-scale information.
  2. Image Contamination Augmentation: A specialized data augmentation algorithm that simulates real-world “noise” found in literature, such as overlapping text, arrows, and partial molecular fragments, to improve robustness.
  3. Graph-Based Decoding with Post-Processing: Unlike pure image-to-SMILES translation, it predicts atoms and bonds (graph generation) and uses a stereochemical discrimination and abbreviation self-correction module to enforce chemical rules (e.g., chirality) and resolve superatoms (e.g., “Ph”, “Bn”).

What experiments were performed?

The model was trained on synthetic data (PubChem) and real patent data (USPTO). It was evaluated on six public benchmarks:

  • Synthetic: Indigo, ChemDraw, RDKit (rendered from 5,719 molecules)
  • Real-World: CLEF, UOB, JPO, USPTO, Staker, and a newly curated ACS dataset (diverse styles)

Baselines: Compared against rule-based (OSRA, MolVec) and deep learning models (MolScribe, DECIMER, SwinOCSR, Img2Mol).

Ablations: Tested the impact of the dual-stream encoder vs. single streams, and the contribution of individual augmentation strategies.

What outcomes/conclusions?

  • Performance: MolNexTR achieved 81-97% accuracy across test sets, outperforming the second-best method (often MolScribe) by margins of 0.3% to 10.0% (on the difficult ACS dataset).
  • Robustness: The model showed superior resilience to image perturbations (rotation, noise) and “curved arrow” noise common in reaction mechanisms.
  • Ablation Results: The dual-stream encoder consistently outperformed single CNN or ViT baselines, and the image contamination algorithm significantly boosted performance on noisy real-world data (ACS).
  • Limitations: The model still struggles with extremely complex hand-drawn molecules and mechanism diagrams where arrows or text are conflated with structure.

Reproducibility Details

Data

Training Data:

  • Synthetic: ~1M molecules randomly selected from PubChem, rendered using RDKit and Indigo with varied styles (thickness, fonts, bond width)
  • Real: 0.68M images from USPTO, with coordinates normalized from MOLfiles

Augmentation:

  • Render Augmentation: Randomized drawing styles (line width, font size, label modes)
  • Image Augmentation: Rotation, cropping, blurring, noise (Gaussian, salt-and-pepper)
  • Molecular Augmentation: Randomly replacing functional groups with abbreviations (from a list of >100) or complex chains (e.g., CH3CH2NH2); adding R-groups
  • Image Contamination: Adding “noise” objects - arrows, lines, text, partial structures - at a minimum distance from the main molecule to simulate literature artifacts

Algorithms

Dual-Stream Encoder:

  • CNN Stream: ConvNext backbone (pre-trained on ImageNet), generating feature maps at scales $H/4$ to $H/32$
  • ViT Stream: Parallel transformer blocks receiving patches of sizes $p=4, 8, 16, 32$. Uses Multi-Head Self-Attention (MHSA) and Feed-Forward Networks (FFN)
  • Fusion: Outputs from both streams are concatenated

Decoder (Graph Generation):

  • Transformer Decoder: 6 layers, 8 heads, hidden dim 256
  • Task 1 (Atoms): Autoregressive prediction of atom tokens $(l, x, y)$ (label + coordinates)
  • Task 2 (Bonds): Prediction of bond types between atom pairs (Single, Double, Triple, Aromatic, Wedge)

Post-Processing:

  • Stereochemistry: Uses predicted coordinates and bond types (wedge/dash) to resolve chirality using RDKit logic
  • Abbreviation Correction: Matches superatoms to a dictionary; if unknown, attempts to greedily connect atoms based on valence or finds the nearest match ($\sigma=0.8$ similarity threshold)

Models

  • Architecture: Encoder-Decoder (ConvNext + ViT Encoder -> Transformer Decoder)
  • Hyperparameters:
    • Optimizer: ADAM (max lr 3e-4, linear warmup for 5% of steps)
    • Batch Size: 256
    • Image Size: $384 \times 384$
    • Dropout: 0.1
  • Training: Fine-tuned CNN backbone for 40 epochs on 10 NVIDIA RTX 3090 GPUs

Evaluation

Primary Metric: SMILES sequence exact matching accuracy (canonicalized)

Benchmarks:

  • Synthetic: Indigo (5,719), ChemDraw (5,719), RDKit (5,719)
  • Real: CLEF (992), UOB (5,740), JPO (450), USPTO (5,719), Staker (50,000), ACS (331)

Hardware

  • GPUs: 10 NVIDIA RTX 3090 GPUs
  • Cluster: HPG Cluster at HKUST

Citation

@article{chenMolNexTRGeneralizedDeep2024,
  title = {MolNexTR: A Generalized Deep Learning Model for Molecular Image Recognition},
  author = {Chen, Yufan and Leung, Ching Ting and Huang, Yong and Sun, Jianwei and Chen, Hao and Gao, Hanyu},
  journal = {Journal of Cheminformatics},
  volume = {16},
  number = {141},
  year = {2024},
  doi = {10.1186/s13321-024-00926-w}
}