MarkushGrapher-2: End-to-End Markush Recognition

A Multimodal Method for Markush Structure Recognition

This is a Method paper that introduces MarkushGrapher-2, a universal encoder-decoder model for recognizing both standard molecular structures and multimodal Markush structures from chemical images. The primary contribution is a dual-encoder architecture that fuses a pretrained OCSR (Optical Chemical Structure Recognition) vision encoder with a Vision-Text-Layout (VTL) encoder, connected through a dedicated ChemicalOCR module for end-to-end processing. The paper also introduces two new resources: a large-scale training dataset (USPTO-MOL-M) of real-world Markush structures extracted from USPTO patent MOL files, and IP5-M, a manually annotated benchmark of 1,000 Markush structures from five major patent offices.

Why Markush Structure Recognition Remains Challenging

Markush structures are compact representations used in patent documents to describe families of related molecules. They combine a visual backbone (atoms, bonds, variable regions) with textual definitions of substituents that can replace those variable regions. This multimodal nature makes them harder to parse than standard molecular diagrams.

Three factors limit automatic Markush recognition. First, visual styles vary across patent offices and publication years. Second, textual definitions lack standardization and often contain conditional or recursive descriptions. Third, real-world training data with comprehensive annotations is scarce. As a result, Markush structures are currently indexed only in two proprietary, manually curated databases: MARPAT and DWPIM.

Prior work, including the original MarkushGrapher, required pre-annotated OCR outputs at inference time, limiting practical deployment. General-purpose models like GPT-5 and DeepSeek-OCR produce mostly chemically invalid outputs on Markush images, suggesting these lie outside their training distribution.

Dual-Encoder Architecture with Dedicated ChemicalOCR

MarkushGrapher-2 uses two complementary encoding pipelines:

Vision encoder pipeline: The input image passes through a Swin-B Vision Transformer (taken from MolScribe) pretrained for OCSR. This encoder extracts visual features representing molecular structures and remains frozen during training.
Vision-Text-Layout (VTL) pipeline: The same image goes through ChemicalOCR, a compact 256M-parameter vision-language model fine-tuned from SmolDocling for OCR on chemical images. ChemicalOCR extracts character-level text and bounding boxes. These, combined with image patches, feed into a T5-base VTL encoder following the UDOP fusion paradigm, where visual and textual tokens are spatially aligned by bounding box overlap.

The VTL encoder output is concatenated with projected embeddings from the vision encoder. This joint representation feeds a text decoder that auto-regressively generates a CXSMILES (ChemAxon Extended SMILES) string describing the backbone structure and a substituent table listing variable group definitions.

Two-Stage Training Strategy

Training proceeds in two phases:

Phase 1 (Adaptation): The vision encoder is frozen. The MLP projector and text decoder train on 243K real-world image-SMILES pairs from MolScribe’s USPTO dataset (3 epochs). This aligns the decoder to the pretrained OCSR feature space.
Phase 2 (Fusion): The vision encoder, projector, and ChemicalOCR are all frozen. The VTL encoder and text decoder train on a mix of 235K synthetic and 145K real-world Markush samples (2 epochs). The VTL encoder learns the features needed for CXSMILES and substituent table prediction without disrupting the established OCSR representations.

The total model has 831M parameters, of which 744M are trainable.

Datasets and Evaluation Benchmarks

Training Data

Purpose	Dataset	Size	Source
OCR pretraining	Synthetic chemical structures	235K	PubChem SMILES augmented to CXSMILES, rendered with annotations
OCR fine-tuning	Manual OCR annotations	7K	IP5 patent document crops
Phase 1 (OCSR)	MolScribe USPTO	243K	Real image-SMILES pairs
Phase 2 (MMSR)	Synthetic CXSMILES	235K	Same as OCR pretraining set
Phase 2 (MMSR)	MolParser dataset	91K	Real-world Markush, converted to CXSMILES
Phase 2 (MMSR)	USPTO-MOL-M	54K	Real-world, auto-extracted from USPTO MOL files (2010-2025)

Evaluation Benchmarks

Markush benchmarks: M2S (103 samples), USPTO-M (74), WildMol-M (10K, semi-manual), and the new IP5-M (1,000 manually annotated from USPTO, JPO, KIPO, CNIPA, and EPO patents, 1980-2025).

OCSR benchmarks: USPTO (5,719), JPO (450), UOB (5,740), WildMol (10K).

The primary metric is CXSMILES Accuracy (A): a prediction is correct when (1) the predicted SMILES matches the ground truth by InChIKey equivalence, and (2) all Markush features (variable groups, positional and frequency variation indicators) are correctly represented. Stereochemistry is ignored during evaluation.

Results: Markush Structure Recognition

Model	M2S	USPTO-M	WildMol-M	IP5-M
MolParser-Base	39	30	38.1	47.7
MolScribe	21	7	28.1	22.3
GPT-5	3	0	-	-
DeepSeek-OCR	0	0	1.9	0.0
MarkushGrapher-1	38	10	32	-
MarkushGrapher-2	56	13	55	48.0

On M2S, MarkushGrapher-2 achieves 56% CXSMILES accuracy vs. 38% for MarkushGrapher-1, a relative improvement of 47%. On WildMol-M (the largest benchmark at 10K samples), MarkushGrapher-2 reaches 55% vs. 38.1% for MolParser-Base and 32% for MarkushGrapher-1. GPT-5 and DeepSeek-OCR generate mostly chemically invalid outputs on Markush images: only 30% and 15% of their predictions are valid CXSMILES on M2S, respectively.

Results: Standard Molecular Structure Recognition

Model	WildMol	JPO	UOB	USPTO
MolParser-Base	76.9	78.9	91.8	93.0
MolScribe	66.4	76.2	87.4	93.1
MolGrapher	45.5	67.5	94.9	91.5
DeepSeek-OCR	25.8	31.6	78.7	36.9
MarkushGrapher-2	68.4	71.0	96.6	89.8

MarkushGrapher-2 achieves the highest score on UOB (96.6%) and remains competitive on other OCSR benchmarks, despite being primarily optimized for Markush recognition.

ChemicalOCR vs. General OCR

Model	M2S F1	USPTO-M F1	IP5-M F1
PaddleOCR v5	7.7	1.2	1.9
EasyOCR	10.2	18.0	18.4
ChemicalOCR	87.2	93.0	86.5

General-purpose OCR tools fail on chemical images because they misinterpret bonds as characters and cannot parse chemical abbreviations. ChemicalOCR outperforms both by a large margin.

Ablation Results and Key Findings

OCR input is critical for Markush features. Without OCR, CXSMILES accuracy drops from 56% to 4% on M2S, and from 53.7% to 15.4% on IP5-M. The backbone structure accuracy ($A_{\text{InChIKey}}$) also drops substantially (from 80% to 39% on M2S), though the vision encoder alone can still recover some structural information. This confirms that textual cues (brackets, indices, variable definitions) are essential for Markush feature prediction.

Two-phase training improves both tasks. Compared to single-phase (fusion only) training, the two-phase strategy improves CXSMILES accuracy from 44% to 50% on M2S and from 53.0% to 61.5% on JPO after the same number of epochs. Adapting the decoder to OCSR features before introducing the VTL encoder prevents the fusion process from degrading learned visual representations.

Frequency variation indicators remain the hardest feature. On IP5-M, the per-feature breakdown shows 73.3% accuracy for backbone InChI, 74.8% for variable groups, 78.8% for positional variation, but only 30.7% for frequency variation (Sg groups). These repeating structural units are particularly challenging to represent and predict.

Limitations: The model relies on accurate OCR as a prerequisite. Performance on USPTO-M (13% CXSMILES accuracy) lags behind other benchmarks, likely due to the older patent styles in that dataset. The paper does not report inference latency.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
OCR pretraining	Synthetic chemical images	235K	Generated from PubChem SMILES, augmented to CXSMILES
OCR fine-tuning	IP5 patent crops	7K	Manually annotated
Phase 1 training	MolScribe USPTO	243K	Public, real image-SMILES pairs
Phase 2 training	Synthetic + MolParser + USPTO-MOL-M	380K	Mix of synthetic (235K), MolParser (91K), USPTO-MOL-M (54K)
Evaluation	M2S, USPTO-M, WildMol-M, IP5-M	103 to 10K	Markush benchmarks
Evaluation	WildMol, JPO, UOB, USPTO	450 to 10K	OCSR benchmarks

Models

Component	Architecture	Parameters	Status
Vision encoder	Swin-B ViT (from MolScribe)	~87M	Frozen
VTL encoder + decoder	T5-base	~744M trainable	Trained
ChemicalOCR	SmolDocling-based VLM	256M	Fine-tuned, frozen in Phase 2
MLP projector	Linear projection	-	Trained in Phase 1, frozen in Phase 2
Total		831M

Evaluation

Metric	Definition
CXSMILES Accuracy (A)	Percentage of samples where InChIKey matches AND all Markush features correct
$A_{\text{InChIKey}}$	Backbone structure accuracy only (ignoring Markush features)
Table Accuracy	Percentage of correctly predicted substituent tables
Markush Accuracy	Joint CXSMILES + Table accuracy
OCR F1	Bounding-box-level precision/recall at IoU > 0.5

Hardware

Training: NVIDIA A100 GPU
Phase 1: 3 epochs, Adam optimizer, lr 5e-4, 1000 warmup steps, batch size 10, weight decay 1e-3
Phase 2: 2 epochs, batch size 8

Paper Information

Citation: Strohmeyer, T., Morin, L., Meijer, G. I., Weber, V., Nassar, A., & Staar, P. (2026). MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Publication: CVPR 2026

Additional Resources:

@misc{strohmeyer2026markushgrapher,
  title={MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures},
  author={Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Val\'{e}ry and Nassar, Ahmed and Staar, Peter},
  year={2026},
  eprint={2603.28550},
  archiveprefix={arXiv},
  primaryclass={cs.CV}
}

A Multimodal Method for Markush Structure Recognition#

Why Markush Structure Recognition Remains Challenging#

Dual-Encoder Architecture with Dedicated ChemicalOCR#

Two-Stage Training Strategy#

Datasets and Evaluation Benchmarks#

Training Data#

Evaluation Benchmarks#

Results: Markush Structure Recognition#

Results: Standard Molecular Structure Recognition#

ChemicalOCR vs. General OCR#

Ablation Results and Key Findings#

Reproducibility Details#

Data#

Models#

Evaluation#

Hardware#

Paper Information#