Vision-Language Models on Hunter Heidenreich | ML Research Scientist

OCSU: Optical Chemical Structure Understanding (2025)

Sat, 14 Mar 2026 00:00:00 +0000

Paper Information

Citation: Fan, S., Xie, Y., Cai, B., Xie, A., Liu, G., Qiao, M., Xing, J., & Nie, Z. (2025). OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery. arXiv preprint arXiv:2501.15415. https://doi.org/10.48550/arXiv.2501.15415

Publication: arXiv 2025

Additional Resources:

Code and Dataset (GitHub)

Multi-Level Chemical Understanding (Method and Resource)

This is primarily a Methodological Paper ($\Psi_{\text{Method}}$) with a significant Resource ($\Psi_{\text{Resource}}$) contribution.

Methodological: It proposes two novel architectures, DoubleCheck (an enhanced recognition model) and Mol-VL (an end-to-end vision-language model), to solve the newly formulated OCSU task.
Resource: It constructs and releases Vis-CheBI20, the first large-scale dataset specifically designed for optical chemical structure understanding, containing 29.7K images and 117.7K image-text pairs.

The Motivation for OCSU Beyond Basic Graph Recognition

Existing methods for processing molecular images focus narrowly on Optical Chemical Structure Recognition (OCSR), which translates an image solely into a machine-readable graph or SMILES string. However, SMILES strings are not chemist-friendly and lack high-level semantic context.

Gap: There is a lack of systems that can translate chemical diagrams into human-readable descriptions (e.g., functional groups, IUPAC names) alongside the graph structure.
Goal: To enable Optical Chemical Structure Understanding (OCSU), bridging the gap between visual representations and both machine/chemist-readable descriptions to support drug discovery and property prediction.

Key Innovations: DoubleCheck, Mol-VL, and the Vis-CheBI20 Dataset

The paper introduces the OCSU task, enabling multi-level understanding (motif, molecule, and abstract levels). To solve this, it introduces two distinct paradigms:

DoubleCheck (OCSR-based): An enhancement to standard OCSR models (like MolScribe) that performs a “second look” at locally ambiguous atoms. It uses attentive feature enhancement to fuse global molecular features with local features from ambiguous regions.
Mol-VL (OCSR-free): An end-to-end Vision-Language Model (VLM) based on Qwen2-VL. It uses multi-task learning to directly generate text descriptions from molecular images without an intermediate SMILES step.
Vis-CheBI20 Dataset: A new benchmark specifically constructed for OCSU, deriving captions and functional group data from ChEBI-20 and PubChem.

Methodology and Experimental Evaluation

The authors evaluated both paradigms on Vis-CheBI20 and existing benchmarks (USPTO, ACS) across four subtasks:

Functional Group Caption: Retrieval/F1 score evaluation.
Molecule Description: Natural language generation metrics (BLEU, ROUGE, METEOR).
IUPAC Naming: Text generation metrics (BLEU, ROUGE).
SMILES Naming (OCSR): Exact matching accuracy ($Acc_s$).

Baselines:

Task-Specific: MolScribe, MolVec, OSRA.
LLM/VLM: Qwen2-VL, BioT5+, Mol-Instructions.
Ablation: DoubleCheck vs. MolScribe backbone to test the “feature enhancement” mechanism.

Results and Conclusions: Paradigm Trade-Offs

DoubleCheck Superiority: DoubleCheck outperformed MolScribe on OCSR tasks across all benchmarks. On USPTO, it achieved 92.85% $Acc_s$ (vs. 92.57%), and on the ACS dataset it showed a +3.12% gain on chiral molecules. On Vis-CheBI20, DoubleCheck improved over MolScribe by an average of 2.27% across all metrics.
Paradigm Trade-offs:
- Mol-VL (OCSR-free) excelled at semantic tasks like Functional Group Captioning, achieving 97.32% F1 (vs. 93.63% for DoubleCheck & RDKit and 89.60% for MolScribe & RDKit). It benefits from end-to-end learning of structural context.
- DoubleCheck (OCSR-based) performed better on IUPAC naming recall and exact SMILES recovery, as explicit graph reconstruction is more precise for rigid nomenclature than VLM generation.
Conclusion: Enhancing submodules improves OCSR-based paradigms, while end-to-end VLMs offer stronger semantic understanding but struggle with exact syntax generation (SMILES/IUPAC).

Reproducibility Details

Data

Vis-CheBI20 Dataset

Source: Derived from ChEBI-20 and PubChem.
Size: 29,700 molecular diagrams, 117,700 image-text pairs.
Generation: Images generated from SMILES using RDKit to simulate real-world journal/patent styles.
Splits (vary by task, see table below):

Task	Train Size	Test Size
Functional Group	26,144	3,269
Description	26,407	3,300
IUPAC Naming	26,200	2,680
SMILES Naming	26,407	3,300

Algorithms

DoubleCheck (Attentive Feature Enhancement)

Ambiguity Detection: Uses atom prediction confidence to identify “ambiguous atoms”.
Masking: Applies a 2D Gaussian mask to the image centered on the ambiguous atom.
Local Encoding: A Swin-B encoder ($\Phi_l$) encodes the masked image region.
Fusion: Aligns local features ($\mathcal{F}_l$) with global features ($\mathcal{F}_g$) using a 2-layer MLP and fuses them via weighted summation.

$$ \begin{aligned} \mathcal{F}_e = \mathcal{F}_g + \text{MLP}(\mathcal{F}_g \oplus \hat{\mathcal{F}}_l) \cdot \hat{\mathcal{F}}_l \end{aligned} $$

Two-Stage Training:
- Stage 1: Train atom/bond predictors (30 epochs).
- Stage 2: Train alignment/fusion modules with random Gaussian mask noise (10 epochs).

Mol-VL (Multi-Task VLM)

Prompting: System prompt: “You are working as an excellent assistant in chemistry…”
Tokens: Uses and special tokens.
Auxiliary Task: Functional group recognition (identifying highlighted groups) added to training to improve context learning.

Models

DoubleCheck:
- Backbone: MolScribe architecture.
- Encoders: Swin-B for both global and local atom encoding.
Mol-VL:
- Base Model: Qwen2-VL (2B and 7B versions).
- Vision Encoder: ViT with naive dynamic resolution and M-RoPE.

Evaluation

Key Metrics:

SMILES: Exact Match Accuracy ($Acc_s$), Chiral Accuracy ($Acc_c$).
Functional Groups: F1 Score (Information Retrieval task).
Text Generation: BLEU-2/4, METEOR, ROUGE-L.

Selected Results:

Model	Task	Metric	Score
DoubleCheck	OCSR (USPTO)	$Acc_s$	92.85%
MolScribe	OCSR (USPTO)	$Acc_s$	92.57%
Mol-VL-7B	Func. Group Caption	F1	97.32%
DoubleCheck & RDKit	Func. Group Caption	F1	93.63%

Hardware

DoubleCheck: Trained on 4 NVIDIA A100 GPUs for 4 days.
- Max LR: 4e-4.
Mol-VL: Trained on 4 NVIDIA A100 GPUs for 10 days.
- Max LR: 1e-5, 50 epochs.

Artifacts

Artifact	Type	License	Notes
PharMolix/OCSU (GitHub)	Code, Model, Dataset	Apache-2.0	Official implementation, Mol-VL-7B weights, and Vis-CheBI20 dataset

Limitations

The authors acknowledge several limitations:

The long-tail distribution of functional groups in training data limits performance on uncommon chemical structures.
Mol-VL struggles with exact syntax generation (SMILES and IUPAC) compared to explicit graph-reconstruction approaches.
Vis-CheBI20 images are synthetically generated via RDKit, which may not fully capture the diversity of real-world journal and patent images.
The authors note that OCSU technologies should be restricted to research purposes, as downstream molecule discovery applications could potentially generate harmful molecules.

Citation

@misc{fanOCSUOpticalChemical2025,
  title = {OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery},
  shorttitle = {OCSU},
  author = {Fan, Siqi and Xie, Yuguang and Cai, Bowen and Xie, Ailin and Liu, Gaochao and Qiao, Mu and Xing, Jie and Nie, Zaiqing},
  year = {2025},
  month = jan,
  number = {arXiv:2501.15415},
  eprint = {2501.15415},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2501.15415},
  archiveprefix = {arXiv}
}

GTR-CoT: Graph Traversal Chain-of-Thought for Molecules

Sat, 14 Mar 2026 00:00:00 +0000

Paper Information

Citation: Wang, J., He, Y., Yang, H., Wu, J., Ge, L., Wei, X., Wang, Y., Li, L., Ao, H., Liu, C., Wang, B., Wu, L., & He, C. (2025). GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition (arXiv:2506.07553). arXiv. https://doi.org/10.48550/arXiv.2506.07553

Publication: arXiv preprint (2025)

Additional Resources:

Paper on arXiv

Contribution: Vision-Language Modeling for OCSR

This is a method paper that introduces GTR-VL, a Vision-Language Model for Optical Chemical Structure Recognition (OCSR). The work addresses the persistent challenge of converting molecular structure images into machine-readable formats, with a particular focus on handling chemical abbreviations that cause errors in existing systems.

Motivation: The Abbreviation Bottleneck

The motivation tackles a long-standing bottleneck in chemical informatics: most existing OCSR systems produce incorrect structures when they encounter abbreviated functional groups. When a chemist draws “Ph” for phenyl or “Et” for ethyl, current models fail because they have been trained on data where images contain abbreviations but the ground-truth labels contain fully expanded molecular graphs.

This creates a fundamental mismatch. The model sees “Ph” in the image but is told the “correct” answer is a full benzene ring. The supervision signal is inconsistent with what is actually visible.

Beyond this data problem, existing graph-parsing methods use a two-stage approach: predict all atoms first, then predict all bonds. This is inefficient and ignores the structural constraints that could help during prediction. The authors argue that mimicking how humans analyze molecular structures - following bonds from atom to atom in a connected traversal - would be more effective.

Novelty: Graph Traversal as Visual Chain-of-Thought

The novelty lies in combining two key insights about how to properly train and architect OCSR systems. The main contributions are:

Graph Traversal as Visual Chain of Thought: GTR-VL generates molecular graphs by traversing them sequentially, predicting an atom, then its connected bond, then the next atom, and so on. This mimics how a human chemist would trace through a structure and allows the model to use previously predicted atoms and bonds as context for subsequent predictions.

Formally, the model output sequence for image $I_m$ is generated as:

$$ R_m = \text{concat}(CoT_m, S_m) $$

where $CoT_m$ represents the deterministic graph traversal steps (atoms and bonds) and $S_m$ is the final SMILES representation. This intermediate reasoning step makes the model more interpretable and helps it learn the structural logic of molecules.
“Faithfully Recognize What You’ve Seen” Principle: This addresses the abbreviation problem head-on. The authors correct the ground-truth annotations to match what’s actually visible in the image.

They treat abbreviations like “Ph” as single “superatoms” and build a pipeline to automatically detect and correct training data. Using OCR to extract visible text from molecular images, they replace the corresponding expanded substructures in the ground-truth with the appropriate abbreviation tokens. This ensures the supervision signal is consistent with the visual input.
Large-Scale Dataset (GTR-1.3M): To support this approach, the authors created a large-scale dataset combining 1M synthetic molecules from PubChem with 351K corrected real-world patent images from USPTO. The key innovation is the correction pipeline that identifies abbreviations in patent images and fixes the inconsistent ground-truth labels.
GRPO for Hand-Drawn OCSR: Hand-drawn molecular data lacks fine-grained atom/bond coordinate annotations, making SFT-based graph parsing inapplicable. The authors use Group Relative Policy Optimization (GRPO) with a composite reward function that combines format, SMILES, and graph-level rewards. The graph reward computes the maximum common subgraph (MCS) between predicted and ground-truth molecular graphs:

$$ R_{\text{graph}} = \frac{|N_m^a|}{|N_g^a| + |N_p^a|} + \frac{|N_m^b|}{|N_g^b| + |N_p^b|} $$

where $N_m^a$, $N_g^a$, $N_p^a$ are atom counts in the MCS, ground truth, and prediction, and $N_m^b$, $N_g^b$, $N_p^b$ are the corresponding bond counts.
Two-Stage Training: Stage 1 performs SFT on GTR-1.3M for printed molecule recognition. Stage 2 applies GRPO on a mixture of printed data (GTR-USPTO-4K) and hand-drawn data (DECIMER-HD-Train, 4,070 samples) to extend capabilities to hand-drawn structures.
MolRec-Bench Evaluation: Traditional SMILES-based evaluation fails for molecules with abbreviations because canonicalization breaks down. The authors created a new benchmark that evaluates graph structure directly, providing three metrics: direct SMILES generation, graph-derived SMILES, and exact graph matching.

What experiments were performed?

The evaluation focused on demonstrating that GTR-VL’s design principles solve real problems that plague existing OCSR systems:

Comprehensive Baseline Comparison: GTR-VL was tested against three categories of models:
- Specialist OCSR systems: MolScribe and MolNexTR
- Chemistry-focused VLMs: ChemVLM, ChemDFM-X, OCSU
- General-purpose VLMs: GPT-4o, GPT-4o-mini, Qwen-VL-Max
MolRec-Bench Evaluation: The new benchmark includes two subsets of patent images:
- MolRec-USPTO: 5,423 standard patent images similar to existing benchmarks
- MolRec-Abb: 9,311 molecular images with abbreviated superatoms, derived from MolGrapher’s USPTO 10K abb subset
This design directly tests whether models can handle the abbreviation problem that breaks existing systems.
Ablation Studies: Systematic experiments isolated the contribution of key design choices:
- Chain-of-Thought vs. Direct: Comparing graph traversal CoT against direct SMILES prediction
- Traversal Strategy: Graph traversal vs. the traditional “atoms-then-bonds” approach
- Dataset Quality: Training on corrected vs. uncorrected data
Retraining Experiments: Existing specialist models (MolScribe, MolNexTR) were retrained from scratch on the corrected GTR-1.3M dataset to isolate the effect of data quality from architectural improvements.
Hand-Drawn OCSR Evaluation: GTR-VL was also evaluated on the DECIMER Hand-drawn test set and ChemPix dataset, comparing against DECIMER and AtomLenz+EditKT baselines.
Qualitative Analysis: Visual inspection of predictions on challenging cases with heavy abbreviation usage, complex structures, and edge cases to understand failure modes.

Results & Conclusions: Resolving the Abbreviation Bottleneck

Performance Gains on Abbreviations: On MolRec-Abb, GTR-VL-Stage1 achieves 85.49% Graph accuracy compared to around 20% for MolScribe and MolNexTR with their original checkpoints. On MolRec-USPTO, GTR-VL-Stage1 reaches 93.45% Graph accuracy. Existing specialist models see their accuracy drop below 20% on MolRec-Abb when abbreviations are present.
Data Correction is Critical: When MolScribe and MolNexTR were retrained on GTR-1.3M, their MolRec-Abb Graph accuracy jumped from around 20% to 70.60% and 71.85% respectively. GTR-VL-Stage1 still outperformed these retrained baselines at 85.49%, confirming that both data correction and the graph traversal approach contribute.
Chain-of-Thought Helps: Ablation on GTR-USPTO-351K shows that CoT yields 68.85% Gen-SMILES vs. 66.54% without CoT, a 2.31 percentage point improvement.
Graph Traversal Beats Traditional Parsing: Graph traversal achieves 83.26% Graph accuracy vs. 80.15% for the atoms-then-bonds approach, and 81.88% vs. 79.02% on Gra-SMILES.
General VLMs Still Struggle: General-purpose VLMs like GPT-4o scored near 0% on MolRec-Bench across all metrics, highlighting the importance of domain-specific training for OCSR.
Hand-Drawn Recognition via GRPO: GTR-VL-Stage1 (SFT only) achieves only 9.53% Graph accuracy on DECIMER-HD-Test, but after GRPO training in Stage 2, performance jumps to 75.44%. On ChemPix, Graph accuracy rises from 22.02% to 86.13%. The graph reward is essential: GRPO without graph supervision achieves only 11.00% SMILES on DECIMER-HD-Test, while adding graph reward reaches 75.64%.
Evaluation Methodology Matters: The new graph-based evaluation metrics revealed problems with traditional SMILES-based evaluation that previous work had missed. Many “failures” in existing benchmarks were actually correct graph predictions that got marked wrong due to canonicalization issues with abbreviations.

The work establishes that addressing the abbreviation problem requires both correcting the training data and rethinking the model architecture. The combination of faithful data annotation and sequential graph generation improves OCSR performance on molecules with abbreviations by a large margin over previous methods.

Reproducibility Details

Models

Base Model: GTR-VL fine-tunes Qwen2.5-VL.

Input/Output Mechanism:

Input: The model takes an image $I_m$ and a text prompt
Output: The model generates $R_m = \text{concat}(CoT_m, S_m)$, where it first produces the Chain-of-Thought (the graph traversal steps) followed immediately by the final SMILES string
Traversal Strategy: Uses depth-first traversal to alternately predict atoms and bonds

Prompt Structure: The model is prompted to “list the types of atomic elements… the coordinates… and the chemical bonds… then… output a canonical SMILES”. The CoT output is formatted as a JSON list of atoms (with coordinates) and bonds (with indices referring to previous atoms), interleaved.

Data

Training Dataset (GTR-1.3M):

Synthetic Component: 1 million molecular SMILES from PubChem, converted to images using Indigo
Real Component: 351,000 samples from USPTO patents (filtered from an original 680,000)
- Processed using an OCR pipeline to detect abbreviations (e.g., “Ph”, “Et”)
- Ground truth expanded structures replaced with superatoms to match visible abbreviations in images
- This “Faithfully Recognize What You’ve Seen” correction ensures training supervision matches visual input

Evaluation Dataset (MolRec-Bench):

MolRec-USPTO: 5,423 molecular images from USPTO patents
MolRec-Abb: 9,311 molecular images with abbreviated superatoms, derived from MolGrapher’s USPTO 10K abb subset

Algorithms

Graph Traversal Algorithm:

Depth-first traversal strategy
Alternating atom-bond prediction sequence
Each step uses previously predicted atoms and bonds as context

Two-Stage Training:

Stage 1 (SFT): Train on GTR-1.3M to learn visual CoT mechanism for printed molecules (produces GTR-VL-Stage1)
Stage 2 (GRPO): Apply GRPO on GTR-USPTO-4K + DECIMER-HD-Train (4,070 samples) for hand-drawn recognition (produces GTR-VL-Stage2, i.e., GTR-VL)

Training Procedure:

Optimizer: AdamW
Learning Rate (SFT): Peak learning rate of $1.6 \times 10^{-4}$ with cosine decay
Learning Rate (GRPO): Peak learning rate of $1 \times 10^{-5}$ with cosine decay
Warm-up: Linear warm-up for the first 10% of iterations
Batch Size (SFT): 2 per GPU with gradient accumulation over 16 steps, yielding effective batch size of 1024
Batch Size (GRPO): 4 per GPU with gradient accumulation of 1, yielding effective batch size of 128

Evaluation

Metrics (three complementary measures to handle abbreviation issues):

Gen-SMILES: Exact match ratio of SMILES strings directly generated by the VLM (image-captioning style)
Gra-SMILES: Exact match ratio of SMILES strings derived from the predicted graph structure (graph-parsing style)
Graph: Exact match ratio between ground truth and predicted graphs (node/edge comparison, bypassing SMILES canonicalization issues)

Baselines Compared:

Specialist OCSR systems: MolScribe, MolNexTR
Chemistry-focused VLMs: ChemVLM, ChemDFM-X, OCSU
General-purpose VLMs: GPT-4o, GPT-4o-mini, Qwen-VL-Max

Hardware

Compute: Training performed on 32 NVIDIA A100 GPUs

Reproducibility Status

Status: Closed. As of the paper’s publication, no source code, pre-trained model weights, or dataset downloads (GTR-1.3M, MolRec-Bench) have been publicly released. The paper does not mention plans for open-source release. The training data pipeline relies on PubChem SMILES (public), USPTO patent images (publicly available through prior work), the Indigo rendering tool (open-source), and an unspecified OCR system for abbreviation detection. Without the released code and data corrections, reproducing the full pipeline would require substantial re-implementation effort.

MolNexTR: A Dual-Stream Molecular Image Recognition

Sat, 04 Oct 2025 00:00:00 +0000

Paper Information

Citation: Chen, Y., Leung, C. T., Huang, Y., Sun, J., Chen, H., & Gao, H. (2024). MolNexTR: a generalized deep learning model for molecular image recognition. Journal of Cheminformatics, 16(141). https://doi.org/10.1186/s13321-024-00926-w

Publication: Journal of Cheminformatics 2024

Additional Resources:

Methodology Overview and Taxonomic Classification

This is a Method paper ($\Psi_{\text{Method}}$). It proposes a neural network architecture (MolNexTR) that integrates ConvNext and Vision Transformers to solve the Optical Chemical Structure Recognition (OCSR) task. The paper validates this method through ablation studies and benchmarking against existing methods including MolScribe and DECIMER.

The Challenge of Domain-Specific Drawing Styles in OCSR

Converting molecular images from chemical literature into machine-readable formats (SMILES) is critical but challenging due to the high variance in drawing styles, fonts, and conventions (e.g., Markush structures, abbreviations). Existing methods have limitations:

CNN-based and ViT-based models often struggle to generalize across diverse, non-standard drawing styles found in real literature.
Pure ViT methods lack translation invariance and local feature representation, while pure CNNs struggle with global dependencies.
Many models predict SMILES strings directly, making it difficult to enforce chemical validity or resolve complex stereochemistry and abbreviations.

Core Innovation: Dual-Stream Encoding and Image Contamination

MolNexTR introduces three main innovations:

Dual-Stream Encoder: A hybrid architecture processing images simultaneously through a ConvNext stream (for local features) and a Vision Transformer stream (for long-range dependencies), fusing them to capture multi-scale information.
Image Contamination Augmentation: A specialized data augmentation algorithm that simulates real-world “noise” found in literature, such as overlapping text, arrows, and partial molecular fragments, to improve robustness.
Graph-Based Decoding with Post-Processing: Unlike pure image-to-SMILES translation, it predicts atoms and bonds (graph generation) and uses a stereochemical discrimination and abbreviation self-correction module to enforce chemical rules (e.g., chirality) and resolve superatoms (e.g., “Ph”, “Bn”).

The prediction of atom labels and coordinates is formulated as a conditional autoregressive generation task, optimized via a cross-entropy loss: $$ \mathcal{L}_{\text{atom}} = -\sum_{t=1}^{T} \log P(x_t \mid \text{Image}, x_{

Experimental Setup: Benchmarking on Synthetic and Real Data

The model was trained on synthetic data (PubChem) and real patent data (USPTO). It was evaluated on nine benchmarks (three synthetic, six real-world):

Synthetic: Indigo, ChemDraw, RDKit (rendered from 5,719 molecules)
Real-World: CLEF, UOB, JPO, USPTO, Staker, and a newly curated ACS dataset (diverse styles)

Baselines: Compared against rule-based (OSRA, MolVec) and deep learning models (MolScribe, DECIMER, SwinOCSR, Img2Mol).

Ablations: Tested the impact of the dual-stream encoder vs. single streams, and the contribution of individual augmentation strategies.

Empirical Results and Robustness Findings

Performance: MolNexTR achieved 81-97% accuracy across test sets, outperforming the second-best method (often MolScribe) by margins of 0.3% to 10.0% (on the difficult ACS dataset).
Perturbation resilience: The model maintained higher accuracy under image perturbations (rotation, noise) and “curved arrow” noise common in reaction mechanisms compared to MolScribe and DECIMER (Table 3).
Ablation Results: The dual-stream encoder consistently outperformed single CNN or ViT baselines, and the image contamination algorithm significantly boosted performance on noisy real-world data (ACS).
Limitations: The model still struggles with extremely complex hand-drawn molecules and mechanism diagrams where arrows or text are conflated with structure. The authors also note that R-group information in real literature often appears in separate text or tables, which the model does not incorporate.

Key Results (Table 2, SMILES exact match accuracy %):

Dataset	MolScribe	MolNexTR	Improvement
Indigo	97.5	97.8	+0.3
ChemDraw	93.8	95.1	+1.3
RDKit	94.6	96.4	+1.8
CLEF	88.3	90.4	+2.1
UOB	87.9	88.5	+0.6
JPO	77.7	82.1	+4.4
USPTO	92.6	93.8	+1.2
Staker	86.9	88.3	+1.4
ACS	71.9	81.9	+10.0

Reproducibility Details

Data

Training Data:

Synthetic: ~1M molecules randomly selected from PubChem, rendered using RDKit and Indigo with varied styles (thickness, fonts, bond width)
Real: 0.68M images from USPTO, with coordinates normalized from MOLfiles

Augmentation:

Render Augmentation: Randomized drawing styles (line width, font size, label modes)
Image Augmentation: Rotation, cropping, blurring, noise (Gaussian, salt-and-pepper)
Molecular Augmentation: Randomly replacing functional groups with abbreviations (from a list of >100) or complex chains (e.g., CH3CH2NH2); adding R-groups
Image Contamination: Adding “noise” objects (arrows, lines, text, partial structures) at a minimum distance from the main molecule to simulate literature artifacts

Algorithms

Dual-Stream Encoder:

CNN Stream: ConvNext backbone (pre-trained on ImageNet), generating feature maps at scales $H/4$ to $H/32$
ViT Stream: Parallel transformer blocks receiving patches of sizes $p=4, 8, 16, 32$. Uses Multi-Head Self-Attention (MHSA) and Feed-Forward Networks (FFN)
Fusion: Outputs from both streams are concatenated

Decoder (Graph Generation):

Transformer Decoder: 6 layers, 8 heads, hidden dim 256
Task 1 (Atoms): Autoregressive prediction of atom tokens $(l, x, y)$ (label + coordinates)
Task 2 (Bonds): Prediction of bond types between atom pairs (None, Single, Double, Triple, Aromatic, Solid Wedge, Dashed Wedge)

Post-Processing:

Stereochemistry: Uses predicted coordinates and bond types (wedge/dash) to resolve chirality using RDKit logic
Abbreviation Correction: Matches superatoms to a dictionary; if unknown, attempts to greedily connect atoms based on valence or finds the nearest match ($\sigma=0.8$ similarity threshold)

Models

Architecture: Encoder-Decoder (ConvNext + ViT Encoder -> Transformer Decoder)
Hyperparameters:
- Optimizer: ADAM (max lr 3e-4, linear warmup for 5% of steps)
- Batch Size: 256
- Image Size: $384 \times 384$
- Dropout: 0.1
Training: Fine-tuned CNN backbone for 40 epochs on 10 NVIDIA RTX 3090 GPUs

Evaluation

Primary Metric: SMILES sequence exact matching accuracy (canonicalized)

Benchmarks:

Synthetic: Indigo (5,719), ChemDraw (5,719), RDKit (5,719)
Real: CLEF (992), UOB (5,740), JPO (450), USPTO (5,719), Staker (50,000), ACS (331)

Hardware

GPUs: 10 NVIDIA RTX 3090 GPUs
Cluster: HPC3 Cluster at HKUST (ITSC)

Artifacts

Artifact	Type	License	Notes
MolNexTR GitHub	Code	Apache-2.0	Official implementation (PyTorch, Jupyter notebooks)
MolNexTR HuggingFace	Dataset/Model	Apache-2.0	Training data and model checkpoint

Citation

@article{chenMolNexTRGeneralizedDeep2024,
  title = {{MolNexTR}: a generalized deep learning model for molecular image recognition},
  author = {Chen, Yufan and Leung, Ching Ting and Huang, Yong and Sun, Jianwei and Chen, Hao and Gao, Hanyu},
  journal = {Journal of Cheminformatics},
  volume = {16},
  number = {1},
  pages = {141},
  year = {2024},
  doi = {10.1186/s13321-024-00926-w}
}

MolParser: End-to-End Molecular Structure Recognition

Fri, 03 Oct 2025 00:00:00 +0000

Paper Information

Citation: Fang, X., Wang, J., Cai, X., Chen, S., Yang, S., Tao, H., Wang, N., Yao, L., Zhang, L., & Ke, G. (2025). MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 24528-24538). https://doi.org/10.48550/arXiv.2411.11098

Publication: ICCV 2025

Additional Resources:

MolParser-7M Dataset - 7M+ image-text pairs for OCSR
MolParser-7M on HuggingFace - Dataset repository
MolDet YOLO Detector - Object detection model for extracting molecular images from documents

Contribution: End-to-End OCSR and Real-World Resources

This is primarily a Method paper (see AI and Physical Sciences paper taxonomy), with a significant secondary contribution as a Resource paper.

Method contribution ($\Psi_{\text{Method}}$): The paper proposes a novel end-to-end architecture combining a Swin Transformer encoder with a BART decoder, and crucially introduces Extended SMILES (E-SMILES), a new syntactic extension to standard SMILES notation that enables representation of Markush structures, abstract rings, and variable attachment points found in patents. The work validates this method through extensive ablation studies, achieving the highest accuracy among tested OCSR systems on WildMol-10k (76.9%).

Resource contribution ($\Psi_{\text{Resource}}$): The paper introduces MolParser-7M, the largest OCSR dataset to date (7.7M image-text pairs), and WildMol, a challenging benchmark of 20,000 manually annotated real-world molecular images. The construction of these datasets through an active learning data engine with human-in-the-loop validation represents significant infrastructure that enables future OCSR research.

Motivation: Extracting Chemistry from Real-World Documents

The motivation stems from a practical problem in chemical informatics: vast amounts of chemical knowledge remain embedded in unstructured formats. Patents, research papers, and legacy documents depict molecular structures as images. This creates a barrier for large-scale data analysis and prevents Large Language Models from effectively understanding scientific literature in chemistry and drug discovery.

Existing OCSR methods struggle with real-world documents for two fundamental reasons:

Representational limitations: Standard SMILES notation cannot capture complex structural templates like Markush structures, which are ubiquitous in patents. These structures define entire families of compounds using variable R-groups and abstract patterns, making them essential for intellectual property but impossible to represent with conventional methods.
Data distribution mismatch: Real-world molecular images suffer from noise, inconsistent drawing styles, variable resolution, and interference from surrounding text. Models trained exclusively on clean, synthetically rendered molecules fail to generalize when applied to actual documents.

Novelty: E-SMILES and Human-in-the-Loop Data Engine

The novelty lies in a comprehensive system that addresses both representation and data quality challenges through four integrated contributions:

Extended SMILES (E-SMILES): A backward-compatible extension to the SMILES format that can represent complex structures previously inexpressible in standard chemical notations. E-SMILES uses a separator token to delineate the core molecular structure from supplementary annotations. These annotations employ XML-like tags to encode Markush structures, polymers, abstract rings, and other complex patterns. Critically, the core structure remains parseable by standard cheminformatics tools like RDKit, while the supplementary tags provide a structured, LLM-friendly format for capturing edge cases.
MolParser-7M Dataset: The largest publicly available OCSR dataset, containing over 7 million image-text pairs. What distinguishes this dataset is both its scale and its composition. It includes 400,000 “in-the-wild” samples (molecular images extracted from actual patents and scientific papers) and subsequently curated by human annotators. This real-world data addresses the distribution mismatch problem directly by exposing the model to the same noise, artifacts, and stylistic variations it encounters in production.
Human-in-the-Loop Data Engine: A systematic approach to collecting and annotating real-world training data. The pipeline begins with an object detection model that extracts molecular images from over a million PDF documents. An active learning algorithm then identifies the most informative samples (those where the current model struggles) for human annotation. The model pre-annotates these images, and human experts review and correct them. This creates an iterative improvement cycle: annotate, train, identify new challenging cases, repeat.
Efficient End-to-End Architecture: The model treats OCSR as an image captioning problem. A Swin-Transformer vision encoder extracts visual features, a simple MLP compresses them, and a BART decoder generates the E-SMILES string autoregressively. The model minimizes the standard negative log-likelihood of the target E-SMILES token sequence $y$ given the sequence history and input image $x$:

$$ \begin{aligned} \mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{

The training strategy employs curriculum learning, starting with simple molecules and gradually introducing complexity and heavier data augmentation.

Experimental Setup: Two-Stage Training and Benchmarking

The evaluation focused on demonstrating that MolParser generalizes to real-world documents:

Two-Stage Training Protocol: The model underwent a systematic training process:
- Pre-training: Initial training on millions of synthetic molecular images using curriculum learning. The curriculum progresses from simple molecules to complex structures while gradually increasing data augmentation intensity (blur, noise, perspective transforms).
- Fine-tuning: Subsequent training on 400,000 curated real-world samples extracted from patents and papers. This fine-tuning phase is critical for adapting to the noise and stylistic variations of actual documents.
Benchmark Evaluation: The model was evaluated on multiple standard OCSR benchmarks to establish baseline performance on clean data. These benchmarks test recognition accuracy on well-formatted molecular diagrams.
Real-World Document Analysis: The critical test involved applying MolParser to molecular structures extracted directly from scientific documents. This evaluation measures the gap between synthetic benchmark performance and real-world applicability (the core problem the paper addresses).
Ablation Studies: Experiments isolating the contribution of each component:
- The impact of real-world training data versus synthetic-only training
- The effectiveness of curriculum learning versus standard training
- The value of the human-in-the-loop annotation pipeline versus random sampling
- The necessity of E-SMILES extensions for capturing complex structures

Outcomes and Empirical Findings

Performance on Benchmarks: MolParser achieves competitive results on standard benchmarks and the best performance on real-world documents. On clean benchmarks like USPTO-10K, MolScribe (96.0%) slightly edges MolParser-Base (94.5%), but on WildMol-10k, MolParser-Base achieved 76.9% accuracy, significantly outperforming MolScribe (66.4%) and MolGrapher (45.5%). This gap validates the core hypothesis that training on actual document images is essential for practical deployment.
Real-World Data is Critical: Models trained exclusively on synthetic data show substantial performance degradation when applied to real documents. The 400,000 in-the-wild training samples bridge this gap, demonstrating that data quality and distribution matching matter as much as model architecture. Ablation experiments showed that pretraining on MolParser-7M synthetic data alone achieved 51.9% accuracy on WildMol, while adding real-world fine-tuning raised this to 76.9%. Using the smaller MolGrapher-300k synthetic dataset without fine-tuning yielded only 22.4%.
E-SMILES Enables Broader Coverage: The extended representation successfully captures molecular structures that were previously inexpressible, particularly Markush structures from patents. This expands the scope of what can be automatically extracted from chemical literature to include patent-style structural templates.
Human-in-the-Loop Scales Efficiently: The active learning pipeline reduces annotation time by approximately 90% while maintaining high quality. This approach makes it feasible to curate large-scale, high-quality datasets for specialized domains where expert knowledge is expensive.
Speed and Accuracy: The end-to-end architecture achieves both high accuracy and fast inference, making it practical for large-scale document processing. MolParser-Base processes 40 images per second on RTX 4090D, while the Tiny variant achieves 131 FPS. The direct image-to-text approach avoids the error accumulation of multi-stage pipelines.
Downstream Applications: The Swin Transformer encoder, once trained on MolParser-7M, serves as an effective molecular fingerprint for property prediction. Paired with a simple two-layer MLP on MoleculeNet benchmarks, MolParser-pretrained features achieved an average ROC-AUC of 73.7% across five tasks, compared to 68.9% for ImageNet-pretrained Swin-T features. The authors also demonstrate chemical reaction parsing by feeding MolDet detections and MolParser E-SMILES into GPT-4o.
Limitations: The authors acknowledge that molecular chirality is not yet fully exploited by the system. The E-SMILES format does not currently support dashed abstract rings, coordination bonds, special symbol Markush patterns, or replication of long structural segments. Additionally, scaling up the volume of real annotated training data could further improve performance.

The work establishes that practical OCSR requires more than architectural innovations. It demands careful attention to data quality, representation design, and the distribution mismatch between synthetic training data and real-world applications. The combination of E-SMILES, the MolParser-7M dataset, and the human-in-the-loop data engine provides a template for building reliable vision systems in scientific domains where clean training data is scarce but expert knowledge is available.

Artifacts

Artifact	Type	License	Notes
MolParser-7M	Dataset	CC-BY-NC-SA-4.0	7.7M image-SMILES pairs for OCSR pretraining and fine-tuning
MolDet	Model	CC-BY-NC-SA-4.0	YOLO11-based molecule detector for PDF documents

No official source code repository has been released. Model weights for MolParser itself are not publicly available as of the dataset release.

Reproducibility Details

Data

The training data is split into a massive synthetic pre-training set and a curated fine-tuning set.

Training Data Composition (MolParser-7M):

Purpose	Dataset Name	Size	Composition / Notes
Pre-training	MolParser-7M (Synthetic)	~7.7M	Markush-3M (40%), ChEMBL-2M (27%), Polymer-1M (14%), PAH-600k (8%), BMS-360k (5%), MolGrapher-300K (4%), Pauling-100k (2%). Generated via RDKit/Indigo with randomized styles.
Fine-tuning	MolParser-SFT-400k	400k	Real images from patents/papers selected via active learning (confidence filtering 0.6-0.9) and manually annotated. 66% of fine-tuning mix.
Fine-tuning	MolParser-Gen-200k	200k	Subset of synthetic data kept to prevent catastrophic forgetting. 32% of fine-tuning mix.
Fine-tuning	Handwrite-5k	5k	Handwritten molecules from Img2Mol to support hand-drawn queries. 1% of fine-tuning mix.

Sources: 1.2M patents and scientific papers (PDF documents)
Extraction: MolDet (YOLO11-based detector) identified ~20M molecular images, deduplicated to ~4M candidates
Selection: Active learning ensemble (5-fold models) identified high-uncertainty samples for annotation
Annotation: Human experts corrected model pre-annotations (90% time savings vs. from-scratch annotation)

Test Benchmarks:

Benchmark	Size	Description
USPTO-10k	10,000	Standard synthetic benchmark
Maybridge UoB	-	Synthetic molecules
CLEF-2012	-	Patent images
JPO	-	Japanese patent office
ColoredBG	-	Colored background molecules
WildMol-10k	10,000	Ordinary molecules cropped from real PDFs (new)
WildMol-10k-M	10,000	Markush structures (significantly harder, new)

Algorithms

Extended SMILES (E-SMILES) Encoding:

Format: SMILESEXTENSION where separates core structure from supplementary annotations
Extensions use XML-like tags:
- index:group for substituents/variable groups (Markush structures)
- for groups connected at any ring position
- for abstract rings
- for connection points
Backward compatible: Core SMILES parseable by RDKit; extensions provide structured format for edge cases

Curriculum Learning Strategy:

Phase 1: No augmentation, simple molecules (<60 tokens)
Phase 2: Gradually increase augmentation intensity and sequence length
Progressive complexity allows stable training on diverse molecular structures

Active Learning Data Selection:

Train 5 model folds on current dataset
Compute pairwise Tanimoto similarity of predictions on candidate images
Select samples with confidence scores 0.6-0.9 for human review (highest learning value)
Human experts correct model pre-annotations
Iteratively expand training set with hard samples

Data Augmentations:

RandomAffine (rotation, scale, translation)
JPEGCompress (compression artifacts)
InverseColor (color inversion)
SurroundingCharacters (text interference)
RandomCircle (circular artifacts)
ColorJitter (brightness, contrast variations)
Downscale (resolution reduction)
Bounds (boundary cropping variations)

Models

The architecture follows a standard Image Captioning (Encoder-Decoder) paradigm.

Architecture Specifications:

Component	Details
Vision Encoder	Swin Transformer (ImageNet pretrained)
- Tiny variant	66M parameters, $224 \times 224$ input
- Small variant	108M parameters, $224 \times 224$ input
- Base variant	216M parameters, $384 \times 384$ input
Connector	2-layer MLP reducing channel dimension by half
Text Decoder	BART-Decoder (12 layers, 16 attention heads)

Training Configuration:

Setting	Pre-training	Fine-tuning
Hardware	8x NVIDIA RTX 4090D GPUs	8x NVIDIA RTX 4090D GPUs
Optimizer	AdamW	AdamW
Learning Rate	$1 \times 10^{-4}$	$5 \times 10^{-5}$
Weight Decay	$1 \times 10^{-2}$	$1 \times 10^{-2}$
Scheduler	Cosine with warmup	Cosine with warmup
Epochs	20	4
Label Smoothing	0.01	0.005

Curriculum Learning Schedule (Pre-training):

Starts with simple molecules (<60 tokens, no augmentation)
Gradually adds complexity and augmentation (blur, noise, perspective transforms)
Enables stable learning across diverse molecular structures

Evaluation

Metrics: Exact match accuracy on predicted E-SMILES strings (molecule-level exact match)

Key Results:

Metric	MolParser-Base	MolScribe	MolGrapher	Notes
WildMol-10k	76.9%	66.4%	45.5%	Real-world patent/paper crops
USPTO-10k	94.5%	96.0%	93.3%	Synthetic benchmark
Throughput (FPS)	39.8	16.5	2.2	Measured on RTX 4090D

Additional Performance:

MolParser-Tiny: 131 FPS on RTX 4090D (66M params)
Real-world vs. synthetic gap: Fine-tuning on MolParser-SFT-400k closed the performance gap between clean benchmarks and in-the-wild documents

Ablation Findings:

Factor	Impact
Real-world training data	Fine-tuning on real data raised accuracy from 51.9% to 76.9% on WildMol-10k
Curriculum learning	Augmentation alone raised WildMol-10k from 40.1% to 69.5%; adding curriculum learning further raised it to 76.9%
Active learning selection	More effective than random sampling for annotation budget
E-SMILES extensions	Essential for Markush structure recognition (impossible with standard SMILES)
Dataset scale	Larger pre-training dataset (7M vs 300k) improved WildMol-10k accuracy from 22.4% to 51.9% before fine-tuning

Hardware

Training: 8x NVIDIA RTX 4090D GPUs
Inference: Single RTX 4090D sufficient for real-time processing
Training time: 20 epochs pre-training + 4 epochs fine-tuning (specific duration not reported)

Citation

@inproceedings{fang2025molparser,
  title={MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild},
  author={Fang, Xi and Wang, Jiankun and Cai, Xiaochen and Chen, Shangqian and Yang, Shuwen and Tao, Haoyi and Wang, Nan and Yao, Lin and Zhang, Linfeng and Ke, Guolin},
  year={2025},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  pages={24528--24538},
  eprint={2411.11098},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  doi={10.48550/arXiv.2411.11098}
}

MolParser-7M & WildMol: Large-Scale OCSR Datasets

Fri, 03 Oct 2025 00:00:00 +0000

Dataset Examples

An example of a complex Markush structure that can be represented by the E-SMILES format but not by standard SMILES or FG-SMILES.

A sample from the WildMol benchmark, showing a low-quality, noisy molecular image cropped from real-world literature that challenges OCSR systems.

A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.

Dataset Subsets

Subset	Count	Description
MolParser-7M (Training Set)	7,740,871	A large-scale dataset for training OCSR models, split into pre-training and fine-tuning stages.
WildMol (Test Set)	20,000	A benchmark of 20,000 human-annotated samples cropped from real PDF files to evaluate OCSR models in ‘in-the-wild’ scenarios. Comprises WildMol-10k (10k ordinary molecules) and WildMol-10k-M (10k Markush structures).

Benchmarks

WildMol-10K Accuracy

Evaluation of OCSR models on 10,000 real-world molecular images cropped from scientific literature and patents

Rank	Model	Accuracy (%)
🥇 1	MolParser-Base End-to-end visual recognition trained on MolParser-7M	76.9
🥈 2	MolScribe Transformer-based OCSR system	66.4
🥉 3	DECIMER 2.7 Deep learning for chemical image recognition	56
4	MolGrapher Graph-based molecular structure recognition	45.5
5	MolVec 0.9.7 Vector-based structure recognition	26.4
6	OSRA 2.1 Optical Structure Recognition Application	26.3
7	Img2Mol Image-to-molecule translation	24.4
8	Imago 2.0 Chemical structure recognition toolkit	6.9

Key Contribution

Introduces MolParser-7M, the largest open-source Optical Chemical Structure Recognition (OCSR) dataset, uniquely combining diverse synthetic data with a large volume of manually-annotated, “in-the-wild” images from real scientific documents to improve model robustness. Also introduces WildMol, a new challenging benchmark for evaluating OCSR performance on real-world data, including Markush structures.

Overview

The MolParser project addresses the challenge of recognizing molecular structures from images found in real-world scientific documents. Unlike existing OCSR datasets that rely primarily on synthetically generated images, MolParser-7M incorporates 400,000 manually annotated images cropped from actual patents and scientific papers, making it the first large-scale dataset to bridge the gap between synthetic training data and real-world deployment scenarios.

Strengths

Largest open-source OCSR dataset with over 7.7 million pairs
The only large-scale OCSR training set that includes a significant amount (400k) of “in-the-wild” data cropped from real patents and literature
High diversity of molecular structures from numerous sources (PubChem, ChEMBL, polymers, etc.)
Introduces the WildMol benchmark for evaluating performance on challenging, real-world data, including Markush structures
The “in-the-wild” fine-tuning data (MolParser-SFT-400k) was curated via an efficient active learning data engine with human-in-the-loop validation

Limitations

The E-SMILES format cannot represent certain complex cases, such as coordination bonds, dashed abstract rings, Markush structures depicted with special patterns, and replication of long structural segments on the skeleton
The model and data do not yet fully exploit molecular chirality, which is critical for chemical properties
Performance could be further improved by scaling up the amount of real annotated training data

Technical Notes

Synthetic Data Generation

To ensure diversity, molecular structures were collected from databases like ChEMBL, PubChem, and Kaggle BMS. A significant number of Markush, polymer, and fused-ring structures were also randomly generated. Images were rendered using RDKit and epam.indigo with randomized parameters (e.g., bond width, font size, rotation) to increase visual diversity. The pretraining dataset is composed of the following subsets:

Subset	Ratio	Source
Markush-3M	40%	Random groups replacement from PubChem
ChEMBL-2M	27%	Molecules selected from ChEMBL
Polymer-1M	14%	Randomly generated polymer molecules
PAH-600k	8%	Randomly generated fused-ring molecules
BMS-360k	5%	Molecules with long carbon chains from BMS
MolGrapher-300K	4%	Training data from MolGrapher
Pauling-100k	2%	Pauling-style images drawn using epam.indigo

In-the-Wild Data Engine (MolParser-SFT-400k)

A YOLO11 object detection model (MolDet) located and cropped over 20 million molecule images from 1.22 million real PDFs (patents and papers). After de-duplication via p-hash similarity, 4 million unique images remained.

An active learning algorithm was used to select the most informative samples for annotation, targeting images where an ensemble of 5-fold models showed moderate confidence (0.6-0.9 Tanimoto similarity), indicating they were challenging but learnable.

This active learning approach with model pre-annotations reduced manual annotation time per molecule to 30 seconds, approximately 90% savings compared to annotating from scratch. In the final fine-tuning dataset, 56.04% of annotations directly utilized raw model pre-annotations, 20.97% passed review after a single manual correction, 13.87% were accepted after a second round of annotation, and 9.13% required three or more rounds.

The fine-tuning dataset is composed of:

Subset	Ratio	Source
MolParser-SFT-400k	66%	Manually annotated data obtained via data engine
MolParser-Gen-200k	32%	Synthetic data selected from pretraining stage
Handwrite-5k	1%	Handwritten molecules selected from Img2Mol

E-SMILES Specification

To accommodate complex patent structures that standard SMILES cannot support, the authors introduced an Extended SMILES format (SMILESEXTENSION). The EXTENSION component uses XML-like tokens to manage complexities:

... encapsulates Markush R-groups and abbreviation groups.
... denotes ring attachments with uncertainty positions.
... defines abstract rings.
identifies a connection point.

This format enables Markush-molecule matching and LLM integration, while retaining RDKit compatibility for the standard SMILES portion.

Reproducibility

Artifact	Type	License	Notes
MolParser-7M	Dataset	CC-BY-NC-SA-4.0	Training and test data on HuggingFace. SFT subset is partially released.
MolDet (YOLO11)	Model	Unknown	Molecule detection model on HuggingFace
MolParser Demo	Other	N/A	Online OCSR demo using MolParser-Base

The dataset is publicly available on HuggingFace under a CC-BY-NC-SA-4.0 (non-commercial) license. The MolParser-SFT-400k subset is only partially released. The YOLO11-based MolDet detection model is also available on HuggingFace. No public code repository is provided for the MolParser recognition model itself. All experiments were conducted on 8 NVIDIA RTX 4090D GPUs, and throughput benchmarks were measured on a single RTX 4090D GPU.

SubGrapher: Visual Fingerprinting of Chemical Structures

Mon, 28 Apr 2025 00:00:00 +0000

Paper Classification and Taxonomy

This is primarily a Methodological Paper ($\Psi_{\text{Method}}$) with a secondary Resource ($\Psi_{\text{Resource}}$) contribution. Using the AI and Physical Sciences paper taxonomy framework:

Primary Classification: Method

The dominant basis vector is Methodological because SubGrapher introduces an architecture that replaces the two-step OCSR workflow (image, then structure, then fingerprint) with single-step fingerprinting (image to visual fingerprint). The paper validates this approach through systematic comparison against state-of-the-art methods (MolGrapher, OSRA, DECIMER, MolScribe), demonstrating superior performance on specific tasks like retrieval and robustness to image quality degradation.

Secondary Classification: Resource

The paper makes non-negligible resource contributions by releasing:

Code and model weights on GitHub and HuggingFace
Five new visual fingerprinting benchmark datasets for molecule retrieval tasks
Comprehensive functional group knowledge base (1,534 substructures)

Motivation: Extracting Complex Structures from Noisy Images

The motivation tackles a fundamental challenge in chemical informatics: extracting molecular information from the vast amounts of unstructured scientific literature, particularly patents. Millions of molecular structures exist only as images in these documents, making them inaccessible for computational analysis, database searches, or machine learning applications.

Traditional Optical Chemical Structure Recognition (OCSR) tools attempt to fully reconstruct molecular graphs from images, converting them into machine-readable formats like SMILES. However, these approaches face two critical limitations:

Brittleness to image quality: Poor resolution, noise, or unconventional drawing styles frequently degrade recognition accuracy
Limited handling of complex structures: Markush structures, generic molecular templates with variable R-groups commonly used in patents, are poorly supported by most conventional OCSR methods

The key insight driving SubGrapher is that full molecular reconstruction may be unnecessary for many applications. For tasks like database searching, similarity analysis, or document retrieval, a molecular fingerprint - a vectorized representation capturing structural features - is often sufficient. This realization opens up a new approach: bypass the fragile reconstruction step and create fingerprints directly from visual information.

Key Innovation: Direct Visual Fingerprinting

SubGrapher takes a different approach to extracting chemical information from images. It creates “visual fingerprints” through functional group recognition. The key innovations are:

Direct Image-to-Fingerprint Pipeline: SubGrapher eliminates the traditional two-step process (image → structure → fingerprint) by generating fingerprints directly from pixels. This single-stage approach avoids error accumulation from failed structure reconstructions and can handle images where conventional OCSR tools produce invalid outputs.
Dual Instance Segmentation Architecture: The system employs two specialized Mask-RCNN networks working in parallel:
- Functional group detector: Trained to identify 1,534 expert-defined functional groups using pixel-level segmentation masks
- Carbon backbone detector: Recognizes 27 common carbon chain patterns to capture the molecular scaffold
Using instance segmentation provides detailed spatial information and higher accuracy through richer supervision during training.
Extensive Functional Group Knowledge Base: The method uses one of the most comprehensive open-source collections of functional groups, encompassing 1,534 substructures. These were systematically defined by:
- Starting with chemically logical atom combinations (C, O, S, N, B, P)
- Expanding to include relevant subgroups and variations
- Filtering based on frequency (appearing ~1,000+ times in PubChem)
- Additional halogen substituents and organometallic groups relevant to EUV photoresists
- Manual curation with SMILES, SMARTS, and descriptive names
Substructure-Graph Construction: After detecting functional groups and carbon backbones, SubGrapher builds a connectivity graph where:
- Each node represents an identified substructure
- Edges connect substructures whose bounding boxes overlap (with 10% margin expansion)
- This graph captures both the chemical components and their spatial relationships
Substructure-based Visual Molecular Fingerprint (SVMF): The final output is a continuous, count-based fingerprint formally defined as a matrix $SVMF(m) \in \mathbb{R}^{n \times n}$ where $n=1561$ (1,534 functional groups + 27 carbon backbones). The matrix is stored as a compressed upper triangular form:

Diagonal elements ($i = j$): Weighted count of substructure $i$ plus self-intersection $$SVMF_{ii}(m) = h_1 \cdot n_i + g_{ii}$$ where $h_1 = 10$ is the diagonal weight hyperparameter, $n_i$ is the instance count, and $g_{ii}$ is the self-intersection coefficient.

Off-diagonal elements ($i \neq j$): Intersection coefficient based on shortest path distance $d$ in the substructure graph $$SVMF_{ij}(m) = h_2(d) \cdot \text{intersection}(s_i, s_j)$$ where the distance decay function $h_2(d)$ is:
- $d \leq 1$: weight = 2
- $d = 2$: weight = 2/4 = 0.5
- $d = 3$: weight = 2/16 = 0.125
- $d = 4$: weight = $2/256 \approx 0.0078$
- $d > 4$: weight = 0
Key properties:
- Carbon chain intersection coefficients are divided by 2, giving functional groups higher effective weight
- Similarity between fingerprints calculated using a normalized Euclidean distance (ratio of L2 norm of difference to L2 norm of sum)
- Resulting fingerprints are highly sparse (average 0.001% non-zero elements)
- Compressed storage enables efficient database searches
Markush Structure Compatibility: SubGrapher processes Markush structures by recognizing their constituent functional groups and creating meaningful fingerprints for similarity searches, achieving higher accuracy than existing OCSR methods on the USPTO-Markush benchmark (S-F1: 88).

Experimental Validation and Benchmarks

The evaluation focused on demonstrating SubGrapher’s effectiveness across two critical tasks: accurate substructure detection and robust molecule retrieval from diverse image collections.

Substructure Detection Performance

SubGrapher’s ability to identify functional groups was tested on three challenging benchmarks that expose different failure modes of OCSR systems:

Dataset	Size	Description	Key Challenge
JPO	341 images	Japanese Patent Office images (molecules with abbreviations removed)	Low quality, noise, artifacts, non-standard drawing styles
USPTO-10K-L	1,000 images	Large molecules (>70 atoms)	Scale variation, structural complexity, many functional groups
USPTO-Markush	74 images	Generic Markush structures	Variable R-groups, abstract patterns, template representation

Key findings:

JPO Dataset (Low-Quality Patent Images): SubGrapher achieved the highest Molecule Exact Match rate (83%), demonstrating robustness to image quality degradation where rule-based methods like OSRA scored lower (67% M-EM).
USPTO-10K-L (Large Molecules): SubGrapher achieved an S-F1 of 97, matching the rule-based OSRA and outperforming all other learning-based methods (MolScribe: 90, DECIMER: 86, MolGrapher: 56). The object detection approach handled scale variation better than other deep-learning OCSR tools on these challenging targets.
USPTO-Markush (Generic Structures): SubGrapher achieved the highest Substructure F1-score (88) on this benchmark, outperforming MolScribe (86), OSRA (74), and DECIMER (10). While other OCSR tools can attempt these images, they have limited support for Markush features. SubGrapher’s instance segmentation approach handles complex Markush structures more effectively by focusing on relevant image regions.

Qualitative analysis revealed that SubGrapher correctly identified functional groups in scenarios where other methods failed completely: images with captions, unconventional drawing styles, or significant quality degradation.

Visual Fingerprinting for Molecule Retrieval

The core application was evaluated using a retrieval task designed to simulate real-world database searching:

Benchmark Creation: Five benchmark datasets were constructed around structurally similar molecules (adenosine, camphor, cholesterol, limonene, and pyridine), each containing 500 molecules sampled from PubChem with at least 90% Tanimoto similarity to the reference molecule, rendered as augmented images.
Retrieval Task: Given a SMILES string as a query, the goal was to find the corresponding molecular image within the dataset of 500 visually similar structures. This tests whether the visual fingerprint can distinguish between closely related molecules.
Performance Comparison: SubGrapher significantly outperformed baseline methods, retrieving the correct molecule at an average rank of 95 out of 500. The key advantage was robustness: SubGrapher generates a unique fingerprint for every image, even with partial or uncertain predictions. In contrast, OCSR-based methods frequently fail to produce valid SMILES, making them unable to generate fingerprints for comparison.
Real-World Case Study: A practical demonstration involved searching a 54-page patent document containing 356 chemical images for a specific Markush structure. SubGrapher successfully located the target structure, highlighting its utility for large-scale document mining.

Training Data Generation

Since no public datasets existed with the required pixel-level mask annotations for functional groups, the researchers developed a comprehensive synthetic data generation pipeline:

Extended MolDepictor: They enhanced existing molecular rendering tools to create images from SMILES strings and generate corresponding segmentation masks for all substructures present in each molecule.
Markush Structure Rendering: The pipeline was extended to handle complex generic structures using CXSMILES representations and the CDK library for rendering, creating training data for molecular templates with structural, positional, and frequency variation indicators.
Diverse Molecular Sources: Training molecules were sourced from PubChem to ensure broad chemical diversity and coverage of different structural families.

Results, Impact, and Limitations

Superior Robustness to Image Quality: SubGrapher consistently outperformed traditional OCSR methods on degraded images, particularly the JPO patent dataset. SubGrapher’s learned representations proved more resilient to noise, artifacts, and unconventional drawing styles than rule-based alternatives like OSRA (M-EM: 83 vs. 67 on JPO).

Metric	SubGrapher	MolScribe	OSRA	DECIMER	MolGrapher
S-F1 (JPO)	92	94	81	86	89
M-EM (JPO)	83	82	67	79	80
S-F1 (USPTO-10K-L)	97	90	97	86	56
M-EM (USPTO-10K-L)	55	55	75	66	31
S-F1 (USPTO-Markush)	88	86	74	10	35
M-EM (USPTO-Markush)	82	86	70	11	30
Avg Retrieval Rank	95/500	181-241/500	138-185/500	N/A	N/A

Note: Retrieval rank ranges reflect the best and worst fingerprint method pairing for each OCSR model (RDKit Daylight or MHFP).

Effective Handling of Scale and Complexity: The instance segmentation approach successfully managed large molecules and complex structures where traditional graph-reconstruction methods struggled. The Substructure F1-scores on USPTO-10K-L and USPTO-Markush benchmarks demonstrated clear advantages for challenging molecular targets.
Markush Structure Processing: SubGrapher achieves the highest Substructure F1-score on Markush structures (88 vs. MolScribe’s 86 and OSRA’s 74). While other OCSR methods can attempt Markush images, they support only limited features such as abbreviation-based variable groups. SubGrapher handles complex Markush features more effectively, expanding the scope of automatically extractable chemical information from patent literature.
Robust Molecule Retrieval Performance: The visual fingerprinting approach achieved reliable retrieval performance (average rank 95/500) across diverse molecular families. The key advantage was consistency: SubGrapher generates meaningful fingerprints even from partial or uncertain predictions, while OCSR-based methods often fail to produce any usable output.
Practical Document Mining Capability: The successful identification of specific Markush structures within large patent documents (54 pages, 356 images) demonstrates real-world applicability for large-scale literature mining and intellectual property analysis.
Single-Stage Architecture Benefits: By eliminating the traditional image → structure → fingerprint pipeline, SubGrapher avoids error accumulation from failed molecular reconstructions. Every input image produces a fingerprint, making the system more reliable for batch processing of diverse document collections.
Limitations and Scope: The method remains focused on common organic functional groups and may struggle with inorganic chemistry, organometallic complexes, or highly specialized molecular classes not well-represented in the training data. The 1,534 functional groups, while extensive, represent a curated subset of chemical space. SubGrapher also cannot distinguish enantiomers, as the detected substructures lack stereochemistry information. Additionally, the method currently cannot recognize substructures in abbreviations or single-atom fragments.

The work demonstrates that direct fingerprint generation can be more robust and practical than traditional structure reconstruction approaches. SubGrapher’s ability to handle Markush structures and degraded images makes it particularly valuable for patent analysis and large-scale document mining, where traditional OCSR methods frequently fail. The approach suggests that task-specific learning (fingerprints for retrieval) can outperform general-purpose reconstruction methods in many practical applications.

Reproducibility Details

Data

Training Data Generation: The paper developed a custom synthetic data pipeline since no public datasets existed with pixel-level mask annotations for functional groups:

Extended MolDepictor: Enhanced molecular rendering tool to generate both images and corresponding segmentation masks for all substructures
Markush Structure Rendering: Pipeline extended to handle complex generic structures
Source Molecules: PubChem for broad chemical diversity

Evaluation Benchmarks:

JPO Dataset: Real patent images with poor resolution, noise, and artifacts
USPTO-10K-L: Large complex molecular structures
USPTO-Markush: Generic structures with variable R-groups
Retrieval Benchmarks: Five datasets (adenosine, camphor, cholesterol, limonene, pyridine), each with 500 similar molecular images

Models

Architecture: Dual instance segmentation system using Mask-RCNN

Functional Group Detector: Mask-RCNN trained to identify 1,534 expert-defined functional groups
Carbon Backbone Detector: Mask-RCNN trained to recognize 27 common carbon chain patterns
Backbone Network: Not specified in the paper

Functional Group Knowledge Base: 1,534 substructures systematically defined by:

Starting with chemically logical atom combinations (C, O, S, N, B, P)
Expanding to include relevant subgroups and variations
Filtering based on frequency (appearing ~1,000+ times in PubChem)
Manual curation with SMILES, SMARTS, and descriptive names

Algorithms

Functional Group Definition:

1,534 Functional Groups: Defined by manually curated SMARTS patterns
- Must contain heteroatoms (O, N, S, P, B)
- Frequency threshold: ~1,000+ occurrences in PubChem
- Systematically constructed from chemically logical atom combinations
- Manual curation with SMILES, SMARTS, and descriptive names
27 Carbon Backbones: Patterns of 3-6 carbon atoms (rings and chains) to capture molecular scaffolds

Substructure-Graph Construction:

Detect functional groups and carbon backbones using Mask-RCNN models
Build connectivity graph:
- Each node represents an identified substructure instance
- Edges connect substructures whose bounding boxes overlap
- Bounding boxes expanded by 10% of smallest box’s diagonal to ensure connectivity between adjacent groups
- Carbon chain intersection coefficients divided by 2, giving functional groups higher effective weight

SVMF Fingerprint Generation:

Matrix form: $SVMF(m) \in \mathbb{R}^{n \times n}$ where $n=1561$
Stored as compressed sparse upper triangular matrix
Diagonal elements: $SVMF_{ii} = h_1 \cdot n_i + g_{ii}$ where $h_1 = 10$
Off-diagonal elements: $SVMF_{ij} = h_2(d) \cdot \text{intersection}(s_i, s_j)$ where:
- $h_2(d) = 2$ for $d = 0, 1$
- $h_2(2) = 2/4$, $h_2(3) = 2/16$, $h_2(4) = 2/256$
- $h_2(d) = 0$ for $d > 4$
Average sparsity: 0.001% non-zero elements
Similarity metric: Normalized Euclidean distance (L2 norm of difference divided by L2 norm of sum)

Evaluation

Metrics:

Substructure F1-score (S-F1): Harmonic mean of precision and recall for individual substructure detection across all molecules in the dataset
Molecule Exact Match (M-EM): Percentage of molecules where S-F1 = 1.0 (all substructures correctly identified)
Retrieval Rank: Average rank of ground truth molecule in candidate list of 500 similar structures when querying with SMILES fingerprint, averaged across 50 queries per benchmark

Baselines: Compared against SOTA OCSR methods:

Deep learning: MolScribe, MolGrapher, DECIMER
Rule-based: OSRA
Fingerprint methods: RDKit Daylight, MHFP (applied to OCSR outputs)

Hardware

Not specified in the paper. Training and inference hardware details are not provided in the main text or would be found in supplementary materials.

Artifacts

Artifact	Type	License	Notes
SubGrapher (GitHub)	Code	MIT	Official inference code
SubGrapher (HuggingFace)	Model	MIT	Pre-trained model weights
SubGrapher-Datasets (HuggingFace)	Dataset	CC-BY-4.0	Visual fingerprinting benchmark datasets

Implementation Gaps

The following details are not available in the paper and would require access to the code repository or supplementary information:

Specific backbone architecture for Mask-RCNN (ResNet variant, Swin Transformer, etc.)
Optimizer type (AdamW, SGD, etc.)
Learning rate and scheduler
Batch size and number of training epochs
Loss function weights (box loss vs. mask loss)
GPU/TPU specifications used for training
Training time and computational requirements

Paper Information

Citation: Morin, L., Meijer, G. I., Weber, V., Van Gool, L., & Staar, P. W. J. (2025). SubGrapher: Visual fingerprinting of chemical structures. Journal of Cheminformatics, 17(1), 149. https://doi.org/10.1186/s13321-025-01091-4

Publication: Journal of Cheminformatics (2025)

@article{morinSubGrapherVisualFingerprinting2025,
  title = {SubGrapher: Visual Fingerprinting of Chemical Structures},
  shorttitle = {SubGrapher},
  author = {Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valéry and Van Gool, Luc and Staar, Peter W. J.},
  year = {2025},
  journal = {Journal of Cheminformatics},
  volume = {17},
  number = {1},
  pages = {149},
  doi = {10.1186/s13321-025-01091-4}
}

Vision-Language Models on Hunter Heidenreich | ML Research Scientist

OCSU: Optical Chemical Structure Understanding (2025)

Paper Information

Multi-Level Chemical Understanding (Method and Resource)

The Motivation for OCSU Beyond Basic Graph Recognition

Key Innovations: DoubleCheck, Mol-VL, and the Vis-CheBI20 Dataset

Methodology and Experimental Evaluation

Results and Conclusions: Paradigm Trade-Offs

Reproducibility Details

Data

Algorithms

Models

Evaluation

Hardware

Artifacts

Limitations

Citation

GTR-CoT: Graph Traversal Chain-of-Thought for Molecules

Paper Information

Contribution: Vision-Language Modeling for OCSR

Motivation: The Abbreviation Bottleneck

Novelty: Graph Traversal as Visual Chain-of-Thought

What experiments were performed?

Results & Conclusions: Resolving the Abbreviation Bottleneck

Reproducibility Details

Models

Data

Algorithms

Evaluation

Hardware

Reproducibility Status

MolNexTR: A Dual-Stream Molecular Image Recognition

Paper Information

Methodology Overview and Taxonomic Classification

The Challenge of Domain-Specific Drawing Styles in OCSR

Core Innovation: Dual-Stream Encoding and Image Contamination

Experimental Setup: Benchmarking on Synthetic and Real Data

Empirical Results and Robustness Findings

Reproducibility Details

Data

Algorithms

Models

Evaluation

Hardware

Artifacts

Citation

MolParser: End-to-End Molecular Structure Recognition

Paper Information

Contribution: End-to-End OCSR and Real-World Resources

Motivation: Extracting Chemistry from Real-World Documents

Novelty: E-SMILES and Human-in-the-Loop Data Engine

Experimental Setup: Two-Stage Training and Benchmarking

Outcomes and Empirical Findings

Artifacts

Reproducibility Details

Data

Algorithms

Models

Evaluation

Hardware

Citation

MolParser-7M & WildMol: Large-Scale OCSR Datasets

Dataset Examples

Dataset Subsets

Benchmarks

WildMol-10K Accuracy#

Key Contribution

Overview

Strengths

Limitations

Technical Notes

Synthetic Data Generation

In-the-Wild Data Engine (MolParser-SFT-400k)

E-SMILES Specification

Reproducibility

SubGrapher: Visual Fingerprinting of Chemical Structures

Paper Classification and Taxonomy

Motivation: Extracting Complex Structures from Noisy Images

Key Innovation: Direct Visual Fingerprinting

Experimental Validation and Benchmarks

WildMol-10K Accuracy