Image-to-Graph Models on Hunter Heidenreich | ML Research Scientist

GraSP: Graph Recognition via Subgraph Prediction (2026)

Sun, 15 Mar 2026 00:00:00 +0000

A General Framework for Visual Graph Recognition

GraSP (Graph Recognition via Subgraph Prediction) addresses a fundamental limitation in image-to-graph methods: existing solutions are task-specific and do not transfer between domains. Whether the task is OCSR, scene graph recognition, music notation parsing, or road network extraction, each domain has developed independent solutions despite solving the same conceptual problem of extracting a graph from an image.

The key insight is that graph recognition can be reformulated as sequential subgraph prediction using a binary classifier, sidestepping two core difficulties of using graphs as neural network outputs:

Graph isomorphism: An uncolored graph with $n$ nodes has $n!$ equivalent representations, making direct output comparison intractable
Compositional outputs: Nodes, edges, and features are interdependent, so standard i.i.d. loss functions are insufficient

Sequential Subgraph Prediction as an MDP

GraSP formulates graph recognition as a Markov Decision Process. Starting from an empty graph, the method iteratively expands the current graph by adding one edge at a time (connecting either a new node or two existing nodes). At each step, a binary classifier predicts whether each candidate successor graph is a subgraph of the target graph shown in the image.

The critical observation is that the optimal value function $V^{\pi^*}$ satisfies:

$$V^{\pi^*}(\mathcal{G}_t | \mathcal{I}) = 1 \iff \mathcal{G}_t \subseteq \mathcal{G}_{\mathcal{I}}$$

This means the value function reduces to a subgraph membership test, which can be learned as a binary classifier rather than requiring reinforcement learning. Greedy decoding then suffices: at each step, select any successor that the classifier predicts is a valid subgraph, and terminate when the classifier indicates the current graph is complete.

This formulation decouples decision (what to add) from generation (in what order), making the same model applicable across different graph types without modification.

Architecture: GNN + FiLM-Conditioned CNN

The architecture has three components:

GNN encoder: A Message Passing Neural Network processes the candidate subgraph, producing a graph embedding. Messages are constructed as concatenations of source node features, target node features, and connecting edge features.
FiLM-conditioned CNN: A ResNet-v2 processes the image, with FiLM layers placed after every normalization layer within each block. The graph embedding conditions the image processing, producing a joint graph-image representation.
MLP classification head: Takes the conditioned image embedding plus a binary terminal flag (indicating whether this is a termination check) and predicts subgraph membership.

The model uses only 7.25M parameters. Group Normalization is used in the CNN (8 groups per layer), Layer Normalization in the GNN and MLP.

Training via Streaming Data Generation

Training uses a streaming architecture rather than a fixed dataset:

For each iteration, a target graph $\mathcal{G}_T$ is sampled and rendered as an image
Positive samples are generated by deleting edges that do not disconnect the graph (yielding valid subgraphs)
Negative samples are generated by expanding successor states and checking via approximate subgraph matching
Two FIFO buffers (one for positives, one for negatives), each holding up to 25,000 images, maintain diverse and balanced mini-batches of 1024 samples
Training uses the RAdam optimizer with a cosine learning rate schedule (warmup over 50M samples, cycle of 250M samples) on 4 A100 GPUs with a 24h budget

Synthetic Benchmarks on Colored Trees

GraSP is evaluated on increasingly complex synthetic tasks involving colored tree graphs:

Small trees (6-9 nodes): Tasks with varying numbers of node colors (1, 3, 5) and edge colors (1, 3, 5). The model works well across all configurations, with simpler tasks (fewer colors) converging faster.
Larger trees (10-15 nodes): The same trends hold but convergence is slower due to increased structural complexity.
Out-of-distribution generalization: Models trained on 6-9 node trees show zero-shot generalization to 10-node trees, indicating learned patterns are size-independent.

OCSR Evaluation on QM9

For the real-world OCSR evaluation, GraSP is applied to QM9 molecular images (grayscale, no stereo-bonds) with a 10,000-molecule held-out test set:

Method	Accuracy
OSRA	45.61%
GraSP	67.51%
MolGrapher	88.36%
DECIMER	92.08%

GraSP does not match state-of-the-art OCSR tools, but the authors emphasize that the same model architecture and training procedure transfers directly from synthetic tree tasks to molecular graphs with no task-specific modifications. The only domain knowledge incorporated is a simple chemistry rule: not extending nodes that already have degree four.

The method highlights the practical advantage of decoupling decision from generation. Functional groups can be represented at different granularities (as single nodes to reduce trajectory depth, or expanded to reduce trajectory breadth) without changing the model.

Reproducibility

Artifact	Type	License	Notes
GraSP Code	Code	Unknown	Official implementation with pre-trained models

The repository includes pre-trained models and example trajectories for interactive exploration. Training requires 4 A100 GPUs with a 24h time budget. The QM9 dataset used for OCSR evaluation is publicly available. No license file is included in the repository.

Limitations and Future Directions

Finite type assumption: The current framework assumes a finite set of node and edge types, limiting applicability to open-vocabulary tasks like scene graph recognition
Scaling to large graphs: For very large graphs, the branching factor of successor states becomes expensive. Learned filters to prune irrelevant successor states could help
OCSR performance gap: While GraSP demonstrates transferability, it falls short of specialized OCSR tools that use domain-specific encodings (SMILES) or pixel-level supervision
Modality extension: The framework could extend beyond images to other input modalities, such as vector embeddings of graphs

Paper Information

Citation: Eberhard, A., Neumann, G., & Friederich, P. (2026). Graph Recognition via Subgraph Prediction. arXiv preprint arXiv:2601.15133. https://arxiv.org/abs/2601.15133

Publication: arXiv 2026

AdaptMol: Domain Adaptation for Molecular OCSR (2026)

Sun, 15 Mar 2026 00:00:00 +0000

Bridging the Synthetic-to-Real Gap in Graph-Based OCSR

Most OCSR methods are trained on synthetic molecular images and evaluated on high-quality literature figures, both exhibiting relatively uniform styles. Hand-drawn molecules represent a particularly challenging domain with irregular bond lengths, variable stroke widths, and inconsistent atom symbols. Prior graph reconstruction methods like MolScribe and MolGrapher drop below 15% accuracy on hand-drawn images, despite achieving over 65% on literature datasets.

AdaptMol addresses this with a three-stage pipeline that enables effective transfer from synthetic to real-world data without requiring graph annotations in the target domain:

Base model training on synthetic data with comprehensive augmentation and dual position representation
MMD alignment of bond-level features between source and target domains
Self-training with SMILES-validated pseudo-labels on unlabeled target images

End-to-End Graph Reconstruction Architecture

AdaptMol builds on MolScribe’s architecture, using a Swin Transformer base encoder ($384 \times 384$ input) with a 6-layer Transformer decoder (8 heads, hidden dim 256). The model jointly predicts atoms and bonds:

Atom prediction follows the Pix2Seq approach, autoregressively generating a sequence of atom tokens:

$$S_N = [l_1, x_1, y_1, l_2, x_2, y_2, \dots, l_n, x_n, y_n]$$

where $l_i$ is the atom label and $(x_i, y_i)$ are discretized coordinate bin indices.

Dual position representation adds a 2D spatial heatmap on top of token-based coordinate prediction. The heatmap aggregates joint spatial distributions of all atoms:

$$\mathbf{H} = \text{Upsample}\left(\sum_{i=1}^{n} P_y^{(i)} \otimes P_x^{(i)}\right)$$

where $P_x^{(i)}$ and $P_y^{(i)}$ are coordinate probability distributions from the softmax logits. During training, this heatmap is supervised with Gaussian kernels at ground-truth atom positions. This reduces false positive atom predictions substantially (from 356 to 33 false positives at IoU 0.05).

Bond prediction extracts atom-level features from decoder hidden states and enriches them with encoder visual features via multi-head attention with a learnable residual weight $\alpha$:

$$\mathbf{F}_{\text{enriched}} = \text{LayerNorm}(\mathbf{F}_{\text{atom}} + \alpha \cdot \text{MHA}(\mathbf{F}_{\text{atom}}, \mathbf{E}_{\text{vis}}))$$

A feed-forward network then predicts bond types between all atom pairs.

Bond-Level Domain Adaptation via MMD

The key insight is that bond features are domain-invariant: they encode structural relationships (single, double, triple, aromatic) independent of visual style. Atom-level alignment is problematic due to class imbalance (carbon dominates), multi-token spanning (functional groups), and position-dependent features.

AdaptMol aligns bond-level feature distributions via class-conditional Maximum Mean Discrepancy:

$$L_{\text{MMD}} = \frac{1}{|\mathcal{C}’|} \sum_{c \in \mathcal{C}’} MMD(F_c^{\text{src}}, F_c^{\text{tgt}})$$

where $\mathcal{C}’$ contains classes with sufficient samples in both domains. Confidence-based filtering retains only high-confidence predictions (confidence > 0.95, entropy < 0.1) for alignment, tightening to 0.98 and 0.05 after the first epoch. Progressive loss weighting follows a schedule of 0.1 (epoch 0), 0.075 (epoch 1), and 0.05 thereafter.

An important side effect: MMD alignment improves inter-class bond discrimination, reducing confusion between visually similar bond types (e.g., jagged double bonds vs. aromatic bonds).

Self-Training with SMILES Validation

After MMD alignment, the model generates predictions on unlabeled target images. Predicted molecular graphs are converted to SMILES and validated against ground-truth SMILES annotations. Only exact matches are retained as pseudo-labels, providing complete graph supervision (atom coordinates, element types, bond types) that was previously unavailable in the target domain.

This approach is far more data-efficient than alternatives: AdaptMol uses only 4,080 real hand-drawn images vs. DECIMER-Handdraw’s 38 million synthetic hand-drawn images.

Comprehensive Data Augmentation

Two categories of augmentation are applied during synthetic data generation:

Structure-rendering augmentation: Functional group abbreviation substitution, bond type conversions (single to wavy/aromatic, Kekule to aromatic rings), R-group insertion, and rendering parameter randomization (font family/size, bond width/spacing)
Image-level augmentation: Geometric operations, quality degradation, layout variations, and chemical document artifacts (caption injection, arrows, marginal annotations)

Structure-rendering augmentation provides the larger benefit, contributing ~20% accuracy improvement on JPO and ~30% on ACS benchmarks.

Results

Hand-Drawn Molecule Recognition

Method	DECIMER test (Acc)	ChemPix (Acc)
AdaptMol	82.6	60.5
DECIMER v2.2	71.9	51.4
AtomLenz	30.0	48.4
MolScribe	10.1	26.1
MolGrapher	10.7	14.5

Literature and Synthetic Benchmarks

AdaptMol achieves state-of-the-art on 4 of 6 literature benchmarks:

Dataset	AdaptMol	MolScribe	MolGrapher	DECIMER v2.2
CLEF	92.7	87.5	57.2	77.7
JPO	88.2	78.8	73.0	75.7
UOB	89.3	88.2	85.1	87.2
ACS	75.5	72.8	41.0	37.7
USPTO	90.9	92.6	74.9	59.6
Staker	84.0	84.4	0.0	66.3

MolScribe edges out on USPTO and Staker. The authors attribute this to MolScribe directly training on all 680K USPTO samples, which may cause it to specialize to that distribution.

Pipeline Ablation

Configuration	Hand-drawn	ChemDraw	JPO
Base model	10.4	92.3	82.7
+ Font augmentation	30.2	92.5	82.8
+ Font aug + MMD	42.1	94.0	83.0
+ Font aug + MMD + Self-training	82.6	95.9	88.2

Each component contributes meaningfully: font augmentation (+19.8), MMD alignment (+11.9), and self-training (+40.5) on hand-drawn accuracy.

Reproducibility

Artifact	Type	License	Notes
AdaptMol Code	Code	MIT	Official implementation
Model + Data	Model/Dataset	MIT	Pretrained checkpoint and datasets

Training uses 2 NVIDIA A100 GPUs (40GB each). Base model trains for 30 epochs on 1M synthetic samples. Domain adaptation involves 3 steps: USPTO self-training (3 iterations of 3 epochs), MMD alignment on hand-drawn data (5 epochs), and hand-drawn self-training (5 iterations).

Limitations

Sequence length constraints prevent accurate prediction of very large molecules (>120 atoms), where resizing causes significant information loss
Cannot recognize Markush structures with repeating unit notation (parentheses/brackets), as synthetic training data lacks such cases
Stereochemistry information is lost when stereo bonds connect to abbreviated functional groups due to RDKit post-processing limitations
The retrained baseline (30 epochs from scratch on synthetic + pseudo-labels) achieves higher hand-drawn accuracy (87.2%) but at the cost of cross-domain robustness on literature benchmarks

Paper Information

Citation: Hu, F., He, E., & Verspoor, K. (2026). AdaptMol: Domain Adaptation for Molecular Image Recognition with Limited Supervision. Research Square preprint. https://doi.org/10.21203/rs.3.rs-8365561/v1

Publication: Research Square preprint, February 2026

Additional Resources:

MolScribe: Robust Image-to-Graph Molecular Recognition

Fri, 19 Dec 2025 00:00:00 +0000

Contribution: Generative Image-to-Graph Modelling

This is a Methodological Paper ($\Psi_{\text{Method}}$) with a secondary contribution to Resources ($\Psi_{\text{Resource}}$).

It proposes a novel architecture (image-to-graph generation) to solve the Optical Chemical Structure Recognition (OCSR) task, validating it through extensive ablation studies and comparisons against strong baselines like MolVec and DECIMER. It also contributes a new benchmark dataset of annotated images from ACS journals.

Motivation: Limitations in Existing OCSR Pipelines

Translating molecular images into machine-readable graphs (OCSR) is challenging due to the high variance in drawing styles, stereochemistry conventions, and abbreviated structures found in literature.

Existing solutions face structural bottlenecks:

Rule-based systems (e.g., OSRA) rely on rigid heuristics that fail on diverse styles.
Image-to-SMILES neural models treat the problem as captioning. They struggle with geometric reasoning (which is strictly required for chirality) and struggle to incorporate chemical constraints or verify correctness because they omit explicit atom locations.

Core Innovation: Joint Graph and Coordinate Prediction

MolScribe introduces an Image-to-Graph generation paradigm that combines the flexibility of neural networks with the precision of symbolic constraints. It frames the task probabilistically as:

$$ P(G | I) = P(A | I) P(B | A, I) $$

Where the model predicts a sequence of atoms $A$ given an image $I$, followed by the bonds $B$ given both the atoms and the image.

Explicit Graph Prediction: It predicts a sequence of atoms (with 2D coordinates) and then predicts bonds between them.
Symbolic Constraints: It uses the predicted graph structure and coordinates to strictly determine chirality and cis/trans isomerism.
Abbreviation Expansion: It employs a greedy algorithm to parse and expand “superatoms” (e.g., “CO2Et”) into their full atomic structure.
Dynamic Augmentation: It introduces a data augmentation strategy that randomly substitutes functional groups with abbreviations and adds R-groups during training to improve generalization.

Methodology: Autoregressive Atoms and Pairwise Bonds

The authors evaluate MolScribe on synthetic and real-world datasets, focusing on Exact Match Accuracy of the canonical SMILES string. The model generates atom sequences autoregressively:

$$ P(A | I) = \prod_{i=1}^n P(a_i | A_{

To handle continuous spatial locations, atom coordinates map to discrete bins (e.g., $\hat{x}_i = \lfloor \frac{x_i}{W} \times n_{\text{bins}} \rfloor$), and decode alongside element labels. Bonds act on a pairwise classifier over the hidden states of every atom pair:

$$ P(B | A, I) = \prod_{i=1}^n \prod_{j=1}^n P(b_{i,j} | A, I) $$

Baselines: Compared against rule-based (MolVec, OSRA) and neural (Img2Mol, DECIMER, SwinOCSR) systems.
Benchmarks:
- Synthetic: Indigo (in-domain) and ChemDraw (out-of-domain).
- Realistic: Five public benchmarks (CLEF, JPO, UOB, USPTO, Staker).
- New Dataset: 331 images from ACS Publications (journal articles).
Ablations: Tested performance without data augmentation, with continuous vs. discrete coordinates, and without non-atom tokens.
Human Eval: Measured the time reduction for chemists using MolScribe to digitize molecules vs. drawing from scratch.

Results: Robust Exact Match Accuracy

Strong Performance: MolScribe achieved 76-93% accuracy across public benchmarks, outperforming baselines on most datasets. On the ACS dataset of journal article images, MolScribe achieved 71.9% compared to the next best 55.3% (OSRA). On the large Staker patent dataset, MolScribe achieved 86.9%, surpassing MSE-DUDL (77.0%) while using far less training data (1.68M vs. 68M examples).
Chirality Verification: Explicit geometric reasoning allowed MolScribe to predict chiral molecules significantly better than image-to-SMILES baselines. When chirality is ignored, the performance gap narrows (e.g., on Indigo, baseline accuracy rises from 94.1% to 96.3%), isolating MolScribe’s primary advantage to geometric reasoning for stereochemistry.
Hand-Drawn Generalization: The model achieved 11.2% exact match accuracy on the DECIMER-HDM dataset, despite lacking hand-drawn images in the training set, with many errors bounded to a few atomic mismatches.
Robustness: The model maintained high performance on perturbed images (rotation/shear), whereas rule-based systems degraded severely.
Usability: The atom-level alignment allows for confidence visualization, and human evaluation showed it reduced digitization time from 137s to 20s per molecule.

Reproducibility Details

Data

The model was trained on a mix of synthetic and patent data with extensive dynamic augmentation:

Purpose	Dataset	Size	Notes
Training	PubChem (Synthetic)	1M	Molecules randomly sampled from PubChem and rendered via Indigo toolkit; includes atom coords.
Training	USPTO (Patents)	680K	Patent data lacks exact atom coordinates; relative coordinates normalized from MOLfiles to image dimensions (often introduces coordinate shifts).

Molecule Augmentation:

Functional Groups: Randomly substituted using 53 common substitution rules (e.g., replacing substructures with “Et” or “Ph”).
R-Groups: Randomly added using vocabulary: [R, R1...R12, Ra, Rb, Rc, Rd, X, Y, Z, A, Ar].
Styles: Random variation of aromaticity (circle vs. bonds) and explicit hydrogens.

Image Augmentation:

Rendering: Randomized font (Arial, Times, Courier, Helvetica), line width, and label modes during synthetic generation.
Perturbations: Applied rotation ($\pm 90^{\circ}$), cropping ($1%$), padding ($40%$), downscaling, blurring, and Salt-and-Pepper/Gaussian noise.

Preprocessing: Input images are resized to $384 \times 384$.

Algorithms

Atom Prediction (Pix2Seq-style):
- The model generates a sequence of tokens: $S^A = [l_1, \hat{x}_1, \hat{y}_1, \dots, l_n, \hat{x}_n, \hat{y}_n]$.
- Discretization: Coordinates are binned into integer tokens ($n_{bins} = 64$).
- Tokenizer: Atom-wise tokenizer splits SMILES into atoms; non-atom tokens (parentheses, digits) are kept to help structure learning.
Bond Prediction:
- Format: Pairwise classification for every pair of predicted atoms.
- Symmetry: For symmetric bonds (single/double), the probability is averaged as: $$ \hat{P}(b_{i,j} = t) = \frac{1}{2} \big( P(b_{i,j} = t) + P(b_{j,i} = t) \big) $$ For wedges, directional logic strictly applies instead.
Abbreviation Expansion (Algorithm 1):
- A greedy algorithm connects atoms within an expanded abbreviation (e.g., “COOH”) until valences are full, avoiding the need for a fixed dictionary.
- Carbon Chains: Splits condensed chains like $C_aX_b$ into explicit sequences ($CX_q…CX_{q+r}$).
- Nested Formulas: Recursively parses nested structures like $N(CH_3)_2$ by treating them as superatoms attached to the current backbone.
- Valence Handling: Iterates through common valences first to resolve ambiguities.

Models

The architecture is an encoder-decoder with a classification head:

Encoder: Swin Transformer (Swin-B), pre-trained on ImageNet-22K (88M params).
Decoder: 6-layer Transformer, 8 heads, hidden dimension 256.
Bond Predictor: 2-layer MLP (Feedforward) with ReLU, taking concatenated atom hidden states as input.
Training: Teacher forcing, Cross-Entropy Loss, Batch size 128, 30 epochs.

Evaluation

Metric: Exact Match of Canonical SMILES.

Stereochemistry: Must match tetrahedral chirality; cis-trans ignored.
R-groups: Replaced with wildcards * or [d*] for evaluation.

Hardware

Compute: Training performed on Linux server with 96 CPUs and 500GB RAM.
GPUs: 4x NVIDIA A100 GPUs.
Training Time: Unspecified; comparative models on large datasets took “more than one day”.
Inference: Requires autoregressive decoding for atoms, followed by a single forward pass for bonds.

Artifacts

Artifact	Type	License	Notes
MolScribe (GitHub)	Code	MIT	Official PyTorch implementation with training, inference, and evaluation scripts
MolScribe (Hugging Face)	Demo	MIT	Interactive web demo for molecular image recognition

Limitations

Scoped to single-molecule images only; does not handle multi-molecule diagrams or reaction schemes.
Hand-drawn molecule recognition remains weak (the model was not trained on hand-drawn data).
Complex Markush structures (positional variation, frequency variation) are not supported, as these cannot be represented in SMILES or MOLfiles.

Paper Information

Citation: Qian, Y., Guo, J., Tu, Z., Li, Z., Coley, C. W., & Barzilay, R. (2023). MolScribe: Robust Molecular Structure Recognition with Image-To-Graph Generation. Journal of Chemical Information and Modeling, 63(7), 1925-1934. https://doi.org/10.1021/acs.jcim.2c01480

Publication: Journal of Chemical Information and Modeling 2023

Additional Resources:

Hugging Face Space

@article{qianMolScribeRobustMolecular2023,
  title = {{{MolScribe}}: {{Robust Molecular Structure Recognition}} with {{Image-To-Graph Generation}}},
  shorttitle = {{{MolScribe}}},
  author = {Qian, Yujie and Guo, Jiang and Tu, Zhengkai and Li, Zhening and Coley, Connor W. and Barzilay, Regina},
  year = 2023,
  month = apr,
  journal = {Journal of Chemical Information and Modeling},
  volume = {63},
  number = {7},
  pages = {1925--1934},
  doi = {10.1021/acs.jcim.2c01480},
  url = {https://pubs.acs.org/doi/10.1021/acs.jcim.2c01480}
}

MolMole: Unified Vision Pipeline for Molecule Mining

Fri, 19 Dec 2025 00:00:00 +0000

MolMole’s Dual Contribution: Unified OCSR Method and Page-Level Benchmarks

This is primarily a Method paper, with a strong Resource contribution.

It functions as a Method paper because it introduces “MolMole,” a unified deep learning framework that integrates molecule detection, reaction diagram parsing, and optical chemical structure recognition (OCSR) into a single pipeline. It validates this method through extensive comparisons against state-of-the-art baselines like DECIMER and OpenChemIE.

It also serves as a Resource paper because the authors construct and release a novel page-level benchmark dataset of 550 annotated pages (patents and articles) to address the lack of standardized evaluation metrics for full-page chemical extraction.

Addressing the Limitations of Fragmented Processing

The rapid accumulation of chemical literature has trapped valuable molecular and reaction data in unstructured formats like images and PDFs. Extracting this manually is time-consuming, while existing AI frameworks have significant limitations:

DECIMER: Lacks the ability to process reaction diagrams entirely.
OpenChemIE: Relies on external layout parser models to crop elements before processing. This dependence often leads to detection failures in documents with complex layouts.
Generative Hallucination: Existing generative OCSR models (like MolScribe) are prone to “hallucinating” structures or failing on complex notations like polymers.

A Unified Vision Pipeline for Layout-Aware Detection

MolMole introduces several architectural and workflow innovations:

Direct Page-Level Processing: Unlike OpenChemIE, MolMole processes full document pages directly without requiring an external layout parser, which improves robustness on complex layouts like two-column patents.
Unified Vision Pipeline: It integrates three specialized vision models into one workflow:
- ViDetect: A DINO-based object detector for identifying molecular regions.
- ViReact: An RxnScribe-based model adapted for full-page reaction parsing.
- ViMore: A detection-based OCSR model that explicitly predicts atoms and bonds.
Hallucination Mitigation: By using a detection-based approach (ViMore), the model avoids hallucinating chemical structures and provides confidence scores.
Advanced Notation Support: The system explicitly handles “wavy bonds” (variable attachments in patents) and polymer bracket notations, which confuse standard SMILES-based models.

Page-Level Benchmark Evaluation and Unified Metrics

The authors evaluated the framework on both a newly curated benchmark and existing public datasets:

New Benchmark Creation: They curated 550 pages (300 patents, 250 articles) fully annotated with bounding boxes, reaction roles (reactant, product, condition), and MOLfiles.
Baselines: MolMole was compared against DECIMER 2.0, OpenChemIE, and ReactionDataExtractor 2.0.
OCSR Benchmarking: ViMore was evaluated against DECIMER, MolScribe, and MolGrapher on four public datasets: USPTO, UOB, CLEF, and JPO.
Metric Proposal: They introduced a combined “End-to-End” metric that modifies standard object detection Precision/Recall to strictly require correct SMILES conversion for a “True Positive”.

$$ \text{True Positive (End-to-End)} = ( \text{IoU} \geq 0.5 ) \land ( \text{SMILES}_{\text{gt}} == \text{SMILES}_{\text{pred}} ) $$

Key Results

Page-Level Performance: On the new benchmark, MolMole achieved F1 scores of 89.1% (Patents) and 86.8% (Articles) for the combined detection-to-conversion task, compared to 73.8% and 67.3% for DECIMER and 68.8% and 70.6% for OpenChemIE (Table 4).
Reaction Parsing: ViReact achieved soft-match F1 scores of 98.0% on patents and 97.0% on articles, compared to 82.2% and 82.9% for the next best model, RxnScribe (w/o LP). Hard-match F1 scores were 92.5% (patents) and 84.6% (articles).
Public Benchmarks: ViMore outperformed competitors on 3 out of 4 public OCSR datasets (CLEF, JPO, USPTO).
Layout Handling: The authors demonstrated that MolMole successfully handles multi-column reaction diagrams where cropping-based models fail and faithfully preserves layout geometry in generated MOLfiles.

Reproducibility

Artifacts

Artifact	Type	License	Notes
MolMole Project Page	Other	Unknown	Demo and project information

Data

Training Data: The models (ViDetect and ViMore) were trained on private/proprietary datasets, which is a limitation for full reproducibility from scratch.
Benchmark Data: The authors introduce a test set of 550 pages (3,897 molecules, 1,022 reactions) derived from patents and scientific articles. This dataset is stated to be made “publicly available”.
Public Evaluation Data: Standard OCSR datasets used include USPTO (5,719 images), UOB (5,740 images), CLEF (992 images), and JPO (450 images).

Algorithms

Pipeline Workflow: PDF → PNG Images → Parallel execution of ViDetect and ViReact → Cropping of molecular regions → ViMore conversion → Output (JSON/Excel).
Post-Processing:
- ViDetect: Removes overlapping proposals based on confidence scores and size constraints.
- ViReact: Refines predictions by correcting duplicates and removing empty entities.
- ViMore: Assembles detected atom/bond information into structured representations (MOLfile).

Models

Model	Architecture Basis	Task	Key Feature
ViDetect	DINO (DETR-based)	Molecule Detection	End-to-end training; avoids slow autoregressive methods.
ViReact	RxnScribe	Reaction Parsing	Operates on full pages; autoregressive decoder for structured sequence generation.
ViMore	Custom Vision Model	OCSR	Detection-based (predicts atom/bond regions).

Evaluation

Molecule Detection: Evaluated using COCO metrics (AP, AR, F1) at IoU thresholds 0.50-0.95.
Molecule Conversion: Evaluated using SMILES exact match accuracy and Tanimoto similarity.
Combined Metric: A custom metric where a True Positive requires both IoU $\geq$ 0.5 and a correct SMILES string match where $\text{SMILES}_{\text{gt}} == \text{SMILES}_{\text{pred}}$.
Reaction Parsing: Evaluated using Hard Match (all components correct) and Soft Match (molecular entities only, ignoring text labels).

Missing Components

Source code: Not publicly released. The paper states the toolkit “will be accessible soon through an interactive demo on the LG AI Research website.” For commercial use, the authors direct inquiries to contact ddu@lgresearch.ai.
Training data: ViDetect and ViMore are trained on proprietary datasets. Training code and data are not available.
Hardware requirements: Not specified in the paper.

Paper Information

Citation: Chun, S., Kim, J., Jo, A., Jo, Y., Oh, S., et al. (2025). MolMole: Molecule Mining from Scientific Literature. arXiv preprint arXiv:2505.03777. https://doi.org/10.48550/arXiv.2505.03777

Publication: arXiv 2025

Additional Resources:

Project Page

@article{chun2025molmole,
  title={MolMole: Molecule Mining from Scientific Literature},
  author={Chun, Sehyun and Kim, Jiye and Jo, Ahra and Jo, Yeonsik and Oh, Seungyul and Lee, Seungjun and Ryoo, Kwangrok and Lee, Jongmin and Kim, Seung Hwan and Kang, Byung Jun and Lee, Soonyoung and Park, Jun Ha and Moon, Chanwoo and Ham, Jiwon and Lee, Haein and Han, Heejae and Byun, Jaeseung and Do, Soojong and Ha, Minju and Kim, Dongyun and Bae, Kyunghoon and Lim, Woohyung and Lee, Edward Hwayoung and Park, Yongmin and Yu, Jeongsang and Jo, Gerrard Jeongwon and Hong, Yeonjung and Yoo, Kyungjae and Han, Sehui and Lee, Jaewan and Park, Changyoung and Jeon, Kijeong and Yi, Sihyuk},
  year={2025},
  journal={arXiv preprint arXiv:2505.03777},
  doi={10.48550/arXiv.2505.03777},
  url={https://arxiv.org/abs/2505.03777}
}

MolGrapher: Graph-based Chemical Structure Recognition

Fri, 19 Dec 2025 00:00:00 +0000

1. Contribution / Type

This is primarily a Methodological paper that proposes a novel neural architecture (MolGrapher), shifting the paradigm of Optical Chemical Structure Recognition (OCSR) from image captioning back to graph reconstruction. It also has a significant Resource component, releasing a synthetic data generation pipeline and a new large-scale benchmark (USPTO-30K) to address the scarcity of annotated real-world data.

2. Motivation

The automatic analysis of chemical literature is critical for accelerating drug and material discovery, but much of this information is locked in 2D images of molecular structures.

Problem: Existing rule-based methods are rigid, while recent deep learning methods based on “image captioning” (predicting SMILES strings) struggle with complex molecules and fail to exploit the natural graph structure of molecules.
Gap: There is a lack of diverse, annotated real-world training data, and captioning models suffer from “hallucinations” where they predict valid SMILES that do not match the image.

3. Novelty / Core Innovation

MolGrapher introduces a graph-based deep learning pipeline that explicitly models the molecule’s geometry and topology.

Supergraph Concept: It first detects all atom keypoints and builds a “supergraph” of all plausible bonds.
Hybrid Approach: It combines a ResNet-based keypoint detector with a Graph Neural Network (GNN) that classifies both atom nodes and bond nodes within the supergraph context. Both atoms and bonds are represented as nodes, with edges only connecting atom nodes to bond nodes.
Synthetic Pipeline: A data generation pipeline that renders molecules with varying styles (fonts, bond widths) and augmentations (pepper patches, random lines, captions) to simulate real document noise.

At the core of the Keypoint Detector’s performance is the Weight-Adaptive Heatmap Regression (WAHR) loss. Since pixels without an atom drastically outnumber pixels containing an atom, WAHR loss is designed to counter the class imbalance. For ground-truth heatmap $y$ and prediction $p$:

$$ L_{WAHR}(p, y) = \sum_i \alpha_y (p_i - y_i)^2 $$

where $\alpha_y$ dynamically down-weights easily classified background pixels.

4. Methodology & Experiments

The authors evaluated MolGrapher against both rule-based (OSRA, MolVec) and deep learning baselines (DECIMER, Img2Mol, Image2Graph).

Benchmarks: Evaluated on standard datasets: USPTO, Maybridge UoB, CLEF-2012, and JPO.
New Benchmark: Introduced and tested on USPTO-30K, split into clean, abbreviated, and large molecule subsets.
Ablations: Analyzed the impact of synthetic augmentations, keypoint loss functions, supergraph connectivity radius, and GNN layers.
Robustness: Tested on perturbed images (rotations, shearing) to mimic scanned patent quality.

The GNN iteratively updates node embeddings through layers ${g^k}_{k \in [1, N]}$, where $e^{k+1} = g^k(e^k)$. Final predictions are obtained via two MLPs (one for atoms, one for bonds): $p_i = MLP_t(e_i^N)$, where $p_i \in \mathbb{R}^{C_t}$ contains the logits for atom or bond classes.

5. Results & Conclusions

MolGrapher achieved the highest accuracy among synthetic-only deep learning methods on most benchmarks tested.

Accuracy: It achieved 91.5% accuracy on USPTO, outperforming all other synthetic-only deep learning methods including ChemGrapher (80.9%), Graph Generation (67.0%), and DECIMER 2.0 (61.0%).
Large Molecules: It demonstrated superior scaling, correctly recognizing large molecules (USPTO-10K-L) where image captioning methods like Img2Mol failed completely (0.0% accuracy).
Generalization: The method proved robust to image perturbations and style variations without requiring fine-tuning on real data. The paper acknowledges that MolGrapher cannot recognize Markush structures (depictions of sets of molecules with positional and frequency variation indicators).

Reproducibility Details

Data

The model relies on synthetic data for training due to the scarcity of annotated real-world images.

Purpose	Dataset	Size	Notes
Training	Synthetic Data	300,000 images	Generated from PubChem SMILES using RDKit. Augmentations include pepper patches, random lines, and variable bond styles.
Evaluation	USPTO-30K	30,000 images	Created by authors from USPTO patents (2001-2020). Subsets: 10K clean, 10K abbreviated, 10K large (>70 atoms).
Evaluation	Standard Benchmarks	Various	USPTO (5,719), Maybridge UoB (5,740), CLEF-2012 (992), JPO (450).

Algorithms

The pipeline consists of three distinct algorithmic stages:

Keypoint Detection:
- Predicts a heatmap of atom locations using a CNN.
- Thresholds heatmaps at the bottom 10th percentile and uses a $5\times5$ window for local maxima.
- Uses Weight-Adaptive Heatmap Regression (WAHR) loss to handle class imbalance (background vs. atoms).
Supergraph Construction:
- Connects every detected keypoint to neighbors within a radius of $3 \times$ the estimated bond length.
- Prunes edges with no filled pixels or if obstructed by a third keypoint.
- Keeps a maximum of 6 bond candidates per atom.
Superatom Recognition:
- Detects “superatom” nodes (abbreviations like COOH).
- Uses PP-OCR to transcribe the text at these node locations.

Models

The architecture utilizes standard backbones tailored for specific sub-tasks:

Keypoint Detector: ResNet-18 backbone with $8\times$ dilation to preserve spatial resolution.
Node Classifier: ResNet-50 backbone with $2\times$ dilation for extracting visual features at node locations.
Graph Neural Network: A custom GNN that updates node embeddings based on visual features and neighborhood context. The initial node embedding combines the visual feature vector $v_i$ and a learnable type encoding $w_{t_i}$.
Readout: MLPs classify nodes into atom types (e.g., C, O, N) and bond types (No Bond, Single, Double, Triple).

Evaluation

Accuracy is defined strictly: the predicted molecule must have an identical InChI string to the ground truth. Stereochemistry and Markush structures are excluded from evaluation.

Metric	Dataset	MolGrapher Score	Best DL Baseline (Synthetic)	Notes
Accuracy	USPTO	91.5%	80.9% (ChemGrapher)	Full USPTO benchmark
Accuracy	USPTO-10K-L	31.4%	0.0% (Img2Mol)	Large molecules (>70 atoms)
Accuracy	JPO	67.5%	64.0% (DECIMER 2.0)	Challenging, low-quality images

Hardware

GPUs: Trained on 3 NVIDIA A100 GPUs.
Training Time: 20 epochs.
Optimization: ADAM optimizer, learning rate 0.0001, decayed by 0.8 after 5000 iterations.
Loss Weighting: Atom classifier loss weighted by 1; bond classifier loss weighted by 3.

Artifacts

Artifact	Type	License	Notes
DS4SD/MolGrapher	Code	MIT	Official PyTorch implementation with training and inference scripts

Paper Information

Title: MolGrapher: Graph-based Visual Recognition of Chemical Structures

Authors: Lucas Morin, Martin Danelljan, Maria Isabel Agea, Ahmed Nassar, Valéry Weber, Ingmar Meijer, Peter Staar, Fisher Yu

Citation: Morin, L., Danelljan, M., Agea, M. I., Nassar, A., Weber, V., Meijer, I., Staar, P., & Yu, F. (2023). MolGrapher: Graph-based Visual Recognition of Chemical Structures. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 19552-19561.

Publication: ICCV 2023

Links:

@inproceedings{morinMolGrapherGraphbasedVisual2023,
  title = {{{MolGrapher}}: {{Graph-based Visual Recognition}} of {{Chemical Structures}}},
  shorttitle = {{{MolGrapher}}},
  booktitle = {Proceedings of the {{IEEE}}/{{CVF International Conference}} on {{Computer Vision}}},
  author = {Morin, Lucas and Danelljan, Martin and Agea, Maria Isabel and Nassar, Ahmed and Weber, Valéry and Meijer, Ingmar and Staar, Peter and Yu, Fisher},
  year = {2023},
  pages = {19552--19561},
  doi = {10.1109/ICCV51070.2023.01791},
  urldate = {2025-10-18},
  langid = {english}
}

MolMiner: Deep Learning OCSR with YOLOv5 Detection

Thu, 18 Dec 2025 00:00:00 +0000

Classification and Contribution

This is primarily a Resource paper ($\Psi_{\text{Resource}}$) with a strong Method component ($\Psi_{\text{Method}}$).

Resource: It presents a complete software application (published as an “Application Note”) for Optical Chemical Structure Recognition (OCSR), including a graphical user interface (GUI) and a new curated “Real-World” dataset of 3,040 molecular images.
Method: It proposes a novel “rule-free” pipeline that replaces traditional vectorization algorithms with deep learning object detection (YOLOv5) and segmentation models.

Motivation: Bottlenecks in Rule-Based Systems

Legacy Backlog: Decades of scientific literature contain chemical structures only as 2D images (PDFs), which are not machine-readable.
Limitations of Legacy Architecture: Existing tools (like OSRA, CLIDE, MolVec) rely on rule-based vectorization (interpreting vectors and nodes) which struggle with noise, low resolution, and complex drawing styles found in scanned documents.
Deep Learning Gap: While deep learning (DL) has advanced computer vision, few practical, end-to-end DL tools existed for OCSR that could handle the full pipeline from PDF extraction to graph generation with high accuracy.

Core Innovation: Object Detection Paradigm for OCSR

Object Detection Paradigm: MolMiner shifts away from the strategy of line-tracing (vectorization), opting to treat atoms and bonds directly as objects to be detected using YOLOv5. This allows it to “look once” at the image.
End-to-End Pipeline: Integration of three specialized modules:
1. MobileNetV2 for segmenting molecular figures from PDF pages.
2. YOLOv5 for detecting chemical elements (atoms/bonds) as bounding boxes.
3. EasyOCR for recognizing text labels and resolving abbreviations (supergroups) to full explicit structures.
Synthetic Training Strategy: The authors bypassed manual labeling by building a data generation module that uses RDKit to create chemically valid images with perfect ground-truth annotations automatically.

Methodology: End-to-End Object Detection Pipeline

Benchmarks: Evaluated on four standard OCSR datasets: USPTO (5,719 images), UOB (5,740 images), CLEF2012 (992 images), and JPO (450 images).
New External Dataset: Collected and annotated a “Real-World” dataset of 3,040 images from 239 scientific papers to test generalization beyond synthetic benchmarks.
Baselines: Compared against open-source tools: MolVec (v0.9.8), OSRA (v2.1.0), and Imago (v2.0).
Qualitative Tests: Tested on difficult cases like hand-drawn molecules and large-sized scans (e.g., Palytoxin).

Results: Speed and Generalization Metrics

Benchmark Performance: MolMiner outperformed open-source baselines on standard validation splits.
- USPTO: 93% MCS accuracy (vs. 89% for MolVec, per Table 2). The commercial CLiDE Pro tool reports 93.8% on USPTO, slightly higher than MolMiner’s 93.3%.
- Real-World Set: 87.8% MCS accuracy (vs. 50.1% for MolVec, 8.9% for OSRA, and 10.3% for Imago).
Inference Velocity: The architecture allows for faster processing compared to CPU rule-based systems. On JPO (450 images), MolMiner finishes in under 1 minute versus 8-23 minutes for rule-based tools (Table 3).
Robustness: Demonstrated ability to handle hand-drawn sketches and noisy scans, though limitations remain with crossing bonds, colorful backgrounds, crowded layout segmentation, and Markush structures.
Software Release: Released as a free desktop application for Mac and Windows with a Ketcher-based editing plugin.

Reproducibility Details

Data

The system relies heavily on synthetic data for training, while evaluation uses both standard and novel real-world datasets.

Purpose	Dataset	Size	Notes
Training	Synthetic RDKit	Large-scale	Generated using RDKit v2021.09.1 and ReportLab v3.5.0. Includes augmentations (rotation, thinning, noise).
Evaluation	USPTO	5,719	Standard benchmark. Avg MW: 380.0.
Evaluation	UOB	5,740	Standard benchmark. Avg MW: 213.5.
Evaluation	CLEF2012	992	Standard benchmark. Avg MW: 401.2.
Evaluation	JPO	450	Standard benchmark. Avg MW: 360.3.
Evaluation	Real-World	3,040	New Contribution. Collected from 239 scientific papers. Download Link.

Algorithms

Data Generation:
- Uses RDKit MolDraw2DSVG and CondenseMolAbbreviations to generate images and ground truth.
- Augmentation: Rotation, line thinning/thickness variation, noise injection.
Graph Construction:
- A distance-based algorithm connects recognized “Atom” and “Bond” objects into a molecular graph.
- Supergroup Parser: Matches detected text against a dictionary collected from RDKit, ChemAxon, and OSRA to resolve abbreviations (e.g., “Ph”, “Me”).
Image Preprocessing:
- Resizing: Images with max dim > 2560 are resized to 2560. Small images (< 640) resized to 640.
- Padding: Images padded to nearest upper bound (640, 1280, 1920, 2560) with white background (255, 255, 255).
- Dilation: For thick-line images, cv2.dilate (3x3 or 2x2 kernel) is applied to estimate median line width.

Models

The system is a cascade of three distinct deep learning models:

MolMiner-ImgDet (Page Segmentation):
- Architecture: MobileNetV2.
- Task: Semantic segmentation to identify and crop chemical figures from full PDF pages.
- Classes: Background vs. Compound.
- Performance: Recall 95.5%.
MolMiner-ImgRec (Structure Recognition):
- Architecture: YOLOv5 (One-stage object detector). Selected over MaskRCNN/EfficientDet for speed/accuracy trade-off.
- Task: Detects atoms and bonds as bounding boxes.
- Labels:
  - Atoms: Si, N, Br, S, I, Cl, H, P, O, C, B, F, Text.
  - Bonds: Single, Double, Triple, Wedge, Dash, Wavy.
- Performance: mAP@0.5 = 97.5%.
MolMiner-TextOCR (Character Recognition):
- Architecture: EasyOCR (fine-tuned).
- Task: Recognize specific characters in “Text” regions identified by YOLO (e.g., supergroups, complex labels).
- Performance: ~96.4% accuracy.

Performance Evaluation & Accuracy Metrics

The paper argues that computing the Maximum Common Substructure (MCS) accuracy is superior to string comparisons of canonical identifiers like InChI or SMILES. The InChI string is heavily sensitive to slight canonicalization or tautomerization discrepancies (like differing aromaticity models). Therefore, for comparing structural isomorphism:

$$ \text{MCS_Accuracy} = \frac{|\text{Edges}_{\text{MCS}}| + |\text{Nodes}_{\text{MCS}}|}{|\text{Edges}_{\text{Ground_Truth}}| + |\text{Nodes}_{\text{Ground_Truth}}|} $$

Using this metric to evaluate bond- and atom-level recall directly measures OCR extraction fidelity.

Metric	MolMiner (Real-World)	MolVec	OSRA	Imago
MCS Accuracy	87.8%	50.1%	8.9%	10.3%
InChI Accuracy	88.9%	62.6%	64.5%	10.8%

Hardware

Inference Hardware: Tested on Intel Xeon Gold 6230R CPU @ 2.10 GHz.
Acceleration: Supports batch inference on GPU, which provides the reported speedups over rule-based CPU tools.
Runtime: Under 1 minute on JPO (450 images), 7 minutes on USPTO (5,719 images), compared to 29-148 minutes for baseline tools on USPTO (Table 3).

Artifacts

Artifact	Type	License	Notes
pharmamind-molminer	Code	Unknown	GitHub repo with user guides and release downloads
Real-World Dataset	Dataset	Unknown	3,040 molecular images from 239 papers

Paper Information

Citation: Xu, Y., Xiao, J., Chou, C.-H., Zhang, J., Zhu, J., Hu, Q., Li, H., Han, N., Liu, B., Zhang, S., Han, J., Zhang, Z., Zhang, S., Zhang, W., Lai, L., & Pei, J. (2022). MolMiner: You only look once for chemical structure recognition. Journal of Chemical Information and Modeling, 62(22), 5321–5328. https://doi.org/10.1021/acs.jcim.2c00733

Publication: Journal of Chemical Information and Modeling (JCIM) 2022

Additional Resources:

@article{xuMolMinerYouOnly2022,
  title = {MolMiner: You only look once for chemical structure recognition},
  shorttitle = {MolMiner},
  author = {Xu, Youjun and Xiao, Jinchuan and Chou, Chia-Han and Zhang, Jianhang and Zhu, Jintao and Hu, Qiwan and Li, Hemin and Han, Ningsheng and Liu, Bingyu and Zhang, Shuaipeng and Han, Jinyu and Zhang, Zhen and Zhang, Shuhao and Zhang, Weilin and Lai, Luhua and Pei, Jianfeng},
  year = 2022,
  month = nov,
  journal = {Journal of Chemical Information and Modeling},
  volume = {62},
  number = {22},
  pages = {5321--5328},
  publisher = {American Chemical Society},
  issn = {1549-9596},
  doi = {10.1021/acs.jcim.2c00733},
}

Image-to-Graph Transformers for Chemical Structures

Thu, 18 Dec 2025 00:00:00 +0000

Contribution and Taxonomic Classification

This is a Method paper. It proposes a novel deep learning architecture designed to extract molecular structures from images by directly predicting the graph topology. The paper validates this approach through ablation studies (comparing ResNet-only baselines to the Transformer-augmented model) and extensive benchmarking against existing tools.

The Challenge with SMILES and Non-Atomic Symbols

Handling Abbreviations: Chemical structures in scientific literature often use non-atomic symbols (superatoms like “R” or “Ph”) to reduce complexity. Standard tools that generate SMILES strings fail here because SMILES syntax does not support arbitrary non-atomic symbols.
Robustness to Style: Existing rule-based tools are brittle to the diverse drawing styles found in literature.
Data Utilization: Pixel-wise graph recognition tools (like ChemGrapher) require expensive pixel-level labeling. An end-to-end approach can utilize massive amounts of image-molecule pairs (like USPTO data) without needing exact coordinate labels.

The Image-to-Graph (I2G) Architecture

The core novelty is the Image-to-Graph (I2G) architecture that bypasses string representations entirely:

Hybrid Encoder: Combines a ResNet backbone (for locality) with a Transformer encoder (for global context), allowing the model to capture relationships between atoms that are far apart in the image.
Graph Decoder (GRAT): A modified Transformer decoder that generates the graph auto-regressively. It uses feature-wise transformations to modulate attention weights based on edge information (bond types).
Coordinate-Aware Training: The model is forced to predict the exact 2D coordinates of atoms in the source image. Combined with auxiliary losses, this boosts SMI accuracy from 0.009 to 0.567 on the UoB ablation (Table 1 in the paper).

Experimental Setup and Baselines

Baselines: The model was compared against OSRA (rule-based), MolVec (rule-based), and ChemGrapher (deep learning pixel-wise).
Benchmarks: Evaluated on four standard datasets: UoB, USPTO, CLEF, and JPO. Images were converted to PDF and back to simulate degradation.
Large Molecule Test: A custom dataset (OLED) was created from 12 journal papers (434 images) to test performance on larger, more complex structures (average 52.8 atoms).
Ablations: The authors tested the impact of the Transformer encoder, auxiliary losses, and coordinate prediction.

Empirical Results and Robustness

Benchmark Performance: The proposed model outperformed existing models with a 17.1% relative improvement on benchmark datasets.
Robustness: On large molecules (OLED dataset), it achieved a 12.8% relative improvement over MolVec (and 20.0% over OSRA).
Data Scaling: Adding real-world USPTO data to the synthetic training set improved performance by 20.5%, demonstrating the model’s ability to learn from noisy, unlabeled coordinates.
Handling Superatoms: The model successfully recognized pseudo-atoms (e.g., $R_1$, $R_2$, $R_3$) as distinct nodes. OSRA, which outputs SMILES, collapsed them into generic “Any” atoms since SMILES does not support non-atomic symbols. MolVec could not recognize them properly at all.

Limitations and Error Analysis

The paper identifies two main failure modes on the USPTO, CLEF, and JPO benchmarks:

Unrecognized superatoms: The model struggles with complex multi-character superatoms not seen during training (e.g., NHNHCOCH$_3$ or H$_3$CO$_2$S). The authors propose character-level atom decoding as a future solution.
Caption interference: The model sometimes misidentifies image captions as atoms, particularly on the JPO dataset. Data augmentation with arbitrary caption text or a dedicated image segmentation step could mitigate this.

Reproducibility Details

Data

The authors used a combination of synthetic and real-world data for training.

Purpose	Dataset	Size	Notes
Training	PubChem	4.6M	Synthetic images generated using RDKit. Random superatoms (e.g., $CF_3$, $NO_2$) were substituted to simulate abbreviations.
Training	USPTO	2.5M	Real image-molecule pairs from patents. Used for robustness; lacks coordinate labels.
Evaluation	Benchmarks	~5.7k	UoB, USPTO, CLEF, JPO. Average ~15.8 atoms per molecule.
Evaluation	OLED	434	Manually segmented from 12 journal papers. Large molecules (avg 52.8 atoms).

Preprocessing:

Input resolution is fixed at $800 \times 800$ pixels.
Images are virtually split into a $25 \times 25$ grid (625 patches total), where each patch is $32 \times 32$ pixels.

Algorithms

Encoder Logic:

Grid Serialization: The $25 \times 25$ grid is flattened into a 1D sequence. 2D position information is concatenated to ResNet features before the Transformer.
Auxiliary Losses: To aid convergence, classifiers on the encoder predict three things per patch: (1) number of atoms, (2) characters in atom labels, and (3) edge-sharing neighbors. These losses decrease to zero during training.

Decoder Logic:

Auto-regressive Generation: At step $t$, the decoder generates a new node and connects it to existing nodes.
Attention Modulation: Attention weights are transformed using bond information: $$ \begin{aligned} \text{Att}(Q, K, V) = \text{softmax} \left( \frac{\Gamma \odot (QK^T) + B}{\sqrt{d_k}} \right) V \end{aligned} $$ where $(\gamma_{ij}, \beta_{ij}) = f(e_{ij})$, with $e_{ij}$ being the edge type (in one-hot representation) between nodes $i$ and $j$, and $f$ is a multi-layer perceptron. $\Gamma$ and $B$ are matrices whose elements at position $(i, j)$ are $\gamma_{ij}$ and $\beta_{ij}$, respectively.
Coordinate Prediction: The decoder outputs coordinates for each atom, which acts as a mechanism to track attention history.

Models

Image Encoder: ResNet-34 backbone followed by a Transformer encoder.
Graph Decoder: A “Graph-Aware Transformer” (GRAT) that outputs nodes (atom labels, coordinates) and edges (bond types).

Evaluation

Metrics focus on structural identity, as standard string matching (SMILES) is insufficient for graphs with superatoms.

Metric	Description	Notes
SMI	Canonical SMILES Match	Correct if predicted SMILES is identical to ground truth.
TS 1	Tanimoto Similarity = 1.0	Ratio of predictions with perfect fingerprint overlap.
Sim.	Average Tanimoto Similarity	Measures average structural overlap across all predictions.

Reproducibility

The paper does not release source code, pre-trained models, or the custom OLED evaluation dataset. The training data sources (PubChem, USPTO) are publicly available, but the specific image generation pipeline (modified RDKit with coordinate extraction and superatom substitution) is not released. Key architectural details (ResNet-34 backbone, Transformer encoder/decoder configuration) and training techniques are described, but exact hyperparameters for full reproduction are limited.

Artifact	Type	License	Notes
PubChem	Dataset	Public Domain	Source of 4.6M molecules for synthetic image generation
USPTO	Dataset	Public Domain	2.5M real image-molecule pairs from patents
RDKit	Code	BSD-3-Clause	Used (with modifications) for synthetic image generation

Paper Information

Citation: Yoo, S., Kwon, O., & Lee, H. (2022). Image-to-Graph Transformers for Chemical Structure Recognition. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3393-3397. https://doi.org/10.1109/ICASSP43922.2022.9746088

Publication: ICASSP 2022

ABC-Net: Keypoint-Based Molecular Image Recognition

Thu, 18 Dec 2025 00:00:00 +0000

Contribution and Paper Type

Method. The paper proposes a novel architectural framework (ABC-Net) for Optical Chemical Structure Recognition (OCSR). It reformulates the problem from image captioning (sequence generation) to keypoint estimation (pixel-wise detection), backed by ablation studies on noise and comparative benchmarks against state-of-the-art tools.

Motivation for Keypoint-Based OCSR

Inefficiency of Rule-Based Methods: Traditional tools (OSRA, MolVec) rely on hand-coded rules that are brittle, require domain expertise, and fail to handle the wide variance in molecular drawing styles.
Data Inefficiency of Captioning Models: Recent Deep Learning approaches (like DECIMER, Img2mol) treat OCSR as image captioning (Image-to-SMILES). This is data-inefficient because canonical SMILES require learning traversal orders, necessitating millions of training examples.
Goal: To create a scalable, data-efficient model that predicts graph structures directly by detecting atomic/bond primitives.

ABC-Net’s Divide-and-Conquer Architecture

Divide-and-Conquer Strategy: ABC-Net breaks the problem down into detecting atom centers and bond centers as independent keypoints.
Keypoint Estimation: A Fully Convolutional Network (FCN) generates heatmaps for object centers. This is inspired by computer vision techniques like CornerNet and CenterNet.
Angle-Based Bond Detection: To handle overlapping bonds, the model classifies bond angles into 60 distinct bins ($0-360°$) at detected bond centers, allowing separation of intersecting bonds.
Implicit Hydrogen Prediction: The model explicitly predicts the number of implicit hydrogens for heterocyclic atoms to resolve ambiguity in dearomatization.

Experimental Setup and Synthetic Data

Dataset Construction: Synthetic dataset of 100,000 molecules from ChEMBL, rendered using two different engines (RDKit and Indigo) to ensure style diversity.
Baselines: Compared against two rule-based methods (MolVec, OSRA) and one deep learning method (Img2mol).
Robustness Testing: Evaluated on the external UOB dataset (real-world images) and synthetic images with varying levels of salt-and-pepper noise (up to $p=0.6$).
Data Efficiency: Analyzed performance scaling with training set size (10k to 160k images).

Results, Generalization, and Noise Robustness

Superior Accuracy: ABC-Net achieved 94-98% accuracy across all test sets (Table 1), outperforming MolVec (12-45% on synthetic data, ~83% on UOB), OSRA (26-62% on synthetic, ~82% on UOB), and Img2mol (78-93% on non-stereo subsets).
Generalization: On the external UOB benchmark, ABC-Net achieved >95% accuracy, whereas the deep learning baseline (Img2mol) dropped to 78.2%, indicating better generalization.
Data Efficiency: The model reached ~95% performance with only 80,000 training images, requiring roughly an order of magnitude less data than captioning-based models like Img2mol (which use millions of training examples).
Noise Robustness: Performance remained stable (<2% drop) with noise levels up to $p=0.1$. Even at extreme noise ($p=0.6$), Tanimoto similarity remained high, suggesting the model recovers most substructures even when exact matches fail.

Limitations

Drawing style coverage: The synthetic training data covers only styles available through RDKit and Indigo renderers. Many real-world styles (e.g., hand-drawn structures, atomic group abbreviations) are not represented.
No stereo baseline from Img2mol: The Img2mol comparison only covers non-stereo subsets because stereo results were not available from the original Img2mol paper.
Scalability to large molecules: Molecules with more than 50 non-hydrogen atoms are excluded from the dataset, and performance on such large structures is untested.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
ABC-Net Repository	Code	Apache-2.0	Official implementation. Missing requirements.txt and pre-trained weights.

Reproducibility Status: Partially Reproducible. The code is provided, but key components like the pre-trained weights, exact training environment dependencies, and the generated synthetic datasets are missing from the open-source release, making exact reproduction difficult.

Data

The authors constructed a synthetic dataset because labeled pixel-wise OCSR data is unavailable.

Source: ChEMBL database
Filtering: Excluded molecules with >50 non-H atoms or rare atom types/charges (<1000 occurrences).
Sampling: 100,000 unique SMILES selected such that every atom type/charge appears in at least 1,000 compounds.
Generation: Images generated via RDKit and Indigo libraries.
- Augmentation: Varied bond thickness, label mode, orientation, and aromaticity markers.
- Resolution: $512 \times 512$ pixels.
- Noise: Salt-and-pepper noise added during training ($P$ = prob of background flip, $Q = 50P$).

Purpose	Dataset	Size	Notes
Training	ChEMBL (RDKit/Indigo)	80k	8:1:1 split (Train/Val/Test)
Evaluation	UOB Dataset	~5.7k images	External benchmark from Univ. of Birmingham

Algorithms

1. Keypoint Detection (Heatmaps)

Down-sampling: Input $512 \times 512$ → Output $128 \times 128$ (stride 4).
Label Softening: To handle discretization error, ground truth peaks are set to 1, first-order neighbors to 0.95, others to 0.
Loss Function: Penalty-reduced pixel-wise binary focal loss (variants of CornerNet loss). The loss formulation is given as:

$$ L_{det} = - \frac{1}{N} \sum_{x,y} \begin{cases} (1 - \hat{A}_{x,y})^{\alpha} \log(\hat{A}_{x,y}) & \text{if } A_{x,y} = 1 \\ (1 - A_{x,y}) (\hat{A}_{x,y})^{\alpha} \log(1 - \hat{A}_{x,y}) & \text{otherwise} \end{cases} $$
- $\alpha=2$ (focal parameter). The $(1 - A_{x,y})$ term reduces the penalty for first-order neighbors of ground truth locations.
- Property classification losses use a separate focal parameter $\beta=2$ with weight balancing: classes with <10% frequency are weighted 10x.

2. Bond Direction Classification

Angle Binning: $360°$ divided into 60 intervals.
Inference: A bond is detected if the angle probability is a local maximum and exceeds a threshold.
Non-Maximum Suppression (NMS): Required for opposite angles (e.g., $30°$ and $210°$) representing the same non-stereo bond.

3. Multi-Task Weighting

Uses Kendall’s uncertainty weighting to balance 8 different loss terms (atom det, bond det, atom type, charge, H-count, bond angle, bond type, bond length).

Models

Architecture: ABC-Net (Custom U-Net / FCN)

Input: $512 \times 512 \times 1$ (Grayscale).
Contracting Path: 5 steps. Each step has conv-blocks + $2 \times 2$ MaxPool.
Expansive Path: 3 steps. Transpose-Conv upsampling + Concatenation (Skip Connections).
Heads: Separate $1 \times 1$ convs for each task map (Atom Heatmap, Bond Heatmap, Property Maps).
Output Dimensions:
- Heatmaps: $(1, 128, 128)$
- Bond Angles: $(60, 128, 128)$
Pre-trained Weights: Not included in the public GitHub repository. The paper’s availability statement mentions code and training datasets but not weights.

Evaluation

Metrics:

Detection: Precision & Recall (Object detection level).
Regression: Mean Absolute Error (MAE) for bond lengths.
Structure Recovery:
- Accuracy: Exact SMILES match rate.
- Tanimoto: ECFP similarity (fingerprint overlap).

Metric	ABC-Net	Img2mol (Baseline)	Notes
Accuracy (UOB)	96.1%	78.2%	Non-stereo subset
Accuracy (Indigo)	96.4%	89.5%	Non-stereo subset
Tanimoto (UOB)	0.989	0.953	Higher substructure recovery

Hardware

Training Configuration: 15 epochs, Batch size 64.
Optimization: Adam Optimizer. LR $2.5 \times 10^{-4}$ (first 5 epochs) → $2.5 \times 10^{-5}$ (last 10).
Repetition: Every experiment was repeated 3 times with random dataset splitting; mean values are reported.
Compute: High-Performance Computing Center of Central South University. Specific GPU model not listed.

Paper Information

Citation: Zhang, X.-C., Yi, J.-C., Yang, G.-P., Wu, C.-K., Hou, T.-J., & Cao, D.-S. (2022). ABC-Net: A divide-and-conquer based deep learning architecture for SMILES recognition from molecular images. Briefings in Bioinformatics, 23(2), bbac033. https://doi.org/10.1093/bib/bbac033

Publication: Briefings in Bioinformatics 2022

Additional Resources:

GitHub Repository

@article{zhangABCNetDivideandconquerBased2022,
  title = {ABC-Net: A Divide-and-Conquer Based Deep Learning Architecture for {SMILES} Recognition from Molecular Images},
  author = {Zhang, Xiao-Chen and Yi, Jia-Cai and Yang, Guo-Ping and Wu, Cheng-Kun and Hou, Ting-Jun and Cao, Dong-Sheng},
  journal = {Briefings in Bioinformatics},
  volume = {23},
  number = {2},
  pages = {bbac033},
  year = {2022},
  publisher = {Oxford University Press},
  doi = {10.1093/bib/bbac033}
}

ChemGrapher: Deep Learning for Chemical Graph OCSR

Wed, 17 Dec 2025 00:00:00 +0000

Classifying the Methodology

This is a Method paper. It proposes a novel deep learning architecture and a specific graph-reconstruction algorithm to solve the problem of Optical Chemical Structure Recognition (OCSR). It validates this method by comparing it against the existing standard tool (OSRA), demonstrating superior performance on specific technical challenges like stereochemistry.

The OCR Stereochemistry Challenge

Chemical knowledge is frequently locked in static images within scientific publications. Extracting this structure into machine-readable formats (graphs, SMILES) is essential for drug discovery and database querying. Existing tools, such as OSRA, rely on optical character recognition (OCR) and expert systems or hand-coded rules. These tools struggle with bond multiplicity and stereochemical information, often missing atoms or misinterpreting 3D cues (wedges and dashes). A machine learning approach allows for improvement via data scaling.

Decoupled Semantic Segmentation and Classification Pipeline

The core novelty is the segmentation-classification pipeline which decouples object detection from type assignment:

Semantic Segmentation: The model first predicts pixel-wise maps for atoms, bonds, and charges using a Dense Prediction Convolutional Network built on dilated convolutions.
Graph Building Algorithm: A specific algorithm iterates over the segmentation maps to generate candidate locations for atoms and bonds.
Refinement via Classification: Dedicated classification networks take cutouts of the original image combined with the segmentation mask to verify and classify each candidate (e.g., distinguishing a single bond from a double bond, or a wedge from a dash).

Additionally, the authors developed a novel method for synthetic data generation by modifying the source code of RDKit to output pixel-wise labels during the image drawing process. This solves the lack of labeled training data.

Evaluating Synthetics and Benchmarks

Synthetic Benchmarking: The authors generated test sets in 3 different stylistic variations. For each style, they tested on both stereo (complex 3D information) and non-stereo compounds.
Baseline Comparison: They compared the error rates of ChemGrapher against OSRA (Optical Structure Recognition Application).
Component-level Evaluation: They analyzed the F1 scores of the segmentation networks versus the classification networks independently to understand where errors propagated.
Real-world Case Study: They manually curated 61 images cut from journal articles to test performance on real, non-synthetic data.

Advancements Over OSRA

Superior Accuracy: ChemGrapher consistently achieved lower error rates than OSRA across all synthetic styles, particularly for stereochemical information (wedge and dash bonds).
Component Performance: The classification networks showed higher F1 scores than the segmentation networks across all prediction types (Figure 4 in the paper). This suggests the two-stage approach allows the classifier to correct segmentation noise.
Real-world Viability: In the manual case study, ChemGrapher correctly predicted 46 of 61 images, compared to 42 of 61 for OSRA.
Limitations: The model struggles with thick bond lines in real-world images. Performance is stronger on carbon-only compounds, where no letters appear in the image.

Reproducibility Details

Data

The authors created a custom synthetic dataset using ChEMBL and RDKit, as no pixel-wise labeled dataset existed.

Purpose	Dataset	Size	Notes
Source	ChEMBL	1.9M	Split into training pool (1.5M), val/train pool (300K), and test pools (35K each).
Segmentation Train	Synthetic	~114K	Sampled from ChEMBL pool such that every atom type appears in >1000 compounds.
Labels	Pixel-wise	N/A	Generated by modifying RDKit source code to output label masks (atom type, bond type, charge) during drawing.
Candidates (Val)	Cutouts	~27K (Atom) ~55K (Bond)	Validation candidates generated from ~450 compounds for evaluating the classification networks.

Algorithms

Algorithm 1: Graph Building

Segment: Apply segmentation network $s(x)$ to get maps $S^a$ (atoms), $S^b$ (bonds), $S^c$ (charges).
Atom Candidates: Identify candidate blobs in $S^a$.
Classify Atoms: For each candidate, crop the input image and segmentation map. Feed to $c_A$ and $c_C$ to predict Atom Type and Charge. Add to Vertex set $V$ if valid.
Bond Candidates: Generate all pairs of nodes in $V$ within $2 \times$ bond length distance.
Classify Bonds: For each pair, create a candidate mask (two rectangles meeting in the middle to encode directionality). Feed to $c_B$ to predict Bond Type (single, double, wedge, etc.). Add to Edge set $E$.

Models

The pipeline uses four distinct Convolutional Neural Networks (CNNs).

1. Semantic Segmentation Network ($s$)

Architecture: 8 convolutional layers (3x3) plus a final 1x1 linear layer (Dense Prediction Convolutional Network).
Kernels: $3 \times 3$ for all convolutional layers; $1 \times 1$ for the final linear layer.
Dilation: Uses dilated convolutions to expand receptive field without losing resolution. Six of the eight convolutional layers use dilation (factors: 2, 4, 8, 8, 4, 2); the first and last convolutional layers have no dilation.
Input: Binary B/W image.
Output: Multi-channel probability maps for Atom Types ($S^a$), Bond Types ($S^b$), and Charges ($S^c$).

2. Classification Networks ($c_A, c_B, c_C$)

Purpose: Refines predictions on small image patches.
Architecture: 5 convolutional layers, followed by a MaxPool layer and a final linear (1x1) layer.
- Layer 1: Depthwise separable convolution (no dilation).
- Layers 2-4: Dilated convolutions (factors 2, 4, 8).
- Layer 5: Standard convolution (no dilation).
- MaxPool: $124 \times 124$.
- Final: 1x1 linear layer.
Inputs:
- Crop of the binary image ($x^{cut}$).
- Crop of the segmentation map ($S^{cut}$).
- “Highlight” mask ($h_L$) indicating the specific candidate location (e.g., a dot for atoms, two rectangles for bonds).

Evaluation

Metric: F1 Score for individual network performance (segmentation pixels and classification accuracy).
Metric: Error Rate (percentage of incorrect graphs) for overall system. A graph is “incorrect” if there is at least one mistake in atoms or bonds.
Baselines: Compared against OSRA.

Hardware

GPU: Training and inference performed on a single NVIDIA Titan Xp (donated by NVIDIA).

Reproducibility Status

Closed. The authors did not release source code, pre-trained models, or the synthetic dataset. The data generation pipeline requires modifications to RDKit’s internal drawing code, which are not publicly available. The ChEMBL source compounds are public, but the pixel-wise labeling procedure cannot be reproduced without the modified RDKit code.

Paper Information

Citation: Oldenhof, M., Arany, Á., Moreau, Y., & Simm, J. (2020). ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep Learning. Journal of Chemical Information and Modeling, 60(10), 4506-4517. https://doi.org/10.1021/acs.jcim.0c00459

Publication: Journal of Chemical Information and Modeling 2020 (arXiv preprint Feb 2020)

Additional Resources:

arXiv Page

@article{oldenhof2020chemgrapher,
  title={ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep Learning},
  author={Oldenhof, Martijn and Arany, Ádám and Moreau, Yves and Simm, Jaak},
  journal={Journal of Chemical Information and Modeling},
  volume={60},
  number={10},
  pages={4506--4517},
  year={2020},
  publisher={ACS Publications},
  doi={10.1021/acs.jcim.0c00459}
}