Optical Chemical Structure Recognition on Hunter Heidenreich | ML Research Scientist

MarkushGrapher-2: End-to-End Markush Recognition

Mon, 06 Apr 2026 00:00:00 +0000

A Multimodal Method for Markush Structure Recognition

This is a Method paper that introduces MarkushGrapher-2, a universal encoder-decoder model for recognizing both standard molecular structures and multimodal Markush structures from chemical images. The primary contribution is a dual-encoder architecture that fuses a pretrained OCSR (Optical Chemical Structure Recognition) vision encoder with a Vision-Text-Layout (VTL) encoder, connected through a dedicated ChemicalOCR module for end-to-end processing. The paper also introduces two new resources: a large-scale training dataset (USPTO-MOL-M) of real-world Markush structures extracted from USPTO patent MOL files, and IP5-M, a manually annotated benchmark of 1,000 Markush structures from five major patent offices.

Why Markush Structure Recognition Remains Challenging

Markush structures are compact representations used in patent documents to describe families of related molecules. They combine a visual backbone (atoms, bonds, variable regions) with textual definitions of substituents that can replace those variable regions. This multimodal nature makes them harder to parse than standard molecular diagrams.

Three factors limit automatic Markush recognition. First, visual styles vary across patent offices and publication years. Second, textual definitions lack standardization and often contain conditional or recursive descriptions. Third, real-world training data with comprehensive annotations is scarce. As a result, Markush structures are currently indexed only in two proprietary, manually curated databases: MARPAT and DWPIM.

Prior work, including the original MarkushGrapher, required pre-annotated OCR outputs at inference time, limiting practical deployment. General-purpose models like GPT-5 and DeepSeek-OCR produce mostly chemically invalid outputs on Markush images, suggesting these lie outside their training distribution.

Dual-Encoder Architecture with Dedicated ChemicalOCR

MarkushGrapher-2 uses two complementary encoding pipelines:

Vision encoder pipeline: The input image passes through a Swin-B Vision Transformer (taken from MolScribe) pretrained for OCSR. This encoder extracts visual features representing molecular structures and remains frozen during training.
Vision-Text-Layout (VTL) pipeline: The same image goes through ChemicalOCR, a compact 256M-parameter vision-language model fine-tuned from SmolDocling for OCR on chemical images. ChemicalOCR extracts character-level text and bounding boxes. These, combined with image patches, feed into a T5-base VTL encoder following the UDOP fusion paradigm, where visual and textual tokens are spatially aligned by bounding box overlap.

The VTL encoder output is concatenated with projected embeddings from the vision encoder. This joint representation feeds a text decoder that auto-regressively generates a CXSMILES (ChemAxon Extended SMILES) string describing the backbone structure and a substituent table listing variable group definitions.

Two-Stage Training Strategy

Training proceeds in two phases:

Phase 1 (Adaptation): The vision encoder is frozen. The MLP projector and text decoder train on 243K real-world image-SMILES pairs from MolScribe’s USPTO dataset (3 epochs). This aligns the decoder to the pretrained OCSR feature space.
Phase 2 (Fusion): The vision encoder, projector, and ChemicalOCR are all frozen. The VTL encoder and text decoder train on a mix of 235K synthetic and 145K real-world Markush samples (2 epochs). The VTL encoder learns the features needed for CXSMILES and substituent table prediction without disrupting the established OCSR representations.

The total model has 831M parameters, of which 744M are trainable.

Datasets and Evaluation Benchmarks

Training Data

Purpose	Dataset	Size	Source
OCR pretraining	Synthetic chemical structures	235K	PubChem SMILES augmented to CXSMILES, rendered with annotations
OCR fine-tuning	Manual OCR annotations	7K	IP5 patent document crops
Phase 1 (OCSR)	MolScribe USPTO	243K	Real image-SMILES pairs
Phase 2 (MMSR)	Synthetic CXSMILES	235K	Same as OCR pretraining set
Phase 2 (MMSR)	MolParser dataset	91K	Real-world Markush, converted to CXSMILES
Phase 2 (MMSR)	USPTO-MOL-M	54K	Real-world, auto-extracted from USPTO MOL files (2010-2025)

Evaluation Benchmarks

Markush benchmarks: M2S (103 samples), USPTO-M (74), WildMol-M (10K, semi-manual), and the new IP5-M (1,000 manually annotated from USPTO, JPO, KIPO, CNIPA, and EPO patents, 1980-2025).

OCSR benchmarks: USPTO (5,719), JPO (450), UOB (5,740), WildMol (10K).

The primary metric is CXSMILES Accuracy (A): a prediction is correct when (1) the predicted SMILES matches the ground truth by InChIKey equivalence, and (2) all Markush features (variable groups, positional and frequency variation indicators) are correctly represented. Stereochemistry is ignored during evaluation.

Results: Markush Structure Recognition

Model	M2S	USPTO-M	WildMol-M	IP5-M
MolParser-Base	39	30	38.1	47.7
MolScribe	21	7	28.1	22.3
GPT-5	3	0	-	-
DeepSeek-OCR	0	0	1.9	0.0
MarkushGrapher-1	38	10	32	-
MarkushGrapher-2	56	13	55	48.0

On M2S, MarkushGrapher-2 achieves 56% CXSMILES accuracy vs. 38% for MarkushGrapher-1, a relative improvement of 47%. On WildMol-M (the largest benchmark at 10K samples), MarkushGrapher-2 reaches 55% vs. 38.1% for MolParser-Base and 32% for MarkushGrapher-1. GPT-5 and DeepSeek-OCR generate mostly chemically invalid outputs on Markush images: only 30% and 15% of their predictions are valid CXSMILES on M2S, respectively.

Results: Standard Molecular Structure Recognition

Model	WildMol	JPO	UOB	USPTO
MolParser-Base	76.9	78.9	91.8	93.0
MolScribe	66.4	76.2	87.4	93.1
DECIMER 2.7	56.0	64.0	88.3	59.9
MolGrapher	45.5	67.5	94.9	91.5
DeepSeek-OCR	25.8	31.6	78.7	36.9
MarkushGrapher-2	68.4	71.0	96.6	89.8

MarkushGrapher-2 achieves the highest score on UOB (96.6%) and remains competitive on other OCSR benchmarks, despite being primarily optimized for Markush recognition.

ChemicalOCR vs. General OCR

Model	M2S F1	USPTO-M F1	IP5-M F1
PaddleOCR v5	7.7	1.2	1.9
EasyOCR	10.2	18.0	18.4
ChemicalOCR	87.2	93.0	86.5

General-purpose OCR tools fail on chemical images because they misinterpret bonds as characters and cannot parse chemical abbreviations. ChemicalOCR outperforms both by a large margin.

Ablation Results and Key Findings

OCR input is critical for Markush features. Without OCR, CXSMILES accuracy drops from 56% to 4% on M2S, and from 53.7% to 15.4% on IP5-M. The backbone structure accuracy ($A_{\text{InChIKey}}$) also drops substantially (from 80% to 39% on M2S), though the vision encoder alone can still recover some structural information. This confirms that textual cues (brackets, indices, variable definitions) are essential for Markush feature prediction.

Two-phase training improves both tasks. Compared to single-phase (fusion only) training, the two-phase strategy improves CXSMILES accuracy from 44% to 50% on M2S and from 53.0% to 61.5% on JPO after the same number of epochs. Adapting the decoder to OCSR features before introducing the VTL encoder prevents the fusion process from degrading learned visual representations.

Frequency variation indicators remain the hardest feature. On IP5-M, the per-feature breakdown shows 73.3% accuracy for backbone InChI, 74.8% for variable groups, 78.8% for positional variation, but only 30.7% for frequency variation (Sg groups). These repeating structural units are particularly challenging to represent and predict.

Limitations: The model relies on accurate OCR as a prerequisite. Performance on USPTO-M (13% CXSMILES accuracy) lags behind other benchmarks, likely due to the older patent styles in that dataset. The paper does not report inference latency.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
OCR pretraining	Synthetic chemical images	235K	Generated from PubChem SMILES, augmented to CXSMILES
OCR fine-tuning	IP5 patent crops	7K	Manually annotated
Phase 1 training	MolScribe USPTO	243K	Public, real image-SMILES pairs
Phase 2 training	Synthetic + MolParser + USPTO-MOL-M	380K	Mix of synthetic (235K), MolParser (91K), USPTO-MOL-M (54K)
Evaluation	M2S, USPTO-M, WildMol-M, IP5-M	103 to 10K	Markush benchmarks
Evaluation	WildMol, JPO, UOB, USPTO	450 to 10K	OCSR benchmarks

Models

Component	Architecture	Parameters	Status
Vision encoder	Swin-B ViT (from MolScribe)	~87M	Frozen
VTL encoder + decoder	T5-base	~744M trainable	Trained
ChemicalOCR	SmolDocling-based VLM	256M	Fine-tuned, frozen in Phase 2
MLP projector	Linear projection	-	Trained in Phase 1, frozen in Phase 2
Total		831M

Evaluation

Metric	Definition
CXSMILES Accuracy (A)	Percentage of samples where InChIKey matches AND all Markush features correct
$A_{\text{InChIKey}}$	Backbone structure accuracy only (ignoring Markush features)
Table Accuracy	Percentage of correctly predicted substituent tables
Markush Accuracy	Joint CXSMILES + Table accuracy
OCR F1	Bounding-box-level precision/recall at IoU > 0.5

Hardware

Training: NVIDIA A100 GPU
Phase 1: 3 epochs, Adam optimizer, lr 5e-4, 1000 warmup steps, batch size 10, weight decay 1e-3
Phase 2: 2 epochs, batch size 8

Artifacts

Artifact	Type	License	Notes
MarkushGrapher GitHub	Code	MIT	Official implementation of MarkushGrapher-2 with models and datasets

Reproducibility classification: Highly Reproducible. Code, models, and datasets are all publicly released under an MIT license with documented training hyperparameters and a single A100 GPU requirement.

Paper Information

Citation: Strohmeyer, T., Morin, L., Meijer, G. I., Weber, V., Nassar, A., & Staar, P. (2026). MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Publication: CVPR 2026

Additional Resources:

@misc{strohmeyer2026markushgrapher,
  title={MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures},
  author={Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Val\'{e}ry and Nassar, Ahmed and Staar, Peter},
  year={2026},
  eprint={2603.28550},
  archiveprefix={arXiv},
  primaryclass={cs.CV}
}

Uni-Parser: Industrial-Grade Multi-Modal PDF Parsing (2025)

Sun, 15 Mar 2026 00:00:00 +0000

Uni-Parser is a modular, loosely coupled PDF parsing engine built for scientific literature and patents. It routes different content types (text, equations, tables, figures, chemical structures) to specialized expert models, then reassembles the parsed outputs into structured formats (JSON, Markdown, HTML) for downstream consumption by LLMs and other applications.

The system processes up to 20 PDF pages per second on 8 NVIDIA RTX 4090D GPUs and supports over 80 languages for OCR.

A Five-Stage Pipeline Architecture

The system is organized into five sequential stages:

Document Pre-Processing: Validates PDFs, extracts metadata, checks text accessibility, and identifies language.
Group-based Layout Detection: Locates semantic blocks and identifies their categories using a novel tree-structured layout representation. Groups naturally paired elements (image-caption, table-title, molecule-identifier).
Semantic Contents Parsing: Routes each block to a specialized model: OCR for text, formula recognition for equations, table structure recognition, OCSR for chemical structures, reaction extraction, and chart parsing. Over ten sub-models operate in parallel.
Semantic Contents Gathering: Filters non-essential elements, reconstructs reading order, merges cross-page and multi-column content, and reintegrates inline multimodal elements.
Output Formatting and Semantic Chunking: Exports parsed documents in task-specific formats with proper chunking for RAG and other downstream tasks.

Group-Based Layout Detection

A key contribution is the group-based layout detection model (Uni-Parser-LD), which uses a hierarchical tree structure to represent page layouts. Elements are organized into a bottom layer (parent nodes like paragraphs, tables, images) and a top layer (child nodes like captions, footnotes, identifiers). This preserves semantic associations between paired elements, such as molecules and their identifiers.

The model is trained on 500k pages, including 220k human-annotated pages from scientific journals and patents across 85 languages. A modified DETR-based architecture was selected as the backbone after finding that RT-DETRv2, YOLOv12, and D-FINE exhibited training instability for this task.

Chemical Structure Recognition with MolParser 1.5

Uni-Parser integrates MolParser 1.5 for OCSR, an end-to-end model that directly generates molecular representations from images. The authors explicitly note that graph-based (atom-bond) methods were the first direction they explored but ultimately abandoned because of:

Strong reliance on rigid, hand-crafted rules that limit scalability
Substantially higher annotation costs (over 20x compared to end-to-end approaches)
Lower performance ceilings despite increasing training data

Molecule Localization

Uni-Parser-LD achieves strong molecule detection performance:

Model	mAP@50	mAP@50-95
Uni-Parser-LD (Uni-Parser Bench)	0.994	0.968
MolDet-Doc-L	0.983	0.919
MolDet-General-L	0.974	0.815
Uni-Parser-LD (BioVista Bench)	0.981	0.844
MolDet-Doc-L	0.961	0.871
MolDet-General-L	0.945	0.815
BioMiner	0.929	-
MolMiner	0.899	-

OCSR Accuracy

MolParser 1.5 consistently outperforms prior methods across molecule types:

Model	Full	Chiral	Markush	All
MolParser 1.5 (Uni-Parser Bench)	0.979	0.809	0.805	0.886
MolParser 1.0	0.953	0.676	0.664	0.800
MolScribe	0.617	0.274	0.168	0.417
MolParser 1.5 (BioVista Bench)	0.795	0.604	0.761	0.780
MolParser 1.0	0.669	0.352	0.733	0.703
MolMiner	0.774	0.497	0.185	0.507
MolScribe	0.703	0.481	0.156	0.455
MolNexTR	0.695	0.419	0.045	0.401
DECIMER	0.545	0.326	0.000	0.298

Chiral molecule recognition remains a significant challenge and is identified as a key area for future work.

Document Parsing Benchmarks

On the Uni-Parser Benchmark (150 PDFs, 2,887 pages from patents and scientific articles), Uni-Parser (HQ mode) achieves an overall score of 89.74 (excluding molecules), outperforming both pipeline tools (MinerU, PP-StructureV3) and specialized VLMs (MinerU2-VLM, DeepSeek-OCR, PaddleOCR-VL). Competing systems score zero on molecule localization and OCSR because they lack molecular recognition capabilities.

On the general-document OmniDocBench-1.5, a variant (Uni-Parser-G) using a swapped layout module achieves 89.75 overall, competitive with top-performing specialized VLMs.

Comparison with OCSR-Enabled PDF Parsers

On a controlled test set of 141 simple molecules, Uni-Parser outperforms other PDF parsing systems with OCSR support:

Method	Recall	OCSR Success	OCSR Acc	Id Match	Time
Uni-Parser	100%	100%	96.5%	100%	1.8s
MathPix	100%	75.9%	59.6%	-	66.1s
MinerU.Chem	66.7%	63.1%	22.7%	-	~7 min

Reproducibility

Artifact	Type	License	Notes
HuggingFace Models	Model/Dataset	Unknown	MolDet models and MolParser-7M dataset available
Project Page	Other	Unknown	Project website with documentation

The Uni-Parser system is deployed on a cluster of 240 NVIDIA L40 GPUs (48 GB each) with 22 CPU cores and 90 GB of host memory per GPU. The reference throughput benchmark (20 pages/second) uses 8 NVIDIA RTX 4090D GPUs. The HuggingFace organization hosts MolDet detection models and several datasets (MolParser-7M, RxnBench, OmniScience), but the full Uni-Parser system code and end-to-end inference pipeline do not appear to be publicly released. MolParser 1.5 model weights are not publicly available as of this writing.

Limitations and Future Directions

Chiral molecule recognition remains a challenge for end-to-end OCSR models
Chemical reaction understanding in real-world literature has substantial room for improvement
Layout models are primarily tailored to scientific and patent documents, with plans to expand to newspapers, slides, books, and financial statements
Chart parsing falls short of industrial-level requirements across the diversity of chart types in scientific literature

Paper Information

Citation: Fang, X., Tao, H., Yang, S., Huang, C., Zhong, S., Lu, H., Lyu, H., Li, X., Zhang, L., & Ke, G. (2025). Uni-Parser Technical Report. arXiv preprint arXiv:2512.15098. https://arxiv.org/abs/2512.15098

Publication: arXiv 2025

Additional Resources:

GraSP: Graph Recognition via Subgraph Prediction (2026)

Sun, 15 Mar 2026 00:00:00 +0000

A General Framework for Visual Graph Recognition

GraSP (Graph Recognition via Subgraph Prediction) addresses a fundamental limitation in image-to-graph methods: existing solutions are task-specific and do not transfer between domains. Whether the task is OCSR, scene graph recognition, music notation parsing, or road network extraction, each domain has developed independent solutions despite solving the same conceptual problem of extracting a graph from an image.

The key insight is that graph recognition can be reformulated as sequential subgraph prediction using a binary classifier, sidestepping two core difficulties of using graphs as neural network outputs:

Graph isomorphism: An uncolored graph with $n$ nodes has $n!$ equivalent representations, making direct output comparison intractable
Compositional outputs: Nodes, edges, and features are interdependent, so standard i.i.d. loss functions are insufficient

Sequential Subgraph Prediction as an MDP

GraSP formulates graph recognition as a Markov Decision Process. Starting from an empty graph, the method iteratively expands the current graph by adding one edge at a time (connecting either a new node or two existing nodes). At each step, a binary classifier predicts whether each candidate successor graph is a subgraph of the target graph shown in the image.

The critical observation is that the optimal value function $V^{\pi^*}$ satisfies:

$$V^{\pi^*}(\mathcal{G}_t | \mathcal{I}) = 1 \iff \mathcal{G}_t \subseteq \mathcal{G}_{\mathcal{I}}$$

This means the value function reduces to a subgraph membership test, which can be learned as a binary classifier rather than requiring reinforcement learning. Greedy decoding then suffices: at each step, select any successor that the classifier predicts is a valid subgraph, and terminate when the classifier indicates the current graph is complete.

This formulation decouples decision (what to add) from generation (in what order), making the same model applicable across different graph types without modification.

Architecture: GNN + FiLM-Conditioned CNN

The architecture has three components:

GNN encoder: A Message Passing Neural Network processes the candidate subgraph, producing a graph embedding. Messages are constructed as concatenations of source node features, target node features, and connecting edge features.
FiLM-conditioned CNN: A ResNet-v2 processes the image, with FiLM layers placed after every normalization layer within each block. The graph embedding conditions the image processing, producing a joint graph-image representation.
MLP classification head: Takes the conditioned image embedding plus a binary terminal flag (indicating whether this is a termination check) and predicts subgraph membership.

The model uses only 7.25M parameters. Group Normalization is used in the CNN (8 groups per layer), Layer Normalization in the GNN and MLP.

Training via Streaming Data Generation

Training uses a streaming architecture rather than a fixed dataset:

For each iteration, a target graph $\mathcal{G}_T$ is sampled and rendered as an image
Positive samples are generated by deleting edges that do not disconnect the graph (yielding valid subgraphs)
Negative samples are generated by expanding successor states and checking via approximate subgraph matching
Two FIFO buffers (one for positives, one for negatives), each holding up to 25,000 images, maintain diverse and balanced mini-batches of 1024 samples
Training uses the RAdam optimizer with a cosine learning rate schedule (warmup over 50M samples, cycle of 250M samples) on 4 A100 GPUs with a 24h budget

Synthetic Benchmarks on Colored Trees

GraSP is evaluated on increasingly complex synthetic tasks involving colored tree graphs:

Small trees (6-9 nodes): Tasks with varying numbers of node colors (1, 3, 5) and edge colors (1, 3, 5). The model works well across all configurations, with simpler tasks (fewer colors) converging faster.
Larger trees (10-15 nodes): The same trends hold but convergence is slower due to increased structural complexity.
Out-of-distribution generalization: Models trained on 6-9 node trees show zero-shot generalization to 10-node trees, indicating learned patterns are size-independent.

OCSR Evaluation on QM9

For the real-world OCSR evaluation, GraSP is applied to QM9 molecular images (grayscale, no stereo-bonds) with a 10,000-molecule held-out test set:

Method	Accuracy
OSRA	45.61%
GraSP	67.51%
MolGrapher	88.36%
DECIMER	92.08%

GraSP does not match state-of-the-art OCSR tools, but the authors emphasize that the same model architecture and training procedure transfers directly from synthetic tree tasks to molecular graphs with no task-specific modifications. The only domain knowledge incorporated is a simple chemistry rule: not extending nodes that already have degree four.

The method highlights the practical advantage of decoupling decision from generation. Functional groups can be represented at different granularities (as single nodes to reduce trajectory depth, or expanded to reduce trajectory breadth) without changing the model.

Reproducibility

Artifact	Type	License	Notes
GraSP Code	Code	Unknown	Official implementation with pre-trained models

The repository includes pre-trained models and example trajectories for interactive exploration. Training requires 4 A100 GPUs with a 24h time budget. The QM9 dataset used for OCSR evaluation is publicly available. No license file is included in the repository.

Limitations and Future Directions

Finite type assumption: The current framework assumes a finite set of node and edge types, limiting applicability to open-vocabulary tasks like scene graph recognition
Scaling to large graphs: For very large graphs, the branching factor of successor states becomes expensive. Learned filters to prune irrelevant successor states could help
OCSR performance gap: While GraSP demonstrates transferability, it falls short of specialized OCSR tools that use domain-specific encodings (SMILES) or pixel-level supervision
Modality extension: The framework could extend beyond images to other input modalities, such as vector embeddings of graphs

Paper Information

Citation: Eberhard, A., Neumann, G., & Friederich, P. (2026). Graph Recognition via Subgraph Prediction. arXiv preprint arXiv:2601.15133. https://arxiv.org/abs/2601.15133

Publication: arXiv 2026

GraphReco: Probabilistic Structure Recognition (2026)

Sun, 15 Mar 2026 00:00:00 +0000

Paper Information

Citation: Wang, H., Yu, Y., & Liu, J.-C. (2026). GraphReco: Probabilistic Structure Recognition for Chemical Molecules. ChemistryOpen, e202500537. https://doi.org/10.1002/open.202500537

Publication: ChemistryOpen 2026 (Open Access)

A Rule-Based OCSR System with Probabilistic Graph Assembly

GraphReco tackles a challenge that is rarely addressed explicitly in rule-based OCSR: the ambiguity that arises during graph assembly when lower-level component extraction results are imprecise. Small deviations in bond endpoint locations, false positive detections, and spatial proximity between elements all create uncertainty about which atoms and bonds should be connected, merged, or discarded.

The system introduces two main contributions:

Fragment Merging (FM) line detection: An adaptive three-stage algorithm for precise bond line identification across images of variable resolution
Probabilistic ambiguity resolution: A Markov network that infers the most likely existence and merging state of atom and bond candidates

Three-Stage Pipeline

GraphReco follows a three-stage workflow:

Component Extraction: Detects circles (aromatic bonds), bond lines (via the FM algorithm), and chemical symbols (via Tesseract OCR). Includes detection of solid wedge, dashed wedge, dashed line, and wavy bond styles. A semi-open-loop correction step resolves cases where symbols are misclassified as bonds and vice versa.
Atom and Bond Ambiguity Resolution: Creates atom and bond candidates from detected components, builds a Markov network to infer their most probable states, and resolves candidates through existence and merging decisions.
Graph Reconstruction: Assembles resolved atoms and bonds into a molecule graph, selects the largest connected component, and exports as MDL Molfile.

Fragment Merging Line Detection

Classical Line Hough Transform (LHT) struggles with chemical structure images because bond lines suffer from pixelization, and algorithm parameters that work for one image resolution fail at others. The FM algorithm addresses this with three stages:

Fragment extraction: Apply LHT with high-resolution parameters (resolution $r = 2$, resolution $\theta = 2°$) to detect fine line fragments. Walk along detected theoretical lines to find actual black pixels and group them by connectivity.
Fragment grouping: Pair fragments that share similar angles, are close in the perpendicular direction, and are either overlapping or connected by a path of black pixels.
Fragment merging: Merge grouped fragments into single line segments using the two border pixels farthest from the centroid.

The FM algorithm effectively handles the tradeoff that plagues standard LHT: coarse parameters miss short lines and produce overlaps, while fine parameters return many fragments shorter than actual bonds.

Probabilistic Ambiguity Resolution via Markov Network

After component extraction, GraphReco creates atom and bond candidates rather than directly assembling the graph. Each bond endpoint generates an atom candidate with a circular bounding area of radius:

$$r_b = \min(l_{\text{bond}}, l_{\text{med}}) / 4$$

where $l_{\text{bond}}$ is the bond length and $l_{\text{med}}$ is the median bond length.

A Markov network is constructed with four types of nodes:

Atom nodes: Boolean existence variables for each atom candidate
Bond nodes: Boolean existence variables for each bond candidate
Atom merge nodes: Boolean variables for pairs of overlapping atom candidates
Bond merge nodes: Boolean variables for pairs of nearby bond candidates

Potential functions encode rules about when candidates should exist or merge, with merging likelihood between two bond-ending atom candidates defined as a piecewise function of center distance $d$:

$$P(a_1, a_2) = \begin{cases} 0.9, & \text{if } d \leq Q \\ 0.7 - 0.4(d - Q)/(R - Q), & \text{if } Q < d \leq R \\ 0.1, & \text{if } d > R \end{cases}$$

where $Q = \max(r_1, r_2)$ and $R = \min(1.5Q, r_1 + r_2)$. MAP inference determines the final state of all candidates.

Evaluation Results

GraphReco is evaluated on USPTO benchmarks with InChI string comparison (stereochemistry removed):

System	USPTO-10K	USPTO-10K-Abb	USPTO
GraphReco	94.2	86.7	89.9
MolVec 0.9.7	92.4	70.3	89.1
Imago 2.0	89.9	63.0	89.4
OSRA 2.1	89.7	63.9	89.3
MolGrapher	93.3	82.8	91.5
Img2Mol	35.4	13.8	25.2

GraphReco outperforms all rule-based systems and most ML systems, with a particularly large margin on USPTO-10K-Abb (abbreviation-heavy molecules). MolGrapher achieves slightly higher accuracy on the USPTO dataset.

Robustness on Perturbed Images

On USPTO-perturbed (rotation and shearing applied), rule-based methods degrade substantially:

System	USPTO-perturbed
MolGrapher	86.7
Img2Mol	42.3
GraphReco	40.6
MolVec 0.9.7	30.7
OSRA 2.1	6.4
Imago 2.0	5.1

GraphReco performs better than other rule-based systems on perturbed inputs (40.6% vs. under 31%) thanks to its probabilistic assembly, but still falls far behind MolGrapher (86.7%), demonstrating the robustness advantage of learned approaches.

Ablation Study

Each component contributes substantially to overall performance on USPTO-10K:

Configuration	USPTO-10K	USPTO-10K-Abb	USPTO
Full system	94.2	86.7	89.9
Without FM line detection	2.9	5.5	4.8
Without atom candidates	9.8	0.4	5.0
Without bond candidates	79.1	75.8	75.0
Without Markov network	88.2	81.4	84.2

The FM algorithm and atom candidate mechanism are both critical (accuracy drops below 10% without either). Bond candidates provide a moderate improvement (~15 percentage points), and the Markov network adds ~6 points over hard-threshold alternatives.

Limitations

Deterministic expert rules limit robustness on perturbed or noisy images, as evidenced by the large accuracy gap with MolGrapher on USPTO-perturbed
The system relies on Tesseract OCR for symbol recognition, which may struggle with unusual fonts or degraded image quality
Only handles single 2D molecule structures per image
Stereochemistry is removed during evaluation, so performance on stereo-bond recognition is not assessed

Reproducibility

GraphReco is implemented in Python and relies on Tesseract OCR, OpenCV, and RDKit. The authors provide an online demo for testing but have not released the source code or a public repository.

Artifact	Type	License	Notes
Online Demo	Other	Unknown	Google Cloud Run deployment (no longer available)

Missing components for full reproduction:

Source code is not publicly available
No pre-built package or installable library
Hyperparameters for Markov network potential functions are given in the paper (Equations 8-11), but full implementation details are not released

Hardware/compute requirements: Not specified in the paper. The system uses classical computer vision (Hough transforms, thinning) and probabilistic inference (Markov networks), so GPU hardware is likely not required.

AdaptMol: Domain Adaptation for Molecular OCSR (2026)

Sun, 15 Mar 2026 00:00:00 +0000

Bridging the Synthetic-to-Real Gap in Graph-Based OCSR

Most OCSR methods are trained on synthetic molecular images and evaluated on high-quality literature figures, both exhibiting relatively uniform styles. Hand-drawn molecules represent a particularly challenging domain with irregular bond lengths, variable stroke widths, and inconsistent atom symbols. Prior graph reconstruction methods like MolScribe and MolGrapher drop below 15% accuracy on hand-drawn images, despite achieving over 65% on literature datasets.

AdaptMol addresses this with a three-stage pipeline that enables effective transfer from synthetic to real-world data without requiring graph annotations in the target domain:

Base model training on synthetic data with comprehensive augmentation and dual position representation
MMD alignment of bond-level features between source and target domains
Self-training with SMILES-validated pseudo-labels on unlabeled target images

End-to-End Graph Reconstruction Architecture

AdaptMol builds on MolScribe’s architecture, using a Swin Transformer base encoder ($384 \times 384$ input) with a 6-layer Transformer decoder (8 heads, hidden dim 256). The model jointly predicts atoms and bonds:

Atom prediction follows the Pix2Seq approach, autoregressively generating a sequence of atom tokens:

$$S_N = [l_1, x_1, y_1, l_2, x_2, y_2, \dots, l_n, x_n, y_n]$$

where $l_i$ is the atom label and $(x_i, y_i)$ are discretized coordinate bin indices.

Dual position representation adds a 2D spatial heatmap on top of token-based coordinate prediction. The heatmap aggregates joint spatial distributions of all atoms:

$$\mathbf{H} = \text{Upsample}\left(\sum_{i=1}^{n} P_y^{(i)} \otimes P_x^{(i)}\right)$$

where $P_x^{(i)}$ and $P_y^{(i)}$ are coordinate probability distributions from the softmax logits. During training, this heatmap is supervised with Gaussian kernels at ground-truth atom positions. This reduces false positive atom predictions substantially (from 356 to 33 false positives at IoU 0.05).

Bond prediction extracts atom-level features from decoder hidden states and enriches them with encoder visual features via multi-head attention with a learnable residual weight $\alpha$:

$$\mathbf{F}_{\text{enriched}} = \text{LayerNorm}(\mathbf{F}_{\text{atom}} + \alpha \cdot \text{MHA}(\mathbf{F}_{\text{atom}}, \mathbf{E}_{\text{vis}}))$$

A feed-forward network then predicts bond types between all atom pairs.

Bond-Level Domain Adaptation via MMD

The key insight is that bond features are domain-invariant: they encode structural relationships (single, double, triple, aromatic) independent of visual style. Atom-level alignment is problematic due to class imbalance (carbon dominates), multi-token spanning (functional groups), and position-dependent features.

AdaptMol aligns bond-level feature distributions via class-conditional Maximum Mean Discrepancy:

$$L_{\text{MMD}} = \frac{1}{|\mathcal{C}’|} \sum_{c \in \mathcal{C}’} MMD(F_c^{\text{src}}, F_c^{\text{tgt}})$$

where $\mathcal{C}’$ contains classes with sufficient samples in both domains. Confidence-based filtering retains only high-confidence predictions (confidence > 0.95, entropy < 0.1) for alignment, tightening to 0.98 and 0.05 after the first epoch. Progressive loss weighting follows a schedule of 0.1 (epoch 0), 0.075 (epoch 1), and 0.05 thereafter.

An important side effect: MMD alignment improves inter-class bond discrimination, reducing confusion between visually similar bond types (e.g., jagged double bonds vs. aromatic bonds).

Self-Training with SMILES Validation

After MMD alignment, the model generates predictions on unlabeled target images. Predicted molecular graphs are converted to SMILES and validated against ground-truth SMILES annotations. Only exact matches are retained as pseudo-labels, providing complete graph supervision (atom coordinates, element types, bond types) that was previously unavailable in the target domain.

This approach is far more data-efficient than alternatives: AdaptMol uses only 4,080 real hand-drawn images vs. DECIMER-Handdraw’s 38 million synthetic hand-drawn images.

Comprehensive Data Augmentation

Two categories of augmentation are applied during synthetic data generation:

Structure-rendering augmentation: Functional group abbreviation substitution, bond type conversions (single to wavy/aromatic, Kekule to aromatic rings), R-group insertion, and rendering parameter randomization (font family/size, bond width/spacing)
Image-level augmentation: Geometric operations, quality degradation, layout variations, and chemical document artifacts (caption injection, arrows, marginal annotations)

Structure-rendering augmentation provides the larger benefit, contributing ~20% accuracy improvement on JPO and ~30% on ACS benchmarks.

Results

Hand-Drawn Molecule Recognition

Method	DECIMER test (Acc)	ChemPix (Acc)
AdaptMol	82.6	60.5
DECIMER v2.2	71.9	51.4
AtomLenz	30.0	48.4
MolScribe	10.1	26.1
MolGrapher	10.7	14.5

Literature and Synthetic Benchmarks

AdaptMol achieves state-of-the-art on 4 of 6 literature benchmarks:

Dataset	AdaptMol	MolScribe	MolGrapher	DECIMER v2.2
CLEF	92.7	87.5	57.2	77.7
JPO	88.2	78.8	73.0	75.7
UOB	89.3	88.2	85.1	87.2
ACS	75.5	72.8	41.0	37.7
USPTO	90.9	92.6	74.9	59.6
Staker	84.0	84.4	0.0	66.3

MolScribe edges out on USPTO and Staker. The authors attribute this to MolScribe directly training on all 680K USPTO samples, which may cause it to specialize to that distribution.

Pipeline Ablation

Configuration	Hand-drawn	ChemDraw	JPO
Base model	10.4	92.3	82.7
+ Font augmentation	30.2	92.5	82.8
+ Font aug + MMD	42.1	94.0	83.0
+ Font aug + MMD + Self-training	82.6	95.9	88.2

Each component contributes meaningfully: font augmentation (+19.8), MMD alignment (+11.9), and self-training (+40.5) on hand-drawn accuracy.

Reproducibility

Artifact	Type	License	Notes
AdaptMol Code	Code	MIT	Official implementation
Model + Data	Model/Dataset	MIT	Pretrained checkpoint and datasets

Training uses 2 NVIDIA A100 GPUs (40GB each). Base model trains for 30 epochs on 1M synthetic samples. Domain adaptation involves 3 steps: USPTO self-training (3 iterations of 3 epochs), MMD alignment on hand-drawn data (5 epochs), and hand-drawn self-training (5 iterations).

Limitations

Sequence length constraints prevent accurate prediction of very large molecules (>120 atoms), where resizing causes significant information loss
Cannot recognize Markush structures with repeating unit notation (parentheses/brackets), as synthetic training data lacks such cases
Stereochemistry information is lost when stereo bonds connect to abbreviated functional groups due to RDKit post-processing limitations
The retrained baseline (30 epochs from scratch on synthetic + pseudo-labels) achieves higher hand-drawn accuracy (87.2%) but at the cost of cross-domain robustness on literature benchmarks

Paper Information

Citation: Hu, F., He, E., & Verspoor, K. (2026). AdaptMol: Domain Adaptation for Molecular Image Recognition with Limited Supervision. Research Square preprint. https://doi.org/10.21203/rs.3.rs-8365561/v1

Publication: Research Square preprint, February 2026

Additional Resources:

OCSU: Optical Chemical Structure Understanding (2025)

Sat, 14 Mar 2026 00:00:00 +0000

Paper Information

Citation: Fan, S., Xie, Y., Cai, B., Xie, A., Liu, G., Qiao, M., Xing, J., & Nie, Z. (2025). OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery. arXiv preprint arXiv:2501.15415. https://doi.org/10.48550/arXiv.2501.15415

Publication: arXiv 2025

Additional Resources:

Code and Dataset (GitHub)

Multi-Level Chemical Understanding (Method and Resource)

This is primarily a Methodological Paper ($\Psi_{\text{Method}}$) with a significant Resource ($\Psi_{\text{Resource}}$) contribution.

Methodological: It proposes two novel architectures, DoubleCheck (an enhanced recognition model) and Mol-VL (an end-to-end vision-language model), to solve the newly formulated OCSU task.
Resource: It constructs and releases Vis-CheBI20, the first large-scale dataset specifically designed for optical chemical structure understanding, containing 29.7K images and 117.7K image-text pairs.

The Motivation for OCSU Beyond Basic Graph Recognition

Existing methods for processing molecular images focus narrowly on Optical Chemical Structure Recognition (OCSR), which translates an image solely into a machine-readable graph or SMILES string. However, SMILES strings are not chemist-friendly and lack high-level semantic context.

Gap: There is a lack of systems that can translate chemical diagrams into human-readable descriptions (e.g., functional groups, IUPAC names) alongside the graph structure.
Goal: To enable Optical Chemical Structure Understanding (OCSU), bridging the gap between visual representations and both machine/chemist-readable descriptions to support drug discovery and property prediction.

Key Innovations: DoubleCheck, Mol-VL, and the Vis-CheBI20 Dataset

The paper introduces the OCSU task, enabling multi-level understanding (motif, molecule, and abstract levels). To solve this, it introduces two distinct paradigms:

DoubleCheck (OCSR-based): An enhancement to standard OCSR models (like MolScribe) that performs a “second look” at locally ambiguous atoms. It uses attentive feature enhancement to fuse global molecular features with local features from ambiguous regions.
Mol-VL (OCSR-free): An end-to-end Vision-Language Model (VLM) based on Qwen2-VL. It uses multi-task learning to directly generate text descriptions from molecular images without an intermediate SMILES step.
Vis-CheBI20 Dataset: A new benchmark specifically constructed for OCSU, deriving captions and functional group data from ChEBI-20 and PubChem.

Methodology and Experimental Evaluation

The authors evaluated both paradigms on Vis-CheBI20 and existing benchmarks (USPTO, ACS) across four subtasks:

Functional Group Caption: Retrieval/F1 score evaluation.
Molecule Description: Natural language generation metrics (BLEU, ROUGE, METEOR).
IUPAC Naming: Text generation metrics (BLEU, ROUGE).
SMILES Naming (OCSR): Exact matching accuracy ($Acc_s$).

Baselines:

Task-Specific: MolScribe, MolVec, OSRA.
LLM/VLM: Qwen2-VL, BioT5+, Mol-Instructions.
Ablation: DoubleCheck vs. MolScribe backbone to test the “feature enhancement” mechanism.

Results and Conclusions: Paradigm Trade-Offs

DoubleCheck Superiority: DoubleCheck outperformed MolScribe on OCSR tasks across all benchmarks. On USPTO, it achieved 92.85% $Acc_s$ (vs. 92.57%), and on the ACS dataset it showed a +3.12% gain on chiral molecules. On Vis-CheBI20, DoubleCheck improved over MolScribe by an average of 2.27% across all metrics.
Paradigm Trade-offs:
- Mol-VL (OCSR-free) excelled at semantic tasks like Functional Group Captioning, achieving 97.32% F1 (vs. 93.63% for DoubleCheck & RDKit and 89.60% for MolScribe & RDKit). It benefits from end-to-end learning of structural context.
- DoubleCheck (OCSR-based) performed better on IUPAC naming recall and exact SMILES recovery, as explicit graph reconstruction is more precise for rigid nomenclature than VLM generation.
Conclusion: Enhancing submodules improves OCSR-based paradigms, while end-to-end VLMs offer stronger semantic understanding but struggle with exact syntax generation (SMILES/IUPAC).

Reproducibility Details

Data

Vis-CheBI20 Dataset

Source: Derived from ChEBI-20 and PubChem.
Size: 29,700 molecular diagrams, 117,700 image-text pairs.
Generation: Images generated from SMILES using RDKit to simulate real-world journal/patent styles.
Splits (vary by task, see table below):

Task	Train Size	Test Size
Functional Group	26,144	3,269
Description	26,407	3,300
IUPAC Naming	26,200	2,680
SMILES Naming	26,407	3,300

Algorithms

DoubleCheck (Attentive Feature Enhancement)

Ambiguity Detection: Uses atom prediction confidence to identify “ambiguous atoms”.
Masking: Applies a 2D Gaussian mask to the image centered on the ambiguous atom.
Local Encoding: A Swin-B encoder ($\Phi_l$) encodes the masked image region.
Fusion: Aligns local features ($\mathcal{F}_l$) with global features ($\mathcal{F}_g$) using a 2-layer MLP and fuses them via weighted summation.

$$ \begin{aligned} \mathcal{F}_e = \mathcal{F}_g + \text{MLP}(\mathcal{F}_g \oplus \hat{\mathcal{F}}_l) \cdot \hat{\mathcal{F}}_l \end{aligned} $$

Two-Stage Training:
- Stage 1: Train atom/bond predictors (30 epochs).
- Stage 2: Train alignment/fusion modules with random Gaussian mask noise (10 epochs).

Mol-VL (Multi-Task VLM)

Prompting: System prompt: “You are working as an excellent assistant in chemistry…”
Tokens: Uses and special tokens.
Auxiliary Task: Functional group recognition (identifying highlighted groups) added to training to improve context learning.

Models

DoubleCheck:
- Backbone: MolScribe architecture.
- Encoders: Swin-B for both global and local atom encoding.
Mol-VL:
- Base Model: Qwen2-VL (2B and 7B versions).
- Vision Encoder: ViT with naive dynamic resolution and M-RoPE.

Evaluation

Key Metrics:

SMILES: Exact Match Accuracy ($Acc_s$), Chiral Accuracy ($Acc_c$).
Functional Groups: F1 Score (Information Retrieval task).
Text Generation: BLEU-2/4, METEOR, ROUGE-L.

Selected Results:

Model	Task	Metric	Score
DoubleCheck	OCSR (USPTO)	$Acc_s$	92.85%
MolScribe	OCSR (USPTO)	$Acc_s$	92.57%
Mol-VL-7B	Func. Group Caption	F1	97.32%
DoubleCheck & RDKit	Func. Group Caption	F1	93.63%

Hardware

DoubleCheck: Trained on 4 NVIDIA A100 GPUs for 4 days.
- Max LR: 4e-4.
Mol-VL: Trained on 4 NVIDIA A100 GPUs for 10 days.
- Max LR: 1e-5, 50 epochs.

Artifacts

Artifact	Type	License	Notes
PharMolix/OCSU (GitHub)	Code, Model, Dataset	Apache-2.0	Official implementation, Mol-VL-7B weights, and Vis-CheBI20 dataset

Limitations

The authors acknowledge several limitations:

The long-tail distribution of functional groups in training data limits performance on uncommon chemical structures.
Mol-VL struggles with exact syntax generation (SMILES and IUPAC) compared to explicit graph-reconstruction approaches.
Vis-CheBI20 images are synthetically generated via RDKit, which may not fully capture the diversity of real-world journal and patent images.
The authors note that OCSU technologies should be restricted to research purposes, as downstream molecule discovery applications could potentially generate harmful molecules.

Citation

@misc{fanOCSUOpticalChemical2025,
  title = {OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery},
  shorttitle = {OCSU},
  author = {Fan, Siqi and Xie, Yuguang and Cai, Bowen and Xie, Ailin and Liu, Gaochao and Qiao, Mu and Xing, Jie and Nie, Zaiqing},
  year = {2025},
  month = jan,
  number = {arXiv:2501.15415},
  eprint = {2501.15415},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2501.15415},
  archiveprefix = {arXiv}
}

GTR-CoT: Graph Traversal Chain-of-Thought for Molecules

Sat, 14 Mar 2026 00:00:00 +0000

Paper Information

Citation: Wang, J., He, Y., Yang, H., Wu, J., Ge, L., Wei, X., Wang, Y., Li, L., Ao, H., Liu, C., Wang, B., Wu, L., & He, C. (2025). GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition (arXiv:2506.07553). arXiv. https://doi.org/10.48550/arXiv.2506.07553

Publication: arXiv preprint (2025)

Additional Resources:

Paper on arXiv

Contribution: Vision-Language Modeling for OCSR

This is a method paper that introduces GTR-VL, a Vision-Language Model for Optical Chemical Structure Recognition (OCSR). The work addresses the persistent challenge of converting molecular structure images into machine-readable formats, with a particular focus on handling chemical abbreviations that cause errors in existing systems.

Motivation: The Abbreviation Bottleneck

The motivation tackles a long-standing bottleneck in chemical informatics: most existing OCSR systems produce incorrect structures when they encounter abbreviated functional groups. When a chemist draws “Ph” for phenyl or “Et” for ethyl, current models fail because they have been trained on data where images contain abbreviations but the ground-truth labels contain fully expanded molecular graphs.

This creates a fundamental mismatch. The model sees “Ph” in the image but is told the “correct” answer is a full benzene ring. The supervision signal is inconsistent with what is actually visible.

Beyond this data problem, existing graph-parsing methods use a two-stage approach: predict all atoms first, then predict all bonds. This is inefficient and ignores the structural constraints that could help during prediction. The authors argue that mimicking how humans analyze molecular structures - following bonds from atom to atom in a connected traversal - would be more effective.

Novelty: Graph Traversal as Visual Chain-of-Thought

The novelty lies in combining two key insights about how to properly train and architect OCSR systems. The main contributions are:

Graph Traversal as Visual Chain of Thought: GTR-VL generates molecular graphs by traversing them sequentially, predicting an atom, then its connected bond, then the next atom, and so on. This mimics how a human chemist would trace through a structure and allows the model to use previously predicted atoms and bonds as context for subsequent predictions.

Formally, the model output sequence for image $I_m$ is generated as:

$$ R_m = \text{concat}(CoT_m, S_m) $$

where $CoT_m$ represents the deterministic graph traversal steps (atoms and bonds) and $S_m$ is the final SMILES representation. This intermediate reasoning step makes the model more interpretable and helps it learn the structural logic of molecules.
“Faithfully Recognize What You’ve Seen” Principle: This addresses the abbreviation problem head-on. The authors correct the ground-truth annotations to match what’s actually visible in the image.

They treat abbreviations like “Ph” as single “superatoms” and build a pipeline to automatically detect and correct training data. Using OCR to extract visible text from molecular images, they replace the corresponding expanded substructures in the ground-truth with the appropriate abbreviation tokens. This ensures the supervision signal is consistent with the visual input.
Large-Scale Dataset (GTR-1.3M): To support this approach, the authors created a large-scale dataset combining 1M synthetic molecules from PubChem with 351K corrected real-world patent images from USPTO. The key innovation is the correction pipeline that identifies abbreviations in patent images and fixes the inconsistent ground-truth labels.
GRPO for Hand-Drawn OCSR: Hand-drawn molecular data lacks fine-grained atom/bond coordinate annotations, making SFT-based graph parsing inapplicable. The authors use Group Relative Policy Optimization (GRPO) with a composite reward function that combines format, SMILES, and graph-level rewards. The graph reward computes the maximum common subgraph (MCS) between predicted and ground-truth molecular graphs:

$$ R_{\text{graph}} = \frac{|N_m^a|}{|N_g^a| + |N_p^a|} + \frac{|N_m^b|}{|N_g^b| + |N_p^b|} $$

where $N_m^a$, $N_g^a$, $N_p^a$ are atom counts in the MCS, ground truth, and prediction, and $N_m^b$, $N_g^b$, $N_p^b$ are the corresponding bond counts.
Two-Stage Training: Stage 1 performs SFT on GTR-1.3M for printed molecule recognition. Stage 2 applies GRPO on a mixture of printed data (GTR-USPTO-4K) and hand-drawn data (DECIMER-HD-Train, 4,070 samples) to extend capabilities to hand-drawn structures.
MolRec-Bench Evaluation: Traditional SMILES-based evaluation fails for molecules with abbreviations because canonicalization breaks down. The authors created a new benchmark that evaluates graph structure directly, providing three metrics: direct SMILES generation, graph-derived SMILES, and exact graph matching.

What experiments were performed?

The evaluation focused on demonstrating that GTR-VL’s design principles solve real problems that plague existing OCSR systems:

Comprehensive Baseline Comparison: GTR-VL was tested against three categories of models:
- Specialist OCSR systems: MolScribe and MolNexTR
- Chemistry-focused VLMs: ChemVLM, ChemDFM-X, OCSU
- General-purpose VLMs: GPT-4o, GPT-4o-mini, Qwen-VL-Max
MolRec-Bench Evaluation: The new benchmark includes two subsets of patent images:
- MolRec-USPTO: 5,423 standard patent images similar to existing benchmarks
- MolRec-Abb: 9,311 molecular images with abbreviated superatoms, derived from MolGrapher’s USPTO 10K abb subset
This design directly tests whether models can handle the abbreviation problem that breaks existing systems.
Ablation Studies: Systematic experiments isolated the contribution of key design choices:
- Chain-of-Thought vs. Direct: Comparing graph traversal CoT against direct SMILES prediction
- Traversal Strategy: Graph traversal vs. the traditional “atoms-then-bonds” approach
- Dataset Quality: Training on corrected vs. uncorrected data
Retraining Experiments: Existing specialist models (MolScribe, MolNexTR) were retrained from scratch on the corrected GTR-1.3M dataset to isolate the effect of data quality from architectural improvements.
Hand-Drawn OCSR Evaluation: GTR-VL was also evaluated on the DECIMER Hand-drawn test set and ChemPix dataset, comparing against DECIMER and AtomLenz+EditKT baselines.
Qualitative Analysis: Visual inspection of predictions on challenging cases with heavy abbreviation usage, complex structures, and edge cases to understand failure modes.

Results & Conclusions: Resolving the Abbreviation Bottleneck

Performance Gains on Abbreviations: On MolRec-Abb, GTR-VL-Stage1 achieves 85.49% Graph accuracy compared to around 20% for MolScribe and MolNexTR with their original checkpoints. On MolRec-USPTO, GTR-VL-Stage1 reaches 93.45% Graph accuracy. Existing specialist models see their accuracy drop below 20% on MolRec-Abb when abbreviations are present.
Data Correction is Critical: When MolScribe and MolNexTR were retrained on GTR-1.3M, their MolRec-Abb Graph accuracy jumped from around 20% to 70.60% and 71.85% respectively. GTR-VL-Stage1 still outperformed these retrained baselines at 85.49%, confirming that both data correction and the graph traversal approach contribute.
Chain-of-Thought Helps: Ablation on GTR-USPTO-351K shows that CoT yields 68.85% Gen-SMILES vs. 66.54% without CoT, a 2.31 percentage point improvement.
Graph Traversal Beats Traditional Parsing: Graph traversal achieves 83.26% Graph accuracy vs. 80.15% for the atoms-then-bonds approach, and 81.88% vs. 79.02% on Gra-SMILES.
General VLMs Still Struggle: General-purpose VLMs like GPT-4o scored near 0% on MolRec-Bench across all metrics, highlighting the importance of domain-specific training for OCSR.
Hand-Drawn Recognition via GRPO: GTR-VL-Stage1 (SFT only) achieves only 9.53% Graph accuracy on DECIMER-HD-Test, but after GRPO training in Stage 2, performance jumps to 75.44%. On ChemPix, Graph accuracy rises from 22.02% to 86.13%. The graph reward is essential: GRPO without graph supervision achieves only 11.00% SMILES on DECIMER-HD-Test, while adding graph reward reaches 75.64%.
Evaluation Methodology Matters: The new graph-based evaluation metrics revealed problems with traditional SMILES-based evaluation that previous work had missed. Many “failures” in existing benchmarks were actually correct graph predictions that got marked wrong due to canonicalization issues with abbreviations.

The work establishes that addressing the abbreviation problem requires both correcting the training data and rethinking the model architecture. The combination of faithful data annotation and sequential graph generation improves OCSR performance on molecules with abbreviations by a large margin over previous methods.

Reproducibility Details

Models

Base Model: GTR-VL fine-tunes Qwen2.5-VL.

Input/Output Mechanism:

Input: The model takes an image $I_m$ and a text prompt
Output: The model generates $R_m = \text{concat}(CoT_m, S_m)$, where it first produces the Chain-of-Thought (the graph traversal steps) followed immediately by the final SMILES string
Traversal Strategy: Uses depth-first traversal to alternately predict atoms and bonds

Prompt Structure: The model is prompted to “list the types of atomic elements… the coordinates… and the chemical bonds… then… output a canonical SMILES”. The CoT output is formatted as a JSON list of atoms (with coordinates) and bonds (with indices referring to previous atoms), interleaved.

Data

Training Dataset (GTR-1.3M):

Synthetic Component: 1 million molecular SMILES from PubChem, converted to images using Indigo
Real Component: 351,000 samples from USPTO patents (filtered from an original 680,000)
- Processed using an OCR pipeline to detect abbreviations (e.g., “Ph”, “Et”)
- Ground truth expanded structures replaced with superatoms to match visible abbreviations in images
- This “Faithfully Recognize What You’ve Seen” correction ensures training supervision matches visual input

Evaluation Dataset (MolRec-Bench):

MolRec-USPTO: 5,423 molecular images from USPTO patents
MolRec-Abb: 9,311 molecular images with abbreviated superatoms, derived from MolGrapher’s USPTO 10K abb subset

Algorithms

Graph Traversal Algorithm:

Depth-first traversal strategy
Alternating atom-bond prediction sequence
Each step uses previously predicted atoms and bonds as context

Two-Stage Training:

Stage 1 (SFT): Train on GTR-1.3M to learn visual CoT mechanism for printed molecules (produces GTR-VL-Stage1)
Stage 2 (GRPO): Apply GRPO on GTR-USPTO-4K + DECIMER-HD-Train (4,070 samples) for hand-drawn recognition (produces GTR-VL-Stage2, i.e., GTR-VL)

Training Procedure:

Optimizer: AdamW
Learning Rate (SFT): Peak learning rate of $1.6 \times 10^{-4}$ with cosine decay
Learning Rate (GRPO): Peak learning rate of $1 \times 10^{-5}$ with cosine decay
Warm-up: Linear warm-up for the first 10% of iterations
Batch Size (SFT): 2 per GPU with gradient accumulation over 16 steps, yielding effective batch size of 1024
Batch Size (GRPO): 4 per GPU with gradient accumulation of 1, yielding effective batch size of 128

Evaluation

Metrics (three complementary measures to handle abbreviation issues):

Gen-SMILES: Exact match ratio of SMILES strings directly generated by the VLM (image-captioning style)
Gra-SMILES: Exact match ratio of SMILES strings derived from the predicted graph structure (graph-parsing style)
Graph: Exact match ratio between ground truth and predicted graphs (node/edge comparison, bypassing SMILES canonicalization issues)

Baselines Compared:

Specialist OCSR systems: MolScribe, MolNexTR
Chemistry-focused VLMs: ChemVLM, ChemDFM-X, OCSU
General-purpose VLMs: GPT-4o, GPT-4o-mini, Qwen-VL-Max

Hardware

Compute: Training performed on 32 NVIDIA A100 GPUs

Reproducibility Status

Status: Closed. As of the paper’s publication, no source code, pre-trained model weights, or dataset downloads (GTR-1.3M, MolRec-Bench) have been publicly released. The paper does not mention plans for open-source release. The training data pipeline relies on PubChem SMILES (public), USPTO patent images (publicly available through prior work), the Indigo rendering tool (open-source), and an unspecified OCR system for abbreviation detection. Without the released code and data corrections, reproducing the full pipeline would require substantial re-implementation effort.

OCSAug: Diffusion-Based Augmentation for Hand-Drawn OCSR

Sat, 20 Dec 2025 00:00:00 +0000

Document Taxonomy: OCSAug as a Novel Method

This is a Method paper according to the taxonomy. It proposes a novel data augmentation pipeline (OCSAug) that integrates Denoising Diffusion Probabilistic Models (DDPM) and the RePaint algorithm to address the data scarcity problem in hand-drawn optical chemical structure recognition (OCSR). The contribution is validated through systematic benchmarking against existing augmentation techniques (RDKit, Randepict) and ablation studies on mask design.

Expanding Hand-Drawn Training Data for OCSR

A vast amount of molecular structure data exists in analog formats, such as hand-drawn diagrams in research notes or older literature. While OCSR models perform well on digitally rendered images, they struggle with hand-drawn images due to noise, varying handwriting styles, and distortions. Current datasets for hand-drawn images (e.g., DECIMER) are too small to train effective models, and existing augmentation tools (RDKit, Randepict) fail to generate sufficiently realistic hand-drawn variations.

OCSAug Pipeline: Masked RePaint via Generative AI

The core novelty is OCSAug, a three-phase pipeline that uses generative AI to synthesize training data:

DDPM + RePaint: It utilizes a DDPM to learn the distribution of hand-drawn images and the RePaint algorithm for inpainting.
Structural Masking: It introduces vertical and horizontal stripe pattern masks. These masks selectively obscure parts of atoms or bonds, forcing the diffusion model to reconstruct them with irregular “hand-drawn” styles while preserving the underlying chemical topology.
Label Transfer: Because the chemical structure is preserved during inpainting, the SMILES label from the original image is directly transferred to the augmented image, bypassing the need for re-annotation.

Benchmarking Diffusion Augmentations on DECIMER

The authors evaluated OCSAug using the DECIMER dataset, specifically a “drug-likeness” subset filtered by Lipinski’s and Veber’s rules.

Baselines: The method was compared against RDKit (digital generation) and Randepict (rule-based augmentation).
Models: Four recent OCSR models were fine-tuned: MolScribe, DECIMER 1.0 (I2S), MolNexTR, and MPOCSR.
Metrics:
- Tanimoto Similarity: To measure prediction accuracy against ground truth.
- Fréchet Inception Distance (FID): To measure the distributional similarity between generated and real hand-drawn images.
- RMSE: To quantify pixel-level structural preservation across different mask thicknesses.

Improved Generalization Capabilities and FID Scores

Performance Boost: OCSAug improved recognition accuracy (Tanimoto similarity) by 1.918 to 3.820 times compared to non-fine-tuned baselines (Improvement Ratio), outperforming traditional augmentation techniques such as RDKit and Randepict (1.570-3.523x).
Data Quality: OCSAug achieved the lowest FID score (0.471) compared to Randepict (4.054) and RDKit (10.581), indicating its generated images are much closer to the real hand-drawn distribution.
Generalization: The method showed improved generalization on a newly collected real-world dataset of 463 images from 6 volunteers.
Resolution Mixing: Training MolScribe and MolNexTR with a mix of $128 \times 128$, $256 \times 256$, and $512 \times 512$ resolution images improved Tanimoto similarity (e.g., MolScribe from 0.585 to 0.640), though this strategy did not help I2S or MPOCSR.
Real-World Evaluation: On a newly collected dataset of 463 hand-drawn images from 6 volunteers (88 drug compounds), the MPOCSR model fine-tuned with OCSAug achieved 0.367 exact-match accuracy (Tanimoto = 1.0), compared to 0.365 for non-augmented fine-tuning and 0.037 for no fine-tuning. The area under the accuracy curve showed a more notable improvement in reducing misrecognition.
Limitations: The generation process is slow (3 weeks for 10k images on a single GPU). The fixed stripe masks may struggle with highly complex, non-drug-like geometries: when evaluated on the full DECIMER dataset (without drug-likeness filtering), OCSAug did not yield uniform improvements across all models.

Reproducibility

Artifacts

Artifact	Type	License	Notes
OCSAug	Code	MIT	Official implementation using guided-diffusion and RePaint
DECIMER Hand-Drawn Dataset	Dataset	CC-BY 4.0	5,088 hand-drawn molecular structure images from 24 individuals

Data

Source: DECIMER dataset (hand-drawn images).
Filtering: A “drug-likeness” filter was applied (Lipinski’s rule of 5 + Veber’s rules) along with an atom filter (C, H, O, S, F, Cl, Br, N, P only).
Final Size: 3,194 samples, split into:
- Training: 2,604 samples.
- Validation: 290 samples.
- Test: 300 samples.
Resolution: All images resized to $256 \times 256$ pixels.

Algorithms

Framework: DDPM implemented using guided-diffusion.
RePaint Settings:
- Total time steps: 250.
- Jump length: 10.
- Resampling counts: 10.
Masking Strategy:
- Vertical Stripes: Obscure atom symbols to vary handwriting style.
- Horizontal Stripes: Obscure bonds to vary length/thickness/alignment.
- Optimal Thickness: A stripe thickness of 4 pixels was found to be optimal for balancing diversity and structural preservation.

Models

The OCSR models were pretrained on PubChem (digital images) and then fine-tuned on the OCSAug dataset.

MolScribe: Swin Transformer encoder, Transformer decoder. Fine-tuned (all layers) for 30 epochs, batch size 16-128, LR 2e-5.
I2S (DECIMER 1.0): Inception V3 encoder (frozen), FC/Decoder fine-tuned. 25 epochs, batch size 64, LR 1e-5.
MolNexTR: Dual-stream encoder (Swin + CNN). Fine-tuned (all layers) for 30 epochs, batch size 16-64, LR 2e-5.
MPOCSR: MPViT backbone. Fine-tuned (all layers) for 25 epochs, batch size 16-32, LR 4e-5.

Evaluation

Metric: Improvement Ratio (IR) of Tanimoto Similarity (TS), calculated iteratively or defined as:

$$ \text{IR} = \frac{\text{TS}_{\text{finetuned}}}{\text{TS}_{\text{non-finetuned}}} $$
Validation: Cross-validation on the split DECIMER dataset.

Hardware

GPU: NVIDIA GeForce RTX 4090.
Training Time: DDPM training took ~6 days.
Generation Time: Generating 2,600 augmented images took ~70 hours.

Paper Information

Citation: Kim, J. H., & Choi, J. (2025). OCSAug: diffusion-based optical chemical structure data augmentation for improved hand-drawn chemical structure image recognition. The Journal of Supercomputing, 81, 926.

Publication: The Journal of Supercomputing 2025

Additional Resources:

@article{kimOCSAugDiffusionbasedOptical2025,
  title = {OCSAug: Diffusion-Based Optical Chemical Structure Data Augmentation for Improved Hand-Drawn Chemical Structure Image Recognition},
  shorttitle = {OCSAug},
  author = {Kim, Jin Hyuk and Choi, Jonghwan},
  year = 2025,
  month = may,
  journal = {The Journal of Supercomputing},
  volume = {81},
  number = {8},
  pages = {926},
  doi = {10.1007/s11227-025-07406-4}
}

Image-to-Sequence OCSR: A Comparative Analysis

Sat, 20 Dec 2025 00:00:00 +0000

Overview

This note provides a comparative analysis of image-to-sequence methods for Optical Chemical Structure Recognition (OCSR). These methods treat molecular structure recognition as an image captioning task, using encoder-decoder architectures to generate sequential molecular representations (SMILES, SELFIES, InChI) directly from pixels.

For the full taxonomy of OCSR approaches including image-to-graph and rule-based methods, see the OCSR Methods taxonomy.

Architectural Evolution (2019-2025)

The field has undergone rapid architectural evolution, with clear generational shifts in both encoder and decoder design.

Timeline

Era	Encoder	Decoder	Representative Methods
2019-2020	CNN (Inception V3, ResNet)	LSTM/GRU with Attention	Staker et al., DECIMER
2021	EfficientNet, ViT	Transformer	DECIMER 1.0, Img2Mol, ViT-InChI
2022	Swin Transformer, ResNet	Transformer	SwinOCSR, Image2SMILES, MICER
2023-2024	EfficientNetV2, SwinV2	Transformer	DECIMER.ai, Image2InChI, MMSSC-Net
2025	EfficientViT, VLMs (Qwen2-VL)	LLM decoders, RL fine-tuning	MolSight, GTR-CoT, OCSU

Encoder Architectures

Architecture	Methods Using It	Key Characteristics
Inception V3	DECIMER (2020)	Early CNN approach, 299x299 input
ResNet-50/101	IMG2SMI, Image2SMILES, MICER, DGAT	Strong baseline, well-understood
EfficientNet-B3	DECIMER 1.0	Efficient scaling, compound coefficients
EfficientNet-V2-M	DECIMER.ai, DECIMER-Hand-Drawn	Improved training efficiency
EfficientViT-L1	MolSight	Optimized for deployment
Swin Transformer	SwinOCSR, MolParser	Hierarchical vision transformer
SwinV2	MMSSC-Net, Image2InChI	Improved training stability
Vision Transformer (ViT)	ViT-InChI	Pure attention encoder
DenseNet	RFL, Hu et al. RCGD	Dense connections, feature reuse
Deep TNT	ICMDT	Transformer-in-Transformer
Qwen2-VL	OCSU, GTR-CoT	Vision-language model encoder

Decoder Architectures

Architecture	Methods Using It	Output Format
GRU with Attention	DECIMER, RFL, Hu et al. RCGD	SMILES, RFL, SSML
LSTM with Attention	Staker et al., ChemPix, MICER	SMILES
Transformer	Most 2021+ methods	SMILES, SELFIES, InChI
GPT-2	MMSSC-Net	SMILES
BART	MolParser	E-SMILES
Pre-trained CDDD	Img2Mol	Continuous embedding → SMILES

Output Representation Comparison

The choice of molecular string representation significantly impacts model performance. Representations fall into three categories: core molecular formats for single structures, extended formats for molecular families and variable structures (primarily Markush structures in patents), and specialized representations optimizing for specific recognition challenges.

The Rajan et al. 2022 ablation study provides a comparison of core formats.

Core Molecular Formats

These represent specific, concrete molecular structures.

Format	Validity Guarantee	Sequence Length	Key Characteristic	Used By
SMILES	No	Shortest (baseline)	Standard, highest accuracy	DECIMER.ai, MolSight, DGAT, most 2023+
DeepSMILES	Partial	~1.1x SMILES	Reduces non-local dependencies	SwinOCSR
SELFIES	Yes (100%)	~1.5x SMILES	Guaranteed valid molecules	DECIMER 1.0, IMG2SMI
InChI	N/A (canonical)	Variable (long)	Unique identifiers, layered syntax	ViT-InChI, ICMDT, Image2InChI
FG-SMILES	No	Similar to SMILES	Functional group-aware tokenization	Image2SMILES

SMILES and Variants

SMILES remains the dominant format due to its compactness and highest accuracy on clean data. Standard SMILES uses single characters for ring closures and branches that may appear far apart in the sequence, creating learning challenges for sequence models.

DeepSMILES addresses these non-local syntax dependencies by modifying how branches and ring closures are encoded, making sequences more learnable for neural models. Despite this modification, DeepSMILES sequences are ~1.1x longer than standard SMILES (not shorter). The format offers partial validity improvements through regex-based tokenization with a compact 76-token vocabulary, providing a middle ground between SMILES accuracy and guaranteed validity.

SELFIES guarantees 100% valid molecules by design through a context-free grammar, eliminating invalid outputs entirely. This comes at the cost of ~1.5x longer sequences and a typical 2-5% accuracy drop compared to SMILES on exact-match metrics. The validity guarantee makes SELFIES particularly attractive for generative modeling applications.

InChI uses a layered canonical syntax fundamentally different from SMILES-based formats. While valuable for unique molecular identification, its complex multi-layer structure (formula, connectivity, stereochemistry, isotopes, etc.) and longer sequences make it less suitable for image-to-sequence learning, resulting in lower recognition accuracy.

Key Findings from Rajan et al. 2022

SMILES achieves highest exact-match accuracy on clean synthetic data
SELFIES guarantees 100% valid molecules but at cost of ~2-5% accuracy drop
InChI is problematic due to complex layered syntax and longer sequences
DeepSMILES offers middle ground with partial validity improvements through modified syntax

Extended Formats for Variable Structures

Markush structures represent families of molecules, using variable groups (R1, R2, etc.) with textual definitions. They are ubiquitous in patent documents for intellectual property protection. Standard SMILES cannot represent these variable structures.

Format	Base Format	Key Feature	Used By
E-SMILES	SMILES + XML annotations	Backward-compatible with separator token	MolParser
CXSMILES	SMILES + extension block	Substituent tables, compression	MarkushGrapher

E-SMILES (Extended SMILES) maintains backward compatibility by using a token to separate core SMILES from XML-like annotations. Annotations encode Markush substituents (index:group), polymer structures (

polymer_info

), and abstract ring patterns (abstract_ring). The core structure remains parseable by standard RDKit.

CXSMILES optimizes representation by moving variable groups directly into the main SMILES string as special atoms with explicit atom indexing (e.g., C:1) to link to an extension block containing substituent tables. This handles both frequency variation and position variation in Markush structures.

Specialized Representations

These formats optimize for specific recognition challenges beyond standard single-molecule tasks.

RFL: Ring-Free Language

RFL fundamentally restructures molecular serialization through hierarchical ring decomposition, addressing a core challenge: standard 1D formats (SMILES, SSML) flatten complex 2D molecular graphs, losing explicit spatial relationships.

Mechanism: RFL decomposes molecules into three explicit components:

Molecular Skeleton (𝒮): Main graph with rings “collapsed”
Ring Structures (ℛ): Individual ring components stored separately
Branch Information (ℱ): Connectivity between skeleton and rings

Technical approach:

Detect all non-nested rings using DFS
Calculate adjacency ($\gamma$) between rings based on shared edges
Merge isolated rings ($\gamma=0$) into SuperAtoms (single node placeholders)
Merge adjacent rings ($\gamma>0$) into SuperBonds (edge placeholders)
Progressive decoding: predict skeleton first, then conditionally decode rings using stored hidden states

Performance: RFL achieves SOTA results on both handwritten (95.38% EM) and printed (95.58% EM) structures, with particular strength on high-complexity molecules where standard baselines fail completely (0% → ~30% on hardest tier).

Note: RFL does not preserve original drawing orientation; it’s focused on computational efficiency through hierarchical decomposition.

SSML: Structure-Specific Markup Language

SSML is the primary orientation-preserving format in OCSR. Based on Chemfig (LaTeX chemical drawing package), it provides step-by-step drawing instructions.

Key characteristics:

Describes how to draw the molecule alongside its graph structure
Uses “reconnection marks” for cyclic structures
Preserves branch angles and spatial relationships
Significantly outperformed SMILES for handwritten recognition: 92.09% vs 81.89% EM (Hu et al. RCGD 2023)

Use case: Particularly valuable for hand-drawn structure recognition where visual alignment between image and reconstruction sequence aids model learning.

Training Data Comparison

Training data scale has grown dramatically, with a shift toward combining synthetic and real-world images.

Data Scale Evolution

Year	Typical Scale	Maximum Reported	Primary Source
2019-2020	1-15M	57M (Staker)	Synthetic (RDKit, CDK)
2021-2022	5-35M	35M (DECIMER 1.0)	Synthetic with augmentation
2023-2024	100-150M	450M+ (DECIMER.ai)	Synthetic + real patents
2025	1-10M + real	7.7M (MolParser)	Curated real + synthetic

Synthetic vs Real Data

Method	Training Data	Real-World Performance Notes
DECIMER.ai	450M+ synthetic (RanDepict)	Strong generalization via domain randomization
MolParser	7.7M with active learning	Explicitly targets “in the wild” images
GTR-CoT	Real patent/paper images	Chain-of-thought improves reasoning
MolSight	Multi-stage curriculum	RL fine-tuning for stereochemistry

Data Augmentation Strategies

Common augmentation techniques across methods:

Technique	Purpose	Used By
Rotation	Orientation invariance	Nearly all methods
Gaussian blur	Image quality variation	DECIMER, MolParser
Salt-and-pepper noise	Scan artifact simulation	DECIMER, Image2SMILES
Affine transforms	Perspective variation	ChemPix, MolParser
Font/style variation	Rendering diversity	RanDepict (DECIMER.ai)
Hand-drawn simulation	Sketch-like inputs	ChemPix, ChemReco, DECIMER-Hand-Drawn
Background variation	Document context	MolParser, DECIMER.ai

Hardware and Compute Requirements

Hardware requirements span several orders of magnitude, from consumer GPUs to TPU pods.

Training Hardware Comparison

Method	Hardware	Training Time	Dataset Size
Staker et al. (2019)	8x GPUs	26 days	57M
IMG2SMI (2021)	1x RTX 2080 Ti	5 epochs	~10M
Image2SMILES (2022)	4x V100	2 weeks	30M
MICER (2022)	4x V100	42 hours	10M
DECIMER 1.0 (2021)	TPU v3-8	Not reported	35M
DECIMER.ai (2023)	TPU v3-256	Not reported	450M+
SwinOCSR (2022)	4x RTX 3090	5 days	5M
MolParser (2025)	8x A100	Curriculum learning	7.7M
MolSight (2025)	Not specified	RL fine-tuning (GRPO)	Multi-stage

Inference Considerations

Few papers report inference speed consistently. Available data:

Method	Inference Speed	Notes
DECIMER 1.0	4x faster than DECIMER	TensorFlow Lite optimization
OSRA (baseline)	~1 image/sec	CPU-based rule system
MolScribe	Real-time capable	Optimized Swin encoder

Accessibility Tiers

Tier	Hardware	Representative Methods
Consumer	1x RTX 2080/3090	IMG2SMI, ChemPix
Workstation	4x V100/A100	Image2SMILES, MICER, SwinOCSR
Cloud/HPC	TPU pods, 8+ A100	DECIMER.ai, MolParser

Benchmark Performance

Common Evaluation Datasets

Dataset	Type	Size	Challenge
USPTO	Patent images	~5K test	Real-world complexity
UOB	Scanned images	~5K test	Scan artifacts
Staker	Synthetic	Variable	Baseline synthetic
CLEF	Patent images	~1K test	Markush structures
JPO	Japanese patents	~1K test	Different rendering styles

Accuracy Comparison (Exact Match %)

Methods are roughly grouped by evaluation era; direct comparison is complicated by different test sets.

Method	USPTO	UOB	Staker	Notes
OSRA (baseline)	~70%	~65%	~80%	Rule-based reference
DECIMER 1.0	~85%	~80%	~90%	First transformer-based
SwinOCSR	~88%	~82%	~92%	Swin encoder advantage
DECIMER.ai	~90%	~85%	~95%	Scale + augmentation
MolParser	~92%	~88%	~96%	Real-world focus
MolSight	~93%+	~89%+	~97%+	RL fine-tuning boost

Note: Numbers are approximate and may vary by specific test split. See individual paper notes for precise figures.

Stereochemistry Recognition

Stereochemistry remains a persistent challenge across all methods:

Method	Approach	Stereo Accuracy
Most methods	Standard SMILES	Lower than non-stereo
MolSight	RL (GRPO) specifically for stereo	Improved
MolNexTR	Graph-based explicit stereo	Better handling
Image2InChI	InChI stereo layers	Mixed results

Hand-Drawn Recognition

A distinct sub-lineage focuses on hand-drawn/sketched chemical structures.

Method	Target Domain	Key Innovation
ChemPix (2021)	Hand-drawn hydrocarbons	First deep learning for sketches
Hu et al. RCGD (2023)	Hand-drawn structures	Random conditional guided decoder
ChemReco (2024)	Hand-drawn C-H-O structures	EfficientNet + curriculum learning
DECIMER-Hand-Drawn (2024)	General hand-drawn	Enhanced DECIMER architecture

Hand-Drawn vs Printed Trade-offs

Hand-drawn methods sacrifice some accuracy on clean printed images
Require specialized training data (synthetic hand-drawn simulation)
Generally smaller training sets due to data collection difficulty
Better suited for educational and lab notebook applications

Key Innovations by Method

Method	Primary Innovation
Staker et al.	First end-to-end deep learning OCSR
DECIMER 1.0	Transformer decoder + SELFIES
Img2Mol	Continuous embedding space (CDDD)
Image2SMILES	Functional group-aware SMILES (FG-SMILES)
SwinOCSR	Hierarchical vision transformer encoder
DECIMER.ai	Massive scale + RanDepict augmentation
MolParser	Extended SMILES + active learning
MolSight	RL fine-tuning (GRPO) for accuracy
GTR-CoT	Chain-of-thought graph traversal
OCSU	Multi-task vision-language understanding
RFL	Hierarchical ring decomposition with SuperAtoms/SuperBonds

Open Challenges

Stereochemistry: Consistent challenge across all methods; RL approaches (MolSight) show promise
Abbreviations/R-groups: E-SMILES and Markush-specific methods emerging
Real-world robustness: Gap between synthetic training and patent/paper images
Inference speed: Rarely reported; important for production deployment
Memory efficiency: Almost never documented; limits accessibility
Multi-molecule images: Most methods assume single isolated structure

References

Individual paper notes linked throughout. For the complete method listing, see the OCSR Methods taxonomy.

MolSight: OCSR with RL and Multi-Granularity Learning

Fri, 19 Dec 2025 00:00:00 +0000

Contribution: A Framework for Optical Chemical Structure Recognition

This is primarily a Method paper. It proposes a novel three-stage training framework (Pretraining → Fine-tuning → RL Post-training) to improve Optical Chemical Structure Recognition (OCSR). Specifically, it introduces the use of Group Relative Policy Optimization (GRPO) to solve non-differentiable chemical validity issues.

It also has a Resource component, as the authors construct and release Stereo-200k, a dataset specifically designed to train models on challenging stereoisomeric molecules.

Motivation: Resolving Stereochemical Cues

Existing OCSR systems struggle to accurately recognize stereochemical information (e.g., chirality, geometric isomerism) because the visual cues distinguishing stereoisomers (such as wedge and dash bonds) are subtle. Current methods often fail to capture the geometric relationships required to distinguish molecules with identical connectivity but different spatial arrangements. Accurate recognition is critical for downstream tasks like drug discovery where stereochemistry determines pharmacological effects.

Core Innovations: GRPO and Multi-Granularity Learning

MolSight introduces three key technical innovations:

Reinforcement Learning for OCSR: It is the first OCSR system to incorporate RL (specifically GRPO) to directly optimize for chemical semantic correctness.
Multi-Granularity Learning: It employs auxiliary heads for chemical bond classification and atom localization. Unlike previous approaches that optimize these jointly, MolSight decouples the coordinate head to prevent interference with SMILES generation.
SMILES-M Notation: A lightweight extension to SMILES to handle Markush structures (common in patents) without significant sequence length increase.

Experimental Methodology

The authors evaluated MolSight using a rigorous mix of real and synthetic benchmarks:

Baselines: Compared against rule-based (OSRA, MolVec, Imago) and deep learning methods (MolScribe, MolGrapher, DECIMER).
Benchmarks: Evaluated on real-world datasets (USPTO, Maybridge UoB, CLEF-2012, JPO) and synthetic datasets (Staker, ChemDraw, Indigo, Stereo-2K).
Ablation Studies: Tested the impact of the bond head, coordinate head, and RL stages separately.
Transfer Learning: Assessed the quality of learned representations by using the frozen encoder for molecular property prediction on MoleculeNet.

Results and Conclusions

SOTA Performance: MolSight achieved 85.1% stereochemical accuracy on the USPTO dataset, significantly outperforming the previous SOTA (MolScribe) which achieved 69.0%.
RL Effectiveness: Reinforcement learning post-training specifically improved performance on stereoisomers, raising Tanimoto similarity and exact match rates on the Stereo-2k test set.
Robustness: On perturbed USPTO images (random rotations and shearing), MolSight achieved 92.3% exact match accuracy (vs. the original 92.0%), while rule-based methods like OSRA dropped from 83.5% to 6.7%. On the low-resolution Staker dataset, MolSight reached 82.1% exact match.

Reproducibility Details

Data

The training pipeline uses three distinct data sources:

Pre-training: MolParser-7M. Contains diverse images but requires the SMILES-M extension to handle Markush structures.
Fine-tuning: PubChem-1M and USPTO-680K. Used for multi-granularity learning with bond and coordinate labels.
RL Post-training: Stereo-200k. A self-collected dataset from the first 2M compounds in PubChem, filtered for chirality (’@’) and cis-trans isomerism (’/’, ‘\’). It uses 5 different RDKit drawing styles to ensure robustness.

Algorithms

Reinforcement Learning: Uses GRPO (Group Relative Policy Optimization).
- Reward Function: A linear combination of Tanimoto similarity and a graded stereochemistry reward. $$ R = w_t \cdot r_{\text{tanimoto}} + w_s \cdot r_{\text{stereo}} $$ where $w_t=0.4$ and $w_s=0.6$. The stereochemistry reward $r_{\text{stereo}}$ is 1.0 for an InChIKey exact match, 0.3 if the atom count matches, and 0.1 otherwise.
- Sampling: Samples 4 completions per image with temperature 1.0 during RL training.
Auxiliary Tasks:
- Bond Classification: Concatenates hidden states of two atom queries to predict bond type via MLP.
- Atom Localization: Treated as a classification task (SimCC) but optimized using Maximum Likelihood Estimation (MLE) to account for uncertainty.

Models

Architecture: Encoder-Decoder Transformer. Input images are preprocessed to $512 \times 512$ resolution.
- Encoder: EfficientViT-L1 (~53M params), chosen for linear attention efficiency.
- Decoder: 6-layer Transformer with RoPE, SwiGLU, and RMSNorm. Randomly initialized (no LLM weights) due to vocabulary mismatch.
- Coordinate Head: Separated from the main decoder. It adds 2 extra Transformer layers to process atom queries before prediction to improve accuracy.
Parameter Tuning:
- Stage 3 (RL) uses LoRA (Rank=8, Alpha=16) to optimize the decoder.

Evaluation

Metrics:
- Exact Match: Exact recognition accuracy for the full molecular structure.
- Tanimoto Coefficient: Fingerprint similarity for chemical semantics.
- OKS (Object Keypoint Similarity): Used specifically for evaluating atom localization accuracy.
Perturbation: Robustness tested with random rotations [-5°, 5°] and xy-shearing [-0.1, 0.1].

Hardware

Compute: Training and inference performed on a single node.
Processors: Intel Xeon Silver 4210R CPU.
Accelerators: 4x NVIDIA GeForce RTX 3090/4090 GPUs.
Hyperparameters:
- Stage 1: Batch size 512, LR $4 \times 10^{-4}$.
- Stage 2: Batch size 256, Bond head LR $4 \times 10^{-4}$, Coord head LR $4 \times 10^{-5}$.
- Stage 3 (RL): Batch size 64, Base LR $1 \times 10^{-4}$.

Artifacts

Artifact	Type	License	Notes
MolSight (GitHub)	Code	Apache-2.0	Official PyTorch implementation with training and inference code

Paper Information

Citation: Zhang, W., Wang, X., Feng, B., & Liu, W. (2025). MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2026). https://doi.org/10.48550/arXiv.2511.17300

Publication: AAAI 2026

Additional Resources:

Official Repository

@inproceedings{zhang2025molsight,
      title={MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning},
      author={Wenrui Zhang and Xinggang Wang and Bin Feng and Wenyu Liu},
      year={2025},
      booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
      eprint={2511.17300},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.17300},
}

MolScribe: Robust Image-to-Graph Molecular Recognition

Fri, 19 Dec 2025 00:00:00 +0000

Contribution: Generative Image-to-Graph Modelling

This is a Methodological Paper ($\Psi_{\text{Method}}$) with a secondary contribution to Resources ($\Psi_{\text{Resource}}$).

It proposes a novel architecture (image-to-graph generation) to solve the Optical Chemical Structure Recognition (OCSR) task, validating it through extensive ablation studies and comparisons against strong baselines like MolVec and DECIMER. It also contributes a new benchmark dataset of annotated images from ACS journals.

Motivation: Limitations in Existing OCSR Pipelines

Translating molecular images into machine-readable graphs (OCSR) is challenging due to the high variance in drawing styles, stereochemistry conventions, and abbreviated structures found in literature.

Existing solutions face structural bottlenecks:

Rule-based systems (e.g., OSRA) rely on rigid heuristics that fail on diverse styles.
Image-to-SMILES neural models treat the problem as captioning. They struggle with geometric reasoning (which is strictly required for chirality) and struggle to incorporate chemical constraints or verify correctness because they omit explicit atom locations.

Core Innovation: Joint Graph and Coordinate Prediction

MolScribe introduces an Image-to-Graph generation paradigm that combines the flexibility of neural networks with the precision of symbolic constraints. It frames the task probabilistically as:

$$ P(G | I) = P(A | I) P(B | A, I) $$

Where the model predicts a sequence of atoms $A$ given an image $I$, followed by the bonds $B$ given both the atoms and the image.

Explicit Graph Prediction: It predicts a sequence of atoms (with 2D coordinates) and then predicts bonds between them.
Symbolic Constraints: It uses the predicted graph structure and coordinates to strictly determine chirality and cis/trans isomerism.
Abbreviation Expansion: It employs a greedy algorithm to parse and expand “superatoms” (e.g., “CO2Et”) into their full atomic structure.
Dynamic Augmentation: It introduces a data augmentation strategy that randomly substitutes functional groups with abbreviations and adds R-groups during training to improve generalization.

Methodology: Autoregressive Atoms and Pairwise Bonds

The authors evaluate MolScribe on synthetic and real-world datasets, focusing on Exact Match Accuracy of the canonical SMILES string. The model generates atom sequences autoregressively:

$$ P(A | I) = \prod_{i=1}^n P(a_i | A_{

To handle continuous spatial locations, atom coordinates map to discrete bins (e.g., $\hat{x}_i = \lfloor \frac{x_i}{W} \times n_{\text{bins}} \rfloor$), and decode alongside element labels. Bonds act on a pairwise classifier over the hidden states of every atom pair:

$$ P(B | A, I) = \prod_{i=1}^n \prod_{j=1}^n P(b_{i,j} | A, I) $$

Baselines: Compared against rule-based (MolVec, OSRA) and neural (Img2Mol, DECIMER, SwinOCSR) systems.
Benchmarks:
- Synthetic: Indigo (in-domain) and ChemDraw (out-of-domain).
- Realistic: Five public benchmarks (CLEF, JPO, UOB, USPTO, Staker).
- New Dataset: 331 images from ACS Publications (journal articles).
Ablations: Tested performance without data augmentation, with continuous vs. discrete coordinates, and without non-atom tokens.
Human Eval: Measured the time reduction for chemists using MolScribe to digitize molecules vs. drawing from scratch.

Results: Robust Exact Match Accuracy

Strong Performance: MolScribe achieved 76-93% accuracy across public benchmarks, outperforming baselines on most datasets. On the ACS dataset of journal article images, MolScribe achieved 71.9% compared to the next best 55.3% (OSRA). On the large Staker patent dataset, MolScribe achieved 86.9%, surpassing MSE-DUDL (77.0%) while using far less training data (1.68M vs. 68M examples).
Chirality Verification: Explicit geometric reasoning allowed MolScribe to predict chiral molecules significantly better than image-to-SMILES baselines. When chirality is ignored, the performance gap narrows (e.g., on Indigo, baseline accuracy rises from 94.1% to 96.3%), isolating MolScribe’s primary advantage to geometric reasoning for stereochemistry.
Hand-Drawn Generalization: The model achieved 11.2% exact match accuracy on the DECIMER-HDM dataset, despite lacking hand-drawn images in the training set, with many errors bounded to a few atomic mismatches.
Robustness: The model maintained high performance on perturbed images (rotation/shear), whereas rule-based systems degraded severely.
Usability: The atom-level alignment allows for confidence visualization, and human evaluation showed it reduced digitization time from 137s to 20s per molecule.

Reproducibility Details

Data

The model was trained on a mix of synthetic and patent data with extensive dynamic augmentation:

Purpose	Dataset	Size	Notes
Training	PubChem (Synthetic)	1M	Molecules randomly sampled from PubChem and rendered via Indigo toolkit; includes atom coords.
Training	USPTO (Patents)	680K	Patent data lacks exact atom coordinates; relative coordinates normalized from MOLfiles to image dimensions (often introduces coordinate shifts).

Molecule Augmentation:

Functional Groups: Randomly substituted using 53 common substitution rules (e.g., replacing substructures with “Et” or “Ph”).
R-Groups: Randomly added using vocabulary: [R, R1...R12, Ra, Rb, Rc, Rd, X, Y, Z, A, Ar].
Styles: Random variation of aromaticity (circle vs. bonds) and explicit hydrogens.

Image Augmentation:

Rendering: Randomized font (Arial, Times, Courier, Helvetica), line width, and label modes during synthetic generation.
Perturbations: Applied rotation ($\pm 90^{\circ}$), cropping ($1%$), padding ($40%$), downscaling, blurring, and Salt-and-Pepper/Gaussian noise.

Preprocessing: Input images are resized to $384 \times 384$.

Algorithms

Atom Prediction (Pix2Seq-style):
- The model generates a sequence of tokens: $S^A = [l_1, \hat{x}_1, \hat{y}_1, \dots, l_n, \hat{x}_n, \hat{y}_n]$.
- Discretization: Coordinates are binned into integer tokens ($n_{bins} = 64$).
- Tokenizer: Atom-wise tokenizer splits SMILES into atoms; non-atom tokens (parentheses, digits) are kept to help structure learning.
Bond Prediction:
- Format: Pairwise classification for every pair of predicted atoms.
- Symmetry: For symmetric bonds (single/double), the probability is averaged as: $$ \hat{P}(b_{i,j} = t) = \frac{1}{2} \big( P(b_{i,j} = t) + P(b_{j,i} = t) \big) $$ For wedges, directional logic strictly applies instead.
Abbreviation Expansion (Algorithm 1):
- A greedy algorithm connects atoms within an expanded abbreviation (e.g., “COOH”) until valences are full, avoiding the need for a fixed dictionary.
- Carbon Chains: Splits condensed chains like $C_aX_b$ into explicit sequences ($CX_q…CX_{q+r}$).
- Nested Formulas: Recursively parses nested structures like $N(CH_3)_2$ by treating them as superatoms attached to the current backbone.
- Valence Handling: Iterates through common valences first to resolve ambiguities.

Models

The architecture is an encoder-decoder with a classification head:

Encoder: Swin Transformer (Swin-B), pre-trained on ImageNet-22K (88M params).
Decoder: 6-layer Transformer, 8 heads, hidden dimension 256.
Bond Predictor: 2-layer MLP (Feedforward) with ReLU, taking concatenated atom hidden states as input.
Training: Teacher forcing, Cross-Entropy Loss, Batch size 128, 30 epochs.

Evaluation

Metric: Exact Match of Canonical SMILES.

Stereochemistry: Must match tetrahedral chirality; cis-trans ignored.
R-groups: Replaced with wildcards * or [d*] for evaluation.

Hardware

Compute: Training performed on Linux server with 96 CPUs and 500GB RAM.
GPUs: 4x NVIDIA A100 GPUs.
Training Time: Unspecified; comparative models on large datasets took “more than one day”.
Inference: Requires autoregressive decoding for atoms, followed by a single forward pass for bonds.

Artifacts

Artifact	Type	License	Notes
MolScribe (GitHub)	Code	MIT	Official PyTorch implementation with training, inference, and evaluation scripts
MolScribe (Hugging Face)	Demo	MIT	Interactive web demo for molecular image recognition

Limitations

Scoped to single-molecule images only; does not handle multi-molecule diagrams or reaction schemes.
Hand-drawn molecule recognition remains weak (the model was not trained on hand-drawn data).
Complex Markush structures (positional variation, frequency variation) are not supported, as these cannot be represented in SMILES or MOLfiles.

Paper Information

Citation: Qian, Y., Guo, J., Tu, Z., Li, Z., Coley, C. W., & Barzilay, R. (2023). MolScribe: Robust Molecular Structure Recognition with Image-To-Graph Generation. Journal of Chemical Information and Modeling, 63(7), 1925-1934. https://doi.org/10.1021/acs.jcim.2c01480

Publication: Journal of Chemical Information and Modeling 2023

Additional Resources:

Hugging Face Space

@article{qianMolScribeRobustMolecular2023,
  title = {{{MolScribe}}: {{Robust Molecular Structure Recognition}} with {{Image-To-Graph Generation}}},
  shorttitle = {{{MolScribe}}},
  author = {Qian, Yujie and Guo, Jiang and Tu, Zhengkai and Li, Zhening and Coley, Connor W. and Barzilay, Regina},
  year = 2023,
  month = apr,
  journal = {Journal of Chemical Information and Modeling},
  volume = {63},
  number = {7},
  pages = {1925--1934},
  doi = {10.1021/acs.jcim.2c01480},
  url = {https://pubs.acs.org/doi/10.1021/acs.jcim.2c01480}
}

MolMole: Unified Vision Pipeline for Molecule Mining

Fri, 19 Dec 2025 00:00:00 +0000

MolMole’s Dual Contribution: Unified OCSR Method and Page-Level Benchmarks

This is primarily a Method paper, with a strong Resource contribution.

It functions as a Method paper because it introduces “MolMole,” a unified deep learning framework that integrates molecule detection, reaction diagram parsing, and optical chemical structure recognition (OCSR) into a single pipeline. It validates this method through extensive comparisons against state-of-the-art baselines like DECIMER and OpenChemIE.

It also serves as a Resource paper because the authors construct and release a novel page-level benchmark dataset of 550 annotated pages (patents and articles) to address the lack of standardized evaluation metrics for full-page chemical extraction.

Addressing the Limitations of Fragmented Processing

The rapid accumulation of chemical literature has trapped valuable molecular and reaction data in unstructured formats like images and PDFs. Extracting this manually is time-consuming, while existing AI frameworks have significant limitations:

DECIMER: Lacks the ability to process reaction diagrams entirely.
OpenChemIE: Relies on external layout parser models to crop elements before processing. This dependence often leads to detection failures in documents with complex layouts.
Generative Hallucination: Existing generative OCSR models (like MolScribe) are prone to “hallucinating” structures or failing on complex notations like polymers.

A Unified Vision Pipeline for Layout-Aware Detection

MolMole introduces several architectural and workflow innovations:

Direct Page-Level Processing: Unlike OpenChemIE, MolMole processes full document pages directly without requiring an external layout parser, which improves robustness on complex layouts like two-column patents.
Unified Vision Pipeline: It integrates three specialized vision models into one workflow:
- ViDetect: A DINO-based object detector for identifying molecular regions.
- ViReact: An RxnScribe-based model adapted for full-page reaction parsing.
- ViMore: A detection-based OCSR model that explicitly predicts atoms and bonds.
Hallucination Mitigation: By using a detection-based approach (ViMore), the model avoids hallucinating chemical structures and provides confidence scores.
Advanced Notation Support: The system explicitly handles “wavy bonds” (variable attachments in patents) and polymer bracket notations, which confuse standard SMILES-based models.

Page-Level Benchmark Evaluation and Unified Metrics

The authors evaluated the framework on both a newly curated benchmark and existing public datasets:

New Benchmark Creation: They curated 550 pages (300 patents, 250 articles) fully annotated with bounding boxes, reaction roles (reactant, product, condition), and MOLfiles.
Baselines: MolMole was compared against DECIMER 2.0, OpenChemIE, and ReactionDataExtractor 2.0.
OCSR Benchmarking: ViMore was evaluated against DECIMER, MolScribe, and MolGrapher on four public datasets: USPTO, UOB, CLEF, and JPO.
Metric Proposal: They introduced a combined “End-to-End” metric that modifies standard object detection Precision/Recall to strictly require correct SMILES conversion for a “True Positive”.

$$ \text{True Positive (End-to-End)} = ( \text{IoU} \geq 0.5 ) \land ( \text{SMILES}_{\text{gt}} == \text{SMILES}_{\text{pred}} ) $$

Key Results

Page-Level Performance: On the new benchmark, MolMole achieved F1 scores of 89.1% (Patents) and 86.8% (Articles) for the combined detection-to-conversion task, compared to 73.8% and 67.3% for DECIMER and 68.8% and 70.6% for OpenChemIE (Table 4).
Reaction Parsing: ViReact achieved soft-match F1 scores of 98.0% on patents and 97.0% on articles, compared to 82.2% and 82.9% for the next best model, RxnScribe (w/o LP). Hard-match F1 scores were 92.5% (patents) and 84.6% (articles).
Public Benchmarks: ViMore outperformed competitors on 3 out of 4 public OCSR datasets (CLEF, JPO, USPTO).
Layout Handling: The authors demonstrated that MolMole successfully handles multi-column reaction diagrams where cropping-based models fail and faithfully preserves layout geometry in generated MOLfiles.

Reproducibility

Artifacts

Artifact	Type	License	Notes
MolMole Project Page	Other	Unknown	Demo and project information

Data

Training Data: The models (ViDetect and ViMore) were trained on private/proprietary datasets, which is a limitation for full reproducibility from scratch.
Benchmark Data: The authors introduce a test set of 550 pages (3,897 molecules, 1,022 reactions) derived from patents and scientific articles. This dataset is stated to be made “publicly available”.
Public Evaluation Data: Standard OCSR datasets used include USPTO (5,719 images), UOB (5,740 images), CLEF (992 images), and JPO (450 images).

Algorithms

Pipeline Workflow: PDF → PNG Images → Parallel execution of ViDetect and ViReact → Cropping of molecular regions → ViMore conversion → Output (JSON/Excel).
Post-Processing:
- ViDetect: Removes overlapping proposals based on confidence scores and size constraints.
- ViReact: Refines predictions by correcting duplicates and removing empty entities.
- ViMore: Assembles detected atom/bond information into structured representations (MOLfile).

Models

Model	Architecture Basis	Task	Key Feature
ViDetect	DINO (DETR-based)	Molecule Detection	End-to-end training; avoids slow autoregressive methods.
ViReact	RxnScribe	Reaction Parsing	Operates on full pages; autoregressive decoder for structured sequence generation.
ViMore	Custom Vision Model	OCSR	Detection-based (predicts atom/bond regions).

Evaluation

Molecule Detection: Evaluated using COCO metrics (AP, AR, F1) at IoU thresholds 0.50-0.95.
Molecule Conversion: Evaluated using SMILES exact match accuracy and Tanimoto similarity.
Combined Metric: A custom metric where a True Positive requires both IoU $\geq$ 0.5 and a correct SMILES string match where $\text{SMILES}_{\text{gt}} == \text{SMILES}_{\text{pred}}$.
Reaction Parsing: Evaluated using Hard Match (all components correct) and Soft Match (molecular entities only, ignoring text labels).

Missing Components

Source code: Not publicly released. The paper states the toolkit “will be accessible soon through an interactive demo on the LG AI Research website.” For commercial use, the authors direct inquiries to contact ddu@lgresearch.ai.
Training data: ViDetect and ViMore are trained on proprietary datasets. Training code and data are not available.
Hardware requirements: Not specified in the paper.

Paper Information

Citation: Chun, S., Kim, J., Jo, A., Jo, Y., Oh, S., et al. (2025). MolMole: Molecule Mining from Scientific Literature. arXiv preprint arXiv:2505.03777. https://doi.org/10.48550/arXiv.2505.03777

Publication: arXiv 2025

Additional Resources:

Project Page

@article{chun2025molmole,
  title={MolMole: Molecule Mining from Scientific Literature},
  author={Chun, Sehyun and Kim, Jiye and Jo, Ahra and Jo, Yeonsik and Oh, Seungyul and Lee, Seungjun and Ryoo, Kwangrok and Lee, Jongmin and Kim, Seung Hwan and Kang, Byung Jun and Lee, Soonyoung and Park, Jun Ha and Moon, Chanwoo and Ham, Jiwon and Lee, Haein and Han, Heejae and Byun, Jaeseung and Do, Soojong and Ha, Minju and Kim, Dongyun and Bae, Kyunghoon and Lim, Woohyung and Lee, Edward Hwayoung and Park, Yongmin and Yu, Jeongsang and Jo, Gerrard Jeongwon and Hong, Yeonjung and Yoo, Kyungjae and Han, Sehui and Lee, Jaewan and Park, Changyoung and Jeon, Kijeong and Yi, Sihyuk},
  year={2025},
  journal={arXiv preprint arXiv:2505.03777},
  doi={10.48550/arXiv.2505.03777},
  url={https://arxiv.org/abs/2505.03777}
}

MolGrapher: Graph-based Chemical Structure Recognition

Fri, 19 Dec 2025 00:00:00 +0000

1. Contribution / Type

This is primarily a Methodological paper that proposes a novel neural architecture (MolGrapher), shifting the paradigm of Optical Chemical Structure Recognition (OCSR) from image captioning back to graph reconstruction. It also has a significant Resource component, releasing a synthetic data generation pipeline and a new large-scale benchmark (USPTO-30K) to address the scarcity of annotated real-world data.

2. Motivation

The automatic analysis of chemical literature is critical for accelerating drug and material discovery, but much of this information is locked in 2D images of molecular structures.

Problem: Existing rule-based methods are rigid, while recent deep learning methods based on “image captioning” (predicting SMILES strings) struggle with complex molecules and fail to exploit the natural graph structure of molecules.
Gap: There is a lack of diverse, annotated real-world training data, and captioning models suffer from “hallucinations” where they predict valid SMILES that do not match the image.

3. Novelty / Core Innovation

MolGrapher introduces a graph-based deep learning pipeline that explicitly models the molecule’s geometry and topology.

Supergraph Concept: It first detects all atom keypoints and builds a “supergraph” of all plausible bonds.
Hybrid Approach: It combines a ResNet-based keypoint detector with a Graph Neural Network (GNN) that classifies both atom nodes and bond nodes within the supergraph context. Both atoms and bonds are represented as nodes, with edges only connecting atom nodes to bond nodes.
Synthetic Pipeline: A data generation pipeline that renders molecules with varying styles (fonts, bond widths) and augmentations (pepper patches, random lines, captions) to simulate real document noise.

At the core of the Keypoint Detector’s performance is the Weight-Adaptive Heatmap Regression (WAHR) loss. Since pixels without an atom drastically outnumber pixels containing an atom, WAHR loss is designed to counter the class imbalance. For ground-truth heatmap $y$ and prediction $p$:

$$ L_{WAHR}(p, y) = \sum_i \alpha_y (p_i - y_i)^2 $$

where $\alpha_y$ dynamically down-weights easily classified background pixels.

4. Methodology & Experiments

The authors evaluated MolGrapher against both rule-based (OSRA, MolVec) and deep learning baselines (DECIMER, Img2Mol, Image2Graph).

Benchmarks: Evaluated on standard datasets: USPTO, Maybridge UoB, CLEF-2012, and JPO.
New Benchmark: Introduced and tested on USPTO-30K, split into clean, abbreviated, and large molecule subsets.
Ablations: Analyzed the impact of synthetic augmentations, keypoint loss functions, supergraph connectivity radius, and GNN layers.
Robustness: Tested on perturbed images (rotations, shearing) to mimic scanned patent quality.

The GNN iteratively updates node embeddings through layers ${g^k}_{k \in [1, N]}$, where $e^{k+1} = g^k(e^k)$. Final predictions are obtained via two MLPs (one for atoms, one for bonds): $p_i = MLP_t(e_i^N)$, where $p_i \in \mathbb{R}^{C_t}$ contains the logits for atom or bond classes.

5. Results & Conclusions

MolGrapher achieved the highest accuracy among synthetic-only deep learning methods on most benchmarks tested.

Accuracy: It achieved 91.5% accuracy on USPTO, outperforming all other synthetic-only deep learning methods including ChemGrapher (80.9%), Graph Generation (67.0%), and DECIMER 2.0 (61.0%).
Large Molecules: It demonstrated superior scaling, correctly recognizing large molecules (USPTO-10K-L) where image captioning methods like Img2Mol failed completely (0.0% accuracy).
Generalization: The method proved robust to image perturbations and style variations without requiring fine-tuning on real data. The paper acknowledges that MolGrapher cannot recognize Markush structures (depictions of sets of molecules with positional and frequency variation indicators).

Reproducibility Details

Data

The model relies on synthetic data for training due to the scarcity of annotated real-world images.

Purpose	Dataset	Size	Notes
Training	Synthetic Data	300,000 images	Generated from PubChem SMILES using RDKit. Augmentations include pepper patches, random lines, and variable bond styles.
Evaluation	USPTO-30K	30,000 images	Created by authors from USPTO patents (2001-2020). Subsets: 10K clean, 10K abbreviated, 10K large (>70 atoms).
Evaluation	Standard Benchmarks	Various	USPTO (5,719), Maybridge UoB (5,740), CLEF-2012 (992), JPO (450).

Algorithms

The pipeline consists of three distinct algorithmic stages:

Keypoint Detection:
- Predicts a heatmap of atom locations using a CNN.
- Thresholds heatmaps at the bottom 10th percentile and uses a $5\times5$ window for local maxima.
- Uses Weight-Adaptive Heatmap Regression (WAHR) loss to handle class imbalance (background vs. atoms).
Supergraph Construction:
- Connects every detected keypoint to neighbors within a radius of $3 \times$ the estimated bond length.
- Prunes edges with no filled pixels or if obstructed by a third keypoint.
- Keeps a maximum of 6 bond candidates per atom.
Superatom Recognition:
- Detects “superatom” nodes (abbreviations like COOH).
- Uses PP-OCR to transcribe the text at these node locations.

Models

The architecture utilizes standard backbones tailored for specific sub-tasks:

Keypoint Detector: ResNet-18 backbone with $8\times$ dilation to preserve spatial resolution.
Node Classifier: ResNet-50 backbone with $2\times$ dilation for extracting visual features at node locations.
Graph Neural Network: A custom GNN that updates node embeddings based on visual features and neighborhood context. The initial node embedding combines the visual feature vector $v_i$ and a learnable type encoding $w_{t_i}$.
Readout: MLPs classify nodes into atom types (e.g., C, O, N) and bond types (No Bond, Single, Double, Triple).

Evaluation

Accuracy is defined strictly: the predicted molecule must have an identical InChI string to the ground truth. Stereochemistry and Markush structures are excluded from evaluation.

Metric	Dataset	MolGrapher Score	Best DL Baseline (Synthetic)	Notes
Accuracy	USPTO	91.5%	80.9% (ChemGrapher)	Full USPTO benchmark
Accuracy	USPTO-10K-L	31.4%	0.0% (Img2Mol)	Large molecules (>70 atoms)
Accuracy	JPO	67.5%	64.0% (DECIMER 2.0)	Challenging, low-quality images

Hardware

GPUs: Trained on 3 NVIDIA A100 GPUs.
Training Time: 20 epochs.
Optimization: ADAM optimizer, learning rate 0.0001, decayed by 0.8 after 5000 iterations.
Loss Weighting: Atom classifier loss weighted by 1; bond classifier loss weighted by 3.

Artifacts

Artifact	Type	License	Notes
DS4SD/MolGrapher	Code	MIT	Official PyTorch implementation with training and inference scripts

Paper Information

Title: MolGrapher: Graph-based Visual Recognition of Chemical Structures

Authors: Lucas Morin, Martin Danelljan, Maria Isabel Agea, Ahmed Nassar, Valéry Weber, Ingmar Meijer, Peter Staar, Fisher Yu

Citation: Morin, L., Danelljan, M., Agea, M. I., Nassar, A., Weber, V., Meijer, I., Staar, P., & Yu, F. (2023). MolGrapher: Graph-based Visual Recognition of Chemical Structures. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 19552-19561.

Publication: ICCV 2023

Links:

@inproceedings{morinMolGrapherGraphbasedVisual2023,
  title = {{{MolGrapher}}: {{Graph-based Visual Recognition}} of {{Chemical Structures}}},
  shorttitle = {{{MolGrapher}}},
  booktitle = {Proceedings of the {{IEEE}}/{{CVF International Conference}} on {{Computer Vision}}},
  author = {Morin, Lucas and Danelljan, Martin and Agea, Maria Isabel and Nassar, Ahmed and Weber, Valéry and Meijer, Ingmar and Staar, Peter and Yu, Fisher},
  year = {2023},
  pages = {19552--19561},
  doi = {10.1109/ICCV51070.2023.01791},
  urldate = {2025-10-18},
  langid = {english}
}

MMSSC-Net: Multi-Stage Sequence Cognitive Networks

Fri, 19 Dec 2025 00:00:00 +0000

Contribution: A Multi-Stage Architectural Pipeline

Methodological Paper ($\Psi_{\text{Method}}$). The paper proposes a deep learning architecture (MMSSC-Net) for Optical Chemical Structure Recognition (OCSR). It focuses on architectural innovation, specifically combining a SwinV2 visual encoder with a GPT-2 decoder, and validates this method through extensive benchmarking against existing rule-based and deep-learning baselines. It includes ablation studies to justify the choice of the visual encoder.

Motivation: Addressing Noise and Rigid Image Recognition

Data Usage Gap: Drug discovery relies heavily on scientific literature, but molecular structures are often locked in vector graphics or images that computers cannot easily process.
Limitations of Prior Work: Existing Rule-based methods are rigid and sensitive to noise. Previous Deep Learning approaches (Encoder-Decoder “Image Captioning” styles) often lack precision, interpretability, and struggle with varying image resolutions or large molecules.
Need for “Cognition”: The authors argue that treating the image as a single isolated whole is insufficient; a model needs to “perceive” fine-grained details (atoms and bonds) to handle noise and varying pixel qualities effectively.

Novelty: A Fine-Grained Perception Pipeline

Multi-Stage Cognitive Architecture: MMSSC-Net splits the task into stages:
1. Fine-grained Perception: Detecting atom and bond sequences (including spatial coordinates) using SwinV2.
2. Graph Construction: Assembling these into a molecular graph.
3. Sequence Evolution: converting the graph into a machine-readable format (SMILES).
Hybrid Transformer Model: It combines a hierarchical vision transformer (SwinV2) for encoding with a generative pre-trained transformer (GPT-2) and MLPs for decoding atomic and bond targets.
Robustness Mechanisms: The inclusion of random noise sequences during training to improve generalization to new molecular targets.

Methodology and Benchmarks

Baselines: compared against 8 other tools:
- Rule-based: MolVec, OSRA.
- Image-Smiles (DL): ABC-Net, Img2Mol, MolMiner.
- Image-Graph-Smiles (DL): Image-To-Graph, MolScribe, ChemGrapher.
Datasets: Evaluated on 5 diverse datasets: STAKER (synthetic), USPTO, CLEF, JPO, and UOB (real-world).
Metrics:
- Accuracy: Exact string match of the predicted SMILES.
- Tanimoto Similarity: Chemical similarity using Morgan fingerprints.
Ablation Study: Tested different visual encoders (Swin Transformer, ViT-B, ResNet-50) to validate the choice of SwinV2.
Resolution Sensitivity: Tested model performance across image resolutions from 256px to 2048px.

Results and Core Outcomes

Strong Performance: MMSSC-Net achieved 75-98% accuracy across datasets, outperforming baselines on most benchmarks. The first three intra-domain and real datasets achieved above 94% accuracy.
Resolution Robustness: The model maintained relatively stable accuracy across varying image resolutions, whereas baselines like Img2Mol showed greater sensitivity to resolution changes (Fig. 4 in the paper).
Efficiency: The SwinV2 encoder was noted to be more efficient than ViT-B in this context.
Limitations: The model struggles with stereochemistry, specifically confusing dashed wedge bonds with solid wedge bonds and misclassifying single bonds as solid wedge bonds. It also has difficulty with “irrelevant text” noise (e.g., unexpected symbols in JPO and DECIMER datasets).

Reproducibility Details

Data

The model was trained on a combination of PubChem and USPTO data, augmented to handle visual variability.

Purpose	Dataset	Size	Notes
Training	PubChem	1,000,000	Converted from InChI to SMILES; random sampling.
Training	USPTO	600,000	Patent images; converted from MOL to SMILES.
Evaluation	STAKER	40,000	Synthetic; Avg res $256 \times 256$.
Evaluation	USPTO	4,862	Real; Avg res $721 \times 432$.
Evaluation	CLEF	881	Real; Avg res $1245 \times 412$.
Evaluation	JPO	380	Real; Avg res $614 \times 367$.
Evaluation	UOB	5,720	Real; Avg res $759 \times 416$.

Augmentation:

Image: Random perturbations using RDKit/Indigo (rotation, filling, cropping, bond thickness/length, font size, Gaussian noise).
Molecular: Introduction of functional group abbreviations and R-substituents (dummy atoms) using SMARTS templates.

Algorithms

Target Sequence Formulation: The model predicts a sequence containing bounding box coordinates and type labels: ${y_{\text{min}}, x_{\text{min}}, y_{\text{max}}, x_{\text{max}}, C_{n}}$.
Loss Function: Cross-entropy loss with maximum likelihood estimation. $$ \max \sum_{i=1}^{N} \sum_{j=1}^{L} \omega_{j} \log P(t_{j}^{i} \mid x_{1}^{i}, x_{2}^{i}, \dots, x_{M}^{i}, t_{1}^{i}, \dots, t_{j-1}^{i}) $$
Noise Injection: A random sequence $T_r$ is appended to the target sequence during training to improve generalization to new goals.
Graph Construction: Atoms ($v$) and bonds ($e$) are recognized separately; bonds are defined by connecting spatial atomic coordinates.

Models

Encoder: Swin Transformer V2.
- Pre-trained on ImageNet-1K.
- Window size: $16 \times 16$.
- Parameters: 88M.
- Input resolution: $256 \times 256$.
- Features: Scaled cosine attention; log-space continuous position bias.
Decoder: GPT-2 + MLP.
- GPT-2: Used for recognizing atom types.
  - Layers: 24.
  - Attention Heads: 12.
  - Hidden Dimension: 768.
  - Dropout: 0.1.
- MLP: Used for classifying bond types (single, double, triple, aromatic, solid wedge, dashed wedge).
Vocabulary:
- Standard: 95 common numbers/characters ([0], [C], [=], etc.).
- Extended: 2000 SMARTS-based characters for isomers/groups (e.g., “[C2F5]”, “[halo]”).

Evaluation

Metrics:

Accuracy: Exact match of the generated SMILES string.
Tanimoto Similarity: Similarity of Morgan fingerprints between predicted and ground truth molecules.

Key Results (Accuracy):

Dataset	MMSSC-Net	MolVec (Rule)	ABC-Net (DL)	MolScribe (DL)
Indigo	98.14	95.63	96.4	97.5
RDKit	94.91	86.7	98.3	93.8
USPTO	94.24	88.47	*	92.6
CLEF	91.26	81.61	*	86.9
UOB	92.71	81.32	96.1	87.9
Staker	89.44	4.49	*	86.9
JPO	75.48	66.8	*	76.2

Hardware

Training Configuration:
- Batch Size: 128.
- Learning Rate: $4 \times 10^{-5}$.
- Epochs: 40.
Inference Speed: The SwinV2 encoder demonstrated higher efficiency (faster inference time) compared to ViT-B and ResNet-50 baselines during ablation.

Reproducibility

Artifact	Type	License	Notes
MMSSCNet (GitHub)	Code	Unknown	Official implementation; includes training and prediction scripts

The paper is published in RSC Advances (open access). Source code is available on GitHub, though the repository has minimal documentation and no explicit license. The training data comes from PubChem (public) and USPTO (public patent data). Pre-trained model weights do not appear to be released. No specific GPU hardware or training time is reported in the paper.

Paper Information

Citation: Zhang, D., Zhao, D., Wang, Z., Li, J., & Li, J. (2024). MMSSC-Net: multi-stage sequence cognitive networks for drug molecule recognition. RSC Advances, 14(26), 18182-18191. https://doi.org/10.1039/D4RA02442G

Publication: RSC Advances 2024

@article{zhangMMSSCNetMultistageSequence2024,
  title = {MMSSC-Net: Multi-Stage Sequence Cognitive Networks for Drug Molecule Recognition},
  shorttitle = {MMSSC-Net},
  author = {Zhang, Dehai and Zhao, Di and Wang, Zhengwu and Li, Junhui and Li, Jin},
  year = 2024,
  journal = {RSC Advances},
  volume = {14},
  number = {26},
  pages = {18182--18191},
  publisher = {Royal Society of Chemistry},
  doi = {10.1039/D4RA02442G},
  url = {https://pubs.rsc.org/en/content/articlelanding/2024/ra/d4ra02442g}
}

MarkushGrapher: Multi-modal Markush Structure Recognition

Fri, 19 Dec 2025 00:00:00 +0000

Overcoming Unimodal Limitations for Markush Structures

The automated analysis of chemical literature, particularly patents, is critical for drug discovery and material science. A major bottleneck is the extraction of Markush structures, which are complex chemical templates that represent families of molecules using a core backbone image and textual variable definitions. Existing methods are limited because they either rely solely on images (OCSR) and miss the textual context, or focus solely on text and miss the structural backbone. This creates a practical need for a unified, multi-modal approach that jointly interprets visual and textual data to accurately extract these structures for prior-art search and database construction. This paper proposes a Method and introduces a new Resource (M2S dataset) to bridge this gap.

The core innovation is MarkushGrapher, a multi-modal architecture that jointly encodes image, text, and layout information. Key contributions include:

Dual-Encoder Architecture: Combines a Vision-Text-Layout (VTL) encoder (based on UDOP) with a specialized, pre-trained Optical Chemical Structure Recognition (OCSR) encoder (MolScribe). Let $E_{\text{VTL}}$ represent the combined sequence embedding and $E_{\text{OCSR}}$ represent the domain-specific visual embeddings.
Joint Recognition: The model autoregressively generates a sequential graph representation (Optimized CXSMILES) and a substituent table simultaneously. It uses cross-modal dependencies, allowing text to clarify ambiguous visual details like bond types.
Synthetic Data Pipeline: A comprehensive pipeline generates realistic synthetic Markush structures (images and text) from PubChem data, overcoming the lack of labeled training data.
Optimized Representation: A compacted version of CXSMILES moves variable groups into the SMILES string and adds explicit atom indexing to handle complex “frequency” and “position” variation indicators.

Experimental Validation on the New M2S Benchmark

The authors validated their approach using the following setup:

Baselines: Compared against image-only chemistry models (DECIMER, MolScribe) and general-purpose multi-modal models (Uni-SMART, GPT-4o, Pixtral, Llama-3.2).
Datasets: Evaluated on three benchmarks:
1. MarkushGrapher-Synthetic: 1,000 generated samples.
2. M2S: A new benchmark of 103 manually annotated real-world patent images.
3. USPTO-Markush: 74 Markush backbone images from USPTO patents.
Ablation Studies: Analyzed the impact of the OCSR encoder, late fusion strategies, and the optimized CXSMILES format. Late fusion improved USPTO-Markush EM from 23% (VTL only) to 32% (Table 3). Removing R-group compression dropped M2S EM from 38% to 30%, and removing atom indexing dropped USPTO-Markush EM from 32% to 24% (Table 4).

Key Results

Performance: MarkushGrapher outperformed all baselines. On the M2S benchmark, it achieved 38% Exact Match on CXSMILES (compared to 21% for MolScribe) and 29% Exact Match on tables. On USPTO-Markush, it reached 32% CXSMILES EM versus 7% for MolScribe.
Markush Feature Recognition: The model can recognize complex Markush features like frequency variation (‘Sg’) and position variation (’m’) indicators. DECIMER and MolScribe scored 0% on both ’m’ and ‘Sg’ sections (Table 2), while MarkushGrapher achieved 76% on ’m’ and 31% on ‘Sg’ sections on M2S.
Cross-Modal Reasoning: Qualitative analysis showed the model can correctly infer visual details (such as bond order) that appear ambiguous in the image but become apparent with the text description.
Robustness: The model generalizes well to real-world data despite being trained purely on synthetic data. On augmented versions of M2S and USPTO-Markush simulating low-quality scanned documents, it maintained 31% and 32% CXSMILES EM respectively (Table 6).

Limitations

The authors note several limitations:

MarkushGrapher does not currently handle abbreviations in chemical structures (e.g., ‘OG’ for oxygen connected to a variable group).
The model relies on ground-truth OCR cells as input, requiring an external OCR model for practical deployment.
Substituent definitions that combine text with interleaved chemical structure drawings are not supported.
The model is trained to predict ’m’ sections connecting to all atoms in a cycle, which can technically violate valence constraints, though the output contains enough information to reconstruct only valid connections.

Reproducibility Details

Data

Training Data

Source: Synthetic dataset generated from PubChem SMILES.
Size: 210,000 synthetic images.
Pipeline:
1. Selection: Sampled SMILES from PubChem based on substructure diversity.
2. Augmentation: SMILES augmented to artificial CXSMILES using RDKit (inserting variable groups, frequency indicators).
3. Rendering: Images rendered using Chemistry Development Kit (CDK) with randomized drawing parameters (font, bond width, spacing).
4. Text Generation: Textual definitions generated using manual templates extracted from patents; 10% were paraphrased using Mistral-7B-Instruct-v0.3 to increase diversity.
5. OCR: Bounding boxes extracted via a custom SVG parser aligned with MOL files.

Evaluation Data

M2S Dataset: 103 images from USPTO, EPO, and WIPO patents (1999-2023), manually annotated with CXSMILES and substituent tables.
USPTO-Markush: 74 images from USPTO patents (2010-2016).
MarkushGrapher-Synthetic: 1,000 samples generated via the pipeline.

Algorithms

Optimized CXSMILES:
- Compression: Variable groups moved from the extension block to the main SMILES string as special atoms to reduce sequence length.
- Indexing: Atom indices appended to each atom (e.g., C:1) to explicitly link the graph to the extension block (crucial for m and Sg sections).
- Vocabulary: Specific tokens used for atoms and bonds.
Augmentation: Standard image augmentations (shift, scale, blur, pepper noise, random lines) and OCR text augmentations (character substitution/insertion/deletion).

Models

Architecture: Encoder-Decoder Transformer.
- VTL Encoder: T5-large encoder (initialized from UDOP) that processes image patches, text tokens, and layout (bounding boxes).
- OCSR Encoder: Vision encoder from MolScribe (Swin Transformer), frozen during training.
- Text Decoder: T5-large decoder.
Fusion Strategy: Late Fusion. The core multi-modal alignment combines the textual layout features with specialized chemical vision explicitly. The fused representation relies on the VTL output $e_1$ concatenated with the MLP-projected OCSR output $e_2$ before decoding: $$ e = e_1(v, t, l) \oplus \text{MLP}(e_2(v)) $$
Parameters: 831M total (744M trainable).

Evaluation

Metrics:

CXSMILES Exact Match (EM): Requires perfect match of SMILES string, variable groups, m sections, and Sg sections (ignoring stereochemistry).
Tanimoto Score: Similarity of RDKit DayLight fingerprints (Markush features removed).
Table Exact Match: All variable groups and substituents must match.
Table F1-Score: Aggregated recall and precision of substituents per variable group.

Hardware

Compute: Trained on a single NVIDIA H100 GPU.
Training Config: 10 epochs, batch size of 10, ADAM optimizer, learning rate 5e-4, 100 warmup steps, weight decay 1e-3.

Artifacts

Artifact	Type	License	Notes
MarkushGrapher	Code	MIT	Official implementation

Paper Information

Citation: Morin, L., Weber, V., Nassar, A., Meijer, G. I., Van Gool, L., Li, Y., & Staar, P. (2025). MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14505-14515. https://doi.org/10.1109/CVPR52734.2025.01352

Publication: CVPR 2025

Additional Resources:

GitHub Repository

@inproceedings{morinMarkushGrapherJointVisual2025,
  title = {MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures},
  shorttitle = {MarkushGrapher},
  booktitle = {2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  author = {Morin, Lucas and Weber, Valéry and Nassar, Ahmed and Meijer, Gerhard Ingmar and Van Gool, Luc and Li, Yawei and Staar, Peter},
  year = {2025},
  month = jun,
  pages = {14505--14515},
  doi = {10.1109/CVPR52734.2025.01352}
}

Image2InChI: SwinTransformer for Molecular Recognition

Fri, 19 Dec 2025 00:00:00 +0000

Image2InChI as a Methodological Innovation

This is a Methodological Paper ($\Psi_{\text{Method}}$). It proposes a specific new deep learning architecture (“Image2InChI”) to solve the task of Optical Chemical Structure Recognition (OCSR). The rhetorical focus is on engineering a system that outperforms baselines on specific metrics (InChI accuracy, MCS accuracy) and providing a valuable reference for future algorithmic work.

Bottlenecks in Chemical Literature Digitization

The accurate digitization of chemical literature is a bottleneck in AI-driven drug discovery. Chemical structures in patents and papers exist as optical images (pixels), but machine learning models require machine-readable string representations (like InChI or SMILES). Efficiently and automatically bridging this gap is a prerequisite for large-scale data mining in chemistry.

Hierarchical SwinTransformer and Attention Integration

The core novelty is the Image2InChI architecture, which integrates:

Improved SwinTransformer Encoder: Uses a hierarchical vision transformer to capture image features.
Feature Fusion with Attention: A novel network designed to integrate image patch features with InChI prediction steps.
End-to-End InChI Prediction: The architecture frames the problem as a direct image-to-sequence translation targeting InChI strings directly, diverging from techniques predicting independent graph components. The model is optimized using a standard Cross-Entropy Loss over the token vocabulary: $$ \mathcal{L}_{\text{CE}} = - \sum_{t=1}^{T} \log P(y_t \mid y_{

Benchmarking on the BMS Dataset

Benchmark Validation: The model was trained and tested on the BMS1000 (Bristol-Myers Squibb) dataset from a Kaggle competition.
Ablation/Comparative Analysis: The authors compared their method against other models in the supplement.
Preprocessing Validation: They justified their choice of denoising algorithms (8-neighborhood vs. Gaussian/Mean) to ensure preservation of bond lines while removing “spiky point noise”.

High InChI Recognition Metrics

High Accuracy: The model achieved 99.8% InChI accuracy, 94.8% Maximum Common Substructure (MCS) accuracy, and 96.2% Longest Common Subsequence (LCS) accuracy on the benchmarked dataset. It remains to be seen how well these models generalize to heavily degraded real-world patent images.
Effective Denoising: The authors concluded that eight-neighborhood filtering is superior to mean or Gaussian filtering for this specific domain because it removes isolated noise points without blurring the fine edges of chemical bonds.
Open Source: The authors stated their intention to release the code, though no public repository has been identified.

Artifacts

Artifact	Type	License	Notes
BMS Dataset (Kaggle)	Dataset	Competition	Bristol-Myers Squibb Molecular Translation competition dataset

No public code repository has been identified for Image2InChI despite the authors’ stated intent to release it.

Reproducibility Details

Data

The primary dataset used is the BMS (Bristol-Myers Squibb) Dataset.

Property	Details
Source	Kaggle Competition (BMS-Molecular-Translation)
Total Size	2.4 million images
Training Set	1.8 million images
Test Set	0.6 million images
Content	Each image corresponds to a unique International Chemical Identifier (InChI)

Other Datasets: The authors also utilized JPO (Japanese Patent Office), CLEF (CLEF-IP 2012), UOB (MolrecUOB), and USPTO datasets for broader benchmarking.

Preprocessing Pipeline:

Denoising: Eight-neighborhood filtering (threshold < 4 non-white pixels) is used to remove salt-and-pepper noise while preserving bond lines. Mean and Gaussian filtering were rejected due to blurring.
Sequence Padding:
- Analysis showed max InChI length < 270.
- Fixed sequence length set to 300.
- Tokens: (190), (191), (192) used for padding/framing.
Numerization: Characters are mapped to integers based on a fixed vocabulary (e.g., ‘C’ -> 178, ‘H’ -> 182).

Algorithms

Eight-Neighborhood Filtering (Denoising):

Pseudocode logic:

Iterate through every pixel.
Count non-white neighbors in the 3x3 grid (8 neighbors).
If count < threshold (default 4), treat as noise and remove.

InChI Tokenization:

InChI strings are split into character arrays.
Example: Vitamin C InChI=1S/C6H8O6... becomes [, C, 6, H, 8, O, 6, ..., , ...].
Mapped to integer tensor for model input.

Models

Architecture: Image2InChI

Encoder: Improved SwinTransformer (Hierarchical Vision Transformer).
Decoder: Transformer Decoder with patch embedding.
Fusion: A novel “feature fusion network with attention” integrates the visual tokens with the sequence generation process.
Framework: PyTorch 1.8.1.

Evaluation

Metrics:

InChI Acc: Exact match accuracy of the predicted InChI string (Reported: 99.8%).
MCS Acc: Maximum Common Substructure accuracy (structural similarity) (Reported: 94.8%).
LCS Acc: Longest Common Subsequence accuracy (string similarity) (Reported: 96.2%).
Morgan FP: Morgan Fingerprint similarity (Reported: 94.1%).

Hardware

Component	Specification
GPU	NVIDIA Tesla P100 (16GB VRAM)
Platform	MatPool cloud platform
CPU	Intel Xeon Gold 6271
RAM	32GB System Memory
Driver	NVIDIA-SMI 440.100
OS	Ubuntu 18.04

Paper Information

Citation: Li, D., Xu, X., Pan, J., Gao, W., & Zhang, S. (2024). Image2InChI: Automated Molecular Optical Image Recognition. Journal of Chemical Information and Modeling, 64(9), 3640-3649. https://doi.org/10.1021/acs.jcim.3c02082

Publication: Journal of Chemical Information and Modeling (JCIM) 2024

Additional Resources:

BMS Dataset (Kaggle)

Note: These notes are based on the Abstract and Supporting Information files only.

@article{li2024image2inchi,
  title={Image2InChI: Automated Molecular Optical Image Recognition},
  author={Li, Da-zhou and Xu, Xin and Pan, Jia-heng and Gao, Wei and Zhang, Shi-rui},
  journal={Journal of Chemical Information and Modeling},
  volume={64},
  number={9},
  pages={3640--3649},
  year={2024},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.3c02082}
}

Enhanced DECIMER for Hand-Drawn Structure Recognition

Fri, 19 Dec 2025 00:00:00 +0000

Method Contribution: Architectural Optimization

This is a Method paper. It proposes an enhanced neural network architecture (EfficientNetV2 + Transformer) specifically designed to solve the problem of recognizing hand-drawn chemical structures. The primary contribution is architectural optimization and a data-driven training strategy, validated through ablation studies (comparing encoders) and benchmarked against existing rule-based and deep learning tools.

Motivation: Digitizing “Dark” Chemical Data

Chemical information in legacy laboratory notebooks and modern tablet-based inputs often exists as hand-drawn sketches.

Gap: Existing Optical Chemical Structure Recognition (OCSR) tools (particularly rule-based ones) lack robustness and fail when images have variability in style, line thickness, or noise.
Need: There is a critical need for automated tools to digitize this “dark data” effectively to preserve it and make it machine-readable and searchable.

Core Innovation: Decoder-Only Design and Synthetic Scaling

The core novelty is the architectural enhancement and synthetic training strategy:

Decoder-Only Transformer: Using only the decoder part of the Transformer (instead of a full encoder-decoder Transformer) improved average accuracy across OCSR benchmarks from 61.28% to 69.27% (Table 3 in the paper).
EfficientNetV2 Integration: Replacing standard CNNs or EfficientNetV1 with EfficientNetV2-M provided better feature extraction and 2x faster training speeds.
Scale of Synthetic Data: The authors demonstrate that scaling synthetic training data (up to 152 million images generated by RanDepict) directly correlates with improved generalization to real-world hand-drawn images, without ever training on real hand-drawn data.

Experimental Setup: Ablation and Real-World Baselines

Model Selection (Ablation): Tested three architectures (EfficientNetV2-M + Full Transformer, EfficientNetV1-B7 + Decoder-only, EfficientNetV2-M + Decoder-only) on standard benchmarks (JPO, CLEF, USPTO, UOB).
Data Scaling: Trained the best model on four progressively larger datasets (from 4M to 152M images) to measure performance gains.
Real-World Benchmarking: Validated the final model on the DECIMER Hand-drawn dataset (5088 real images drawn by volunteers) and compared against 9 other tools (OSRA, MolVec, Img2Mol, MolScribe, etc.).

Results and Conclusions: Strong Accuracy on Hand-Drawn Scans

Strong Performance: The final DECIMER model achieved 99.72% valid predictions and 73.25% exact accuracy on the hand-drawn benchmark. The next best non-DECIMER tool was MolGrapher at 10.81% accuracy, followed by MolScribe at 7.65%.
Robustness: Deep learning methods outperform rule-based methods (which scored 3% or less accuracy) on hand-drawn data.
Data Saturation: Quadrupling the dataset from 38M to 152M images yielded only marginal gains (about 3 percentage points in accuracy), suggesting current synthetic data strategies may be hitting a plateau.

Reproducibility

Artifacts

Artifact	Type	License	Notes
DECIMER Image Transformer (GitHub)	Code	MIT	Official TensorFlow implementation
Model Weights (Zenodo)	Model	Unknown	Pre-trained hand-drawn model weights
DECIMER PyPi Package	Code	MIT	Installable Python package
RanDepict (GitHub)	Code	MIT	Synthetic hand-drawn image generation toolkit

Data

The model was trained entirely on synthetic data generated using the RanDepict toolkit. No real hand-drawn images were used for training.

Dataset	Source	Molecules	Total Images	Notes
1	ChEMBL	2,187,669	4,375,338	1 augmented + 1 clean per molecule
2	ChEMBL	2,187,669	13,126,014	2 augmented + 4 clean per molecule
3	PubChem	9,510,000	38,040,000	1 augmented + 3 clean per molecule
4	PubChem	38,040,000	152,160,000	1 augmented + 3 clean per molecule

A separate model selection experiment used a 1,024,000-molecule subset of ChEMBL to compare the three architectures (Table 1 in the paper). The DECIMER Hand-Drawn evaluation dataset consists of 5,088 real hand-drawn images from 23 volunteers.

Preprocessing:

SMILES strings length < 300 characters.
Images resized to $512 \times 512$.
Images generated with and without “hand-drawn style” augmentations.

Algorithms

Tokenization: SMILES split by heavy atoms, brackets, bond symbols, and special characters. Start and end tokens added; padded with .
Optimization: Adam optimizer with a custom learning rate schedule (as specified in the original Transformer paper). A dropout rate of 0.1 was used.
Loss Function: Trained using focal loss to address class imbalance for rare tokens. The focal loss formulation reduces the relative loss for well-classified examples: $$ \text{FL}(p_{\text{t}}) = -\alpha_{\text{t}} (1 - p_{\text{t}})^\gamma \log(p_{\text{t}}) $$
Augmentations: RanDepict applied synthetic distortions to mimic handwriting (wobbly lines, variable thickness, etc.).

Models

The final architecture (Model 3) is an Encoder-Decoder structure:

Encoder: EfficientNetV2-M (pretrained ImageNet backbone).
- Input: $512 \times 512 \times 3$ image.
- Output Features: $16 \times 16 \times 512$ (reshaped to sequence length 256, dimension 512).
- Note: The final fully connected layer of the CNN is removed.
Decoder: Transformer (Decoder-only).
- Layers: 6
- Attention Heads: 8
- Embedding Dimension: 512
Output: Predicted SMILES string token by token.

Evaluation

Metrics used for evaluation:

Valid Predictions (%): Percentage of outputs that are syntactically valid SMILES.
Exact Match Accuracy (%): Canonical SMILES string identity.
Tanimoto Similarity: Fingerprint similarity (PubChem fingerprints) between ground truth and prediction.

Data Scaling Results (Hand-Drawn Dataset, Table 4 in the paper):

Dataset	Training Images	Valid Predictions	Exact Accuracy	Tanimoto
1 (ChEMBL)	4,375,338	96.21%	5.09%	0.490
2 (ChEMBL)	13,126,014	97.41%	26.08%	0.690
3 (PubChem)	38,040,000	99.67%	70.34%	0.939
4 (PubChem)	152,160,000	99.72%	73.25%	0.942

Comparison with Other Tools (Hand-Drawn Dataset, Table 5 in the paper):

OCSR Tool	Method	Valid Predictions	Exact Accuracy	Tanimoto
DECIMER (Ours)	Deep Learning	99.72%	73.25%	0.94
DECIMER.ai	Deep Learning	96.07%	26.98%	0.69
MolGrapher	Deep Learning	99.94%	10.81%	0.51
MolScribe	Deep Learning	95.66%	7.65%	0.59
Img2Mol	Deep Learning	98.96%	5.25%	0.52
SwinOCSR	Deep Learning	97.37%	5.11%	0.64
ChemGrapher	Deep Learning	69.56%	N/A	0.09
Imago	Rule-based	43.14%	2.99%	0.22
MolVec	Rule-based	71.86%	1.30%	0.23
OSRA	Rule-based	54.66%	0.57%	0.17

Hardware

Compute: Google Cloud TPU v4-128 pod slice.
Training Time:
- EfficientNetV2-M model trained ~2x faster than EfficientNetV1-B7.
- Average training time per epoch: 34 minutes (for Model 3 on 1M dataset subset).
Epochs: Models trained for 25 epochs.

Paper Information

Citation: Rajan, K., Brinkhaus, H.O., Zielesny, A. et al. (2024). Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture. Journal of Cheminformatics, 16(78). https://doi.org/10.1186/s13321-024-00872-7

Publication: Journal of Cheminformatics 2024

Additional Resources:

@article{rajanAdvancementsHanddrawnChemical2024,
  title = {Advancements in Hand-Drawn Chemical Structure Recognition through an Enhanced {{DECIMER}} Architecture},
  author = {Rajan, Kohulan and Brinkhaus, Henning Otto and Zielesny, Achim and Steinbeck, Christoph},
  year = 2024,
  month = jul,
  journal = {Journal of Cheminformatics},
  volume = {16},
  number = {1},
  pages = {78},
  issn = {1758-2946},
  doi = {10.1186/s13321-024-00872-7}
}

Dual-Path Global Awareness Transformer (DGAT) for OCSR

Fri, 19 Dec 2025 00:00:00 +0000

Contribution Type: Deep Learning Method for OCSR

This is a Method paper ($\Psi_{\text{Method}}$).

The classification is based on the proposal of a novel deep learning architecture (DGAT) designed to address specific limitations in existing Optical Chemical Structure Recognition (OCSR) systems. The contribution is validated through benchmarking against external baselines (DeepOCSR, DECIMER, SwinOCSR) and ablation studies that isolate the impact of the new modules.

Motivation: Addressing Global Context Loss

Existing multimodal fusion methods for OCSR suffer from limited awareness of global context.

Problem: Models often generate erroneous sequences when processing complex motifs, such as rings or long chains, due to a disconnect between local feature extraction and global structural understanding.
Gap: Current architectures struggle to capture the “fine-grained differences between global and local features,” leading to topological errors.
Practical Need: Accurate translation of chemical images to machine-readable sequences (SMILES/SELFIES) is critical for materials science and AI-guided chemical research.

Core Innovation: Dual-Path Global Awareness Transformer

The authors propose the Dual-Path Global Awareness Transformer (DGAT), which redesigns the decoder with two novel mechanisms to better handle global context:

Cascaded Global Feature Enhancement (CGFE): This module bridges cross-modal gaps by emphasizing global context. It concatenates global visual features with sequence features and processes them through a Cross-Modal Assimilation MLP and an Adaptive Alignment MLP to align multimodal representations. The feature enhancement conceptually computes:

$$ f_{\text{enhanced}} = \text{MLP}_{\text{align}}(\text{MLP}_{\text{assimilate}}([f_{\text{global}}, f_{\text{seq}}])) $$
Sparse Differential Global-Local Attention (SDGLA): A module that dynamically captures fine-grained differences between global and local features. It uses sequence features (embedded with global info) as queries, while utilizing local and global visual features as keys/values in parallel attention heads to generate initial multimodal features.

Experimental Setup and Baselines

The model was evaluated on a newly constructed dataset and compared against five major baselines.

Baselines: DeepOCSR, DECIMER 1.0, DECIMER V2, SwinOCSR, and MPOCSR.
Ablation Studies:
- Layer Depth: Tested Transformer depths from 1 to 5 layers; 3 layers proved optimal for balancing gradient flow and parameter sufficiency.
- Beam Size: Tested inference beam sizes 1-5; size 3 achieved the best balance between search depth and redundancy.
- Module Contribution: Validated that removing CGFE results in a drop in structural similarity (Tanimoto), proving the need for pre-fusion alignment.
Robustness Analysis: Performance broken down by molecule complexity (atom count, ring count, bond count).
Chirality Validation: Qualitative analysis of attention maps on chiral molecules to verify the model learns stereochemical cues implicitly.

Results and Conclusions

Performance Over Baselines: DGAT outperformed the MPOCSR baseline across all metrics:
- BLEU-4: 84.0% (+5.3% improvement)
- ROUGE: 90.8% (+1.9% improvement)
- Tanimoto Similarity: 98.8% (+1.2% improvement)
- Exact Match Accuracy: 54.6% (+10.9% over SwinOCSR)
Chiral Recognition: The model implicitly recognizes chiral centers (e.g., generating [C@@H1] tokens correctly) based on 2D wedge cues without direct stereochemical supervision.
Limitations: Performance drops for extreme cases, such as molecules with 4+ rings or 4+ double/triple bonds, due to dataset imbalance. The model still hallucinates branches in highly complex topologies.

Reproducibility Details

Data

The training data is primarily drawn from PubChem and augmented to improve robustness.

Augmentation Strategy: Each sequence generates three images with random rendering parameters.
- Rotation: 0, 90, 180, 270, or random [0, 360)
- Bond Width: 1, 2, or 3 pixels
- Bond Offset: Sampled from 0.08-0.18 (inherited from Image2SMILES)
- CoordGen: Enabled with 20% probability
Evaluation Set: A newly constructed benchmark dataset was used for final reporting.

Algorithms

Training Configuration:
- Encoder LR: $5 \times 10^{-5}$ (Pretrained ResNet-101)
- Decoder LR: $1 \times 10^{-4}$ (Randomly initialized Transformer)
- Optimizer: Implied SGD/Adam (context mentions Momentum 0.9, Weight Decay 0.0001)
- Batch Size: 256
Inference:
- Beam Search: A beam size of 3 is used. Larger beam sizes (4-5) degraded BLEU/ROUGE scores due to increased redundancy.

Models

Visual Encoder:
- Backbone: ResNet-101 initialized with ImageNet weights
- Structure: Convolutional layers preserved up to the final module. Classification head removed.
- Pooling: A $7 \times 7$ average pooling layer is used to extract global visual features.
Sequence Decoder:
- Architecture: Transformer-based with CGFE and SDGLA modules.
- Depth: 3 Transformer layers
- Dropout: Not utilized

Evaluation

Performance is reported using sequence-level and structure-level metrics.

Metric	DGAT Score	Baseline (MPOCSR)	Notes
BLEU-4	84.0%	78.7%	Measures n-gram precision
ROUGE	90.8%	88.9%	Sequence recall metric
Tanimoto	98.8%	97.6%	Structural similarity fingerprint
Accuracy	54.6%	35.7%	Exact structure match rate

Artifacts

Artifact	Type	License	Notes
DGAT	Code	Unknown	Official implementation with training and evaluation scripts

Paper Information

Citation: Wang, R., Ji, Y., Li, Y., & Lee, S.-T. (2025). Dual-Path Global Awareness Transformer for Optical Chemical Structure Recognition. The Journal of Physical Chemistry Letters, 16(50), 12787-12795. https://doi.org/10.1021/acs.jpclett.5c03057

Publication: The Journal of Physical Chemistry Letters 2025

Additional Resources:

GitHub Repository

@article{wang2025dgat,
  title={Dual-Path Global Awareness Transformer for Optical Chemical Structure Recognition},
  author={Wang, Rui and Ji, Yujin and Li, Youyong and Lee, Shuit-Tong},
  journal={The Journal of Physical Chemistry Letters},
  volume={16},
  number={50},
  pages={12787--12795},
  year={2025},
  doi={10.1021/acs.jpclett.5c03057}
}

DECIMER.ai: Optical Chemical Structure Recognition

Fri, 19 Dec 2025 00:00:00 +0000

Project Scope and Contribution Type

This is primarily a Resource paper (Infrastructure Basis) with a significant Method component.

The primary contribution is DECIMER.ai, a fully open-source platform (web app and Python packages) for the entire chemical structure mining pipeline, filling a gap where most tools were proprietary or fragmented. It also contributes the RanDepict toolkit for massive synthetic data generation.

The secondary methodological contribution proposes and validates a specific deep learning architecture (EfficientNet-V2 encoder + Transformer decoder) that treats chemical structure recognition as an image-to-text translation task (SMILES generation).

The Scarcity of Machine-Readable Chemical Data

Data Scarcity: While the number of chemical publications is increasing, most chemical information is locked in non-machine-readable formats (images in PDFs) and is not available in public databases.

Limitations of Existing Tools: Prior OCSR (Optical Chemical Structure Recognition) tools were largely rule-based (fragile to noise) or proprietary.

Lack of Integration: There was no existing open-source system that combined segmentation (finding the molecule on a page), classification (confirming it is a molecule), and recognition (translating it to SMILES) into a single workflow.

DECIMER Architecture and Novel Image-to-SMILES Approach

Comprehensive Workflow: It is the first open-source platform to integrate segmentation (Mask R-CNN), classification (EfficientNet), and recognition (Transformer) into a unified pipeline.

Data-Driven Approach: Unlike tools like MolScribe which use intermediate graph representations and rules, DECIMER uses a purely data-driven “image-to-SMILES” translation approach without hard-coded chemical rules. The core recognition model operates as a sequence-to-sequence generator, mathematically formalizing the task as maximizing the conditional probability of a SMILES sequence given an image.

Massive Synthetic Training: The use of RanDepict to generate over 450 million synthetic images, covering diverse depiction styles and augmentations (including Markush structures), to train the model from scratch.

Benchmarking and Evaluation Methodology

Benchmarking: The system was tested against openly available tools (OSRA, MolVec, Imago, Img2Mol, SwinOCSR, MolScribe) on standard datasets: USPTO, UOB, CLEF, JPO, and a custom “Hand-drawn” dataset.

Robustness Testing: Performance was evaluated on both clean images and images with added distortions (rotation, shearing) to test the fragility of rule-based systems vs. DECIMER.

Markush Structure Analysis: Specific evaluation of the model’s ability to interpret Markush structures (generic structures with R-groups).

Comparison of Approaches: A direct comparison with MolScribe by training DECIMER on MolScribe’s smaller training set to isolate the impact of architecture vs. data volume.

Performance Outcomes and Key Findings

Comparative Performance: DECIMER Image Transformer consistently produced average Tanimoto similarities above 0.95 on in-domain test data and achieved competitive or leading results across external benchmarks, with extremely low rates of catastrophic failure. Tanimoto similarity is calculated based on molecular fingerprints $A$ and $B$ as: $$ T(A, B) = \frac{A \cdot B}{|A|^2 + |B|^2 - A \cdot B} $$

Data Volume Necessity: When trained on small datasets, MolScribe (graph/rule-based) outperformed DECIMER. DECIMER’s performance advantage relies heavily on its massive training scale (>400M images).

Robustness: The model showed no performance degradation on distorted images, unlike rule-based legacy tools.

Generalization: Despite having no hand-drawn images in the training set, the base model recognized 27% of hand-drawn structures perfectly (average Tanimoto 0.69), outperforming all alternative open tools. After fine-tuning with synthetic hand-drawn-like images from RanDepict, perfect predictions increased to 60% (average Tanimoto 0.89).

Reproducibility

Artifacts

Artifact	Type	License	Notes
DECIMER.ai Web App	Code	MIT	Laravel-based web application for the full pipeline
DECIMER Image Transformer	Code	MIT	Core OCSR Python package
DECIMER Image Segmentation	Code	MIT	Mask R-CNN segmentation for chemical structures in documents
DECIMER Image Classifier	Code	MIT	EfficientNet-based chemical structure image classifier
RanDepict	Code	MIT	Synthetic training data generation toolkit

Data

The models were trained on synthetic data generated from PubChem molecules.

Purpose	Dataset	Size	Generation/Notes
Training	`pubchem_1`	~108M mols	PubChem molecules (mass < 1500 Da), processed with RanDepict (v1.0.5). Included image augmentations.
Training	`pubchem_2`	~126M mols	Included Markush structures generated by pseudo-randomly replacing atoms with R-groups. Image size 299x299.
Training	`pubchem_3`	>453M images	Re-depicted `pubchem_2` molecules at 512x512 resolution. Used RanDepict v1.0.8.
Test	In-domain	250,000	Held-out set generated similarly to training data.
Benchmark	External	Various	USPTO (5719), UOB (5740), CLEF (992), JPO (450), Indigo (50k), Hand-drawn (5088).

Data Generation:

Tool: RanDepict (uses CDK, RDKit, Indigo, PIKAChU)
Augmentations: Rotation, shearing, noise, pixelation, curved arrows, text labels
Format: Data saved as TFRecord files for TPU training

Algorithms

SMILES Tokenization: Regex-based splitting (atoms, brackets, bonds). Added , , and padded with . used for unknown tokens.
Markush Token Handling: To avoid ambiguity, digits following ‘R’ (e.g., R1) were replaced with unique non-digit characters during training to distinguish them from ring-closure numbers.
Image Augmentation Pipeline: Custom RanDepict features (v1.1.4) were used to simulate “hand-drawn-like” styles based on ChemPIX’s implementation.

Models

The platform consists of three distinct models:

DECIMER Segmentation:
- Architecture: Mask R-CNN (TensorFlow 2.10.0 implementation)
- Purpose: Detects and cuts chemical structures from full PDF pages
DECIMER Image Classifier:
- Architecture: EfficientNet-V1-B0
- Input: 224x224 pixels
- Training: Fine-tuned on ~10.9M images (balanced chemical/non-chemical)
- Performance: AUC 0.99 on in-domain test set
DECIMER Image Transformer (OCSR Engine):
- Encoder: EfficientNet-V2-M (CNN). Input size 512x512. 52M parameters
- Decoder: Transformer. 4 encoder blocks, 4 decoder blocks, 8 attention heads. d_model=512, d_ff=2048. 59M parameters
- Total Params: ~111 Million

Evaluation

Primary Metric: Tanimoto Similarity (calculated on PubChem fingerprints of the predicted vs. ground truth SMILES)
Secondary Metrics: Exact Match (Identity), BLEU score (for string similarity, esp. Markush)
Failure Analysis: “Catastrophic failure” defined as Tanimoto similarity of 0 or invalid SMILES

Hardware

Training was performed on Google Cloud TPUs due to the massive dataset size.

pubchem_1/pubchem_2: Trained on TPU v3-32 pod slice
pubchem_3 (Final Model): Trained on TPU v3-256 pod slice
Training Time:
- Data generation (512x512): ~2 weeks on cluster (20 threads, 36 cores)
- Model Training (EffNet-V2-M): 1 day and 7 hours per epoch on TPU v3-256

Paper Information

Citation: Rajan, K., Brinkhaus, H. O., Agea, M. I., Zielesny, A., & Steinbeck, C. (2023). DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. Nature Communications, 14(1), 5045. https://doi.org/10.1038/s41467-023-40782-0

Publication: Nature Communications 2023

Additional Resources:

@article{rajanDECIMERaiOpenPlatform2023,
  title = {DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications},
  author = {Rajan, Kohulan and Brinkhaus, Henning Otto and Agea, M. Isabel and Zielesny, Achim and Steinbeck, Christoph},
  journal = {Nature Communications},
  volume = {14},
  number = {1},
  pages = {5045},
  year = {2023},
  doi = {10.1038/s41467-023-40782-0}
}

ChemReco: Hand-Drawn Chemical Structure Recognition

Fri, 19 Dec 2025 00:00:00 +0000

Research Contribution & Classification

This is a Methodological Paper ($\Psi_{\text{Method}}$) with a significant Resource ($\Psi_{\text{Resource}}$) component.

Method: The primary contribution is “ChemReco,” a specific deep learning pipeline (EfficientNet + Transformer) designed to solve the Optical Chemical Structure Recognition (OCSR) task for hand-drawn images. The authors conduct extensive ablation studies on architecture and data mixing ratios to validate performance.
Resource: The authors explicitly state that “the primary focus of this paper is constructing datasets” due to the scarcity of hand-drawn molecular data. They introduce a comprehensive synthetic data generation pipeline involving RDKit modifications and image degradation to create training data.

Motivation: Digitizing Hand-Drawn Chemical Sketches

Hand-drawing is the most intuitive method for chemists and students to record molecular structures. However, digitizing these drawings into machine-readable formats (like SMILES) usually requires time-consuming manual entry or specialized software.

Gap: Existing OCSR tools and rule-based methods often fail on hand-drawn sketches due to diverse writing styles, poor image quality, and the absence of labeled data.
Application: Automated recognition enables efficient chemical research and allows for automatic grading in educational settings.

Core Innovation: Synthetic Pipeline and Hybrid Architecture

The paper introduces ChemReco, an end-to-end system for recognizing C-H-O structures. Key novelties include:

Synthetic Data Pipeline: A multi-stage generation method that modifies RDKit source code to randomize bond/angle parameters, followed by OpenCV-based augmentation, degradation, and background addition to simulate realistic hand-drawn artifacts.
Architectural Choice: The specific application of EfficientNet (encoder) combined with a Transformer (decoder) for this domain, which the authors demonstrate outperforms the more common ResNet+LSTM baselines.
Hybrid Training Strategy: Finding that a mix of 90% synthetic and 10% real data yields optimal performance, superior to using either dataset alone.

Methodology & Ablation Studies

The authors performed a series of ablation studies and comparisons:

Synthesis Ablation: Evaluated the impact of each step in the generation pipeline (RDKit only $\rightarrow$ Augmentation $\rightarrow$ Degradation $\rightarrow$ Background) on validation loss and accuracy.
Dataset Size Ablation: Tested model performance when trained on synthetic datasets ranging from 100,000 to 1,000,000 images.
Real/Synthetic Ratio: Investigated the optimal mixing ratio of synthetic to real hand-drawn images (100:0, 90:10, 50:50, 10:90, 0:100), finding that the 90:10 ratio achieved 93.81% exact match, compared to 63.33% for synthetic-only and 65.83% for real-only.
Architecture Comparison: Benchmarked four encoder-decoder combinations: ResNet vs. EfficientNet encoders paired with LSTM vs. Transformer decoders.
Baseline Comparison: Compared results against a related study utilizing a CNN+LSTM framework.

Results & Interpretations

Best Performance: The EfficientNet + Transformer model trained on a 90:10 synthetic-to-real ratio achieved a 96.90% Exact Match rate on the test set.
Background Robustness: When training on synthetic data alone (no real images), the best accuracy on background-free test images was approximately 46% (using RDKit-aug-deg), while background test images reached approximately 53% (using RDKit-aug-bkg-deg). Adding random backgrounds during training helped prevent the model from overfitting to clean white backgrounds.
Data Volume: Increasing the synthetic dataset size from 100k to 1M consistently improved accuracy (average exact match: 49.40% at 100k, 54.29% at 200k, 61.31% at 500k, 63.33% at 1M, all without real images in training).
Encoder-Decoder Comparison (at 90:10 mix with 1M images):

Encoder	Decoder	Avg. Exact Match (%)
ResNet	LSTM	93.81
ResNet	Transformer	94.76
EfficientNet	LSTM	96.31
EfficientNet	Transformer	96.90

Superiority over Baselines: The model outperformed the cited CNN+LSTM baseline from ChemPix (93% vs 76% on the ChemPix test set).

Limitations

Restricted atom types: The system only handles molecules composed of carbon, hydrogen, and oxygen (C-H-O), excluding nitrogen, sulfur, halogens, and other heteroatoms commonly found in organic chemistry.
Structural complexity: Only structures with at most one ring are supported. Complex multi-ring systems and fused ring structures are not covered.
Dataset availability: The real hand-drawn dataset (2,598 images) is not publicly released and is only available upon request from the corresponding author.
Future directions: The authors suggest expanding to more heteroatoms, complex ring structures, and applications in automated grading of chemistry exams.

Reproducibility

Artifact	Type	License	Notes
hdr-DeepLearning	Code	Unknown	Official implementation in PyTorch
Paper	Publication	CC-BY-4.0	Open access via Nature

The real hand-drawn dataset (2,598 images) is available upon request from the corresponding author, not publicly downloadable. The synthetic data generation pipeline is described in detail but relies on modified RDKit source code, which is included in the repository.

Data

The study utilizes a combination of collected SMILES data, real hand-drawn images, and generated synthetic images.

Source Data: SMILES codes collected from PubChem, ZINC, GDB-11, and GDB-13. Filtered for C, H, O atoms and max 1 ring.
Real Dataset: 670 selected SMILES codes drawn by multiple volunteers, totaling 2,598 images.
Synthetic Dataset: Generated up to 1,000,000 images using the pipeline below.
Training Mix: The optimal training set used 1 million images with a 90:10 ratio of synthetic to real images.

Dataset Type	Source	Size	Notes
Real	Volunteer Drawings	2,598 images	Used for mixed training and testing
Synthetic	Generated	100k - 1M	Generated via modified RDKit + OpenCV augmentation/degradation; optionally enhanced with Stable Diffusion

Algorithms

The Synthetic Image Generation Pipeline is critical for reproduction:

RDKit Modification: Modify source code to introduce random keys, character width, length, and bond angles.
Augmentation (OpenCV): Apply sequence: Resize ($p=0.5$), Blur ($p=0.4$), Erode/Dilate ($p=0.2$), Distort ($p=0.8$), Flip ($p=0.5$), Affine ($p=0.7$).
Degradation: Apply sequence: Salt+pepper noise ($p=0.1$), Contrast ($p=0.7$), Sharpness ($p=0.5$), Invert ($p=0.3$).
Background Addition: Random backgrounds are augmented (Crop, Distort, Flip) and added to the molecular image to prevent background overfitting.
Diffusion Enhancement: Stable Diffusion (v1-4) is used for image-to-image enhancement to better simulate hand-drawn styles (prompt: “A pencil sketch of [Formula]… without charge distribution”).

Models

The system uses an encoder-decoder architecture:

Encoder: EfficientNet (pre-trained on ImageNet). The last layer is removed, and features are extracted into a Numpy array.
Decoder: Transformer. Utilizes self-attention to generate the SMILES sequence. Chosen over LSTM for better handling of long-range dependencies.
Output: Canonical SMILES string.

Evaluation

Primary Metric: Exact Match (EM). A strict binary evaluation checking whether the complete generated SMILES perfectly replicates the target string.
Other Metrics: Levenshtein Distance measures edit-level character proximity, while the Tanimoto coefficient evaluates structural similarity based on chemical fingerprints. Both were monitored during validation ablation runs.

Metric	Value	Baseline (CNN+LSTM)	Notes
Exact Match	96.90%	76%	Tested on the provided test set

Hardware

CPU: Intel(R) Xeon(R) Gold 6130 (40 GB RAM).
GPU: NVIDIA Tesla V100 (32 GB video memory).
Framework: PyTorch 1.9.1.
Training Configuration:
- Optimizer: Adam (learning rate 1e-4).
- Batch size: 32.
- Epochs: 100.

Paper Information

Citation: Ouyang, H., Liu, W., Tao, J., et al. (2024). ChemReco: automated recognition of hand-drawn carbon-hydrogen-oxygen structures using deep learning. Scientific Reports, 14, 17126. https://doi.org/10.1038/s41598-024-67496-7

Publication: Scientific Reports 2024

Additional Resources:

Official Code Repository

@article{ouyangChemRecoAutomatedRecognition2024,
  title = {ChemReco: Automated Recognition of Hand-Drawn Carbon--Hydrogen--Oxygen Structures Using Deep Learning},
  author = {Ouyang, Hengjie and Liu, Wei and Tao, Jiajun and Luo, Yanghong and Zhang, Wanjia and Zhou, Jiayu and Geng, Shuqi and Zhang, Chengpeng},
  journal = {Scientific Reports},
  volume = {14},
  number = {1},
  pages = {17126},
  year = {2024},
  publisher = {Nature Publishing Group},
  doi = {10.1038/s41598-024-67496-7}
}

Benchmarking Eight OCSR Tools on Patent Images (2024)

Fri, 19 Dec 2025 00:00:00 +0000

Contribution: Benchmarking General and Specialized OCSR Tools

This paper is primarily a Resource contribution ($0.7 \Psi_{\text{Resource}}$) with a secondary Method component ($0.3 \Psi_{\text{Method}}$).

It establishes a new, independent benchmark dataset of 2,702 manually selected patent images to evaluate existing Optical Chemical Structure Recognition (OCSR) tools. The authors rigorously compare 8 different methods using this dataset to determine the state-of-the-art. The Resource contribution is evidenced by the creation of this curated benchmark, explicit evaluation metrics (exact connectivity table matching), and public release of datasets, processing scripts, and evaluation tools on Zenodo.

The secondary Method contribution comes through the development of “ChemIC,” a ResNet-50 image classifier designed to categorize images (Single vs. Multiple vs. Reaction) to enable a modular processing pipeline. However, this method serves to support the insights gained from the benchmarking resource.

Motivation: The Need for Realistic, Modality-Diverse Patent Benchmarks

Lack of Standardization: A universally accepted standard set of images for OCSR quality measurement is currently missing; existing tools are often evaluated on synthetic data or limited datasets.

Industrial Relevance: Patents contain diverse and “noisy” image modalities (Markush structures, salts, reactions, hand-drawn styles) that are critical for Freedom to Operate (FTO) and novelty checks in the pharmaceutical industry. These real-world complexities are often missing from existing benchmarks.

Modality Gaps: Different tools excel at different tasks (e.g., single molecules vs. reactions). Monolithic approaches frequently break down on complex patent documents, and there was minimal systematic understanding of which tools perform best for which image types.

Integration Needs: The authors aimed to identify tools to replace or augment their existing rule-based system (OSRA) within the SciWalker application, requiring a rigorous comparative study.

Core Innovation: A Curated Multi-Modality Dataset and Hybrid Classification Pipeline

Independent Benchmark: Creation of a manually curated test set of 2,702 images from real-world patents (WO, EP, US), specifically selected to include “problematic” edge cases like inorganic complexes, peptides, and Markush structures, providing a more realistic evaluation environment than synthetic datasets.

Comprehensive Comparison: Side-by-side evaluation of 8 open-access tools: DECIMER, ReactionDataExtractor, MolScribe, RxnScribe, SwinOCSR, OCMR, MolVec, and OSRA, using identical test conditions and evaluation criteria.

ChemIC Classifier: Implementation of a specialized image classifier (ResNet-50) to distinguish between single molecules, multiple molecules, reactions, and non-chemical images, facilitating a “hybrid” pipeline that routes images to the most appropriate tool.

Strict Evaluation Logic: Utilization of an exact match criterion for connectivity tables (ignoring partial similarity scores like Tanimoto) to reflect rigorous industrial requirements for novelty checking in patent applications.

Methodology: Exact-Match Evaluation Across Eight Open-Source Systems

Tool Selection: Installed and tested 8 tools: DECIMER v2.4.0, ReactionDataExtractor v2.0.0, MolScribe v1.1.1, RxnScribe v1.0, MolVec v0.9.8, OCMR, SwinOCSR, and OSRA v2.1.5.

Dataset Construction:

Test Set: 2,702 patent images split into three “buckets”: A (Single structure - 1,454 images), B (Multiple structures - 661 images), C (Reactions - 481 images).
Training Set (for ChemIC): 16,000 images from various sources (Patents, Im2Latex, etc.) split into 12,804 training, 1,604 validation, and 1,604 test images.

Evaluation Protocol:

Calculated Precision, Recall, and F1 scores based on an exact connectivity table structure matching (rejecting Tanimoto similarity as industrially insufficient). The metrics follow standard formulations where true positives ($\text{TP}$) represent perfectly assembled structures: $$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \qquad \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \qquad \text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$
Manual inspection by four chemists to verify predictions.
Developed custom tools (ImageComparator and ExcelConstructor) to facilitate visual comparison and result aggregation.

Segmentation Test: Applied DECIMER segmentation to multi-structure images to see if splitting them before processing improved results, combining segmentation with MolScribe for final predictions.

Key Findings: Modality Specialization Outperforms Monolithic Approaches

Single Molecules: MolScribe achieved the highest performance (Precision: 87%, F1: 93%), followed closely by DECIMER (Precision: 84%, F1: 91%). These transformer-based approaches outperformed rule-based methods on single-structure images (e.g., MolScribe F1: 93% vs. OSRA F1: 78%).

Reactions: Evaluated on 103 randomly selected reaction images containing 284 total reactions, RxnScribe outperformed others (Recall: 97%, F1: 86%), demonstrating the value of specialized architectures for reaction diagrams. General-purpose tools struggled with reaction recognition.

Multiple Structures: Evaluated on 20 multi-structure images containing 146 single structures, all AI-based tools struggled. OSRA (rule-based) performed best here but still had low precision (58%). Combining DECIMER segmentation (with the expand option) with MolScribe on these same 20 images improved precision to 82% and F1 to 90%, showing that image segmentation as a preprocessing step can boost multi-structure performance.

Failures: Current tools fail on polymers, large oligomers, and complex Markush structures. Most tools (except MolVec) correctly recognize cis-trans and tetrahedral stereochemistry, but other forms (e.g., octahedral, axial, helical) are not recognized. None of the evaluated tools can reliably recognize dative/coordinate bonds in metal complexes, indicating gaps in training data coverage.

Classifier Utility: The ChemIC model achieved 99.62% accuracy on the test set, validating the feasibility of a modular pipeline where images are routed to the specific tool best suited for that modality. The authors estimate that a hybrid system (MolScribe + OSRA + RxnScribe) routed by ChemIC would achieve an average F1 of 80%, compared to 68% for OSRA alone across all modalities.

Reproducibility Details

Data

Purpose	Dataset	Size	Description
Benchmark (Test)	Manual Patent Selection	2,702 Images	Sources: WO, EP, US patents Bucket A: Single structures (1,454) Bucket B: Multi-structures (661) Bucket C: Reactions (481)
ChemIC Training	Aggregated Sources	16,000 Images	Sources: Patents (OntoChem), MolScribe dataset, DECIMER dataset, RxnScribe dataset, Im2Latex-100k Split: 12,804 Train / 1,604 Val / 1,604 Test

Algorithms

Scoring Logic:

Single Molecules: Score = 1 if exact match of connectivity table (all atoms, valencies, bonds, superatom abbreviations, and charge correct), 0 otherwise. Stereochemistry correctness was not considered a scoring criterion. Tanimoto similarity explicitly rejected as too lenient.
Reactions: Considered correct if at least one reactant and one product are correct and capture main features. Stoichiometry and conditions ignored.

Image Segmentation: Used DECIMER segmentation (with expand option) to split multi-structure images into single images before passing to MolScribe.

Models

Tool	Version	Architecture
DECIMER	v2.4.0	EfficientNet-V2-M encoder + Transformer decoder
MolScribe	v1.1.1	Swin Transformer encoder + Transformer decoder
RxnScribe	v1.0	Specialized for reaction diagrams
ReactionDataExtractor	v2.0.0	Deep learning-based extraction
MolVec	v0.9.8	Rule-based vectorization
OSRA	v2.1.5	Rule-based recognition
SwinOCSR	-	Swin Transformer encoder-decoder
OCMR	-	CNN-based framework
ChemIC (New)	-	ResNet-50 CNN in PyTorch for 4-class classification

Evaluation

Key Results on Single Structures (Bucket A - 400 random sample):

Method	Precision	Recall	F1 Score
MolScribe	87%	100%	93%
DECIMER	84%	100%	91%
OCMR	77%	100%	87%
MolVec	74%	100%	85%
OSRA	64%	100%	78%
SwinOCSR	65%	95%	77%

Key Results on Reactions (Bucket C):

Method	Precision	Recall	F1 Score
RxnScribe	77%	97%	86%
OSRA	64%	65%	64%
ReactionDataExtractor	49%	62%	55%

Hardware

ChemIC Training: Trained on a machine with 40 Intel(R) Xeon(R) Gold 6226 CPUs. Training time approximately 6 hours for 100 epochs (early stopping at epoch 26).

Artifacts

Artifact	Type	License	Notes
Zenodo Repository (Code & Data)	Code, Dataset	Unknown	Benchmark images, processing scripts, evaluation tools, ChemIC classifier code
ImageComparator	Code	MIT	Java tool for visual comparison of OCSR predictions

Paper Information

Citation: Krasnov, A., Barnabas, S. J., Boehme, T., Boyer, S. K., & Weber, L. (2024). Comparing software tools for optical chemical structure recognition. Digital Discovery, 3(4), 681-693. https://doi.org/10.1039/D3DD00228D

Publication: Digital Discovery 2024

Additional Resources:

Zenodo Repository (Code & Data)

@article{krasnovComparingSoftwareTools2024,
  title = {Comparing Software Tools for Optical Chemical Structure Recognition},
  author = {Krasnov, Aleksei and Barnabas, Shadrack J. and Boehme, Timo and Boyer, Stephen K. and Weber, Lutz},
  year = {2024},
  journal = {Digital Discovery},
  volume = {3},
  number = {4},
  pages = {681--693},
  publisher = {Royal Society of Chemistry},
  doi = {10.1039/D3DD00228D},
  langid = {english}
}

AtomLenz: Atom-Level OCSR with Limited Supervision

Fri, 19 Dec 2025 00:00:00 +0000

Dual Contribution: Method and Data Resource

The paper proposes an architecture (AtomLenz) and training framework (ProbKT* + Edit-Correction) to solve the problem of Optical Chemical Structure Recognition (OCSR) in data-sparse domains. It also releases a curated, relabeled dataset of hand-drawn molecules with atom-level bounding box annotations.

Overcoming Annotation Bottlenecks in OCSR

Optical Chemical Structure Recognition (OCSR) is critical for digitizing chemical literature and lab notes. However, existing methods face three main limitations:

Generalization Limits: They struggle with sparse or stylistically unique domains, such as hand-drawn images, where massive datasets for pretraining are unavailable.
Annotation Cost: “Atom-level” methods (which detect individual atoms and bonds) require expensive bounding box annotations, which are rarely available for real-world sketch data.
Lack of Interpretability/Localization: Pure “Image-to-SMILES” models (like DECIMER) work well but fail to localize the atoms or bonds in the original image, limiting human-in-the-loop review and mechanistic interpretability.

AtomLenz, ProbKT*, and Graph Edit-Correction

The core contribution is AtomLenz, an OCSR framework that achieves atom-level entity detection using only SMILES supervision on target domains. The authors construct an explicit object detection pipeline using Faster R-CNN trained via a composite multi-task loss. The objective aims to optimize a multi-class log loss $L_{cls}$ for predicted class $\hat{c}$ and a regression loss $L_{reg}$ for predicted bounding box coordinates $\hat{b}$:

$$ \mathcal{L} = L_{cls}(c, \hat{c}) + L_{reg}(b, \hat{b}) $$

To bridge the gap between image inputs and the weakly supervised SMILES labels, the system leverages:

ProbKT (Probabilistic Knowledge Transfer):* Uses probabilistic logic and Hungarian matching to align predicted objects with the “ground truth” derived from the SMILES strings, enabling backpropagation without explicit bounding boxes.
Graph Edit-Correction: Generates pseudo-labels by solving an optimization problem that finds the smallest edit on the predicted graph such that the corrected graph and the ground-truth SMILES graph become isomorphic, which forces fine-tuning on less frequent atom types. The combination of ProbKT* and Edit-Correction is abbreviated as EditKT*.
ChemExpert: A chemically sound ensemble strategy that cascades predictions from multiple models (e.g., passing through DECIMER, then AtomLenz), halting at the first output that clears basic RDKit chemical validity checks.

Data Efficiency and Domain Adaptation Experiments

The authors evaluated the model specifically on domain adaptation and sample efficiency, treating hand-drawn molecules as the primary low-data target distribution:

Pretraining: Initially trained on ~214k synthetic images from ChEMBL explicitly labeled with bounding boxes (generated via RDKit).
Target Domain Adaptation: Fine-tuned on the Brinkhaus hand-drawn dataset (4,070 images) using purely SMILES supervision.
Evaluation Sets:
- Hand-drawn test set: 1,018 images.
- ChemPix: 614 out-of-domain hand-drawn images.
- Atom Localization set: 1,000 synthetic images to evaluate precise bounding box capabilities.
Baselines: Compared against leading OCSR methods, including DECIMER (v2.2.0), Img2Mol, MolScribe, ChemGrapher, and OSRA.

State-of-the-Art Ensembles vs. Standalone Limitations

SOTA Ensemble Performance: The ChemExpert module (combining AtomLenz and DECIMER) achieved state-of-the-art accuracy on both hand-drawn (63.5%) and ChemPix (51.8%) test sets.
Data Efficiency under Bottleneck Regimes: AtomLenz effectively bypassed the massive data constraints of competing models. When all methods were retrained from scratch on the same 4,070-sample hand-drawn training set (enriched with atom-level annotations from EditKT*), AtomLenz achieved 33.8% exact accuracy, outperforming baselines like Img2Mol (0.0%), MolScribe (1.3%), and DECIMER (0.1%), illustrating its sample efficiency.
Localization Success: The base framework achieved strong localization (mAP 0.801), a capability not provided by end-to-end transformers like DECIMER.
Methodological Tradeoffs: While AtomLenz is highly sample efficient, its standalone performance when fine-tuned on the target domain (33.8% accuracy) underperforms fine-tuned models trained on larger datasets like DECIMER (62.2% accuracy). AtomLenz achieves state-of-the-art results primarily when deployed as part of the ChemExpert ensemble alongside DECIMER, since errors from the two approaches tend to occur on different samples, allowing them to complement each other.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
Official Repository (AtomLenz)	Code	MIT	Complete pipeline for AtomLenz, ProbKT*, and Graph Edit-Correction.
Pre-trained Models	Model	MIT	Downloadable weights for Faster R-CNN detection backbones.
Hand-drawn Dataset (Brinkhaus)	Dataset	Unknown	Images and SMILES used for target domain fine-tuning and evaluation.
Relabeled Hand-drawn Dataset	Dataset	Unknown	1,417 images with bounding box annotations generated via EditKT*.
AtomLenz Web Demo	Other	Unknown	Interactive Hugging Face space for testing model inference.

Data

The study utilizes a mix of large synthetic datasets and smaller curated hand-drawn datasets.

Purpose	Dataset	Size	Notes
Pretraining	Synthetic ChEMBL	~214,000	Generated via RDKit/Indigo. Annotated with atoms, bonds, charges, stereocenters.
Fine-tuning	Hand-drawn (Brinkhaus)	4,070	Used for weakly supervised adaptation (SMILES only).
Evaluation	Hand-drawn Test	1,018
Evaluation	ChemPix	614	Out-of-distribution hand-drawn images.
Evaluation	Atom Localization	1,000	Synthetic images with ground truth bounding boxes.

Algorithms

Molecular Graph Constructor (Algorithm 1): A rule-based system to assemble the graph from detected objects:
1. Filtering: Removes overlapping atom boxes (IoU threshold).
2. Node Creation: Merges overlapping charge and stereocenter objects with their corresponding atom objects.
3. Edge Creation: Iterates over bond objects; if a bond overlaps with exactly two atoms, an edge is added. If >2, it selects the most probable pair.
4. Validation: Checks valency constraints; removes bonds iteratively if constraints are violated.
Weakly Supervised Training:
- ProbKT*: Uses Hungarian matching to align predicted objects with the “ground truth” implied by the SMILES string, allowing backpropagation without explicit boxes.
- Graph Edit-Correction: Finds the smallest edit on the predicted graph such that the corrected and true SMILES graphs become isomorphic, then uses the correction to generate pseudo-labels for retraining.

Models

Object Detection Backbone: Faster R-CNN.
- Four distinct models are trained for different entity types: Atoms ($O^a$), Bonds ($O^b$), Charges ($O^c$), and Stereocenters ($O^s$).
- Loss Function: Multi-task loss combining Multi-class Log Loss ($L_{cls}$) and Regression Loss ($L_{reg}$).
ChemExpert: An ensemble wrapper that prioritizes models based on user preference (e.g., DECIMER first, then AtomLenz). It accepts the first prediction that passes RDKit chemical validity checks.

Evaluation

Primary metrics focused on structural correctness and localization accuracy.

Metric	Value (Hand-drawn)	Baseline (DECIMER FT)	Notes
Accuracy (T=1)	33.8% (AtomLenz+EditKT*)	62.2%	Exact ECFP6 fingerprint match.
Tanimoto Sim.	0.484	0.727	Average similarity.
mAP	0.801	N/A	Localization accuracy (IoU 0.05-0.35).
Ensemble Acc.	63.5%	62.2%	ChemExpert (DECIMER + AtomLenz).

Hardware

Compute: Experiments utilized the Flemish Supercomputer Center (VSC) resources.
Note: Specific GPU models (e.g., A100/V100) are not explicitly detailed in the text, but Faster R-CNN training is standard on consumer or enterprise GPUs.

Paper Information

Citation: Oldenhof, M., De Brouwer, E., Arany, Á., & Moreau, Y. (2024). Atom-Level Optical Chemical Structure Recognition with Limited Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

Publication venue/year: CVPR 2024

Additional Resources:

BibTeX:

@inproceedings{oldenhofAtomLevelOpticalChemical2024,
  title = {Atom-Level Optical Chemical Structure Recognition with Limited Supervision},
  author = {Oldenhof, Martijn and De Brouwer, Edward and Arany, {\'A}d{\'a}m and Moreau, Yves},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2024},
  eprint = {2404.01743},
  archiveprefix = {arXiv},
  primaryclass = {cs.CV}
}

SwinOCSR: End-to-End Chemical OCR with Swin Transformers

Thu, 18 Dec 2025 00:00:00 +0000

Contribution: Methodological Architecture and Datasets

This is a Methodological Paper with a significant Resource component.

Method: It proposes a novel architecture (Swin Transformer backbone) and a specific loss function optimization (Focal Loss) for the task of Optical Chemical Structure Recognition (OCSR).
Resource: It constructs a large-scale synthetic dataset of 5 million molecules, specifically designing it to cover complex cases like substituents and aromatic rings.

Motivation: Addressing Visual Context and Data Imbalance

Problem: OCSR (converting images of chemical structures to SMILES) is difficult due to complex chemical patterns and long sequences. Existing deep learning methods (often CNN-based) struggle to achieve satisfactory recognition rates.
Technical Gap: Standard CNN backbones (like ResNet or EfficientNet) focus on local feature extraction and miss global dependencies required for interpreting complex molecular diagrams.
Data Imbalance: Chemical strings suffer from severe class imbalance (e.g., ‘C’ and ‘H’ are frequent; ‘Br’ or ‘Cl’ are rare), which causes standard Cross Entropy loss to underperform.

Core Innovation: Swin Transformers and Focal Loss

Swin Transformer Backbone: SwinOCSR replaces the standard CNN backbone with a Swin Transformer, using shifted window attention to capture both local and global image features more effectively.
Multi-label Focal Loss (MFL): The paper introduces a modified Focal Loss to OCSR, the first explicit attempt to address token imbalance in OCSR (per the authors). This penalizes the model for errors on rare tokens, addressing the “long-tail” distribution of chemical elements. The standard Focal Loss formulation heavily weights hard-to-classify examples: $$ \begin{aligned} FL(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t) \\ \end{aligned} $$
Structured Synthetic Dataset: Creation of a dataset explicitly balanced across four structural categories: Kekule rings, Aromatic rings, and their combinations with substituents.

Experimental Setup and Baselines

Backbone Comparison: The authors benchmarked SwinOCSR against the backbones of leading competitors: ResNet-50 (used in Image2SMILES) and EfficientNet-B3 (used in DECIMER 1.0).
Loss Function Ablation: They compared the performance of standard Cross Entropy (CE) loss against their proposed Multi-label Focal Loss (MFL).
Category Stress Test: Performance was evaluated separately on molecules with/without substituents and with/without aromaticity to test robustness.
Real-world Evaluation: The model was tested on 100 images manually extracted from the literature (with manually labeled SMILES), and separately on 100 CDK-generated images from those same SMILES, to measure the domain gap between synthetic and real-world data.

Results and Limitations

Synthetic test set performance: With Multi-label Focal Loss (MFL), SwinOCSR achieved 98.58% accuracy on the synthetic test set, compared to 97.36% with standard CE loss. Both ResNet-50 (89.17%) and EfficientNet-B3 (86.70%) backbones scored lower when using CE loss (Table 3).
Handling of long sequences: The model maintained high accuracy (94.76%) even on very long DeepSMILES strings (76-100 characters), indicating effective global feature extraction.
Per-category results: Performance was consistent across molecule categories: Category 1 (Kekule, 98.20%), Category 2 (Aromatic, 98.46%), Category 3 (Kekule + Substituents, 98.76%), Category 4 (Aromatic + Substituents, 98.89%). The model performed slightly better on molecules with substituents and aromatic rings.
Domain shift: While performance on synthetic data was strong, accuracy dropped to 25% on 100 real-world literature images. On 100 CDK-generated images from the same SMILES strings, accuracy was 94%, confirming that the gap stems from stylistic differences between CDK-rendered and real-world images. The authors attribute this to noise, low resolution, and variations such as condensed structural formulas and abbreviations.

Reproducibility Details

Data

Source: The first 8.5 million structures from PubChem were downloaded, yielding ~6.9 million unique SMILES.
Generation Pipeline:
- Tools: CDK (Chemistry Development Kit) for image rendering; RDKit for SMILES canonicalization.
- Augmentation: To ensure diversity, the dataset was split into 4 categories (1.25M each): (1) Kekule, (2) Aromatic, (3) Kekule + Substituents, (4) Aromatic + Substituents. Substituents were randomly added from a list of 224 common patent substituents.
- Preprocessing: Images rendered as binary, resized to 224x224, and copied to 3 channels (RGB simulation).

Purpose	Dataset	Size	Notes
Training	Synthetic (PubChem-derived)	4,500,000	18:1:1 split (Train/Val/Test)
Validation	Synthetic (PubChem-derived)	250,000
Test	Synthetic (PubChem-derived)	250,000

Algorithms

Loss Function: Multi-label Focal Loss (MFL). The single-label classification task was cast as multi-label to apply Focal Loss, using a sigmoid activation on logits.
Optimization:
- Optimizer: Adam with initial learning rate 5e-4.
- Schedulers: Cosine decay for the Swin Transformer backbone; Step decay for the Transformer encoder/decoder.
- Regularization: Dropout rate of 0.1.

Models

Backbone (Encoder 1): Swin Transformer.
- Patch size: $4 \times 4$.
- Linear embedding dimension: 192.
- Structure: 4 stages with Swin Transformer Blocks (Window MSA + Shifted Window MSA).
- Output: Flattened patch sequence $S_b$.
Transformer Encoder (Encoder 2): 6 standard Transformer encoder layers. Uses Positional Embedding + Multi-Head Attention + MLP.
Transformer Decoder: 6 standard Transformer decoder layers. Uses Masked Multi-Head Attention (to prevent look-ahead) + Multi-Head Attention (connecting to encoder output $S_e$).
Tokenization: DeepSMILES format used (syntactically more robust than SMILES). Vocabulary size: 76 tokens (76 unique characters found in dataset). Embedding dimension: 256.

Evaluation

Metrics: Accuracy (Exact Match), Tanimoto Similarity (PubChem fingerprints), BLEU, ROUGE.

Metric	SwinOCSR (CE)	SwinOCSR (MFL)	ResNet-50 (CE)	EfficientNet-B3 (CE)
Accuracy	97.36%	98.58%	89.17%	86.70%
Tanimoto	99.65%	99.77%	98.79%	98.46%
BLEU	99.46%	99.59%	98.62%	98.37%
ROUGE	99.64%	99.78%	98.87%	98.66%

Hardware

GPU: Trained on NVIDIA Tesla V100-PCIE.
Training Time: 30 epochs.
Batch Size: 256 images ($224 \times 224$ pixels).

Artifacts

Artifact	Type	License	Notes
SwinOCSR	Code + Data	Unknown	Official implementation with dataset and trained models

Paper Information

Citation: Xu, Z., Li, J., Yang, Z. et al. (2022). SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer. Journal of Cheminformatics, 14(41). https://doi.org/10.1186/s13321-022-00624-5

Publication: Journal of Cheminformatics 2022

Additional Resources:

GitHub Repository

String Representations for Chemical Image Recognition

Thu, 18 Dec 2025 00:00:00 +0000

Empirical Focus and Resource Contributions

This is an Empirical Paper ($\Psi_{\text{Empirical}}$) with a secondary contribution as a Resource Paper ($\Psi_{\text{Resource}}$).

It functions as a systematic ablation study, keeping the model architecture (EfficientNet-B3 + Transformer) constant while varying the input/output representation (SMILES, DeepSMILES, SELFIES, InChI) to determine which format yields the best performance for Optical Chemical Structure Recognition (OCSR). It also contributes large-scale benchmarking datasets derived from ChEMBL and PubChem.

The Syntax Challenge in Chemical Image Recognition

Optical Chemical Structure Recognition (OCSR) is essential for extracting chemical information buried in scientific literature and patents. While deep learning offers a promising alternative to rule-based approaches, neural networks struggle with the syntax of standard chemical representations like SMILES. Specifically, the tokenization of SMILES strings (where ring closures and branches are marked by single characters potentially far apart in the sequence) creates learning difficulties for sequence-to-sequence models. Newer representations like DeepSMILES and SELFIES were developed to address these syntax issues, but their comparative performance in image-to-text tasks had not been rigorously benchmarked.

Isolating String Representation Variables

The core novelty is the comparative isolation of the string representation variable in an OCSR context. Previous approaches often selected a representation (usually SMILES) without validating if it was optimal for the learning task. This study specifically tests the hypothesis that syntax-robust representations (like SELFIES) improve deep learning performance compared to standard SMILES. It provides empirical evidence on the trade-off between validity (guaranteed by SELFIES) and accuracy (highest with SMILES).

Large-Scale Image-to-Text Translation Experiments

The authors performed a large-scale image-to-text translation experiment:

Task: Converting 2D chemical structure images into text strings.
Data:
- ChEMBL: ~1.6M molecules, split into two datasets (with and without stereochemistry).
- PubChem: ~3M molecules, split similarly, to test performance scaling with data size.
Representations: The same chemical structures were converted into four formats: SMILES, DeepSMILES, SELFIES, and InChI.
Metric: The models were evaluated on:
- Validity: Can the predicted string be decoded back to a molecule?
- Exact Match: Is the predicted string identical to the ground truth?
- Tanimoto Similarity: How chemically similar is the prediction to the ground truth (using PubChem fingerprints)? The similarity $\mathcal{T}$ between two molecular fingerprints $A$ and $B$ is calculated as: $$ \mathcal{T}(A, B) = \frac{A \cdot B}{|A|^2 + |B|^2 - A \cdot B} $$

Comparative Performance and Validity Trade-offs

SMILES is the most accurate: Contrary to the hypothesis that syntax-robust formats would learn better, SMILES consistently achieved the highest exact match accuracy (up to 88.62% on PubChem data) and average Tanimoto similarity (0.98). This is likely due to SMILES having shorter string lengths and fewer unique tokens compared to SELFIES.
SELFIES guarantees validity: While slightly less accurate in direct translation, SELFIES achieved 100% structural validity (every prediction could be decoded), whereas SMILES predictions occasionally contained syntax errors.
InChI is unsuitable: InChI performed significantly worse (approx. 64% exact match) due to extreme maximum string lengths (up to 273 characters).
Stereochemistry adds difficulty: Including stereochemistry reduced accuracy across all representations due to increased token count and visual complexity.
Recommendation: Use SMILES for maximum accuracy; use SELFIES if generating valid structures is the priority (e.g., generative tasks).

Reproducibility Details

Data

The study used curated subsets from ChEMBL and PubChem. Images were generated synthetically.

Purpose	Dataset	Size	Notes
Training	ChEMBL (Dataset 1/2)	~1.5M	Filtered for MW < 1500, specific elements (C,H,O,N,P,S,F,Cl,Br,I,Se,B).
Training	PubChem (Dataset 3/4)	~3.0M	Same filtering rules, used to test scaling.
Evaluation	Test Split	~120k - 250k	Created using RDKit MaxMin algorithm to ensure chemical diversity.

Image Generation:

Tool: CDK Structure Diagram Generator (SDG).
Specs: $300 \times 300$ pixels, rotated by random angles ($0-360^{\circ}$), saved as 8-bit PNG.

Algorithms

Tokenization Rules (Critical for replication):

SELFIES: Split at every ][ (e.g., [C][N] $\rightarrow$ [C], [N]).
SMILES / DeepSMILES: Regex-based splitting:
- Every heavy atom (e.g., C, N).
- Every bracket ( and ).
- Every bond symbol = and #.
- Every single-digit number.
- Everything inside square brackets [] is kept as a single token.
InChI: The prefix InChI=1S/ was treated as a single token and removed during training, then re-added for evaluation.

Models

The model follows the DECIMER architecture.

Encoder: EfficientNet-B3 (pre-trained with “Noisy Student” weights).
- Output: Image feature vectors of shape $10 \times 10 \times 1536$.
Decoder: Transformer (similar to the “Base” model from Attention Is All You Need).
- Layers: 4 encoder-decoder layers.
- Attention Heads: 8.
- Dimension ($d_{\text{model}}$): 512.
- Feed-forward ($d_{\text{ff}}$): 2048.
- Dropout: 10%.
Loss: Sparse categorical cross-entropy.
Optimizer: Adam with custom learning rate scheduler.

Evaluation

Metrics were calculated after converting all predictions back to standard SMILES.

Metric	Baseline (SMILES)	Notes
Identical Match	88.62% (PubChem)	Strict character-for-character equality.
Valid Structure	99.78%	SMILES had rare syntax errors; SELFIES achieved 100%.
Tanimoto (Avg)	0.98	Calculated using PubChem fingerprints via CDK.

Hardware

Training: Google Cloud TPUs (v3-8).
Format: Data converted to TFRecords (128 image/text pairs per record) for TPU efficiency.
Batch Size: 1024.

Artifacts

Artifact	Type	License	Notes
DECIMER Short Communication	Code	MIT	Training and evaluation scripts (Python, Java)
Datasets on Zenodo	Dataset	MIT	SMILES data and processing scripts

Paper Information

Citation: Rajan, K., Steinbeck, C., & Zielesny, A. (2022). Performance of chemical structure string representations for chemical image recognition using transformers. Digital Discovery, 1(2), 84-90. https://doi.org/10.1039/D1DD00013F

Publication: Digital Discovery 2022

Additional Resources:

ChemRxiv Preprint (PDF)
Official Code Repository
Data on Zenodo
Related work: DECIMER, DECIMER 1.0, IMG2SMI

@article{rajanPerformanceChemicalStructure2022,
  title = {Performance of Chemical Structure String Representations for Chemical Image Recognition Using Transformers},
  author = {Rajan, Kohulan and Steinbeck, Christoph and Zielesny, Achim},
  year = 2022,
  journal = {Digital Discovery},
  volume = {1},
  number = {2},
  pages = {84--90},
  publisher = {Royal Society of Chemistry},
  doi = {10.1039/D1DD00013F}
}

Review of OCSR Techniques and Models (Musazade 2022)

Thu, 18 Dec 2025 00:00:00 +0000

Systematization of OCSR Evolution

This is a Systematization paper ($\Psi_{\text{Systematization}}$). It organizes existing literature into two distinct evolutionary phases: Rule-based systems (1990s-2010s) and Machine Learning-based systems (2015-present). It synthesizes performance metrics across these paradigms to highlight the shift from simple classification to “image captioning” (sequence generation).

Justification: The paper focuses on “organizing and synthesizing existing literature” and answers the core question: “What do we know?” The dominant contribution is systematization based on several key indicators:

Survey Structure: The paper explicitly structures content by categorizing the field into two distinct historical and methodological groups: “Rule-based systems” and “ML-based systems”. It traces the “evolution of approaches from rule-based structure analyses to complex statistical models”, moving chronologically from early tools like OROCS and OSRA (1990s-2000s) to modern Deep Learning approaches like DECIMER and Vision Transformers.
Synthesis of Knowledge: The paper aggregates performance metrics from various distinct studies into unified comparison tables (Table 1 for rule-based and Table 2 for ML-based). It synthesizes technical details of different models, explaining how specific architectures (CNNs, LSTMs, Attention mechanisms) are applied to the specific problem of Optical Chemical Structure Recognition (OCSR).
Identification of Gaps: The authors dedicate specific sections to “Gaps of rule-based systems” and “Gaps of ML-based systems”. It concludes with recommendations for future development, such as the need for “standardized datasets” and specific improvements in image augmentation and evaluation metrics.

Motivation for Digitization in Cheminformatics

The primary motivation is the need to digitize vast amounts of chemical knowledge locked in non-digital formats (e.g., scanned PDFs, older textbooks). This is challenging because:

Representational Variety: A single chemical formula can be drawn in many visually distinct ways (e.g., different orientations, bond styles, fonts).
Legacy Data: Older documents contain noise, low resolution, and disconnected strokes that confuse standard computer vision models.
Lack of Standardization: There is no centralized database or standardized benchmark for evaluating OCSR performance, making comparison difficult.

Key Insights and the Paradigm Shift

The paper provides a structured comparison of the “evolution” of OCSR, specifically identifying the pivot point where the field moved from object detection to NLP-inspired sequence generation.

Key insights include:

The Paradigm Shift: Identifying that OCSR has effectively become an “image captioning” problem where the “caption” is a SMILES or InChI string.
Metric Critique: It critically analyzes the flaws in current evaluation metrics, noting that Levenshtein Distance (LD) is better than simple accuracy but still fails to capture semantic chemical severity (e.g., mistaking “F” for “S” is worse than a wrong digit).
Hybrid Potential: Despite the dominance of ML, the authors argue that rule-based heuristics are still valuable for post-processing validation (e.g., checking element order, sequence structure, and formula correspondence).

Comparative Analysis of Rule-Based vs. ML Systems

As a review paper, it aggregates experimental results from primary sources. It compares:

Rule-based systems: OSRA, chemoCR, Imago, Markov Logic OCSR, and various heuristic approaches.
ML-based systems: DECIMER (multiple versions), MSE-DUDL, ICMDT (Image Captioning Model based on Deep Transformer-in-Transformer), and other BMS Kaggle competition solutions.

It contrasts these systems using:

Datasets: BMS (synthetic, 4M images), PubChem (synthetic), U.S. Patents (real-world scanned).
Metrics: Tanimoto similarity (structural overlap) and Levenshtein distance (string edit distance).

Outcomes, Critical Gaps, and Recommendations

Transformers are SOTA: Attention-based encoder-decoder models outperform CNN-RNN hybrids. DECIMER 1.0 achieved 96.47% Tanimoto $= 1.0$ on its test set using an EfficientNet-B3 encoder and Transformer decoder.
Data Hungry: Modern approaches require massive datasets (millions of images) and significant compute. DECIMER 1.0 trained on 39M images for 14 days on TPU, while the original DECIMER took 27 days on a single GPU. Rule-based systems required neither large data nor heavy compute but hit a performance ceiling.
Critical Gaps:
- Super-atoms: Current models struggle with abbreviated super-atoms (e.g., “Ph”, “COOH”).
- Stereochemistry: 3D information (wedges/dashes) is often lost or misinterpreted.
- Resolution: Models are brittle to resolution changes; some require high-res, others fail if images aren’t downscaled.
Recommendation: Future systems should integrate “smart” pre-processing (denoising without cropping) and use domain-specific distance metrics. The authors also note that post-processing formula validation (checking element order, sequence structure, and formula correspondence) increases accuracy by around 5-6% on average. They suggest exploring Capsule Networks as an alternative to CNNs, since capsules add position invariance through routing-by-agreement rather than max-pooling.

Reproducibility

As a review paper, this work does not introduce original code, models, or datasets. The paper itself is open access via the Journal of Cheminformatics. This section summarizes the technical details of the systems reviewed.

Data

The review identifies the following key datasets used for training OCSR models:

Dataset	Type	Size	Notes
BMS (Bristol-Myers Squibb)	Synthetic	~4M images	2.4M train / 1.6M test. Used for Kaggle competition. Test images contain noise (salt & pepper, blur) and rotations absent from training images.
PubChem	Synthetic	~39M	Generated via CDK (Chemistry Development Kit). Used by DECIMER 1.0 (90/10 train/test split).
U.S. Patents (USPTO)	Scanned	Variable	Real-world noise, often low resolution. One of several training sources for MSE-DUDL (alongside PubChem and Indigo, totaling 50M+ samples).
ChemInfty	Scanned	869 images	Older benchmark used to evaluate rule-based systems (e.g., Markov Logic OCSR).

Algorithms

The review highlights the progression of algorithms:

Rule-Based: Hough transforms for bond detection, vectorization/skeletonization, and OCR for atom labels.
Sequence Modeling:
- Image Captioning: Encoder (CNN/ViT) → Decoder (RNN/Transformer).
- Tokenization: Parsing InChI/SMILES into discrete tokens (e.g., splitting C13 into C, 13).
- Beam Search: Used in inference (typical $k=15-20$) to find the most likely chemical string.

Models

Key architectures reviewed:

DECIMER 1.0: Uses EfficientNet-B3 (Encoder) and Transformer (Decoder). Predicts SELFIES strings (more robust than SMILES).
Swin Transformer: Often used in Kaggle ensembles as the visual encoder due to better handling of variable image sizes.
Grid LSTM: Used in older deep learning approaches (MSE-DUDL) to capture spatial dependencies.

Evaluation

Metrics standard in the field:

Levenshtein Distance (LD): Edit distance between predicted and ground truth strings. Lower is better. Formally, for two sequences $a$ and $b$ (e.g. SMILES strings) of lengths $|a|$ and $|b|$, the recursive distance $LD(a, b)$ is bounded from $0$ to $\max(|a|, |b|)$.
Tanimoto Similarity: Measures overlap of molecular fingerprints ($0.0 - 1.0$). Higher is better. DECIMER 1.0 achieved a Tanimoto of 0.99 on PubChem data (Table 2). Calculated as: $$ \begin{aligned} T(A, B) = \frac{N_c}{N_a + N_b - N_c} \end{aligned} $$ where $N_a$ and $N_b$ are the number of bits set to 1 in fingerprints $A$ and $B$, and $N_c$ is the number of common bits set to 1.
1-1 Match Rate: Exact string matching (accuracy). For DECIMER 1.0, 96.47% of results achieved Tanimoto $= 1.0$.

Hardware

Training Cost: High for SOTA. DECIMER 1.0 required ~14 days on TPU. The original DECIMER took ~27 days on a single NVIDIA GPU.
Inference: Transformer models are heavy; rule-based systems run on standard CPUs but with lower accuracy.

Paper Information

Citation: Musazade, F., Jamalova, N., & Hasanov, J. (2022). Review of techniques and models used in optical chemical structure recognition in images and scanned documents. Journal of Cheminformatics, 14(1), 61. https://doi.org/10.1186/s13321-022-00642-3

Publication: Journal of Cheminformatics 2022

@article{musazadeReviewTechniquesModels2022,
  title = {Review of Techniques and Models Used in Optical Chemical Structure Recognition in Images and Scanned Documents},
  author = {Musazade, Fidan and Jamalova, Narmin and Hasanov, Jamaladdin},
  year = 2022,
  month = sep,
  journal = {Journal of Cheminformatics},
  volume = {14},
  number = {1},
  pages = {61},
  doi = {10.1186/s13321-022-00642-3}
}

One Strike, You're Out: Detecting Markush Structures

Thu, 18 Dec 2025 00:00:00 +0000

Methodology and Classification

This is a Method paper (Classification: $\Psi_{\text{Method}}$).

It proposes a patch-based classification pipeline to solve a technical failure mode in Optical Chemical Structure Recognition (OCSR). Distinct rhetorical indicators include a baseline comparison (CNN vs. traditional ORB), ablation studies (architecture, pretraining), and a focus on evaluating the filtering efficacy against a known failure mode.

The Markush Structure Challenge

The Problem: Optical Chemical Structure Recognition (OCSR) tools convert 2D images of molecules into machine-readable formats. These tools struggle with “Markush structures,” generic structural templates used frequently in patents that contain variables rather than specific atoms (e.g., $R$, $X$, $Y$).

The Gap: Markush structures are difficult to detect because they often appear as small indicators (a single “R” or variable) within a large image, resulting in a very low Signal-to-Noise Ratio (SNR). Existing OCSR research pipelines typically bypass this by manually excluding these structures from their datasets.

The Goal: To build an automated filter that can identify images containing Markush structures so they can be removed from OCSR pipelines, improving overall database quality without requiring manual data curation.

Patch-Based Classification Pipeline

The core technical contribution is an end-to-end deep learning pipeline tailored for low-SNR chemical images where standard global resizing or cropping fails due to large variations in image resolution and pixel scales.

Patch Generation: The system slices input images into overlapping patches generated from two offset grids, ensuring that variables falling on boundaries are fully captured in at least one crop.
Targeted Annotation: The labels rely on pixel-level bounding boxes around Markush indicators, minimizing the noise that would otherwise overwhelm a full-image classification attempt.
Inference Strategy: During inference, the query image is broken into patches, individually classified, and aggregated entirely using a maximum pooling rule where $X = \max_{i=1}^{n} \{ x_i \}$.
Evaluation: Provides the first systematic comparison between fixed-feature extraction (ORB + XGBoost) and end-to-end deep learning for this specific domain.

Experimental Setup and Baselines

The authors compared two distinct paradigms on a manually annotated dataset:

Fixed-Feature Baseline: Used ORB (Oriented FAST and Rotated BRIEF) to detect keypoints and match them against a template bank of known Markush symbols. Features (match counts, Hamming distances) were fed into an XGBoost model.
Deep Learning Method: Fine-tuned ResNet18 and Inception V3 models on the generated image patches.
- Ablations: Contrasted pretraining sources, evaluating general domain (ImageNet) against chemistry-specific domain (USPTO images).
- Fine-tuning: Compared full-network fine-tuning against freezing all but the fully connected layers.

To handle significant class imbalance, the primary evaluation metric was the Macro F1 score, defined as:

$$ \text{Macro F1} = \frac{1}{N} \sum_{i=1}^{N} \frac{2 \cdot \text{precision}_i \cdot \text{recall}_i}{\text{precision}_i + \text{recall}_i} $$

Performance Outcomes

CNN vs. ORB: Deep learning architectures outperformed the fixed-feature baseline. The best model (Inception V3 pretrained on ImageNet) achieved an image-level Macro F1 of 0.928, compared to 0.701 (image-level) for the ORB baseline, and a patch-level Macro F1 of 0.917.
The Pretraining Surprise: Counterintuitively, ImageNet pretraining consistently outperformed the domain-specific USPTO pretraining. The authors hypothesize that the filters learned from ImageNet pretraining generalize well outside the ImageNet domain, though why the USPTO-pretrained filters underperform remains unclear.
Full Model Tuning: Unfreezing the entire network yielded higher performance than tuning only the classifier head, indicating that standard low-level visual filters require substantial adaptation to reliably distinguish chemical line drawings.
Limitations and Edge Cases: The best CNN achieved an ROC AUC of 0.97 on the primary patch test set, while the ORB baseline scored 0.81 on the auxiliary dataset (the paper notes these ROC curves are not directly comparable due to different evaluation sets). The aggregation metric ($X = \max \{ x_i \}$) is naive and has not been optimized. Furthermore, the patching approach creates inherent label noise when a Markush indicator is cleanly bisected by a patch edge, potentially forcing the network to learn incomplete visual features.

Reproducibility Details

Data

The study used a primary dataset labeled by domain experts and a larger auxiliary dataset for evaluation.

Purpose	Dataset	Size	Notes
Training/Val	Primary Dataset	272 Images	Manually annotated with bounding boxes for Markush indicators. Split 60/20/20.
Evaluation	Auxiliary Dataset	~5.4k Images	5117 complete structures, 317 Markush. Used for image-level testing only (no bbox).

Patch Generation:

Images are cropped into patches of size 224x224 (ResNet) or 299x299 (Inception).
Patches are generated from 2 grids offset by half the patch width/height to ensure annotations aren’t lost on edges.
Labeling Rule: A patch is labeled “Markush” if >50% of an annotation’s pixels fall inside it.

Algorithms

ORB (Baseline):

Matches query images against a bank of template patches containing Markush indicators.
Features: Number of keypoints, number of matches, Hamming distance of best 5 matches.
Classifier: XGBoost trained on these features.
Hyperparameters: Search over number of features (500-2000) and template patches (50-250).

Training Configuration:

Framework: PyTorch with Optuna for optimization.
Optimization: 25 trials per configuration.
Augmentations: Random perspective shift, posterization, sharpness/blur.

Models

Two main architectures were compared.

Model	Input Size	Parameters	Pretraining Source
ResNet18	224x224	11.5M	ImageNet
Inception V3	299x299	23.8M	ImageNet & USPTO

Best Configuration: Inception V3, ImageNet weights, Full Model fine-tuning (all layers unfrozen).

Evaluation

Primary metric was Macro F1 due to class imbalance.

Metric	Best CNN (Inception V3)	Baseline (ORB)	Notes
Patch Test F1	$0.917 \pm 0.014$	N/A	ORB does not support patch-level
Image Test F1	$0.928 \pm 0.035$	$0.701 \pm 0.052$	CNN aggregates patch predictions
Aux Test F1	0.914	0.533	Evaluation on large secondary dataset
ROC AUC	0.97	0.81

Hardware

GPU: Tesla V100-SXM2-16GB
CPU: Intel Xeon E5-2686 @ 2.30GHz
RAM: 64 GB

Artifacts

Artifact	Type	License	Notes
GitHub Repository	Code	Apache-2.0	MSc thesis code: CNN training, ORB baseline, evaluation scripts

The primary dataset was manually annotated by Elsevier domain experts and is not publicly available. The auxiliary dataset (from Elsevier) is also not public. Pre-trained model weights are not released in the repository.

Paper Information

Citation: Jurriaans, T., Szarkowska, K., Nalisnick, E., Schwörer, M., Thorne, C., & Akhondi, S. (2023). One Strike, You’re Out: Detecting Markush Structures in Low Signal-to-Noise Ratio Images. arXiv preprint arXiv:2311.14633. https://doi.org/10.48550/arXiv.2311.14633

Publication: arXiv 2023

Additional Resources:

GitHub Repository

@misc{jurriaansOneStrikeYoure2023,
  title = {One {{Strike}}, {{You}}'re {{Out}}: {{Detecting Markush Structures}} in {{Low Signal-to-Noise Ratio Images}}},
  shorttitle = {One {{Strike}}, {{You}}'re {{Out}}},
  author = {Jurriaans, Thomas and Szarkowska, Kinga and Nalisnick, Eric and Schwoerer, Markus and Thorne, Camilo and Akhondi, Saber},
  year = 2023,
  month = nov,
  number = {arXiv:2311.14633},
  eprint = {2311.14633},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2311.14633},
  archiveprefix = {arXiv}
}

MolMiner: Deep Learning OCSR with YOLOv5 Detection

Thu, 18 Dec 2025 00:00:00 +0000

Classification and Contribution

This is primarily a Resource paper ($\Psi_{\text{Resource}}$) with a strong Method component ($\Psi_{\text{Method}}$).

Resource: It presents a complete software application (published as an “Application Note”) for Optical Chemical Structure Recognition (OCSR), including a graphical user interface (GUI) and a new curated “Real-World” dataset of 3,040 molecular images.
Method: It proposes a novel “rule-free” pipeline that replaces traditional vectorization algorithms with deep learning object detection (YOLOv5) and segmentation models.

Motivation: Bottlenecks in Rule-Based Systems

Legacy Backlog: Decades of scientific literature contain chemical structures only as 2D images (PDFs), which are not machine-readable.
Limitations of Legacy Architecture: Existing tools (like OSRA, CLIDE, MolVec) rely on rule-based vectorization (interpreting vectors and nodes) which struggle with noise, low resolution, and complex drawing styles found in scanned documents.
Deep Learning Gap: While deep learning (DL) has advanced computer vision, few practical, end-to-end DL tools existed for OCSR that could handle the full pipeline from PDF extraction to graph generation with high accuracy.

Core Innovation: Object Detection Paradigm for OCSR

Object Detection Paradigm: MolMiner shifts away from the strategy of line-tracing (vectorization), opting to treat atoms and bonds directly as objects to be detected using YOLOv5. This allows it to “look once” at the image.
End-to-End Pipeline: Integration of three specialized modules:
1. MobileNetV2 for segmenting molecular figures from PDF pages.
2. YOLOv5 for detecting chemical elements (atoms/bonds) as bounding boxes.
3. EasyOCR for recognizing text labels and resolving abbreviations (supergroups) to full explicit structures.
Synthetic Training Strategy: The authors bypassed manual labeling by building a data generation module that uses RDKit to create chemically valid images with perfect ground-truth annotations automatically.

Methodology: End-to-End Object Detection Pipeline

Benchmarks: Evaluated on four standard OCSR datasets: USPTO (5,719 images), UOB (5,740 images), CLEF2012 (992 images), and JPO (450 images).
New External Dataset: Collected and annotated a “Real-World” dataset of 3,040 images from 239 scientific papers to test generalization beyond synthetic benchmarks.
Baselines: Compared against open-source tools: MolVec (v0.9.8), OSRA (v2.1.0), and Imago (v2.0).
Qualitative Tests: Tested on difficult cases like hand-drawn molecules and large-sized scans (e.g., Palytoxin).

Results: Speed and Generalization Metrics

Benchmark Performance: MolMiner outperformed open-source baselines on standard validation splits.
- USPTO: 93% MCS accuracy (vs. 89% for MolVec, per Table 2). The commercial CLiDE Pro tool reports 93.8% on USPTO, slightly higher than MolMiner’s 93.3%.
- Real-World Set: 87.8% MCS accuracy (vs. 50.1% for MolVec, 8.9% for OSRA, and 10.3% for Imago).
Inference Velocity: The architecture allows for faster processing compared to CPU rule-based systems. On JPO (450 images), MolMiner finishes in under 1 minute versus 8-23 minutes for rule-based tools (Table 3).
Robustness: Demonstrated ability to handle hand-drawn sketches and noisy scans, though limitations remain with crossing bonds, colorful backgrounds, crowded layout segmentation, and Markush structures.
Software Release: Released as a free desktop application for Mac and Windows with a Ketcher-based editing plugin.

Reproducibility Details

Data

The system relies heavily on synthetic data for training, while evaluation uses both standard and novel real-world datasets.

Purpose	Dataset	Size	Notes
Training	Synthetic RDKit	Large-scale	Generated using RDKit v2021.09.1 and ReportLab v3.5.0. Includes augmentations (rotation, thinning, noise).
Evaluation	USPTO	5,719	Standard benchmark. Avg MW: 380.0.
Evaluation	UOB	5,740	Standard benchmark. Avg MW: 213.5.
Evaluation	CLEF2012	992	Standard benchmark. Avg MW: 401.2.
Evaluation	JPO	450	Standard benchmark. Avg MW: 360.3.
Evaluation	Real-World	3,040	New Contribution. Collected from 239 scientific papers. Download Link.

Algorithms

Data Generation:
- Uses RDKit MolDraw2DSVG and CondenseMolAbbreviations to generate images and ground truth.
- Augmentation: Rotation, line thinning/thickness variation, noise injection.
Graph Construction:
- A distance-based algorithm connects recognized “Atom” and “Bond” objects into a molecular graph.
- Supergroup Parser: Matches detected text against a dictionary collected from RDKit, ChemAxon, and OSRA to resolve abbreviations (e.g., “Ph”, “Me”).
Image Preprocessing:
- Resizing: Images with max dim > 2560 are resized to 2560. Small images (< 640) resized to 640.
- Padding: Images padded to nearest upper bound (640, 1280, 1920, 2560) with white background (255, 255, 255).
- Dilation: For thick-line images, cv2.dilate (3x3 or 2x2 kernel) is applied to estimate median line width.

Models

The system is a cascade of three distinct deep learning models:

MolMiner-ImgDet (Page Segmentation):
- Architecture: MobileNetV2.
- Task: Semantic segmentation to identify and crop chemical figures from full PDF pages.
- Classes: Background vs. Compound.
- Performance: Recall 95.5%.
MolMiner-ImgRec (Structure Recognition):
- Architecture: YOLOv5 (One-stage object detector). Selected over MaskRCNN/EfficientDet for speed/accuracy trade-off.
- Task: Detects atoms and bonds as bounding boxes.
- Labels:
  - Atoms: Si, N, Br, S, I, Cl, H, P, O, C, B, F, Text.
  - Bonds: Single, Double, Triple, Wedge, Dash, Wavy.
- Performance: mAP@0.5 = 97.5%.
MolMiner-TextOCR (Character Recognition):
- Architecture: EasyOCR (fine-tuned).
- Task: Recognize specific characters in “Text” regions identified by YOLO (e.g., supergroups, complex labels).
- Performance: ~96.4% accuracy.

Performance Evaluation & Accuracy Metrics

The paper argues that computing the Maximum Common Substructure (MCS) accuracy is superior to string comparisons of canonical identifiers like InChI or SMILES. The InChI string is heavily sensitive to slight canonicalization or tautomerization discrepancies (like differing aromaticity models). Therefore, for comparing structural isomorphism:

$$ \text{MCS_Accuracy} = \frac{|\text{Edges}_{\text{MCS}}| + |\text{Nodes}_{\text{MCS}}|}{|\text{Edges}_{\text{Ground_Truth}}| + |\text{Nodes}_{\text{Ground_Truth}}|} $$

Using this metric to evaluate bond- and atom-level recall directly measures OCR extraction fidelity.

Metric	MolMiner (Real-World)	MolVec	OSRA	Imago
MCS Accuracy	87.8%	50.1%	8.9%	10.3%
InChI Accuracy	88.9%	62.6%	64.5%	10.8%

Hardware

Inference Hardware: Tested on Intel Xeon Gold 6230R CPU @ 2.10 GHz.
Acceleration: Supports batch inference on GPU, which provides the reported speedups over rule-based CPU tools.
Runtime: Under 1 minute on JPO (450 images), 7 minutes on USPTO (5,719 images), compared to 29-148 minutes for baseline tools on USPTO (Table 3).

Artifacts

Artifact	Type	License	Notes
pharmamind-molminer	Code	Unknown	GitHub repo with user guides and release downloads
Real-World Dataset	Dataset	Unknown	3,040 molecular images from 239 papers

Paper Information

Citation: Xu, Y., Xiao, J., Chou, C.-H., Zhang, J., Zhu, J., Hu, Q., Li, H., Han, N., Liu, B., Zhang, S., Han, J., Zhang, Z., Zhang, S., Zhang, W., Lai, L., & Pei, J. (2022). MolMiner: You only look once for chemical structure recognition. Journal of Chemical Information and Modeling, 62(22), 5321–5328. https://doi.org/10.1021/acs.jcim.2c00733

Publication: Journal of Chemical Information and Modeling (JCIM) 2022

Additional Resources:

@article{xuMolMinerYouOnly2022,
  title = {MolMiner: You only look once for chemical structure recognition},
  shorttitle = {MolMiner},
  author = {Xu, Youjun and Xiao, Jinchuan and Chou, Chia-Han and Zhang, Jianhang and Zhu, Jintao and Hu, Qiwan and Li, Hemin and Han, Ningsheng and Liu, Bingyu and Zhang, Shuaipeng and Han, Jinyu and Zhang, Zhen and Zhang, Shuhao and Zhang, Weilin and Lai, Luhua and Pei, Jianfeng},
  year = 2022,
  month = nov,
  journal = {Journal of Chemical Information and Modeling},
  volume = {62},
  number = {22},
  pages = {5321--5328},
  publisher = {American Chemical Society},
  issn = {1549-9596},
  doi = {10.1021/acs.jcim.2c00733},
}

MICER: Molecular Image Captioning with Transfer Learning

Thu, 18 Dec 2025 00:00:00 +0000

MICER’s Contribution to Optical Structure Recognition

This is a Method paper according to the AI for Physical Sciences taxonomy. It proposes MICER, an encoder-decoder architecture that integrates transfer learning (fine-tuning pre-trained models) and attention mechanisms for Optical Chemical Structure Recognition (OCSR). The study includes rigorous benchmarking comparing MICER against three rule-based tools (OSRA, MolVec, Imago) and existing deep learning methods (DECIMER). The authors conduct extensive factor comparison experiments to isolate the effects of stereochemistry, molecular complexity, data volume, and encoder backbone choices.

The Challenge of Generalizing in OCSR

Chemical structures in scientific literature are valuable for drug discovery, but they are locked in image formats that are difficult to mine automatically. Traditional OCSR tools (like OSRA) rely on hand-crafted rules and expert knowledge. They are brittle, struggle with stylistic variations, and have low generalization ability. While deep learning has been applied (e.g., DECIMER), previous attempts often used frozen pre-trained feature extractors (without fine-tuning) or failed to fully exploit transfer learning, leading to suboptimal performance. The goal of this work is to build an end-to-end “image captioning” system that translates molecular images directly into SMILES strings without intermediate segmentation steps.

Integrating Fine-Tuning and Attention for Chemistry

The core novelty lies in the specific architectural integration of transfer learning with fine-tuning for the chemical domain. Unlike DECIMER, which used a frozen network, MICER fine-tunes a pre-trained ResNet on molecular images. This allows the encoder to adapt from general object recognition to specific chemical feature extraction.

The model incorporates an attention mechanism into the LSTM decoder, allowing the model to focus on specific image regions (atoms and bonds) when generating each character of the SMILES string. The paper explicitly analyzes “intrinsic features” of molecular data (stereochemistry, complexity) to guide the design of the training dataset, combining multiple chemical toolkits (Indigo, RDKit) to generate diverse styles.

Experimental Setup and Ablation Studies

The authors performed two types of experiments: Factor Comparison (ablations) and Benchmarking.

Factor Comparisons: They evaluated how performance is affected by:

Stereochemistry (SI): Comparing models trained on data with and without stereochemical information.
Molecular Complexity (MC): Analyzing performance across 5 molecular weight intervals.
Data Volume (DV): Training on datasets ranging from 0.64 million to 10 million images.
Pre-trained Models (PTMs): Comparing 8 different backbones (e.g., ResNet, VGG, Inception, MobileNet) versus a base CNN.

Benchmarking:

Baselines: OSRA, MolVec, Imago (rule-based); Base CNN, DECIMER (deep learning).
Datasets: Four test sets (100k images each, except UOB): Uni-style, Multi-style, Noisy, and Real-world (UOB dataset).
Metrics: Sequence Accuracy (Exact Match), Levenshtein Distance (ALD), and Tanimoto Similarity (Fingerprint match).

Results and Core Insights

MICER achieved 97.54% Sequence Accuracy on uni-style data and 82.33% on the real-world UOB dataset, outperforming rule-based and deep learning baselines across all four test sets.

Dataset	Method	SA (%)	AMFTS (%)
Uni-style	OSRA	23.14	56.83
Uni-style	DECIMER	35.32	86.92
Uni-style	MICER	97.54	99.74
Multi-style	OSRA	15.68	44.50
Multi-style	MICER	95.09	99.28
Noisy	MICER	94.95	99.25
UOB (real-world)	OSRA	80.24	91.17
UOB (real-world)	DECIMER	21.75	65.15
UOB (real-world)	MICER	82.33	94.47

ResNet101 was identified as the most effective encoder (87.58% SA in preliminary tests on 0.8M images), outperforming deeper (DenseNet121 at 81.41%) and lighter (MobileNetV2 at 39.83%) networks. Performance saturates around 6 million training samples, reaching 98.84% SA. Stereochemical information drops accuracy by approximately 6.1% (from 87.61% to 81.50%), indicating wedge and dash bonds are harder to recognize. Visualizing attention maps showed the model correctly attends to specific atoms (e.g., focusing on ‘S’ or ‘Cl’ pixels) when generating the corresponding character.

Limitations

The authors acknowledge several limitations. MICER struggles with superatoms, R-groups, text labels, and uncommon atoms (e.g., Sn) that were not seen during training. On noisy data, noise spots near Cl atoms can cause misclassification as O atoms. Complex molecular images with noise lead to misrecognition of noise points as single bonds and wedge-shaped bonds as double bonds. All methods, including MICER, have substantial room for improvement on real-world datasets that contain these challenging elements.

Reproducibility Details

Data

The training data was curated from the ZINC20 database.

Preprocessing:

Filtering: Removed organometallics, mixtures, and invalid molecules.
Standardization: SMILES were canonicalized and de-duplicated.
Generation: Images generated using Indigo and RDKit toolkits to vary styles.

Dataset Size:

Total: 10 million images selected for the final model.
Composition: 6 million “default style” (Indigo) + 4 million “multi-style” (Indigo + RDKit).
Splits: 8:1:1 ratio for Training/Validation/Test.

Vocabulary: A token dictionary of 39 SMILES characters plus 3 special tokens: [pad], [sos], [eos], [0]-[9], [C], [l], [c], [O], [N], [n], [F], [H], [o], [S], [s], [B], [r], [I], [i], [P], [p], (, ), [, ], @, =, #, /, -, +, \, %. Two-letter atoms like ‘Br’ are tokenized as distinct characters [B], [r], and ‘Cl’ as [C], [l].

Algorithms

Tokenization: Character-level tokenization (not atom-level); the model learns to assemble ‘C’ and ’l’ into ‘Cl’.
Attention Mechanism: Uses a soft attention mechanism where the decoder calculates an attention score between the encoder’s feature map ($8 \times 8 \times 512$) and the current hidden vector. Formula: $$ \begin{aligned} \text{att_score} &= \text{softmax}(L_a(\tanh(L_f(F) + L_b(b_t)))) \end{aligned} $$
Training Configuration:
- Loss Function: Cross-entropy loss
- Optimizer: Adam optimizer
- Learning Rate: 2e-5
- Batch Size: 256
- Epochs: 15

Models

Encoder:

Backbone: Pre-trained ResNet101 (trained on ImageNet).
Modifications: The final layer is removed to output a Feature Map of size $8 \times 8 \times 512$.
Flattening: Reshaped to a $64 \times 512$ feature matrix for the decoder.

Decoder:

Type: Long Short-Term Memory (LSTM) with Attention.
Dropout: 0.3 applied to minimize overfitting.

The encoder uses a pilot network (for universal feature extraction), a max-pooling layer, and multiple feature extraction layers containing convolutional blocks (CBs), feeding into the attention LSTM.

Evaluation

Metrics:

SA (Sequence Accuracy): Strict exact match of SMILES strings.
ALD (Average Levenshtein Distance): Edit distance for character-level error analysis.
AMFTS / MFTS@1.0: Tanimoto similarity of ECFP4 fingerprints to measure structural similarity.

Test Sets:

Uni-style: 100,000 images (Indigo default).
Multi-style: 100,000 images (>10 styles).
Noisy: 100,000 images with noise added.
UOB: 5,575 real-world images from literature.

Hardware

Compute: 4 x NVIDIA Tesla V100 GPUs
Training Time: Approximately 42 hours for the final model

Artifacts

Artifact	Type	License	Notes
MICER	Code	MIT	Official implementation

The training data (generated from ZINC20) and pre-trained model weights are not publicly released. The repository contains code but has minimal documentation (2 commits, no description).

Paper Information

Citation: Yi, J., Wu, C., Zhang, X., Xiao, X., Qiu, Y., Zhao, W., Hou, T., & Cao, D. (2022). MICER: a pre-trained encoder-decoder architecture for molecular image captioning. Bioinformatics, 38(19), 4562-4572. https://doi.org/10.1093/bioinformatics/btac545

Publication: Bioinformatics 2022

Additional Resources:

GitHub Repository

@article{yiMICERPretrainedEncoder2022,
  title = {{{MICER}}: A Pre-Trained Encoder--Decoder Architecture for Molecular Image Captioning},
  shorttitle = {{{MICER}}},
  author = {Yi, Jiacai and Wu, Chengkun and Zhang, Xiaochen and Xiao, Xinyi and Qiu, Yanlong and Zhao, Wentao and Hou, Tingjun and Cao, Dongsheng},
  year = {2022},
  month = sep,
  journal = {Bioinformatics},
  volume = {38},
  number = {19},
  pages = {4562--4572},
  issn = {1367-4811},
  doi = {10.1093/bioinformatics/btac545}
}

Image2SMILES: Transformer OCSR with Synthetic Data Pipeline

Thu, 18 Dec 2025 00:00:00 +0000

Contribution: Image2SMILES as a Method and Resource

This is primarily a Method paper with a significant Resource component.

Method: It proposes a specific neural architecture (ResNet backbone and Transformer Decoder) to solve the Optical Chemical Structure Recognition (OCSR) task, answering “How well does this work?” with extensive benchmarks against rule-based systems like OSRA.
Resource: A core contribution is the “Generate and Train!” paradigm, where the authors release a comprehensive synthetic data generator to overcome the lack of labeled training data in the field.

Motivation: Bottlenecks in Recognizing Trapped Chemical Structures

Retrieving chemical structure data from legacy scientific literature is a major bottleneck in cheminformatics.

Problem: Chemical structures are often “trapped” in image formats (PDFs, scans). Manual extraction is slow, and existing rule-based tools (e.g., OSRA) are brittle when facing diverse drawing styles, “Markush” structures (templates), or visual contamination.
Gap: Deep learning approaches require massive datasets, but no large-scale annotated dataset of chemical figures exists.
Goal: To create a robust, data-driven recognition engine that can handle the messiness of real-world chemical publications (e.g., text overlays, arrows, partial overlaps).

Core Innovation: The “Generate and Train!” Pipeline and FG-SMILES

“Generate and Train!” Paradigm: The authors assert that architecture is secondary to data simulation. They developed an advanced augmentation pipeline that simulates geometry (rotation, bonds) alongside specific chemical drawing artifacts like “Markush” variables ($R_1$, $R_2$), functional group abbreviations (e.g., -OMe, -Ph), and visual “contamination” (stray text, arrows).
FG-SMILES: A modified SMILES syntax designed to handle functional groups and Markush templates as single tokens (pseudo-atoms), allowing the model to predict generalized scaffolds.
Encoder-Free Architecture: The authors found that a standard Transformer Encoder was unnecessary. They feed the flattened feature map from a ResNet backbone directly into the Transformer Decoder, which improved performance.

Methodology and Benchmarking Against OSRA

Training: The model was trained on 10 million synthetically generated images derived from PubChem structures, selected via a complexity-biased sampling algorithm.
Validation (Synthetic): Evaluated on a hold-out set of 1M synthetic images.
Validation (Real World):
- Dataset A: 332 manually cropped structures from 10 specific articles, excluding reaction schemes.
- Dataset B: 296 structures systematically extracted from Journal of Organic Chemistry (one paper per issue from 2020) to reduce selection bias.
Comparison: Benchmarked against OSRA (v2.11), a widely used rule-based OCSR tool.

Results: High-Precision Extraction and Key Limitations

Performance:
- Synthetic: 90.7% exact match accuracy.
- Real Data (Dataset A): Image2SMILES achieved 79.2% accuracy compared to OSRA’s 62.1%.
- Real Data (Dataset B): Image2SMILES achieved 62.5% accuracy compared to OSRA’s 24.0%.
Confidence Correlation: There is a strong correlation between the model’s confidence score and prediction validity. Thresholding at 0.995 yields 99.85% accuracy while ignoring 22.5% of data, enabling high-precision automated pipelines.
Key Failures: The model struggles with functional groups absent from its training dictionary (e.g., $\text{NMe}_2$, Ms), confusion of R-group indices ($R’$ vs $R_1$), and explicit hydrogens rendered as groups.

Reproducibility Details

Data

Source: A subset of 10 million molecules sampled from PubChem.
Selection Logic: Bias towards complex/rare structures using a “Full Coefficient” (FC) probability metric based on molecule size and ring/atom rarity.
- Formula: $BC=0.1+1.2\left(\frac{n_{\max}-n}{n_{\max}}\right)^{3}$ where $n_{\max}=60$.
Generation: Uses RDKit for rendering with augmentations: rotation, font size, line thickness, whitespace, and CoordGen (20% probability).
Contamination: “Visual noise” is stochastically added, including parts of other structures, labels, and arrows cropped from real documents.
Target Format: FG-SMILES (Functional Group SMILES). Replaces common functional groups with pseudo-atoms (e.g., [Me], [Ph], [NO2]) and supports variable R-group positions using a v token.

Algorithms

Contamination Augmentation: A dedicated algorithm simulates visual noise (arrows, text) touching or overlapping the main molecule to force robustness.
Functional Group Resolution: An algorithm identifies overlapping functional group templates (SMARTS) and resolves them to prevent nested group conflicts (e.g., resolving Methyl vs Methoxy).
Markush Support: Stochastic replacement of substituents with R-group labels ($R_1$, $R’$, etc.) based on a defined probability table (e.g., $P(R)=0.2$, $P(R_1)=0.15$).

Models

Architecture: “Image-to-Sequence” hybrid model.
- Backbone: ResNet-50, but with the last two residual blocks removed. Output shape: $512 \times 48 \times 48$.
- Neck: No Transformer Encoder. CNN features are flattened and passed directly to the Decoder.
- Decoder: Standard Transformer Decoder with parameters from the original Transformer architecture.
Input: Images resized to $384 \times 384 \times 3$.
Output: Sequence of FG-SMILES tokens.

Evaluation

Metric: Binary “Exact Match” (valid/invalid).
- Strict criteria: Stereo and R-group indices must match exactly (e.g., $R’$ vs $R_1$ is a failure).
Datasets:
- Internal: 5% random split of generated data (500k samples).
- External (Dataset A & B): Manually cropped real-world images from specified journals.

Hardware

Training: 4 $\times$ Nvidia V100 GPUs + 36 CPU cores.
Duration: ~2 weeks for training (5 epochs, ~63 hours/epoch). Data generation took 3 days on 80 CPUs.
Optimizer: RAdam with learning rate $3 \cdot 10^{-4}$.

Artifacts

Artifact	Type	License	Notes
Data Generator (GitHub)	Code	MIT	Synthetic training data generator
1M Generated Samples (Zenodo)	Dataset	Unknown	Randomly generated image-SMILES pairs
Real-World Test Images (Zenodo)	Dataset	Unknown	Cropped structures from real papers with target FG-SMILES
Syntelly Demo	Other	Proprietary	Web demo for PDF-to-SMILES extraction

Paper Information

Citation: Khokhlov, I., Krasnov, L., Fedorov, M. V., & Sosnin, S. (2022). Image2SMILES: Transformer-Based Molecular Optical Recognition Engine. Chemistry-Methods, 2(1), e202100069. https://doi.org/10.1002/cmtd.202100069

Publication: Chemistry-Methods 2022

Additional Resources:

@article{khokhlovImage2SMILESTransformerBasedMolecular2022,
  title = {Image2SMILES: Transformer-Based Molecular Optical Recognition Engine},
  shorttitle = {Image2SMILES},
  author = {Khokhlov, Ivan and Krasnov, Lev and Fedorov, Maxim V. and Sosnin, Sergey},
  year = {2022},
  journal = {Chemistry-Methods},
  volume = {2},
  number = {1},
  pages = {e202100069},
  issn = {2628-9725},
  doi = {10.1002/cmtd.202100069},
  url = {https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cmtd.202100069}
}

Image-to-Graph Transformers for Chemical Structures

Thu, 18 Dec 2025 00:00:00 +0000

Contribution and Taxonomic Classification

This is a Method paper. It proposes a novel deep learning architecture designed to extract molecular structures from images by directly predicting the graph topology. The paper validates this approach through ablation studies (comparing ResNet-only baselines to the Transformer-augmented model) and extensive benchmarking against existing tools.

The Challenge with SMILES and Non-Atomic Symbols

Handling Abbreviations: Chemical structures in scientific literature often use non-atomic symbols (superatoms like “R” or “Ph”) to reduce complexity. Standard tools that generate SMILES strings fail here because SMILES syntax does not support arbitrary non-atomic symbols.
Robustness to Style: Existing rule-based tools are brittle to the diverse drawing styles found in literature.
Data Utilization: Pixel-wise graph recognition tools (like ChemGrapher) require expensive pixel-level labeling. An end-to-end approach can utilize massive amounts of image-molecule pairs (like USPTO data) without needing exact coordinate labels.

The Image-to-Graph (I2G) Architecture

The core novelty is the Image-to-Graph (I2G) architecture that bypasses string representations entirely:

Hybrid Encoder: Combines a ResNet backbone (for locality) with a Transformer encoder (for global context), allowing the model to capture relationships between atoms that are far apart in the image.
Graph Decoder (GRAT): A modified Transformer decoder that generates the graph auto-regressively. It uses feature-wise transformations to modulate attention weights based on edge information (bond types).
Coordinate-Aware Training: The model is forced to predict the exact 2D coordinates of atoms in the source image. Combined with auxiliary losses, this boosts SMI accuracy from 0.009 to 0.567 on the UoB ablation (Table 1 in the paper).

Experimental Setup and Baselines

Baselines: The model was compared against OSRA (rule-based), MolVec (rule-based), and ChemGrapher (deep learning pixel-wise).
Benchmarks: Evaluated on four standard datasets: UoB, USPTO, CLEF, and JPO. Images were converted to PDF and back to simulate degradation.
Large Molecule Test: A custom dataset (OLED) was created from 12 journal papers (434 images) to test performance on larger, more complex structures (average 52.8 atoms).
Ablations: The authors tested the impact of the Transformer encoder, auxiliary losses, and coordinate prediction.

Empirical Results and Robustness

Benchmark Performance: The proposed model outperformed existing models with a 17.1% relative improvement on benchmark datasets.
Robustness: On large molecules (OLED dataset), it achieved a 12.8% relative improvement over MolVec (and 20.0% over OSRA).
Data Scaling: Adding real-world USPTO data to the synthetic training set improved performance by 20.5%, demonstrating the model’s ability to learn from noisy, unlabeled coordinates.
Handling Superatoms: The model successfully recognized pseudo-atoms (e.g., $R_1$, $R_2$, $R_3$) as distinct nodes. OSRA, which outputs SMILES, collapsed them into generic “Any” atoms since SMILES does not support non-atomic symbols. MolVec could not recognize them properly at all.

Limitations and Error Analysis

The paper identifies two main failure modes on the USPTO, CLEF, and JPO benchmarks:

Unrecognized superatoms: The model struggles with complex multi-character superatoms not seen during training (e.g., NHNHCOCH$_3$ or H$_3$CO$_2$S). The authors propose character-level atom decoding as a future solution.
Caption interference: The model sometimes misidentifies image captions as atoms, particularly on the JPO dataset. Data augmentation with arbitrary caption text or a dedicated image segmentation step could mitigate this.

Reproducibility Details

Data

The authors used a combination of synthetic and real-world data for training.

Purpose	Dataset	Size	Notes
Training	PubChem	4.6M	Synthetic images generated using RDKit. Random superatoms (e.g., $CF_3$, $NO_2$) were substituted to simulate abbreviations.
Training	USPTO	2.5M	Real image-molecule pairs from patents. Used for robustness; lacks coordinate labels.
Evaluation	Benchmarks	~5.7k	UoB, USPTO, CLEF, JPO. Average ~15.8 atoms per molecule.
Evaluation	OLED	434	Manually segmented from 12 journal papers. Large molecules (avg 52.8 atoms).

Preprocessing:

Input resolution is fixed at $800 \times 800$ pixels.
Images are virtually split into a $25 \times 25$ grid (625 patches total), where each patch is $32 \times 32$ pixels.

Algorithms

Encoder Logic:

Grid Serialization: The $25 \times 25$ grid is flattened into a 1D sequence. 2D position information is concatenated to ResNet features before the Transformer.
Auxiliary Losses: To aid convergence, classifiers on the encoder predict three things per patch: (1) number of atoms, (2) characters in atom labels, and (3) edge-sharing neighbors. These losses decrease to zero during training.

Decoder Logic:

Auto-regressive Generation: At step $t$, the decoder generates a new node and connects it to existing nodes.
Attention Modulation: Attention weights are transformed using bond information: $$ \begin{aligned} \text{Att}(Q, K, V) = \text{softmax} \left( \frac{\Gamma \odot (QK^T) + B}{\sqrt{d_k}} \right) V \end{aligned} $$ where $(\gamma_{ij}, \beta_{ij}) = f(e_{ij})$, with $e_{ij}$ being the edge type (in one-hot representation) between nodes $i$ and $j$, and $f$ is a multi-layer perceptron. $\Gamma$ and $B$ are matrices whose elements at position $(i, j)$ are $\gamma_{ij}$ and $\beta_{ij}$, respectively.
Coordinate Prediction: The decoder outputs coordinates for each atom, which acts as a mechanism to track attention history.

Models

Image Encoder: ResNet-34 backbone followed by a Transformer encoder.
Graph Decoder: A “Graph-Aware Transformer” (GRAT) that outputs nodes (atom labels, coordinates) and edges (bond types).

Evaluation

Metrics focus on structural identity, as standard string matching (SMILES) is insufficient for graphs with superatoms.

Metric	Description	Notes
SMI	Canonical SMILES Match	Correct if predicted SMILES is identical to ground truth.
TS 1	Tanimoto Similarity = 1.0	Ratio of predictions with perfect fingerprint overlap.
Sim.	Average Tanimoto Similarity	Measures average structural overlap across all predictions.

Reproducibility

The paper does not release source code, pre-trained models, or the custom OLED evaluation dataset. The training data sources (PubChem, USPTO) are publicly available, but the specific image generation pipeline (modified RDKit with coordinate extraction and superatom substitution) is not released. Key architectural details (ResNet-34 backbone, Transformer encoder/decoder configuration) and training techniques are described, but exact hyperparameters for full reproduction are limited.

Artifact	Type	License	Notes
PubChem	Dataset	Public Domain	Source of 4.6M molecules for synthetic image generation
USPTO	Dataset	Public Domain	2.5M real image-molecule pairs from patents
RDKit	Code	BSD-3-Clause	Used (with modifications) for synthetic image generation

Paper Information

Citation: Yoo, S., Kwon, O., & Lee, H. (2022). Image-to-Graph Transformers for Chemical Structure Recognition. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3393-3397. https://doi.org/10.1109/ICASSP43922.2022.9746088

Publication: ICASSP 2022

ICMDT: Automated Chemical Structure Image Recognition

Thu, 18 Dec 2025 00:00:00 +0000

Contribution: Image-to-Text Translation for Chemical Structures

This is a Method paper.

It proposes a novel neural network architecture, the Image Captioning Model based on Deep TNT (ICMDT), to solve the specific problem of “molecular translation” (image-to-text). The classification is supported by the following rhetorical indicators:

Novel Mechanism: It introduces the “Deep TNT block” to improve upon the existing TNT architecture by fusing features at three levels (pixel, small patch, large patch).
Baseline Comparison: The authors explicitly compare their model against four other architectures (CNN+RNN and CNN+Transformer variants).
Ablation Study: Section 4.3 is dedicated to ablating specific components (position encoding, patch fusion) to prove their contribution to the performance gain.

Motivation: Digitizing Historical Chemical Literature

The primary motivation is to speed up chemical research by digitizing historical chemical literature.

Problem: Historical sources often contain corrupted or noisy images, making automated recognition difficult.
Gap: Existing models like the standard TNT (Transformer in Transformer) function primarily as encoders for classification and fail to effectively integrate local pixel-level information required for precise structure generation.
Goal: To build a dependable generative model that can accurately translate these noisy images into InChI (International Chemical Identifier) text strings.

Novelty: Multi-Level Feature Fusion with Deep TNT

The core contribution is the Deep TNT block and the resulting ICMDT architecture.

Deep TNT Block: The Deep TNT block expands upon standard local and global modeling by stacking three transformer blocks to process information at three granularities:
1. Internal Transformer: Processes pixel embeddings.
2. Middle Transformer: Processes small patch embeddings.
3. Exterior Transformer: Processes large patch embeddings.
Multi-level Fusion: The model fuses pixel-level features into small patches, and small patches into large patches, allowing for finer integration of local details.
Position Encoding: A specific strategy of applying shared position encodings to small patches and pixels, while using a learnable 1D encoding for large patches.

Methodology: Benchmarking on the BMS Dataset

The authors evaluated the model on the Bristol-Myers Squibb Molecular Translation dataset.

Baselines: They constructed four comparative models:
- EfficientNetb0 + RNN (Bi-LSTM)
- ResNet50d + RNN (Bi-LSTM)
- EfficientNetb0 + Transformer
- ResNet101d + Transformer
Ablation: They tested the impact of removing the large patch position encoding (ICMDT*), reverting the encoder to a standard TNT-S (TNTD), and setting the patch size to 32 directly on TNT-S without the exterior transformer block (TNTD-B).
Pre-processing Study: They experimented with denoising ratios and cropping strategies.

Results & Conclusions: Improved InChI Translation Accuracy

Performance: ICMDT achieved the lowest Levenshtein distance (0.69) among all five models tested (Table 3). The best-performing baseline was ResNet101d+Transformer.
Convergence: The model converged significantly faster than the baselines, outperforming others as early as epoch 6.7.
Ablation Results: The full Deep TNT block reduced error by nearly half compared to the standard TNT encoder (0.69 vs 1.29 Levenshtein distance). Removing large patch position encoding (ICMDT*) degraded performance to 1.04, and directly using patch size 32 on TNT-S (TNTD-B) scored 1.37.
Limitations: The model struggles with stereochemical layers (e.g., identifying clockwise neighbors or +/- signs) compared to non-stereochemical layers.
Inference & Fusion: The multi-model inference and fusion pipeline (beam search, TTA, step-wise logit ensemble, and voting) improved results by 0.24 to 2.5 Levenshtein distance reduction over single models.
Future Work: Integrating full object detection to predict atom/bond coordinates to better resolve 3D stereochemical information.

Reproducibility

Status: Partially Reproducible. The dataset is publicly available through Kaggle, and the paper provides detailed hyperparameters and architecture specifications. However, no source code or pretrained model weights have been released.

Artifact	Type	License	Notes
BMS Molecular Translation (Kaggle)	Dataset	Competition Terms	Training/test images with InChI labels

Missing components: No official code repository or pretrained weights. Reimplementation requires reconstructing the Deep TNT block, training pipeline, and inference/fusion strategy from the paper description alone.

Hardware/compute requirements: Not explicitly stated in the paper.

Data

The experiments used the Bristol-Myers Squibb Molecular Translation dataset from Kaggle.

Purpose	Dataset	Size	Notes
Training	BMS Training Set	2,424,186 images	Supervised; contains noise and blur
Evaluation	BMS Test Set	1,616,107 images	Higher noise variation than training set

Pre-processing Strategy:

Effective: Padding resizing (reshaping to square using the longer edge, padding insufficient parts with pixels from the middle of the image).
Ineffective: Smart cropping (removing white borders degraded performance).
Augmentation: GaussNoise, Blur, RandomRotate90, and PepperNoise ($SNR=0.996$).
Denoising: Best results found by mixing denoised and original data (Ratio 2:13) during training.

Algorithms

Optimizer: Lookahead ($\alpha=0.5, k=5$) and RAdam ($\beta_1=0.9, \beta_2=0.99$).
Loss Function: Anti-Focal loss ($\gamma=0.5$) combined with Label Smoothing. Standard Focal Loss adds a modulating factor $(1-p_t)^\gamma$ to cross-entropy to focus on hard negatives. Anti-Focal Loss (Raunak et al., 2020) modifies this factor to reduce the disparity between training and inference distributions in Seq2Seq models.
Training Schedule:
- Initial resolution: $224 \times 224$
- Fine-tuning: Resolution $384 \times 384$ for labels $>150$ length.
- Batch size: Dynamic, increasing from 16 to 1024 (with proportional learning rate scaling).
- Noisy Labels: Randomly replacing chemical elements in labels with a certain probability to improve robustness during inference.
Inference Strategy:
- Beam Search ($k=16$ initially, $k=64$ if failing InChI validation).
- Test Time Augmentation (TTA): Rotations of $90^\circ$.
- Ensemble: Step-wise logit ensemble and voting based on Levenshtein distance scores.

Models

ICMDT Architecture:

Encoder (Deep TNT) (Depth: 12 layers):
- Internal Block: Dim 160, Heads 4, Hidden size 640, MLP act GELU, Pixel patch size 4.
- Middle Block: Dim 10, Heads 6, Hidden size 128, MLP act GELU, Small patch size 16.
- Exterior Block: Dim 2560, Heads 10, Hidden size 5120, MLP act GELU, Large patch size 32.
Decoder (Vanilla Transformer):
- Decoder dim: 2560, FFN dim: 1024.
- Depth: 3 layers, Heads: 8.
- Vocab size: 193 (InChI tokens), text_dim: 384.

Evaluation

Metric: Levenshtein Distance (measures single-character edit operations between generated and ground truth InChI strings).

Ablation Results (Table 3 from paper):

Model	Params (M)	Levenshtein Distance
ICMDT	138.16	0.69
ICMDT*	138.16	1.04
TNTD	114.36	1.29
TNTD-B	114.36	1.37

Baseline Comparison (from convergence curves, Figure 9):

Model	Params (M)	Convergence (Epochs)
ICMDT	138.16	~9.76
ResNet101d + Transformer	302.02	14+
EfficientNetb0 + Transformer	-	-
ResNet50d + RNN	90.6	14+
EfficientNetb0 + RNN	46.3	-

Paper Information

Citation: Li, Y., Chen, G., & Li, X. (2022). Automated Recognition of Chemical Molecule Images Based on an Improved TNT Model. Applied Sciences, 12(2), 680. https://doi.org/10.3390/app12020680

Publication: MDPI Applied Sciences 2022

Additional Resources:

Kaggle Competition: BMS Molecular Translation

@article{liAutomatedRecognitionChemical2022,
  title = {Automated {{Recognition}} of {{Chemical Molecule Images Based}} on an {{Improved TNT Model}}},
  author = {Li, Yanchi and Chen, Guanyu and Li, Xiang},
  year = 2022,
  month = jan,
  journal = {Applied Sciences},
  volume = {12},
  number = {2},
  pages = {680},
  publisher = {Multidisciplinary Digital Publishing Institute},
  issn = {2076-3417},
  doi = {10.3390/app12020680}
}

Handwritten Chemical Structure Recognition with RCGD

Thu, 18 Dec 2025 00:00:00 +0000

Contribution and Methodological Framework

This is primarily a Method paper with a significant Resource component.

Method: It proposes a novel architectural framework (RCGD) and a new representation syntax (SSML) to solve the specific problem of handwritten chemical structure recognition.
Resource: It introduces a new benchmark dataset, EDU-CHEMC, containing 50,000 handwritten images to address the lack of public data in this domain.

The Ambiguity of Handwritten Chemical Structures

Recognizing handwritten chemical structures is significantly harder than printed ones due to:

Inherent Ambiguity: Handwritten atoms and bonds vary greatly in appearance.
Projection Complexity: Converting 2D projected layouts (like Natta or Fischer projections) into linear strings is difficult.
Limitations of Existing Formats: Standard formats like SMILES require domain knowledge (valence rules) and have a high semantic gap with the visual image. They often fail to represent “invalid” structures commonly found in educational/student work.

Bridging the Semantic Gap with SSML and RCGD

The paper introduces two core contributions to bridge the semantic gap between image and markup:

Structure-Specific Markup Language (SSML): An extension of Chemfig that provides an unambiguous, visual-based graph representation. Unlike SMILES, it describes how to draw the molecule step-by-step, making it easier for models to learn visual alignments. It supports “reconnection marks” to handle cyclic structures explicitly.
Random Conditional Guided Decoder (RCGD): A decoder that treats recognition as a graph traversal problem. It introduces three novel mechanisms:
- Conditional Attention Guidance: Uses branch angle directions to guide the attention mechanism, preventing the model from getting lost in complex structures.
- Memory Classification: A module that explicitly stores and classifies “unexplored” branch points to handle ring closures (reconnections).
- Path Selection: A training strategy that randomly samples traversal paths to prevent overfitting to a specific serialization order.

Experimental Setup and Baselines

Datasets:

Mini-CASIA-CSDB (Printed): A subset of 97,309 printed molecular structure images, upscaled to $500 \times 500$ resolution.
EDU-CHEMC (Handwritten): A new dataset of 52,987 images collected from educational settings (cameras, scanners, screens), including erroneous/non-existent structures.

Baselines:

Compared against standard String Decoders (SD) (based on DenseWAP), tested with both SMILES and SSML on Mini-CASIA-CSDB and exclusively with SSML on EDU-CHEMC.
Compared against BTTR and ABM (recent mathematical expression recognition models) adapted for the chemical structure task, both using SSML on EDU-CHEMC.
On Mini-CASIA-CSDB, also compared against WYGIWYS (a SMILES-based string decoder at 300x300 resolution).

Ablation Studies:

Evaluated the impact of removing Path Selection (PS) and Memory Classification (MC) mechanisms on EDU-CHEMC.
Tested robustness to image rotation ($180^{\circ}$) on Mini-CASIA-CSDB.

Recognition Performance and Robustness

Superiority of SSML: Models trained with SSML significantly outperformed those trained with SMILES (92.09% vs 81.89% EM on printed data) due to reduced semantic gap.
Best Performance: RCGD achieved the highest Exact Match (EM) scores on both datasets:
- Mini-CASIA-CSDB: 95.01% EM.
- EDU-CHEMC: 62.86% EM.
EDU-CHEMC Baselines: On the handwritten dataset, SD (DenseWAP) achieved 61.35% EM, outperforming both BTTR (58.21% EM) and ABM (58.78% EM). The authors note that BTTR and ABM’s reverse training mode, which helps in regular formula recognition, does not transfer well to graph-structured molecular data.
Ablation Results (Table 5, EDU-CHEMC): Removing Path Selection alone dropped EM from 62.86% to 62.15%. Removing both Path Selection and Memory Classification dropped EM further to 60.31%, showing that memory classification has a larger impact.
Robustness: RCGD showed minimal performance drop (0.85%) on rotated images compared to SMILES-based methods (10.36% drop). The SD with SSML dropped by 2.19%, confirming that SSML itself improves rotation invariance.
Educational Utility: The method can recognize and reconstruct chemically invalid structures (e.g., a Carbon atom with 5 bonds), making it applicable for correcting and revising handwritten answers in chemistry education.

Reproducibility Details

Data

1. EDU-CHEMC (Handwritten)

Total Size: 52,987 images.
Splits: Training (48,998), Validation (999), Test (2,992).
Characteristics: Real-world educational data, mixture of isolated molecules and reaction equations, includes invalid chemical structures.

2. Mini-CASIA-CSDB (Printed)

Total Size: 97,309 images.
Splits: Training (80,781), Validation (8,242), Test (8,286).
Preprocessing: Original $300 \times 300$ images were upscaled to $500 \times 500$ RGB to resolve blurring issues.

Algorithms

1. SSML Generation

To convert a molecular graph to SSML:

Traverse: Start from the left-most atom.
Bonds/Atoms: Output atom text and bond format [:].
Branches: At branch points, use phantom symbols ( and ) to enclose branches, ordered by ascending bond angle.
Reconnections: Use ?[tag] and ?[tag, bond] to mark start/end of ring closures.

2. RCGD Specifics

RCGD-SSML: Modified version of SSML for the decoder. Removes ( ) delimiters; adds \eob (end of branch). Maintains a dynamic Branch Angle Set ($M$).
Path Selection: During training, when multiple branches exist in $M$, the model randomly selects one to traverse next. During inference, it uses beam search to score candidate paths.
Loss Function: $$ \begin{aligned} L_{\text{total}} = L_{\text{ce}} + L_{\text{bc}} \end{aligned} $$
- $L_{\text{ce}}$: Cross-entropy loss for character sequence generation.
- $L_{\text{bc}}$: Multi-label classification loss for the memory module (predicting reconnection bond types for stored branch states).

Models

Encoder: DenseNet

Structure: 3 dense blocks.
Growth Rate: 24.
Depth: 32 per block.
Output: High-dimensional feature map $x \in \mathbb{R}^{d_x \times h \times w}$.

Decoder: GRU with Attention

Hidden State Dimension: 256.
Embedding Dimension: 256.
Attention Projection: 128.
Memory Classification Projection: 256.

Training Config:

Optimizer: Adam.
Learning Rate: 2e-4 with multi-step decay (gamma 0.5).
Dropout: 15%.
Strategy: Teacher-forcing used for validation selection.

Evaluation

Metrics:

Exact Match (EM): Percentage of samples where the predicted graph structure perfectly matches the label. For SMILES, string comparison; for SSML, converted to graph for isomorphism check.
Structure EM: Auxiliary metric for samples with mixed content (text + molecules), counting samples where all molecular structures are correct.

Artifacts:

Artifact	Type	License	Notes
EDU-CHEMC	Dataset	Unknown	Dataset annotations and download links (actual data hosted on Google Drive)

Missing Components:

No training or inference code is publicly released; only the dataset is available.
Pre-trained model weights are not provided.

Paper Information

Citation: Hu, J., Wu, H., Chen, M., Liu, C., Wu, J., Yin, S., Yin, B., Yin, B., Liu, C., Du, J., & Dai, L. (2023). Handwritten Chemical Structure Image to Structure-Specific Markup Using Random Conditional Guided Decoder. Proceedings of the 31st ACM International Conference on Multimedia (pp. 8114-8124). https://doi.org/10.1145/3581783.3612573

Publication: ACM Multimedia 2023

Additional Resources:

GitHub Repository / EDU-CHEMC Dataset

@inproceedings{huHandwrittenChemicalStructure2023,
  title = {Handwritten Chemical Structure Image to Structure-Specific Markup Using Random Conditional Guided Decoder},
  booktitle = {Proceedings of the 31st ACM International Conference on Multimedia},
  author = {Hu, Jinshui and Wu, Hao and Chen, Mingjun and Liu, Chenyu and Wu, Jiajia and Yin, Shi and Yin, Baocai and Yin, Bing and Liu, Cong and Du, Jun and Dai, Lirong},
  year = {2023},
  month = oct,
  pages = {8114--8124},
  publisher = {ACM},
  address = {Ottawa ON Canada},
  doi = {10.1145/3581783.3612573},
  isbn = {979-8-4007-0108-5}
}

End-to-End Transformer for Molecular Image Captioning

Thu, 18 Dec 2025 00:00:00 +0000

Methodological Contribution

This is a Methodological Paper. It proposes a novel architectural approach to molecular image translation by replacing the standard CNN encoder with a Vision Transformer (ViT). The authors validate this method through comparative benchmarking against standard CNN+RNN baselines (e.g., ResNet+LSTM) and provide optimizations for inference speed.

Motivation and Problem Statement

The core problem addressed is existing molecular translation methods (extracting chemical structure from images into computer-readable InChI format) rely heavily on rule-based systems or CNN+RNN architectures. These current approaches often underperform when handling noisy images (common in scanned old journals) or images with few distinguishable features. There is a significant need in drug discovery to digitize and analyze legacy experimental data locked in image format within scientific publications.

Core Innovations: End-to-End ViT Encoder

The primary contribution is the use of a completely convolution-free Vision Transformer (ViT) as the encoder, allowing the model to utilize long-range dependencies among image patches from the very beginning via self-attention: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ The architecture is a pure Transformer (Encoder-Decoder), treating the molecular image similarly to a sequence of tokens (patches). Furthermore, the authors implement a specific caching strategy for the decoder to avoid recomputing embeddings for previously decoded tokens, reducing the time complexity of the decoding step.

Experimental Setup and Baselines

The model was compared against standard CNN + RNN and ResNet (18, 34, 50) + LSTM with attention. Ablation studies were conducted varying the number of transformer layers (3, 6, 12, 24) and image resolution (224x224 vs 384x384). The model trained on a large combined dataset, including Bristol Myers Squibb data, SMILES, GDB-13, and synthetically augmented images containing noise and artifacts. Performance was evaluated using the Levenshtein distance metric, which computes the minimum number of single-character edits to transform the predicted string into the ground truth.

Performance Outcomes and Capabilities

The proposed 24-layer ViT model (input size 384) achieved the lowest Levenshtein distance of 6.95, outperforming the ResNet50+LSTM baseline (7.49) and the standard CNN+RNN (103.7). Increasing the number of layers had a strong positive impact, with the 24-layer model becoming competitive with current approaches. The authors note the model was evaluated on datasets with low distinguishable features and noise, where the ViT encoder’s self-attention over all patches from the first layer helped capture relevant structure. The proposed caching optimization reduced the total decoding time complexity from $O(MN^2 + N^3)$ to $O(MN + N^2)$ for $N$ timesteps, by reducing the per-timestep cost to $O(M + N)$.

Reproducibility Details

Data

The model was trained on a combined dataset randomly split into 70% training, 10% test, and 20% validation.

Dataset	Description	Notes
Bristol Myers Squibb	~2.4 million synthetic images with InChI labels.	Provided by BMS global biopharmaceutical company.
SMILES	Kaggle contest data converted to InChI.	Images generated using RDKit.
GDB-13	Subset of 977 million small organic molecules (up to 13 atoms).	Converted from SMILES using RDKit.
Augmented Images	Synthetic images with salt/pepper noise, dropped atoms, and bond modifications.	Used to improve robustness against noise.

Algorithms

Training Objective: Cross-entropy loss minimization.
Inference Decoding: Autoregressive decoding predicting the next character of the InChI string.
Positional Encoding: Standard sine and cosine functions of different frequencies.
Optimization:
- Caching: Caches the output of each layer during decoding to avoid recomputing embeddings for already decoded tokens.
- JIT: PyTorch JIT compiler used for graph optimization (1.2-1.5x speed increase on GPU).
- Self-Critical Training: Finetuning performed using self-critical sequence training (SCST).

Models

Encoder (Vision Transformer):
- Input: Flattened 2D patches of the image. Patch size: $16 \times 16$.
- Projection: Trainable linear projection to latent vector size $D$.
- Structure: Alternating layers of Multi-Head Self-Attention (MHSA) and MLP blocks.
Decoder (Vanilla Transformer):
- Input: Tokenized InChI string + sinusoidal positional embedding.
- Vocabulary: 275 tokens (including , , ).
Hyperparameters (Best Model):
- Image Size: $384 \times 384$.
- Layers: 24.
- Feature Dimension: 512.
- Attention Heads: 12.
- Optimizer: Adam.
- Learning Rate: $3 \times 10^{-5}$ (decayed by 0.5 in last 2 epochs).
- Batch Size: Varied [64-512].

Evaluation

Primary Metric: Levenshtein Distance (lower is better).

Model	Image Size	Layers	Epochs	Levenshtein Dist.
Standard CNN+RNN	224	3	10	103.7
ResNet18 + LSTM	224	4	10	75.03
ResNet34 + LSTM	224	4	10	45.72
ResNet50 + LSTM	224	5	10	7.49
ViT Transformers	224	3	5	79.82
ViT Transformers	224	6	5	54.58
ViT Transformers	224	12	5	31.30
ViT Transformers (Best)	384	24	10	6.95

Hardware

System: 70GB GPU system.
Framework: PyTorch and PyTorch Lightning.

Paper Information

Citation: Sundaramoorthy, C., Kelvin, L. Z., Sarin, M., & Gupta, S. (2021). End-to-End Attention-based Image Captioning. arXiv preprint arXiv:2104.14721. https://doi.org/10.48550/arXiv.2104.14721

Publication: arXiv 2021 (preprint)

Note: This is an arXiv preprint and has not undergone formal peer review.

@misc{sundaramoorthyEndtoEndAttentionbasedImage2021,
  title = {End-to-{{End Attention-based Image Captioning}}},
  author = {Sundaramoorthy, Carola and Kelvin, Lin Ziwen and Sarin, Mahak and Gupta, Shubham},
  year = 2021,
  month = apr,
  number = {arXiv:2104.14721},
  eprint = {2104.14721},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2104.14721},
  archiveprefix = {arXiv}
}

DECIMER 1.0: Transformers for Chemical Image Recognition

Thu, 18 Dec 2025 00:00:00 +0000

Evaluating the Contribution: A Methodological Shift

Method (Dominant) with strong Resource elements.

This is primarily a Method paper because it proposes a specific architectural evolution. It replaces CNN-RNN/Encoder-Decoder models with a Transformer-based network to solve the problem of image-to-structure translation. It validates this methodological shift through rigorous ablation studies comparing feature extractors (InceptionV3 vs. EfficientNet) and decoder architectures.

It also serves as a Resource contribution by releasing the open-source software, trained models, and describing the curation of a massive synthetic training dataset (>35 million molecules).

Motivation: Inaccessible Chemical Knowledge

Data Inaccessibility: A vast amount of chemical knowledge (pre-1990s) is locked in printed or scanned literature and is not machine-readable.
Manual Bottlenecks: Manual curation and extraction of this data is tedious, slow, and error-prone.
Limitations of Prior Tools: Existing Optical Chemical Structure Recognition (OCSR) tools are often rule-based or struggle with the noise and variability of full-page scanned articles. Previous deep learning attempts were not publicly accessible or robust enough.

Key Innovation: Transformer-Based Molecular Translation

Transformer Architecture: Shifts from the standard CNN-RNN (Encoder-Decoder) approach to a Transformer-based decoder, significantly improving accuracy.
EfficientNet Backbone: Replaces the standard InceptionV3 feature extractor with EfficientNet-B3, which improved feature extraction quality for chemical images.
SELFIES Representation: Utilizes SELFIES (SELF-referencing Embedded Strings) as the target output. This guarantees 100% robust molecular strings and eliminates the “invalid SMILES” problem common in generative models.
Massive Scaling: Trains on synthetic datasets derived from PubChem (up to 39 million molecules total, with the largest training subset at ~35 million), demonstrating that scaling data size directly correlates with improved model performance.

Methodology and Experimental Validation

Feature Extractor Ablation: Compared InceptionV3 vs. EfficientNet-B3 (and B7) on a 1-million molecule subset to determine the optimal image encoder.
Architecture Comparison: Benchmarked the Encoder-Decoder (CNN+RNN) against the Transformer model using Tanimoto similarity metrics. The structural similarity between predicted and ground truth molecules was measured via Tanimoto similarity over molecular fingerprints: $$ T(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}|^2 + |\mathbf{B}|^2 - \mathbf{A} \cdot \mathbf{B}} $$
Data Scaling: Evaluated performance across increasing training set sizes (1M, 10M, 15M, 35M) to observe scaling laws.
Stereochemistry & Ions: Tested the model’s ability to handle complex stereochemical information and charged groups (ions), creating separate datasets for these tasks.
Augmentation Robustness: Evaluated the model on augmented images (blur, noise, varying contrast) to simulate real-world scanned document conditions.

Results and Scaling Observations

Architecture Comparison: The Transformer model with EfficientNet-B3 features outperformed the Encoder-Decoder baseline by a wide margin. On the 1M dataset, the Transformer achieved 74.57% exact matches (Tanimoto 1.0) compared to only 7.03% for the Encoder-Decoder (Table 4 in the paper).
High Accuracy at Scale: With the full 35-million molecule training set (Dataset 1), the model achieved a Tanimoto 1.0 score of 96.47% and an average Tanimoto similarity of 0.99.
Isomorphism: 99.75% of predictions with a Tanimoto score of 1.0 were confirmed to be structurally isomorphic to the ground truth (checked via InChI).
Stereochemistry Costs: Including stereochemistry and ions increased the token count and difficulty, resulting in slightly lower accuracy (~89.87% exact match on Dataset 2).
Hardware Efficiency: Training on TPUs (v3-8) was ~4x faster than Nvidia V100 GPUs. For the 1M molecule model, convergence took ~8h 41min on TPU v3-8 vs ~29h 48min on V100 GPU. The largest model (35M) took less than 14 days on TPU.
Augmentation Robustness (Dataset 3): When trained on augmented images and tested on non-augmented images, the model achieved 86.43% Tanimoto 1.0. Using a pre-trained model from Dataset 2 and refitting on augmented images improved this to 88.04% on non-augmented test images and 80.87% on augmented test images, retaining above 97% isomorphism rates.

Reproducibility Details

Data

The authors generated synthetic data from PubChem.

Purpose	Dataset	Size	Notes
Training	Dataset 1 (Clean)	39M total (35M train)	No stereo/ions. Filtered for MW < 1500, bond count 3-40, SMILES len < 40.
Training	Dataset 2 (Complex)	37M total (33M train)	Includes stereochemistry and charged groups (ions).
Training	Dataset 3 (Augmented)	37M total (33M train)	Dataset 2 with image augmentations applied.
Preprocessing	N/A	N/A	Molecules converted to SELFIES. Images generated via CDK Structure Diagram Generator (SDG) as $299 \times 299$ 8-bit PNGs.
Format	TFRecords	75 MB chunks	128 Data points (image vector + tokenized string) per record.

Algorithms

Text Representation: SELFIES used to avoid invalid intermediate strings. Tokenized via Keras tokenizer.
- Dataset 1 Tokens: 27 unique tokens. Max length 47.
- Dataset 2/3 Tokens: 61 unique tokens (due to stereo/ion tokens).
Augmentation: Implemented using imgaug python package. Random application of:
- Gaussian/Average Blur, Additive Gaussian Noise, Salt & Pepper, Coarse Dropout, Gamma Contrast, Sharpen, Brightness.
Optimization: Adam optimizer with a custom learning rate scheduler (following the “Attention is all you need” paper).

Models

The final architecture is an Image-to-SELFIES Transformer.

Encoder (Feature Extractor):
- EfficientNet-B3 (pre-trained on Noisy-student).
- Input: $299 \times 299 \times 3$ images (normalized -1 to 1).
- Output Feature Vector: $10 \times 10 \times 1536$.
Decoder (Transformer):
- 4 Encoder-Decoder layers.
- 8 Parallel Attention Heads.
- Dimension size: 512.
- Feed-forward size: 2048.
- Dropout: 0.1.

Evaluation

Evaluation was performed on a held-out test set (10% of total data) selected via RDKit MaxMin algorithm for diversity.

Metric	Value	Baseline	Notes
Tanimoto 1.0	96.47%	74.57% (1M subset)	Percentage of predictions with perfect fingerprint match (Dataset 1, 35M training).
Avg Tanimoto	0.9923	0.9371 (1M subset)	Average similarity score (Dataset 1, 35M training).
Isomorphism	99.75%	-	Percentage of Tanimoto 1.0 predictions that are structurally identical (checked via InChI).

Hardware

Training Hardware: TPU v3-8 (Google Cloud). TPU v3-32 was tested but v3-8 was chosen for cost-effectiveness.
Comparison Hardware: Nvidia Tesla V100 (32GB GPU).
Performance:
- TPU v3-8 was ~4x faster than V100 GPU.
- 1 Million molecule model convergence: 8h 41min on TPU vs ~29h 48min on GPU.
- Largest model (35M) took less than 14 days on TPU.

Reproducibility

The paper is open-access, and both code and data are publicly available.

Artifact	Type	License	Notes
DECIMER-TPU (GitHub)	Code	MIT	Official implementation using TensorFlow and TPU training
Code Archive (Zenodo)	Code	MIT	Archival snapshot of the codebase
Training Data (Zenodo)	Dataset	Unknown	SMILES data used for training (images generated via CDK SDG)
DECIMER Project Page	Other	N/A	Project landing page

Hardware Requirements: Training requires TPU v3-8 (Google Cloud) or Nvidia V100 GPU. The largest model (35M molecules) took less than 14 days on TPU v3-8.
Missing Components: Augmentation parameters are documented in the paper (Table 14). Pre-trained model weights are available through the GitHub repository.

Paper Information

Citation: Rajan, K., Zielesny, A. & Steinbeck, C. (2021). DECIMER 1.0: deep learning for chemical image recognition using transformers. Journal of Cheminformatics, 13(1), 61. https://doi.org/10.1186/s13321-021-00538-8

Publication: Journal of Cheminformatics 2021

Additional Resources:

@article{rajanDECIMER10Deep2021,
  title = {DECIMER 1.0: Deep Learning for Chemical Image Recognition Using Transformers},
  shorttitle = {DECIMER 1.0},
  author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
  year = {2021},
  month = {aug},
  journal = {Journal of Cheminformatics},
  volume = {13},
  number = {1},
  pages = {61},
  issn = {1758-2946},
  doi = {10.1186/s13321-021-00538-8},
  url = {https://doi.org/10.1186/s13321-021-00538-8}
}

ChemPix: Hand-Drawn Hydrocarbon Structure Recognition

Thu, 18 Dec 2025 00:00:00 +0000

Paper Classification and Core Contribution

This is primarily a Method paper, with a secondary contribution as a Resource paper.

The paper’s core contribution is the ChemPix architecture and training strategy using neural image captioning (CNN-LSTM) to convert hand-drawn chemical structures to SMILES. The extensive ablation studies on synthetic data generation (augmentation, degradation, backgrounds) and ensemble learning strategies confirm the methodological focus. The secondary resource contribution includes releasing a curated dataset of hand-drawn hydrocarbons and code for generating synthetic training data.

The Structural Input Bottleneck in Computational Chemistry

Inputting molecular structures into computational chemistry software for quantum calculations is often a bottleneck, requiring domain expertise and cumbersome manual entry in drawing software. While optical chemical structure recognition (OCSR) tools exist, they typically struggle with the noise and variability of hand-drawn sketches. There is a practical need for a tool that allows chemists to simply photograph a hand-drawn sketch and immediately convert it into a machine-readable format (SMILES), making computational workflows more accessible.

CNN-LSTM Image Captioning and Synthetic Generalization

Image Captioning Paradigm: The authors treat the problem as neural image captioning, using an encoder-decoder (CNN-LSTM) framework to “translate” an image directly to a SMILES string. This avoids the complexity of explicit atom/bond detection and graph assembly.
Synthetic Data Engineering: The paper introduces a rigorous synthetic data generation pipeline that transforms clean RDKit-generated images into “pseudo-hand-drawn” images via randomized backgrounds, degradation, and heavy augmentation. This allows the model to achieve >50% accuracy on real hand-drawn data without ever seeing it during training.
Ensemble Uncertainty Estimation: The method utilizes a “committee” (ensemble) of networks to improve accuracy and estimate confidence based on vote agreement, providing users with reliability indicators for predictions.

Extensive Ablation and Real-World Evaluation

Ablation Studies on Data Pipeline: The authors trained models on datasets generated at different stages of the pipeline (Clean RDKit $\rightarrow$ Augmented $\rightarrow$ Backgrounds $\rightarrow$ Degraded) to quantify the value of each transformation in bridging the synthetic-to-real domain gap.
Sample Size Scaling: They analyzed performance scaling by training on synthetic dataset sizes ranging from 10,000 to 500,000 images to understand data requirements.
Real-world Validation: The model was evaluated on a held-out test set of hand-drawn images collected via a custom web app, providing genuine out-of-distribution testing.
Fine-tuning Experiments: Comparisons of synthetic-only training versus fine-tuning with a small fraction of real hand-drawn data to assess the value of limited real-world supervision.

State-of-the-Art Hand-Drawn OCSR Performance

Pipeline Efficacy: Augmentation and image degradation were the most critical factors for generalization, achieving over 50% accuracy on hand-drawn data when training with 500,000 synthetic images. Adding backgrounds had a negligible effect on accuracy compared to degradation.
State-of-the-Art Performance: The final ensemble model (5 out of 17 trained NNs, selected for achieving >50% individual accuracy) achieved 76% accuracy (top-1) and 85.5% accuracy (top-3) on the hand-drawn test set, a significant improvement over the best single model’s 67.5%.
Synthetic Generalization: A model trained on 500,000 synthetic images achieved >50% accuracy on real hand-drawn data without any fine-tuning, validating the synthetic data generation strategy as a viable alternative to expensive manual labeling.
Ensemble Benefits: The voting committee approach improved accuracy and provided interpretable uncertainty estimates through vote distributions. When all five committee members agree ($V=5$), the confidence value reaches 98%.

Limitations

The authors acknowledge several limitations of the current system:

Hydrocarbons only: The model is restricted to hydrocarbon structures and does not handle heteroatoms or functional groups.
No conjoined rings: Molecules with multiple conjoined rings are excluded due to limitations of RDKit’s image generation, which depicts bridges differently from standard chemistry drawing conventions.
Resonance hybrid notation: The network struggles with benzene rings drawn in the resonance hybrid style (with a circle) compared to the Kekule structure, since the RDKit training images use exclusively Kekule representations.
Challenging backgrounds: Lined and squared paper increase recognition difficulty, and structures bleeding through from the opposite side of the page can confuse the network.

Reproducibility Details

Data

The study relies on two primary data sources: a massive synthetic dataset generated procedurally and a smaller collected dataset of real drawings.

Purpose	Dataset	Size	Notes
Training	Synthetic (RDKit)	500,000 images	Generated via RDKit with “heavy” augmentation: rotation ($0-360°$), blur, salt+pepper noise, and background texture addition.
Fine-tuning	Hand-Drawn (Real)	613 images	Crowdsourced via a web app from over 100 unique users; split into 200-image test set and 413 training/validation images.
Backgrounds	Texture Images	1,052 images	A pool of unlabeled texture photos (paper, desks, shadows) used to generate synthetic backgrounds.

Data Generation Parameters:

Augmentations: Rotation, Resize ($200-300px$), Blur, Dilate, Erode, Aspect Ratio, Affine transform ($\pm 20px$), Contrast, Quantize, Sharpness
Backgrounds: Randomly translated $\pm 100$ pixels and reflected

Algorithms

Ensemble Voting
A committee of networks casts votes for the predicted SMILES string. The final prediction is the one with the highest vote count. Validity of SMILES is checked using RDKit.

Beam Search
Used in the decoding layer with a beam width of $k=5$ to explore multiple potential SMILES strings. It approximates the sequence $\mathbf{\hat{y}}$ that maximizes the joint probability:

$$ \mathbf{\hat{y}} = \arg\max_{\mathbf{y}} \sum_{t=1}^{T} \log P(y_t \mid y_{

Optimization:

Optimizer: Adam
Learning Rate: $1 \times 10^{-4}$
Batch Size: 20
Loss Function: Cross-entropy loss across the sequence of $T$ tokens, computed as:

$$ \mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{
where $\mathbf{x}$ is the image representation and $y_t$ is the predicted SMILES character. This is calculated as perplexity for validation.

Models

The architecture is a standard image captioning model (Show, Attend and Tell style) adapted for chemical structures.

Encoder (CNN):

Input: 256x256 pixel PNG images
Structure: 4 blocks of Conv2D + MaxPool
- Block 1: 64 filters, (3,3) kernel
- Block 2: 128 filters, (3,3) kernel
- Block 3: 256 filters, (3,3) kernel
- Block 4: 512 filters, (3,3) kernel
Activation: ReLU throughout

Decoder (LSTM):

Hidden Units: 512
Embedding Dimension: 80
Attention: Mechanism with intermediary vector dimension of 512

Evaluation

Primary Metric: Exact SMILES match accuracy (character-by-character identity between predicted and ground truth SMILES)
Perplexity: Used for saving model checkpoints (minimizing uncertainty)
Top-k Accuracy: Reported for $k=1$ (76%) and $k=3$ (85.5%)

Artifacts

Artifact	Type	License	Notes
ChemPixCH	Code + Dataset	Apache-2.0	Official implementation with synthetic data generation pipeline and collected hand-drawn dataset

Paper Information

Citation: Weir, H., Thompson, K., Woodward, A., Choi, B., Braun, A., & Martínez, T. J. (2021). ChemPix: Automated Recognition of Hand-Drawn Hydrocarbon Structures Using Deep Learning. Chemical Science, 12(31), 10622-10633. https://doi.org/10.1039/D1SC02957F

Publication: Chemical Science 2021

Additional Resources:

GitHub Repository

@article{weir2021chempix,
  title={ChemPix: Automated Recognition of Hand-Drawn Hydrocarbon Structures Using Deep Learning},
  author={Weir, Hayley and Thompson, Keiran and Woodward, Amelia and Choi, Benjamin and Braun, Augustin and Mart{\'i}nez, Todd J.},
  journal={Chemical Science},
  volume={12},
  number={31},
  pages={10622--10633},
  year={2021},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D1SC02957F}
}

ABC-Net: Keypoint-Based Molecular Image Recognition

Thu, 18 Dec 2025 00:00:00 +0000

Contribution and Paper Type

Method. The paper proposes a novel architectural framework (ABC-Net) for Optical Chemical Structure Recognition (OCSR). It reformulates the problem from image captioning (sequence generation) to keypoint estimation (pixel-wise detection), backed by ablation studies on noise and comparative benchmarks against state-of-the-art tools.

Motivation for Keypoint-Based OCSR

Inefficiency of Rule-Based Methods: Traditional tools (OSRA, MolVec) rely on hand-coded rules that are brittle, require domain expertise, and fail to handle the wide variance in molecular drawing styles.
Data Inefficiency of Captioning Models: Recent Deep Learning approaches (like DECIMER, Img2mol) treat OCSR as image captioning (Image-to-SMILES). This is data-inefficient because canonical SMILES require learning traversal orders, necessitating millions of training examples.
Goal: To create a scalable, data-efficient model that predicts graph structures directly by detecting atomic/bond primitives.

ABC-Net’s Divide-and-Conquer Architecture

Divide-and-Conquer Strategy: ABC-Net breaks the problem down into detecting atom centers and bond centers as independent keypoints.
Keypoint Estimation: A Fully Convolutional Network (FCN) generates heatmaps for object centers. This is inspired by computer vision techniques like CornerNet and CenterNet.
Angle-Based Bond Detection: To handle overlapping bonds, the model classifies bond angles into 60 distinct bins ($0-360°$) at detected bond centers, allowing separation of intersecting bonds.
Implicit Hydrogen Prediction: The model explicitly predicts the number of implicit hydrogens for heterocyclic atoms to resolve ambiguity in dearomatization.

Experimental Setup and Synthetic Data

Dataset Construction: Synthetic dataset of 100,000 molecules from ChEMBL, rendered using two different engines (RDKit and Indigo) to ensure style diversity.
Baselines: Compared against two rule-based methods (MolVec, OSRA) and one deep learning method (Img2mol).
Robustness Testing: Evaluated on the external UOB dataset (real-world images) and synthetic images with varying levels of salt-and-pepper noise (up to $p=0.6$).
Data Efficiency: Analyzed performance scaling with training set size (10k to 160k images).

Results, Generalization, and Noise Robustness

Superior Accuracy: ABC-Net achieved 94-98% accuracy across all test sets (Table 1), outperforming MolVec (12-45% on synthetic data, ~83% on UOB), OSRA (26-62% on synthetic, ~82% on UOB), and Img2mol (78-93% on non-stereo subsets).
Generalization: On the external UOB benchmark, ABC-Net achieved >95% accuracy, whereas the deep learning baseline (Img2mol) dropped to 78.2%, indicating better generalization.
Data Efficiency: The model reached ~95% performance with only 80,000 training images, requiring roughly an order of magnitude less data than captioning-based models like Img2mol (which use millions of training examples).
Noise Robustness: Performance remained stable (<2% drop) with noise levels up to $p=0.1$. Even at extreme noise ($p=0.6$), Tanimoto similarity remained high, suggesting the model recovers most substructures even when exact matches fail.

Limitations

Drawing style coverage: The synthetic training data covers only styles available through RDKit and Indigo renderers. Many real-world styles (e.g., hand-drawn structures, atomic group abbreviations) are not represented.
No stereo baseline from Img2mol: The Img2mol comparison only covers non-stereo subsets because stereo results were not available from the original Img2mol paper.
Scalability to large molecules: Molecules with more than 50 non-hydrogen atoms are excluded from the dataset, and performance on such large structures is untested.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
ABC-Net Repository	Code	Apache-2.0	Official implementation. Missing requirements.txt and pre-trained weights.

Reproducibility Status: Partially Reproducible. The code is provided, but key components like the pre-trained weights, exact training environment dependencies, and the generated synthetic datasets are missing from the open-source release, making exact reproduction difficult.

Data

The authors constructed a synthetic dataset because labeled pixel-wise OCSR data is unavailable.

Source: ChEMBL database
Filtering: Excluded molecules with >50 non-H atoms or rare atom types/charges (<1000 occurrences).
Sampling: 100,000 unique SMILES selected such that every atom type/charge appears in at least 1,000 compounds.
Generation: Images generated via RDKit and Indigo libraries.
- Augmentation: Varied bond thickness, label mode, orientation, and aromaticity markers.
- Resolution: $512 \times 512$ pixels.
- Noise: Salt-and-pepper noise added during training ($P$ = prob of background flip, $Q = 50P$).

Purpose	Dataset	Size	Notes
Training	ChEMBL (RDKit/Indigo)	80k	8:1:1 split (Train/Val/Test)
Evaluation	UOB Dataset	~5.7k images	External benchmark from Univ. of Birmingham

Algorithms

1. Keypoint Detection (Heatmaps)

Down-sampling: Input $512 \times 512$ → Output $128 \times 128$ (stride 4).
Label Softening: To handle discretization error, ground truth peaks are set to 1, first-order neighbors to 0.95, others to 0.
Loss Function: Penalty-reduced pixel-wise binary focal loss (variants of CornerNet loss). The loss formulation is given as:

$$ L_{det} = - \frac{1}{N} \sum_{x,y} \begin{cases} (1 - \hat{A}_{x,y})^{\alpha} \log(\hat{A}_{x,y}) & \text{if } A_{x,y} = 1 \\ (1 - A_{x,y}) (\hat{A}_{x,y})^{\alpha} \log(1 - \hat{A}_{x,y}) & \text{otherwise} \end{cases} $$
- $\alpha=2$ (focal parameter). The $(1 - A_{x,y})$ term reduces the penalty for first-order neighbors of ground truth locations.
- Property classification losses use a separate focal parameter $\beta=2$ with weight balancing: classes with <10% frequency are weighted 10x.

2. Bond Direction Classification

Angle Binning: $360°$ divided into 60 intervals.
Inference: A bond is detected if the angle probability is a local maximum and exceeds a threshold.
Non-Maximum Suppression (NMS): Required for opposite angles (e.g., $30°$ and $210°$) representing the same non-stereo bond.

3. Multi-Task Weighting

Uses Kendall’s uncertainty weighting to balance 8 different loss terms (atom det, bond det, atom type, charge, H-count, bond angle, bond type, bond length).

Models

Architecture: ABC-Net (Custom U-Net / FCN)

Input: $512 \times 512 \times 1$ (Grayscale).
Contracting Path: 5 steps. Each step has conv-blocks + $2 \times 2$ MaxPool.
Expansive Path: 3 steps. Transpose-Conv upsampling + Concatenation (Skip Connections).
Heads: Separate $1 \times 1$ convs for each task map (Atom Heatmap, Bond Heatmap, Property Maps).
Output Dimensions:
- Heatmaps: $(1, 128, 128)$
- Bond Angles: $(60, 128, 128)$
Pre-trained Weights: Not included in the public GitHub repository. The paper’s availability statement mentions code and training datasets but not weights.

Evaluation

Metrics:

Detection: Precision & Recall (Object detection level).
Regression: Mean Absolute Error (MAE) for bond lengths.
Structure Recovery:
- Accuracy: Exact SMILES match rate.
- Tanimoto: ECFP similarity (fingerprint overlap).

Metric	ABC-Net	Img2mol (Baseline)	Notes
Accuracy (UOB)	96.1%	78.2%	Non-stereo subset
Accuracy (Indigo)	96.4%	89.5%	Non-stereo subset
Tanimoto (UOB)	0.989	0.953	Higher substructure recovery

Hardware

Training Configuration: 15 epochs, Batch size 64.
Optimization: Adam Optimizer. LR $2.5 \times 10^{-4}$ (first 5 epochs) → $2.5 \times 10^{-5}$ (last 10).
Repetition: Every experiment was repeated 3 times with random dataset splitting; mean values are reported.
Compute: High-Performance Computing Center of Central South University. Specific GPU model not listed.

Paper Information

Citation: Zhang, X.-C., Yi, J.-C., Yang, G.-P., Wu, C.-K., Hou, T.-J., & Cao, D.-S. (2022). ABC-Net: A divide-and-conquer based deep learning architecture for SMILES recognition from molecular images. Briefings in Bioinformatics, 23(2), bbac033. https://doi.org/10.1093/bib/bbac033

Publication: Briefings in Bioinformatics 2022

Additional Resources:

GitHub Repository

@article{zhangABCNetDivideandconquerBased2022,
  title = {ABC-Net: A Divide-and-Conquer Based Deep Learning Architecture for {SMILES} Recognition from Molecular Images},
  author = {Zhang, Xiao-Chen and Yi, Jia-Cai and Yang, Guo-Ping and Wu, Cheng-Kun and Hou, Ting-Jun and Cao, Dong-Sheng},
  journal = {Briefings in Bioinformatics},
  volume = {23},
  number = {2},
  pages = {bbac033},
  year = {2022},
  publisher = {Oxford University Press},
  doi = {10.1093/bib/bbac033}
}

Unified Framework for Handwritten Chemical Expressions

Wed, 17 Dec 2025 00:00:00 +0000

Addressing the Complexity of Handwritten Organic Chemistry

This is a Methodological Paper ($\Psi_{\text{Method}}$) from Microsoft Research Asia that addresses the challenge of recognizing complex 2D organic chemistry structures. By 2009, math expression recognition had seen significant commercial progress, but chemical expression recognition remained less developed.

The specific gap addressed is the geometric complexity of organic formulas. While inorganic formulas typically follow a linear, equation-like structure, organic formulas present complex 2D diagrammatic structures with various bond types and rings. Existing work often relied on strong assumptions (like single-stroke symbols) or failed to handle arbitrary compounds. There was a clear need for a unified solution capable of handling both inorganic and organic domains consistently.

The Chemical Expression Structure Graph (CESG)

The core innovation is a unified statistical framework that processes inorganic and organic expressions within the same pipeline. Key technical novelties include:

Unified Bond Modeling: Bonds are treated as special symbols. The framework detects “extended bond symbols” (multi-stroke bonds) and splits them into single, double, or triple bonds using corner detection for consistent processing.
Chemical Expression Structure Graph (CESG): A defined graph representation for generic chemical expressions where nodes represent symbols and edges represent bonds or spatial relations.
Non-Symbol Modeling: During the symbol grouping phase, the system explicitly models “invalid groups” to reduce over-grouping errors.
Global Graph Search: Structure analysis is formulated as finding the optimal CESG by searching over a Weighted Direction Graph ($G_{WD}$).

Graph Search and Statistical Validation

The authors validated the framework on a proprietary database of 35,932 handwritten chemical expressions collected from 300 writers.

Setup: The data was split into roughly 26,000 training and 6,400 testing samples.
Metric: Recognition accuracy was measured strictly by expression (all symbols and the complete structure must be correct).
Ablations: The team evaluated the performance contribution of symbol grouping, structure analysis, and full semantic verification.

Recognition Accuracy and Outcomes

The full framework achieved a Top-1 accuracy of 75.4% and a Top-5 accuracy of 83.1%.

Component Contribution: Structure analysis is the primary bottleneck. Adding it drops the theoretical “perfect grouping” performance from 85.9% to 74.1% (Top-1) due to structural errors.
Semantic Verification: Checking valence and grammar improved relative accuracy by 1.7%.

The unified framework effectively handles the variance in 2D space for chemical expressions, demonstrating that delayed decision-making (keeping top-N candidates) improves robustness.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
N/A	N/A	N/A	No public artifacts (code, data, models) were released by the authors.

Data

The study used a private Microsoft Research Asia dataset, making direct reproduction difficult.

Purpose	Dataset	Size	Notes
Total	Proprietary MSRA DB	35,932 expressions	Written by 300 people
Training	Subset	25,934 expressions
Testing	Subset	6,398 expressions

Content: 2,000 unique expressions from high school/college textbooks.
Composition: ~25% of samples are organic expressions.
Vocabulary: 163 symbol classes (elements, digits, +, ↑, %, bonds, etc.).

Algorithms

1. Symbol Grouping (Dynamic Programming)

Objective: Find the optimal symbol sequence $G_{max}$ maximizing the posterior probability given the ink strokes: $$ G_{max} = \arg\max_{G} P(G | \text{Ink}) $$
Non-symbol modeling: Iteratively trained models on “incorrect grouping results” to learn to reject invalid strokes.
Inter-group modeling: Uses Gaussian Mixture Models (GMM) to model spatial relations ($R_j$) between groups.

2. Bond Processing

Extended Bond Symbol: Recognizes connected strokes (e.g., a messy double bond written in one stroke) as a single “extended” symbol.
Splitting: Uses Curvature Scale Space (CSS) corner detection to split extended symbols into primitive lines.
Classification: A Neural Network verifies if the split lines form valid single, double, or triple bonds.

3. Structure Analysis (Graph Search)

Graph Construction: Builds a Weighted Direction Graph ($G_{WD}$) where nodes are symbol candidates and edges are potential relationships ($E_{c}, E_{nc}, E_{peer}, E_{sub}$).
Edge Weights: Calculated as the product of observation, spatial, and contextual probabilities: $$ W(S, O, R) = P(O|S) \times P(\text{Spatial}|R) \times P(\text{Context}|S, R) $$
- Spatial probability uses rectangular control regions and distance functions.
- Contextual probability uses statistical co-occurrence (e.g., ‘C’ often appears with ‘H’).
Search: Breadth-first search with pruning to find the top-N optimal CESGs.

Models

Symbol Recognition: Implementation details not specified, but likely HMM or NN based on the era. Bond verification explicitly uses a Neural Network.
Spatial Models: Gaussian Mixture Models (GMM) are used to model the 9 spatial relations (e.g., Left-super, Above, Subscript).
Semantic Model: A Context-Free Grammar (CFG) parser is used for final verification (e.g., ensuring digits aren’t reactants).

Evaluation

Evaluation is performed using “Expression-level accuracy”.

Metric	Value (Top-1)	Value (Top-5)	Notes
Full Framework	75.4%	83.1%
Without Semantics	74.1%	83.0%
Grouping Only	85.9%	95.6%	Theoretical max if structure analysis was perfect

Paper Information

Citation: Chang, M., Han, S., & Zhang, D. (2009). A Unified Framework for Recognizing Handwritten Chemical Expressions. 2009 10th International Conference on Document Analysis and Recognition, 1345–1349. https://doi.org/10.1109/ICDAR.2009.64

Publication: ICDAR 2009

@inproceedings{changUnifiedFrameworkRecognizing2009,
  title = {A {{Unified Framework}} for {{Recognizing Handwritten Chemical Expressions}}},
  booktitle = {2009 10th {{International Conference}} on {{Document Analysis}} and {{Recognition}}},
  author = {Chang, Ming and Han, Shi and Zhang, Dongmei},
  year = 2009,
  volume = {3},
  pages = {1345--1349},
  publisher = {IEEE},
  address = {Barcelona, Spain},
  doi = {10.1109/ICDAR.2009.64}
}

SVM-HMM Online Classifier for Chemical Symbols

Wed, 17 Dec 2025 00:00:00 +0000

Contribution: Double-Stage Classification Method

Method. This paper is a methodological contribution that proposes a novel “double-stage classifier” architecture. It fits the taxonomy by introducing a specific algorithmic pipeline (SVM rough classification followed by HMM fine classification) and a novel pre-processing algorithm (Point Sequence Reordering) to solve technical limitations in recognizing organic ring structures. The contribution is validated through ablation studies (comparing SVM kernels and HMM state/Gaussian counts) and performance benchmarks.

Motivation: Recognizing Complex Organic Ring Structures

The primary motivation is the complexity of recognizing handwritten chemical symbols, specifically the distinction between Organic Ring Structures (ORS) and Non-Ring Structures (NRS). Existing single-stage classifiers are unreliable for ORS because these symbols have arbitrary writing styles, variable stroke numbers, and inconsistent stroke orders due to their 2D hexagonal structure. A robust system is needed to handle this uncertainty and achieve high accuracy.

Core Innovation: Point Sequence Reordering (PSR)

The authors introduce two main novelties:

Double-Stage Architecture: A hybrid system where an SVM (using RBF kernel) first roughly classifies inputs as either ORS or NRS, followed by specialized HMMs for fine-grained recognition.
Point Sequence Reordering (PSR) Algorithm: A stroke-order independent algorithm designed specifically for ORS. It reorders the point sequence of a symbol based on a counter-clockwise scan from the centroid, effectively eliminating the uncertainty caused by variations in stroke number and writing order.

Methodology & Experimental Design

The authors collected a custom dataset and performed sequential optimizations:

SVM Optimization: Compared Polynomial, RBF, and Sigmoid kernels to find the best rough classifier.
HMM Optimization: Tested multiple combinations of states (4, 6, 8) and Gaussians (3, 4, 6, 8, 9, 12) to maximize fine classification accuracy.
PSR Validation: Conducted an ablation study comparing HMM accuracy on ORS symbols “Before PSR” vs “After PSR” to quantify the algorithm’s impact.

Results & Final Conclusions

Architecture Performance: The RBF-based SVM achieved 99.88% accuracy in differentiating ORS from NRS.
HMM Configuration: The optimal HMM topology was found to be 8-states and 12-Gaussians for both symbol types.
PSR Impact: The PSR algorithm improved ORS recognition. Top-1 accuracy shifted from 49.84% (Before PSR) to 98.36% (After PSR).
Overall Accuracy: The final integrated system achieved a Top-1 accuracy of 93.10% and Top-3 accuracy of 98.08% on the test set.

Reproducibility Details

Data

The study defined 101 chemical symbols split into two categories.

Category	Count	Content	Notes
NRS (Non-Ring)	63	Digits 0-9, 44 letters, 9 operators	Operators include +, -, =, $\rightarrow$, etc.
ORS (Organic Ring)	38	2D hexagonal structures	Benzene rings, cyclohexane, etc.

Collection: 12,322 total samples (122 per symbol) collected from 20 writers (teachers and students).
Split: 9,090 training samples and 3,232 test samples.
Constraints: Three specifications were used: normal, standard, and freestyle.

Algorithms

1. SVM Feature Extraction (Rough Classification) The input strokes are scaled, and a 58-dimensional feature vector is calculated:

Mesh ($4 \times 4$): Ratio of points in 16 grids (16 features).
Outline: Normalized scan distance from 4 edges with 5 scan lines each (20 features).
Projection: Point density in 5 bins per edge (20 features).
Aspect Ratio: Height/Width ratios (2 features).

2. Point Sequence Reordering (PSR) Used strictly for ORS preprocessing:

Calculate the centroid $(x_c, y_c)$ of the symbol.
Initialize a scan line at angle $\theta = 0$.
Traverse points; if a point $p_i = (x_i, y_i)$ satisfies the distance threshold to the scan line, add it to the reordered list. Distance $d_i$ is calculated as: $$ d_i = |(y_i - y_c)\cos(\theta) - (x_i - x_c)\sin(\theta)| $$
Increment $\theta$ by $\Delta\theta$ and repeat until a full circle ($2\pi$) is completed.

Models

SVM (Stage 1): RBF Kernel was selected as optimal with parameters $C=512$ and $\gamma=0.5$.
HMM (Stage 2): Left-right continuous HMM trained via Baum-Welch algorithm. The topology is one model per symbol using 8 states and 12 Gaussians.

Evaluation

Metrics reported are Top-1, Top-2, and Top-3 accuracy on the held-out test set.

Metric	NRS Accuracy	ORS Accuracy	Overall Test Accuracy
Top-1	91.91%	97.53%	93.10%
Top-3	99.12%	99.34%	98.08%

Hardware

Device: HP Pavilion tx1000 Tablet PC.
Processor: 2.00GHz CPU.

Paper Information

Citation: Zhang, Y., Shi, G., & Wang, K. (2010). A SVM-HMM Based Online Classifier for Handwritten Chemical Symbols. 2010 International Conference on Pattern Recognition, 1888–1891. https://doi.org/10.1109/ICPR.2010.465

Publication: ICPR 2010

@inproceedings{zhang2010svm,
  title = {A SVM-HMM Based Online Classifier for Handwritten Chemical Symbols},
  booktitle = {2010 International Conference on Pattern Recognition},
  author = {Zhang, Yang and Shi, Guangshun and Wang, Kai},
  year = {2010},
  pages = {1888--1891},
  publisher = {IEEE},
  doi = {10.1109/ICPR.2010.465}
}

Recognition of On-line Handwritten Chemical Expressions

Wed, 17 Dec 2025 00:00:00 +0000

Contribution: On-line Chemical Expression Recognition Framework

This is a Method paper. It proposes a novel architectural pipeline (“Algorithm Model”) for recognizing on-line handwritten chemical expressions. The paper focuses on detailing the specific mechanisms of this pipeline (pre-processing, segmentation, two-level recognition, and HCI) and validates its effectiveness through quantitative comparison against a conventional baseline. The rhetorical structure aligns with the “Methodological Basis” of the taxonomy, prioritizing the “how well does this work?” question over theoretical derivation or dataset curation.

Motivation: The HCI Gap in Chemical Drawing

The authors identify a gap in existing human-computer interaction (HCI) for chemistry. While mathematical formula recognition had seen progress, chemical expression recognition was under-researched. Existing tools relied on keyboard/mouse input, which was time-consuming and inefficient for the complex, variable nature of chemical structures. Previous attempts were either too slow (vectorization-based) or failed to leverage specific chemical knowledge effectively. There was a practical need for a system that could handle the specific syntactic rules of chemistry in an on-line (real-time) handwriting setting.

Novelty: Two-Level Recognition Architecture

The core contribution is a two-level recognition algorithm that integrates chemical domain knowledge.

Level 1 (Substance Level): Treats connected strokes as a potential “substance unit” (e.g., “H2O”) and matches them against a dictionary using a modified edit distance algorithm.
Level 2 (Character Level): If the substance match fails, it falls back to segmenting the unit into isolated characters and reconstructing them using syntactic rules.
Hybrid Segmentation: Combines structural analysis (using bounding box geometry for super/subscript detection) with “partial recognition” (identifying special symbols like +, =, -> early to split the expression).

Methodology: Custom Dataset and Baseline Comparisons

The authors conducted a validation experiment in a laboratory environment with 20 participants (chemistry students and teachers).

Dataset: 1,197 total samples (983 from a standard set of 341 expressions, 214 arbitrary expressions written by users).
Baselines: They compared their “Two-Level” algorithm against a “Conventional” algorithm that skips the substance-level check and directly recognizes characters (“Recognize Character Directly”).
Conditions: They also tested the impact of their Human-Computer Interaction (HCI) module which allows user corrections.

Results: High Accuracy and HCI Corrections

Accuracy: The proposed two-level algorithm achieved significantly higher accuracy (96.4% for expression recognition) compared to the conventional baseline (91.5%).
Robustness: The method performed well even on “arbitrary” expressions not in the standard set (92.5% accuracy vs 88.2% baseline).
HCI Impact: Allowing users to modify results via the HCI module pushed final accuracy to high levels (98.8%).
Conclusion: The authors concluded the algorithm is reliable for real applications and flexible enough to be extended to other domains like physics or engineering.

Reproducibility Details

Data

The paper does not use a public benchmark but collected its own data for validation.

Purpose	Dataset	Size	Notes
Validation	Custom Lab Dataset	1,197 samples	Collected from 20 chemistry students/teachers using Tablet PCs. Includes 341 standard expressions + arbitrary user inputs.

Algorithms

The pipeline consists of four distinct phases with specific algorithmic choices:

1. Pre-processing

Smoothing: Uses a 5-tap Gaussian low-pass filter (Eq. 1) with specific coefficients to smooth stroke data.
Redundancy: Merges redundant points and removes “prickles” (isolated noise).
Re-ordering: Strokes are spatially re-sorted left-to-right, top-to-down to correct for arbitrary writing order.

2. Segmentation

Structural Analysis: Distinguishes relationships (Superscript vs. Subscript vs. Horizontal) using a geometric feature vector $(T, B)$ based on bounding box heights ($h$), vertical centers ($C$), and barycenters ($B_{bary}$): $$ \begin{aligned} d &= 0.7 \cdot y_{12} - y_{22} + 0.3 \cdot y_{11} \\ T &= 1000 \cdot \frac{d}{h_1} \\ B &= 1000 \cdot \frac{B_{bary1} - B_{bary2}}{h_1} \end{aligned} $$
Partial Recognition: Detects special symbols (+, =, ->) early to break expressions into “super-substance units” (e.g., separating reactants from products).

3. Recognition (Two-Level)

Level 1 (Dictionary Match):
- Uses a modified Edit Distance (Eq. 6) incorporating a specific distance matrix based on chemical syntax.
- Similarity $\lambda_{ij}$ is weighted by stroke credibility $\mu_i$ and normalized by string length.
Level 2 (Character Segmentation):
- Falls back to this if Level 1 fails.
- Segments characters by analyzing pixel density in horizontal/vertical/diagonal directions to find concave/convex points.
- Recombines characters using syntactic rules (e.g., valency checks) to verify validity.

Evaluation

Evaluation focused on recognition accuracy at both the character and expression level.

Metric	Value (Proposed)	Value (Baseline)	Notes
Expression Accuracy (EA)	96.4%	91.5%	“Standard” dataset subset.
Expression Accuracy (EA)	92.5%	88.2%	“Other” (arbitrary) dataset subset.
HCI-Assisted Accuracy	98.8%	N/A	Accuracy after user correction.

Hardware

Input Devices: Tablet PCs were used for data collection and testing.
Compute: Specific training hardware is not listed, but the algorithm is designed for real-time interaction on standard 2008-era computing devices.

Paper Information

Citation: Yang, J., Shi, G., Wang, Q., & Zhang, Y. (2008). Recognition of On-line Handwritten Chemical Expressions. 2008 IEEE International Joint Conference on Neural Networks, 2360–2365. https://doi.org/10.1109/IJCNN.2008.4634125

Publication: IJCNN 2008

@inproceedings{jufengyangRecognitionOnlineHandwritten2008,
  title = {Recognition of On-Line Handwritten Chemical Expressions},
  booktitle = {2008 {{IEEE International Joint Conference}} on {{Neural Networks}} ({{IEEE World Congress}} on {{Computational Intelligence}})},
  author = {{Jufeng Yang} and {Guangshun Shi} and {Qingren Wang} and {Yong Zhang}},
  year = 2008,
  month = jun,
  pages = {2360--2365},
  publisher = {IEEE},
  address = {Hong Kong, China},
  doi = {10.1109/IJCNN.2008.4634125},
  urldate = {2025-12-17},
  isbn = {978-1-4244-1820-6}
}

Online Handwritten Chemical Formula Structure Analysis

Wed, 17 Dec 2025 00:00:00 +0000

Hierarchical Grammatical Framework Contribution

This is a Method paper. It proposes a novel architectural framework for processing chemical formulas by decomposing them into three hierarchical levels (Formula, Molecule, Text). The contribution is defined by a specific set of formal grammatical rules and parsing algorithms used to construct a “grammar spanning tree” and “molecule spanning graph” from online handwritten strokes.

Motivation for Online Formula Recognition

The primary motivation is the application of mobile computing in chemistry education, where precise comprehension of casual, online handwritten formulas is a significant challenge.

2D Complexity: Unlike 1D text, chemical formulas utilize complex 2D spatial relationships that convey specific chemical meaning (e.g., bonds, rings).
Format Limitations: Existing storage formats like CML (Chemical Markup Language) or MDL MOLFILE do not natively record the layout or abbreviated information necessary for recognizing handwritten input.
Online Gap: Previous research focused heavily on offline (image-based) recognition, lacking solutions for online (stroke-based) handwritten chemical formulas (OHCF).

Core Novelty in Three-Level Grammatical Analysis

The core novelty is the Three-Level Grammatical Analysis approach:

Formula Level (1D): Treats the reaction equation as a linear sequence of components (Reactants, Products, Separators), parsed via a context-free grammar to build a spanning tree.
Molecule Level (2D): Treats molecules as graphs where “text groups” are vertices and “bonds” are edges. It introduces specific handling for “hidden Carbon dots” (intersections of bonds without text).
Text Level (1D): Analyzes the internal structure of text groups (atoms, subscripts).

Unique to this approach is the formal definition of the chemical grammar as a 5-tuple $G=(T,N,P,M,S)$ and the generation of an Adjacency Matrix directly from the handwritten sketch to represent chemical connectivity.

Experimental Validation on Handwritten Strokes

The authors validated their model using a custom dataset of online handwritten formulas.

Data Source: 25 formulas were randomly selected from a larger pool of 1,250 samples.
Scope: The test set included 484 total symbols, comprising generators, separators, text symbols, rings, and various bond types.
Granular Validation: The system was tested at multiple distinct stages:
- Key Symbol Extraction (Formula Level)
- Text Localization (Molecule Level)
- Bond End Grouping (Molecule Level)
- Text Recognition (Text Level)

Downstream Impact and Parsing Accuracy

The system achieved high accuracy across all sub-tasks, demonstrating that the hierarchical grammar approach is effective for both inorganic and organic formulas.

Formula Level: 98.3% accuracy for Key Symbols; 100% for State-assisted symbols.
Molecule Level: 98.8% accuracy for Bond End Grouping; 100% for Free End-Text connection detection.
Text Recognition: 98.7% accuracy (Top-3) using HMMs.
Impact: The method successfully preserves the writer’s “online information” (habits/intentions) while converting the handwritten input into standard formats suitable for visual editing or data retrieval.

Reproducibility Details

To replicate this work, one would need to implement the specific grammatical production rules and the geometric thresholds defined for bond analysis.

Data

Purpose	Dataset	Size	Notes
Training	Symbol HMMs	5,670 samples	Used to train the text recognition module
Testing	Text Recognition	2,016 samples	Test set for character HMMs
Testing	Formula Analysis	25 formulas	Random subset of 1,250 collected samples; contains 484 symbols

Algorithms

1. Formula Level Parsing

HBL Analysis: Identify the “Horizontal Baseline” (HBL) containing the most symbols to locate key operators (e.g., $+$, $\rightarrow$).
Grammar: Use the productions defined in Figure 4. Example rules include:
- $Reaction ::= ReactantList \ Generator \ ProductList$
- $Reactant ::= BalancingNum \ Molecule \ IonicCharacter$

2. Molecule Level Analysis (Bond Grouping)

Endpoint Classification: Points are classified as free ends, junctions (3+ bonds), or connections (2 bonds).
Grouping Equation: An endpoint $(x_k, y_k)$ belongs to Group A based on distance thresholding: $$ \begin{aligned} Include(x_0, y_0) = \begin{cases} 1, & d_0 < t \cdot \max d_x + \partial \\ 0, & \text{else} \end{cases} \end{aligned} $$ Where $d_k$ is the Euclidean distance to the group center $(x_a, y_a)$.

3. Connection Detection

Text-Bond Connection: A text group is connected to a bond if the free end falls within a bounding box expanded by thresholds $t_W$ and $t_H$: $$ \begin{aligned} Con(x,y) = \begin{cases} 1, & \min x - t_W < x < \max x + t_W \text{ AND } \min y - t_H < y < \max y + t_H \\ 0, & \text{else} \end{cases} \end{aligned} $$

Models

Text Recognition: Hidden Markov Models (HMM) are used for recognizing individual text symbols.
Grammar: Context-Free Grammar (CFG) designed with ambiguity elimination to ensure a single valid parse tree for any valid formula.

Evaluation

Performance is measured by recognition accuracy at specific processing stages:

Metric	Task	Value	Notes
Accuracy	F1 (Key Symbol Extraction)	98.3%	Formula Level
Accuracy	F2 (State-assisted Symbol)	100%	Formula Level
Accuracy	M2 (Bond End Grouping)	98.8%	Molecule Level
Accuracy	M3 (Free End-Text Conn)	100%	Molecule Level
Accuracy	T1 (Text Recognition)	98.7%	Top-3 Accuracy

Paper Information

Citation: Wang, X., Shi, G., & Yang, J. (2009). The Understanding and Structure Analyzing for Online Handwritten Chemical Formulas. 2009 10th International Conference on Document Analysis and Recognition, 1056–1060. https://doi.org/10.1109/ICDAR.2009.70

Publication: ICDAR 2009

@inproceedings{wangUnderstandingStructureAnalyzing2009,
  title = {The {{Understanding}} and {{Structure Analyzing}} for {{Online Handwritten Chemical Formulas}}},
  booktitle = {2009 10th {{International Conference}} on {{Document Analysis}} and {{Recognition}}},
  author = {Wang, Xin and Shi, Guangshun and Yang, Jufeng},
  year = {2009},
  pages = {1056--1060},
  publisher = {IEEE},
  address = {Barcelona, Spain},
  doi = {10.1109/ICDAR.2009.70},
  isbn = {978-1-4244-4500-4},
  langid = {english}
}

On-line Handwritten Chemical Expression Recognition

Wed, 17 Dec 2025 00:00:00 +0000

A Methodological Approach to Chemical Recognition

This is a Method paper. It proposes a specific “novel two-level algorithm” and a “System model” for recognizing chemical expressions. The paper focuses on the architectural design of the recognition pipeline (segmentation, substance recognition, symbol recognition) and validates it against a “conventional algorithm” baseline, fitting the standard profile of a methodological contribution.

Bridging the Gap in Pen-Based Chemical Input

While pen-based computing has advanced for text and mathematical formulas, inputting chemical expressions remains “time-consuming”. Existing research often lacks “adequate chemical knowledge” or relies on algorithms that are too slow (global optimization) or structurally weak (local optimization). The authors aim to bridge this gap by integrating chemical domain knowledge into the recognition process to improve speed and accuracy.

Two-Level Recognition Strategy for Formulas

The core novelty is a two-level recognition strategy:

Level 1 (Substance Recognition): Uses global structural information to identify entire “substance units” (e.g., $H_2SO_4$) by matching against a dictionary.
Level 2 (Symbol Recognition): If Level 1 fails, the system falls back to segmenting the substance into isolated characters and recognizing them individually.

Additionally, the method integrates syntactic features (chemical knowledge) such as element conservation to validate and correct results and uses specific geometric features to distinguish superscript/subscript relationships.

Dataset Collection and Baseline Comparisons

Dataset Collection: The authors collected 1197 handwritten expression samples from 20 chemistry professionals and students. This included 983 “standard” expressions (from 341 templates) and 214 “arbitrary” expressions written freely.
Comparison: They compared their “Two-level recognition” approach against a “conventional algorithm” that shields the first level (directly segmenting into characters).
Metrics: They measured Material Accuracy (MA), Correct Expressions Number (AEN), and Expression Accuracy (EA).

High Accuracy in Formula Recognition

High Accuracy: The proposed algorithm achieved 96.4% Material Accuracy (MA) and 95.7% Expression Accuracy (EA) on the total test set.
Robustness: The method performed well on both standard (96.3% EA) and arbitrary (92.5% EA) expressions.
Validation: The authors conclude the algorithm is “reliable,” “flexible,” and suitable for real-time applications compared to prior work.

Reproducibility Details

Data

The authors constructed two distinct datasets for training and evaluation:

Purpose	Dataset	Size	Notes
Symbol Training	ISF Files	12,240 files	Used to train the ANN classifier. Covers 102 symbol classes (numerals, letters, operators, organic loops).
Expression Testing	Handwritten Expressions	1,197 samples	983 standard + 214 arbitrary expressions collected from 20 chemistry teachers/students.

Algorithms

1. Structural Segmentation (Superscript/Subscript)

To distinguish relationships (superscript, subscript, in-line), the authors define geometric parameters based on the bounding boxes of adjacent symbols ($x_{i1}, y_{i1}, x_{i2}, y_{i2}$):

$$d = 0.7 \times y_{12} - y_{22} + 0.3 \times y_{11}$$ $$T = 1000 \times d/h$$ $$B = 1000 \times (B_1 - B_2)/h_1$$

Where $B_1, B_2$ are barycenters and $h$ is height. $(T, B)$ serves as the feature vector for classification.

2. Segmentation Reliability

For segmenting strokes into units, the reliability of a segmentation path is calculated as:

$$Cof(K_{i},n)=\sum_{j=0}^{N}P(k_{j},k_{j+1})+P(S_{K_{i}})+\delta(N)$$

Where $P(k_j, k_{j+1})$ is the reliability of strokes being recognized as symbol $S_{k_j}$.

3. Substance Matching (Level 1)

A modified string edit distance is used to match handwritten input against a dictionary:

$$\lambda_{\overline{u}}=\mu_{i} \times f(Dis(i,j,r)/\sqrt{Max(Len_{i},Len_{j})})$$

Where $\mu_i$ is the recognizer credibility and $Dis(i,j,r)$ is the edit distance.

Models

Classifier: An ANN-based classifier is used for isolated symbol recognition.
Input Features: A set of ~30 features is extracted from strokes, including writing time, interval time, elastic mesh, and stroke outline.
Performance: The classifier achieved 92.1% accuracy on a test set of 2,702 isolated symbols.

Evaluation

The system was evaluated on the 1,197 expression samples.

Metric	Value (Total)	Value (Standard)	Value (Other)	Notes
Material Accuracy (MA)	96.4%	97.7%	94%	Accuracy of substance recognition.
Expression Accuracy (EA)	95.7%	96.3%	92.5%	Accuracy of full expression recognition.

Paper Information

Citation: Yang, J., Shi, G., Wang, K., Geng, Q., & Wang, Q. (2008). A Study of On-Line Handwritten Chemical Expressions Recognition. 2008 19th International Conference on Pattern Recognition, 1–4. https://doi.org/10.1109/ICPR.2008.4761824

Publication: ICPR 2008

@inproceedings{yangStudyOnlineHandwritten2008,
  title = {A Study of On-Line Handwritten Chemical Expressions Recognition},
  booktitle = {2008 19th {{International Conference}} on {{Pattern Recognition}}},
  author = {Yang, Jufeng and Shi, Guangshun and Wang, Kai and Geng, Qian and Wang, Qingren},
  year = 2008,
  month = dec,
  pages = {1--4},
  publisher = {IEEE},
  address = {Tampa, FL, USA},
  doi = {10.1109/ICPR.2008.4761824}
}

Img2Mol: Accurate SMILES Recognition from Depictions

Wed, 17 Dec 2025 00:00:00 +0000

Method Classification

This is a method paper that introduces Img2Mol, a deep learning system for Optical Chemical Structure Recognition (OCSR). The work focuses on building a fast, accurate, and robust system for converting molecular structure depictions into machine-readable SMILES strings.

Systematization and Motivation

Vast amounts of chemical knowledge exist only as images in scientific literature and patents, making this data inaccessible for computational analysis, database searches, or machine learning pipelines. Manually extracting this information is slow and error-prone, creating a bottleneck for drug discovery and chemical research.

While rule-based OCSR systems like OSRA, MolVec, and Imago exist, they are brittle. Small variations in drawing style or image quality can cause them to fail. The authors argue that a deep learning approach, trained on diverse synthetic data, can generalize better across different depiction styles and handle the messiness of real-world images more reliably.

Two-Stage Architecture and Core Novelty

The novelty lies in a two-stage architecture that separates perception from decoding, combined with aggressive data augmentation to ensure robustness. The key contributions are:

1. Two-Stage Architecture with CDDD Embeddings

Img2Mol uses an intermediate representation to predict SMILES from pixels. A custom CNN encoder maps the input image to a 512-dimensional Continuous and Data-Driven Molecular Descriptor (CDDD) embedding - a pre-trained, learned molecular representation that smoothly captures chemical similarity. A pre-trained decoder then converts this CDDD vector into the final canonical SMILES string.

This two-stage design has several advantages:

The CDDD space is continuous and chemically meaningful, so nearby embeddings correspond to structurally similar molecules. This makes the regression task easier than learning discrete token sequences directly.
The decoder is pre-trained and fixed, so the CNN only needs to learn the image → CDDD mapping. This decouples the visual recognition problem from the sequence generation problem.
CDDD embeddings naturally enforce chemical validity constraints, reducing the risk of generating nonsensical structures.

2. Extensive Data Augmentation for Robustness

The model was trained on 11.1 million unique molecules from ChEMBL and PubChem, but the critical insight is how the training images were generated. To expose the CNN to maximum variation in depiction styles, the authors:

Used three different cheminformatics libraries (RDKit, OEChem, Indigo) to render images, each with its own drawing conventions
Applied wide-ranging augmentations: varying bond thickness, font size, rotation, resolution (originally 192-256 px; expanded to 190-2500 px in the final model), and other stylistic parameters
Over-sampled larger molecules to improve performance on complex structures, which are underrepresented in chemical databases

This ensures the network rarely sees the same depiction of a molecule twice, forcing it to learn invariant features.

3. Fast Inference

Because the architecture is a simple CNN followed by a fixed decoder, inference is very fast - especially compared to rule-based systems that rely on iterative graph construction algorithms. This makes Img2Mol practical for large-scale document mining.

Experimental Validation and Benchmarks

The evaluation focused on demonstrating that Img2Mol is more accurate, robust, and generalizable than existing rule-based systems:

Benchmark Comparisons: Img2Mol was tested on several standard OCSR benchmarks, including USPTO (patent images), University of Birmingham (UoB), CLEF, and JPO (Japanese Patent Office) datasets, against three open-source baselines: OSRA, MolVec, and Imago. No deep learning baselines were available at the time for comparison.
Resolution and Molecular Size Analysis: The initial model, Img2Mol(no aug.), was evaluated across different image resolutions and molecule sizes (measured by number of atoms) to understand failure modes. This revealed that:
- Performance degraded for molecules with >35 atoms
- Very high-resolution images lost detail when downscaled to the fixed input size
- Low-resolution images (where rule-based methods failed completely) were handled well
Data Augmentation Ablation: A final model, Img2Mol, was trained with the full augmentation pipeline (wider resolution range, over-sampling of large molecules). Performance was compared to the initial version to quantify the effect of augmentation.
Depiction Library Robustness: The model was tested on images generated by each of the three rendering libraries separately to confirm that training on diverse styles improved generalization.
Input Perturbation for Benchmark Fairness: For the smaller benchmark datasets (USPTO, UoB, CLEF, JPO), the authors applied slight random rotation (within +/-5 degrees) and shearing to each image five times to detect potential overfitting of rule-based methods to well-known benchmarks.
Generalization Tests: Img2Mol was evaluated on real-world patent images from the STAKER dataset, which were not synthetically generated. This tested whether the model could transfer from synthetic training data to real documents.
Hand-Drawn Molecule Recognition: As an exploratory test, the authors evaluated performance on hand-drawn molecular structures, a task the model was never trained for, to see if the learned features could generalize to completely different visual styles.
Speed Benchmarking: Inference time was measured and compared to rule-based baselines to demonstrate the practical efficiency of the approach.

Results, Conclusions, and Limitations

Key benchmark results from Table 1 of the paper (accuracy / Tanimoto similarity, in %):

Benchmark	Img2Mol	MolVec 0.9.8	Imago 2.0	OSRA 2.1
Img2Mol test set	88.25 / 95.27	2.59 / 13.03	0.02 / 4.74	2.59 / 13.03
STAKER	64.33 / 83.76	5.32 / 31.78	0.07 / 5.06	5.23 / 26.98
USPTO	42.29 / 73.07	30.68 / 65.50	5.07 / 7.28	6.37 / 44.21
UoB	78.18 / 88.51	75.01 / 86.88	5.12 / 7.19	70.89 / 85.27
CLEF	48.84 / 78.04	44.48 / 76.61	26.72 / 41.29	17.04 / 58.84
JPO	45.14 / 69.43	49.48 / 66.46	23.18 / 37.47	33.04 / 49.62

Per-library accuracy on a 5,000-compound subset (depicted five times each):

Library	Img2Mol	MolVec	Imago	OSRA
RDKit	93.4%	3.7%	0.3%	4.4%
OEChem	89.5%	33.4%	12.3%	26.3%
Indigo	79.0%	22.2%	4.2%	22.6%

Substantial Performance Gains: Img2Mol outperformed all three rule-based baselines on nearly every benchmark. MolVec scored higher on JPO (49.48% vs. 45.14% accuracy). Accuracy was measured both as exact SMILES match and as Tanimoto similarity (using ECFP6 1024-bit fingerprints). Even when Img2Mol did not predict the exact molecule, it often predicted a chemically similar one.
Robustness Across Conditions: The full Img2Mol model (with aggressive augmentation) showed consistent performance across all image resolutions and molecule sizes. In contrast, rule-based systems were “brittle” - performance dropped sharply with minor perturbations to image quality or style.
Depiction Library Invariance: Img2Mol’s performance was stable across all three rendering libraries (RDKit, OEChem, Indigo), validating the multi-library training strategy. Rule-based methods struggled particularly with RDKit-generated images.
Strong Generalization to Real-World Data: Despite being trained exclusively on synthetic images, Img2Mol performed well on real patent images from the STAKER dataset. This suggests the augmentation strategy successfully captured the diversity of real-world depictions.
Overfitting in Baselines: Rule-based methods performed surprisingly well on older benchmarks (USPTO, UoB, CLEF) but failed on newer datasets (Img2Mol’s test set, STAKER). This suggests they may be implicitly tuned to specific drawing conventions in legacy datasets.
Limited Hand-Drawn Recognition: Img2Mol could recognize simple hand-drawn structures but struggled with complex or large molecules. This is unsurprising given the lack of hand-drawn data in training, but it highlights a potential avenue for future work.
Speed Advantage: Img2Mol processed 5,000 images in approximately 4 minutes at the smallest input size, with compute time mostly independent of input resolution due to the fixed 224x224 rescaling. Rule-based methods showed sharply increasing compute times at higher resolutions.

The work establishes that deep learning can outperform traditional rule-based OCSR systems when combined with a principled two-stage architecture and comprehensive data augmentation. The CDDD embedding acts as a bridge between visual perception and chemical structure, providing a chemically meaningful intermediate representation that improves both accuracy and robustness. The focus on synthetic data diversity proves to be an effective strategy for generalizing to real-world documents.

Reproducibility Details

Models

Architecture: Custom 8-layer Convolutional Neural Network (CNN) encoder

Input: $224 \times 224$ pixel grayscale images
Backbone Structure: 8 convolutional layers organized into 3 stacks, followed by 3 fully connected layers
- Stack 1: 3 Conv layers ($7 \times 7$ filters, stride 3, padding 4) + Max Pooling
- Stack 2: 2 Conv layers + Max Pooling
- Stack 3: 3 Conv layers + Max Pooling
- Head: 3 fully connected layers
Output: 512-dimensional CDDD embedding vector

Decoder: Pre-trained CDDD decoder (from Winter et al.) - fixed during training, not updated

Algorithms

Loss Function: Mean Squared Error (MSE) regression minimizing the distance between the predicted and true embeddings:

$$ l(d) = l(\text{cddd}_{\text{true}} - \text{cddd}_{\text{predicted}}) $$

Optimizer: AdamW with initial learning rate $10^{-4}$

Training Schedule:

Batch size: 256
Training duration: 300 epochs
Plateau scheduler: Multiplies learning rate by 0.7 if validation loss plateaus for 10 epochs
Early stopping: Triggered if no improvement in validation loss for 50 epochs

Noise Tolerance: The decoder requires the CNN to predict embeddings with noise level $\sigma \le 0.15$ to achieve >90% accuracy

Data

Training Data: 11.1 million unique molecules from ChEMBL and PubChem

Splits: Approximately 50,000 examples each for validation and test sets

Synthetic Image Generation:

Three cheminformatics libraries: RDKit, OEChem, and Indigo
Augmentations: Resolution (190-2500 pixels), rotation, bond thickness, font size
Salt stripping: Keep only the largest fragment
Over-sampling: Larger molecules (>35 atoms) over-sampled to improve performance

Evaluation

Metrics:

Exact SMILES match accuracy
Tanimoto similarity (chemical fingerprint-based structural similarity)

Benchmarks:

Img2Mol test set (25,000 synthetic images at 224x224 px)
STAKER (30,000 real-world USPTO patent images at 256x256 px)
USPTO (4,852 patent images, avg. 649x417 px)
UoB (5,716 images from University of Birmingham, avg. 762x412 px)
CLEF (711 images, avg. 1243x392 px)
JPO (365 Japanese Patent Office images, avg. 607x373 px)
Hand-drawn molecular structures (exploratory, no defined benchmark)

Baselines: OSRA, MolVec, Imago (rule-based systems)

Hardware

⚠️ Unspecified in paper or supplementary materials. Inference speed reported as ~4 minutes for 5000 images; training hardware (GPU model, count) is undocumented.

Artifacts

Artifact	Type	License	Notes
Img2Mol GitHub	Code	Apache 2.0	Official implementation
Img2Mol model weights	Model	CC BY-NC 4.0	Non-commercial use only

Known Limitations

Molecular Size: Performance degrades for molecules with >35 atoms. This is partly a property of the CDDD latent space itself: for larger molecules, the “volume of decodable latent space” shrinks, making the decoder more sensitive to small noise perturbations in the predicted embedding.

Paper Information

Citation: Clevert, D.-A., Le, T., Winter, R., & Montanari, F. (2021). Img2Mol – accurate SMILES recognition from molecular graphical depictions. Chemical Science, 12(42), 14174–14181. https://doi.org/10.1039/d1sc01839f

Publication: Chemical Science (2021)

Additional Resources:

HMM-based Online Recognition of Chemical Symbols

Wed, 17 Dec 2025 00:00:00 +0000

What kind of paper is this?

This is a Method paper that proposes a specific algorithmic pipeline for the online recognition of handwritten chemical symbols. The core contribution is the engineering of an 11-dimensional feature vector combined with a Hidden Markov Model (HMM) architecture. The paper validates this method through quantitative experiments on a custom dataset, focusing on recognition accuracy as the primary metric.

What is the motivation?

Recognizing chemical symbols is uniquely challenging due to the complex structure of chemical expressions and the nature of pen-based input, which often results in broken or conglutinate strokes. Additionally, variations in writing style and random noise make the task difficult. While online recognition for Western characters and CJK scripts is well-developed, works specifically targeting online chemical symbol recognition are scarce, with most prior research focusing on offline recognition or global optimization.

What is the novelty here?

The primary novelty is the application of continuous HMMs specifically to the domain of online chemical symbol recognition, utilizing a specialized set of 11-dimensional local features. While HMMs have been used for other scripts, this paper tailors the feature extraction (including curliness, linearity, and writing direction) to capture the specific geometric properties of chemical symbols.

What experiments were performed?

The authors constructed a specific dataset for this task involving 20 participants (college teachers and students).

Dataset: 64 distinct symbols (digits, English letters, Greek letters, operators)
Volume: 7,808 total samples (122 per symbol), split into 5,670 training samples and 2,016 testing samples
Model Sweeps: They evaluated the HMM performance by varying the number of states (4, 6, 8) and the number of Gaussians per state (3, 4, 6, 9, 12)

What were the outcomes and conclusions drawn?

Performance: The best configuration (6 states, 9 Gaussians) achieved a top-1 accuracy of 89.5% and a top-3 accuracy of 98.7%
Scaling: Results showed that generally, increasing the number of states and Gaussians improved accuracy, though at the cost of computational efficiency
Error Analysis: The primary sources of error were shape similarities between specific characters (e.g., ‘0’ vs ‘O’ vs ‘o’, and ‘C’ vs ‘c’ vs ‘(’)

Reproducibility Details

Status: Closed / Very Low Reproducibility. This 2009 study relies on a private, custom-collected dataset and does not provide source code, model weights, or an open-access preprint.

Artifacts

Artifact	Type	License	Notes
None publicly available	N/A	N/A	No open source code, open datasets, or open-access preprints were released with this publication.

Data

The study utilized a custom dataset collected in a laboratory environment.

Purpose	Dataset	Size	Notes
Training	Custom Chemical Symbol Set	5,670 samples	90 samples per symbol
Testing	Custom Chemical Symbol Set	2,016 samples	32 samples per symbol

Dataset Composition: The set includes 64 symbols: Digits (0-9), Uppercase (A-Z, missing Q), Lowercase (a-z, selected), Greek letters ($\alpha$, $\beta$, $\gamma$, $\pi$), and operators ($+$, $=$, $\rightarrow$, $\uparrow$, $\downarrow$, $($ , $)$).

Algorithms

1. Preprocessing

The raw tablet data undergoes a 6-step pipeline:

Duplicate Point Elimination: Removing sequential points with identical coordinates
Broken Stroke Connection: Using Bezier curves to interpolate missing points/connect broken strokes
Hook Elimination: Removing artifacts at the start/end of strokes characterized by short length and sharp angle changes
Smoothing: Reducing noise from erratic pen movement
Re-sampling: Spacing points equidistantly to remove temporal variation
Size Normalization: Removing variation in writing scale

2. Feature Extraction (11 Dimensions)

Features are extracted from a 5-point window centered on $t$ ($t-2$ to $t+2$). The 11 dimensions are:

Normalized Vertical Position: $y(t)$ mapped to $[0,1]$
Normalized First Derivative ($x’$): Calculated via weighted sum of neighbors
Normalized First Derivative ($y’$): Calculated via weighted sum of neighbors
Normalized Second Derivative ($x’’$): Computed using $x’$ values
Normalized Second Derivative ($y’’$): Computed using $y’$ values
Curvature: $\frac{x’y’’ - x’‘y’}{(x’^2 + y’^2)^{3/2}}$
Writing Direction (Cos): $\cos \alpha(t)$ based on vector from $t-1$ to $t+1$
Writing Direction (Sin): $\sin \alpha(t)$
Aspect Ratio: Ratio of height to width in the 5-point window
Curliness: Deviation from the straight line connecting the first and last point of the window
Linearity: Average squared distance of points in the window to the straight line connecting start/end points

3. Feature Normalization

The final feature matrix $V$ is normalized to zero mean and unit standard deviation using the covariance matrix: $o_t = \Sigma^{-1/2}(v_t - \mu)$.

Models

Architecture: Continuous Hidden Markov Models (HMM)
Topology: Left-to-right (Bakis model)
Initialization: Initial distribution $\pi = {1, 0, …, 0}$; uniform transition matrix $A$; segmental k-means for observation matrix $B$
Training: Baum-Welch re-estimation
Decision: Maximum likelihood classification ($\hat{\lambda} = \arg \max P(O|\lambda)$)

Evaluation

Metric	Best Value	Configuration	Notes
Top-1 Accuracy	89.5%	6 States, 9 Gaussians	Highest reported accuracy
Top-3 Accuracy	98.7%	6 States, 9 Gaussians	Top-3 candidate accuracy

Paper Information

Citation: Zhang, Y., Shi, G., & Yang, J. (2009). HMM-Based Online Recognition of Handwritten Chemical Symbols. 2009 10th International Conference on Document Analysis and Recognition, 1255–1259. https://doi.org/10.1109/ICDAR.2009.99

Publication: ICDAR 2009

@inproceedings{zhang2009hmm,
  title = {HMM-Based Online Recognition of Handwritten Chemical Symbols},
  booktitle = {2009 10th International Conference on Document Analysis and Recognition},
  author = {Zhang, Yang and Shi, Guangshun and Yang, Jufeng},
  year = {2009},
  volume = {75},
  pages = {1255--1259},
  publisher = {IEEE},
  doi = {10.1109/ICDAR.2009.99}
}

Handwritten Chemical Symbol Recognition Using SVMs

Wed, 17 Dec 2025 00:00:00 +0000

Paper Contribution and Taxonomy

This is a Method paper according to the AI for Physical Sciences taxonomy.

Dominant Basis: The authors propose a novel hybrid architecture (SVM-EM) that combines two existing techniques to solve a specific recognition problem.
Rhetorical Indicators: The paper explicitly defines algorithms (Algorithm 1 & 2), presents a system architecture, and validates the method via ablation studies comparing the hybrid approach against its individual components.

Motivation for Pen-Based Input

Entering chemical expressions on digital devices is difficult due to their complex 2D spatial structure.

The Problem: While handwriting recognition for text and math is mature, chemical structures involve unique symbols and spatial arrangements that existing tools struggle to process efficiently.
Existing Solutions: Standard tools (like ChemDraw) rely on point-and-click interactions, which are described as complicated and non-intuitive compared to direct handwriting.
Goal: To enable fluid handwriting input on pen/touch-based devices (like iPads) by accurately recognizing individual chemical symbols in real-time.

Novelty: Hybrid SVM and Elastic Matching

The core contribution is the Hybrid SVM-EM approach, which splits recognition into a coarse classification stage and a fine-grained verification stage.

Two-Stage Pipeline:
1. SVM Recognition: Uses statistical features (stroke count, turning angles) to generate a short-list of candidate symbols.
2. Elastic Matching (EM): Uses a geometric point-to-point distance metric to re-rank these candidates against a library of stored symbol prototypes.
Online Stroke Partitioning: A heuristic-based method to group strokes into symbols in real-time based on time adjacency (grouping the last $n$ strokes) and spatial intersection checks, without waiting for the user to finish the entire drawing.

Experimental Design and Data Collection

The authors conducted a user study to collect data and evaluate the system:

Participants: 10 users were recruited to write chemical symbols on an iPad.
Task: Each user wrote 78 distinct chemical symbols (digits, alphabets, bonds) 3 times each.
Baselines: The hybrid method was compared against two baselines:
1. SVM only
2. Elastic Matching only.
Metrics: Evaluation focused on Precision@k (where $k=1, 3, 5$), measuring how often the correct symbol appeared in the top-$k$ suggestions.

Recognition Performance and Outcomes

The hybrid approach demonstrated improved performance compared to using either technique in isolation.

Key Results:
- Hybrid SVM-EM: 89.7% Precision@1 (Top-1 accuracy).
- SVM Only: 85.1% Precision@1.
- EM Only: 76.7% Precision@1.
Category Performance: The system performed best on Operators (91.9%) and Digits (91.3%), with slightly lower performance on Alphabetic characters (88.6%).
Impact: The system was successfully implemented as a real-time iOS application, allowing users to draw complex structures like $C\#CC(O)$ which are then converted to SMILES strings.

Reproducibility Details

Data

The study generated a custom dataset for training and evaluation.

Purpose	Dataset Stats	Details
Evaluation	2,340 samples	Collected from 10 users. Consists of 78 unique symbols: 10 digits (0-9), 52 letters (A-Z, a-z), and 16 bonds/operators (e.g., $=$, $+$, hash bonds).
Training	Unspecified size	A “Chemical Elastic Symbol Library” was created containing samples of all supported symbols to serve as prototypes for the Elastic Matching step.

Algorithms

The pipeline consists of four distinct algorithmic steps:

1. Stroke Partitioning

Logic: Groups the most recently written stroke with up to the last 4 previous strokes.
Filtering: Invalid groups are removed using “Spatial Distance Checking” (strokes too far apart) and “Stroke Intersection Checking” (strokes that don’t intersect where expected).

2. Preprocessing

Size Normalization: Scales symbol to a standard size based on its bounding box.
Smoothing: Uses average smoothing (replacing mid-points with the average of neighbors) to remove jitter.
Sampling: Resamples valid strokes to a fixed number of 50 points.

3. SVM Feature Extraction

Horizontal Angle: Calculated between two consecutive points ($P_1, P_2$). Values are binned into 12 groups ($30^{\circ}$ each).
Turning Angle: The difference between two consecutive horizontal angles. Values are binned into 18 groups ($10^{\circ}$ each).
Features: Input vector consists of stroke count, normalized coordinates, and the percentage of angles falling into the histograms described above.

4. Elastic Matching (Verification)

Distance Function: Euclidean distance summation between the points of the candidate symbol ($s$) and the partitioned input ($s_p$). $$ \begin{aligned} D(s, s_p) = \sum_{j=1}^{n} \sqrt{(x_{s,j} - x_{p,j})^2 + (y_{s,j} - y_{p,j})^2} \end{aligned} $$ Note: The paper formula sums the distances; $n$ is the number of points (50).
Ranking: Candidates are re-ranked in ascending order of this elastic distance.

Models

Classifier: Linear Support Vector Machine (SVM) implemented using LibSVM.
Symbol Library: A “Chemical Elastic Symbol Library” stores the raw stroke point sequences for all 78 supported symbols to enable the elastic matching comparison.

Evaluation

Performance was measured using precision at different ranks (Top-N accuracy).

Metric	Value	Baseline	Notes
Precision@1	89.7%	85.1% (SVM)	Hybrid model reduces error rate significantly over baselines.
Precision@3	94.1%	N/A	High recall in top 3 allows users to quickly correct errors via UI selection.
Precision@5	94.6%	N/A

Hardware

Device: Apple iPad (iOS platform).
Input: Touch/Pen-based input recording digital ink (x, y coordinates and pen-up/down events).

Paper Information

Citation: Tang, P., Hui, S. C., & Fu, C. W. (2013). Online Chemical Symbol Recognition for Handwritten Chemical Expression Recognition. 2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS), 535–540. https://doi.org/10.1109/ICIS.2013.6607894

Publication: IEEE ICIS 2013

@inproceedings{tangOnlineChemicalSymbol2013,
  title = {Online Chemical Symbol Recognition for Handwritten Chemical Expression Recognition},
  booktitle = {2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS)},
  author = {Tang, Peng and Hui, Siu Cheung and Fu, Chi-Wing},
  year = 2013,
  volume = {22},
  pages = {535--540},
  publisher = {IEEE},
  doi = {10.1109/ICIS.2013.6607894}
}

Handwritten Chemical Ring Recognition with Neural Networks

Wed, 17 Dec 2025 00:00:00 +0000

Contribution: Recognition Architecture for Heterocyclic Rings

This is a Method paper ($\Psi_{\text{Method}}$).

It proposes a specific algorithmic architecture (the “Classifier-Recognizer Approach”) to solve a pattern recognition problem. The rhetorical structure centers on defining three variations of a method, performing ablation-like comparisons between them (Whole Image vs. Lower Part), and demonstrating superior performance metrics (~94% accuracy) for the proposed technique.

Motivation: Enabling Sketch-Based Chemical Search

The authors identify a gap in existing OCR and handwriting recognition research, which typically focuses on alphanumeric characters or whole words.

Missing Capability: Recognition of specific heterocyclic chemical rings (23 types) had not been performed previously.
Practical Utility: Existing chemical search engines require text-based queries (names); this work enables “backward” search where a user can draw a ring to find its information.
Educational/Professional Aid: Useful for chemistry departments and mobile applications where chemists can sketch formulas on screens.

Innovation: The Classifier-Recognizer Pipeline

The core novelty is the two-phase “Classifier-Recognizer” architecture designed to handle the visual similarity of heterocyclic rings:

Phase 1 (Classifier): A neural network classifies the ring into one of four broad categories (S, N, O, Others) based solely on the upper part of the image (40x15 pixels).
Phase 2 (Recognizer): A class-specific neural network identifies the exact ring.
Optimization: The most successful variation (“Lower Part Image Recognizer with Half Size Grid”) uses only the lower part of the image and odd rows (half-grid) to reduce input dimensionality and computation time while improving accuracy. This effectively subsamples the input grid matrix $M \in \mathbb{R}^{H \times W}$ to a reduced matrix $M_{\text{sub}}$: $$ M_{\text{sub}} = { m_{i,j} \in M \mid i \text{ is odd} } $$

Failed Preliminary Approaches

Before arriving at the Classifier-Recognizer architecture, the authors tried three simpler methods that all failed:

Ordinary NN: A single neural network with 1600 inputs (40x40 grid), 1600 hidden units, and 23 outputs. This standard approach achieved only 7% accuracy.
Row/Column pixel counts: Using the number of black pixels per row and per column as features ($N_c + N_r$ inputs), which dramatically reduced dimensionality. This performed even worse, below 1% accuracy.
Midline crossing count: Drawing a horizontal midline and counting the number of line crossings. This failed because the crossing count varies between writers for the same ring.

These failures motivated the two-phase Classifier-Recognizer design.

Experimental Setup and Network Variations

The authors conducted a comparative study of three methodological variations:

Whole Image Recognizer: Uses the full image.
Whole Image (Half Size Grid): Uses only odd rows ($20 \times 40$ pixels).
Lower Part (Half Size Grid): Uses the lower part of the image with odd rows (the proposed method).

Setup:

Dataset: 23 types of heterocyclic rings.
Training: 1500 samples (distributed across S, N, O, and Others classes).
Testing: 1150 samples.
Metric: Recognition accuracy (Performance %) and Error %.

Results: High Accuracy via Dimension Reduction

Superior Method: The “Lower Part Image Recognizer with Half Size Grid” achieved the best performance (~94% overall).
High Classifier Accuracy: The first phase (classification into S/N/O/Other) achieves 100% accuracy for class S, 98.67% for O, 97.75% for N, and 97.67% for Others (Table 3).
Class ‘Others’ Difficulty: The ‘Others’ class showed lower performance (~90-93%) compared to S/N/O due to the higher complexity and similarity of rings in that category.
Efficiency: The half-grid approach reduced training time from ~53 hours (Whole Image) to ~35 hours (Lower Part Half Size Grid) while improving accuracy from 87% to 94%.

Training/Testing comparison across the three Classifier-Recognizer variations (Table 2):

Method	Hidden Nodes	Iterations	Training Time (hrs)	Error	Performance
Whole Image	50	1000	~53	13.0%	87.0%
Whole Image (Half Grid)	50	1000	~41	9.0%	91.0%
Lower Part (Half Grid)	50	1000	~35	6.0%	94.0%

Reproducibility Details

Data

The dataset consists of handwritten samples of 23 specific heterocyclic rings.

Purpose	Dataset	Size	Notes
Training	Heterocyclic Rings	1500 samples	Split: 300 (S), 400 (N), 400 (O), 400 (Others)
Testing	Heterocyclic Rings	1150 samples	Split: 150 (S), 300 (O), 400 (N), 300 (Others)

Preprocessing Steps:

Monochrome Conversion: Convert image to monochrome bitmap.
Grid Scaling: Convert drawing area (regardless of original size) to a fixed 40x40 grid.
Bounding: Scale the ring shape itself to fit the 40x40 grid.

Algorithms

The “Lower Part with Half Size” Pipeline:

Cut Point: A horizontal midline is defined; the algorithm separates the “Upper Part” and “Lower Part”.
Phase 1 Input: The Upper Part (rows 0-15 approx, scaled) is fed to the Classifier NN to determine the class (S, N, O, or Others).
Phase 2 Input:
- For classes S, N, O: The Lower Part of the image is used.
- For class Others: The Whole Ring is used.
Dimensionality Reduction: For the recognizer networks, only odd rows are used (effectively a 20x40 input grid) to reduce inputs from 1600 to 800.

Models

The system uses multiple distinct Feed-Forward Neural Networks (Backpropagation is implied by “training” and “epochs” context, though not explicitly named as the algorithm):

Structure: 1 Classifier NN + 4 Recognizer NNs (one for each class).
Hidden Layers: The preliminary “ordinary method” experiment used 1600 hidden units. The Classifier-Recognizer methods all used 50 hidden nodes per Table 2. The paper also notes that the ordinary approach tried various hidden layer sizes.
Input Nodes:
- Standard: 1600 (40x40).
- Optimized: ~800 (20x40 via half-grid).

Evaluation

Classifier Phase Testing Results (Table 3):

Class	Samples	Correct	Accuracy	Error
S	150	150	100.00%	0.00%
O	300	296	98.67%	1.33%
N	400	391	97.75%	2.25%
Others	300	293	97.67%	2.33%

Recognizer Phase Testing Results (Lower Part Image Recognizer with Half Size Grid, Table 4):

Class	Samples	Correct	Accuracy	Error
S	150	147	98.00%	2.00%
O	300	289	96.33%	3.67%
N	400	386	96.50%	3.50%
Others	300	279	93.00%	7.00%
Overall	1150	-	~94.0%	-

Reproducibility Assessment

No source code, trained models, or datasets were released with this paper. The handwritten ring samples were collected by the authors, and the software described (a desktop application) is not publicly available. The neural network architecture details (50 hidden nodes, 1000 iterations) and preprocessing pipeline are described in sufficient detail for reimplementation, but reproducing results would require collecting a new handwritten dataset of heterocyclic rings.

Status: Closed (no public code, data, or models).

Paper Information

Citation: Hewahi, N., Nounou, M. N., Nassar, M. S., Abu-Hamad, M. I., & Abu-Hamad, H. I. (2008). Chemical Ring Handwritten Recognition Based on Neural Networks. Ubiquitous Computing and Communication Journal, 3(3).

Publication: Ubiquitous Computing and Communication Journal 2008

@article{hewahiCHEMICALRINGHANDWRITTEN2008,
  title = {CHEMICAL RING HANDWRITTEN RECOGNITION BASED ON NEURAL NETWORKS},
  author = {Hewahi, Nabil and Nounou, Mohamed N and Nassar, Mohamed S and Abu-Hamad, Mohamed I and Abu-Hamad, Husam I},
  year = {2008},
  journal = {Ubiquitous Computing and Communication Journal},
  volume = {3},
  number = {3}
}

Deep Learning for Molecular Structure Extraction (2019)

Wed, 17 Dec 2025 00:00:00 +0000

Contribution Type: Method and Resource

This is primarily a methodological paper with a secondary resource contribution.

Method: It proposes a novel end-to-end deep learning architecture (Segmentation U-Net + Recognition Encoder-Decoder) to replace traditional rule-based optical chemical structure recognition (OCSR) systems.

Resource: It details a pipeline for generating large-scale synthetic datasets (images overlaying patent/journal backgrounds) necessary to train the deep learning models.

Motivation: Overcoming Brittle Rule-Based Systems

Existing tools for extracting chemical structures from literature (e.g., OSRA, CLIDE) rely on complex, handcrafted rules and heuristics (edge detection, vectorization). These systems suffer from:

Brittleness: They fail when image quality is low (low resolution, noise) or when artistic styles vary (wavy bonds, crossing lines).
Maintenance difficulty: Improvements require manual codification of new rules for every edge case, which is difficult to scale.
Data volume: The explosion of published life science papers (2000+ per day in Medline) creates a need for automated, robust curation tools that humans cannot match.

Core Innovation: End-to-End Pixel-to-SMILES Recognition

The authors present an end-to-end deep learning approach for this task that operates directly on raw pixels without explicit subcomponent recognition (e.g., detecting atoms and bonds separately). Key innovations include:

Pixel-to-SMILES: Treating structure recognition as an image captioning problem using an encoder-decoder architecture with attention, generating SMILES directly.
Low-Resolution Robustness: The model is trained on aggressively downsampled images (~60 dpi for segmentation, 256x256 for prediction), making it robust to poor quality and noisy inputs from legacy PDF extractions.
Implicit Superatom Handling: The model learns to recognize and generate sequences for superatoms (e.g., “OTBS”) contextually.

Experimental Setup and Large-Scale Synthetic Data

The authors validated their approach using a mix of large-scale synthetic training sets and real-world test sets:

Synthetic Generation: They created a segmentation dataset by overlaying USPTO molecules onto “whited-out” journal pages.
Ablation/Training: Metrics were tracked on Indigo (synthetic) and USPTO (real patent images) datasets.
External Validation:
- Valko Dataset: A standard benchmark of 454 heterogeneous images from literature.
- Proprietary Dataset: A collection of images from 47 articles and 5 patents to simulate real-world drug discovery curation.
Stress Testing: They analyzed performance distributions across molecular weight, heavy atom count, and rare elements (e.g., Uranium, Vanadium).

Results and Limitations in Complex Structures

High Accuracy on Standard Sets: The model achieved 82% accuracy on the Indigo validation set and 77% on the USPTO validation set. No apparent overfitting was observed on the Indigo data (57M training examples), though some overfitting occurred on the smaller USPTO set (1.7M training examples).
Real-World Viability: It achieved 83% accuracy on the proprietary internal test set, with validation and proprietary accuracies ranging from 77-83%, indicating the training sets reasonably approximate real drug discovery data.
Segmentation Quality: Low segmentation error rates were observed: only 3.3% of the Valko dataset and 6.6% of the proprietary images failed to segment properly.
Limitations on Complexity: Performance dropped to 41% on the Valko test set. Superatoms were the single largest contributor to prediction errors, with 21% of Valko samples containing one or more incorrectly predicted superatoms. Only 6.6% of total training images contained any superatom, limiting the model’s exposure.
Stereochemistry Challenges: 60% of compounds with incorrectly predicted stereochemistry had explicit stereochemistry in both the ground truth and the prediction, but with wrong configurations assigned (e.g., predicting R instead of S). The model often correctly identified which atoms have stereocenters but assigned the wrong direction, suggesting the architecture may not incorporate sufficient spatial context for configuration assignment.

Reproducibility Details

Data

The authors utilized three primary sources for generating training data. All inputs were strictly downsampled to improve robustness.

Purpose	Dataset	Size	Notes
Training	Indigo Set	57M	PubChem molecules rendered via Indigo (256x256).
Training	USPTO Set	1.7M	Image/SMILES pairs from public patent data.
Training	OS X Indigo	10M	Additional Indigo renders from Mac OS for style diversity.
Segmentation	Synthetic Pages	N/A	Generated by overlaying USPTO images on text-cleared PDF pages.

Preprocessing:

Segmentation Inputs: Grayscale, downsampled to ~60 dpi.
Prediction Inputs: Resized to 256x256 such that bond lengths are approximately 3-12 pixels.
Augmentation: Random affine transforms, brightness scaling, and binarization applied during training.

Algorithms

Segmentation Pipeline:

Multi-scale Inference: Masks generated at resolutions from 30 to 60 dpi (3 dpi increments) and averaged for the final mask.
Post-processing: Hough transform used to remove long straight lines (table borders). Mask blobs filtered by pixel count thresholds.

Prediction Pipeline:

Sequence Generation: SMILES generated character-by-character via greedy decoding. During inference, predictions are made at several low resolutions and the sequence with the highest confidence (product of per-character softmax outputs) is returned.
Attention-based Verification: Attention weights used to re-project predicted atoms back into 2D space to visually verify alignment with the input image.

Models

1. Segmentation Model (U-Net Variant):

Architecture: U-Net style with skip connections.
Input: 128x128x1 grayscale image.
Layers: Alternating 3x3 Conv and 2x2 Max Pool.
Activation: Parametric ReLU (pReLU).
Parameters: ~380,000.

2. Structure Prediction Model (Encoder-Decoder):

Encoder: CNN with 5x5 convolutions, 2x2 Max Pooling, pReLU. No pooling in first layers to preserve fine features.
Decoder: 3 layers of GridLSTM cells.
Attention: Soft/Global attention mechanism conditioned on the encoder state.
Input: 256x256x1 image.
Output: Sequence of characters (vocab size 65).
Parameters: ~46.3 million.

Evaluation

Evaluation required an exact string match of the Canonical SMILES (including stereochemistry) to the ground truth.

Metric	Value	Dataset	Notes
Accuracy	82%	Indigo Val	Synthetic validation set
Accuracy	77%	USPTO Val	Real patent images
Accuracy	83%	Proprietary	Internal pharma dataset (real world)
Accuracy	41%	Valko Test	External benchmark; difficult due to superatoms

Hardware

Segmentation Training: 1 GPU, ~4 days (650k steps).
Prediction Training: 8 NVIDIA Pascal GPUs, ~26 days (1M steps).
Framework: TensorFlow.
Optimizer: Adam.

Artifacts

No public code, pre-trained models, or generated datasets were released with this paper. The training pipeline relies on publicly available molecular databases (PubChem, USPTO) and open-source rendering tools (Indigo), but the specific training sets, model weights, and inference code remain unavailable.

Paper Information

Citation: Staker, J., Marshall, K., Abel, R., & McQuaw, C. (2019). Molecular Structure Extraction From Documents Using Deep Learning. Journal of Chemical Information and Modeling, 59(3), 1017-1029. https://doi.org/10.1021/acs.jcim.8b00669

Publication: Journal of Chemical Information and Modeling (JCIM) 2019

Additional Resources:

Schrödinger Publication Page

@article{stakerMolecularStructureExtraction2019,
  title = {Molecular Structure Extraction From Documents Using Deep Learning},
  author = {Staker, Joshua and Marshall, Kyle and Abel, Robert and McQuaw, Carolyn},
  year = {2019},
  month = {feb},
  journal = {Journal of Chemical Information and Modeling},
  volume = {59},
  number = {3},
  pages = {1017--1029},
  doi = {10.1021/acs.jcim.8b00669},
  url = {https://doi.org/10.1021/acs.jcim.8b00669}
}

DECIMER: Deep Learning for Chemical Image Recognition

Wed, 17 Dec 2025 00:00:00 +0000

Contribution: Method for Optical Chemical Entity Recognition

This is primarily a Method ($\Psi_{\text{Method}}$) paper with a strong Resource ($\Psi_{\text{Resource}}$) component.

Method: It proposes a novel architecture (DECIMER) that repurposes “show-and-tell” image captioning networks for Optical Chemical Entity Recognition (OCER), providing an alternative to traditional rule-based segmentation pipelines.
Resource: It establishes a framework for generating large-scale synthetic training data using open-source cheminformatics tools (CDK) and databases (PubChem), circumventing the scarcity of manually annotated chemical images.

Motivation: Brittleness of Heuristic Pipelines

The extraction of chemical structures from scientific literature (OCER) is critical for populating open-access databases. Traditional OCER systems (like OSRA or CLiDE) rely on complex multi-step pipelines involving vectorization, character recognition, and graph compilation. These systems are brittle and incorporating new structural features requires laborious engineering. Inspired by the success of deep neural network approaches like AlphaGo Zero, the authors sought to formulate an end-to-end deep learning approach that learns directly from data with minimal prior assumptions.

Novelty: Image Captioning for Molecular Graphs

Image-to-Text Formulation: The paper frames chemical structure recognition as an image captioning problem, translating a bitmap image directly into a SMILES string using an encoder-decoder network. This bypasses explicit segmentation of atoms and bonds entirely.
Synthetic Data Strategy: The authors generate synthetic images from PubChem using the CDK Structure Diagram Generator, scaling the dataset size to 15 million.
Robust String Representations: The study performs key ablation experiments on string representations, comparing standard SMILES against DeepSMILES to evaluate how syntactic validity affects the network’s learning capability.

Experimental Setup and Validation Strategies

Data Scaling: Models were trained on dataset sizes ranging from 54,000 to 15 million synthetic images to observe empirical scaling laws regarding accuracy and compute time.
Representation Comparison: The authors compared the validity of predicted strings and recognition accuracy when training on SMILES versus DeepSMILES. The cross-entropy loss formulation for sequence generation can be represented as: $$ \mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{
Metric Evaluation: Performance was measured using Validity (syntactic correctness) and Tanimoto Similarity $T$, computed on molecular fingerprints to capture partial correctness even if the exact string prediction failed: $$ T(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$

Results and Critical Conclusions

Data Representation: DeepSMILES proved superior to standard SMILES for training stability and output validity. Preliminary tests suggested SELFIES performs even better (0.78 Tanimoto vs 0.53 for DeepSMILES at 6M images).
Scaling Behavior: Accuracy improves linearly with dataset size. The authors extrapolate that near-perfect detection would require training on 50 to 100 million structures.
Current Limitations: At the reported training scale (up to 15M), the model does not yet rival traditional heuristic approaches, but the learning curve suggests it is a viable trajectory given sufficient compute and data.

Reproducibility Details

Data

The training data is synthetic, generated using the Chemistry Development Kit (CDK) Structure Diagram Generator (SDG) based on molecules from PubChem.

Curation Rules (applied to PubChem data):

Molecular weight < 1500 Daltons.
Elements restricted to: C, H, O, N, P, S, F, Cl, Br, I, Se, B.
No counter ions or charged groups.
No isotopes (e.g., D, T).
Bond count between 5 and 40.
SMILES length < 40 characters.
Implicit hydrogens only (except in functional groups).

Preprocessing:

Images: Generated as 299x299 bitmaps to match Inception V3 input requirements.
Augmentation: One random rotation applied per molecule; no noise or blurring added in this iteration.

Purpose	Dataset	Size	Notes
Training	Synthetic (PubChem)	54k - 15M	Scaled across 12 experiments
Testing	Independent Set	6k - 1.6M	10% of training size

Algorithms

Architecture: "Show, Attend and Tell" (Attention-based Image Captioning).
Optimization: Adam optimizer with learning rate 0.0005.
Loss Function: Sparse Categorical Crossentropy.
Training Loop: Trained for 25 epochs per model. Batch size of 640 images.

Models

The network is implemented in TensorFlow 2.0.

Encoder: Inception V3 (Convolutional NN), used unaltered. Extracts feature vectors saved as NumPy arrays.
Decoder: Gated Recurrent Unit (GRU) based Recurrent Neural Network (RNN) with soft attention mechanism.
Embeddings: Image embedding dimension size of 600.

Evaluation

The primary metric is Tanimoto similarity (Jaccard index) on PubChem fingerprints, which is robust for measuring structural similarity even when exact identity is not reached.

Metric	Definition
Tanimoto 1.0	Percentage of predictions that are chemically identical to ground truth (isomorphic).
Average Tanimoto	Mean similarity score across the test set (captures partial correctness).
Validity	Percentage of predicted strings that are valid DeepSMILES/SMILES.

Artifacts

Artifact	Type	License	Notes
DECIMER (Java utilities)	Code	MIT	CDK-based data generation and conversion tools
DECIMER-Image-to-SMILES	Code	MIT	TensorFlow training and inference scripts (archived)
PubChem	Dataset	Public Domain	Source of molecular structures for synthetic training data

Hardware

Training was performed on a single node.

GPU: 1x NVIDIA Tesla V100.
CPU: 2x Intel Xeon Gold 6230.
RAM: 384 GB.
Compute Time:
- Linear scaling with data size.
- 15 million structures took ~27 days (91,881s per epoch).
- Projected time for 100M structures: ~4 months on single GPU.

Paper Information

Citation: Rajan, K., Zielesny, A. & Steinbeck, C. (2020). DECIMER: towards deep learning for chemical image recognition. Journal of Cheminformatics, 12(1), 65. https://doi.org/10.1186/s13321-020-00469-w

Publication: Journal of Cheminformatics 2020

Additional Resources:

@article{rajanDECIMERDeepLearning2020,
  title = {{{DECIMER}}: Towards Deep Learning for Chemical Image Recognition},
  shorttitle = {{{DECIMER}}},
  author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
  year = 2020,
  month = oct,
  journal = {Journal of Cheminformatics},
  volume = {12},
  number = {1},
  pages = {65},
  issn = {1758-2946},
  doi = {10.1186/s13321-020-00469-w}
}

ChemGrapher: Deep Learning for Chemical Graph OCSR

Wed, 17 Dec 2025 00:00:00 +0000

Classifying the Methodology

This is a Method paper. It proposes a novel deep learning architecture and a specific graph-reconstruction algorithm to solve the problem of Optical Chemical Structure Recognition (OCSR). It validates this method by comparing it against the existing standard tool (OSRA), demonstrating superior performance on specific technical challenges like stereochemistry.

The OCR Stereochemistry Challenge

Chemical knowledge is frequently locked in static images within scientific publications. Extracting this structure into machine-readable formats (graphs, SMILES) is essential for drug discovery and database querying. Existing tools, such as OSRA, rely on optical character recognition (OCR) and expert systems or hand-coded rules. These tools struggle with bond multiplicity and stereochemical information, often missing atoms or misinterpreting 3D cues (wedges and dashes). A machine learning approach allows for improvement via data scaling.

Decoupled Semantic Segmentation and Classification Pipeline

The core novelty is the segmentation-classification pipeline which decouples object detection from type assignment:

Semantic Segmentation: The model first predicts pixel-wise maps for atoms, bonds, and charges using a Dense Prediction Convolutional Network built on dilated convolutions.
Graph Building Algorithm: A specific algorithm iterates over the segmentation maps to generate candidate locations for atoms and bonds.
Refinement via Classification: Dedicated classification networks take cutouts of the original image combined with the segmentation mask to verify and classify each candidate (e.g., distinguishing a single bond from a double bond, or a wedge from a dash).

Additionally, the authors developed a novel method for synthetic data generation by modifying the source code of RDKit to output pixel-wise labels during the image drawing process. This solves the lack of labeled training data.

Evaluating Synthetics and Benchmarks

Synthetic Benchmarking: The authors generated test sets in 3 different stylistic variations. For each style, they tested on both stereo (complex 3D information) and non-stereo compounds.
Baseline Comparison: They compared the error rates of ChemGrapher against OSRA (Optical Structure Recognition Application).
Component-level Evaluation: They analyzed the F1 scores of the segmentation networks versus the classification networks independently to understand where errors propagated.
Real-world Case Study: They manually curated 61 images cut from journal articles to test performance on real, non-synthetic data.

Advancements Over OSRA

Superior Accuracy: ChemGrapher consistently achieved lower error rates than OSRA across all synthetic styles, particularly for stereochemical information (wedge and dash bonds).
Component Performance: The classification networks showed higher F1 scores than the segmentation networks across all prediction types (Figure 4 in the paper). This suggests the two-stage approach allows the classifier to correct segmentation noise.
Real-world Viability: In the manual case study, ChemGrapher correctly predicted 46 of 61 images, compared to 42 of 61 for OSRA.
Limitations: The model struggles with thick bond lines in real-world images. Performance is stronger on carbon-only compounds, where no letters appear in the image.

Reproducibility Details

Data

The authors created a custom synthetic dataset using ChEMBL and RDKit, as no pixel-wise labeled dataset existed.

Purpose	Dataset	Size	Notes
Source	ChEMBL	1.9M	Split into training pool (1.5M), val/train pool (300K), and test pools (35K each).
Segmentation Train	Synthetic	~114K	Sampled from ChEMBL pool such that every atom type appears in >1000 compounds.
Labels	Pixel-wise	N/A	Generated by modifying RDKit source code to output label masks (atom type, bond type, charge) during drawing.
Candidates (Val)	Cutouts	~27K (Atom) ~55K (Bond)	Validation candidates generated from ~450 compounds for evaluating the classification networks.

Algorithms

Algorithm 1: Graph Building

Segment: Apply segmentation network $s(x)$ to get maps $S^a$ (atoms), $S^b$ (bonds), $S^c$ (charges).
Atom Candidates: Identify candidate blobs in $S^a$.
Classify Atoms: For each candidate, crop the input image and segmentation map. Feed to $c_A$ and $c_C$ to predict Atom Type and Charge. Add to Vertex set $V$ if valid.
Bond Candidates: Generate all pairs of nodes in $V$ within $2 \times$ bond length distance.
Classify Bonds: For each pair, create a candidate mask (two rectangles meeting in the middle to encode directionality). Feed to $c_B$ to predict Bond Type (single, double, wedge, etc.). Add to Edge set $E$.

Models

The pipeline uses four distinct Convolutional Neural Networks (CNNs).

1. Semantic Segmentation Network ($s$)

Architecture: 8 convolutional layers (3x3) plus a final 1x1 linear layer (Dense Prediction Convolutional Network).
Kernels: $3 \times 3$ for all convolutional layers; $1 \times 1$ for the final linear layer.
Dilation: Uses dilated convolutions to expand receptive field without losing resolution. Six of the eight convolutional layers use dilation (factors: 2, 4, 8, 8, 4, 2); the first and last convolutional layers have no dilation.
Input: Binary B/W image.
Output: Multi-channel probability maps for Atom Types ($S^a$), Bond Types ($S^b$), and Charges ($S^c$).

2. Classification Networks ($c_A, c_B, c_C$)

Purpose: Refines predictions on small image patches.
Architecture: 5 convolutional layers, followed by a MaxPool layer and a final linear (1x1) layer.
- Layer 1: Depthwise separable convolution (no dilation).
- Layers 2-4: Dilated convolutions (factors 2, 4, 8).
- Layer 5: Standard convolution (no dilation).
- MaxPool: $124 \times 124$.
- Final: 1x1 linear layer.
Inputs:
- Crop of the binary image ($x^{cut}$).
- Crop of the segmentation map ($S^{cut}$).
- “Highlight” mask ($h_L$) indicating the specific candidate location (e.g., a dot for atoms, two rectangles for bonds).

Evaluation

Metric: F1 Score for individual network performance (segmentation pixels and classification accuracy).
Metric: Error Rate (percentage of incorrect graphs) for overall system. A graph is “incorrect” if there is at least one mistake in atoms or bonds.
Baselines: Compared against OSRA.

Hardware

GPU: Training and inference performed on a single NVIDIA Titan Xp (donated by NVIDIA).

Reproducibility Status

Closed. The authors did not release source code, pre-trained models, or the synthetic dataset. The data generation pipeline requires modifications to RDKit’s internal drawing code, which are not publicly available. The ChEMBL source compounds are public, but the pixel-wise labeling procedure cannot be reproduced without the modified RDKit code.

Paper Information

Citation: Oldenhof, M., Arany, Á., Moreau, Y., & Simm, J. (2020). ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep Learning. Journal of Chemical Information and Modeling, 60(10), 4506-4517. https://doi.org/10.1021/acs.jcim.0c00459

Publication: Journal of Chemical Information and Modeling 2020 (arXiv preprint Feb 2020)

Additional Resources:

arXiv Page

@article{oldenhof2020chemgrapher,
  title={ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep Learning},
  author={Oldenhof, Martijn and Arany, Ádám and Moreau, Yves and Simm, Jaak},
  journal={Journal of Chemical Information and Modeling},
  volume={60},
  number={10},
  pages={4506--4517},
  year={2020},
  publisher={ACS Publications},
  doi={10.1021/acs.jcim.0c00459}
}

A Review of Optical Chemical Structure Recognition Tools

Wed, 17 Dec 2025 00:00:00 +0000

Systematization and Benchmarking of OCSR

This is primarily a Systematization paper ($0.7 \Psi_{\text{Systematization}}$) with a significant Resource component ($0.3 \Psi_{\text{Resource}}$).

It serves as a Systematization because it organizes nearly three decades of research in Optical Chemical Structure Recognition (OCSR), categorizing methods into rule-based systems (e.g., Kekulé, CLiDE, OSRA) and emerging machine-learning approaches (e.g., MSE-DUDL, Chemgrapher). It synthesizes information on 16 distinct tools, many of which are commercial or no longer available.

It acts as a Resource by defining a benchmark for the field. The authors evaluate the three available open-source tools (Imago, MolVec, OSRA) against four distinct datasets to establish baseline performance metrics for accuracy and speed.

Motivation: Digitizing Legacy Chemical Literature

A vast amount of chemical knowledge remains “hidden” in the primary scientific literature (printed or PDF), conveyed as 2D images. Because these depictions are not machine-readable, there is a “backlog of decades of chemical literature” that cannot be easily indexed or searched in open-access databases.

While Chemical Named Entity Recognition (NER) exists for text, translating graphical depictions into formats like SMILES or SDfiles requires specialized OCSR tools. The motivation is to enable the automated curation of this legacy data to feed public databases.

Core Innovations: Historical Taxonomy and Open Standards

The primary novelty is the comprehensive aggregation of the history of the field, which had not been thoroughly reviewed recently. It details the algorithmic evolution from the first work in 1990 to deep learning methods in 2019.

Specific contributions include:

Historical Taxonomy: Classification of tools into rule-based vs. machine-learning, and open-source vs. commercial/unavailable.
Open Source Benchmark: A comparative performance analysis of the only three open-source tools available at the time (Imago, MolVec, OSRA) on standardized datasets.
Algorithmic Breakdown: Detailed summaries of the workflows for closed-source or lost tools (e.g., Kekulé, OROCS, ChemReader) based on their original publications.

Benchmarking Methodology and Open-Source Evaluation

The authors performed a benchmark study to evaluate the accuracy and speed of three open-source OCSR tools: MolVec (0.9.7), Imago (2.0), and OSRA (2.1.0).

They tested these tools on four datasets of varying quality and origin:

USPTO: 5,719 images from US patents (high quality).
UOB: 5,740 images from the University of Birmingham, published alongside MolRec.
CLEF 2012: 961 images from the CLEF-IP evaluation (well-segmented, clean).
JPO: 450 images from Japanese patents (low quality, noise, Japanese characters).

Evaluation metrics were:

Accuracy: Percentage of perfectly recognized structures, mathematically defined as exact string matching between generated and reference standard InChI sequences $\text{Accuracy} = \frac{\text{Correct InChI Matches}}{\text{Total Images}}$ (verified by converting output to InChI strings and matching against reference InChIs).
Speed: Total processing time for the dataset.

Results and General Conclusions

Benchmark Results (Table 2):

Dataset	Metric	MolVec 0.9.7	Imago 2.0	OSRA 2.1.0
USPTO (5,719 images)	Time (min)	28.65	72.83	145.04
	Accuracy	88.41%	87.20%	87.69%
UOB (5,740 images)	Time (min)	28.42	152.52	125.78
	Accuracy	88.39%	63.54%	86.50%
CLEF 2012 (961 images)	Time (min)	4.41	16.03	21.33
	Accuracy	80.96%	65.45%	94.90%
JPO (450 images)	Time (min)	7.50	22.55	16.68
	Accuracy	66.67%	40.00%	57.78%

Key Observations:

MolVec was the fastest tool, processing datasets significantly quicker than competitors (e.g., 28.65 min for USPTO vs. 145.04 min for OSRA).
OSRA performed exceptionally well on clean, well-segmented data (94.90% on CLEF 2012) but was slower.
Imago generally lagged in accuracy compared to the other two, particularly on the UOB dataset (63.54% vs. 88.39% for MolVec and 86.50% for OSRA).
JPO Difficulty: All tools struggled with the noisy Japanese Patent Office dataset (accuracies ranged from 40.00% to 66.67%), highlighting issues with noise and non-standard labels.

General Conclusions:

No “gold standard” tool existed (as of 2020) that solved all problems (page segmentation, R-groups, NLP integration).
Rule-based approaches dominate the history of the field, but deep learning methods (MSE-DUDL, Chemgrapher) were emerging, though they were closed-source at the time of writing.
There was a critical need for tools that could handle full-page recognition (combining segmentation and recognition).

Reproducibility Details

The authors provided sufficient detail to replicate the benchmarking study.

Artifacts

Artifact	Type	License	Notes
OCSR_Review (GitHub)	Code / Data	MIT	Benchmark images (PNG, 72 dpi) and evaluation scripts
OSRA	Code	Open Source	Version 2.1.0 tested; precompiled binaries are commercial
Imago	Code	Open Source	Version 2.0 tested; no longer actively developed
MolVec	Code	LGPL-2.1	Version 0.9.7 tested; Java-based standalone tool

Data

The study used four public datasets. Images were converted to PNG (72 dpi) to ensure compatibility across all tools.

Dataset	Size	Source	Characteristics
USPTO	5,719	OSRA Validation Set	US Patent images, generally clean.
UOB	5,740	Univ. of Birmingham	Published alongside MolRec.
CLEF 2012	961	CLEF-IP 2012	Well-segmented, high quality.
JPO	450	Japanese Patent Office	Low quality, noisy, contains Japanese text.

Algorithms

The paper does not propose a new algorithm but benchmarks existing ones. The execution commands for reproducibility were:

Imago: Executed via command line without installation. ./imago_console -dir /image/directory/path
MolVec: Executed as a JAR file. java -cp [dependencies] gov.nih.ncats.molvec.Main -dir [input_dir] -outDir [output_dir]
OSRA: Installed via Conda (PyOSRA) due to compilation complexity. Required dictionaries for superatoms and spelling. osra -f sdf -a [superatom_dict] -l [spelling_dict] -w [output_file] [input_file]

Models

The specific versions of the open-source software tested were:

Tool	Version	Technology	License
MolVec	0.9.7	Java-based, rule-based	LGPL-2.1
Imago	2.0	C++, rule-based	Open Source
OSRA	2.1.0	C++, rule-based	Open Source

Evaluation

Metric: Perfect structural match. The output SDfile/SMILES was converted to a Standard InChI string and compared to the ground truth InChI. Any deviation counted as a failure.
Environment: Linux workstation (Ubuntu 20.04 LTS).

Hardware

The benchmark was performed on a high-end workstation to measure processing time.

CPUs: 2x Intel Xeon Silver 4114 (40 threads total).
RAM: 64 GB.
Parallelization: MolVec had pre-implemented parallelization features that contributed to its speed.

Paper Information

Citation: Rajan, K., Brinkhaus, H. O., Zielesny, A., & Steinbeck, C. (2020). A review of optical chemical structure recognition tools. Journal of Cheminformatics, 12(1), 60. https://doi.org/10.1186/s13321-020-00465-0

Publication: Journal of Cheminformatics 2020

@article{rajanReviewOpticalChemical2020,
  title = {A Review of Optical Chemical Structure Recognition Tools},
  author = {Rajan, Kohulan and Brinkhaus, Henning Otto and Zielesny, Achim and Steinbeck, Christoph},
  year = 2020,
  month = oct,
  journal = {Journal of Cheminformatics},
  volume = {12},
  number = {1},
  pages = {60},
  issn = {1758-2946},
  doi = {10.1186/s13321-020-00465-0}
}

Research on Chemical Expression Images Recognition

Tue, 16 Dec 2025 00:00:00 +0000

Paper Information

Citation: Hong, C., Du, X., & Zhang, L. (2015). Research on Chemical Expression Images Recognition. Proceedings of the 2015 Joint International Mechanical, Electronic and Information Technology Conference, 267-271. https://doi.org/10.2991/jimet-15.2015.50

Publication: JIMET 2015 (Atlantis Press)

Additional Resources:

JSME Editor (used for visualization)

Contribution: New OCSR Workflow for Adhesion and Wedge Bonds

Method. The paper proposes a novel algorithmic pipeline (OCSR) for recognizing 2D organic chemical structures from images. It validates this method by comparing it against an existing tool (OSRA) using a quantitative metric (Tanimoto Coefficient) on a test set of 200 images.

Motivation: Challenges with Connecting Symbols and Stereochemistry

A vast amount of chemical structural information exists in scientific literature (PDFs/images) that is not machine-readable. Manually converting these images to formats like InChI or CML is labor-intensive. Existing tools face challenges with:

Adhesion: Poor separation when chemical symbols touch or overlap with bonds.
Stereochemistry: Incomplete identification of “real” (solid) and “virtual” (dashed/hashed) wedge bonds.

Core Innovation: Vector-Based Separation and Stereochemical Logic

The authors propose a specific OCSR (Optical Chemical Structure Recognition) workflow with two key technical improvements:

Vector-based Separation: The method vectorizes the image (using Potrace) to extract straight lines and curves, allowing better separation of “adhesive” chemical symbols (like H, N, O attached to bonds).
Stereochemical Logic: Specific rules for identifying wedge bonds:
- Virtual (Dashed) Wedges: Identified by grouping connected domains and checking linear correlation of their center points.
- Real (Solid) Wedges: Identified after thinning by analyzing linear correlation and width variance of line segments.

Methodology & Experimental Setup

Dataset: 200 chemical structure images collected from the network.
Baselines: Compared against OSRA (Optical Structure Recognition Application), a free online tool.
Metric: Tanimoto Coefficient, measuring the similarity of the set of recognized bonds and symbols against the ground truth. The similarity $T(A, B)$ is defined as:

$$ T(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$

Results & Conclusions

Performance: The proposed OCSR method achieved higher recognition rates than OSRA.
- Exact Match (100%): OCSR achieved 90.0% vs. OSRA’s 82.2%.
- High Similarity (>85%): OCSR recognized 157 structures vs. OSRA’s 114.
Limitations: The paper notes that “real wedge” and “virtual wedge” identification was a primary focus, but general recognition effectiveness still “has room for improvement”.

Reproducibility Details

Data

The study used a custom collection of images, not a standard benchmark.

Purpose	Dataset	Size	Notes
Evaluation	Web-crawled chemical images	200 structures	Images containing 2D organic structures; specific source URLs not provided.

Algorithms

The recognition pipeline follows these specific steps:

Preprocessing:
- Grayscale: via cvCvtColor (OpenCV).
- Binarization: via Otsu’s method.
Isolated Symbol Removal:
- Identifies connected domains with aspect ratios in [0.8, 3.0].
- Recognizes them using OCR (GOCR, OCRAD, Tesseract) and removes them from the image.
Virtual Wedge Recognition:
- Groups small connected domains (points/clumps).
- Calculates linear correlation of center points; if collinear, treats as a dashed bond.
Vectorization & Thinning:
- Thinning: Rosenfeld algorithm (optimized) to reduce lines to single pixel width.
- Vectorization: Uses Potrace to convert pixels to vector segments.
- Merging: Combines split vector segments based on angle thresholds to form long straight lines.
Adhesive Symbol Separation:
- Identifies curves (short segments after vectorization) attached to long lines.
- Separates these domains and re-runs OCR.
“Super Atom” Merging:
- Merges adjacent vertical/horizontal symbols (e.g., “HO”, “CH3”) based on distance thresholds between bounding boxes.

Models

The system relies on off-the-shelf OCR tools for character recognition; no custom ML models were trained.

OCR Engines: GOCR, OCRAD, TESSERACT.
Visualization: JSME (JavaScript Molecule Editor) used to render output strings.

Evaluation

Metric	Value (OCSR)	Baseline (OSRA)	Notes
Exact Match (100%)	90.0%	82.2%	Percentage of 200 images perfectly recognized.
>95% Similarity	95 images	71 images	Count of images with Tanimoto > 0.95.
>85% Similarity	157 images	114 images	Count of images with Tanimoto > 0.85.

Hardware

Requirements: Unspecified; runs on standard CPU architecture (implied by use of standard libraries like OpenCV and Potrace).

Citation

@inproceedings{hongResearchChemicalExpression2015,
  title = {Research on {{Chemical Expression Images Recognition}}},
  booktitle = {Proceedings of the 2015 {{Joint International Mechanical}}, {{Electronic}} and {{Information Technology Conference}}},
  author = {Hong, Chen and Du, Xiaoping and Zhang, Lu},
  year = {2015},
  publisher = {Atlantis Press},
  address = {Chongqing, China},
  doi = {10.2991/jimet-15.2015.50},
  isbn = {978-94-6252-129-2}
}

Probabilistic OCSR with Markov Logic Networks

Tue, 16 Dec 2025 00:00:00 +0000

Paper Information

Citation: Frasconi, P., Gabbrielli, F., Lippi, M., & Marinai, S. (2014). Markov Logic Networks for Optical Chemical Structure Recognition. Journal of Chemical Information and Modeling, 54(8), 2380-2390. https://doi.org/10.1021/ci5002197

Publication: Journal of Chemical Information and Modeling 2014

Contribution: Probabilistic Method for OCSR

This is a Method paper ($\Psi_{\text{Method}}$).

It proposes a novel algorithmic architecture (MLOCSR) that integrates low-level pattern recognition with a high-level probabilistic reasoning engine based on Markov Logic Networks (MLNs). While it contributes to resources by creating a clustered dataset for evaluation, the primary focus is on demonstrating that probabilistic inference offers a superior methodology to the deterministic, rule-based heuristics employed by previous state-of-the-art systems like OSRA and CLiDE.

Motivation: Overcoming Brittle Rule-Based Systems

Optical Chemical Structure Recognition (OCSR) is critical for converting the vast archive of chemical literature (bitmap images in patents and papers) into machine-readable formats.

Limitation of Prior Work: Existing systems (OSRA, CLiDE, ChemReader) rely on “empirical hard-coded geometrical rules” to assemble atoms and bonds. These heuristics are brittle, requiring manual tuning of parameters for different image resolutions and failing when images are degraded or noisy.
Gap: Chemical knowledge is typically used only in post-processing (e.g., to fix valency errors).
Goal: To create a resolution-independent system that uses probabilistic reasoning to handle noise and ambiguity in graphical primitives.

Core Innovation: Markov Logic Networks for Diagram Interpretation

The core novelty is the application of Markov Logic Networks (MLNs) to the problem of diagram interpretation.

Probabilistic Reasoning: The system treats extracted visual elements (lines, text boxes) as “evidence” and uses weighted first-order logic formulas to infer the most likely molecular graph (Maximum A Posteriori inference). The probability of a state $x$ is defined by the MLN log-linear model: $$ P(X=x) = \frac{1}{Z} \exp\left(\sum_{i} w_i n_i(x)\right) $$ where $w_i$ is the weight of the $i$-th formula and $n_i(x)$ is the number of true groundings in $x$.
Unified Knowledge Representation: Geometric constraints (e.g., collinearity) and chemical rules (e.g., valency) are encoded in the same logic framework.
Methodology and Experimental Setupe low-level extraction module dynamically estimates character size ($T$) and stroke width ($S$) to normalize parameters, removing the dependence on image DPI metadata.

What experiments were performed?

The authors evaluated the system on recognition accuracy against the leading open-source baseline, OSRA (v1.4.0).

Datasets:
- USPTO Clustered: A non-redundant subset of 937 images derived from a larger set of 5,719 US Patent Office images.
- ChemInfty: 869 images from Japanese patents.
- Degraded Images: The USPTO set was synthetically degraded at three resampling levels (Low, Medium, High degradation) to test robustness.
Metrics:
- Geometric: Precision, Recall, and $F_1$ scores for individual atoms and bonds.
- Chemical: Tanimoto similarity (using path fingerprints) and InChI string matching (basic and full stereochemistry).

Results and Conclusions

Superior Robustness: MLOCSR significantly outperformed OSRA on degraded images. On high-degradation images, MLOCSR achieved an atom $F_1$ of 80.3% compared to OSRA’s 76.0%.
Geometric Accuracy: In clean datasets (USPTO cluster), MLOCSR achieved higher $F_1$ scores for atoms (99.1% vs 97.5%) and bonds (98.8% vs 97.8%).
Chemical Fidelity: The system achieved comparable Tanimoto similarity scores (0.948 vs 0.940 for OSRA).
Limitation: OSRA slightly outperformed MLOCSR on “Full InChI” matching (81.4% vs 79.4%), indicating the probabilistic model still needs improvement in handling complex stereochemistry.

Reproducibility Details

Data

The study utilized public datasets, with specific preprocessing to ensure non-redundancy.

Purpose	Dataset	Size	Notes
Evaluation	USPTO Clustered	937 images	Selected via spectral clustering from 5,719 raw images to remove near-duplicates.
Evaluation	ChemInfty	869 images	Ground-truthed dataset from Japanese patent applications (2008).

Algorithms

The pipeline consists of two distinct phases: Low-Level Vectorization and High-Level Inference.

1. Low-Level Extraction (Image Processing)

Binarization: Global thresholding followed by morphological closing.
Text/Stroke Estimation:
- Finds text height ($T$) by looking for “N” or “H” characters via OCR, or averaging compatible connected components.
- Estimates stroke width ($S$) by inspecting pixel density on potential segments identified by Hough transform.
Vectorization:
- Canny Edge Detection + Hough Transform to find lines.
- Douglas-Peucker algorithm for polygonal approximation of contours.
- Circle Detection: Finds aromatic rings by checking for circular arrangements of carbon candidates.

2. High-Level Inference (Markov Logic)

Evidence Generation: Visual primitives (lines, text boxes, circles) are converted into logical ground atoms (e.g., LineBetweenCpoints(c1, c2)).
Inference Engine: Uses MaxWalkSAT for Maximum A Posteriori (MAP) inference to determine the most probable state of query predicates (e.g., DoubleBond(c1, c2)).
Parameters: MaxWalkSAT run with 3 tries and 1,000,000 steps per try.

Models

Markov Logic Network (MLN):
- Contains 128 first-order logic formulas.
- Geometric Rules: Example: VeryCloseCpoints(c1, c2) => SameCarbon(c1, c2) (weighted rule to merge close nodes).
- Chemical Rules: Example: IsHydroxyl(t) ^ Connected(c,t) => SingleBond(c,t) (imposes valency constraints).
OCR Engine: Tesseract is used for character recognition on text connected components.

Evaluation

The authors introduced a bipartite graph matching method to evaluate geometric accuracy when superatoms (e.g., “COOH”) are not expanded.

Metric	Details
Atom/Bond $F_1$	Calculated via minimum-weight bipartite matching between predicted graph and ground truth, weighted by Euclidean distance.
InChI	Standard unique identifier string. “Basic” ignores stereochemistry; “Full” includes it.
Tanimoto	Jaccard index of path fingerprints between predicted and ground truth molecules.

Hardware

Software: Logic inference performed using the Alchemy software package (University of Washington).
Web Server: The system was made available at http://mlocsr.dinfo.unifi.it (Note: URL likely inactive).

Citation

@article{frasconiMarkovLogicNetworks2014,
  title = {Markov {{Logic Networks}} for {{Optical Chemical Structure Recognition}}},
  author = {Frasconi, Paolo and Gabbrielli, Francesco and Lippi, Marco and Marinai, Simone},
  year = 2014,
  month = aug,
  journal = {Journal of Chemical Information and Modeling},
  volume = {54},
  number = {8},
  pages = {2380--2390},
  issn = {1549-9596, 1549-960X},
  doi = {10.1021/ci5002197},
  urldate = {2025-10-13},
  langid = {english}
}

Overview of the TREC 2011 Chemical IR Track Benchmark

Tue, 16 Dec 2025 00:00:00 +0000

Contribution: Establishing Chemical IR Benchmarks

This is a Resource ($\Psi_{\text{Resource}}$) paper with a secondary contribution in Systematization ($\Psi_{\text{Systematization}}$).

It serves as an infrastructural foundation for the field by establishing the “yardstick” for chemical information retrieval. It defines three distinct tasks, curates the necessary datasets (text and image), and creates the evaluation metrics required to measure progress. Secondarily, it systematizes the field by analyzing 36 different runs from 9 research groups, categorizing the performance of various approaches against these new benchmarks.

Motivation: Bridging Text and Image Search in Chemistry

The primary motivation is to bridge the gap between distinct research communities (text mining and image understanding), which are both essential for chemical information retrieval but rarely interact. Professional searchers in chemistry rely heavily on non-textual information (structures), yet prior evaluation efforts lacked specific tasks to handle image data. The track aims to provide professional searchers with a clear understanding of the limits of current tools while stimulating research interest in both patent retrieval and chemical image recognition.

Novelty: The Image-to-Structure (I2S) Task

The core novelty is the introduction of the Image-to-Structure (I2S) task. While previous years provided image data, this was the first specific task requiring participants to translate a raster image of a molecule into a chemical structure file. Additionally, the Technology Survey (TS) task shifted its focus specifically to biomedical and pharmaceutical topics to investigate how general IR systems handle the high terminological diversity (synonyms, abbreviations) typical of biomedical patents.

Methodology: TREC 2011 Task Formulations

The organizers conducted a large-scale benchmarking campaign across three specific tasks:

Prior Art (PA) Task: A patent retrieval task using 1,000 topics distributed among the EPO, USPTO, and WIPO.
Technology Survey (TS) Task: An ad-hoc retrieval task focused on 6 specific biomedical/pharmaceutical information needs (e.g., “Tests for HCG hormone”).
Image-to-Structure (I2S) Task: A recognition task using 1,000 training images and 1,000 evaluation images from USPTO patents, where systems had to generate the correct chemical structure (MOL file).

A total of 9 groups submitted 36 runs across these tasks. Relevance judgments were performed using stratified sampling and a dual-evaluator system (junior and senior experts) for the TS task.

Outcomes: Task Achievements and Limitations

Image-to-Structure Success: The new I2S task was the most successful task that year, with 5 participating groups submitting 11 runs. All participants recognized over 60% of the structures.
Prior Art Saturation: Only 2 groups participated in the PA task. The organizers concluded that this task had reached its “final point,” having learned the extent to which relevant documents can be retrieved in one pass for chemical patent applications.
Biomedical Complexity: Four teams submitted 14 runs for the TS task, which highlighted the complexity of biomedical queries. The use of specialized domain experts (senior evaluators) and students (junior evaluators) provided high-quality relevance data, though the small number of topics (6) limits broad generalization.

Reproducibility Details

The following details describe the benchmark environment established by the organizers, allowing for the replication of the evaluation.

Data

The track utilized a large collection of approximately 500GB of compressed text and image data.

Task	Dataset / Source	Size / Split	Notes
Prior Art (PA)	EPO, USPTO, WIPO patents	1,000 Topics	Distributed: 334 EPO, 333 USPTO, 333 WIPO.
Tech Survey (TS)	Biomedical patents/articles	6 Topics	Topics formulated by domain experts; focused on complexity (synonyms, abbreviations).
Image (I2S)	USPTO patent images	1,000 Train / 1,000 Eval	Criteria: No polymers, “organic” elements only, MW < 1000, single fragment.

Algorithms

The paper defines specific evaluation algorithms used to ground-truth the submissions:

Stratified Sampling (TS): Pools were generated using the method from Yilmaz et al. (2008). The pool included the top 10 documents from all runs, 30% of the top 30, and 10% of the rest down to rank 1000.
InChI Matching (I2S): Evaluation relied on generating Standard InChI Keys from both the ground truth MOL files and the participant submissions. Success was defined by exact string matching of these keys. This provided a relatively controversy-free measure of chemical identity.

Models

While the paper does not propose a single model, it evaluates several distinct approaches submitted by participants. Notable systems mentioned include:

OSRA (SAIC-Frederik / NIH)
ChemReader (University of Michigan)
ChemOCR (Fraunhofer SCAI)
UoB (University of Birmingham)
GGA (GGA Software)

Evaluation

Performance was measured using standard IR metrics for text and exact matching for images.

Metric	Task	Description
MAP / xinfAP	Prior Art / Tech Survey	Mean Average Precision ($\text{MAP}$) and Extended Inferred AP ($\text{xinfAP}$) were used to measure retrieval quality.
infNDCG	Tech Survey	Used to account for graded relevance (highly relevant vs relevant, formalized as $\text{infNDCG}$).
Recall	Image-to-Structure	Percentage of images where the generated InChI key matched exactly ($R = \frac{\text{Correct}}{\text{Total}}$).

Artifacts

Artifact	Type	License	Notes
TREC 2011 Chemistry Track Data	Dataset	Unknown	Topics, relevance judgments, and image sets for all three tasks
TREC 2011 Proceedings	Other	Unknown	Full proceedings including participant system descriptions

Hardware

Specific hardware requirements for the participating systems are not detailed in this overview, but the dataset size (500GB) implies significant storage and I/O throughput requirements.

Paper Information

Citation: Lupu, M., Gurulingappa, H., Filippov, I., Zhao, J., Fluck, J., Zimmermann, M., Huang, J., & Tait, J. (2011). Overview of the TREC 2011 Chemical IR Track. In Proceedings of the Twentieth Text REtrieval Conference (TREC 2011).

Publication: Text REtrieval Conference (TREC) 2011

Resources:

@inproceedings{lupuOverviewTREC20112011,
  title = {Overview of the {{TREC}} 2011 {{Chemical IR Track}}},
  author = {Lupu, Mihai and Gurulingappa, Harsha and Filippov, Igor and Zhao, Jiashu and Fluck, Juliane and Zimmermann, Marc and Huang, Jimmy and Tait, John},
  year = {2011},
  booktitle = {Proceedings of the Twentieth Text REtrieval Conference (TREC 2011)},
  publisher = {NIST},
  abstract = {The third year of the Chemical IR evaluation track benefitted from the support of many more people interested in the domain, as shown by the number of co-authors of this overview paper. We continued the two tasks we had before, and introduced a new task focused on chemical image recognition. The objective is to gradually move towards systems really useful to the practitioners, and in chemistry, this involves both text and images. The track had a total of 9 groups participating, submitting a total of 36 runs.},
  langid = {english}
}

OSRA at CLEF-IP 2012: Native TIFF Processing for Patents

Tue, 16 Dec 2025 00:00:00 +0000

Contribution: Evaluating Native Processing in OSRA

This is a Method paper ($\Psi_{\text{Method}}$).

It focuses on evaluating the algorithmic performance of the Optical Structure Recognition Application (OSRA) and justifies specific implementation details (such as pairwise distance clustering) through comparative analysis. The paper systematically compares preprocessing workflows (native vs. tiffsplit) to demonstrate how implementation choices impact precision, recall, and F1 scores.

Motivation: Advancing Chemical Structure Recognition

The primary motivation is to solve the Chemical Structure Recognition task within the context of the CLEF-IP 2012 challenge. The goal is to accurately convert images of chemical structures found in patent documents into established computerized molecular formats (connection tables).

A secondary technical motivation is to address issues in page segmentation where standard bounding box approaches fail to separate overlapping or nested molecular structures.

Core Innovation: Pairwise Distance Segmentation

The core novelty lies in the algorithmic approach to object detection and page segmentation:

Rejection of Bounding Boxes: Unlike standard OCR approaches, OSRA does not use a bounding box paradigm internally. Instead, it relies on the minimum pairwise distance between points of different connected components. This allows the system to correctly handle cases where a larger molecule “surrounds” a smaller one, which bounding boxes would incorrectly merge.
Native TIFF Processing: The authors identify that external tools (specifically tiffsplit) introduce artifacts during multi-page TIFF conversion. They implement native splitting facilities within OSRA, which substantially improves precision (from 0.433 to 0.708 at tolerance 0).

Experimental Setup: Segmentation and Recognition Tracks

The authors performed two specific tracks for the CLEF-IP 2012 challenge:

Page Segmentation:
- Dataset: 5421 ground truth structures.
- Comparison: Run 1 used tiffsplit (external tool) to separate pages; Run 2 used OSRA’s native internal page splitting.
- Metrics: Precision, Recall, and F1 scores calculated at varying pixel tolerances (0, 10, 20, 40, 55 pixels).
Structure Recognition:
- Dataset: A test set split into an “Automatic” evaluation set (865 structures checkable via InChI keys) and a “Manual” evaluation set (95 structures requiring human review due to Markush labels).
- Metric: Recognition rate (Recalled %).

Results and Conclusions: Native Processing Gains

Native vs. External Splitting: The native OSRA page splitting outperformed the external tiffsplit tool by a wide margin. At tolerance 0, native processing achieved 0.708 Precision compared to 0.433 for tiffsplit. The authors attribute this gap to artifacts introduced during tiffsplit’s internal TIFF format conversion. The native run also returned far fewer records (5,254 vs. 8,800 for tiffsplit), indicating fewer false detections.
Recognition Rate: Across 960 total structures, the system achieved an 83% recognition rate (88% on the automatic set, 40% on the manual Markush set).
Context: The results were consistent with OSRA’s second-place finish (out of 6 participants) at TREC-CHEM 2011.

Reproducibility Details

Data

The experiments used the CLEF-IP 2012 benchmark datasets.

Purpose	Set	Size	Notes
Segmentation	Ground Truth	5,421 structures	Used to evaluate bounding box/coordinate accuracy.
Recognition	Automatic	865 structures	Evaluated via InChI key matching.
Recognition	Manual	95 structures	Evaluated manually due to Markush-style labels.

Algorithms

1. Component Clustering (Pairwise Distance)

The segmentation algorithm avoids bounding boxes.

Logic: Calculate the minimum pairwise distance between points of distinct graphical components.
Criterion: If distance $d < \text{threshold}$, components are grouped.
Advantage: Enables separation of complex geometries where a bounding box $B_1$ might fully encompass $B_2$ (e.g., a large ring surrounding a salt ion), whereas the actual pixels are disjoint.

2. Image Pre-processing

Workflow A (Run 1): Multi-page TIFF → tiffsplit binary → Single TIFFs → OSRA.
Workflow B (Run 2): Multi-page TIFF → OSRA Internal Split → Recognition.

Evaluation

Page Segmentation Results (tiffsplit, Run 1)

Using tiffsplit for page splitting returned 8,800 records against 5,421 ground truth structures.

Tolerance (px)	Precision	Recall	F1
0	0.433	0.703	0.536
10	0.490	0.795	0.606
20	0.507	0.823	0.627
40	0.536	0.870	0.663
55	0.549	0.891	0.679

Page Segmentation Results (Native Split, Run 2)

Using OSRA’s native TIFF reading returned 5,254 records, with much higher precision.

Tolerance (px)	Precision	Recall	F1
0	0.708	0.686	0.697
10	0.793	0.769	0.781
20	0.821	0.795	0.808
40	0.867	0.840	0.853
55	0.887	0.860	0.873

Structure Recognition Results

Set	Count	Recalled	Percentage
Automatic	865	761	88%
Manual	95	38	40%
Total	960	799	83%

Artifacts

Artifact	Type	License	Notes
OSRA	Code	Open Source	Official project page at NCI/NIH

OSRA is described as an open source utility. The CLEF-IP 2012 benchmark datasets were provided as part of the shared task. No hardware or compute requirements are specified in the paper.

Paper Information

Citation: Filippov, I. V., Katsubo, D., & Nicklaus, M. C. (2012). Optical Structure Recognition Application entry to CLEF-IP 2012. CLEF 2012 Evaluation Labs and Workshop, Online Working Notes.

Publication: CLEF 2012

Additional Resources:

Project Home Page

@inproceedings{filippovOpticalStructureRecognition2012,
  title = {Optical {{Structure Recognition Application}} Entry to {{CLEF-IP}} 2012},
  author = {Filippov, Igor V and Katsubo, Dmitry and Nicklaus, Marc C},
  year = {2012},
  booktitle = {CLEF 2012 Evaluation Labs and Workshop, Online Working Notes},
  url = {https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-FilippovEt2012.pdf},
  abstract = {We present our entry to CLEF 2012 Chemical Structure Recognition task. Our submission includes runs for both bounding box extraction and molecule structure recognition tasks using Optical Structure Recognition Application. OSRA is an open source utility to convert images of chemical structures to connection tables into established computerized molecular formats. It has been under constant development since 2007.}
}

MolRec at CLEF 2012: Rule-Based Structure Recognition

Tue, 16 Dec 2025 00:00:00 +0000

Contribution to Chemical Structure Recognition

This is a Method paper. It describes the architecture of an engineered artifact (the “MolRec” system) and evaluates its efficacy on a specific task (Chemical Structure Recognition) using a standardized benchmark. It focuses on the mechanisms of vectorization and rule-based rewriting.

Motivation and CLEF 2012 Context

The work was motivated by the CLEF 2012 chemical structure recognition task. The goal was to automatically interpret chemical diagram images clipped from patent documents. This is challenging because real-world patent images contain complex structures, such as bridge bonds and elements not supported by standard conversion tools like OpenBabel.

Novelty in Rule-Based Vectorization

The primary contribution is an improved rule-based rewrite engine compared to the authors’ previous TREC 2011 submission, featuring a fully overhauled implementation that improves both recognition performance and computational efficiency. The system uses a two-stage approach:

Vectorization: Extracts geometric primitives (lines, circles, arrows) and characters.
Rule Engine: Applies 18 specific geometric rewriting rules to transform primitives into a chemical graph, which can then be exported to MOL or SMILES format.

Notably, the system explicitly handles “bridge bonds” (3D perspective structures) by applying specific recognition rules before general bond detection.

Experimental Setup on the CLEF 2012 Corpus

The system was evaluated on the CLEF 2012 corpus of 961 test images, split into two distinct sets to test different capabilities:

Automatic Set: 865 images evaluated automatically using OpenBabel to compare generated MOL files against ground truth.
Manual Set: 95 “challenging” images containing elements beyond OpenBabel’s scope (e.g., Markush structures), evaluated via manual visual inspection.

The authors performed four runs with slightly different internal parameters to test system stability.

Performance Outcomes and Failure Analysis

Performance:

Automatic Set: High performance, achieving accuracy between 94.91% and 96.18%.
Manual Set: Lower performance, with accuracy between 46.32% and 58.95%, reflecting the difficulty of complex patent diagrams containing Markush structures and other elements beyond OpenBabel’s scope.

Failure Analysis:

The authors conducted a detailed error analysis on 52 distinct mis-recognized diagrams from the manual set and 46 from the automatic set. Key failure modes include:

Character Grouping: The largest error source in the manual set (26 images). A bug caused the digit “1” to be repeated within atom groups, and closely-spaced atom groups were incorrectly merged.
Touching Characters: 8 images in the manual set and 1 in the automatic set. The system lacks segmentation for characters that touch, causing OCR failure.
Four-way Junctions: 6 manual and 7 automatic images. Vectorization failed to correctly identify junctions where four lines meet.
Missed Wedge Bonds: 6 images each for missed solid wedge and dashed wedge bonds in the automatic set.
OCR Errors: 5 manual and 11 automatic images, including misrecognition of “G” as “O” and “I” interpreted as a vertical single bond.
Charge Signs: MolRec correctly recognized positive charge signs but missed three negative charge signs, including one placed at the top left of an atom name.
Dataset Errors: The authors identified 11 images where the ground truth MOL files were incorrect, but MolRec’s recognition was actually correct.

Reproducibility Details

Data

The dataset was provided by CLEF 2012 organizers and consists of images clipped from patent documents.

Purpose	Dataset	Size	Notes
Evaluation (Auto)	CLEF 2012 Set 1	865 images	Evaluated via OpenBabel
Evaluation (Manual)	CLEF 2012 Set 2	95 images	Complex/Markush structures

Algorithms

The MolRec pipeline consists of two primary modules:

1. Vectorization Module

Binarization: Uses Otsu’s method.
OCR: Extracts connected components and classifies them using nearest neighbor classification with a Euclidean metric. Detected characters are removed from the image.
Bond Separation:
- Thins remaining components to single-pixel width.
- Builds polyline representations.
- Splits polylines at junctions (3+ lines meeting).
- Simplification: Applies the Douglas-Peucker algorithm with a threshold of 1-2 average line widths to remove scanning artifacts while preserving corners. The threshold is based on measured average line width, allowing adaptation to different line styles.
- Also detects circles, arrow heads, and solid triangles (annotated with direction).

2. Rule Engine

Input: Geometric primitives (segments, circles, triangles, arrows, character groups).
Structure: 18 rewrite rules.
Priority: Two rules for Bridge Bonds (Open/Closed) are applied first.
Standard Rules: 16 rules applied in arbitrary order for standard bonds (Single, Double, Triple, Wedge, Dative, etc.).
Implicit Nodes: Some rules handle cases where carbon atoms are implicit at bond junctions. These rules detect double or triple bonds while producing new geometric objects by splitting bonds at implicit nodes for further processing.
Example Rule (Wavy Bond):
- Condition 1: Set of line segments $L$ where $n \ge 3$.
- Condition 2: Segment lengths match “dash length” parameter.
- Condition 3: All elements are connected.
- Condition 4: Center points are approximately collinear.
- Condition 5: Endpoints form a single sequence (end elements have 1 neighbor, internal have 2).
- Condition 6: Two unconnected endpoints must be the pair of endpoints that are furthest apart.
- Consequence: Replace $L$ with a Wavy Bond between the furthest two endpoints. The bond has unknown direction.

Models

MolRec is a rule-based system and does not use trained deep learning models or weights.

Superatoms: Uses a dictionary look-up to resolve character groups representing superatoms into subgraphs.
Disambiguation: Context-based logic is applied after graph construction to resolve ambiguities (e.g., distinguishing vertical bond | from letter I or digit 1).

Evaluation

Set	Run 1	Run 2	Run 3	Run 4
Auto (865 images)	96.18% (832/865)	94.91% (821/865)	94.91% (821/865)	96.18% (832/865)
Manual (95 images)	46.32% (44/95)	58.95% (56/95)	46.32% (44/95)	56.84% (54/95)

Key Parameters:

Dash Length: Range of acceptable values for dashed lines.
Simplification Threshold: 1-2x average line width for Douglas-Peucker.

Artifacts

Artifact	Type	License	Notes
CLEF 2012 Workshop Paper	Other	Open Access	CEUR Workshop Proceedings

Reproducibility Classification: Closed

No source code for the MolRec system has been publicly released. The CLEF 2012 evaluation dataset was distributed to task participants and is not openly available. The rule-based algorithm is described in sufficient detail to re-implement, but exact parameter values and the character classification training set are not fully specified. No hardware or compute requirements are reported.

Paper Information

Citation: Sadawi, N. M., Sexton, A. P., & Sorge, V. (2012). MolRec at CLEF 2012 – Overview and Analysis of Results. CLEF 2012 Evaluation Labs and Workshop, Online Working Notes. https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf

Publication: CLEF 2012 Evaluation Labs and Workshop, Online Working Notes

@inproceedings{sadawi2012molrec,
  title={MolRec at CLEF 2012--Overview and Analysis of Results},
  author={Sadawi, Noureddin M and Sexton, Alan P and Sorge, Volker},
  booktitle={CLEF 2012 Evaluation Labs and Workshop, Online Working Notes},
  year={2012},
  url={https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf}
}

CLEF-IP 2012: Patent and Chemical Structure Benchmark

Tue, 16 Dec 2025 00:00:00 +0000

Patent Retrieval and the CLEF-IP 2012 Benchmark

This is a Resource paper (benchmark infrastructure). It establishes a standardized test bed for the Intellectual Property (IP) Information Retrieval community by defining tasks, curating datasets (topics and relevance judgments), and establishing evaluation protocols. The paper does not propose a new method itself but aggregates and analyzes the performance of participant systems on these shared tasks.

Motivation for Standardized IP Information Retrieval

The volume of patent applications is increasing rapidly, necessitating automated methods to help patent experts find prior art and classify documents.

Economic Impact: Thorough searches are critical due to the high economic value of granted patents.
Complexity: Patent work-flows are specific; examiners need to find prior art for specific claims alongside whole documents, and often rely on non-textual data like flowcharts and chemical diagrams.
Gap: Existing general IR tools are insufficient for the specific granularity (passages, images, structures) required in the IP domain.

The 2012 edition of the lab introduced three specific tasks targeting different modalities of patent data:

Passage Retrieval starting from Claims: Moving beyond document-level retrieval to identifying specific relevant passages based on claim text.
Flowchart Recognition: A new image analysis task requiring the extraction of structural information (nodes, edges, text) from patent images.
Chemical Structure Recognition: A dual task of segmenting molecular diagrams from full pages and recognizing them into structural files (MOL), specifically addressing the challenge of Markush structures in patents.

Benchmarking Setup and Evaluation

The “experiments” were the benchmarking tasks themselves, performed by participants (e.g., University of Birmingham, SAIC, TU Vienna).

Passage Retrieval: Participants retrieved documents and passages for 105 test topics (sets of claims) from a corpus of 1.5 million patents. Performance was measured using PRES, Recall, and MAP at the document level, and AP/Precision at the passage level.
Flowchart Recognition: Participants extracted graph structures from 100 test images. Evaluation compared the submitted graphs to ground truth using a distance metric based on the Maximum Common Subgraph (MCS).
Chemical Structure:
- Segmentation: Identifying bounding boxes of chemical structures in 30 multipage TIFF patents.
- Recognition: Converting 865 “automatic” (standard MOL) and 95 “manual” (Markush/complex) diagrams into structure files.

Key Findings and Baseline Results

Passage Retrieval: Approaches varied from two-step retrieval (document then passage) to full NLP techniques. Translation tools were universally used due to the multilingual corpus (English, German, French).
Chemical Recognition: The best performing system (UoB, run uob-4) achieved 92% recall on total structures (886/960), with 96% on the automatic set and 57% on the manual set. SAIC achieved 83% total recall. The manual evaluation highlighted a critical need for standards extending MOL files to support Markush structures, which are common in patents but poorly supported by current tools.
Flowchart Recognition: The evaluation was not completed at the time of writing the workshop notes. The evaluation required a combination of structural matching and edit-distance for text labels because OCR outputs rarely “hard-matched” the gold standard.

Chemical Structure Recognition Results

Segmentation (SAIC, best run using OSRA native rendering):

Tolerance (px)	Precision	Recall	$F_1$
0	0.708	0.686	0.697
10	0.793	0.769	0.781
20	0.821	0.795	0.808
40	0.867	0.840	0.853
55	0.887	0.860	0.873

Recognition (automatic and manual sets):

System	Auto (#/865)	Auto %	Manual (#/95)	Manual %	Total (#/960)	Total %
SAIC	761	88%	38	40%	799	83%
UoB-1	832	96%	44	46%	876	91%
UoB-2	821	95%	56	59%	877	91%
UoB-3	821	95%	44	46%	865	90%
UoB-4	832	96%	54	57%	886	92%

Reproducibility Details

Data

The collection focuses on European Patent Office (EPO) and WIPO documents published up to 2002.

1. Passage Retrieval Data

Corpus: >1.5 million XML patent documents (EP and WO sources).
Training Set: 51 topics (sets of claims) with relevance judgments (18 DE, 21 EN, 12 FR).
Test Set: 105 topics (35 per language).
Topic Source: Extracted manually from search reports listing “X” or “Y” citations (highly relevant prior art).

2. Flowchart Data

Format: Black and white TIFF images.
Training Set: 50 images with textual graph representations.
Test Set: 100 images.
Ground Truth: A defined textual format describing nodes (NO), directed edges (DE), undirected edges (UE), and meta-data (MT).

3. Chemical Structure Data

Segmentation: 30 patent files rendered as 300dpi monochrome multipage TIFFs.
Recognition (Automatic Set): 865 diagram images fully representable in standard MOL format.
Recognition (Manual Set): 95 diagram images containing Markush structures or variability not supported by standard MOL.

Algorithms

Ground Truth Generation:

Qrels Generator: An in-house tool was used to manually map search report citations to specific XML passages (XPaths) for the passage retrieval task.
McGregor Algorithm: Used for the flowchart evaluation to compute the Maximum Common Subgraph (MCS) between participant submissions and ground truth.

Evaluation

Passage Retrieval Metrics:

Document Level: PRES (Patent Retrieval Evaluation Score), Recall, MAP. Cut-off at 100 documents.
Passage Level: $AP(D)$ (Average Precision at document level) and $Precision(D)$ (Precision at document level), averaged across all relevant documents for a topic.

Flowchart Recognition Metric:

Graph Distance ($d$): Defined quantitatively based on the Maximum Common Subgraph (MCS) between a target flowchart ($F_t$) and a submitted flowchart ($F_s$): $$ \begin{aligned} d(F_t, F_s) &= 1 - \frac{|mcs(F_t, F_s)|}{|F_t| + |F_s| - |mcs(F_t, F_s)|} \end{aligned} $$ where $|F|$ represents the size of the graph (nodes + edges).
Levels: Evaluated at three levels: Basic (structure only), Intermediate (structure + node types), and Complete (structure + types + text labels).

Chemical Structure Metrics:

Segmentation: Precision, Recall, and $F_1$ based on bounding box matches. A match is valid if borders are within a tolerance (0 to 55 pixels).
Recognition:
- Automatic: Comparison of InChI strings generated by Open Babel.
- Manual: Visual comparison of images rendered by MarvinView.

Reproducibility

The CLEF-IP 2012 benchmark data was distributed to registered participants through the CLEF evaluation framework. The patent corpus is derived from the MAREC dataset (EPO and WIPO documents published until 2002). Evaluation tools for segmentation (bounding box comparison) and recognition (InChI comparison via Open Babel) were developed in-house by the organizers. The McGregor algorithm implementation for flowchart evaluation was also custom-built.

No public code repositories or pre-trained models are associated with this paper, as it is a benchmarking infrastructure paper. The evaluation protocols and data formats are fully described in the paper.

Artifact	Type	License	Notes
CLEF-IP 2012 data	Dataset	Unknown	Distributed to registered CLEF participants; no persistent public archive
MAREC corpus	Dataset	Unknown	Source patent corpus (EPO/WIPO documents up to 2002)

Status: Partially Reproducible
Missing components: The benchmark datasets were distributed to participants and are not hosted on a persistent public repository. The in-house evaluation tools (qrels generator, segmentation comparator, flowchart distance calculator) are not publicly released.

Paper Information

Citation: Piroi, F., Lupu, M., Hanbury, A., Magdy, W., Sexton, A. P., & Filippov, I. (2012). CLEF-IP 2012: Retrieval Experiments in the Intellectual Property Domain. CLEF 2012 Working Notes, CEUR Workshop Proceedings, Vol. 1178.

Publication: CLEF 2012 Working Notes (CEUR-WS Vol. 1178)

@inproceedings{piroi2012clefip,
  title={CLEF-IP 2012: Retrieval Experiments in the Intellectual Property Domain},
  author={Piroi, Florina and Lupu, Mihai and Hanbury, Allan and Magdy, Walid and Sexton, Alan P. and Filippov, Igor},
  booktitle={CLEF 2012 Working Notes},
  series={CEUR Workshop Proceedings},
  volume={1178},
  year={2012},
  publisher={CEUR-WS.org},
  url={https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-PiroiEt2012.pdf}
}

ChemReader Image-to-Structure OCR at TREC 2011 Chemical IR

Tue, 16 Dec 2025 00:00:00 +0000

Methodological Application: Applying ChemReader to Chemical OCR

This is a Methodological Paper ($\Psi_{\text{Method}}$).

The dominant vector is $\Psi_{\text{Method}}$ because the paper’s core contribution is the empirical evaluation and refinement of ChemReader on the Image-to-Structure (I2S) task. The rhetorical indicators align with this classification through the reporting of quantitative performance metrics, detailed error analysis, and a focus on how well the system works and how its underlying algorithms need refinement.

Motivation: Bridging the Gap in Image-to-Structure Tasks

The motivation is two-fold:

Scientific Need: Traditional text-based chemical mining methods cannot utilize image data in scientific literature. Chemical OCR software is required to extract 2D chemical structure diagrams from raster images and convert them into a machine-readable chemical file format, paving the way for advanced chemical literature mining.
Benchmark Participation: The immediate motivation was participation in the TREC Chemical IR campaign’s Image-to-Structure (I2S) task, which was designed to evaluate existing chemical OCR software and establish a platform for developing chemical information retrieval techniques utilizing image data.

Novelty: Benchmark Evaluation and Error Analysis of ChemReader

ChemReader was previously introduced in earlier publications and is a chemical OCR system tailored to a chemical database annotation scheme. The novelty of this paper lies in evaluating ChemReader within the formal I2S benchmark setting and conducting a detailed error analysis of its performance. After fixing a stereo bond omission and a corner detection bug discovered during the evaluation, ChemReader achieved 93% accuracy (930/1000) on the benchmark test set.

Experimental Setup: The TREC 2011 I2S Challenge

The experiment was the application of the ChemReader software to the Image-to-Structure (I2S) task of the TREC Chemical IR campaign.

Setup: The software was used to process image data provided for the I2S task.
Evaluation: The system was initially evaluated, revealing two issues: the omission of bond stereo types in the output structures and a bug in the corner detection code that failed on lines touching the image boundary. Each issue lowered accuracy by approximately 10%.
Analysis: After fixing these issues, ChemReader was re-evaluated on the full 1000-image test set (Test III). A detailed error analysis was then conducted on 20 randomly selected samples from Test III results.

Training Progress

The paper reports three rounds of major training, with approximately 15% accuracy gain per round:

Initial (untrained): 57% accuracy on 100 selected training images
Key changes included deactivating unnecessary heuristic algorithms (resizing, de-noising, line merging), limiting the character set, updating the chemical dictionary to a lightweight version, and fixing precision loss from type conversions.
Each round improved accuracy by approximately 15% (Figure 1 in the paper shows the progression).

Outcomes: High Accuracy Hindered by Complex Connectivity Rules

Submitted Results: Test I achieved 691/1000 correct outputs (avg. Tanimoto similarity 0.9769), and Test II achieved 689/1000 (avg. Tanimoto similarity 0.9823). Both scored lower than training accuracy due to the stereo bond omission and corner detection bug.
Key Finding: After fixing these two issues, ChemReader achieved 93% accuracy (930/1000) on the I2S task (Test III), comparable to the highest accuracy among participants.
Limitation/Future Direction: A detailed error analysis on 20 randomly selected samples from Test III (Table 2) showed that the software requires the incorporation of more chemical intelligence in its algorithms to address remaining systematic errors. The most frequent errors were:
- Wrongly merged nodes: 6 samples (30%), caused by nodes too close to be distinguished by a distance threshold
- Missed bonds: 4 samples (20%), caused by filtering out short line segments
- Nonstandard representations: noise symbols confusing the system, nonstandard wedge/hatched bond styles, and 3D crossing bonds that ChemReader cannot interpret

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	TREC 2011 Chemical IR I2S Training Set	1000 images (100 used for quick eval)	TIF format, one chemical structure per image
Evaluation	TREC 2011 Chemical IR I2S Test Set	1000 images (20 sampled for error analysis)	Same format constraints; 930/1000 correct in Test III

Algorithms

ChemReader is a chemical Optical Character Recognition (OCR) system with a 17-step pipeline:

Pixel clustering: Region-growing to identify the chemical structure region
Preprocessing: Resizing, de-noising, and bond length estimation (deactivated for I2S task)
Text identification: Connected components with similar heights/areas labeled as characters
Benzene ring detection: Identifying circles representing aromatic bonds
Hatched bond detection: Finding short collinear line segments of uniform length
Skeletonization: Thinning bond pixels for downstream processing
Ring structure detection: Pentagonal/hexagonal rings via Generalized Hough Transformation (GHT)
Line detection: Modified Hough Transformation with corner detection for bond extraction
Line filtering: Removing spurious short segments
Secondary text identification: Re-examining unidentified fragments for text
Character recognition: Dual-engine approach (GOCR template matching + Euclidean distance-based engine)
Chemical spell checker: Matching against a dictionary of 770 chemical abbreviations
Secondary line detection: Re-running line detection on remaining pixels
Line merging/breaking: Combining fragmented bonds or splitting at junction nodes
Graph construction: Creating nodes from bond endpoints and chemical symbol centers, merging nearby nodes
Connected component selection: Selecting the largest graph component
Output: Connection table in machine-readable format

Models

ChemReader is a rule-based system relying on traditional computer vision (Hough Transformation, region growing, skeletonization) and template-based character recognition. It does not use machine learning model architectures such as CNNs or neural networks.

Evaluation

Test	Correct Outputs	Avg. Tanimoto Similarity	Notes
Test I (submitted)	691/1000	0.9769	Original submission
Test II (submitted)	689/1000	0.9823	Alternative parameter setting
Test III (post-fix)	930/1000 (93%)	0.9913	After fixing stereo bond omission and corner detection bug

Error Breakdown (from 20-sample analysis of Test III):

Wrongly merged nodes: 6 (30%)
Missed bonds: 4 (20%)
Nonstandard representations (noise symbols, nonstandard wedge/hatched bonds, 3D crossing bonds): remaining errors

Reproducibility Assessment

ChemReader’s source code is not publicly available. The TREC 2011 Chemical IR I2S image sets were distributed to task participants but are not openly hosted. No pre-trained models apply (rule-based system). The paper provides a detailed algorithmic description (17-step pipeline) and parameter values, but full reproduction requires access to both the ChemReader codebase and the TREC image sets.

Status: Closed

Paper Information

Citation: Park, J., Li, Y., Rosania, G. R., & Saitou, K. (2011). Image-to-Structure Task by ChemReader. TREC 2011 Chemical IR Track Report.

Publication: TREC 2011 Chemical IR Track

Additional Resources:

@techreport{parkImagetoStructureTaskChemReader2011,
  title = {Image-to-Structure Task by {ChemReader}},
  author = {Park, Jungkap and Li, Ye and Rosania, Gus R. and Saitou, Kazuhiro},
  year = {2011},
  month = oct,
  institution = {University of Michigan},
  type = {TREC 2011 Chemical IR Track Report}
}

Chemical Structure Reconstruction with chemoCR (2011)

Tue, 16 Dec 2025 00:00:00 +0000

Contribution: The chemoCR Architecture

Methodological Paper ($\Psi_{\text{Method}}$)

This paper focuses entirely on the architecture and workflow of the chemoCR system. It proposes specific algorithmic innovations (texture-based vectorization, graph constraint exploration) and defines a comprehensive pipeline for converting raster images into semantic chemical graphs. The primary contribution is the system design and its operational efficacy.

Motivation: Digitizing Image-Locked Chemical Structures

Chemical structures are the preferred language of chemistry, yet they are frequently locked in non-machine-readable formats (bitmap images like GIF, BMP) within patents and journals.

The Problem: Once published as images, chemical structure information is “dead” to analysis software.
The Gap: Manual reconstruction is slow and error-prone. Existing tools struggled with the diversity of drawing styles (e.g., varying line thickness, font types, and non-standard bond representations).
The Goal: To automate the conversion of these depictions into connection tables (SDF/MOL files) to make the data accessible for computational chemistry applications.

Core Innovation: Rule-Based Semantic Object Identification

The system is based on a “Semantic Entity” approach that identifies chemically significant objects (chiral bonds, superatoms, reaction arrows) from structural formula depictions. Key technical innovations include:

Texture-based Vectorization: A new algorithm that computes local directions to vectorize lines, robust against varying drawing styles.
Expert System Integration: A graph constraint exploration algorithm that applies an XML-based rule set to classify geometric primitives into chemical classes such as BOND, DOUBLEBOND, TRIPLEBOND, BONDSET, DOTTED CHIRAL, STRINGASSOCIATION, DOT, RADICAL, REACTION, REACTION ARROW, REACTION PLUS, CHARGE, and UNKNOWN.
Validation Scoring: A built-in validation module that tests valences, bond lengths and angles, typical atom types, and fragments to assign a confidence score (0 to 1) to the reconstruction.

Experiments: The TREC 2011 Image-to-Structure Task

The system was evaluated as part of the TREC 2011 Image-to-Structure (I2S) Task.

Dataset: 1,000 unique chemical structure images provided by USPTO.
Configuration: The authors used chemoCR v0.93 in batch mode with a single pre-configured parameter set (“Houben-Weyl”), originally developed for the Houben-Weyl book series of organic chemistry reactions published by Thieme.
Process: The workflow included image binarization, connected component analysis, OCR for atom labels, and final molecule assembly.
Metric: Perfect match recall against ground-truth MOL files.

Results and Conclusions: Expert Systems vs. “Dirty” Data

Performance: The system achieved a perfect match for 656 out of 1,000 structures (65.6%).
Error Analysis: Failures were primarily attributed to “unclear semantics” in drawing styles, such as:
- Overlapping objects (e.g., atom labels clashing with bonds).
- Ambiguous primitives (dots interpreted as both radicals and chiral centers).
- Markush structures (variable groups), which were excluded from the I2S task definition. A prototype for Markush detection existed but was not used.
Limitations: The vectorizer cannot recognize curves and circles, only straight lines. Aromatic ring detection (via a heuristic that looks for a large “O” character in the center of a ring system) was switched off for the I2S task. The system maintained 12 different parameter sets for various drawing styles, and selecting the correct set was critical.
Impact: Demonstrated that rule-based expert systems combined with standard pattern recognition could handle high-quality datasets effectively, though non-standard drawing styles remain a challenge.

Reproducibility Details

Data

The paper relies on the TREC 2011 I2S dataset, comprising images extracted from USPTO patents.

Purpose	Dataset	Size	Notes
Evaluation	TREC 2011 I2S	1,000 images	Binarized bitmaps from USPTO patents.
Training	Internal Training Set	Unknown	Used to optimize parameter sets (e.g., “Houben-Weyl” set).

Algorithms

The paper describes three main workflow phases (preprocessing, semantic entity recognition, and molecule reconstruction plus validation), organized into four pipeline sections:

Preprocessing:
- Vaporizer Unit: Erases parts of the image that are presumably not structure diagrams (e.g., text or other human-readable information), isolating the chemical depictions.
- Connected Components: Groups all foreground pixels that are 8-connected into components.
- Text Tagging and OCR: Identifies components that map to text areas and converts bitmap letters into characters.
Vectorization:
- Algorithm: Compute Local Directions. It analyzes segment clusters to detect ascending, descending, horizontal, and vertical trends in pixel data, converting them into vectors.
- Feature: Explicitly handles “thick chirals” (wedges) by computing orientation.
Reconstruction (Expert System):
- Core Logic: Graph Constraint Exploration. It visits connected components and evaluates them against an XML Rule Set.
- Classification: Objects are tagged with chemical keywords (e.g., BONDSET for ring systems and chains, STRINGASSOCIATION for atom labels, DOTTED CHIRAL for chiral bonds).
- Rules: Configurable via chemoCRSettings.xml. The successful rule with the highest priority value defines the annotation for each component.
Assembly & Validation:
- Combines classified vectors and OCR text into a semantic graph.
- Superatoms: Matches text groups against a loaded superatom database (e.g., “COOH”, “Boc”).
- Validation: Calculates a score (0-1) based on chemical feasibility (valences, bond lengths and angles, typical atom types, and fragments).

Models

The system is primarily rule-based but utilizes machine learning components for specific sub-tasks:

OCR: A trainable OCR module using supervised machine learning to recognize atom labels ($H, C, N, O$). The specific classifier is not detailed in the paper.
Rule Base: An XML file containing the expert system logic. This is the “model” for structural interpretation.

Evaluation

Evaluation was performed strictly within the context of the TREC competition.

Metric	Value	Baseline	Notes
Recall (Perfect Match)	656 / 1000	N/A	Strict structural identity required.

Hardware

Software Stack: Platform-independent JAVA libraries.
Compute: Batch mode processing supported; specific hardware requirements (CPU/RAM) were not disclosed.

Artifacts

Artifact	Type	License	Notes
chemoCR (Fraunhofer SCAI)	Software	Unknown	Project page defunct; tool was proprietary
TREC 2011 Proceedings Paper	Paper	Public	Official NIST proceedings

No source code was publicly released. The chemoCR system was a proprietary tool from Fraunhofer SCAI. The TREC 2011 I2S dataset was distributed to competition participants and is not independently hosted.

Paper Information

Citation: Zimmermann, M. (2011). Chemical Structure Reconstruction with chemoCR. TREC 2011 Proceedings.

Publication: Text REtrieval Conference (TREC) 2011

@inproceedings{zimmermannChemicalStructureReconstruction2011,
  title = {Chemical Structure Reconstruction with {{chemoCR}}},
  booktitle = {Text {{REtrieval Conference}} ({{TREC}}) 2011},
  author = {Zimmermann, Marc},
  year = {2011},
  langid = {english}
}

Structural Analysis of Handwritten Chemical Formulas

Mon, 15 Dec 2025 00:00:00 +0000

Contribution: Structural Approach to Document Analysis

Method. This paper proposes a system architecture for document analysis. It introduces a specific pipeline (Global Perception followed by Incremental Extraction) and validates this strategy with recognition rates on specific tasks. The core contribution is the shift from bitmap-based processing to a structural graph representation of graphical primitives.

Motivation: Overcoming Bitmap Limitations in Freehand Drawings

Complexity of Freehand: Freehand drawings contain fluctuating lines and noise that make standard vectorization techniques difficult to apply directly.
Limitation of Bitmap Analysis: Most existing systems at the time attempted to interpret the document by working directly on the static bitmap image throughout the process.
Need for Context: Interpretation requires a dynamic resource that can evolve as knowledge is extracted (e.g., recognizing a polygon changes the context for its neighbors).

Novelty: Dynamic Structural Graphs and Recursive Specialists

The authors propose a Structural Representation as the unique resource for interpretation.

Quadrilateral Primitives: The system builds Quadrilaterals (pairs of vectors) to represent thin shapes, which are robust to handwriting fluctuations.
Structural Graph: These primitives are organized into a graph where arcs represent geometric relationships (T-junctions, L-junctions, parallels).
Specialist Agents: Interpretation is driven by independent modules (specialists) that browse this graph recursively to identify high-level chemical entities like rings (polygons) or chains.

Experimental Setup and Outcomes

Validation Set: The system was tested on 20 handwritten off-line documents containing chemical formulas at 300 dpi resolution.
Text Database: A separate base of 328 models was used for the text recognition component.
High Graphical Accuracy: The system achieved a $\approx 97%$ recognition rate for graphical parts (chemical elements like rings and bonds).
Text Recognition: The text recognition module achieved a $\approx 93%$ success rate.
Robustness: The structural graph approach successfully handled multiple liaisons, polygons, chains and allowed for the progressive construction of a solution consistent with the context.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Evaluation	Handwritten Documents	20 docs	Off-line documents at 300 dpi
Training	Character Models	328 models	Used for the Pattern Matching text recognition base

Algorithms

The interpretation process is divided into two distinct phases:

1. Global Perception (Graph Construction)

Vectorization: Contour tracking produces a chain of vectors, which are simplified via iterative polygonal approximation until fusion stabilizes (2-5 iterations).
Quadrilateral Formation: Vectors are paired to form quadrilaterals based on Euclidean distance and “empirical” alignment criteria.
Graph Generation: Quadrilaterals become nodes. Arcs are created based on “zones of influence” and classified into 5 types: T-junction, Intersection (X), Parallel (//), L-junction, and Successive (S).
Redraw Heuristic: A pre-processing step transforms T, X, and S junctions into L or // relations, as chemical drawings primarily consist of L-junctions and parallels.

2. Specialists (Interpretation)

Liaison Specialist: Scans the graph for // arcs or quadrilaterals with free extremities to identify bonds.
Polygon/Chain Specialist: Uses recursive look-left and look-right procedures. If a search returns to the start node after $n$ steps, a polygon is detected.
Text Localization: Clusters “short” quadrilaterals by physical proximity into “focus zones”. Zones are classified as text/non-text based on connected components.

Models

Text Recognition Hybrid:

Normalization & Pattern Matching: A classic method using the database of 328 models.
Structural Rule Base: Uses “significant” quadrilaterals (length $\ge 1/3$ of zone dimension) to verify characters. A rule base defines the expected count of horizontal, vertical, right-diagonal, and left-diagonal lines for each character.

Evaluation

Metric	Value	Baseline	Notes
Graphical Element Recognition	~97%	N/A	Evaluated on 20 documents (Fig. 7 examples)
Text Recognition	~93%	N/A	Evaluated on 20 documents

Paper Information

Citation: Ramel, J.-Y., Boissier, G., & Emptoz, H. (1999). Automatic Reading of Handwritten Chemical Formulas from a Structural Representation of the Image. Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR ‘99), 83-86. https://doi.org/10.1109/ICDAR.1999.791730

Publication: ICDAR 1999

@inproceedings{ramelAutomaticReadingHandwritten1999,
  title = {Automatic Reading of Handwritten Chemical Formulas from a Structural Representation of the Image},
  booktitle = {Proceedings of the {{Fifth International Conference}} on {{Document Analysis}} and {{Recognition}}. {{ICDAR}} '99 ({{Cat}}. {{No}}.{{PR00318}})},
  author = {Ramel, J.-Y. and Boissier, G. and Emptoz, H.},
  year = 1999,
  pages = {83--86},
  publisher = {IEEE},
  address = {Bangalore, India},
  doi = {10.1109/ICDAR.1999.791730},
  isbn = {978-0-7695-0318-9}
}

OSRA at TREC-CHEM 2011: Optical Structure Recognition

Mon, 15 Dec 2025 00:00:00 +0000

Contribution: Method and Resource

This is primarily a Methodological Paper ($\Psi_{\text{Method}}$), with a secondary Resource ($\Psi_{\text{Resource}}$) component.

It is Methodological because it details the specific algorithmic workflow (segmentation, binarization, vectorization, and rule-based recognition) used to translate pixel data into chemical semantics. It specifically addresses the “Image2Structure” task. It also serves as a Resource contribution by introducing OSRA as a free, open-source utility available to the community.

Motivation: Limitations of Standard OCR in Chemistry

A vast body of chemical information exists in journal publications and patents as two-dimensional structure diagrams. While human-readable, these images are inaccessible to machine data mining techniques like virtual screening. Standard Optical Character Recognition (OCR) is insufficient, and widely used techniques such as wavelet transforms or neural networks (as used in face recognition) are not applicable here because chemical diagrams contain far more structural complexity than alphabet characters, and misinterpretation of a single element can yield a valid but incorrect molecule.

Core Innovation: Chemistry-Aware Heuristic Pipeline

The authors present a specialized pipeline distinct from standard OCR that combines image processing with domain-specific chemical logic. Key technical contributions include:

Entropy-based Page Segmentation: A statistical method using row entropy to distinguish between pages with mixed text/graphics and pages with single structures.
Custom Binarization: A specific grayscale conversion ($Gr=\min(R,G,B)$).
Heuristic Confidence Scoring: A linear “confidence function” derived from atom and ring counts to select the best structure resolution.
Specialized Bond Recognition: Algorithms to detect bridge bonds, wedge/dashed bonds (3D info), and aromatic rings via inner circles.

Methodology: Evaluation on TREC-CHEM Image2Structure

The system was validated through submission to the Image2Structure task of TREC-CHEM.

Version: OSRA version 1.3.8 was used without modifications.
Setup: Two runs were submitted: one with default settings (automatic scale selection) and one fixed at 300 dpi.
Data: The evaluation used a “Training set” and a “Challenge Set” provided by the task organizers.
Metric: Recall rates were measured for both sets.

Results and Real-World Impact

Performance: The default settings achieved an 84.3% recall on the training set and 84.8% on the challenge set. The 300 dpi run performed slightly better (86.1% training, 85.6% challenge).
Utility: The tool is widely used by academic and commercial researchers to extract data from patents (USPTO, JPO).
Validation: Recognition rates have shown steady improvement over a 3-year development period.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
OSRA (SourceForge)	Code	Unknown	Open-source OCSR tool

Data

Source: The primary evaluation data came from the TREC-CHEM Image2Structure task.
Reference Datasets: The paper references the “Chem-Infty Dataset” as a source of ground-truthed chemical structure images.

Algorithms

The OSRA pipeline is heuristic-heavy. Key implementation details for replication include:

1. Page Segmentation

Entropy Calculation: Used to detect text vs. graphics. Entropy $E = -p \log p$ is calculated for rows in a feature matrix of component distances.
Thresholds: Max entropy > 6 indicates mixed text/graphics; $\le$ 3 indicates a single structure. A threshold of 4 is used to distinguish the two.
Separator Removal: Linear separators (aspect ratio above 100 or below 0.01, size above 300 pixels) are deleted early. Table frames are identified as connected components with aspect ratio between 0.1 and 10, with at least 300 pixels lying on the surrounding rectangle.
Text Removal: Text blocks are identified if a group of segments (distance determined by local minima in distance matrix) contains > 8 segments, has a fill ratio > 0.2, or aspect ratio > 10.

2. Image Preprocessing

Grayscale: $Gr = \min(R, G, B)$.
Resolutions: Processed at 72, 150, 300 dpi, and a dynamic resolution between 500-1200 dpi.
Noise Factor: Ratio of 2-pixel line segments to 3-pixel line segments. If this factor is between 0.5 and 1.0, anisotropic smoothing (GREYCstoration) is applied.
Thinning: Uses the method by J. M. Cychosz to reduce lines to 1 pixel width.

3. Vectorization & Atom Detection

Library: Potrace is used for vectorization.
Atom Identification: Atoms are detected at Bezier curve control points if:
- Potrace classifies it as a corner.
- Direction change normal component is $\ge$ 2 pixels.
- The distance from the last atom to the next control point is less than the distance from the last atom to the current control point.
OCR: GOCR and OCRAD are used for label recognition on connected sets smaller than max character dimensions. Tesseract and Cuneiform were also tested but did not improve recognition results.

4. Chemical Logic

Average Bond Length: Defined as the value at the 75th percentile of the sorted bond length list (to avoid bias from small artifacts).
Aromaticity: Flagged if a circle is found inside a ring, atoms are within half the average bond length of the circle, and bond angles to the center are less than 90 degrees.
Bridge Bonds: Detected if an atom connected to 4 pairwise collinear single bonds (none terminal) can be removed without changing fragment count, rotatable bonds, or reducing the number of 5- and 6-membered rings by 2.

5. Connection Table Compilation

Library: OpenBabel is used for conversion into SMILES or SDF formats.
Process: A molecular object is constructed from connectivity information along with stereo- and aromaticity flags. Superatom fragments are added at this stage using a user-modifiable dictionary.

Models

This is a non-learning based system (Rule-based/Heuristic). However, it uses a tuned linear function for confidence estimation.

Confidence Function: Used to select the best resolution result.

$$ \begin{aligned} \text{confidence} &= 0.316030 - 0.016315 N_C + 0.034336 N_N + 0.066810 N_O \\ &+ 0.035674 N_F + 0.065504 N_S + 0.04 N_{Cl} + 0.066811 N_{Br} \\ &+ 0.01 N_R - 0.02 N_{Xx} - 0.212739 N_{rings} + 0.071300 N_{aromatic} \\ &+ 0.329922 N_{rings5} + 0.342865 N_{rings6} - 0.037796 N_{fragments} \end{aligned} $$

Where $N_C$ is carbon count, $N_{rings}$ is ring count, etc.

Evaluation

Metric	Run	Training Set	Challenge Set
Recall	Default Settings	84.3%	84.8%
Recall	Fixed 300 dpi	86.1%	85.6%

Paper Information

Citation: Filippov, I. V., Katsubo, D., & Nicklaus, M. C. (2011). Optical Structure Recognition Application entry in Image2Structure task. TREC-CHEM.

Publication: TREC-CHEM 2011

Additional Resources:

@techreport{filippovOpticalStructureRecognition2011,
  title = {Optical {{Structure Recognition Application}} Entry in {{Image2Structure}} Task},
  author = {Filippov, Igor V. and Katsubo, Dmitry and Nicklaus, Marc C.},
  year = {2011},
  month = oct,
  institution = {National Cancer Institute},
  note = {TREC-CHEM Entry}
}

Kekulé-1 System for Chemical Structure Recognition

Mon, 15 Dec 2025 00:00:00 +0000

Paper Information

Citation: McDaniel, J. R., & Balmuth, J. R. (1996). Automatic Interpretation of Chemical Structure Diagrams. Graphics Recognition. Methods and Applications, 148-158. https://doi.org/10.1007/3-540-61226-2_13

Publication: Lecture Notes in Computer Science (LNCS), Vol. 1072, Springer, 1996.

System Architecture and Contribution

This is a Method paper. It proposes a novel software architecture (“Kekulé-1”) designed to solve the specific technical problem of converting rasterized chemical diagrams into machine-readable connection tables. The paper is characterized by:

Algorithmic Specification: It details specific algorithms for vectorization, polygon approximation, and character recognition.
Performance Metrics: It validates the method using quantitative accuracy (98.9%) and speed comparisons against manual entry.
System Architecture: It describes the integration of typically disparate components (OCR, vectorization, chemical rules) into a cohesive pipeline.

Motivation: The Chemical Data Entry Bottleneck

Chemical structure diagrams are the primary medium for communication between chemists, but computers cannot natively “read” these raster images.

Efficiency Gap: Manual redrawing of structures into chemical databases takes 6 to 10 minutes per structure.
Technical Challenge: Existing commercial OCR systems failed on chemical diagrams because they could not handle the mix of graphics (bonds) and text (atom labels), nor could they recognize small fonts (3-7 points) or chemical symbols accurately.
Goal: To create an “Optical Chemical Structure Recognition” (OCSR) system that reduces processing time to seconds while handling complex notation like stereochemistry and group formulas.

Core Innovations in Chemical OCR

Kekulé-1 represents the “first successful attempt” to integrate image processing, OCR, and structure editing into a single workflow. Key innovations include:

Context-Aware OCR: Unlike standard OCR, Kekulé-1 uses “chemical spell checking” by applying valence rules and chemical context to correct raw character recognition errors (e.g., distinguishing ‘5’ from ‘S’ based on bonding).
Adaptive Polygon Approximation: A modified vectorization algorithm that partitions objects at the farthest node to prevent artifact nodes in U-shaped structures.
Hybrid Parsing: It treats the diagram as a graph where nodes can be explicit atoms or geometric intersections, using rule-based logic to parse “group formulas” (like $COOH$) recursively.

Experimental Validation and Benchmarks

The authors evaluated the system on a private test set to validate robustness and speed.

Dataset: 524 chemical structures chosen from a “wide variety of sources” specifically to test the system’s limits.
Metrics: Success rate (percentage of structures processed with minimal editing) and processing time per structure.
Comparators: Performance was compared against the “manual redrawing” baseline.

Results, Performance, and Conclusions

High Accuracy: 98.9% of the test structures were successfully processed (with an average of 0.74 user prompts per structure).
Speedup: Processing took 7 to 30 seconds per structure, a significant improvement over the 6 to 10 minute manual baseline.
Robustness: The system successfully handled pathological cases like broken characters, skew (rotation), and touching characters.
Impact: The authors conclude that the techniques are generalizable to other domains like electrical circuits and utility maps.

Reproducibility Details

Data

Training/Test Data: The evaluation used 524 chemical structures. These were not released publicly but were selected to represent “limit” cases.
Input format: Scanned images at 300-400 dpi. The authors note that higher resolutions do not add information due to ink wicking and paper limitations.

Algorithms

The paper details several specific algorithmic implementations:

Vectorization (Polygon Approximation):

Standard thinning and raster-to-vector translation are used.
Innovation: The algorithm searches for the node farthest from the current start node to partition the object. This prevents artifact nodes in curved lines.
Threshold Formula: The allowed deviation ($dist$) from a straight line is adaptive based on segment length ($length$):

$$dist = \max(1, \frac{length}{10.0} + 0.4)$$

(Units in pixels)

Rotation Correction:

The system computes the angle of all “long” line segments modulo 15 degrees.
It bins these angles; the bin with the highest count (representing < 4 degrees rotation) is treated as the scan skew and corrected.

Optical Character Recognition (OCR):

Uses a neural network with linked/shared weights (similar to Convolutional Neural Networks, though not named as such) acting as a feature detector.
Training: Trained on specific chemical fonts.
Inference: Outputs are ranked; if multiple characters (e.g., ‘5’ and ‘S’) exceed a threshold, both are kept, and chemical context resolves the ambiguity later.

Chemical Parsing:

Group formulas (e.g., $COOH$) are parsed left-to-right by subtracting valences.
Example: For $COOH$, the external bond reduces Carbon’s valence to 3. The first Oxygen takes 2, leaving 1. The final Oxygen takes 1 (attaching to Carbon), and the Hydrogen takes 1 (attaching to Oxygen).

Models

OCR Model: A neural network with a “shared weights” paradigm, effectively creating a learned convolution map. It achieves ~99.9% raw accuracy on isolated test sets of chemical fonts.

Hardware

Compute: The evaluation was performed on an 80486 processor at 33 MHz.
Time: Average processing time was 9 seconds per structure.

Citation

@inproceedings{mcdanielAutomaticInterpretationChemical1996,
  title = {Automatic Interpretation of Chemical Structure Diagrams},
  booktitle = {Graphics Recognition. Methods and Applications},
  author = {McDaniel, Joe R. and Balmuth, Jason R.},
  editor = {O'Gorman, Lawrence and Kasturi, Rangachar},
  series = {Lecture Notes in Computer Science},
  volume = {1072},
  pages = {148--158},
  year = {1996},
  publisher = {Springer},
  doi = {10.1007/3-540-61226-2_14}
}

Imago: Open-Source Chemical Structure Recognition (2011)

Mon, 15 Dec 2025 00:00:00 +0000

Paper Contribution and Resource Utility

This is primarily a Resource ($\Psi_{\text{Resource}}$) paper, with a secondary Method ($\Psi_{\text{Method}}$) component.

Resource: The paper’s main contribution is the release of the “Imago” open-source toolkit. It emphasizes infrastructure utility: cross-platform C++ implementation, a core written from scratch without third-party code, and the inclusion of both GUI and command-line tools.

Method: It provides a detailed description of the recognition pipeline (filtering, segmentation, vectorization) to document the resource.

Motivation: The Deep Web of Chemical Structures

Chemical databases (like PubChem or PubMed) allow text and substructure searches, but a vast amount of chemical data remains “locked” in the images of scientific articles and patents. This is described as a “Deep Web indexing problem”. To bridge this gap, the authors identify a need for efficient, accurate, and portable algorithms to convert 2D raster images of molecules into graph representations suitable for indexing and search.

Core Innovation: A Dependency-Free C++ Architecture

The novelty lies in the open-source, dependency-free implementation.

Portability: The core of the toolkit is written from scratch in modern C++ without third-party libraries, specifically targeting portability to mobile devices and various platforms.

Integration: It combines optical character recognition (OCR) with specific chemical heuristics (like identifying stereochemistry and abbreviations) into a single usable workflow.

Methodology and Experimental Validation at TREC-CHEM

The paper describes the algorithm used in Imago and reflects on its participation in the Image2Structure task at TREC-CHEM 2011. No quantitative results are reported; the “Discussion” section instead reflects on qualitative performance issues observed during the task, such as handling low resolution, noise, and connected atom labels.

Outcomes, Limitations, and Future Directions

Release: The authors successfully released Imago under the GPLv3 license, including an API for developers. The toolkit outputs recognized structures in MDL Molfile format.

Limitations Identified: The straightforward pipeline fails when images have low resolution (atom labels merge with bonds), high noise, or tight character spacing (symbols rendered without space pixels between them). Additionally, when few symbols are present, the average bond length estimate can have large error, causing atom symbols to be misidentified as bond chains.

Future Directions: The authors propose moving from a linear pipeline to an “optimization procedure” that maximizes a confidence score, using probabilistic mapping of image primitives to chemical entities. They also argue that recognition programs should output a confidence score to enable automatic batch processing (only images with low confidence need manual review). They suggest a multi-pass workflow where each iteration adjusts parameters to improve the confidence level, and they note the additional challenge of separating molecule images from text in real articles and patents.

Reproducibility Details

Data

The paper does not specify a training dataset for the core logic (which appears heuristic-based), but references testing context:

Domain: Images from scientific articles and patents.
Validation: TREC-CHEM 2011 Image2Structure task data.
Databases: Mentions PubMed and PubChem as context for the type of data being indexed.

Algorithms

The recognition pipeline follows a strict linear sequence:

Preprocessing:
- Binarization: Threshold-based.
- Supersegmentation: Locates the chemical structure using a $15 \times 15$ window neighbor search.
- Filtering: Removes single-down stereo bonds (dashed triangles) early to prevent incorrect recognition of the small line segments during classification.
Separation (Symbols vs. Graphics):
- Heuristic: Estimates “capital letter height”.
- Criteria: Groups segments by height and aspect ratio range $[\text{MinSymRatio}, \text{MaxSymRatio}]$.
Skeleton Construction (Vectorization):
- Thinning: Uses neighborhood maps to reduce lines to 1-pixel thickness.
- De-crossing: Each black pixel with more than 2 black pixels in its 8-neighborhood becomes white, isolating polylines.
- Smoothing: Uses the Douglas-Peucker algorithm.
- Graph Adjustment: Merges close vertices and detects bond orders based on parallel edges.
Symbol Recognition:
- Grouping: Uses a Relative Neighborhood Graph to group characters into superatoms/labels.
- OCR: Classification based on Fourier descriptors of outer/inner contours.
Chemical Expansion:
- Abbreviation: Expands common groups (e.g., Ph, COOH) stored as SMILES notation, using the Indigo toolkit for 2D coordinate generation of the expanded structures.

Models

Type: Heuristic-based computer vision pipeline; no learned deep learning weights mentioned.
Stereo Recognition:
- Single Down: Identified as $k \ge 3$ parallel equidistant lines.
- Single Up: Identified by checking if a bond was a solid triangle before thinning.

Evaluation

Metrics: None quantitatively reported in the text; discussion focuses on qualitative failure modes (low resolution, noise).

Artifacts

Artifact	Type	License	Notes
Imago GitHub Repository	Code	Apache-2.0 (current); GPLv3 (as published)	Official C++ implementation
Imago Project Page	Other	N/A	Documentation and downloads

Hardware

Requirements: Designed to be lightweight and portable (mobile-device capable), written in C++. No specific GPU/TPU requirements.

Paper Information

Citation: Smolov, V., Zentsev, F., & Rybalkin, M. (2011). Imago: Open-Source Toolkit for 2D Chemical Structure Image Recognition. TREC-CHEM 2011.

Publication: TREC-CHEM 2011

Additional Resources:

@techreport{smolovImagoOpenSourceToolkit2011,
  title = {Imago: {{Open-Source Toolkit}} for {{2D Chemical Structure Image Recognition}}},
  author = {Smolov, Viktor and Zentsev, Fedor and Rybalkin, Mikhail},
  year = {2011},
  institution = {{GGA Software Services LLC}},
  note = {TREC-CHEM 2011}
}

CLiDE Pro: Optical Chemical Structure Recognition Tool

Mon, 15 Dec 2025 00:00:00 +0000

Paper Information

Citation: Valko, A. T., & Johnson, A. P. (2009). CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition. Journal of Chemical Information and Modeling, 49(4), 780-787. https://doi.org/10.1021/ci800449t

Publication: Journal of Chemical Information and Modeling 2009

Contribution: Robust Algorithmic Pipeline for OCSR

This is primarily a Method ($\Psi_{\text{Method}}$) paper, as it proposes a specific algorithmic architecture (CLiDE Pro) for converting raster images of chemical structures into connection tables. It details the procedural steps for segmentation, vectorization, and graph reconstruction.

It also has a secondary Resource ($\Psi_{\text{Resource}}$) contribution, as the authors compile and release a validation set of 454 real-world images to serve as a community benchmark for OCSR systems.

Motivation: Bridging the Gap Between Legacy Document Images and Machine-Readable Chemistry

While modern chemical drawing software captures structural information explicitly, the vast majority of legacy and current chemical literature (journals, patents, reports) exists as images or PDF documents. These images are human-readable but lack the semantic “connection table” data required for chemical databases and software. Manual redrawing is time-consuming and error-prone. Therefore, there is a critical need for efficient Optical Chemical Structure Recognition (OCSR) systems to automate this extraction.

Novelty: Integrated Document Segmentation and Ambiguity Resolution Heuristics

CLiDE Pro introduces several algorithmic improvements over its predecessor (CLiDE) and contemporary tools:

Integrated Document Segmentation: Unlike page-oriented systems, it processes whole documents to link information across pages.
Robust “Difficult Feature” Handling: It implements specific heuristic rules to resolve ambiguities in crossing bonds (bridged structures), which are often misinterpreted as carbon atoms in other systems.
Generic Structure Interpretation: It includes a module to parse “generic” (Markush) structures by matching R-group labels in the diagram with text-based definitions found in the document.
Ambiguity Resolution: It uses context-aware rules to distinguish between geometrically similar features, such as vertical lines representing bonds vs. the letter ’l’ in ‘Cl’.

Methodology and Benchmarking on Real-World Data

The authors conducted a systematic validation on a dataset of 454 images containing 519 structure diagrams.

Source Material: Images were extracted from published materials (journals, patents), ensuring “real artifacts” like noise and scanning distortions were present.
Automation: The test was fully automated without human intervention.
Metrics: The primary metric was the “success rate,” defined as the correct reconstruction of the molecule’s connection table. They also performed fine-grained error analysis on specific features (e.g., atom labels, dashed bonds, wavy bonds).

Results: High Topological Accuracy and Persistent OCR Challenges

High Accuracy: The system achieved a 89.79% retrieval rate (466/519 molecules correctly reconstructed).
Robustness on Primitives: Solid straight bonds were recognized with 99.92% accuracy.
Key Failure Modes: The majority of errors (58 cases) occurred in atom label construction, specifically when labels touched nearby bonds or other artifacts, causing OCR failures.
Impact: The study demonstrated that handling “difficult” drawing features like crossing bonds and bridged structures significantly reduces topological errors. The authors released the test set to encourage standardized benchmarking in the OCSR field.

Reproducibility Details

Data

The authors utilized a custom dataset designed to reflect real-world noise.

Purpose	Dataset	Size	Notes
Evaluation	CLiDE Pro Validation Set	454 images (519 structures)	Extracted from scanned journals and PDFs. Includes noise/artifacts. Available in Supporting Information.

Algorithms

The CLiDE Pro pipeline consists of five distinct phases. To replicate this system, one would need to implement:

Image Binarization:
- Input images are binarized using a threshold-based technique to separate foreground (molecule) from background.
- Connected Component Analysis (CCA): A non-recursive scan identifies connected components (CCs) and generates interpixel contours (using N, S, E, W directions).
Document Segmentation:
- Layout Analysis: Uses a bottom-up approach building a tree structure. It treats CCs as graph vertices and distances as edges.
- Clustering: A minimal-cost spanning tree (Kruskal’s algorithm) groups CCs into words, lines, and blocks.
- Classification: CCs are classified (Character, Dash, Line, Graphics, Noise) based on size thresholds derived from statistical image analysis.
Vectorization:
- Contour Approximation: Uses a method similar to Sklansky and Gonzalez to approximate contours into polygons.
- Vector Formation: Long polygon sides become straight lines; short consecutive sides become curves. Opposing borders of a line are matched to define the bond vector.
- Wavy Bonds: Detected by finding groups of short vectors lying on a straight line.
- Dashed Bonds: Detected using the Hough transform to find collinear or parallel dashes.
Atom Label Construction:
- OCR: An OCR engine (filtering + topological analysis) interprets characters.
- Grouping: Characters are grouped into words based on horizontal and vertical proximity (for vertical labels).
- Superatom Lookup: Labels are matched against a database of elements, functional groups, and R-groups. Unknown linear formulas (e.g., $\text{CH}_2\text{CH}_2\text{OH}$) are parsed.
Graph Reconstruction:
- Connection Logic: Bond endpoints are joined to atoms if they are within a distance threshold and “point toward” the label.
- Implicit Carbons: Unconnected bond ends are joined if close; parallel bonds merge into double/triple bonds.
- Crossing Bonds: Rules check proximity, length, and ring membership to determine if crossing lines are valid atoms or 3D visual artifacts.
Generic Structure Interpretation:
- Text Mining: A lexical/syntactic analyzer extracts R-group definitions (e.g., “R = Me or H”) from text blocks.
- Matching: The system attempts to match R-group labels in the diagram with the parsed text definitions.

Models

OCR Engine: The system relies on a customized OCR engine capable of handling rotation and chemical symbols, though the specific architecture (neural vs. feature-based) is not detailed beyond “topological and geometrical feature analysis”.
Superatom Database: A lookup table containing elements, common functional groups, and R-group labels.

Evaluation

The evaluation focused on the topological correctness of the output.

Metric	Value	Notes
Total Success Rate	89.79%	466/519 structures perfectly reconstructed.
Atom Label Accuracy	98.54%	3923/3981 labels correct. Main error source: labels touching bonds.
Solid Bond Accuracy	>99.9%	16061/16074 solid bonds correct.
Dashed Bond Accuracy	98.37%	303/308 dashed bonds correct.

Hardware

Requirements: Unspecified; described as efficient.
Performance: The system processed the complex Palytoxin structure “within a few seconds”. This implies low computational overhead suitable for standard desktop hardware of the 2009 era.

Citation

@article{valkoCLiDEProLatest2009,
  title = {CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition},
  author = {Valko, Aniko T. and Johnson, A. Peter},
  journal = {Journal of Chemical Information and Modeling},
  volume = {49},
  number = {4},
  pages = {780--787},
  year = {2009},
  doi = {10.1021/ci800449t}
}

ChemInk: Real-Time Recognition for Chemical Drawings

Mon, 15 Dec 2025 00:00:00 +0000

Contribution: Real-Time Sketch Recognition Method

This is a Method paper. It proposes a novel architectural framework for sketch recognition that integrates visual features at three distinct levels (inkpoints, segments, symbols) into a single probabilistic model. The rhetorical structure centers on the proposal of this new architecture, the introduction of a specific “trainable corner detector” algorithm, and the validation of these methods against existing benchmarks and alternative toolsets (ChemDraw).

Motivation: Bridging the Gap Between Sketching and CAD

The primary motivation is to bridge the gap between the natural, efficient process of drawing chemical diagrams by hand and the cumbersome “point-click-and-drag” interactions required by CAD tools like ChemDraw. While chemists prefer sketching for communication, existing digital tools do not offer the same speed or ease of use. The goal is to build an intelligent system that understands freehand sketches in real-time, converting them into structured data suitable for analysis or search.

Core Innovation: Hierarchical Joint CRF Model

The core novelty lies in the hierarchical joint model. Unlike previous approaches that might treat stroke segmentation and symbol recognition as separate, isolated steps, ChemInk uses a Conditional Random Field (CRF) to jointly model dependencies across three levels:

Inkpoints: Local visual appearance.
Segments: Stroke fragments separated by corners.
Candidates: Potential symbol groupings.

Additionally, the paper introduces a trainable corner detector that learns domain-specific corner definitions from data.

Experimental Design and Baselines

The authors conducted two primary evaluations:

Off-line Accuracy Evaluation:
- Dataset: 12 real-world organic compounds drawn by 10 participants.
- Metric: Recognition accuracy (Recall and Precision).
- Baseline: Comparison against their own previous work (O&D 2009) and ablations (with/without context).
On-line User Study:
- Task: 9 participants (chemistry students) drew 5 diagrams using both ChemInk (Tablet PC) and ChemDraw (Mouse/Keyboard).
- Metric: Time to completion and subjective user ratings (speed/ease of use).

Results: Accuracy and User Study Outcomes

Accuracy: The system achieved 97.4% symbol recognition accuracy, slightly outperforming the best prior result (97.1%). The trainable corner detector achieved 99.91% recall.
Speed: Users were twice as fast using ChemInk (avg. 36s) compared to ChemDraw (avg. 79s).
Usability: Participants rated ChemInk significantly higher for speed (6.3 vs 4.5) and ease of use (6.3 vs 4.7) on a 7-point scale.
Conclusion: Sketch recognition is a viable, superior alternative to standard CAD tools for authoring chemical diagrams.

Reproducibility Details

Data

Training/Test Data: 12 real-world organic compounds (e.g., Aspirin, Penicillin) drawn by 10 participants (organic chemistry familiar).
Evaluation Split: User-independent cross-validation (training on 9 users, testing on 1).
Input: Raw digital ink (strokes) collected on a Tablet PC.

Algorithms

1. Corner Detection (Trainable)

Method: Iterative vertex elimination.
Cost Function: $cost(p_{i}) = \sqrt{mse(s_{i}; p_{i-1}, p_{i+1})} \cdot dist(p_{i}; p_{i-1}, p_{i+1})$
Procedure: Repeatedly remove the vertex with the lowest cost until the classifier (trained on features like cost, diagonal length, ink density) predicts the remaining vertices are corners.

2. Feature Extraction

Inkpoints: Sampled at regular intervals. Features = $10 \times 10$ pixel orientation filters (0, 45, 90, 135 degrees) at two scales ($L/2$, $L$), smoothed and downsampled to $5 \times 5$. Total 400 features.
Segments: Similar image features centered at segment midpoint, plus geometric features (length, ink density).
Candidates: 5 feature images ($20 \times 20$) including an “endpoint” image, stretched to normalize aspect ratio.
Dimensionality Reduction: PCA used to compress feature images to 256 components.

3. Structure Generation

Clustering: Agglomerative clustering with a complete-link metric to connect symbols.
Threshold: Stop clustering at distance $0.4L$.

Models

Conditional Random Field (CRF)

Structure: 3-level hierarchy (Inkpoints $V_p$, Segments $V_s$, Candidates $V_c$).
Nodes:
- $V_p, V_s$ labels: “bond”, “hash”, “wedge”, “text”.
- $V_c$ labels: specific candidate interpretations.
Edges/Potentials:
- Entity-Feature: $\phi(y, x)$ (Linear classifier).
- Consistency: $\psi(y_i, y_j)$ (Hard constraint: child must match parent label).
- Spatial Context: $\psi_{ss}(y_i, y_j)$ (Pairwise geometric relationships between segments: angle, distance).
- Overlap: Prevents conflicting candidates from sharing segments.
Inference: Loopy Belief Propagation (up to 100 iterations).
Training: Maximum Likelihood via gradient ascent (L-BFGS).

Evaluation

Primary Metric: Accuracy (Recall/Precision) of symbol detection.
Comparison: Compared against Ouyang & Davis 2009 (previous SOTA).
Speed Metric: Wall-clock time for diagram creation (ChemInk vs. ChemDraw).

Hardware

Processor: 3.7 GHz processor (single thread) for base benchmarking (approx. 1 sec/sketch).
Deployment: Validated on 1.8 GHz Tablet PCs using multi-core parallelization for real-time feedback.

Paper Information

Citation: Ouyang, T. Y., & Davis, R. (2011). ChemInk: A Natural Real-Time Recognition System for Chemical Drawings. Proceedings of the 16th International Conference on Intelligent User Interfaces, 267–276. https://doi.org/10.1145/1943403.1943444

Publication: IUI ‘11

@inproceedings{ouyangChemInkNaturalRealtime2011,
  title = {ChemInk: A Natural Real-Time Recognition System for Chemical Drawings},
  shorttitle = {ChemInk},
  booktitle = {Proceedings of the 16th International Conference on Intelligent User Interfaces},
  author = {Ouyang, Tom Y. and Davis, Randall},
  year = {2011},
  month = feb,
  pages = {267--276},
  publisher = {ACM},
  address = {Palo Alto, CA, USA},
  doi = {10.1145/1943403.1943444},
  isbn = {978-1-4503-0419-1},
  url = {http://hdl.handle.net/1721.1/78898}
}

Chemical Structure Recognition (Rule-Based)

Mon, 15 Dec 2025 00:00:00 +0000

Paper Information

Citation: Sadawi, N. M., Sexton, A. P., & Sorge, V. (2012). Chemical structure recognition: A rule based approach. Proceedings of SPIE, 8297, 82970E. https://doi.org/10.1117/12.912185

Publication: IS&T/SPIE Electronic Imaging 2012

Methodological Contribution

Methodological Paper ($\Psi_{\text{Method}}$)

This paper proposes a novel mechanism (MolRec) for Optical Chemical Structure Recognition (OCSR). It focuses on defining a “strictly rule based system” to transform vectorised molecule images into graph representations, contrasting this declarative approach with procedural or heuristic-heavy methods. The contribution is validated through direct comparison with the leading open-source tool (OSRA).

Motivation: Overcoming Procedural Heuristics

Chemical literature contains vast amounts of information locked in 2D diagrams. This visual data is generally inaccessible to search tools or electronic processing. While commercial and academic tools existed (e.g., OSRA, Kekulé), they typically relied on procedural heuristics that required experimental tuning and were difficult to extend. The authors sought to create a system based on precise, declarative rewrite rules to handle the ambiguity inherent in chemical drawing conventions.

Core Innovation: Geometric Rewrite Rules

The core novelty is the geometric rewrite rule system (MolRec).

Geometric Primitives: The system operates on five high-level primitives: Line Segment, Arrow, Circle, Triangle, and Character Group.
Fuzzy Parameters: It introduces formal definitions for “fuzzy” relationships (e.g., dash-neighbouring, approximate collinearity) to handle drawing irregularities and scanning artifacts.
Ambiguity Resolution: Specific rules (R4-R6) are designed to disambiguate visual homoglyphs, such as distinguishing a “triple bond” from a “dashed bold bond” based on context (connected atoms).
Explicit “Cutting”: A mechanism to identify implicit carbon nodes within continuous line segments (e.g., splitting a long line intersected by parallel lines into a double bond).

Experimental Setup vs. Baselines

The authors compared their system (MolRec) against OSRA (the leading open-source system) on two datasets:

OSRA Benchmark: 5,735 computer-generated diagrams with ground truth MOL files.
Maybridge Dataset: 5,730 scanned images (300dpi) from a drug catalogue, converted to ground truth MOL files via InChI lookups.

Evaluation was semantic: The output MOL files were compared using OpenBabel to check for structural equivalence, ignoring syntactic file differences.

Results and Key Findings

MolRec outperformed OSRA on both datasets:

OSRA Benchmark: MolRec achieved 88.46% accuracy vs. OSRA’s 77.23%.
Maybridge Dataset: MolRec achieved 83.84% accuracy vs. OSRA’s 72.57%.

Key Findings:

Robustness: The line thinning + Douglas-Peucker vectorization approach was found to be more robust than Hough transform approaches used by other tools.
Failure Modes: Major remaining errors were caused by “touching components” (ligatures, characters touching bonds) and complex “superatoms” (abbreviations like “-Ph” or “-COOH”) with ambiguous connection points.
Triangle Detection: The “expanding disc” method for identifying wedge bonds was highly effective.

Reproducibility Details

Data

Two distinct datasets were used for validation:

Dataset	Type	Size	Notes
OSRA Benchmark	Synthetic	5,735	Computer-generated diagrams provided by the OSRA project.
Maybridge	Scanned	5,730	Scanned at 300dpi from the Maybridge drug catalogue. Ground truth generated via CAS Registry Number $\to$ InChI $\to$ OpenBabel.

Algorithms

The recognition pipeline consists of three stages: Vectorization, Geometric Processing, and Rule Application.

1. Vectorization & Primitives

Binarization & OCR: Connected components are labelled and passed to an OCR engine to extract “Character Groups”.
Thinning: Image is thinned to unit width.
Simplification: Douglas-Peucker algorithm converts pixel paths into straight Line Segments.
Triangle Detection: A disc growing algorithm walks inside black regions to identify Triangles (wedges). If the disc cannot grow, it is a thick line (Bold Bond).

2. Fuzzy Parameters

The rules rely on tolerating drawing imperfections using defined parameters:

$r_e$: Radius of collinearity (strict).
$d_l$ / $d_s$: Dash length / Dash separation (fuzzy).
$bdl$ / $bdw$: Bold dash length / width (fuzzy).
$bs$: Bond separation (max distance between parallel bonds).
$ol$: Minimal overlap.

3. The Rule System (R1-R18)

The core logic uses 18 mutual-exclusion rules to rewrite geometric primitives into chemical graph edges.

Planar Bonds:
- R1-R3 (Single/Double/Triple): Identifies parallel lines based on bs and ol. Uses “cutting” to split lines at implicit nodes.
Ambiguity Resolution (Stereo vs. Planar):
- R4 (Dashed Bold vs. Triple): Checks context. If purely geometric measures match both, it defaults to Triple unless specific dash constraints are met.
- R5 (Dashed Wedge vs. Triple): Similar disambiguation based on length monotonicity.
- R6 (Dashed Wedge vs. Double): Differentiates based on line length differences ($l_1 > l_2$).
Stereo Bonds:
- R7-R9 (Dashed Types): Identifies collinear segments with specific neighbor patterns (1 neighbor for ends, 2 for internal).
- R10-R11 (Hollow Wedge): Detects triangles formed by 3 or 4 lines.
- R14 (Solid Wedge): Direct mapping from Triangle primitive.
Special Structures:
- R12 (Wavy Bond): Zig-zag line segments.
- R13 (Arrow): Dative bond.
- R16 (Aromatic Ring): Circle inside a cycle of >5 lines.
- R17-R18 (Bridge Bonds): Handles 2.5D crossing bonds (open or closed gaps).

Evaluation

Metric: Semantic graph matching. The output MOL file is compared to the ground truth MOL file using OpenBabel. Success = correct graph isomorphism.

Results Table:

Dataset	System	Success Rate	Fail Rate
OSRA	MolRec	88.46%	11.54%
	OSRA	77.23%	22.77%
Maybridge	MolRec	83.84%	16.16%
	OSRA	72.57%	27.43%

Hardware

Compute: Requirements not specified, but the approach (vectorization + rule matching) is computationally lightweight compared to modern deep learning methods.

Reconstruction of Chemical Molecules from Images

Sun, 14 Dec 2025 00:00:00 +0000

Methodological Basis

This paper is a clear methodological contribution describing a novel system architecture. It proposes a five-stage pipeline to solve a specific engineering problem: converting rasterized chemical images into structured chemical files (SDF). The authors validate the method by benchmarking it against a commercial product (CLIDE) and analyzing performance across multiple databases.

The Inaccessibility of Raster Chemical Images

Data Inaccessibility: A massive amount of chemical knowledge (scientific articles, patents) exists only as raster images, rendering it inaccessible to computational analysis.
Inefficiency of Manual Entry: Manual replication of molecules into CAD programs is the standard but unscalable solution for extracting this information.
Limitations of Existing Tools: Previous academic and commercial attempts (early 90s systems like CLIDE) had faded or remained limited in robustness, leaving the problem “wide open”.

Topology-Preserving Chemical Vectorization

The core novelty is the topology-preserving vectorization strategy designed specifically for chemical graphs.

Graph-Centric Vectorizer: This system prioritizes graph characteristics over the pixel precision of traditional CAD vectorizers, ensuring one line in the image becomes exactly one vector, regardless of line width or vertex thickness.
Chemical Knowledge Module: The inclusion of a final validation step that applies chemical rules (valence, charge) to detect and potentially correct reconstruction errors.
Hybrid Recognition: The separation of the pipeline into a “Body” path (vectorizer for bonds) and an “OCR” path (SVM for atomic symbols), which are re-integrated in a reconstruction phase.

Validating Reconstruction Accuracy

The authors performed a quantitative validation using ground-truth SDF files to verify reconstruction accuracy. The success rate metric evaluated whether the reconstructed graph perfectly matched the true SDF:

$$ \text{Accuracy} = \frac{\text{Correctly Reconstructed SDFs}}{\text{Total Images Evaluated}} $$

Baselines: The system was benchmarked against the commercial software CLIDE on “Database 1”.
Datasets: Three distinct databases were used:
- Database 1: 100 images (varied fonts/line widths).
- Database 2: 100 images.
- Database 3: 7,604 images (large-scale test).

System Performance and Scalability

Superior Performance: On Database 1, the proposed system correctly reconstructed 97% of images, whereas the commercial CLIDE system only reconstructed 25% (after parameter tuning).
Scalability: The system maintained reasonable performance on the large dataset (Database 3), achieving 67% accuracy.
Robustness: The system can handle varying fonts and line widths via parameterization.
Future Work: The authors plan to implement a feedback loop where the Chemical Knowledge Module can send error signals back to earlier modules to correct inconsistencies.

Reproducibility Details

Reproducibility Status: Closed / Not Reproducible (Paywalled paper, no public code or data).

Data

The paper utilizes three databases for validation. The authors note that for these images, the correct SDF files were already available, allowing for direct automated checking.

Purpose	Dataset	Size	Notes
Evaluation	Database 1	100 Images	Varied line widths, fonts, symbols; used for CLIDE comparison.
Evaluation	Database 2	100 Images	General chemical database.
Evaluation	Database 3	7,604 Images	Large-scale database.

Algorithms

The system is composed of five distinct modules executed in sequence:

1. Binarization & Segmentation

Preprocessing: Removal of anti-aliasing effects followed by adaptive histogram binarization.
Connected Components: A non-recursive raster-scan algorithm identifies connected Run-Length Encoded (RLE) segments.

2. Optical Character Recognition (OCR)

Feature Extraction: Uses functions similar to Zernike moments and a wavelet transform strategy.
Classification: Identifies isolated characters/symbols and separates them from the molecular “body”.

3. Vectorizer

Logic: Assigns local directions to RLE segments based on neighbors, then groups segments with similar local direction patterns.
Constraint: Enforces a 1-to-1 mapping between visual lines and graph vectors to prevent spurious small vectors at thick joints.

4. Reconstruction (Heuristics)

This module annotates vectors with chemical significance:

Chiral Bonds (Wedges): Identified by registering vectors against original pixel density. If a vector corresponds to a thick geometric form (triangle/rectangle), it is labeled chiral.
Dotted Chiral Bonds: Identified by clustering isolated vectors (no neighbors) using quadtree clustering on geometric centers. Coherent parallel clusters are fused into a single bond.
Double/Triple Bonds: Detected by checking for parallel vectors within a Region of Interest (ROI) defined as the vector’s bounding box dilated by a factor of 2.
Superatoms: OCR results are clustered by dilating bounding boxes; overlapping boxes are grouped into names (e.g., “COOH”).

5. Chemical Knowledge

Validates the generated graph against rules for valences and charges. If valid, an SDF file is generated.

Models

SVM (Support Vector Machine): Used within the OCR module to classify connected components as characters or symbols. It is trained to be tolerant to rotation and font variations.

Evaluation

The primary metric is binary success rate per molecule (perfect reconstruction of the SDF).

Metric	Value (DB1)	Value (DB3)	Baseline (CLIDE on DB1)	Notes
Correct Reconstruction	97%	67%	25%	CLIDE required significant parameter tuning to reach 25%.

Paper Information

Citation: Algorri, M.-E., Zimmermann, M., Friedrich, C. M., Akle, S., & Hofmann-Apitius, M. (2007). Reconstruction of Chemical Molecules from Images. Proceedings of the 29th Annual International Conference of the IEEE EMBS, 4609-4612. https://doi.org/10.1109/IEMBS.2007.4353366

Publication venue: IEEE EMBS 2007

@inproceedings{algorriReconstructionChemicalMolecules2007,
  title = {Reconstruction of {{Chemical Molecules}} from {{Images}}},
  booktitle = {Proceedings of the 29th Annual International Conference of the IEEE EMBS},
  author = {Algorri, Maria-Elena and Zimmermann, Marc and Friedrich, Christoph M. and Akle, Santiago and {Hofmann-Apitius}, Martin},
  year = {2007},
  pages = {4609--4612},
  publisher = {IEEE},
  doi = {10.1109/IEMBS.2007.4353366}
}

OSRA: Open Source Optical Structure Recognition

Sun, 14 Dec 2025 00:00:00 +0000

Paper Information

Citation: Filippov, I. V., & Nicklaus, M. C. (2009). Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution. Journal of Chemical Information and Modeling, 49(3), 740-743. https://doi.org/10.1021/ci800067r

Publication: J. Chem. Inf. Model. 2009

Additional Resources:

Overview and Motivation

Resource

This paper is a quintessential Infrastructure contribution ($\Psi_{\text{Resource}}$). While it contains significant algorithmic detail, the rhetorical structure and primary goal place it squarely as an infrastructure paper. The dominant contribution is the creation, release, and documentation of a software tool (OSRA).

A vast amount of chemical knowledge is locked in scientific and patent documents as graphical images (Kekulé structures). This is the classic chemical informatics challenge: decades of chemical knowledge are trapped in visual form.

Legacy Data Gap: Historical literature does not use computer-parsable formats, making automated processing of millions of documents impossible without optical recognition. Scientific papers and patents have historically depicted molecules as 2D structural diagrams.
Need for Automation: Manual transcription is not scalable for the hundreds of thousands of documents available. While modern standards like InChI and CML exist, the vast majority of chemical literature remains inaccessible for computational analysis.
Open Source Gap: Before OSRA, only commercial software like CLiDE existed for this task, creating a barrier for academic researchers and limiting reproducibility. no universal or open-source solution was available prior to this work.

Core Innovations and Pipeline

OSRA is claimed to be the first open-source optical structure recognition (OSR) program. The novelty lies in creating an accessible OCSR system with a practical, multi-stage pipeline that combines classical image processing techniques with chemical knowledge.

Key contributions:

Integrated Pipeline: It uniquely combines existing open-source image processing tools (ImageMagick for formats, Potrace for vectorization, GOCR/OCRAD for text) into a chemical recognition workflow. The value is in the assembly and integration.
Vectorization-Based Approach: OSRA uses the Potrace library to convert bitmap images into vector graphics (Bezier curves), then analyzes the geometry of these curves to identify bonds and atoms. This is more robust than angle-based detection methods because it leverages continuous mathematical properties of curves.
Multi-Resolution Processing with Confidence Estimation: The system automatically processes each image at three different resolutions (72, 150, and 300 dpi), generating up to three candidate structures. A learned confidence function trained via linear regression on chemical features (heteroatom count, ring patterns, fragment count) selects the most chemically sensible result.
Resolution Independence: Unlike some predecessors, it is designed to handle over 90 image formats and works independently of specific resolutions or fonts.
Comprehensive Chemical Rules: OSRA implements sophisticated heuristics for chemical structure interpretation:
- Distinguishes bridge bond crossings from tetrahedral carbon centers using graph connectivity rules
- Recognizes stereochemistry from wedge bonds (detected via line thickness gradients)
- Handles old-style aromatic notation (circles inside rings)
- Expands common chemical abbreviations (superatoms like “COOH” or “CF₃”)
- Uses the 75th percentile of bond lengths as the reference length to avoid outlier bias

Methodology and Validation

The authors validated OSRA against both commercial software and manual curation:

Commercial Comparison: They compared OSRA against CLiDE (a commercial OSR tool) using a “small test set” of 11 files provided by Simbiosys containing 42 structures. Performance was measured both as exact match accuracy and as Tanimoto similarity using molecular fingerprints.
Internal Validation: They tested on an internal set of 66 images containing 215 structures, covering various resolutions, color depths, and drawing styles to assess performance at scale and characterize typical error patterns.
Metric Definition: They defined recognition success using both exact matches (“Perfect by InChI”) and Tanimoto similarity (using CACTVS fingerprints). The authors explicitly argued for using Tanimoto similarity as the primary evaluation metric, reasoning that partial recognition (e.g., missing a methyl group) still provides useful chemical information, which binary “correct/incorrect” judgments fail to capture.

Results and Conclusions

Competitive Accuracy: On the small comparative set, OSRA recognized 26 structures perfectly (by InChI) versus CLiDE’s 11, demonstrating that an open-source, rule-based approach could outperform established commercial systems.
Robustness: On the internal diverse set (215 structures), OSRA achieved a 93% average Tanimoto similarity and perfectly recognized 107 structures (50%). Tanimoto similarity above 85% was achieved for 182 structures (85%). This established OSRA as a competitive tool for practical use.
Multi-Resolution Success: The multi-resolution strategy allowed OSRA to handle images with varying quality and formats. The confidence function (with correlation coefficient $r=0.89$) successfully identified which resolution produced the most chemically plausible structure.
Limitations: The authors acknowledge issues with:
- “Imperfect segmentation” leading to missed structures (3 missed in internal set) and false positives (7 in internal set)
- Novel drawing conventions not covered by the implemented heuristics
- Highly degraded or noisy images where vectorization fails
- Hand-drawn structures that deviate significantly from standard chemical drawing practices
- Complex reaction schemes with multiple molecules and arrows
Open-Source Impact: By releasing OSRA as open-source software, the authors enabled widespread adoption and community contribution. This established a foundation for future OCSR research and made the technology accessible to researchers without commercial software budgets.

The work established that rule-based OCSR systems could achieve competitive performance when carefully engineered with chemical knowledge. OSRA became a standard baseline for the field and remained the dominant open-source solution until the rise of deep learning methods over a decade later. The vectorization-based approach and the emphasis on Tanimoto similarity as an evaluation metric influenced subsequent work in the area.

Technical Details

Grayscale Conversion

OSRA uses a non-standard grayscale conversion to preserve light-colored atoms (e.g., yellow sulfur):

$$\text{Gray} = \min(R, G, B)$$

This prevents light colors from being washed out during binarization, unlike the standard weighted formula ($0.3R + 0.59G + 0.11B$).

Image Segmentation

Chemical structures are identified within a page using specific bounding box criteria:

Black pixel density: Must be between 0.0 and 0.2
Aspect ratio: Height-to-width ratio must be between 0.2 and 5.0
Minimum size: Width and height must be >50 pixels at resolutions >150 dpi

Noise Detection and Smoothing

A “noise factor” determines whether anisotropic smoothing is applied:

$$\text{Noise Factor} = \frac{\text{Count of 2-pixel line segments}}{\text{Count of 3-pixel line segments}}$$

Smoothing is applied only if this ratio is between 0.5 and 1.0.

Atom Detection from Bezier Curves

Potrace Bezier control points are flagged as potential atoms if:

The point is classified as a “corner” by Potrace
The vector direction change has a normal component of at least 2 pixels

The normal component criterion is more robust than angle-based detection because angles are difficult to measure accurately in pixelated environments where line thickness is non-zero.

Bond Length Estimation

The reference bond length is computed as the 75th percentile of all detected bond lengths. This avoids bias from outlier bonds (e.g., extremely short or long bonds from recognition errors).

Confidence Function

A linear regression function selects the best result from the multi-scale processing:

$$\text{confidence} = 0.316 - 0.016N_{c} + 0.034N_{N} + 0.067N_{o} + \ldots + 0.330N_{\text{rings5}} + \ldots$$

where $N_C$, $N_N$, $N_O$ represent counts of carbon, nitrogen, oxygen atoms, respectively. It prioritizes structures with more recognized heteroatoms and rings, while penalizing fragment counts. Additional terms account for ring pattern

Minimum size: Width and height must be >50 pixels at resolutions >150 dpi

Noise Detection and Smoothing

A “noise factor” determines whether anisotropic smoothing is applied:

$$\text{Noise Factor} = \frac{\text{Count of 2-pixel line segments}}{\text{Count of 3-pixel line segments}}$$ | Purpose | Dataset | Size | Notes | |———|———|——|——-| | Comparison | “Small test set” (Simbiosys) | 11 files (42 structures) | Used to compare vs. CLiDE | | Validation | Internal Test Set | 66 images (215 structures) | Various resolutions, color depths, styles |

Evaluation

Metrics used to define “Success”:

Metric	Definition
Perfect by InChI	Exact match of the InChI string to the human-curated structure.
Average Tanimoto	Tanimoto similarity (CACTVS fingerprints) between OSRA output and ground truth.
uuuuu	NCI CADD identifier match (topology only; ignores stereochem/charge/tautomers).

Results Table (Comparison):

Tool	Perfect (InChI)	T > 85%	uuuuu Match
OSRA	26 / 42	39 / 42	28 / 42
CLiDE	11 / 42	26 / 42	12 / 42

Software/Dependencies

The system relies on external libraries:

ImageMagick: Image format parsing (supports 90+ formats)
Ghostscript: PDF/PS interpretation
Potrace: Vectorization (converts bitmap to Bezier curves)
GOCR / OCRAD: Optical Character Recognition (heteroatom label recognition)
OpenBabel / RDKit: Chemical backends for connection table compilation
Output Formats: SMILES strings and SD files

Citation

@article{filippovOpticalStructureRecognition2009,
  title = {Optical {{Structure Recognition Software To Recover Chemical Information}}: {{OSRA}}, {{An Open Source Solution}}},
  shorttitle = {Optical {{Structure Recognition Software To Recover Chemical Information}}},
  author = {Filippov, Igor V. and Nicklaus, Marc C.},
  year = {2009},
  month = mar,
  journal = {Journal of Chemical Information and Modeling},
  volume = {49},
  number = {3},
  pages = {740--743},
  doi = {10.1021/ci800067r}
}

The confidence function is a linear regression model trained on chemical features:

$$\text{Confidence} = 0.316 - 0.016N_C + 0.034N_N + 0.067N_O + 0.036N_F + \ldots$$

where $N_C$, $N_N$, $N_O$, $N_F$ represent counts of carbon, nitrogen, oxygen, and fluorine atoms, respectively. Additional terms account for ring counts and fragment counts. The model achieves a correlation coefficient of $r=0.89$.

This function scores the three resolution candidates (72, 150, and 300 dpi), and the highest-scoring structure is selected as the final output.

Data

Test Sets:

CLiDE Comparison: 42 structures from 11 files (Simbiosys small test set)
Internal Validation: 215 structures

Evaluation Metrics:

Exact match accuracy (binary correct/incorrect)
Tanimoto similarity using molecular fingerprints (preferred metric for partial recognition credit)

Models

Pipeline Components:

Image Preprocessing: ImageMagick (supports 90+ formats)
Vectorization: Potrace library (converts bitmap to Bezier curves)
OCR: GOCR and OCRAD (heteroatom label recognition)
Output Formats: SMILES strings and SD files

Optical Recognition of Chemical Graphics

Sun, 14 Dec 2025 00:00:00 +0000

Contribution: Early OCSR Pipeline Methodology

Method. This paper proposes a novel architectural pipeline for the automatic recognition of chemical structure diagrams. It defines a specific sequence of algorithmic steps, including diagram separation, vectorization, segmentation, and structural analysis, which converts pixel data into a semantic chemical representation (MDL Molfile).

Motivation: Digitizing Legacy Chemical Data

Problem: In 1993, vast databases of chemical information existed, but the entry of graphical data was significantly less advanced than the facilities for manipulating it.

Gap: Creating digital chemical structures required trained operators to manually redraw diagrams that already existed in printed journals and catalogs, leading to a costly duplication of effort.

Goal: To automate the creation of coded representations (connection tables) directly from optically scanned diagrams on printed pages.

Novelty: General Document Analysis Integrated with Chemical Rules

Pipeline Approach: The authors present a complete end-to-end system that integrates general document analysis with domain-specific chemical rules.

Convex Bounding Separation: A novel use of “bounding polygons” defined by 8 fixed-direction bands to distinguish diagram components from text with linear computational cost.

Vector-Based Segmentation: The system uses the output of a vectorizer (GIFTS) to classify diagram elements. It relies on the observation that vectorizers approximate characters with sets of short vectors to distinguish them from bonds.

Methodology and System Evaluation

System Implementation: The algorithm was implemented in ‘C’ on IBM PS/2 personal computers running OS/2 Presentation Manager.

Input Specification: The system was tested on documents scanned at 300 dpi using an IBM 3119 scanner.

Qualitative Evaluation: The authors evaluated the system on “typical scanned structures” and “simple planar diagrams”. Large-scale quantitative benchmarking was not conducted in this work.

Results, Performance, and Limitations

Performance: The prototype processes a typical structure (after extraction) in less than one minute.

Accuracy: It is reported to be accurate for simple planar diagrams.

Output Format: The system successfully generates MDL Molfiles that interface with standard chemistry software like REACCS, MACCS, and modeling tools.

Limitations: The system struggles with broken lines, characters touching bond structures, and requires manual intervention for complex errors.

Reproducibility Details

Status: Closed (Historical). As an early prototype from 1993, no source code, datasets, or digital models were publicly released. Reproducing this exact system would require recreating the pipeline from the described heuristics and sourcing vintage OCR software.

Artifacts

Artifact	Type	License	Notes
None available	N/A	N/A	No digital artifacts were released with this 1993 publication.

Data

The paper does not release a dataset but specifies the input requirements for the system.

Purpose	Dataset	Size	Notes
Input	Scanned Documents	N/A	Black ink on white paper; scanned at 300 dpi bi-level.

Algorithms

The paper relies on a pipeline of specific heuristics and geometric rules.

1. Diagram Separation (Region Growing)

Bounding Polygons: Uses convex polygons defined by pairs of parallel sides in 8 fixed directions. This approximation improves distance estimation compared to bounding rectangles.
Seed Detection: Finds a connected component with bounding dimension $D > d_{\text{max char size}}$.
Aggregation: Iteratively searches for neighboring components within a specific distance threshold $d_t$ (where $d_t$ is smaller than the whitespace margin) and merges them into the bounding polygon.

2. Vectorization & Segmentation

Vectorization: Uses the GIFTS system (IBM Tokyo) to fit lines to pixels.
Classification Heuristics:
- Ratio Test: If the ratio of a group’s dimension to the full diagram dimension is below a threshold $\tau$, it is classified as a Symbol: $$ \frac{D_{\text{group}}}{D_{\text{diagram}}} < \tau $$
- Context Rule: Small vector groups near letters are classified as Characters (handles ’l’ in ‘Cl’).
- Circle Rule: A group is a Circle (aromatic ring) if it contains $N \ge 8$ vectors in a roughly circular arrangement.
- Default: Otherwise, classified as Bond Structure.

3. Cleanup & Structure Recognition

Short Vector Removal: Vectors shorter than a fraction of the median line length $L_{\text{median}}$ are shrunk to their midpoint (fixing broken junctions).
Vertex Merging: If two vectors meet at an angle $\theta < 35^{\circ}$, the vertex is removed (fixing single lines broken into two).
Aromatic Processing: If a circle is detected, the system identifies the 6 closest atoms and adds double bonds to every second bond in the ring.

Models

OCR:

The system uses a feature-based, single-font OCR engine.
It assumes non-serif, plain styles typical of drafting standards.
Character images are normalized for size before recognition.

Hardware

Scanner: IBM 3119 (300 dpi).
Compute: IBM PS/2 series running OS/2.

Paper Information

Citation: Casey, R., et al. (1993). Optical Recognition of Chemical Graphics. Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR ‘93), 627-631. https://doi.org/10.1109/ICDAR.1993.395658

Publication: ICDAR 1993

@inproceedings{caseyOpticalRecognitionChemical1993,
  title = {Optical Recognition of Chemical Graphics},
  booktitle = {Proceedings of 2nd {{International Conference}} on {{Document Analysis}} and {{Recognition}} ({{ICDAR}} '93)},
  author = {Casey, R. and Boyer, S. and Healey, P. and Miller, A. and Oudot, B. and Zilles, K.},
  year = 1993,
  pages = {627--631},
  publisher = {IEEE Comput. Soc. Press},
  address = {Tsukuba Science City, Japan},
  doi = {10.1109/ICDAR.1993.395658}
}

OCSR Methods: A Taxonomy of Approaches

Sun, 14 Dec 2025 00:00:00 +0000

Overview

Optical Chemical Structure Recognition (OCSR) aims to automatically extract machine-readable molecular representations (e.g., SMILES, InChI, mol files) from images of chemical structures. Methods have evolved from early rule-based systems to modern deep learning approaches.

This note organizes OCSR methods by their fundamental approach, providing a framework for understanding the landscape of techniques.

Common Limitations and Failure Modes

Regardless of the underlying paradigm, most OCSR systems struggle with a common set of challenges:

Stereochemistry: Ambiguous wedge/dash bonds, varying drawing conventions, and implicit stereocenters frequently lead to incorrect isomer generation.
Markush Structures: Generic structures with variable R-groups (common in patents) require complex subgraph mapping that sequence-based models often fail to capture.
Image Degradation: Artifacts, low resolution, skewed scans, and hand-drawn irregularities degrade the performance of both rule-based heuristics and CNN feature extractors.
Superatoms and Abbreviations: Textual abbreviations (e.g., “Ph”, “t-Bu”, “BoC”) embedded within the image require joint optical character recognition (OCR) and structural parsing.

Review & Survey Papers

Comprehensive surveys and systematization of knowledge papers that organize and synthesize the OCSR literature.

Year	Paper	Notes	Focus
2020	A review of optical chemical structure recognition tools	Rajan et al. 2020	Survey of 30 years of OCSR development (1990-2019); benchmark of three open-source tools (OSRA, Imago, MolVec) on four datasets
2022	Review of techniques and models used in optical chemical structure recognition	Musazade et al. 2022	Systematization of OCSR evolution from rule-based systems to modern deep learning; identifies paradigm shift to image captioning and critiques evaluation metrics
2024	Comparing software tools for optical chemical structure recognition	Krasnov et al. 2024	Benchmark of 8 open-access tools on 2,702 manually curated patent images; proposes ChemIC classifier for hybrid routing approach

Deep Learning Methods

End-to-end neural network architectures that learn to map images directly to molecular representations.

Note on Paper Types: Papers listed below are primarily Method ($\Psi_{\text{Method}}$) papers focused on novel architectures and performance improvements. Some also have secondary Resource ($\Psi_{\text{Resource}}$) contributions through released tools or datasets. See the AI and Physical Sciences paper taxonomy for classification details.

Image-to-Sequence Paradigm

Treating chemical structure recognition as an image captioning task, these methods use encoder-decoder architectures (often with attention mechanisms) to generate sequential molecular representations like SMILES directly from pixels. Formally, given an image $I$, the model learns to sequentially output tokens $y_t$ to maximize the conditional probability: $$ p(Y|I) = \prod_{t=1}^{T} p(y_t | y_{

Year	Paper	Notes	Architecture
2019	Molecular Structure Extraction From Documents Using Deep Learning	Staker et al. Notes	U-Net segmentation + CNN-GridLSTM encoder-decoder with attention
2020	DECIMER: towards deep learning for chemical image recognition	DECIMER Notes	Inception V3 encoder + GRU decoder with attention
2021	ChemPix: automated recognition of hand-drawn hydrocarbon structures	ChemPix Notes	CNN encoder + LSTM decoder with attention
2021	DECIMER 1.0: deep learning for chemical image recognition using transformers	DECIMER 1.0 Notes	EfficientNet-B3 encoder + Transformer decoder with SELFIES output
2021	End-to-End Attention-based Image Captioning	ViT-InChI Transformer Notes	Vision Transformer encoder + Transformer decoder with InChI output
2021	Img2Mol - accurate SMILES recognition from molecular graphical depictions	Img2Mol Notes	CNN encoder + pre-trained CDDD decoder for continuous embedding
2021	IMG2SMI: Translating Molecular Structure Images to SMILES	IMG2SMI Notes	ResNet-101 encoder + Transformer decoder with SELFIES output
2022	Automated Recognition of Chemical Molecule Images Based on an Improved TNT Model	ICMDT Notes	Deep TNT encoder + Transformer decoder with InChI output
2022	Image2SMILES: Transformer-Based Molecular Optical Recognition Engine	Image2SMILES Notes	ResNet-50 encoder + Transformer decoder with FG-SMILES output
2022	MICER: a pre-trained encoder-decoder architecture for molecular image captioning	MICER Notes	Fine-tuned ResNet101 encoder + LSTM decoder with attention
2022	Performance of chemical structure string representations for chemical image recognition using transformers	Rajan String Representations	Comparative ablation: SMILES vs DeepSMILES vs SELFIES vs InChI
2022	SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer	SwinOCSR Notes	Swin Transformer encoder + Transformer decoder with DeepSMILES output
2023	Handwritten Chemical Structure Image to Structure-Specific Markup Using Random Conditional Guided Decoder	Hu et al. RCGD Notes	DenseNet encoder + GRU decoder with attention and SSML output
2023	DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications	DECIMER.ai Notes	EfficientNet-V2-M encoder + Transformer decoder with SMILES output
2024	ChemReco: automated recognition of hand-drawn carbon-hydrogen-oxygen structures using deep learning	ChemReco Notes	EfficientNet encoder + Transformer decoder with SMILES output
2024	Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture	Enhanced DECIMER Notes	EfficientNet-V2-M encoder + Transformer decoder with SMILES output
2024	Image2InChI: Automated Molecular Optical Image Recognition	Image2InChI Notes	Improved SwinTransformer encoder + Transformer decoder with InChI output
2024	MMSSC-Net: multi-stage sequence cognitive networks for drug molecule recognition	MMSSC-Net Notes	SwinV2 encoder + GPT-2 decoder with MLP for multi-stage cognition
2024	RFL: Simplifying Chemical Structure Recognition with Ring-Free Language	RFL Notes	DenseNet encoder + GRU decoder with hierarchical ring decomposition
2025	Dual-Path Global Awareness Transformer for Optical Chemical Structure Recognition	DGAT Notes	ResNet-101 encoder + Transformer with CGFE/SDGLA modules and SMILES output
2025	GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition	GTR-CoT Notes	Qwen-VL 2.5 3B encoder-decoder with graph traversal chain-of-thought and SMILES output
2025	MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild	MolParser Notes	Swin Transformer encoder + BART decoder with Extended SMILES (E-SMILES) output
2025	MolSight: OCSR with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning	MolSight Notes	EfficientViT-L1 encoder + Transformer decoder with RL (GRPO) and SMILES output
2025	OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery	OCSU Notes	Mol-VL: Qwen2-VL encoder-decoder with multi-task learning for multi-level understanding

Image-to-Graph Paradigm

Methods that explicitly construct molecular graphs as intermediate representations, identifying atoms as vertices $V$ and bonds as edges $E$ before converting to standard molecular formats. Graph approaches construct an adjacency matrix $A$ and feature vectors, effectively turning OCSR into a joint probability model over nodes, edges, and their spatial coordinates: $$ p(G|I) = \prod_{v \in V} p(v|I) \prod_{u,v \in V} p(e_{uv}|v_u, v_v, I) $$ This avoids hallucinating invalid character strings and explicitly grounds the predictions to the image space (via bounding boxes/segmentation), improving interpretability.

Year	Paper	Notes	Architecture
2020	ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep Learning	ChemGrapher Notes	U-Net-based semantic segmentation + graph building algorithm + classification CNNs
2022	ABC-Net: A divide-and-conquer based deep learning architecture for SMILES recognition	ABC-Net Notes	U-Net-style FCN with keypoint detection heatmaps + multi-task property prediction
2022	Image-to-Graph Transformers for Chemical Structure Recognition	Image-to-Graph Transformers Notes	ResNet-34 encoder + Transformer encoder + Graph-Aware Transformer (GRAT) decoder
2022	MolMiner: You Only Look Once for Chemical Structure Recognition	MolMiner Notes	MobileNetV2 segmentation + YOLOv5 object detection + EasyOCR + graph construction
2023	MolGrapher: Graph-based Visual Recognition of Chemical Structures	MolGrapher Notes	ResNet-18 keypoint detector + supergraph construction + GNN classifier
2023	MolScribe: Robust Molecular Structure Recognition with Image-To-Graph Generation	MolScribe Notes	Swin Transformer encoder + Transformer decoder with explicit atom coordinates and bond prediction
2024	Atom-Level Optical Chemical Structure Recognition with Limited Supervision	AtomLenz Notes	Faster R-CNN object detection + graph constructor with weakly supervised training (ProbKT*)
2024	MolNexTR: a generalized deep learning model for molecular image recognition	MolNexTR Notes	Dual-stream (ConvNext + ViT) encoder + Transformer decoder with graph generation
2025	MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures	MarkushGrapher Notes	UDOP VTL encoder + MolScribe OCSR encoder + T5 decoder with CXSMILES + substituent table
2025	MolMole: Molecule Mining from Scientific Literature	MolMole Notes	ViDetect (DINO) + ViReact (RxnScribe) + ViMore (detection-based) unified page-level pipeline
2025	OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery	OCSU Notes	DoubleCheck: MolScribe + attentive feature enhancement with local ambiguous atom refinement

Image-to-Fingerprint Paradigm

Methods that bypass molecular graph reconstruction entirely, generating molecular fingerprints directly from images through functional group recognition and spatial analysis. These approaches prioritize retrieval and similarity search over exact structure reconstruction.

Year	Paper	Notes	Architecture
2025	SubGrapher: visual fingerprinting of chemical structures	SubGrapher Notes	Dual Mask-RCNN instance segmentation (1,534 groups + 27 backbones) + substructure-graph + SVMF fingerprint

Image Classification and Filtering

Methods that classify chemical structure images for preprocessing purposes, such as detecting Markush structures or other problematic inputs that should be filtered before full OCSR processing.

Year	Paper	Notes	Architecture
2023	One Strike, You’re Out: Detecting Markush Structures in Low Signal-to-Noise Ratio Images	Jurriaans et al. Notes	Patch-based pipeline with Inception V3 or ResNet18 for binary classification

Traditional Machine Learning Methods

Hybrid approaches combining classical machine learning algorithms (neural networks, SVMs, CRFs) with domain-specific heuristics and image processing. These methods (primarily from 1992-2014) used ML for specific subtasks like character recognition or symbol classification while relying on rule-based systems for chemical structure interpretation.

Year	Paper	Notes	Key ML Component
1992	Kekulé: OCR-Optical Chemical (Structure) Recognition	Kekulé Notes	Multilayer perceptron for OCR
1996	Automatic Interpretation of Chemical Structure Diagrams	Kekulé-1 Notes	Neural network with shared weights (proto-CNN)
2007	Recognition of Hand Drawn Chemical Diagrams	Ouyang-Davis Notes	SVM for symbol classification
2008	Chemical Ring Handwritten Recognition Based on Neural Networks	Hewahi et al. Notes	Two-phase classifier-recognizer with feed-forward NNs
2008	Recognition of On-line Handwritten Chemical Expressions	Yang et al. Notes	Two-level algorithm with edit distance matching
2008	A Study of On-Line Handwritten Chemical Expressions Recognition	Yang et al. Notes	ANN with two-level substance recognition
2009	A Unified Framework for Recognizing Handwritten Chemical Expressions	Chang et al. Notes	GMM for spatial relations, NN for bond verification
2009	HMM-Based Online Recognition of Handwritten Chemical Symbols	Zhang et al. Notes	Hidden Markov Model for online handwriting
2009	The Understanding and Structure Analyzing for Online Handwritten Chemical Formulas	Wang et al. Notes	HMM for text recognition + CFG for structure parsing
2010	A SVM-HMM Based Online Classifier for Handwritten Chemical Symbols	Zhang et al. Notes	Dual-stage SVM-HMM with PSR algorithm
2011	ChemInk: A Natural Real-Time Recognition System for Chemical Drawings	ChemInk Notes	Conditional Random Field (CRF) joint model
2013	Online Chemical Symbol Recognition for Handwritten Chemical Expression Recognition	Tang et al. Notes	SVM with elastic matching for handwriting
2014	Markov Logic Networks for Optical Chemical Structure Recognition	MLOCSR Notes	Markov Logic Network for probabilistic inference

Rule-Based Methods

Classic approaches using heuristics, image processing, and domain-specific rules. While some systems use traditional OCR engines (which may contain ML components), the chemical structure recognition itself is purely algorithmic.

Note: The chemoCR systems use SVM-based OCR but employ rule-based topology-preserving vectorization for core structure reconstruction, placing them primarily in this category.

Core Methods

Year	Paper	Notes
1990	Computational Perception and Recognition of Digitized Molecular Structures	Contreras et al. Notes
1993	Chemical Literature Data Extraction: The CLiDE Project	CLiDE Notes
1993	Optical Recognition of Chemical Graphics	Casey et al. Notes
1999	Automatic Reading of Handwritten Chemical Formulas from a Structural Representation of the Image	Ramel et al. Notes
2007	Automatic Recognition of Chemical Images	chemoCR Notes
2007	Reconstruction of Chemical Molecules from Images	chemoCR Notes
2009	Automated extraction of chemical structure information from digital raster images	ChemReader Notes
2009	CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition	CLiDE Pro Notes
2009	Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution	OSRA Notes
2012	Chemical Structure Recognition: A Rule Based Approach	MolRec Notes
2015	Research on Chemical Expression Images Recognition	Hong et al. Notes

TREC 2011 Chemistry Track

The TREC 2011 Chemistry Track provided a standardized benchmark for comparing OCSR systems, introducing the novel Image-to-Structure task alongside Prior Art and Technology Survey tasks. Papers from this evaluation are grouped here.

System	Paper	Notes
chemoCR	Chemical Structure Reconstruction with chemoCR	chemoCR Notes
ChemReader	Image-to-Structure Task by ChemReader	ChemReader at TREC 2011 Notes
Imago	Imago: open-source toolkit for 2D chemical structure image recognition	Imago Notes
OSRA	Optical Structure Recognition Application entry in Image2Structure task	OSRA at TREC 2011 Notes
MolRec	Performance of MolRec at TREC 2011 Overview and Analysis of Results	MolRec at TREC Notes
ChemInfty	Robust Method of Segmentation and Recognition of Chemical Structure Images in ChemInfty	ChemInfty Notes

CLEF 2012 Chemistry Track

The CLEF-IP 2012 benchmarking lab introduced three specific IR tasks in the intellectual property domain: claims-based passage retrieval, flowchart recognition, and chemical structure recognition. The chemical structure recognition task included both segmentation (identifying bounding boxes) and recognition (converting to MOL format) subtasks, with a particular focus on challenging Markush structures common in patents.

System	Paper	Notes
MolRec	MolRec at CLEF 2012 - Overview and Analysis of Results	MolRec at CLEF 2012 Notes
OSRA	Optical Structure Recognition Application entry to CLEF-IP 2012	OSRA at CLEF-IP 2012 Notes

Kekulé: OCR-Optical Chemical Recognition

Sun, 14 Dec 2025 00:00:00 +0000

Paper Information

Citation: McDaniel, J. R., & Balmuth, J. R. (1992). Kekulé: OCR-Optical Chemical (Structure) Recognition. Journal of Chemical Information and Computer Sciences, 32(4), 373-378. https://doi.org/10.1021/ci00008a018

Publication: Journal of Chemical Information and Computer Sciences, 1992

System Architecture and Methodological Approach

This is a Methodological Paper ($\Psi_{\text{Method}}$). It proposes a novel software architecture (“Kekulé”) designed to solve a specific technical problem: the automatic conversion of printed chemical structure diagrams into computer-readable connection tables. The paper focuses on the “how” of the system by detailing the seven-step pipeline from scanning to graph compilation, validating the method through performance testing on a specific dataset.

Motivation: Bridging Visual Diagrams and Connection Tables

The primary motivation is to bridge the gap between how chemists communicate (structural diagrams) and how chemical databases store information (connection tables like MOLfiles).

Inefficiency of Manual Entry: Manual compilation of structural descriptions is “tedious and highly prone to error”.
Redrawing Costs: Even using drawing programs (like ChemDraw ancestors) to capture connectivity is inefficient; redrawing a complex molecule like vitamin $B_{12}$ takes ~20 minutes.
Lack of Existing Solutions: Existing OCR systems at the time failed on chemical diagrams because they could not handle the mix of graphics (bonds) and text (atom labels), and struggled with small, mixed fonts.

Novelty: A Hybrid OCR and Heuristic Approach

Kekulé represents the first successful attempt to integrate all of the required elements of image processing, OCR, structure editing, and database communication into a complete system.

Hybrid OCR Approach: Unlike commercial OCR of the time, it used a custom implementation combining rotation correction (for skew) with a multilayer perceptron neural network trained specifically on small fonts (down to 3.2 points).
Heuristic Feature Extraction: The authors developed specific heuristics to handle chemical artifacts, such as an exhaustive search for dashed lines, explicitly rejecting Hough transforms as unreliable for short segments.
Contextual “Spell Checking”: The system uses chemical context to verify OCR results, such as checking atom symbols against a valid list and using bond connections to disambiguate characters.

Experimental Setup and Dataset Validation

The authors performed a validation study on a diverse set of chemical structures to stress-test the system:

Dataset: 444 chemical structures were selected from a wide variety of sources, including the Merck Index, Aldrich Handbook, and ACS Nomenclature Guide, specifically chosen to “test Kekulé’s limits”.
Metrics:
- Processing Success: Percentage of structures processed.
- User Intervention: Average number of prompts per structure for verification.
- Editing Time: Time required to correct interpretation errors (arbitrary “good” limit set at 30 seconds).

Results and System Performance

High Success Rate: 98.9% of the 444 structures were processed successfully.
Performance Speed: The average processing time was 9 seconds per structure on an 80486 (33 MHz) processor.
Error Modes: The primary bottleneck was broken characters in scanned images (e.g., breaks in ‘H’ or ‘N’ crossbars), which slowed down the OCR significantly.
Impact: The system demonstrated that automated interpretation was faster and less error-prone than manual redrawing.

Reproducibility Details

The following details outline the specific technical implementation described in the 1992 paper.

Data

The authors did not release a public dataset but described their test set sources in detail.

Purpose	Dataset	Size	Notes
Evaluation	Mixed Chemical Sources	444 structures	Sourced from Merck Index, Aldrich Handbook, ACS Nomenclature Guide, etc.
Training (OCR)	Font Exemplars	Unknown	“Exemplars of characters from numerous serif and sanserif fonts”.

Algorithms

The paper details a 7-step pipeline. Key algorithmic choices include:

Vectorization:
- Images are reduced to 1-pixel width using thinning and raster-to-vector translation.
- An adaptive smoothing algorithm is applied to remove pixel-level jitter.
Feature Extraction (Dashed Lines):
- Hough Transforms were rejected due to poor performance on short line segments.
- Slope sorting was rejected due to variance in short dashes.
- Chosen Method: Exhaustive search/testing of all features that might be dashed lines (subset of features).
Graph Compilation:
- Character Grouping: Characters are assembled into strings based on XY adjacency.
- Node Creation: Character strings become nodes. Vectors with endpoints “too far” from strings create new nodes.
- Heuristics: Circles are converted to alternating single-double bonds; “thick” bonds between wedges are automatically generated.

Models

The core machine learning component is the OCR engine.

Architecture: A multilayer perceptron neural network (fully connected).
Input: Normalized characters. Normalization involves rotation (for skew), scaling, under-sampling, and contrast/density adjustments.
Output: Ranked probability matches. Outputs above an experimental threshold are retained. If a character is ambiguous (e.g., ‘5’ vs ‘S’), both are kept and resolved via chemical context.
Performance: Raw accuracy ~96% on small fonts (compared to ~85% for commercial OCR of the era).

Hardware

The system was developed and tested on hardware typical of the early 1990s.

Processor: Intel 80486 at 33 MHz.
Scanners: Hewlett-Packard ScanJet (300 dpi) and Logitech ScanMan (400 dpi hand-held).
Platform: Microsoft Windows.

Citation

@article{mcdanielKekuleOCRopticalChemical1992,
  title = {Kekulé: {{OCR-optical}} Chemical (Structure) Recognition},
  shorttitle = {Kekulé},
  author = {McDaniel, Joe R. and Balmuth, Jason R.},
  year = 1992,
  month = jul,
  journal = {Journal of Chemical Information and Computer Sciences},
  volume = {32},
  number = {4},
  pages = {373--378},
  issn = {0095-2338, 1520-5142},
  doi = {10.1021/ci00008a018},
  urldate = {2025-12-15},
  langid = {english}
}

IMG2SMI: Translating Molecular Structure Images to SMILES

Sun, 14 Dec 2025 00:00:00 +0000

Contributions & Taxonomy

This is both a Method and Resource paper:

Method: It adapts standard image captioning architectures (encoder-decoder) to the domain of Optical Chemical Structure Recognition (OCSR), treating molecule recognition as a translation task.
Resource: It introduces MOLCAP, a large-scale dataset of 81 million molecules aggregated from public chemical databases, addressing the data scarcity that previously hindered deep learning approaches to OCSR.

The Bottleneck in Chemical Literature Translation

Chemical literature is “full of recipes written in a language computers cannot understand” because molecules are depicted as 2D images. This creates a fundamental bottleneck:

The Problem: Chemists must manually redraw molecular structures to search for related compounds or reactions. This is slow, error-prone, and makes large-scale literature mining impossible.
Existing Tools: Legacy systems like OSRA (Optical Structure Recognition Application) rely on handcrafted rules and often require human correction, making them unfit for unsupervised, high-throughput processing.
The Goal: An automated system that can translate structure images directly to machine-readable strings (SMILES/SELFIES) without human supervision, enabling large-scale knowledge extraction from decades of chemistry literature and patents.

Core Innovation: SELFIES and Image Captioning

The core novelty is demonstrating that how you represent the output text is as important as the model architecture itself. Key contributions:

Image Captioning Framework: Applies modern encoder-decoder architectures (ResNet-101 + Transformer) to OCSR, treating it as an image-to-text translation problem with a standard cross-entropy loss objective over the generation sequence: $$ \mathcal{L} = -\sum\limits_{t=1}^{T} \log P(y_t \mid y_1, \ldots, y_{t-1}, x) $$
SELFIES as Target Representation: The key mechanism relies on using SELFIES (Self-Referencing Embedded Strings) as the output format. SELFIES is based on a formal grammar where every possible string corresponds to a valid molecule, eliminating the syntactic invalidity problems (unmatched parentheses, invalid characters) that plague SMILES generation.
MOLCAP Dataset: Created a comprehensive dataset of 81 million unique molecules from PubChem, ChEMBL, GDB-13, and other sources. Generated 256x256 pixel images using RDKit for 1 million training samples and 5,000 validation samples.
Task-Specific Evaluation: Demonstrated that traditional NLP metrics (BLEU) are poor indicators of scientific utility. Introduced evaluation based on molecular fingerprints (MACCS, RDK, Morgan) and Tanimoto similarity: $$ T(a, b) = \frac{c}{a + b - c} $$ where $c$ is the number of common fingerprint bits, and $a$ and $b$ are the number of set bits in each respective molecule’s fingerprint. This formulation reliably measures functional chemical similarity.

Experimental Setup and Ablation Studies

The evaluation focused on comparing IMG2SMI to existing systems and identifying which design choices matter most:

Baseline Comparisons: Benchmarked against OSRA (rule-based system) and DECIMER (first deep learning approach) on the MOLCAP dataset to establish whether modern architectures could surpass traditional methods.
Ablation Studies: Extensive ablations isolating key factors:
- Decoder Architecture: Transformer vs. RNN/LSTM decoders
- Encoder Fine-tuning: Fine-tuned vs. frozen pre-trained ResNet weights
- Output Representation: SELFIES vs. character-level SMILES vs. BPE-tokenized SMILES (the most critical ablation)

Configuration	MACCS FTS	Valid Captions
RNN + Fixed Encoder	0.1526	N/A
RNN + Fine-tuned Encoder	0.4180	N/A
Transformer + Fixed Encoder	0.7674	61.1%
Transformer + Fine-tuned Encoder	0.9475	99.4%
Character-level SMILES (fine-tuned)	N/A	2.1%
BPE SMILES (2000 vocab, fine-tuned)	N/A	20.0%
SELFIES (fine-tuned)	0.9475	99.4%

Metric Analysis: Systematic comparison of evaluation metrics including BLEU, ROUGE, Levenshtein distance, exact match accuracy, and molecular fingerprint-based similarity measures.

Results, Findings, and Limitations

Performance Gains:

Metric	IMG2SMI	OSRA	DECIMER	Random Baseline
MACCS FTS	0.9475	0.3600	0.0000	0.3378
RDK FTS	0.9020	0.2790	0.0000	0.2229
Morgan FTS	0.8707	0.2677	0.0000	0.1081
ROUGE	0.6240	0.0684	0.0000	0.0422
Exact Match	7.24%	0.04%	0.00%	0.00%
Valid Captions	99.4%	65.2%	N/A	N/A

163% improvement over OSRA on MACCS Tanimoto similarity.
Nearly 10x improvement on ROUGE scores (0.6240 vs. 0.0684).
Average Tanimoto similarity exceeds 0.85 (functionally similar molecules even when not exact matches).

Key Findings:

SELFIES is Critical: Using SELFIES yields 99.4% valid molecules, compared to only ~2% validity for character-level SMILES.
Architecture Matters: Transformer decoder significantly outperforms RNN/LSTM approaches. Fine-tuning the ResNet encoder (vs. frozen weights) yields substantial performance gains (e.g., MACCS FTS: 0.7674 to 0.9475).
Metric Insights: BLEU is a poor metric for this task. Molecular fingerprint-based Tanimoto similarity is most informative because it measures functional chemical similarity.

Limitations:

Low Exact Match: Only 7.24% exact matches. The model captures the overarching functional groups and structure but misses fine details like exact double bond placement.
Complexity Bias: Trained on large molecules (average length >40 tokens), so it performs poorly on very simple structures where OSRA still excels.

Conclusion: The work shows that modern encoder-decoder architectures combined with valid-by-construction molecular representations (SELFIES) can outperform traditional rule-based systems by large margins on fingerprint-based similarity metrics. The system is useful for literature mining where functional similarity matters more than exact matches, though 7.24% exact match accuracy and poor performance on simple molecules indicate clear directions for future work.

Reproducibility Details

Models

Architecture: Image captioning system based on DETR (Detection Transformer) framework.

Visual Encoder:

Backbone: ResNet-101 pre-trained on ImageNet
Feature Extraction: 4th layer extraction (convolutions only)
Output: 2048-dimensional dense feature vector

Caption Decoder:

Type: Transformer encoder-decoder
Layers: 3 stacked encoder layers, 3 stacked decoder layers
Attention Heads: 8
Hidden Dimensions: 2048 (feed-forward networks)
Dropout: 0.1
Layer Normalization: 1e-12

Training Configuration:

Optimizer: AdamW
Learning Rate: 5e-5 (selected after sweep from 1e-4 to 1e-6)
Weight Decay: 1e-4
Batch Size: 32
Epochs: 5
Codebase: Built on open-source DETR implementation

Data

MOLCAP Dataset:

Property	Value	Notes
Total Size	81,230,291 molecules	Aggregated from PubChem, ChEMBL, GDB13
Training Split	1,000,000 molecules	Randomly selected unique molecules
Validation Split	5,000 molecules	Randomly selected for evaluation
Image Resolution	256x256 pixels	Generated using RDKit
Median SELFIES Length	>45 characters	More complex than typical benchmarks
Full Dataset Storage	~16.24 TB	Necessitated use of 1M subset
Augmentation	None	No cropping, rotation, or other augmentation

Preprocessing:

Images generated using RDKit at 256x256 resolution
Molecules converted to canonical representations
SELFIES tokenization for model output

Evaluation

Primary Metrics:

Metric	IMG2SMI Value	OSRA Baseline	Purpose
MACCS FTS	0.9475	0.3600	Fingerprint Tanimoto Similarity (functional groups)
RDK FTS	0.9020	0.2790	RDKit fingerprint similarity
Morgan FTS	0.8707	0.2677	Morgan fingerprint similarity (circular)
ROUGE	0.6240	0.0684	Text overlap metric
Exact Match	7.24%	0.04%	Structural identity (strict)
Valid Captions	99.4%	65.2%	Syntactic validity (with SELFIES)
Levenshtein Distance	21.13	32.76	String edit distance (lower is better)

Secondary Metrics (shown to be less informative for chemical tasks):

BLEU, ROUGE (better suited for natural language)
Levenshtein distance (doesn’t capture chemical similarity)

Hardware

GPU: Single NVIDIA GeForce RTX 2080 Ti
Training Time: ~5 hours per epoch, approximately 25 hours total for 5 epochs
Memory: Sufficient for batch size 32 with ResNet-101 + Transformer architecture

Artifacts

The paper mentions releasing both code and the MOLCAP dataset, but no public repository or download link has been confirmed as available.

Artifact	Type	License	Notes
MOLCAP dataset	Dataset	Unknown	81M molecules; claimed released but no public URL found
IMG2SMI code	Code	Unknown	Built on DETR; claimed released but no public URL found

Paper Information

Citation: Campos, D., & Ji, H. (2021). IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System (No. arXiv:2109.04202). arXiv. https://doi.org/10.48550/arXiv.2109.04202

Publication: arXiv preprint (2021)

Additional Resources:

Paper on arXiv

@article{campos2021img2smi,
  title={IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System},
  author={Campos, Daniel and Ji, Heng},
  journal={arXiv preprint arXiv:2109.04202},
  year={2021},
  doi={10.48550/arXiv.2109.04202}
}

Hand-Drawn Chemical Diagram Recognition (AAAI 2007)

Sun, 14 Dec 2025 00:00:00 +0000

Contribution and Methodological Approach

This is a Method paper. It proposes a multi-stage pipeline for interpreting hand-drawn diagrams that integrates a trainable symbol recognizer with a domain-specific verification step. The authors validate the method through an ablation study comparing the full system against a baseline lacking domain knowledge.

Motivation for Sketch-Based Interfaces

Current software for specifying chemical structures (e.g., ChemDraw, IsisDraw) relies on mouse and keyboard interfaces, which lack the speed, ease of use, and naturalness of drawing on paper. The goal is to bridge the gap between natural expression and computer interpretation by building a system that understands freehand chemical sketches.

Novel Integration of Chemical Domain Knowledge

The primary novelty is the integration of domain knowledge (specifically chemical valence rules) directly into the interpretation loop to resolve ambiguities and correct errors.

Specific technical contributions include:

Hybrid Recognizer: Combines feature-based SVMs, image-based template matching (modified Tanimoto), and off-the-shelf handwriting recognition to handle the mix of geometry and text.
Domain Verification Loop: A post-processing step that checks the chemical validity of the structure (e.g., nitrogen must have 3 bonds). If an inconsistency is found, the system searches the space of alternative hypotheses generated during the initial parsing phase to find a valid interpretation.
Contextual Parsing: Uses a sliding window (up to 7 strokes) and spatial context to parse interspersed symbols.
Implicit Structure Handling: Supports two common chemistry notations: (1) implicit elements, where carbon and hydrogen atoms are omitted and inferred from bond connectivity and valence rules, and (2) aromatic rings, detected as a circle drawn inside a hexagonal 6-carbon cycle.

Experimental Design and User Study

The authors conducted a user study to evaluate the system’s robustness on unconstrained sketches.

Participants: 6 users familiar with organic chemistry.
Task: Each user drew 12 pre-specified molecular compounds on a Tablet PC.
Conditions: The system was evaluated in two modes:
1. Domain: The full system with chemical valence checks.
2. Baseline: A simplified version with no knowledge of chemical valence/verification.
Data Split: Evaluated on collected sketches using a leave-one-out style approach (training on 11 examples from the same users).

Results and Error Reduction Analysis

Performance: The full system achieved an overall F-measure of 0.87 (Precision 0.86, Recall 0.89).
Impact of Domain Knowledge: Using domain knowledge reduced the overall error rate (measured by recall) by 27% compared to the baseline. The improvement was statistically significant ($p < .05$).
Error Recovery: The system successfully recovered from interpretations that were geometrically plausible but chemically impossible (e.g., misinterpreting “N” as bonds), as illustrated in their qualitative analysis.
Output Integration: Once interpreted, the resulting structure is expressed in a standard chemical specification format that can be passed to tools such as ChemDraw (for rendering) or SciFinder (for database queries).
Limitations: The system struggled with “messy” sketches where users drew single bonds with multiple strokes or over-traced lines, as the current bond recognizer assumes single-stroke straight bonds.

Reproducibility Details

Data

The study collected a custom dataset of hand-drawn diagrams.

Volume: 6 participants $\times$ 12 molecules = 72 total sketches (implied).
Preprocessing:
- Scale Normalization: The system estimates scale based on the average length of straight bonds (chosen because they are easy to identify). This normalizes geometric features for the classifier.
- Stroke Segmentation: Poly-line approximation using recursive splitting (minimizing least squared error) to break multi-segment strokes (e.g., connected bonds) into primitives.

Algorithms

1. Ink Parsing (Sliding Window)

Examines all combinations of up to $n=7$ sequential strokes.
Classifies each group as a valid symbol or invalid garbage.

2. Template Matching (Image-based)

Used for resolving ambiguities in text/symbols (e.g., ‘H’ vs ‘N’).
Metric: Modified Tanimoto coefficient. Unlike standard Tanimoto (point overlap), this version accounts for relative angle and curvature at each point.

3. Domain Verification

Trigger: An element with incorrect valence (e.g., Hydrogen with >1 bond).
Resolution: Searches stored alternative hypotheses for the affected strokes. It accepts a new hypothesis if it resolves the valence error without introducing new ones.
Constraint: It keeps an inconsistent structure if the original confidence score is significantly higher than alternatives (assuming user is still drawing or intentionally left it incomplete).

Models

Symbol Recognizer (Discriminative Classifier)

Type: Support Vector Machine (SVM).
Classes: Element letters, straight bonds, hash bonds, wedge bonds, invalid groups.
Input Features:
1. Number of strokes
2. Bounding-box dimensions (width, height, diagonal)
3. Ink density (ink length / diagonal length)
4. Inter-stroke distance (max distance between strokes in group)
5. Inter-stroke orientation (vector of relative orientations)

Text Recognition

Microsoft Tablet PC SDK: Used for recognizing alphanumeric characters (elements and subscripts).
Integrated with the SVM and Template Matcher via a combined scoring mechanism.

Evaluation

Metric	Value (Overall)	Baseline Comparison	Notes
Precision	0.86	0.81 (Baseline)	Full system vs. no domain knowledge
Recall	0.89	0.85 (Baseline)	27% error reduction
F-Measure	0.87	0.83 (Baseline)	Statistically significant ($p < .05$)

True Positive Definition: Match in both location (stroke grouping) and classification (label).

Hardware

Device: 1.5GHz Tablet PC.
Performance: Real-time feedback.

Reproducibility

No source code, trained models, or collected sketch data were publicly released. The paper is openly available through the AAAI digital library. The system depends on the Microsoft Tablet PC SDK (a proprietary, now-discontinued component), which would make exact replication difficult even with the algorithm descriptions provided.

Status: Closed

Paper Information

Citation: Ouyang, T. Y., & Davis, R. (2007). Recognition of Hand Drawn Chemical Diagrams. Proceedings of the 22nd National Conference on Artificial Intelligence (AAAI-07), 846-851.

Publication: AAAI 2007

@inproceedings{ouyang2007recognition,
  title={Recognition of Hand Drawn Chemical Diagrams},
  author={Ouyang, Tom Y and Davis, Randall},
  booktitle={Proceedings of the 22nd National Conference on Artificial Intelligence},
  volume={1},
  pages={846--851},
  year={2007}
}

Graph Perception for Chemical Structure OCR

Sun, 14 Dec 2025 00:00:00 +0000

Paper Information

Citation: Contreras, M. L., Allendes, C., Alvarez, L. T., & Rozas, R. (1990). Computational perception and recognition of digitized molecular structures. Journal of Chemical Information and Computer Sciences, 30(3), 302-307. https://doi.org/10.1021/ci00067a014

Publication: Journal of Chemical Information and Computer Sciences, 1990

Contribution: Graph Perception and Character Recognition

This is a Methodological Paper ($\Psi_{\text{Method}}$).

It proposes a specific algorithmic pipeline (“graph perception and character recognition”) to solve the technical problem of converting pixelated images of molecules into machine-readable connectivity tables. The dominant contribution is the novel set of algorithms (contour search, circular inspection, matrix parametrization).

Motivation: Automating Chemical Database Entry

The primary motivation is to automate the input of chemical structures into databases.

Problem: Manual input of structures (especially large ones with stereochemistry) is time-consuming and prone to human error.
Gap: Existing methods required significant human intervention. The authors created a system that handles the “graph/skeleton” and the “alphanumeric characters” effectively to speed up entry into systems like ARIUSA or CAD tools.

Algorithmic Novelty: Circular Inspection Processing

The paper introduces a unified “capture-to-recognition” system written in C that handles both type-printed and hand-printed structures. Key novelties include:

Circular Inspection Algorithm: A specific technique for detecting internal rings and multiple bonds by sweeping a radius of 0.3 bond lengths around atoms.
Hybrid Recognition: Combining “graph perception” (vectorizing the lines) with “character recognition” (OCR for atom labels) in a single pipeline.
Matrix Parametrization for OCR: A feature extraction method that assigns hexadecimal IDs to character matrices based on pixel gradients and “semibytes”.

Methodology: Validation via Custom Structure Dataset

The authors validated the system by digitizing and recognizing a set of test structures:

Dataset: 200 type-printed structures and 50 hand-printed structures.
Metric: “Reliability” percentage (correct recognition of the connectivity table).
Speed Comparison: Measured processing time against a “qualified person” performing manual input for an average 20-atom molecule.

Results: Speed and File Size Efficiency

Accuracy: The system achieved 94% reliability for both type- and hand-printed graphs.
Character Recognition: Isolated character recognition achieved >99% reliability.
Speed: The system was 3-5 times faster than manual human input.
Efficiency: The storage required for a recognized molecule (e.g., $C_{19}H_{31}N$) was significantly smaller (4.1 kb) than the raw image bitmap.

Reproducibility Details

Data

The paper does not use a standard external dataset but rather a custom set of structures for validation.

Purpose	Dataset	Size	Notes
Validation	Type-printed structures	200 images	Used to test reliability
Validation	Hand-printed structures	50 images	“Straight enough” drawings required

Algorithms

The paper details three specific algorithmic components crucial for replication:

Graph Perception (Contour Search):
- Sweep: Left-to-right horizontal sweep to find the first pixel.
- Contour Follow: Counter-clockwise algorithm used to trace borders.
- Vertex Detection: A vertex is flagged if the linear trajectory deflection angle is $>18^\circ$.
- Atom Localization: Two or more vertices in a small space indicate an atom position.
Circular Inspection (Branching/Rings):
- Radius: A circle is inspected around each atom with $r = 0.3 \times \text{single bond length}$.
- Branch Detection: “Unknown border pixels” found on this circle trigger new contour searches to find attached bonds or rings.
Character Recognition (Matrix Feature Extraction):
- Separation: Characters are separated into isolated matrices and “relocated” to the top-left corner.
- Parametrization: The matrix is divided into zones. A “semibyte” (4-bit code) is generated by checking for pixel density in specific directions.
- ID Assignment: Matrices are assigned a Hex ID (e.g., 8, 1, 0, 6) based on these semibytes.
- Differentiation: Secondary parameters (concavities, vertical lines) resolve conflicts (e.g., between ‘b’ and ‘h’).

Models

The system does not use learned weights (neural networks). It relies on rule-based topological recognition.

Representation: The final output is a Prolog data structure converted into a connectivity table.
Atom Recognition: Terminal atoms are identified by linear projection; if no pixels are found, it defaults to Carbon.

Hardware

The performance metrics reflect 1990s hardware, useful for historical context or low-resource reimplementation.

Capture: PC-AT microcomputer with HP-Scanjet.
Processing: MicroVax II (8 MB real memory, 159 MB hard disc) running Ultrix-32.
Memory Usage: A $300 \times 300$ dpi image required ~175 kb; a recognized graph required ~1.6 kb.
Time: Processing time per molecule was 0.7 - 1.0 minutes.

Citation

@article{contrerasComputationalPerceptionRecognition1990,
  title = {Computational Perception and Recognition of Digitized Molecular Structures},
  author = {Contreras, M. Leonor and Allendes, Carlos and Alvarez, L. Tomas and Rozas, Roberto},
  year = 1990,
  month = aug,
  journal = {Journal of Chemical Information and Computer Sciences},
  volume = {30},
  number = {3},
  pages = {302--307},
  issn = {0095-2338, 1520-5142},
  doi = {10.1021/ci00067a014}
}

ChemReader: Automated Structure Extraction

Sun, 14 Dec 2025 00:00:00 +0000

Paper Information

Citation: Park, J., Rosania, G. R., Shedden, K. A., Nguyen, M., Lyu, N., & Saitou, K. (2009). Automated extraction of chemical structure information from digital raster images. Chemistry Central Journal, 3(1), 4. https://doi.org/10.1186/1752-153X-3-4

Publication: Chemistry Central Journal 2009

Paper Contribution: Method & Pipeline

This is a Method paper.

It proposes a novel software system, ChemReader, designed to automate the analog-to-digital conversion of chemical structure diagrams. The paper focuses on the algorithmic pipeline, specifically modifying standard computer vision techniques like the Hough Transform to suit chemical graphs. It validates the method through direct performance comparisons against existing State-of-the-Art tools (OSRA, CLiDE, Kekule).

Motivation: Unlocking Analog Chemical Information

There is a massive amount of chemical information (molecular interactions, pathways, disease processes) locked in scientific literature. However, this information is typically encoded as “analog diagrams” (raster images) embedded in text. Existing text-based search engines cannot index these structures effectively.

While previous tools existed (Kekule, OROCS, CLiDE), they often required high-resolution images (150-300 dpi) or manual intervention to separate diagrams from text, making fully automated, large-scale database annotation impractical.

Core Innovation: Modified Transforms and Spell Checking

The authors introduce ChemReader, a fully automated toolkit with several specific algorithmic innovations tailored for chemical diagrams:

Modified Hough Transform (HT): Unlike standard HT, which treats all pixels equally, ChemReader uses a modified weight function that accounts for pixel connectivity and line thickness to better detect chemical bonds.
Chemical Spell Checker: A post-processing step that uses a dictionary of common chemical abbreviations (770 entries) and n-gram probabilities to correct Optical Character Recognition (OCR) errors (e.g., correcting specific atom labels based on valence rules), improving accuracy from 66% to 87%.
Specific Substructure Detection: Dedicated algorithms for detecting stereochemical “wedge” bonds using corner detection and aromatic rings using the Generalized Hough Transform.

Experimental Setup and Baselines

The authors compared ChemReader against three other systems: OSRA V1.0.1, CLiDE V2.1 Lite, and Kekule V2.0 demo.

They used three distinct datasets to test robustness:

Set I (50 images): Diverse drawing styles and fonts collected via Google Image Search.
Set II (100 images): Ligand images from the GLIDA database, linked to PubChem for ground truth.
Set III (212 images): Low-resolution images embedded in 121 scanned journal articles from PubMed.

Results and Conclusions

Accuracy: ChemReader significantly outperformed competitors. In the difficult Set III (journal articles), ChemReader achieved 30.2% correct exact output, compared to 17% for OSRA and 6.6% for CLiDE.
Similarity: Even when exact matches failed, ChemReader maintained high Tanimoto similarity scores (0.74-0.86), indicating it successfully captured the majority of chemically significant features.
Substructure Recognition: ChemReader demonstrated higher recall rates across all PubChem fingerprint categories (rings, atom pairs, SMARTS patterns) compared to other tools.
Error Correction: The “Chemical Spell Checker” improved character recognition accuracy from 66% to 87%.

Reproducibility Details

Data

The study utilized three test sets collected from public sources.

Purpose	Dataset	Size	Notes
Evaluation	Set I	50 images	Sourced from Google Image Search to vary styles/fonts.
Evaluation	Set II	100 images	Randomly selected ligands from the GLIDA database; ground truth via PubChem.
Evaluation	Set III	212 images	Extracted from 121 PubMed journal articles; specifically excludes non-chemical figures.

Algorithms

The pipeline consists of several sequential processing steps:

De-noising: Uses GREYCstoration, an anisotropic smoothing algorithm, to regulate image noise.
Segmentation: Uses an 8-connectivity algorithm to group pixels. Components are classified as text or graphics based on height/area ratios.
Line Detection (Modified Hough Transform):
- Standard Hough Transform is modified to weight pixel pairs $(P_i, P_j)$ based on connectivity.
- Weight Function ($W_{ij}$): $$W_{ij} = \begin{cases} n_{ij}(P_0 - x_{ij}) & \text{if } x_{ij}/n_{ij} > P_0 \\ 0 & \text{otherwise} \end{cases}$$ Where $n_{ij}$ is the pixel count between points, $x_{ij}$ is the count of black pixels, and $P_0$ is a density threshold.
Wedge Bond Detection: Uses corner detection to find triangles where the area equals the number of black pixels (isosceles shape check).
Chemical Spell Checker:
- Calculates the Maximum Likelihood ($ML$) of a character string being a valid chemical word $T$ from a dictionary.
- Similarity Metric: $$Sim(S_i, T_i) = 1 - \sqrt{\sum_{j=1}^{M} [I^{S_i}(j) - I^{T_i}(j)]^2}$$ Uses pixel-by-pixel intensity difference between the input segment $S$ and candidate template $T$.

Models

Character Recognition: Uses the open-source GOCR library. It employs template matching based on features like holes, pixel densities, and transitions.
Chemical Dictionary: A lookup table containing 770 frequently used chemical abbreviations and fundamental valence rules.

Evaluation

Performance was measured using exact structure matching and fingerprint similarity.

Metric	Value (Set III)	Baseline (OSRA)	Notes
% Correct	30.2%	17%	Exact structure match using ChemAxon JChem.
Avg Similarity	0.740	0.526	Tanimoto similarity on PubChem Substructure Fingerprints.
Precision (Rings)	0.87	0.84	Precision rate for recognizing ring systems.
Recall (Rings)	0.83	0.73	Recall rate for recognizing ring systems.

Hardware

Platform: C++ implementation running on MS Windows.
Dependencies: GOCR (OCR), GREYCstoration (Image processing).

Chemical Machine Vision

Sun, 14 Dec 2025 00:00:00 +0000

Paper Information

Citation: Gkoutos, G. V., Rzepa, H., Clark, R. M., Adjei, O., & Johal, H. (2003). Chemical Machine Vision: Automated Extraction of Chemical Metadata from Raster Images. Journal of Chemical Information and Computer Sciences, 43(5), 1342-1355. https://doi.org/10.1021/ci034017n

Publication: J. Chem. Inf. Comput. Sci. 2003

Paper Classification: Methodological Approach

This is a Method paper. It proposes a novel architectural pipeline applying “machine vision” techniques (Gabor wavelets and Kohonen networks) to the problem of identifying chemical diagrams in low-resolution raster images. The paper focuses on the “how” (the algorithm and its parameters) and validates the method through quantitative experiments optimizing feature vectors and masks.

Motivation: Extracting Legacy Chemical Data

The primary motivation is to unlock the “large amount of data” trapped in legacy raster images (GIF, JPEG) on the Web that lack semantic metadata.

Legacy Data Problem: Most chemical structural information on the Web is embedded in raster images, not machine-readable formats like Molfiles.
Limitations of Existing Tools: Previous tools like Kekule and CLiDE acted as “Chemical OCR,” attempting to reconstruct exact atom-bond connections. This required high-resolution images (>300 dpi) and human intervention, making them unsuitable for automated Web crawling of low-resolution (72-96 dpi) images.
Goal: To create a low-cost, automated tool for a “robot-based Internet resource discovery tool” that can classify images (e.g., “is this a molecule?”).

Core Innovation: Texture Recognition over Structural OCR

The core novelty is the shift from “Optical Character Recognition” (exact reconstruction) to “Texture Recognition” (classification).

Texture-Based Approach: The authors treat chemical diagrams as textures. They use Gabor wavelets to extract texture features. Crucially, this system does not recognize specific chemical structures (i.e., atom-bond connectivity tables, SMILES, or Molfiles). It only classifies images into broad categories.
Incremental Learning: The system uses a Kohonen Self-Organizing Feature Map (KSOFM) combined with Class Boundary Analysis (CBA). This allows for “incremental learning,” where new classes (e.g., aromatic vs. non-aromatic) can be added without retraining the entire system.
Optimization for Chemistry: The authors identify specific parameters (frequency channels, mask sizes) that are optimal for the “texture” of chemical diagrams.
Integration with ChemDig: The method was designed to feed into ChemDig, a robot-based index engine for automated web crawling and metadata generation.

Experimental Setup: Parameter Optimization

The authors performed optimization and validation experiments using a dataset of 300 images divided into three classes: Ring Systems, Non-Ring Systems, and Non-Chemistry (textures, biological figures, etc.).

Parameter Optimization: They systematically varied hyperparameters to find the optimal configuration:
- Feature Vector Size: Tested sizes from 100 to 4000 elements.
- Energy Mask Size: Tested windows from $3 \times 3$ to $15 \times 15$ pixels.
- Frequency Channels: Tested seven spatial frequencies ($\sqrt{2}$ to $64\sqrt{2}$).
Classification Performance: Evaluated the system’s ability to classify unseen test images using a 50:50 training/test split.
Comparison: Qualitatively compared the approach against vectorization tools (Autotrace, CR2V).

Results: Robust Classification of Low-Resolution Images

Optimal Configuration: The system performed best with a feature vector size of ~1500 elements, a $9 \x9$ energy mask, and frequency channel $4\sqrt{2}$.
High Accuracy: Achieved a recognition rate of 91% with a 50:50 training/test split, and up to 92% with a 70:30 split.
Robustness: The system successfully distinguished between chemical and non-chemical images (zero false negatives for chemical images).
Limitations: Misclassifications occurred between “ring” and “non-ring” systems when structures had similar visual “textures” (e.g., similar density or layout).
Impact: The method is viable for automating metadata generation (e.g., alt tags) for web crawlers, functioning as a coarse-grained filter before more expensive processing.

Reproducibility Details

Data

The study used a custom dataset of raster images collected from the Web.

Purpose	Dataset	Size	Notes
Training/Eval	Custom Web Dataset	300 images	Split into 3 classes: Ring Systems, Non-Ring Systems, Non-Chemistry.
Resolution	Low-Res Web Images	72-96 dpi	Deliberately chosen to mimic Web conditions where OCR fails.
Format	Raster	GIF, JPEG	Typical web formats.

Algorithms

The core pipeline consists of a Gabor Transform Unit followed by a Training/Classification Unit.

Gabor Wavelets: Used for feature extraction. The 2D Gabor wavelet equation is: $$h(x,y)=\exp\left{-\frac{1}{2}\left[\frac{x^{2}}{\sigma_{x}^{2}}+\frac{y^{2}}{\sigma_{y}^{2}}\right]\right}\cos(2\pi\mu_{\sigma}x+\phi)$$
- Bank Structure: 28 filters total (4 orientations $\times$ 7 radial frequencies).
- Orientations: $0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}$.
- Frequencies: 1 octave apart, specifically $1\sqrt{2}, \dots, 64\sqrt{2}$.
- Selected Frequency: $4\sqrt{2}$ was found to be optimal for chemistry.
Preprocessing:
- Buffer Mounting: Images are mounted in a buffer (set to 0) to handle edge artifacts.
- Look-Up-Tables (LUT/LUF): A binary Look-Up-Frame (LUF) indicates Regions of Interest (ROI) to avoid computing empty space; values are stored in a Look-Up-Table (LUT) to prevent re-computation of overlapping windows.
Feature Extraction:
- Non-linear Thresholding: $\psi(t) = \tanh(\alpha t)$ with $\alpha = 0.25$.
- Energy Function: Calculated as average absolute deviation from the mean using a window $W_{xy}$. $$e_{k}(x,y)=\frac{1}{M^{2}}\sum_{(a,b)\in W_{xy}}|\psi(r_{k}(a,b))|$$
- Optimal Window: $9 \times 9$ pixels.

Models

The classification model relies on competitive learning.

Architecture: Kohonen Self-Organizing Feature Map (KSOFM).
Training:
- Learning Rate: Starts at 1.0, decreases to 0.1.
- Class Boundary Analysis (CBA): Computes the centroid (mean) and variance of each cluster. The variance defines the class boundary.
Classification Metric: Euclidean Distance Norm. An unknown vector is classified based on the shortest distance to a cluster center, provided it falls within the variance boundary. $$D_{ij}=||x_{i}-x_{j}||$$

Evaluation

Performance was measured using recognition rate ($R_s$) and misclassification error ($E_s$).

Metric	Value	Baseline	Notes
Recognition Rate	91%	N/A	Achieved with 50:50 split. 92% with 70:30 split.
Feature Size	~1500	4000	Reducing vector size from 4000 to 1500 maintained ~80% accuracy while improving speed.

Citation

@article{gkoutosChemicalMachineVision2003,
  title = {Chemical {{Machine Vision}}: {{Automated Extraction}} of {{Chemical Metadata}} from {{Raster Images}}},
  shorttitle = {Chemical {{Machine Vision}}},
  author = {Gkoutos, Georgios V. and Rzepa, Henry and Clark, Richard M. and Adjei, Osei and Johal, Harpal},
  year = 2003,
  month = sep,
  journal = {Journal of Chemical Information and Computer Sciences},
  volume = {43},
  number = {5},
  pages = {1342--1355},
  issn = {0095-2338},
  doi = {10.1021/ci034017n},
  urldate = {2025-12-15},
  langid = {english}
}

Chemical Literature Data Extraction: The CLiDE Project

Sun, 14 Dec 2025 00:00:00 +0000

Paper Information

Citation: Ibison, P., Jacquot, M., Kam, F., Neville, A. G., Simpson, R. W., Tonnelier, C., Venczel, T., & Johnson, A. P. (1993). Chemical Literature Data Extraction: The CLiDE Project. Journal of Chemical Information and Computer Sciences, 33(3), 338-344. https://doi.org/10.1021/ci00013a010

Publication: J. Chem. Inf. Comput. Sci. 1993

Contribution and Taxonomy

Classification: Method ($\Psi_{\text{Method}}$)

This methodological paper proposes a novel software architecture for Optical Chemical Structure Recognition (OCSR). It details specific algorithms for image segmentation, vectorization, and chemical interpretation, validated through the successful extraction of complex structures from literature.

Motivation: Automating Literature Extraction

The manual creation of chemical reaction databases is a time-consuming and expensive process requiring trained chemists to abstract information from literature. While commercial tools existed for interpreting isolated scanned structures (like Kekulé), there was a lack of systems capable of processing whole pages of journals (including embedded text, reaction schemes, and structures) without significant human intervention.

Core Innovation: A Three-Phase Hybrid Architecture

CLiDE introduces a comprehensive three-phase architecture (Recognition, Grouping, Interpretation) that integrates computer vision with chemical knowledge. Key novelties include:

Context-Aware Interpretation: The use of an extendable superatom database to resolve ambiguities in chemical text (e.g., expanding “OAc” or “Me” into connection tables).
Hybrid Primitive Detection: A combination of contour coding for solid lines and a modified Hough transform specifically tuned for detecting dashed chemical bonds.
Semantic Re-construction: A scoring system for bond-atom association that considers both distance and vector direction to handle poorly drawn structures.

Methodology and Experimental Validation

The authors validated the system on a set of “difficult cases” selected to test specific capabilities. These included:

Crossing Bonds: Structures where bonds intersect without a central atom (Fig. 9d, 9e).
Stereochemistry: Identification of wedged, dashed, and wavy bonds.
Generic Structures: Parsing generic text blocks (e.g., $R^1 = Me$) and performing substitutions.
Accuracy Estimation: The authors report an approximate 90% recognition rate for distinct characters in literature scans.

Results and Structural Reconstruction

The system successfully generates connection tables (exported as MOLfiles or ChemDraw files) from scanned bitmaps. It effectively distinguishes between graphical primitives (wedges, lines) and text, accurately reconstructing stereochemistry and resolving superatom synonyms (e.g., converting “MeO” to “OMe”). The authors conclude that while character recognition depends heavily on image quality, the graphic primitive recognition is robust for lines above a threshold length.

Reproducibility Details

Data

Input Format: Binary bitmaps scanned from journal pages.
Resolution: 300 dpi (generating ~1 MB per page).
Superatom Database: A lookup table containing ~200 entries. Each entry includes:
- Valency/Charge: Explicit constraints (e.g., “HO” takes 1 bond, “CO2” takes 2).
- Bonding Index: Specifies which letter in the string serves as the attachment point (e.g., letter 2 for “HO”, letters 1 and 2 for “CO2”).
- Sub-Connection Table: The internal atomic representation of the group.

Algorithms

1. Primitive Recognition (Vectorization)

Contour Coding: Uses the Ahronovitz-Bertier-Habib method to generate interpixel contours (directions N, S, E, W) for connected components.
Polygonal Approximation: A method similar to Sklansky and Gonzalez breaks contours into “fractions”.
- Rule: Long sides are “straight fractions”; consecutive short sides are “curved fractions”.
- Reconstruction: Parallel fractions are paired to form bond borders. If a border is split (due to noise or crossing lines), the system attempts to merge collinear segments.
Dash Detection: A modified Hough transform is applied to small connected components. It requires at least three collinear dashes to classify a sequence as a dashed bond.

2. Interpretation Rules

Bond-Atom Association:
- Candidate Selection: The system identifies $m$ closest bonds for a superatom requiring $n$ connections ($m \ge n$).
- Scoring Function: Connections are selected based on minimizing perpendicular distance (alignment).
Crossing Bonds: Resolved using rules based on proximity, length, collinearity, and ring membership to distinguish actual crossings from central carbon atoms.

Models

OCR: A neural network trained on alphanumeric characters.
- Input Representation: Density matrices derived from character bitmaps.
- Post-processing: Unrecognized characters are flagged for manual correction.

Hardware

Platform: SUN SPARC workstation.
Scanner: Agfa Focus S 800GS.
Implementation Language: C++.

Citation

@article{ibisonChemicalLiteratureData1993,
  title = {Chemical Literature Data Extraction: {{The CLiDE Project}}},
  shorttitle = {Chemical Literature Data Extraction},
  author = {Ibison, P. and Jacquot, M. and Kam, F. and Neville, A. G. and Simpson, R. W. and Tonnelier, C. and Venczel, T. and Johnson, A. P.},
  year = 1993,
  month = may,
  journal = {Journal of Chemical Information and Computer Sciences},
  volume = {33},
  number = {3},
  pages = {338--344},
  issn = {0095-2338, 1520-5142},
  doi = {10.1021/ci00013a010}
}

Automatic Recognition of Chemical Images

Sun, 14 Dec 2025 00:00:00 +0000

Contribution: Rule-Based Image Mining Architecture

$\Psi_{\text{Method}}$ (Methodological Basis)

This is a methodological paper describing a system architecture for image mining in the chemical domain. It focuses on the engineering challenge of converting rasterized depictions of molecules into computer-readable SDF files. The paper details the algorithmic pipeline and validates it through quantitative benchmarking against a commercial alternative.

Motivation: Digitizing Chemical Literature

Loss of Information: Chemical software creates images. The chemical significance is lost when published in scientific literature, making the data “dead” to computers.
Gap in Technology: Image mining lags behind advances in text mining. Existing commercial solutions (like CLIDE) faded away or remained limited.
Scale of Problem: The colossal production of chemical documents requires automated tools to exploit this information at large scale.

Core Innovation: Graph-Preserving Vectorization

Graph-Preserving Vectorization: The system uses a custom vectorizer designed to preserve the “graph characteristics” of chemical diagrams (1 vector = 1 line), which avoids creating spurious vectors at thick joints. It aims to generate a mathematical graph, $G = (V, E)$, mapped geometrically to the image lines.
Chemical Knowledge Integration: A distinct module validates the reconstructed graph against chemical rules (valences, charges) to ensure the output is chemically valid.
Hybrid Processing: The system splits the image into “connected components” for an OCR path (text/symbols) and a “body” path (bonds), reassembling them later.

Methodology & Experiments: Benchmark Validation

The authors performed a quantitative validation using three different databases where ground-truth SDF files were available. They also compared their system against the commercial tool CLIDE (Chemical Literature Data Extraction).

Database 1: 100 images (varied line widths/fonts)
Database 2: 100 images
Database 3: 7,604 images (large-scale batch processing)

Results & Conclusions: Superior Accuracy over Baselines

High Accuracy: The system achieved 94% correct reconstruction on Database 1 and 77% on Database 2. Accuracy was measured as correct recovery of identical geometry and connections.

$$ \text{Acc} = \frac{\text{Correct Images}}{\text{Total Images}} $$

Baseline Superiority: The commercial tool CLIDE only successfully reconstructed ~50% of images in Database 1 (compared to the authors’ 94%).
Scalability: On the large dataset (Database 3), the system achieved 67% accuracy in batch mode.
Robustness: The authors state the system uses a handful of parameters and works robustly across different image types. CLIDE lacked flexibility and required manual intervention.

Reproducibility Details

Reproducibility Status: Closed / Not Formally Reproducible. As is common with applied research from this era, the source code, training models (SVM), and specific datasets used for benchmarking do not appear to be publicly maintained or available.

Artifacts

Artifact	Type	License	Notes
None available	N/A	Unknown	No public code, models, or datasets were released with this 2007 publication.

Data

Purpose	Dataset	Size	Notes
Evaluation	Database 1	100 Images	Used for comparison with CLIDE; 94% success rate
Evaluation	Database 2	100 Images	77% success rate
Evaluation	Database 3	7,604 Images	Large-scale test; 67% success rate

Algorithms

The paper outlines a 5-module pipeline:

Pre-processing: Adaptive histogram binarization and non-recursive connected component labeling using RLE segments.
OCR: A “chemically oriented OCR” using wavelet functions for feature extraction and a Support Vector Machine (SVM) for classification. It distinguishes characters from molecular structure.
Vectorizer: Assigns local directions to RLE segments and groups them into patterns. Crucially, it enforces a one-to-one mapping between image lines and graph vectors.
Reconstruction: A rule-based module that annotates vectors:
- Stereochemistry: Registers vectors against original pixels; thick geometric forms (triangles) become chiral wedges.
- Dotted Bonds: Identifies isolated vectors and clusters them using quadtree clustering.
- Multi-bonds: Identifies parallel vectors within a dilated bounding box (factor of 2).
Chemical Knowledge: Validates the graph valences and properties before exporting SDF.

Models

SVM: Used in the OCR module to classify text/symbols. It supports dynamic training to correct classification mistakes.

Evaluation

The primary metric is the percentage of correctly reconstructed images (generating a valid, matching SDF file).

Metric	System Value (DB1)	Baseline (CLIDE)	Notes
Reconstruction Accuracy	94%	~50%	CLIDE noted as unsuitable for batch processing

Paper Information

Citation: Algorri, M.-E., Zimmermann, M., & Hofmann-Apitius, M. (2007). Automatic Recognition of Chemical Images. Eighth Mexican International Conference on Current Trends in Computer Science, 41-46. https://doi.org/10.1109/ENC.2007.25

Publication: ENC 2007 (IEEE Computer Society)

@inproceedings{algorriAutomaticRecognitionChemical2007,
  title = {Automatic {{Recognition}} of {{Chemical Images}}},
  booktitle = {Eighth {{Mexican International Conference}} on {{Current Trends}} in {{Computer Science}} ({{ENC}} 2007)},
  author = {Algorri, Maria-Elena and Zimmermann, Marc and {Hofmann-Apitius}, Martin},
  year = {2007},
  pages = {41--46},
  publisher = {IEEE},
  doi = {10.1109/ENC.2007.25}
}

αExtractor: Chemical Info from Biomedical Literature

Sat, 11 Oct 2025 00:00:00 +0000

Methodological Contribution: A Robust Optical Recognition System

This is primarily a Method ($\Psi_{\text{Method}}$) paper with a significant secondary Resource ($\Psi_{\text{Resource}}$) contribution (see the AI and Physical Sciences paper taxonomy for more on these categories).

The dominant methodological contribution is the ResNet-Transformer recognition architecture that outperforms existing OCSR tools across multiple benchmarks through robustness engineering. It specifically focuses on training on 20 million synthetic images with aggressive augmentation to handle degraded image conditions. The work answers the core methodological question “How well does this work?” through extensive benchmarking against existing OCSR tools and ablation studies validating architectural choices.

The secondary resource contribution comes from releasing αExtractor as a freely available web service, correcting labeling errors in standard benchmarks (CLEF, UOB, JPO), and providing an end-to-end document processing pipeline for biomedical literature mining.

Motivation: Extracting Visual Chemical Knowledge from Biomedical Literature

The motivation addresses a familiar pain point in chemical informatics within a biomedical context. Vast amounts of chemical knowledge in biomedical literature exist only as images, such as molecular structures embedded in figures, chemical synthesis schemes, and compound diagrams. This visual knowledge remains effectively invisible to computational methods, which creates a massive bottleneck for drug discovery research, systematic reviews, and large-scale chemical database construction.

Existing OCSR tools face two critical problems when applied to biomedical literature:

Real-world image quality: Biomedical papers often contain low-resolution figures, images with complex backgrounds, noise from scanning/digitization, and inconsistent drawing styles across different journals and decades of publications.
End-to-end extraction: Most OCSR systems assume the presence of clean, cropped molecular images. In practice, you need to first find the molecular structures within multi-panel figures, reaction schemes, and dense document layouts before you can recognize them.

The authors argue that a practical literature mining system needs to solve both problems simultaneously via robust recognition under noisy conditions and automated detection of molecular images within complex documents.

Core Innovation: Robust ResNet-Transformer Architecture

The core innovation lies in combining a competition-winning recognition architecture with extensive robustness engineering and end-to-end document processing. The key contributions include:

ResNet-Transformer Recognition Model: The core recognition system uses a Residual Neural Network (ResNet) encoder paired with a Transformer decoder in an image-captioning framework. This architecture won first place in a Kaggle molecular translation competition, which provided a strong foundation for the recognition task. Let the input image be $I$. The model maximizes the joint likelihood of the SMILES tokens $T$ and coordinate sequences $X, Y$: $$ \begin{aligned} \mathcal{L}_{\text{total}} = - \sum_{i=1}^{L} \log P(T_i \mid I, T_{
Enhanced Molecular Representation: The model produces an augmented representation that encompasses:
- Standard molecular connectivity information
- Bond type tokens (solid wedge bonds, dashed bonds, etc.) that preserve 3D stereochemical information
- Atom coordinate predictions that allow reconstruction of the exact molecular pose from the original image
This dual prediction of discrete structure and continuous coordinates makes the output strictly faithful to the source material and enables better quality assessment.
Massive Synthetic Training Dataset: The model was trained on approximately 20 million synthetic molecular images generated from PubChem SMILES with aggressive data augmentation. The augmentation strategy randomized visual styles, image quality, and rendering parameters to create maximum diversity, ensuring the network rarely saw the same molecular depiction twice. This forces the model to learn robust, style-invariant features.
End-to-End Document Processing Pipeline: αExtractor integrates object detection and structure recognition into a complete document mining system:
- An object detection model automatically locates molecular images within PDF documents
- The recognition model converts detected images to structured representations
- A web service interface makes the entire pipeline accessible to researchers without machine learning expertise
Robustness-First Design: The system was explicitly designed to handle degraded image conditions that break traditional OCSR tools, including low resolution, background interference, color variations, and scanning artifacts commonly found in legacy biomedical literature.

Experimental Methodology: Stress Testing under Real-World Conditions

The evaluation focused on demonstrating robust performance across diverse image conditions, from pristine benchmarks to challenging real-world scenarios:

Benchmark Dataset Evaluation: αExtractor was tested on four standard OCSR benchmarks:
- CLEF: Chemical structure recognition challenge dataset
- UOB: University of Birmingham patent images
- JPO: Japan Patent Office molecular diagrams
- USPTO: US Patent and Trademark Office structures
Performance was measured using exact SMILES match accuracy.
Error Analysis and Dataset Correction: During evaluation, the researchers discovered numerous labeling errors in the original benchmark datasets. They systematically identified and corrected these errors, then re-evaluated all methods on the cleaned datasets to get more accurate performance measurements.
Robustness Stress Testing: The system was evaluated on two challenging datasets specifically designed to test robustness:
- Color background images (200 samples): Molecular structures on complex, colorful backgrounds that simulate real figure conditions
- Low-quality images (200 samples): Degraded images with noise, blur, and artifacts typical of scanned documents
These tests compared αExtractor against three open-source tools (OSRA, Molvel, and Imago) under realistic degradation conditions.
Generalization Testing: In the most challenging experiment, αExtractor was tested on the DECIMER hand-drawn molecule images dataset (Brinkhaus et al., 2022), representing a completely different visual domain not represented in the training data. This tested whether the learned features could generalize beyond digital rendering styles to human-drawn chemistry.
End-to-End Document Extraction: The complete pipeline was evaluated on 50 PDF files containing 2,336 molecular images. This tested both the object detection component (finding molecules in complex documents) and the recognition component (converting them to SMILES) in a realistic literature mining scenario.
Speed Benchmarking: Inference time was measured to demonstrate the practical efficiency needed for large-scale document processing.

Results & Conclusions: Strong Performance on Degraded Images

Substantial Accuracy Gains: On the four benchmark datasets, αExtractor achieved accuracies of 91.83% (CLEF), 98.47% (UOB), 88.67% (JPO), and 93.64% (USPTO), compared to previous best results of 84.6%, 90.0%, 72.2%, and 89.9% respectively. After correcting dataset labeling errors, the true accuracies were even higher, reaching 95.77% on CLEF, 99.86% on UOB, and 92.44% on JPO.
Robustness on Degraded Images: Open-source competitors struggled on degraded images (achieving 5.5% accuracy at best). αExtractor maintained over 90% accuracy on both color background and low-quality image datasets, demonstrating the effectiveness of the synthetic training strategy.
Generalization to Hand-Drawn Molecules: On hand-drawn molecules, a domain completely absent from training data, αExtractor achieved 61.4% accuracy while other tools scored between 0.69% and 2.93%. This suggests the model learned genuinely chemical features rather than style-specific patterns.
Practical End-to-End Performance: In the complete document processing evaluation, αExtractor detected 95.1% of molecular images (2,221 out of 2,336) and correctly recognized 94.5% of detected structures (2,098 correct predictions). This demonstrates the system’s readiness for real-world literature mining applications.
Ablation Results: Ablation experiments confirmed that each architectural component (ResNet backbone, Transformer encoder, Transformer decoder) contributes to performance, with the Transformer decoder having the largest impact. Replacing the Transformer decoder with an LSTM decoder substantially reduced accuracy (Table S6 in the paper).
Dataset Quality Issues: The systematic discovery of labeling errors in standard benchmarks highlights a broader problem in OCSR evaluation. The corrected datasets provide more reliable baselines for future method development.
Spatial Layout Limitation: αExtractor correctly identifies molecular connectivity, but the re-rendered structures may have different spatial layouts than the originals. This could complicate visual verification for complex molecules, even if the chemical information remains accurate.
Non-Standard Depiction Handling: For images with non-standard bond depictions or atomic valences, αExtractor correctly identifies and normalizes them to standard representations. While chemically accurate, this means the re-rendered structure may visually differ from the original image.

Overall, αExtractor combines accurate recognition (over 90% on degraded images), end-to-end document processing, and strong generalization across image conditions. It targets large-scale literature mining tasks where previous tools struggled with degraded inputs. The focus on real-world robustness over benchmark optimization reflects a practical approach to deploying machine learning in scientific workflows.

Reproducibility Details

This paper is Partially Reproducible. While the authors detail the model architectures and training techniques, the source code, training dataset (20M synthetic images), and pre-trained weights remain closed-source and proprietary. The authors released a sample of their test data and host an online web server for running inference.

Artifact	Type	License	Notes
Corrected CLEF Dataset	Dataset	Unknown	Authors’ corrected version of the CLEF benchmark.
Corrected UOB Dataset	Dataset	Unknown	Authors’ corrected version of the UOB benchmark.
Corrected JPO Dataset	Dataset	Unknown	Authors’ corrected version of the JPO benchmark.
Color Background Dataset	Dataset	Unknown	200 samples of molecular structures on complex, colorful backgrounds.
Low Quality Dataset	Dataset	Unknown	200 samples of degraded images with noise, blur, and artifacts.
PDF Test Set	Dataset	Unknown	Sample PDF files for end-to-end document extraction evaluation.
αExtractor Web Server	Other	Unknown	Online service for running inference using the proprietary system.

Models

Image Recognition Model:

Backbone: ResNet50 producing output of shape $2048 \times 19 \times 19$, projected to 512 channels via a feed-forward layer
Transformer Architecture: 3 encoder layers and 3 decoder layers with hidden dimension of 512
Output Format: Generates SMILES tokens plus two auxiliary coordinate sequences (X-axis and Y-axis) that are length-aligned with the SMILES tokens via padding

Object Detection Model:

Architecture: DETR (Detection Transformer) with ResNet101 backbone
Transformer Architecture: 6 encoder layers and 6 decoder layers with hidden dimension of 256
Purpose: Locates molecular images within PDF pages before recognition

Coordinate Prediction:

Continuous X/Y coordinates are discretized into 200 discrete bins
Padding tokens added to coordinate sequences to align perfectly with SMILES token sequence, enabling simultaneous structure and pose prediction

Data

Training Data:

Synthetic Generation: Python script rendering PubChem SMILES into 2D images
Dataset Size: Approximately 20.3 million synthetic molecular images from PubChem
Superatom Handling: 50% of molecules had functional groups replaced with superatoms (e.g., “COOH”) or generic labels (R1, X1) to match literature drawing conventions
Rendering Augmentation: Randomized bond thickness, bond spacing, font size, font color, and padding size

Geometric Augmentation:

Shear along x-axis: $\pm 15^\circ$
Rotation: $\pm 15^\circ$
Piecewise affine scaling

Noise Injection:

Pepper noise: 0-2%
Salt noise: 0-40%
Gaussian noise: scale 0-0.16

Destructive Augmentation:

JPEG compression: severity levels 2-5
Random masking

Evaluation Datasets:

CLEF: Chemical structure recognition challenge dataset
UOB: University of Birmingham patent images
JPO: Japan Patent Office molecular diagrams
USPTO: US Patent and Trademark Office structures
Color background images: 200 samples
Low-quality images: 200 samples
Hand-drawn structures: Test set for generalization
End-to-end document extraction: 50 PDFs (567 pages, 2,336 molecular images)

Training

Image Recognition Model:

Optimizer: Adam with learning rate of 1e-4
Batch Size: 100
Epochs: 5
Loss Function: Cross-entropy loss for both SMILES prediction and coordinate prediction

Object Detection Model:

Optimizer: Adam with learning rate of 1e-4
Batch Size: 24
Training Strategy: Pre-trained on synthetic “Lower Quality” data for 5 epochs, then fine-tuned on annotated real “High Quality” data for 30 epochs

Evaluation

Metrics:

Recognition: SMILES accuracy (exact match)
End-to-End Pipeline:
- Recall: 95.1% for detection
- Accuracy: 94.5% for recognition

Hardware

Inference Hardware:

Cloud CPU server (8 CPUs, 64 GB RAM)
Throughput: Processed 50 PDFs (567 pages) in 40 minutes

Paper Information

Citation: Xiong, J., Liu, X., Li, Z., Xiao, H., Wang, G., Niu, Z., Fei, C., Zhong, F., Wang, G., Zhang, W., Fu, Z., Liu, Z., Chen, K., Jiang, H., & Zheng, M. (2023). αExtractor: a system for automatic extraction of chemical information from biomedical literature. Science China Life Sciences, 67(3), 618-621. https://doi.org/10.1007/s11427-023-2388-x

Publication: Science China Life Sciences (2023)

Additional Resources:

Paper on Springer

MolRec: Rule-Based OCSR System at TREC 2011 Benchmark

Sat, 11 Oct 2025 00:00:00 +0000

Contribution: Rule-Based OCSR System

This is a Method paper that presents and validates MolRec, a rule-based system for Optical Chemical Structure Recognition (OCSR). While the paper emphasizes performance analysis on the TREC 2011 benchmark, the core contribution is the system architecture itself: a multi-stage pipeline using vectorization, geometric rule-based analysis, and graph construction to convert chemical diagram images into machine-readable MOL files.

Motivation: Robust Conversion of Chemical Diagrams

Chemical molecular diagrams are ubiquitous in scientific documents across chemistry and life sciences. Converting these static raster images into machine-readable formats (like MOL files) that encode precise spatial and connectivity information is important for cheminformatics applications such as database indexing, similarity searching, and automated literature mining.

While pixel-based pattern matching approaches exist, they struggle with variations in drawing style, image quality, and diagram complexity. An approach that can handle the geometric and topological diversity of real-world chemical diagrams is needed.

Novelty: Vectorization and Geometric Rules

MolRec uses a vectorization and geometric rule-based pipeline. Key technical innovations include:

Disk-Growing Heuristic for Wedge Bonds: A novel dynamic algorithm to distinguish wedge bonds from bold lines. A disk with radius greater than the average line width is placed inside the connected component and grown to the largest size that still covers only foreground pixels. The disk is then walked in the direction that allows it to continue growing. When it can grow no more, the base of the triangle (stereo-center) has been located, identifying the wedge orientation.

Joint Breaking Strategy: Explicitly breaking all connected joints in the vectorization stage to avoid combinatorial connection complexity. This allows uniform treatment of all line segment connections regardless of junction complexity.

Superatom Dictionary Mining: The system mines MOL files from the OSRA dataset to build a comprehensive superatom dictionary (e.g., “Ph”, “COOH”), supplemented by the Marvin abbreviation collection.

Comprehensive Failure Analysis: Unlike most OCSR papers that report only aggregate accuracy, this work provides a detailed categorization of all 55 failures, identifying 61 specific error reasons and their root causes.

Methodology and TREC 2011 Experiments

Benchmark: The system was evaluated on the TREC 2011 Chemical Track test set consisting of 1,000 molecular diagram images. The authors performed two independent runs with slightly different internal parameter settings to assess reproducibility.

Evaluation Metric: Correct recall of chemical structures. Output MOL files were compared semantically to ground truth using OpenBabel, which ignores syntactically different but chemically equivalent representations.

Failure Analysis: Across both runs, 55 unique diagrams were misrecognized (50 in run 1, 51 in run 2, with significant overlap). The authors manually examined all 55 and categorized them, identifying 61 specific reasons for mis-recognition. This analysis provides insight into systematic limitations of the rule-based approach.

Results and Top Failure Modes

High Accuracy: MolRec achieved a 95% correct recovery rate on the TREC 2011 benchmark:

Run 1: 950/1000 structures correctly recognized (95.0%)
Run 2: 949/1000 structures correctly recognized (94.9%)

The near-identical results across runs with slightly different internal parameters show stability of the rule-based approach.

Top Failure Modes (from detailed analysis of 55 unique misrecognized diagrams, yielding 61 total error reasons):

Dashed wedge bond misidentification (15 cases): Most common failure. Short dashes at the narrow end were interpreted as a separate dashed bond while longer dashes were treated as a dashed wedge or dashed bold bond, splitting one bond into two with a spurious node.
Incorrect stereochemistry (10 cases): Heuristics guessed wrong 3D orientations for ambiguous bold/dashed bonds where syntax alone is insufficient.
Touching components (6 cases): Characters touching bonds, letters touching symbols, or ink bleed between close parallel lines caused segmentation failures.
Incorrect character grouping (5 cases): Characters too close together for reliable separation.
Solid circles without 3D hydrogen bond (5 cases): MolRec correctly interprets solid circles as implying a hydrogen atom via a solid wedge bond, but some solution MOL files in the test set omit this bond, causing a mismatch.
Diagram caption confusion (5 cases): Captions appearing within images are mistakenly parsed as part of the molecular structure.
Unrecognised syntax (5 cases): User annotations, unusual notations (e.g., wavy line crossing a dashed wedge), and repetition structures.
Broken characters (3 cases): Degraded or partial characters without recovery mechanisms.
Connectivity of superatoms (3 cases): Ambiguous permutation of connection points for multi-bonded superatoms.
Problematic bridge bonds (3 cases): Extreme perspective or angles outside MolRec’s thresholds.
Unhandled bond type (1 case): A dashed dative bond not previously encountered.

System Strengths:

Douglas-Peucker line simplification proves faster and more robust than Hough transforms across different drawing styles
Disk-growing wedge bond detection effectively distinguishes 3D orientations in most cases
Mining MOL files for superatom dictionary captures real-world chemical abbreviation usage patterns

Fundamental Limitations Revealed:

Brittleness: Small variations in drawing style or image quality can cause cascading failures
Stereochemistry ambiguity: Even humans disagree on ambiguous cases; automated resolution based purely on syntax is inherently limited
Segmentation dependence: Most failures trace back to incorrect separation of text, bonds, and graphical elements
No error recovery: Early-stage mistakes propagate through the pipeline with no mechanism for correction

Test Set Quality Issues: The paper also highlights several cases where the TREC 2011 ground truth itself was questionable. Some solution MOL files omitted stereo bond information for solid circle notations, dative (polar) bonds were inconsistently interpreted as either double bonds or single bonds across the training and test sets, and one diagram contained over-connected carbon atoms (5 bonds without the required positive charge indication) that the solution MOL file did not flag.

The systematic error analysis reveals what 95% accuracy means in practice. The failure modes highlight scalability challenges for rule-based systems when applied to diverse real-world documents with noise, artifacts, and non-standard conventions.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Dictionary Mining	OSRA Dataset	Unknown	Mined to create superatom dictionary for abbreviations like “Ph”, “COOH”
Dictionary	Marvin Collection	N/A	Integrated Marvin abbreviation group collection for additional superatoms
Evaluation	TREC 2011 Test Set	1,000 images	Standard benchmark for Text REtrieval Conference Chemical Track

Algorithms

The MolRec pipeline consists of sequential image processing and graph construction stages:

1. Preprocessing

Binarization: Input image converted to binary
Connected Component Labeling: Identifies distinct graphical elements
OCR: Simple metric space-based engine identifies characters (letters $L$, digits $N$, symbols $S$)
Character Grouping: Spatial proximity and type-based heuristics group characters:
- Horizontal: Letter-Letter, Digit-Digit, Letter-Symbol
- Vertical: Letter-Letter only
- Diagonal: Letter-Digit, Letter-Charge

2. Vectorization (Line Finding)

Image Thinning: Reduce lines to unit width
Douglas-Peucker Algorithm: Simplify polylines into straight line segments
Joint Breaking: Explicitly split lines at junctions where $>2$ segments meet, avoiding combinatorial connection complexity

3. Bond Recognition Rules

After erasing text from the image, remaining line segments are analyzed:

Double/Triple Bonds: Cluster segments with same slope within threshold distance
Dashed Bonds: Identify repeated short segments of similar length with collinear center points
Wedge/Bold Bonds: Dynamic disk algorithm:
- Place disk with radius $>$ average line width inside component
- Grow disk to maximum size to locate triangle base (stereo-center)
- “Walk” disk to find narrow end, distinguishing wedge orientation
Wavy Bonds: Identify sawtooth pattern polylines after thinning
Implicit Nodes: Split longer segments at points where parallel shorter segments terminate (carbon atoms in chains)

4. Graph Construction

Node Formation: Group line segment endpoints by distance threshold
Disambiguation: Logic separates lowercase “l”, uppercase “I”, digit “1”, and vertical bonds
Superatom Expansion: Replace abbreviations with full structures using mined dictionary
Stereochemistry Resolution: Heuristics based on neighbor counts determine direction for ambiguous bold/dashed bonds (known limitation)

5. MOL File Generation

Final graph structure converted to standard MOL file format

Evaluation

Metric	Run 1	Run 2	Notes
Correct Recall	950/1000	949/1000	Slightly different internal parameters between runs
Accuracy	95.0%	94.9%	Semantic comparison using OpenBabel

Comparison Method: OpenBabel converts graphs to MOL files and compares them semantically to ground truth, ignoring syntactic variations that don’t affect chemical meaning.

Failure Categorization: 55 unique misrecognized diagrams analyzed across both runs, identifying 61 specific error reasons across 11 categories including dashed wedge bond misidentification (15), incorrect stereochemistry (10), touching components (6), incorrect character grouping (5), solid circles (5), diagram caption confusion (5), unrecognised syntax (5), broken characters (3), superatom connectivity (3), problematic bridge bonds (3), and unhandled bond type (1).

Artifacts

Artifact	Type	License	Notes
Open Babel	Code	GPL-2.0	Used for semantic MOL file comparison
OSRA	Code	GPL-2.0	Source of superatom dictionary data (MOL files mined)
TREC 2011 Chemical Track	Dataset	Unknown	1,000 molecular diagram images (available via NIST)

Reproducibility Status: Partially Reproducible. The MolRec source code is not publicly available. The evaluation dataset (TREC 2011) is accessible through NIST, and the tools used for comparison (OpenBabel) are open source. However, full reproduction of MolRec’s pipeline would require reimplementation from the paper’s descriptions.

Hardware

Compute Details: Not explicitly specified in the paper
Performance Note: Vectorization approach noted as “proven to be fast” compared to Hough transform alternatives

References

@inproceedings{sadawiPerformanceMolRecTREC2011,
  title = {Performance of {{MolRec}} at {{TREC}} 2011 {{Overview}} and {{Analysis}} of {{Results}}},
  booktitle = {Proceedings of the 20th {{Text REtrieval Conference}}},
  author = {Sadawi, Noureddin M. and Sexton, Alan P. and Sorge, Volker},
  year = {2011},
  langid = {english}
}

Paper Information

Citation: Sadawi, N. M., Sexton, A. P., & Sorge, V. (2011). Performance of MolRec at TREC 2011 Overview and Analysis of Results. Proceedings of the 20th Text REtrieval Conference. https://trec.nist.gov/pubs/trec20/papers/UoB.chem.update.pdf

Publication: TREC 2011

Additional Resources:

Open Babel - Used for semantic MOL file comparison
OSRA Project - Source of superatom dictionary data

MolRec: Chemical Structure Recognition at CLEF 2012

Sat, 11 Oct 2025 00:00:00 +0000

Systematization of Rule-Based OCSR

This is a Systematization paper that evaluates and analyzes MolRec’s performance in the CLEF 2012 chemical structure recognition competition. The work provides systematic insights into how the improved MolRec system performed on different types of molecular diagrams and reveals structural challenges facing rule-based OCSR approaches through comprehensive failure analysis.

Investigating the Limits of Rule-Based Recognition

This work builds on the TREC 2011 competition, where a previous implementation of MolRec already performed well. The CLEF 2012 competition provided an opportunity to test an improved, more computationally efficient version of MolRec on different datasets and understand how performance varies across complexity levels.

The motivation is to understand exactly where rule-based chemical structure recognition breaks down. Examining the specific types of structures that cause failures provides necessary context for the high accuracy rates achieved on simpler structures.

The Two-Stage MolRec Architecture

The novelty lies in the systematic evaluation across two different difficulty levels and the comprehensive failure analysis. The authors tested an improved MolRec implementation that was more efficient than the TREC 2011 version, providing insights into both system evolution and the inherent challenges of chemical structure recognition.

MolRec Architecture Overview: The system follows a two-stage pipeline approach:

Vectorization Stage: The system preprocesses input images through three steps:
- Image binarization using Otsu’s method to convert grayscale images to black and white, followed by labelling of connected components
- OCR processing using nearest neighbor classification with a Euclidean metric to identify and remove text components (atom labels, charges, etc.)
- Separation of bond elements: thinning connected components to single-pixel width, building polyline representations, detecting circles, arrows, and solid triangles, and applying the Douglas-Peucker line simplification algorithm to clean up vectorized bonds
Rule Engine Stage: A set of 18 chemical rules converts geometric primitives into molecular graphs:
- Bridge bond recognition (2 rules applied before all others, handling structures with multiple connection paths depicted in 2.5-dimensional perspective drawings)
- Standard bond and atom recognition (16 rules applied in arbitrary order)
- Context-aware disambiguation resolving ambiguities using the full graph structure and character groups
- Superatom expansion looking up character groups identifying more than one atom in a dictionary and replacing them with molecule subgraphs

The system can output results in standard formats like MOL files or SMILES strings.

CLEF 2012 Experimental Design

The CLEF 2012 organizers provided a set of 961 test images clipped from patent documents, split into two sets:

Automatic Evaluation Set (865 images): Images selected for automatic evaluation by comparison of generated MOL files with ground truth MOL files using the OpenBabel toolkit.
Manual Evaluation Set (95 images): A more challenging collection of images containing elements not supported by OpenBabel (typically Markush structures), requiring manual visual evaluation. This set was intentionally included to provide a greater challenge.

The authors ran MolRec four times with slightly adjusted internal parameters, then manually examined every incorrect recognition to categorize error types.

Performance Divergence and Critical Failure Modes

The results reveal a stark performance gap between simple and complex molecular structures:

Performance on Automatic Evaluation Set: On the 865-image set, MolRec achieved 94.91% to 96.18% accuracy across four runs with different parameter settings. A total of 46 different diagrams were mis-recognized across all runs.

Performance on Manual Evaluation Set: On the 95-image set, accuracy dropped to 46.32% to 58.95%. A total of 52 different diagrams were mis-recognized across all runs. Some diagrams failed for multiple reasons.

Key Failure Modes Identified (with counts from the paper’s Table 3):

Character Grouping (26 manual, 0 automatic): An implementation bug caused the digit “1” to be repeated within atom groups, so $R_{21}$ was incorrectly recognized as $R_{211}$. A separate problem was difficulty correctly separating closely spaced atom groups.
Touching Characters (8 manual, 1 automatic): The system does not handle touching characters, so overlapping characters cause mis-recognition.
Four-Way Junction Failures (6 manual, 7 automatic): The vectorization process could not handle junctions where four lines meet, leading to incorrect connectivity.
OCR Errors (5 manual, 11 automatic): Character recognition errors included “G” interpreted as “O”, “alkyl” being mis-recognized, and “I” interpreted as a vertical single bond.
Missed Solid and Dashed Wedge Bonds (0 manual, 6 each in automatic): The system incorrectly recognized a number of solid wedge and dashed wedge bonds.
Missed Wavy Bonds (2 manual, 1 automatic): Some wavy bonds were not recognized despite the dedicated wavy bond rule.
Missed Charge Signs (1 manual, 2 automatic): While correctly recognizing positive charge signs, MolRec missed three negative charge signs, including one placed at the top left of an atom name.
Other Errors: An atom too close to a bond endpoint was erroneously considered connected, a solid wedge bond too close to a closed node was incorrectly connected, and a dashed bold bond had its stereocentre incorrectly determined.

Dataset Quality Issues: The authors discovered 11 images where the ground truth MOL files were incorrect and MolRec’s recognition was actually correct. As the authors note, such ground truth errors are very difficult to avoid in this complex task.

Key Insights:

Performance gap between simple and complex structures: While MolRec achieved over 94% accuracy on standard molecular diagrams, the performance drop on the more challenging manual evaluation set (down to 46-59%) highlights the difficulty of real patent document images.
Many errors are fixable: The authors note that many mis-recognition problems (such as the character grouping bug and four-way junction vectorization) can be solved with relatively simple enhancements.
Touching character segmentation remains a notoriously difficult open problem that the authors plan to explore further.
Evaluation challenges: The discovery of 11 incorrect ground truth MOL files illustrates how difficult it is to create reliable benchmarks for chemical structure recognition.

The authors conclude that despite the high recognition rates on simpler structures, there is still plenty of room for improvement. They identify future work areas including recognition of more general Markush structures, robust charge sign spotting, and accurate identification of wedge bonds.

Reproducibility Details

System Architecture

Model Type: Non-neural, Rule-Based System (Vectorization Pipeline + Logic Engine)

Data

Evaluation Datasets (CLEF 2012): 961 total test images clipped from patent documents:

Automatic Evaluation Set: 865 images evaluated automatically using OpenBabel for exact structural matching of generated MOL files against ground truth
Manual Evaluation Set: 95 images containing elements not supported by OpenBabel (typically Markush structures), requiring manual visual evaluation

Training Data: The paper does not describe the reference set used to build OCR character prototypes for the nearest neighbor classifier.

Algorithms

Vectorization Pipeline (three steps):

Image Binarization: Otsu’s method, followed by connected component labelling
OCR: Nearest neighbor classification with Euclidean distance metric; recognized characters are removed from the image
Bond Element Separation: Thinning to single-pixel width, polyline construction, Douglas-Peucker line simplification (threshold set to 1-2x average line width), detection of circles, arrows, and solid triangles

Rule Engine: 18 chemical structure rules converting geometric primitives to molecular graphs:

Bridge Bond Rules (2 rules): Applied before all other rules, handling bridge bond structures depicted in 2.5-dimensional perspective drawings
Wavy Bond Rule: Detailed in paper, identifies approximately collinear connected line segments with zig-zag patterns ($n \geq 3$ segments)
Standard Recognition Rules: 16 rules for bonds, atoms, and chemical features (applied in arbitrary order; most not detailed in this paper)

Optimization: Performance tuned via manual adjustment of fuzzy and strict geometric threshold parameters.

Evaluation

Metrics:

Automated: Exact structural match via OpenBabel MOL file comparison
Manual: Visual inspection by human experts for structures where OpenBabel fails

Results:

Automatic Evaluation Set (865 images): 94.91% to 96.18% accuracy across four runs
Manual Evaluation Set (95 images): 46.32% to 58.95% accuracy across four runs

Hardware

Not specified in the paper. Given the era (2012) and algorithmic nature (Otsu, thinning, geometric analysis), likely ran on standard CPUs.

Reproducibility Assessment

Closed. No public code, data, or models are available. The paper outlines high-level logic (Otsu binarization, Douglas-Peucker simplification, 18-rule system) but does not provide:

The complete specification of all 18 rules (only Rule 2.2 for wavy bonds is detailed)
Exact numerical threshold values for fuzzy/strict parameters used in the CLEF runs
OCR training data or character prototype specifications

The authors refer readers to a separate 2012 DRR (SPIE) paper [5] for a more detailed overview of the MolRec system architecture.

Paper Information

Citation: Sadawi, N. M., Sexton, A. P., & Sorge, V. (2012). MolRec at CLEF 2012: Overview and Analysis of Results. Working Notes of CLEF 2012 Evaluation Labs and Workshop. CLEF. https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf

Publication: CLEF 2012 Workshop (ImageCLEF Track)

MolNexTR: A Dual-Stream Molecular Image Recognition

Sat, 04 Oct 2025 00:00:00 +0000

Paper Information

Citation: Chen, Y., Leung, C. T., Huang, Y., Sun, J., Chen, H., & Gao, H. (2024). MolNexTR: a generalized deep learning model for molecular image recognition. Journal of Cheminformatics, 16(141). https://doi.org/10.1186/s13321-024-00926-w

Publication: Journal of Cheminformatics 2024

Additional Resources:

Methodology Overview and Taxonomic Classification

This is a Method paper ($\Psi_{\text{Method}}$). It proposes a neural network architecture (MolNexTR) that integrates ConvNext and Vision Transformers to solve the Optical Chemical Structure Recognition (OCSR) task. The paper validates this method through ablation studies and benchmarking against existing methods including MolScribe and DECIMER.

The Challenge of Domain-Specific Drawing Styles in OCSR

Converting molecular images from chemical literature into machine-readable formats (SMILES) is critical but challenging due to the high variance in drawing styles, fonts, and conventions (e.g., Markush structures, abbreviations). Existing methods have limitations:

CNN-based and ViT-based models often struggle to generalize across diverse, non-standard drawing styles found in real literature.
Pure ViT methods lack translation invariance and local feature representation, while pure CNNs struggle with global dependencies.
Many models predict SMILES strings directly, making it difficult to enforce chemical validity or resolve complex stereochemistry and abbreviations.

Core Innovation: Dual-Stream Encoding and Image Contamination

MolNexTR introduces three main innovations:

Dual-Stream Encoder: A hybrid architecture processing images simultaneously through a ConvNext stream (for local features) and a Vision Transformer stream (for long-range dependencies), fusing them to capture multi-scale information.
Image Contamination Augmentation: A specialized data augmentation algorithm that simulates real-world “noise” found in literature, such as overlapping text, arrows, and partial molecular fragments, to improve robustness.
Graph-Based Decoding with Post-Processing: Unlike pure image-to-SMILES translation, it predicts atoms and bonds (graph generation) and uses a stereochemical discrimination and abbreviation self-correction module to enforce chemical rules (e.g., chirality) and resolve superatoms (e.g., “Ph”, “Bn”).

The prediction of atom labels and coordinates is formulated as a conditional autoregressive generation task, optimized via a cross-entropy loss: $$ \mathcal{L}_{\text{atom}} = -\sum_{t=1}^{T} \log P(x_t \mid \text{Image}, x_{

Experimental Setup: Benchmarking on Synthetic and Real Data

The model was trained on synthetic data (PubChem) and real patent data (USPTO). It was evaluated on nine benchmarks (three synthetic, six real-world):

Synthetic: Indigo, ChemDraw, RDKit (rendered from 5,719 molecules)
Real-World: CLEF, UOB, JPO, USPTO, Staker, and a newly curated ACS dataset (diverse styles)

Baselines: Compared against rule-based (OSRA, MolVec) and deep learning models (MolScribe, DECIMER, SwinOCSR, Img2Mol).

Ablations: Tested the impact of the dual-stream encoder vs. single streams, and the contribution of individual augmentation strategies.

Empirical Results and Robustness Findings

Performance: MolNexTR achieved 81-97% accuracy across test sets, outperforming the second-best method (often MolScribe) by margins of 0.3% to 10.0% (on the difficult ACS dataset).
Perturbation resilience: The model maintained higher accuracy under image perturbations (rotation, noise) and “curved arrow” noise common in reaction mechanisms compared to MolScribe and DECIMER (Table 3).
Ablation Results: The dual-stream encoder consistently outperformed single CNN or ViT baselines, and the image contamination algorithm significantly boosted performance on noisy real-world data (ACS).
Limitations: The model still struggles with extremely complex hand-drawn molecules and mechanism diagrams where arrows or text are conflated with structure. The authors also note that R-group information in real literature often appears in separate text or tables, which the model does not incorporate.

Key Results (Table 2, SMILES exact match accuracy %):

Dataset	MolScribe	MolNexTR	Improvement
Indigo	97.5	97.8	+0.3
ChemDraw	93.8	95.1	+1.3
RDKit	94.6	96.4	+1.8
CLEF	88.3	90.4	+2.1
UOB	87.9	88.5	+0.6
JPO	77.7	82.1	+4.4
USPTO	92.6	93.8	+1.2
Staker	86.9	88.3	+1.4
ACS	71.9	81.9	+10.0

Reproducibility Details

Data

Training Data:

Synthetic: ~1M molecules randomly selected from PubChem, rendered using RDKit and Indigo with varied styles (thickness, fonts, bond width)
Real: 0.68M images from USPTO, with coordinates normalized from MOLfiles

Augmentation:

Render Augmentation: Randomized drawing styles (line width, font size, label modes)
Image Augmentation: Rotation, cropping, blurring, noise (Gaussian, salt-and-pepper)
Molecular Augmentation: Randomly replacing functional groups with abbreviations (from a list of >100) or complex chains (e.g., CH3CH2NH2); adding R-groups
Image Contamination: Adding “noise” objects (arrows, lines, text, partial structures) at a minimum distance from the main molecule to simulate literature artifacts

Algorithms

Dual-Stream Encoder:

CNN Stream: ConvNext backbone (pre-trained on ImageNet), generating feature maps at scales $H/4$ to $H/32$
ViT Stream: Parallel transformer blocks receiving patches of sizes $p=4, 8, 16, 32$. Uses Multi-Head Self-Attention (MHSA) and Feed-Forward Networks (FFN)
Fusion: Outputs from both streams are concatenated

Decoder (Graph Generation):

Transformer Decoder: 6 layers, 8 heads, hidden dim 256
Task 1 (Atoms): Autoregressive prediction of atom tokens $(l, x, y)$ (label + coordinates)
Task 2 (Bonds): Prediction of bond types between atom pairs (None, Single, Double, Triple, Aromatic, Solid Wedge, Dashed Wedge)

Post-Processing:

Stereochemistry: Uses predicted coordinates and bond types (wedge/dash) to resolve chirality using RDKit logic
Abbreviation Correction: Matches superatoms to a dictionary; if unknown, attempts to greedily connect atoms based on valence or finds the nearest match ($\sigma=0.8$ similarity threshold)

Models

Architecture: Encoder-Decoder (ConvNext + ViT Encoder -> Transformer Decoder)
Hyperparameters:
- Optimizer: ADAM (max lr 3e-4, linear warmup for 5% of steps)
- Batch Size: 256
- Image Size: $384 \times 384$
- Dropout: 0.1
Training: Fine-tuned CNN backbone for 40 epochs on 10 NVIDIA RTX 3090 GPUs

Evaluation

Primary Metric: SMILES sequence exact matching accuracy (canonicalized)

Benchmarks:

Synthetic: Indigo (5,719), ChemDraw (5,719), RDKit (5,719)
Real: CLEF (992), UOB (5,740), JPO (450), USPTO (5,719), Staker (50,000), ACS (331)

Hardware

GPUs: 10 NVIDIA RTX 3090 GPUs
Cluster: HPC3 Cluster at HKUST (ITSC)

Artifacts

Artifact	Type	License	Notes
MolNexTR GitHub	Code	Apache-2.0	Official implementation (PyTorch, Jupyter notebooks)
MolNexTR HuggingFace	Dataset/Model	Apache-2.0	Training data and model checkpoint

Citation

@article{chenMolNexTRGeneralizedDeep2024,
  title = {{MolNexTR}: a generalized deep learning model for molecular image recognition},
  author = {Chen, Yufan and Leung, Ching Ting and Huang, Yong and Sun, Jianwei and Chen, Hao and Gao, Hanyu},
  journal = {Journal of Cheminformatics},
  volume = {16},
  number = {1},
  pages = {141},
  year = {2024},
  doi = {10.1186/s13321-024-00926-w}
}

ChemInfty: Chemical Structure Recognition in Patent Images

Sat, 04 Oct 2025 00:00:00 +0000

Paper Information

Citation: Fujiyoshi, A., Nakagawa, K., & Suzuki, M. (2011). Robust Method of Segmentation and Recognition of Chemical Structure Images in ChemInfty. Pre-Proceedings of the 9th IAPR International Workshop on Graphics Recognition, GREC.

Publication: GREC 2011 (Graphics Recognition Workshop)

Additional Resources:

InftyReader Project

Contribution: Segment-Based OCSR Method

This is a method paper that introduces ChemInfty, a rule-based system for Optical Chemical Structure Recognition (OCSR) specifically designed to handle the challenging, low-quality images found in Japanese patent applications.

Motivation: The Challenge of Degraded Patent Images

The motivation is straightforward: Japanese patent applications contain a massive amount of chemical knowledge, but the images are remarkably poor quality. Unlike the relatively clean molecular diagrams in scientific papers, patent images suffer from multiple problems that break conventional OCSR systems.

The authors quantified these issues in a sample of 200 patent images and found that 22% contained touching characters (where atom labels merge together), 19.5% had characters touching bond lines, and 8.5% had broken lines. These are not edge cases; they are pervasive enough to cripple existing recognition tools. Established systems like CLIDE, ChemReader, and OSRA struggle significantly with line-touching characters and broken lines, leading to recognition failures.

The challenge is compounded by the sheer diversity of creation methods. Some structures are drawn with sophisticated molecular editors, others with basic paint programs, and some are even handwritten. This means there’s no standardization in fonts, character sizes, or line thickness. Add in the effects of scanning and faxing, and you have images with significant noise, distortion, and degradation.

The goal of ChemInfty is to build a system robust enough to handle these messy real-world conditions and make Japanese patent chemistry computer-searchable.

Core Innovation: Segment Decomposition and Dynamic Programming

The novelty lies in a segment-based decomposition approach that separates the recognition problem into manageable pieces before attempting to classify them. The key insight is that traditional OCR fails on these images because characters and lines are physically merged. You cannot recognize a character if you cannot cleanly separate it from the surrounding bonds first.

ChemInfty’s approach has several distinctive elements:

Line and Curve Segmentation: The system first decomposes the image into smaller line and curve segments. The decomposition happens at natural breakpoints—crossings, sharp bends, and other locations where touching is likely to occur. This creates a set of primitive elements that can be recombined in different ways.
Linear Order Assumption for Scalability: To make the dynamic programming approach computationally tractable and avoid combinatorial explosion, the system assumes that segments to be combined are adjacent when sorted in one of four directional orderings ($\perp, \setminus, \triangle, \rightarrow$). This constraint dramatically reduces the search space while still capturing the natural spatial relationships in chemical diagrams.
Dynamic Programming for Segment Combination: Once the image is decomposed, the system faces a combinatorial problem: which segments should be grouped together to form characters, and which should be classified as bonds? The authors use dynamic programming to efficiently search for the “most suitable combination” of segments. This optimization finds the configuration that maximizes the likelihood of valid chemical structure elements.
Two-Pass OCR Strategy: ChemInfty integrates with InftyReader, a powerful OCR engine. The system uses OCR twice in the pipeline:
- First pass: High-confidence character recognition removes obvious atom labels early, simplifying the remaining image
- Second pass: After the segment-based method identifies and reconstructs difficult character regions, OCR is applied again to the cleaned-up character image
This two-stage approach handles both easy and hard cases effectively: simple characters are recognized immediately, while complex cases get special treatment.
Image Thinning for Structure Analysis: Before segmentation, the system thins the remaining graphical elements (after removing high-confidence characters) to skeleton lines. This thinning operation reveals the underlying topological structure—crossings, bends, and endpoints—making it easier to detect where segments should be divided.
Proximity-Based Grouping: After identifying potential character segments, the system groups nearby segments together. This spatial clustering ensures that parts of the same character that were separated by bonds get recombined correctly.

Methodology: Real-World Patent Evaluation

The evaluation focused on demonstrating that ChemInfty could handle real-world patent images at scale:

Large-Scale Patent Dataset: The system was tested on chemical structure images from Japanese patent applications published in 2008. This represents a realistic deployment scenario with all the messiness of actual documents.
Touching Character Separation: The authors specifically measured the system’s ability to separate characters from bonds when they were touching. Success was defined as cleanly extracting the character region so that OCR could recognize it.
Recognition Accuracy by Object Type: Performance was broken down by element type (characters, line segments, solid wedges, and hashed wedges). This granular analysis revealed which components were easier or harder for the system to handle.
End-to-End Performance: The overall recognition ratio was calculated across all object types to establish the system’s practical utility for automated patent processing.

Results and Conclusions

Effective Separation for Line-Touching Characters: The segment-based method successfully separated 63.5% of characters that were touching bond lines. This is a substantial improvement over standard OCR, which typically fails completely on such cases. The authors note that when image quality is reasonable, the separation method works well.
Strong Overall Character Recognition: Character recognition achieved 85.86% accuracy, which is respectable given the poor quality of the input images. Combined with the 90.73% accuracy for line segments, this demonstrates the system can reliably reconstruct the core molecular structure.
Weak Performance on Wedges: The system struggled significantly with stereochemistry notation. Solid wedges were correctly recognized only 52.54% of the time, and hashed wedges fared even worse at 23.63%. This is a critical limitation since stereochemistry is often essential for understanding molecular properties.
Image Quality Dependency: The authors acknowledge that the method’s effectiveness is ultimately limited by image quality. When images are severely degraded (blurred to the point where even humans struggle to distinguish characters from noise), the segmentation approach cannot reliably separate touching elements.
Overall System Performance: The combined recognition ratio of 86.58% for all objects indicates that ChemInfty is a working system but not yet production-ready. The authors conclude that further refinement is necessary, particularly for wedge recognition and handling extremely low-quality images.

The work establishes that segment-based decomposition with dynamic programming is a viable approach for handling the specific challenges of patent image OCSR. The two-pass OCR strategy and the use of image thinning to reveal structure are practical engineering solutions that improve robustness. However, the results also highlight that rule-based methods are fundamentally limited by image quality. There is only so much you can do with algorithmic cleverness when the input is severely degraded. This limitation would motivate later work on deep learning approaches that can learn robust feature representations from large datasets.

Reproducibility Details

Technical Paradigm

This is a pre-deep learning (2011) classical computer vision paper. The system uses rule-based methods and traditional OCR engines, not neural networks.

Models

InftyReader: A mathematical OCR engine used for the initial high-confidence character recognition pass. This is a pre-existing external tool.
DEF-based OCR: A standard OCR engine based on Directional Element Features (DEF). These are manually engineered statistical features (histograms of edge directions), not learned neural network features.

Algorithms

The paper details a multi-step recognition pipeline:

Preprocessing: Binarization and smoothing
Initial Character Removal: High-confidence characters are recognized by the InftyReader OCR engine and removed from the image to simplify segmentation
Skeletonization: Thinning using Hilditch’s algorithm to skeletonize graphical elements, revealing topological structure (crossings, bends, endpoints)
Feature Point Detection:
- Crossing points: Direct detection on skeleton
- Bending points: Detected using the Hough transformation
Dynamic Programming Search:
- Input: Set of line/curve segments $S$
- Procedure: Sort segments in 4 directions ($\perp, \setminus, \triangle, \rightarrow$). For each direction, use DP to find the grouping that minimizes a heuristic score
- Complexity: $O(n^2)$ where $n$ is the number of segments
- Scoring: Uses a function Measure(S') that returns a score (0-100) indicating if a subset of segments forms a valid character or bond

The scoring function Measure(S') used in the dynamic programming algorithm is never mathematically defined in the paper, limiting replicability.

Data

Evaluation Dataset: Chemical structure images from Japanese patent applications published in 2008. The complete 2008 dataset contains 229,969 total images.

Purpose	Dataset	Size	Notes
Evaluation	Japanese Published Patent Applications (2008)	1,599 images	Contains 229,969 total images for the year. Format: TIFF, 200-400 dpi.
Analysis	Random subset for frequency analysis	200 images	Used to estimate frequency of touching/broken characters (found in ~20% of images).

No Training Set: The system is rule-based and uses pre-built OCR engines, so no model training was performed.

Evaluation

Primary Metric: Recognition ratio (percentage of correctly recognized objects)

Metric	Value	Notes
Line-touching Separation	63.5%	Success rate for separating text glued to lines
Character Recognition	85.86%	For all character sizes
Line segments	90.73%	Standard bond recognition
Solid Wedge Recognition	52.54%	Low performance noted as area for improvement
Hashed Wedges	23.63%	Poorest performing element type
Overall	86.58%	Combined across all object types

Total Objects Evaluated: 742,287 objects (characters, line segments, solid wedges, hashed wedges) extracted from the patent images.

Hardware

Not reported. Computational cost was not a primary concern for this classical CV system.

Replicability

Low. The paper does not provide sufficient detail for full replication:

The scoring function Measure(S') used in the dynamic programming algorithm is never mathematically defined
Dependency on the proprietary/specialized InftyReader engine
No pseudocode provided for the segment decomposition heuristics

Notes on Wedge Recognition

The system’s poor performance on solid wedges (52.54%) and hashed wedges (23.63%) reflects a fundamental challenge for classical thinning algorithms. Wedge bonds are dense triangular regions that indicate 3D stereochemistry. When skeletonized using algorithms like Hilditch’s method, these “blob” shapes often distort into unrecognizable patterns, unlike the clean thin lines that represent regular bonds.

Citation

@article{fujiyoshiRobustMethodSegmentation2011,
  title = {Robust {{Method}} of {{Segmentation}} and {{Recognition}} of {{Chemical Structure Images}} in {{ChemInfty}}},
  author = {Fujiyoshi, Akio and Nakagawa, Koji and Suzuki, Masakazu},
  year = 2011,
  journal = {Pre-proceedings of the 9th IAPR international workshop on graphics recognition, GREC},
  langid = {english}
}

MolParser: End-to-End Molecular Structure Recognition

Fri, 03 Oct 2025 00:00:00 +0000

Paper Information

Citation: Fang, X., Wang, J., Cai, X., Chen, S., Yang, S., Tao, H., Wang, N., Yao, L., Zhang, L., & Ke, G. (2025). MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 24528-24538). https://doi.org/10.48550/arXiv.2411.11098

Publication: ICCV 2025

Additional Resources:

MolParser-7M Dataset - 7M+ image-text pairs for OCSR
MolParser-7M on HuggingFace - Dataset repository
MolDet YOLO Detector - Object detection model for extracting molecular images from documents

Contribution: End-to-End OCSR and Real-World Resources

This is primarily a Method paper (see AI and Physical Sciences paper taxonomy), with a significant secondary contribution as a Resource paper.

Method contribution ($\Psi_{\text{Method}}$): The paper proposes a novel end-to-end architecture combining a Swin Transformer encoder with a BART decoder, and crucially introduces Extended SMILES (E-SMILES), a new syntactic extension to standard SMILES notation that enables representation of Markush structures, abstract rings, and variable attachment points found in patents. The work validates this method through extensive ablation studies, achieving the highest accuracy among tested OCSR systems on WildMol-10k (76.9%).

Resource contribution ($\Psi_{\text{Resource}}$): The paper introduces MolParser-7M, the largest OCSR dataset to date (7.7M image-text pairs), and WildMol, a challenging benchmark of 20,000 manually annotated real-world molecular images. The construction of these datasets through an active learning data engine with human-in-the-loop validation represents significant infrastructure that enables future OCSR research.

Motivation: Extracting Chemistry from Real-World Documents

The motivation stems from a practical problem in chemical informatics: vast amounts of chemical knowledge remain embedded in unstructured formats. Patents, research papers, and legacy documents depict molecular structures as images. This creates a barrier for large-scale data analysis and prevents Large Language Models from effectively understanding scientific literature in chemistry and drug discovery.

Existing OCSR methods struggle with real-world documents for two fundamental reasons:

Representational limitations: Standard SMILES notation cannot capture complex structural templates like Markush structures, which are ubiquitous in patents. These structures define entire families of compounds using variable R-groups and abstract patterns, making them essential for intellectual property but impossible to represent with conventional methods.
Data distribution mismatch: Real-world molecular images suffer from noise, inconsistent drawing styles, variable resolution, and interference from surrounding text. Models trained exclusively on clean, synthetically rendered molecules fail to generalize when applied to actual documents.

Novelty: E-SMILES and Human-in-the-Loop Data Engine

The novelty lies in a comprehensive system that addresses both representation and data quality challenges through four integrated contributions:

Extended SMILES (E-SMILES): A backward-compatible extension to the SMILES format that can represent complex structures previously inexpressible in standard chemical notations. E-SMILES uses a separator token to delineate the core molecular structure from supplementary annotations. These annotations employ XML-like tags to encode Markush structures, polymers, abstract rings, and other complex patterns. Critically, the core structure remains parseable by standard cheminformatics tools like RDKit, while the supplementary tags provide a structured, LLM-friendly format for capturing edge cases.
MolParser-7M Dataset: The largest publicly available OCSR dataset, containing over 7 million image-text pairs. What distinguishes this dataset is both its scale and its composition. It includes 400,000 “in-the-wild” samples (molecular images extracted from actual patents and scientific papers) and subsequently curated by human annotators. This real-world data addresses the distribution mismatch problem directly by exposing the model to the same noise, artifacts, and stylistic variations it encounters in production.
Human-in-the-Loop Data Engine: A systematic approach to collecting and annotating real-world training data. The pipeline begins with an object detection model that extracts molecular images from over a million PDF documents. An active learning algorithm then identifies the most informative samples (those where the current model struggles) for human annotation. The model pre-annotates these images, and human experts review and correct them. This creates an iterative improvement cycle: annotate, train, identify new challenging cases, repeat.
Efficient End-to-End Architecture: The model treats OCSR as an image captioning problem. A Swin-Transformer vision encoder extracts visual features, a simple MLP compresses them, and a BART decoder generates the E-SMILES string autoregressively. The model minimizes the standard negative log-likelihood of the target E-SMILES token sequence $y$ given the sequence history and input image $x$:

$$ \begin{aligned} \mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{

The training strategy employs curriculum learning, starting with simple molecules and gradually introducing complexity and heavier data augmentation.

Experimental Setup: Two-Stage Training and Benchmarking

The evaluation focused on demonstrating that MolParser generalizes to real-world documents:

Two-Stage Training Protocol: The model underwent a systematic training process:
- Pre-training: Initial training on millions of synthetic molecular images using curriculum learning. The curriculum progresses from simple molecules to complex structures while gradually increasing data augmentation intensity (blur, noise, perspective transforms).
- Fine-tuning: Subsequent training on 400,000 curated real-world samples extracted from patents and papers. This fine-tuning phase is critical for adapting to the noise and stylistic variations of actual documents.
Benchmark Evaluation: The model was evaluated on multiple standard OCSR benchmarks to establish baseline performance on clean data. These benchmarks test recognition accuracy on well-formatted molecular diagrams.
Real-World Document Analysis: The critical test involved applying MolParser to molecular structures extracted directly from scientific documents. This evaluation measures the gap between synthetic benchmark performance and real-world applicability (the core problem the paper addresses).
Ablation Studies: Experiments isolating the contribution of each component:
- The impact of real-world training data versus synthetic-only training
- The effectiveness of curriculum learning versus standard training
- The value of the human-in-the-loop annotation pipeline versus random sampling
- The necessity of E-SMILES extensions for capturing complex structures

Outcomes and Empirical Findings

Performance on Benchmarks: MolParser achieves competitive results on standard benchmarks and the best performance on real-world documents. On clean benchmarks like USPTO-10K, MolScribe (96.0%) slightly edges MolParser-Base (94.5%), but on WildMol-10k, MolParser-Base achieved 76.9% accuracy, significantly outperforming MolScribe (66.4%) and MolGrapher (45.5%). This gap validates the core hypothesis that training on actual document images is essential for practical deployment.
Real-World Data is Critical: Models trained exclusively on synthetic data show substantial performance degradation when applied to real documents. The 400,000 in-the-wild training samples bridge this gap, demonstrating that data quality and distribution matching matter as much as model architecture. Ablation experiments showed that pretraining on MolParser-7M synthetic data alone achieved 51.9% accuracy on WildMol, while adding real-world fine-tuning raised this to 76.9%. Using the smaller MolGrapher-300k synthetic dataset without fine-tuning yielded only 22.4%.
E-SMILES Enables Broader Coverage: The extended representation successfully captures molecular structures that were previously inexpressible, particularly Markush structures from patents. This expands the scope of what can be automatically extracted from chemical literature to include patent-style structural templates.
Human-in-the-Loop Scales Efficiently: The active learning pipeline reduces annotation time by approximately 90% while maintaining high quality. This approach makes it feasible to curate large-scale, high-quality datasets for specialized domains where expert knowledge is expensive.
Speed and Accuracy: The end-to-end architecture achieves both high accuracy and fast inference, making it practical for large-scale document processing. MolParser-Base processes 40 images per second on RTX 4090D, while the Tiny variant achieves 131 FPS. The direct image-to-text approach avoids the error accumulation of multi-stage pipelines.
Downstream Applications: The Swin Transformer encoder, once trained on MolParser-7M, serves as an effective molecular fingerprint for property prediction. Paired with a simple two-layer MLP on MoleculeNet benchmarks, MolParser-pretrained features achieved an average ROC-AUC of 73.7% across five tasks, compared to 68.9% for ImageNet-pretrained Swin-T features. The authors also demonstrate chemical reaction parsing by feeding MolDet detections and MolParser E-SMILES into GPT-4o.
Limitations: The authors acknowledge that molecular chirality is not yet fully exploited by the system. The E-SMILES format does not currently support dashed abstract rings, coordination bonds, special symbol Markush patterns, or replication of long structural segments. Additionally, scaling up the volume of real annotated training data could further improve performance.

The work establishes that practical OCSR requires more than architectural innovations. It demands careful attention to data quality, representation design, and the distribution mismatch between synthetic training data and real-world applications. The combination of E-SMILES, the MolParser-7M dataset, and the human-in-the-loop data engine provides a template for building reliable vision systems in scientific domains where clean training data is scarce but expert knowledge is available.

Artifacts

Artifact	Type	License	Notes
MolParser-7M	Dataset	CC-BY-NC-SA-4.0	7.7M image-SMILES pairs for OCSR pretraining and fine-tuning
MolDet	Model	CC-BY-NC-SA-4.0	YOLO11-based molecule detector for PDF documents

No official source code repository has been released. Model weights for MolParser itself are not publicly available as of the dataset release.

Reproducibility Details

Data

The training data is split into a massive synthetic pre-training set and a curated fine-tuning set.

Training Data Composition (MolParser-7M):

Purpose	Dataset Name	Size	Composition / Notes
Pre-training	MolParser-7M (Synthetic)	~7.7M	Markush-3M (40%), ChEMBL-2M (27%), Polymer-1M (14%), PAH-600k (8%), BMS-360k (5%), MolGrapher-300K (4%), Pauling-100k (2%). Generated via RDKit/Indigo with randomized styles.
Fine-tuning	MolParser-SFT-400k	400k	Real images from patents/papers selected via active learning (confidence filtering 0.6-0.9) and manually annotated. 66% of fine-tuning mix.
Fine-tuning	MolParser-Gen-200k	200k	Subset of synthetic data kept to prevent catastrophic forgetting. 32% of fine-tuning mix.
Fine-tuning	Handwrite-5k	5k	Handwritten molecules from Img2Mol to support hand-drawn queries. 1% of fine-tuning mix.

Sources: 1.2M patents and scientific papers (PDF documents)
Extraction: MolDet (YOLO11-based detector) identified ~20M molecular images, deduplicated to ~4M candidates
Selection: Active learning ensemble (5-fold models) identified high-uncertainty samples for annotation
Annotation: Human experts corrected model pre-annotations (90% time savings vs. from-scratch annotation)

Test Benchmarks:

Benchmark	Size	Description
USPTO-10k	10,000	Standard synthetic benchmark
Maybridge UoB	-	Synthetic molecules
CLEF-2012	-	Patent images
JPO	-	Japanese patent office
ColoredBG	-	Colored background molecules
WildMol-10k	10,000	Ordinary molecules cropped from real PDFs (new)
WildMol-10k-M	10,000	Markush structures (significantly harder, new)

Algorithms

Extended SMILES (E-SMILES) Encoding:

Format: SMILESEXTENSION where separates core structure from supplementary annotations
Extensions use XML-like tags:
- index:group for substituents/variable groups (Markush structures)
- for groups connected at any ring position
- for abstract rings
- for connection points
Backward compatible: Core SMILES parseable by RDKit; extensions provide structured format for edge cases

Curriculum Learning Strategy:

Phase 1: No augmentation, simple molecules (<60 tokens)
Phase 2: Gradually increase augmentation intensity and sequence length
Progressive complexity allows stable training on diverse molecular structures

Active Learning Data Selection:

Train 5 model folds on current dataset
Compute pairwise Tanimoto similarity of predictions on candidate images
Select samples with confidence scores 0.6-0.9 for human review (highest learning value)
Human experts correct model pre-annotations
Iteratively expand training set with hard samples

Data Augmentations:

RandomAffine (rotation, scale, translation)
JPEGCompress (compression artifacts)
InverseColor (color inversion)
SurroundingCharacters (text interference)
RandomCircle (circular artifacts)
ColorJitter (brightness, contrast variations)
Downscale (resolution reduction)
Bounds (boundary cropping variations)

Models

The architecture follows a standard Image Captioning (Encoder-Decoder) paradigm.

Architecture Specifications:

Component	Details
Vision Encoder	Swin Transformer (ImageNet pretrained)
- Tiny variant	66M parameters, $224 \times 224$ input
- Small variant	108M parameters, $224 \times 224$ input
- Base variant	216M parameters, $384 \times 384$ input
Connector	2-layer MLP reducing channel dimension by half
Text Decoder	BART-Decoder (12 layers, 16 attention heads)

Training Configuration:

Setting	Pre-training	Fine-tuning
Hardware	8x NVIDIA RTX 4090D GPUs	8x NVIDIA RTX 4090D GPUs
Optimizer	AdamW	AdamW
Learning Rate	$1 \times 10^{-4}$	$5 \times 10^{-5}$
Weight Decay	$1 \times 10^{-2}$	$1 \times 10^{-2}$
Scheduler	Cosine with warmup	Cosine with warmup
Epochs	20	4
Label Smoothing	0.01	0.005

Curriculum Learning Schedule (Pre-training):

Starts with simple molecules (<60 tokens, no augmentation)
Gradually adds complexity and augmentation (blur, noise, perspective transforms)
Enables stable learning across diverse molecular structures

Evaluation

Metrics: Exact match accuracy on predicted E-SMILES strings (molecule-level exact match)

Key Results:

Metric	MolParser-Base	MolScribe	MolGrapher	Notes
WildMol-10k	76.9%	66.4%	45.5%	Real-world patent/paper crops
USPTO-10k	94.5%	96.0%	93.3%	Synthetic benchmark
Throughput (FPS)	39.8	16.5	2.2	Measured on RTX 4090D

Additional Performance:

MolParser-Tiny: 131 FPS on RTX 4090D (66M params)
Real-world vs. synthetic gap: Fine-tuning on MolParser-SFT-400k closed the performance gap between clean benchmarks and in-the-wild documents

Ablation Findings:

Factor	Impact
Real-world training data	Fine-tuning on real data raised accuracy from 51.9% to 76.9% on WildMol-10k
Curriculum learning	Augmentation alone raised WildMol-10k from 40.1% to 69.5%; adding curriculum learning further raised it to 76.9%
Active learning selection	More effective than random sampling for annotation budget
E-SMILES extensions	Essential for Markush structure recognition (impossible with standard SMILES)
Dataset scale	Larger pre-training dataset (7M vs 300k) improved WildMol-10k accuracy from 22.4% to 51.9% before fine-tuning

Hardware

Training: 8x NVIDIA RTX 4090D GPUs
Inference: Single RTX 4090D sufficient for real-time processing
Training time: 20 epochs pre-training + 4 epochs fine-tuning (specific duration not reported)

Citation

@inproceedings{fang2025molparser,
  title={MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild},
  author={Fang, Xi and Wang, Jiankun and Cai, Xiaochen and Chen, Shangqian and Yang, Shuwen and Tao, Haoyi and Wang, Nan and Yao, Lin and Zhang, Linfeng and Ke, Guolin},
  year={2025},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  pages={24528--24538},
  eprint={2411.11098},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  doi={10.48550/arXiv.2411.11098}
}

MolParser-7M & WildMol: Large-Scale OCSR Datasets

Fri, 03 Oct 2025 00:00:00 +0000

Dataset Examples

An example of a complex Markush structure that can be represented by the E-SMILES format but not by standard SMILES or FG-SMILES.

A sample from the WildMol benchmark, showing a low-quality, noisy molecular image cropped from real-world literature that challenges OCSR systems.

A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.

Dataset Subsets

Subset	Count	Description
MolParser-7M (Training Set)	7,740,871	A large-scale dataset for training OCSR models, split into pre-training and fine-tuning stages.
WildMol (Test Set)	20,000	A benchmark of 20,000 human-annotated samples cropped from real PDF files to evaluate OCSR models in ‘in-the-wild’ scenarios. Comprises WildMol-10k (10k ordinary molecules) and WildMol-10k-M (10k Markush structures).

Benchmarks

WildMol-10K Accuracy

Evaluation of OCSR models on 10,000 real-world molecular images cropped from scientific literature and patents

Rank	Model	Accuracy (%)
🥇 1	MolParser-Base End-to-end visual recognition trained on MolParser-7M	76.9
🥈 2	MolScribe Transformer-based OCSR system	66.4
🥉 3	DECIMER 2.7 Deep learning for chemical image recognition	56
4	MolGrapher Graph-based molecular structure recognition	45.5
5	MolVec 0.9.7 Vector-based structure recognition	26.4
6	OSRA 2.1 Optical Structure Recognition Application	26.3
7	Img2Mol Image-to-molecule translation	24.4
8	Imago 2.0 Chemical structure recognition toolkit	6.9

Key Contribution

Introduces MolParser-7M, the largest open-source Optical Chemical Structure Recognition (OCSR) dataset, uniquely combining diverse synthetic data with a large volume of manually-annotated, “in-the-wild” images from real scientific documents to improve model robustness. Also introduces WildMol, a new challenging benchmark for evaluating OCSR performance on real-world data, including Markush structures.

Overview

The MolParser project addresses the challenge of recognizing molecular structures from images found in real-world scientific documents. Unlike existing OCSR datasets that rely primarily on synthetically generated images, MolParser-7M incorporates 400,000 manually annotated images cropped from actual patents and scientific papers, making it the first large-scale dataset to bridge the gap between synthetic training data and real-world deployment scenarios.

Strengths

Largest open-source OCSR dataset with over 7.7 million pairs
The only large-scale OCSR training set that includes a significant amount (400k) of “in-the-wild” data cropped from real patents and literature
High diversity of molecular structures from numerous sources (PubChem, ChEMBL, polymers, etc.)
Introduces the WildMol benchmark for evaluating performance on challenging, real-world data, including Markush structures
The “in-the-wild” fine-tuning data (MolParser-SFT-400k) was curated via an efficient active learning data engine with human-in-the-loop validation

Limitations

The E-SMILES format cannot represent certain complex cases, such as coordination bonds, dashed abstract rings, Markush structures depicted with special patterns, and replication of long structural segments on the skeleton
The model and data do not yet fully exploit molecular chirality, which is critical for chemical properties
Performance could be further improved by scaling up the amount of real annotated training data

Technical Notes

Synthetic Data Generation

To ensure diversity, molecular structures were collected from databases like ChEMBL, PubChem, and Kaggle BMS. A significant number of Markush, polymer, and fused-ring structures were also randomly generated. Images were rendered using RDKit and epam.indigo with randomized parameters (e.g., bond width, font size, rotation) to increase visual diversity. The pretraining dataset is composed of the following subsets:

Subset	Ratio	Source
Markush-3M	40%	Random groups replacement from PubChem
ChEMBL-2M	27%	Molecules selected from ChEMBL
Polymer-1M	14%	Randomly generated polymer molecules
PAH-600k	8%	Randomly generated fused-ring molecules
BMS-360k	5%	Molecules with long carbon chains from BMS
MolGrapher-300K	4%	Training data from MolGrapher
Pauling-100k	2%	Pauling-style images drawn using epam.indigo

In-the-Wild Data Engine (MolParser-SFT-400k)

A YOLO11 object detection model (MolDet) located and cropped over 20 million molecule images from 1.22 million real PDFs (patents and papers). After de-duplication via p-hash similarity, 4 million unique images remained.

An active learning algorithm was used to select the most informative samples for annotation, targeting images where an ensemble of 5-fold models showed moderate confidence (0.6-0.9 Tanimoto similarity), indicating they were challenging but learnable.

This active learning approach with model pre-annotations reduced manual annotation time per molecule to 30 seconds, approximately 90% savings compared to annotating from scratch. In the final fine-tuning dataset, 56.04% of annotations directly utilized raw model pre-annotations, 20.97% passed review after a single manual correction, 13.87% were accepted after a second round of annotation, and 9.13% required three or more rounds.

The fine-tuning dataset is composed of:

Subset	Ratio	Source
MolParser-SFT-400k	66%	Manually annotated data obtained via data engine
MolParser-Gen-200k	32%	Synthetic data selected from pretraining stage
Handwrite-5k	1%	Handwritten molecules selected from Img2Mol

E-SMILES Specification

To accommodate complex patent structures that standard SMILES cannot support, the authors introduced an Extended SMILES format (SMILESEXTENSION). The EXTENSION component uses XML-like tokens to manage complexities:

... encapsulates Markush R-groups and abbreviation groups.
... denotes ring attachments with uncertainty positions.
... defines abstract rings.
identifies a connection point.

This format enables Markush-molecule matching and LLM integration, while retaining RDKit compatibility for the standard SMILES portion.

Reproducibility

Artifact	Type	License	Notes
MolParser-7M	Dataset	CC-BY-NC-SA-4.0	Training and test data on HuggingFace. SFT subset is partially released.
MolDet (YOLO11)	Model	Unknown	Molecule detection model on HuggingFace
MolParser Demo	Other	N/A	Online OCSR demo using MolParser-Base

The dataset is publicly available on HuggingFace under a CC-BY-NC-SA-4.0 (non-commercial) license. The MolParser-SFT-400k subset is only partially released. The YOLO11-based MolDet detection model is also available on HuggingFace. No public code repository is provided for the MolParser recognition model itself. All experiments were conducted on 8 NVIDIA RTX 4090D GPUs, and throughput benchmarks were measured on a single RTX 4090D GPU.

SubGrapher: Visual Fingerprinting of Chemical Structures

Mon, 28 Apr 2025 00:00:00 +0000

Paper Classification and Taxonomy

This is primarily a Methodological Paper ($\Psi_{\text{Method}}$) with a secondary Resource ($\Psi_{\text{Resource}}$) contribution. Using the AI and Physical Sciences paper taxonomy framework:

Primary Classification: Method

The dominant basis vector is Methodological because SubGrapher introduces an architecture that replaces the two-step OCSR workflow (image, then structure, then fingerprint) with single-step fingerprinting (image to visual fingerprint). The paper validates this approach through systematic comparison against state-of-the-art methods (MolGrapher, OSRA, DECIMER, MolScribe), demonstrating superior performance on specific tasks like retrieval and robustness to image quality degradation.

Secondary Classification: Resource

The paper makes non-negligible resource contributions by releasing:

Code and model weights on GitHub and HuggingFace
Five new visual fingerprinting benchmark datasets for molecule retrieval tasks
Comprehensive functional group knowledge base (1,534 substructures)

Motivation: Extracting Complex Structures from Noisy Images

The motivation tackles a fundamental challenge in chemical informatics: extracting molecular information from the vast amounts of unstructured scientific literature, particularly patents. Millions of molecular structures exist only as images in these documents, making them inaccessible for computational analysis, database searches, or machine learning applications.

Traditional Optical Chemical Structure Recognition (OCSR) tools attempt to fully reconstruct molecular graphs from images, converting them into machine-readable formats like SMILES. However, these approaches face two critical limitations:

Brittleness to image quality: Poor resolution, noise, or unconventional drawing styles frequently degrade recognition accuracy
Limited handling of complex structures: Markush structures, generic molecular templates with variable R-groups commonly used in patents, are poorly supported by most conventional OCSR methods

The key insight driving SubGrapher is that full molecular reconstruction may be unnecessary for many applications. For tasks like database searching, similarity analysis, or document retrieval, a molecular fingerprint - a vectorized representation capturing structural features - is often sufficient. This realization opens up a new approach: bypass the fragile reconstruction step and create fingerprints directly from visual information.

Key Innovation: Direct Visual Fingerprinting

SubGrapher takes a different approach to extracting chemical information from images. It creates “visual fingerprints” through functional group recognition. The key innovations are:

Direct Image-to-Fingerprint Pipeline: SubGrapher eliminates the traditional two-step process (image → structure → fingerprint) by generating fingerprints directly from pixels. This single-stage approach avoids error accumulation from failed structure reconstructions and can handle images where conventional OCSR tools produce invalid outputs.
Dual Instance Segmentation Architecture: The system employs two specialized Mask-RCNN networks working in parallel:
- Functional group detector: Trained to identify 1,534 expert-defined functional groups using pixel-level segmentation masks
- Carbon backbone detector: Recognizes 27 common carbon chain patterns to capture the molecular scaffold
Using instance segmentation provides detailed spatial information and higher accuracy through richer supervision during training.
Extensive Functional Group Knowledge Base: The method uses one of the most comprehensive open-source collections of functional groups, encompassing 1,534 substructures. These were systematically defined by:
- Starting with chemically logical atom combinations (C, O, S, N, B, P)
- Expanding to include relevant subgroups and variations
- Filtering based on frequency (appearing ~1,000+ times in PubChem)
- Additional halogen substituents and organometallic groups relevant to EUV photoresists
- Manual curation with SMILES, SMARTS, and descriptive names
Substructure-Graph Construction: After detecting functional groups and carbon backbones, SubGrapher builds a connectivity graph where:
- Each node represents an identified substructure
- Edges connect substructures whose bounding boxes overlap (with 10% margin expansion)
- This graph captures both the chemical components and their spatial relationships
Substructure-based Visual Molecular Fingerprint (SVMF): The final output is a continuous, count-based fingerprint formally defined as a matrix $SVMF(m) \in \mathbb{R}^{n \times n}$ where $n=1561$ (1,534 functional groups + 27 carbon backbones). The matrix is stored as a compressed upper triangular form:

Diagonal elements ($i = j$): Weighted count of substructure $i$ plus self-intersection $$SVMF_{ii}(m) = h_1 \cdot n_i + g_{ii}$$ where $h_1 = 10$ is the diagonal weight hyperparameter, $n_i$ is the instance count, and $g_{ii}$ is the self-intersection coefficient.

Off-diagonal elements ($i \neq j$): Intersection coefficient based on shortest path distance $d$ in the substructure graph $$SVMF_{ij}(m) = h_2(d) \cdot \text{intersection}(s_i, s_j)$$ where the distance decay function $h_2(d)$ is:
- $d \leq 1$: weight = 2
- $d = 2$: weight = 2/4 = 0.5
- $d = 3$: weight = 2/16 = 0.125
- $d = 4$: weight = $2/256 \approx 0.0078$
- $d > 4$: weight = 0
Key properties:
- Carbon chain intersection coefficients are divided by 2, giving functional groups higher effective weight
- Similarity between fingerprints calculated using a normalized Euclidean distance (ratio of L2 norm of difference to L2 norm of sum)
- Resulting fingerprints are highly sparse (average 0.001% non-zero elements)
- Compressed storage enables efficient database searches
Markush Structure Compatibility: SubGrapher processes Markush structures by recognizing their constituent functional groups and creating meaningful fingerprints for similarity searches, achieving higher accuracy than existing OCSR methods on the USPTO-Markush benchmark (S-F1: 88).

Experimental Validation and Benchmarks

The evaluation focused on demonstrating SubGrapher’s effectiveness across two critical tasks: accurate substructure detection and robust molecule retrieval from diverse image collections.

Substructure Detection Performance

SubGrapher’s ability to identify functional groups was tested on three challenging benchmarks that expose different failure modes of OCSR systems:

Dataset	Size	Description	Key Challenge
JPO	341 images	Japanese Patent Office images (molecules with abbreviations removed)	Low quality, noise, artifacts, non-standard drawing styles
USPTO-10K-L	1,000 images	Large molecules (>70 atoms)	Scale variation, structural complexity, many functional groups
USPTO-Markush	74 images	Generic Markush structures	Variable R-groups, abstract patterns, template representation

Key findings:

JPO Dataset (Low-Quality Patent Images): SubGrapher achieved the highest Molecule Exact Match rate (83%), demonstrating robustness to image quality degradation where rule-based methods like OSRA scored lower (67% M-EM).
USPTO-10K-L (Large Molecules): SubGrapher achieved an S-F1 of 97, matching the rule-based OSRA and outperforming all other learning-based methods (MolScribe: 90, DECIMER: 86, MolGrapher: 56). The object detection approach handled scale variation better than other deep-learning OCSR tools on these challenging targets.
USPTO-Markush (Generic Structures): SubGrapher achieved the highest Substructure F1-score (88) on this benchmark, outperforming MolScribe (86), OSRA (74), and DECIMER (10). While other OCSR tools can attempt these images, they have limited support for Markush features. SubGrapher’s instance segmentation approach handles complex Markush structures more effectively by focusing on relevant image regions.

Qualitative analysis revealed that SubGrapher correctly identified functional groups in scenarios where other methods failed completely: images with captions, unconventional drawing styles, or significant quality degradation.

Visual Fingerprinting for Molecule Retrieval

The core application was evaluated using a retrieval task designed to simulate real-world database searching:

Benchmark Creation: Five benchmark datasets were constructed around structurally similar molecules (adenosine, camphor, cholesterol, limonene, and pyridine), each containing 500 molecules sampled from PubChem with at least 90% Tanimoto similarity to the reference molecule, rendered as augmented images.
Retrieval Task: Given a SMILES string as a query, the goal was to find the corresponding molecular image within the dataset of 500 visually similar structures. This tests whether the visual fingerprint can distinguish between closely related molecules.
Performance Comparison: SubGrapher significantly outperformed baseline methods, retrieving the correct molecule at an average rank of 95 out of 500. The key advantage was robustness: SubGrapher generates a unique fingerprint for every image, even with partial or uncertain predictions. In contrast, OCSR-based methods frequently fail to produce valid SMILES, making them unable to generate fingerprints for comparison.
Real-World Case Study: A practical demonstration involved searching a 54-page patent document containing 356 chemical images for a specific Markush structure. SubGrapher successfully located the target structure, highlighting its utility for large-scale document mining.

Training Data Generation

Since no public datasets existed with the required pixel-level mask annotations for functional groups, the researchers developed a comprehensive synthetic data generation pipeline:

Extended MolDepictor: They enhanced existing molecular rendering tools to create images from SMILES strings and generate corresponding segmentation masks for all substructures present in each molecule.
Markush Structure Rendering: The pipeline was extended to handle complex generic structures using CXSMILES representations and the CDK library for rendering, creating training data for molecular templates with structural, positional, and frequency variation indicators.
Diverse Molecular Sources: Training molecules were sourced from PubChem to ensure broad chemical diversity and coverage of different structural families.

Results, Impact, and Limitations

Superior Robustness to Image Quality: SubGrapher consistently outperformed traditional OCSR methods on degraded images, particularly the JPO patent dataset. SubGrapher’s learned representations proved more resilient to noise, artifacts, and unconventional drawing styles than rule-based alternatives like OSRA (M-EM: 83 vs. 67 on JPO).

Metric	SubGrapher	MolScribe	OSRA	DECIMER	MolGrapher
S-F1 (JPO)	92	94	81	86	89
M-EM (JPO)	83	82	67	79	80
S-F1 (USPTO-10K-L)	97	90	97	86	56
M-EM (USPTO-10K-L)	55	55	75	66	31
S-F1 (USPTO-Markush)	88	86	74	10	35
M-EM (USPTO-Markush)	82	86	70	11	30
Avg Retrieval Rank	95/500	181-241/500	138-185/500	N/A	N/A

Note: Retrieval rank ranges reflect the best and worst fingerprint method pairing for each OCSR model (RDKit Daylight or MHFP).

Effective Handling of Scale and Complexity: The instance segmentation approach successfully managed large molecules and complex structures where traditional graph-reconstruction methods struggled. The Substructure F1-scores on USPTO-10K-L and USPTO-Markush benchmarks demonstrated clear advantages for challenging molecular targets.
Markush Structure Processing: SubGrapher achieves the highest Substructure F1-score on Markush structures (88 vs. MolScribe’s 86 and OSRA’s 74). While other OCSR methods can attempt Markush images, they support only limited features such as abbreviation-based variable groups. SubGrapher handles complex Markush features more effectively, expanding the scope of automatically extractable chemical information from patent literature.
Robust Molecule Retrieval Performance: The visual fingerprinting approach achieved reliable retrieval performance (average rank 95/500) across diverse molecular families. The key advantage was consistency: SubGrapher generates meaningful fingerprints even from partial or uncertain predictions, while OCSR-based methods often fail to produce any usable output.
Practical Document Mining Capability: The successful identification of specific Markush structures within large patent documents (54 pages, 356 images) demonstrates real-world applicability for large-scale literature mining and intellectual property analysis.
Single-Stage Architecture Benefits: By eliminating the traditional image → structure → fingerprint pipeline, SubGrapher avoids error accumulation from failed molecular reconstructions. Every input image produces a fingerprint, making the system more reliable for batch processing of diverse document collections.
Limitations and Scope: The method remains focused on common organic functional groups and may struggle with inorganic chemistry, organometallic complexes, or highly specialized molecular classes not well-represented in the training data. The 1,534 functional groups, while extensive, represent a curated subset of chemical space. SubGrapher also cannot distinguish enantiomers, as the detected substructures lack stereochemistry information. Additionally, the method currently cannot recognize substructures in abbreviations or single-atom fragments.

The work demonstrates that direct fingerprint generation can be more robust and practical than traditional structure reconstruction approaches. SubGrapher’s ability to handle Markush structures and degraded images makes it particularly valuable for patent analysis and large-scale document mining, where traditional OCSR methods frequently fail. The approach suggests that task-specific learning (fingerprints for retrieval) can outperform general-purpose reconstruction methods in many practical applications.

Reproducibility Details

Data

Training Data Generation: The paper developed a custom synthetic data pipeline since no public datasets existed with pixel-level mask annotations for functional groups:

Extended MolDepictor: Enhanced molecular rendering tool to generate both images and corresponding segmentation masks for all substructures
Markush Structure Rendering: Pipeline extended to handle complex generic structures
Source Molecules: PubChem for broad chemical diversity

Evaluation Benchmarks:

JPO Dataset: Real patent images with poor resolution, noise, and artifacts
USPTO-10K-L: Large complex molecular structures
USPTO-Markush: Generic structures with variable R-groups
Retrieval Benchmarks: Five datasets (adenosine, camphor, cholesterol, limonene, pyridine), each with 500 similar molecular images

Models

Architecture: Dual instance segmentation system using Mask-RCNN

Functional Group Detector: Mask-RCNN trained to identify 1,534 expert-defined functional groups
Carbon Backbone Detector: Mask-RCNN trained to recognize 27 common carbon chain patterns
Backbone Network: Not specified in the paper

Functional Group Knowledge Base: 1,534 substructures systematically defined by:

Starting with chemically logical atom combinations (C, O, S, N, B, P)
Expanding to include relevant subgroups and variations
Filtering based on frequency (appearing ~1,000+ times in PubChem)
Manual curation with SMILES, SMARTS, and descriptive names

Algorithms

Functional Group Definition:

1,534 Functional Groups: Defined by manually curated SMARTS patterns
- Must contain heteroatoms (O, N, S, P, B)
- Frequency threshold: ~1,000+ occurrences in PubChem
- Systematically constructed from chemically logical atom combinations
- Manual curation with SMILES, SMARTS, and descriptive names
27 Carbon Backbones: Patterns of 3-6 carbon atoms (rings and chains) to capture molecular scaffolds

Substructure-Graph Construction:

Detect functional groups and carbon backbones using Mask-RCNN models
Build connectivity graph:
- Each node represents an identified substructure instance
- Edges connect substructures whose bounding boxes overlap
- Bounding boxes expanded by 10% of smallest box’s diagonal to ensure connectivity between adjacent groups
- Carbon chain intersection coefficients divided by 2, giving functional groups higher effective weight

SVMF Fingerprint Generation:

Matrix form: $SVMF(m) \in \mathbb{R}^{n \times n}$ where $n=1561$
Stored as compressed sparse upper triangular matrix
Diagonal elements: $SVMF_{ii} = h_1 \cdot n_i + g_{ii}$ where $h_1 = 10$
Off-diagonal elements: $SVMF_{ij} = h_2(d) \cdot \text{intersection}(s_i, s_j)$ where:
- $h_2(d) = 2$ for $d = 0, 1$
- $h_2(2) = 2/4$, $h_2(3) = 2/16$, $h_2(4) = 2/256$
- $h_2(d) = 0$ for $d > 4$
Average sparsity: 0.001% non-zero elements
Similarity metric: Normalized Euclidean distance (L2 norm of difference divided by L2 norm of sum)

Evaluation

Metrics:

Substructure F1-score (S-F1): Harmonic mean of precision and recall for individual substructure detection across all molecules in the dataset
Molecule Exact Match (M-EM): Percentage of molecules where S-F1 = 1.0 (all substructures correctly identified)
Retrieval Rank: Average rank of ground truth molecule in candidate list of 500 similar structures when querying with SMILES fingerprint, averaged across 50 queries per benchmark

Baselines: Compared against SOTA OCSR methods:

Deep learning: MolScribe, MolGrapher, DECIMER
Rule-based: OSRA
Fingerprint methods: RDKit Daylight, MHFP (applied to OCSR outputs)

Hardware

Not specified in the paper. Training and inference hardware details are not provided in the main text or would be found in supplementary materials.

Artifacts

Artifact	Type	License	Notes
SubGrapher (GitHub)	Code	MIT	Official inference code
SubGrapher (HuggingFace)	Model	MIT	Pre-trained model weights
SubGrapher-Datasets (HuggingFace)	Dataset	CC-BY-4.0	Visual fingerprinting benchmark datasets

Implementation Gaps

The following details are not available in the paper and would require access to the code repository or supplementary information:

Specific backbone architecture for Mask-RCNN (ResNet variant, Swin Transformer, etc.)
Optimizer type (AdamW, SGD, etc.)
Learning rate and scheduler
Batch size and number of training epochs
Loss function weights (box loss vs. mask loss)
GPU/TPU specifications used for training
Training time and computational requirements

Paper Information

Citation: Morin, L., Meijer, G. I., Weber, V., Van Gool, L., & Staar, P. W. J. (2025). SubGrapher: Visual fingerprinting of chemical structures. Journal of Cheminformatics, 17(1), 149. https://doi.org/10.1186/s13321-025-01091-4

Publication: Journal of Cheminformatics (2025)

@article{morinSubGrapherVisualFingerprinting2025,
  title = {SubGrapher: Visual Fingerprinting of Chemical Structures},
  shorttitle = {SubGrapher},
  author = {Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valéry and Van Gool, Luc and Staar, Peter W. J.},
  year = {2025},
  journal = {Journal of Cheminformatics},
  volume = {17},
  number = {1},
  pages = {149},
  doi = {10.1186/s13321-025-01091-4}
}

RFL: Simplifying Chemical Structure Recognition (AAAI 2025)

Thu, 19 Dec 2024 00:00:00 +0000

Methodological Contribution

This is a Methodological paper ($\Psi_{\text{Method}}$). It introduces a novel representation system (Ring-Free Language) and a specialized neural architecture (Molecular Skeleton Decoder) designed to solve specific limitations in converting 2D images to 1D chemical strings. The paper validates this method through direct comparison with existing baselines and ablation studies.

Motivation: Limitations of 1D Serialization

Current Optical Chemical Structure Recognition (OCSR) methods typically rely on “unstructured modeling,” where 2D molecular graphs are flattened into 1D strings like SMILES or SSML. While simple, these linear formats struggle to explicitly capture complex spatial relationships, particularly in molecules with multiple rings and branches. End-to-end models often fail to “understand” the graph structure when forced to predict these implicit 1D sequences, leading to error accumulation in complex scenarios.

Innovation: Ring-Free Language (RFL) and Molecular Skeleton Decoder (MSD)

The authors propose two primary contributions to decouple spatial complexity:

Ring-Free Language (RFL): A divide-and-conquer representation that splits a molecular graph $G$ into three explicit components: a molecular skeleton $\mathcal{S}$, individual ring structures $\mathcal{R}$, and branch information $\mathcal{F}$. This allows rings to be collapsed into “SuperAtoms” or “SuperBonds” during initial parsing.
Molecular Skeleton Decoder (MSD): A hierarchical architecture that progressively predicts the skeleton first, then the individual rings (using SuperAtom features as conditions), and finally classifies the branch connections.

Methodology and Experiments

The method was evaluated on both handwritten and printed chemical structures against two baselines: DenseWAP (Zhang et al. 2018) and RCGD (Hu et al. 2023).

Datasets:
- EDU-CHEMC: ~49k handwritten samples (challenging, diverse styles)
- Mini-CASIA-CSDB: ~89k printed samples (from ChEMBL)
- Synthetic Complexity Dataset: A custom split of ChEMBL data grouped by structural complexity (atoms + bonds + rings) to test generalization
Ablation Studies (Table 2, on EDU-CHEMC with MSD-DenseWAP): Without MSD or [conn], EM=38.70%. Adding [conn] alone raised EM to 44.02%. Adding MSD alone raised EM to 52.76%. Both together achieved EM=64.96%, confirming each component’s contribution.

Outcomes and Conclusions

New best results: MSD-RCGD achieved 65.39% EM on EDU-CHEMC (handwritten) and 95.23% EM on Mini-CASIA-CSDB (printed), outperforming the RCGD baseline (62.86% and 95.01%, respectively). MSD-DenseWAP surpassed the previous best on EDU-CHEMC by 2.06% EM (64.92% vs. 62.86%).
Universal improvement: Applying MSD/RFL to DenseWAP improved its accuracy from 61.35% to 64.92% EM on EDU-CHEMC and from 92.09% to 94.10% EM on Mini-CASIA-CSDB, demonstrating the method is model-agnostic.
Complexity handling: When trained on low-complexity molecules only (levels 1-2), MSD-DenseWAP still recognized higher-complexity unseen structures, while standard DenseWAP could hardly recognize them at all (Figure 6 in the paper).

The authors note that this is the first end-to-end solution that decouples and models chemical structures in a structured form. Future work aims to extend structured-based modeling to other tasks such as tables, flowcharts, and diagrams.

Artifacts

Artifact	Type	License	Notes
RFL-MSD	Code	MIT	Official PyTorch implementation

Reproducibility Details

Data

The authors utilized one handwritten and one printed dataset, plus a synthetic set for stress-testing complexity.

Purpose	Dataset	Size	Notes
Training/Test	EDU-CHEMC	48,998 Train / 2,992 Test	Handwritten images from educational scenarios
Training/Test	Mini-CASIA-CSDB	89,023 Train / 8,287 Test	Printed images rendered from ChEMBL using RDKit
Generalization	ChEMBL Subset	5 levels of complexity	Custom split based on Eq: $N_{atom} + N_{bond} + 12 \times N_{ring}$

Algorithms

RFL Splitting (Encoding):

Detect Rings: Use DFS to find all non-nested rings $\mathcal{R}$.
Determine Adjacency ($\gamma$): Calculate shared edges between rings.
Merge:
- If $\gamma(r_i) = 0$ (isolated), merge ring into a SuperAtom node.
- If $\gamma(r_i) > 0$ (adjacent), merge ring into a SuperBond edge.
Update: Record connection info in $\mathcal{F}$ and remove ring details from the main graph to form Skeleton $\mathcal{S}$.

MSD Decoding:

Hierarchical Prediction: The model predicts the Skeleton $\mathcal{S}$ first.
Contextual Ring Prediction: When a SuperAtom/Bond token is predicted, its hidden state $f^s$ is stored. After the skeleton is finished, $f^s$ is used as a condition to autoregressively decode the specific ring structure.
Token [conn]: A special token separates connected ring bonds from unconnected ones to sparsify the branch classification task.

Models

The architecture follows a standard Image-to-Sequence pattern but with a forked decoder.

Encoder: DenseNet (Growth rate=24, Depth=32 per block)
Decoder (MSD):
- Core: GRU with Attention (Hidden dim=256, Embedding dim=256, Dropout=0.15)
- Skeleton Module: Autoregressively predicts sequence tokens. Uses Maxout activation.
- Branch Module: A binary classifier (MLP) taking concatenated features of skeleton bonds $f_{bs}$ and ring bonds $f_{br}$ to predict connectivity matrix $\mathcal{F}$.
Loss Function: $\mathcal{O} = \lambda_1 \mathcal{L}_{ce} + \lambda_2 \mathcal{L}_{cls}$ (where $\lambda_1 = \lambda_2 = 1$)

Evaluation

Metrics focus on exact image reconstruction and structural validity.

Metric	Description	Notes
EM (Exact Match)	% of images where predicted graph exactly matches ground truth.	Primary metric
Struct-EM	% of correctly identified chemical structures (ignoring non-chemical text).	Auxiliary metric

Hardware

Compute: 4 x NVIDIA Tesla V100 (32GB RAM)
Training Configuration:
- Batch size: 8 (Handwritten), 32 (Printed)
- Epochs: 50
- Optimizer: Adam ($lr=2\times10^{-4}$, decayed by 0.5 via MultiStepLR)

Paper Information

Citation: Chang, Q., Chen, M., Pi, C., Hu, P., Zhang, Z., Ma, J., Du, J., Yin, B., & Hu, J. (2025). RFL: Simplifying Chemical Structure Recognition with Ring-Free Language. In Proceedings of the AAAI Conference on Artificial Intelligence, 39(2), 2007-2015. https://doi.org/10.1609/aaai.v39i2.32197

Publication: AAAI 2025 (Oral)

Additional Resources:

Official Code Repository

@inproceedings{changRFLSimplifyingChemical2025,
  title = {RFL: Simplifying Chemical Structure Recognition with Ring-Free Language},
  shorttitle = {RFL},
  author = {Chang, Qikai and Chen, Mingjun and Pi, Changpeng and Hu, Pengfei and Zhang, Zhenrong and Ma, Jiefeng and Du, Jun and Yin, Baocai and Hu, Jinshui},
  year = {2025},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  volume = {39},
  number = {2},
  pages = {2007--2015},
  eprint = {2412.07594},
  primaryclass = {cs},
  doi = {10.1609/aaai.v39i2.32197},
  archiveprefix = {arXiv}
}

Optical Chemical Structure Recognition on Hunter Heidenreich | ML Research Scientist

MarkushGrapher-2: End-to-End Markush Recognition

A Multimodal Method for Markush Structure Recognition

Why Markush Structure Recognition Remains Challenging

Dual-Encoder Architecture with Dedicated ChemicalOCR

Two-Stage Training Strategy

Datasets and Evaluation Benchmarks

Training Data

Evaluation Benchmarks

Results: Markush Structure Recognition

Results: Standard Molecular Structure Recognition

ChemicalOCR vs. General OCR

Ablation Results and Key Findings

Reproducibility Details

Data

Models

Evaluation

Hardware

Artifacts

Paper Information

Uni-Parser: Industrial-Grade Multi-Modal PDF Parsing (2025)

An Industrial-Grade Multi-Modal Document Parser

A Five-Stage Pipeline Architecture

Group-Based Layout Detection

Chemical Structure Recognition with MolParser 1.5

Molecule Localization

OCSR Accuracy

Document Parsing Benchmarks

Comparison with OCSR-Enabled PDF Parsers

Reproducibility

Limitations and Future Directions

Paper Information

GraSP: Graph Recognition via Subgraph Prediction (2026)

A General Framework for Visual Graph Recognition

Sequential Subgraph Prediction as an MDP

Architecture: GNN + FiLM-Conditioned CNN

Training via Streaming Data Generation

Synthetic Benchmarks on Colored Trees

OCSR Evaluation on QM9

Reproducibility

Limitations and Future Directions

Paper Information

GraphReco: Probabilistic Structure Recognition (2026)

Paper Information

A Rule-Based OCSR System with Probabilistic Graph Assembly

Three-Stage Pipeline

Fragment Merging Line Detection

Probabilistic Ambiguity Resolution via Markov Network

Evaluation Results

Robustness on Perturbed Images

Ablation Study

Limitations

Reproducibility

AdaptMol: Domain Adaptation for Molecular OCSR (2026)

Bridging the Synthetic-to-Real Gap in Graph-Based OCSR

End-to-End Graph Reconstruction Architecture

Bond-Level Domain Adaptation via MMD

Self-Training with SMILES Validation

Comprehensive Data Augmentation

Results

Hand-Drawn Molecule Recognition

Literature and Synthetic Benchmarks

Pipeline Ablation

Reproducibility

Limitations

Paper Information

OCSU: Optical Chemical Structure Understanding (2025)

Paper Information

Multi-Level Chemical Understanding (Method and Resource)

The Motivation for OCSU Beyond Basic Graph Recognition

Key Innovations: DoubleCheck, Mol-VL, and the Vis-CheBI20 Dataset

Methodology and Experimental Evaluation

Results and Conclusions: Paradigm Trade-Offs

Reproducibility Details

Data

Algorithms

Models

Evaluation

Hardware

Artifacts