MolMiner: Deep Learning OCSR with YOLOv5 Detection

Paper Information

Citation: Xu, Y., Xiao, J., Chou, C.-H., Zhang, J., Zhu, J., Hu, Q., Li, H., Han, N., Liu, B., Zhang, S., Han, J., Zhang, Z., Zhang, S., Zhang, W., Lai, L., & Pei, J. (2022). MolMiner: You Only Look Once for Chemical Structure Recognition. Journal of Chemical Information and Modeling, 62(22), 5321–5328. https://doi.org/10.1021/acs.jcim.2c00733

Publication: Journal of Chemical Information and Modeling (JCIM) 2022

Additional Resources:

Classification and Contribution

This is primarily a Resource paper ($\Psi_{\text{Resource}}$) with a strong Method component ($\Psi_{\text{Method}}$).

Resource: It presents a complete software application (published as an “Application Note”) for Optical Chemical Structure Recognition (OCSR), including a graphical user interface (GUI) and a new curated “Real-World” dataset of 3,040 molecular images.
Method: It proposes a novel “rule-free” pipeline that replaces traditional vectorization algorithms with deep learning object detection (YOLOv5) and segmentation models.

Motivation: Bottlenecks in Rule-Based Systems

Legacy Backlog: Decades of scientific literature contain chemical structures only as 2D images (PDFs), which are not machine-readable.
Limitations of Legacy Architecture: Existing tools (like OSRA, CLIDE, MolVec) rely on rule-based vectorization (interpreting vectors and nodes) which struggle with noise, low resolution, and complex drawing styles found in scanned documents.
Deep Learning Gap: While deep learning (DL) has advanced computer vision, few practical, end-to-end DL tools existed for OCSR that could handle the full pipeline from PDF extraction to graph generation with high accuracy.

Core Innovation: Object Detection Paradigm for OCSR

Object Detection Paradigm: MolMiner shifts away from the strategy of line-tracing (vectorization), opting to treat atoms and bonds directly as objects to be detected using YOLOv5. This allows it to “look once” at the image.
End-to-End Pipeline: Integration of three specialized modules:
1. MobileNetV2 for segmenting molecular figures from PDF pages.
2. YOLOv5 for detecting chemical elements (atoms/bonds) as bounded boxes.
3. EasyOCR for recognizing text labels and resolving abbreviations (supergroups) to full explicit structures.
Synthetic Training Strategy: The authors bypassed manual labeling by building a data generation module that uses RDKit to create chemically valid images with perfect ground-truth annotations automatically.

Methodology: End-to-End Object Detection Pipeline

Benchmarks: Evaluated on four standard OCSR datasets: USPTO (5,719 images), UOB (5,740 images), CLEF2012 (992 images), and JPO (450 images).
New External Dataset: Collected and annotated a “Real-World” dataset of 3,040 images from 239 scientific papers to test generalization beyond synthetic benchmarks.
Baselines: Compared against open-source tools: MolVec (v0.9.8), OSRA (v2.1.0), and Imago (v2.0).
Qualitative Tests: Tested on difficult cases like hand-drawn molecules and large-sized scans (e.g., Palytoxin).

Results: Speed and Generalization Metrics

SOTA Performance: MolMiner outperformed open-source baselines on standard validation splits.
- USPTO: 93.3% MCS accuracy (vs. 88.9% for MolVec).
- Real-World Set: 87.8% MCS accuracy (vs. 50.1% for MolVec, 8.9% for OSRA, and 10.3% for Imago/OSRA).
Inference Velocity: The architecture allows for faster processing compared to CPU rule-based systems ($<1$ minute for 1,000 images on JPO versus 8-23 minutes for rule-based systems when using batch GPU compute).
Robustness: Demonstrated ability to handle hand-drawn sketches and noisy scans, though limitations remain with crossing bonds and crowded layouts.
Software Release: Released as a free desktop application for Mac and Windows with a Ketcher-based editing plugin.

Reproducibility Details

Data

The system relies heavily on synthetic data for training, while evaluation uses both standard and novel real-world datasets.

Purpose	Dataset	Size	Notes
Training	Synthetic RDKit	Large-scale	Generated using RDKit v2021.09.1 and ReportLab v3.5.0. Includes augmentations (rotation, thinning, noise).
Evaluation	USPTO	5,719	Standard benchmark. Avg MW: 440.3.
Evaluation	UOB	5,740	Standard benchmark. Avg MW: 213.5.
Evaluation	CLEF2012	992	Standard benchmark. Avg MW: 400.9.
Evaluation	Real-World	3,040	New Contribution. Collected from 239 scientific papers. Avg MW: 496.8. Download Link.

Algorithms

Data Generation:
- Uses RDKit MolDraw2DSVG and CondenseMolAbbreviations to generate images and ground truth.
- Augmentation: Rotation, line thinning/thickness variation, noise injection.
Graph Construction:
- A distance-based algorithm connects recognized “Atom” and “Bond” objects into a molecular graph.
- Supergroup Parser: Matches detected text against a dictionary collected from RDKit, ChemAxon, and OSRA to resolve abbreviations (e.g., “Ph”, “Me”).
Image Preprocessing:
- Resizing: Images with max dim > 2560 are resized to 2560. Small images (< 640) resized to 640.
- Padding: Images padded to nearest upper bound (640, 1280, 1920, 2560) with white background (255, 255, 255).
- Dilation: For thick-line images, cv2.dilate (3x3 or 2x2 kernel) is applied to estimate median line width.

Models

The system is a cascade of three distinct deep learning models:

MolMiner-ImgDet (Page Segmentation):
- Architecture: MobileNetV2.
- Task: Semantic segmentation to identify and crop chemical figures from full PDF pages.
- Classes: Background vs. Compound.
- Performance: Recall 95.5%.
MolMiner-ImgRec (Structure Recognition):
- Architecture: YOLOv5 (One-stage object detector). Selected over MaskRCNN/EfficientDet for speed/accuracy trade-off.
- Task: Detects atoms and bonds as bounding boxes.
- Labels:
  - Atoms: Si, N, Br, S, Cl, H, P, O, C, B, F, Text.
  - Bonds: Single, Double, Triple, Wedge, Dash, Wavy.
- Performance: [email protected] = 97.5%.
MolMiner-TextOCR (Character Recognition):
- Architecture: EasyOCR (fine-tuned).
- Task: Recognize specific characters in “Text” regions identified by YOLO (e.g., supergroups, complex labels).
- Performance: ~96.4% accuracy.

Performance Evaluation & Accuracy Metrics

The paper argues that computing the Maximum Common Substructure (MCS) accuracy is superior to string comparisons of canonical identifiers like InChI or SMILES. The InChI string is heavily sensitive to slight canonicalization or tautomerization discrepancies (like differing aromaticity models). Therefore, for comparing structural isomorphism:

$$ \text{MCS_Accuracy} = \frac{|\text{Edges}{\text{MCS}}| + |\text{Nodes}{\text{MCS}}|}{|\text{Edges}{\text{Ground_Truth}}| + |\text{Nodes}{\text{Ground_Truth}}|} $$

Using this metric to evaluate bond- and atom-level recall directly measures OCR extraction fidelity.

Metric	MolMiner (Real-World)	MolVec	OSRA	Imago
MCS Accuracy	87.8%	50.1%	8.9%	10.3%
InChI Accuracy	88.9%	62.6%	64.5%	10.8%

Hardware

Inference Hardware: Tested on Intel Xeon Gold 6230R CPU @ 2.10 GHz.
Acceleration: Supports batch inference on GPU, which provides the reported speedups over rule-based CPU tools.
Runtime: ~1 minute for 1,000 images on standard benchmarks (significantly faster than the ~20+ mins for older tools).

Citation

@article{xuMolMinerYouOnly2022,
  title = {MolMiner: You Only Look Once for Chemical Structure Recognition},
  shorttitle = {MolMiner},
  author = {Xu, Youjun and Xiao, Jinchuan and Chou, Chia-Han and Zhang, Jianhang and Zhu, Jintao and Hu, Qiwan and Li, Hemin and Han, Ningsheng and Liu, Bingyu and Zhang, Shuaipeng and Han, Jinyu and Zhang, Zhen and Zhang, Shuhao and Zhang, Weilin and Lai, Luhua and Pei, Jianfeng},
  year = 2022,
  month = nov,
  journal = {Journal of Chemical Information and Modeling},
  volume = {62},
  number = {22},
  pages = {5321--5328},
  publisher = {American Chemical Society},
  issn = {1549-9596},
  doi = {10.1021/acs.jcim.2c00733},
}

Paper Information#

Classification and Contribution#

Motivation: Bottlenecks in Rule-Based Systems#

Core Innovation: Object Detection Paradigm for OCSR#

Methodology: End-to-End Object Detection Pipeline#

Results: Speed and Generalization Metrics#

Reproducibility Details#

Data#

Algorithms#

Models#

Performance Evaluation & Accuracy Metrics#

Hardware#

Citation#