Paper Information
Citation: Xu, Y., Xiao, J., Chou, C.-H., Zhang, J., Zhu, J., Hu, Q., Li, H., Han, N., Liu, B., Zhang, S., Han, J., Zhang, Z., Zhang, S., Zhang, W., Lai, L., & Pei, J. (2022). MolMiner: You Only Look Once for Chemical Structure Recognition. Journal of Chemical Information and Modeling, 62(22), 5321–5328. https://doi.org/10.1021/acs.jcim.2c00733
Publication: Journal of Chemical Information and Modeling (JCIM) 2022
Additional Resources:
What kind of paper is this?
This is primarily a Resource paper ($\Psi_{\text{Resource}}$) with a strong Method component ($\Psi_{\text{Method}}$).
- Resource: It presents a complete software application (published as an “Application Note”) for Optical Chemical Structure Recognition (OCSR), including a graphical user interface (GUI) and a new curated “Real-World” dataset of 3,040 molecular images.
- Method: It proposes a novel “rule-free” pipeline that replaces traditional vectorization algorithms with deep learning object detection (YOLOv5) and segmentation models.
What is the motivation?
- Legacy Backlog: Decades of scientific literature contain chemical structures only as 2D images (PDFs), which are not machine-readable.
- Limitation of Rule-Based Systems: Existing tools (like OSRA, CLIDE, MolVec) rely on rule-based vectorization (interpreting vectors and nodes). These struggle with noise, low resolution, and complex drawing styles found in scanned documents.
- Deep Learning Gap: While deep learning (DL) has advanced computer vision, few practical, end-to-end DL tools existed for OCSR that could handle the full pipeline from PDF extraction to graph generation with high accuracy.
What is the novelty here?
- Object Detection Paradigm: Unlike methods that try to “trace” lines (vectorization), MolMiner treats atoms and bonds as objects to be detected using YOLOv5. This allows it to “look once” at the image rather than parsing it sequentially.
- End-to-End Pipeline: Integration of three specialized modules:
- MobileNetV2 for segmenting molecular figures from PDF pages.
- YOLOv5 for detecting chemical elements (atoms/bonds).
- EasyOCR for recognizing text labels and supergroups.
- Synthetic Training Strategy: The authors bypassed manual labeling by building a data generation module that uses RDKit to create chemically valid images with perfect ground-truth annotations automatically.
What experiments were performed?
- Benchmarks: Evaluated on four standard OCSR datasets: USPTO (5,719 images), UOB (5,740 images), CLEF2012 (992 images), and JPO (450 images).
- New External Dataset: Collected and annotated a “Real-World” dataset of 3,040 images from 239 scientific papers to test generalization beyond synthetic benchmarks.
- Baselines: Compared against open-source tools: MolVec (v0.9.8), OSRA (v2.1.0), and Imago (v2.0).
- Qualitative Tests: Tested on difficult cases like hand-drawn molecules and large-sized scans (e.g., Palytoxin).
What outcomes/conclusions?
- SOTA Performance: MolMiner outperformed all open-source baselines on all datasets.
- USPTO: 93.3% MCS accuracy (vs. 88.9% for MolVec).
- Real-World Set: 87.8% MCS accuracy (vs. 50.1% for MolVec and <11% for Imago/OSRA).
- Speed: Inference is significantly faster than rule-based systems (e.g., <1 min for JPO dataset vs 8-23 mins for others) due to batch GPU processing.
- Robustness: Demonstrated ability to handle hand-drawn sketches and noisy scans, though limitations remain with crossing bonds and crowded layouts.
- Software Release: Released as a free desktop application for Mac and Windows with a Ketcher-based editing plugin.
Reproducibility Details
Data
The system relies heavily on synthetic data for training, while evaluation uses both standard and novel real-world datasets.
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Training | Synthetic RDKit | Large-scale | Generated using RDKit v2021.09.1 and ReportLab v3.5.0. Includes augmentations (rotation, thinning, noise). |
| Evaluation | USPTO | 5,719 | Standard benchmark. Avg MW: 440.3. |
| Evaluation | UOB | 5,740 | Standard benchmark. Avg MW: 213.5. |
| Evaluation | CLEF2012 | 992 | Standard benchmark. Avg MW: 400.9. |
| Evaluation | Real-World | 3,040 | New Contribution. Collected from 239 scientific papers. Avg MW: 496.8. Download Link. |
Algorithms
- Data Generation:
- Uses RDKit
MolDraw2DSVGandCondenseMolAbbreviationsto generate images and ground truth. - Augmentation: Rotation, line thinning/thickness variation, noise injection.
- Uses RDKit
- Graph Construction:
- A distance-based algorithm connects recognized “Atom” and “Bond” objects into a molecular graph.
- Supergroup Parser: Matches detected text against a dictionary collected from RDKit, ChemAxon, and OSRA to resolve abbreviations (e.g., “Ph”, “Me”).
- Image Preprocessing:
- Resizing: Images with max dim > 2560 are resized to 2560. Small images (< 640) resized to 640.
- Padding: Images padded to nearest upper bound (640, 1280, 1920, 2560) with white background (255, 255, 255).
- Dilation: For thick-line images,
cv2.dilate(3x3 or 2x2 kernel) is applied to estimate median line width.
Models
The system is a cascade of three distinct deep learning models:
- MolMiner-ImgDet (Page Segmentation):
- Architecture: MobileNetV2.
- Task: Semantic segmentation to identify and crop chemical figures from full PDF pages.
- Classes: Background vs. Compound.
- Performance: Recall 95.5%.
- MolMiner-ImgRec (Structure Recognition):
- Architecture: YOLOv5 (One-stage object detector). Selected over MaskRCNN/EfficientDet for speed/accuracy trade-off.
- Task: Detects atoms and bonds as bounding boxes.
- Labels:
- Atoms: Si, N, Br, S, Cl, H, P, O, C, B, F, Text.
- Bonds: Single, Double, Triple, Wedge, Dash, Wavy.
- Performance: [email protected] = 97.5%.
- MolMiner-TextOCR (Character Recognition):
- Architecture: EasyOCR (fine-tuned).
- Task: Recognize specific characters in “Text” regions identified by YOLO (e.g., supergroups, complex labels).
- Performance: ~96.4% accuracy.
Evaluation
The paper argues that MCS (Maximum Common Substructure) is a better metric than InChI strings because InChI is sensitive to minor canonicalization differences (e.g., aromaticity perception, stereochemistry format).
| Metric | MolMiner (Real-World) | MolVec | OSRA | Imago |
|---|---|---|---|---|
| MCS Accuracy | 87.8% | 50.1% | 8.9% | 10.3% |
| InChI Accuracy | 88.9% | 62.6% | 64.5% | 10.8% |
Hardware
- Inference Hardware: Tested on Intel Xeon Gold 6230R CPU @ 2.10 GHz.
- Acceleration: Supports batch inference on GPU, which provides the reported speedups over rule-based CPU tools.
- Runtime: ~1 minute for 1,000 images on standard benchmarks (significantly faster than the ~20+ mins for older tools).
Citation
@article{xuMolMinerYouOnly2022,
title = {MolMiner: You Only Look Once for Chemical Structure Recognition},
shorttitle = {MolMiner},
author = {Xu, Youjun and Xiao, Jinchuan and Chou, Chia-Han and Zhang, Jianhang and Zhu, Jintao and Hu, Qiwan and Li, Hemin and Han, Ningsheng and Liu, Bingyu and Zhang, Shuaipeng and Han, Jinyu and Zhang, Zhen and Zhang, Shuhao and Zhang, Weilin and Lai, Luhua and Pei, Jianfeng},
year = 2022,
month = nov,
journal = {Journal of Chemical Information and Modeling},
volume = {62},
number = {22},
pages = {5321--5328},
publisher = {American Chemical Society},
issn = {1549-9596},
doi = {10.1021/acs.jcim.2c00733},
}