MolParser Datasets

MolParser-7M and WildMol Datasets for Robust Chemical Structure Recognition
Dataset Details
AuthorsXi Fang, Jiankun Wang, Xiaochen Cai, Shangqian Chen, Shuwen Yang, Haoyi Tao, Nan Wang, Lin Yao, Linfeng Zhang, Guolin Ke
Paper TitleMolParser: End-to-end Visual Recognition of Molecule Structures in the Wild
InstitutionDP Technology
Published InarXiv
CategoryDocument Processing
FormatMolecule Images (PNG) Extended SMILES (E-SMILES) strings
SizeTest molecules: 20,000
Training pairs: 7,740,871
DateOctober 2025
Year2025
Links📊 Dataset📄 Paper
A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.
The MolParser system is designed to perform end-to-end recognition of molecular structures found in real-world documents like patents and scientific literature.

Key Contribution

Introduces MolParser-7M, the largest Optical Chemical Structure Recognition (OCSR) dataset, uniquely combining diverse synthetic data with a large volume of manually-annotated, ‘in-the-wild’ images from real scientific documents to improve model robustness. Also introduces WildMol, a new challenging benchmark for evaluating OCSR performance on real-world data, including Markush structures.

Dataset Information

Format

Molecule Images (PNG) Extended SMILES (E-SMILES) strings

Size

TypeCount
Test Molecules20,000
Training Pairs7,740,871

Dataset Examples

An example of a complex Markush structure that can be represented by the E-SMILES format but not by standard SMILES or FG-SMILES.
An example of a complex Markush structure that can be represented by the E-SMILES format but not by standard SMILES or FG-SMILES.
A sample from the WildMol benchmark, showing a low-quality, noisy molecular image cropped from real-world literature that challenges OCSR systems.
A sample from the WildMol benchmark, showing a low-quality, noisy molecular image cropped from real-world literature that challenges OCSR systems.
A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.
A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.

Dataset Subsets

SubsetCountDescription
MolParser-7M (Training Set)7,740,871A large-scale dataset for training OCSR models, split into pre-training and fine-tuning stages.
WildMol (Test Set)20,000A benchmark of 20,000 human-annotated samples cropped from real PDF files to evaluate OCSR models in ‘in-the-wild’ scenarios.

Results

Optical Chemical Structure Recognition ( Accuracy %)

ModelWildMol-10K
OSRA 2.126.3
MolVec 0.9.726.4
Imago 2.06.9
Img2Mol24.4
MolGrapher45.5
DECIMER 2.756.0
MolScribe🥈 66.4
MolParser-Base🥇 76.9

Strengths

  • Largest open-source OCSR dataset with over 7.7 million pairs.
  • The only large-scale OCSR training set that includes a significant amount (400k) of ‘in-the-wild’ data cropped from real patents and literature.
  • High diversity of molecular structures from numerous sources (PubChem, ChEMBL, polymers, etc.).
  • Introduces the WildMol benchmark for evaluating performance on challenging, real-world data, including Markush structures.
  • The ‘in-the-wild’ fine-tuning data (MolParser-SFT-400k) was curated via an efficient active learning data engine with human-in-the-loop validation.

Limitations

  • The E-SMILES format cannot represent certain complex cases, such as coordination bonds, dashed abstract rings, and Markush structures depicted with special patterns.
  • The model and data do not yet fully exploit molecular chirality, which is critical for chemical properties.
  • Performance could be further improved by scaling up the amount of real annotated training data.

Technical Notes

Synthetic Data Generation

  • To ensure diversity, molecular structures were collected from databases like ChEMBL, PubChem, and Kaggle BMS.
  • A significant number of Markush, polymer, and fused-ring structures were also randomly generated.
  • Images were rendered using RDKit and epam.indigo with randomized parameters (e.g., bond width, font size, rotation) to increase visual diversity.

In-the-Wild Data Engine (MolParser-SFT-400k)

  • A YOLOv11 object detection model (MolDet) located and cropped over 20 million molecule images from 1.22 million real PDFs (patents and papers).
  • After de-duplication via p-hash similarity, 4 million unique images remained.
  • An active learning algorithm was used to select the most informative samples for annotation, targeting images where an ensemble of models showed moderate confidence (0.6-0.9 Tanimoto similarity), indicating they were challenging but learnable.
  • This active learning approach with model pre-annotations reduced manual annotation time per molecule to 30 seconds, a 90% savings compared to annotating from scratch.