MolParser Datasets

MolParser-7M and WildMol Datasets for Robust Chemical Structure Recognition
Dataset Details
AuthorsXi Fang, Jiankun Wang, Xiaochen Cai, Shangqian Chen, Shuwen Yang, Haoyi Tao, Nan Wang, Lin Yao, Linfeng Zhang, Guolin Ke
Paper TitleMolParser: End-to-end Visual Recognition of Molecule Structures in the Wild
InstitutionDP Technology
Published InarXiv
CategoryDocument Processing
FormatMolecule Images (PNG) Extended SMILES (E-SMILES) strings
SizeTest molecules: 20,000
Training pairs: 7,740,871
DateOctober 2025
Year2025
Links📊 Dataset📄 Paper
A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.
The MolParser system is designed to perform end-to-end recognition of molecular structures found in real-world documents like patents and scientific literature.