Paper Information

Citation: Fang, X., Tao, H., Yang, S., Huang, C., Zhong, S., Lu, H., Lyu, H., Li, X., Zhang, L., & Ke, G. (2025). Uni-Parser Technical Report. arXiv preprint arXiv:2512.15098. https://arxiv.org/abs/2512.15098

Publication: arXiv 2025

Additional Resources:

An Industrial-Grade Multi-Modal Document Parser

Uni-Parser is a modular, loosely coupled PDF parsing engine built for scientific literature and patents. It routes different content types (text, equations, tables, figures, chemical structures) to specialized expert models, then reassembles the parsed outputs into structured formats (JSON, Markdown, HTML) for downstream consumption by LLMs and other applications.

The system processes up to 20 PDF pages per second on 8 NVIDIA RTX 4090D GPUs and supports over 80 languages for OCR.

A Five-Stage Pipeline Architecture

The system is organized into five sequential stages:

  1. Document Pre-Processing: Validates PDFs, extracts metadata, checks text accessibility, and identifies language.
  2. Group-based Layout Detection: Locates semantic blocks and identifies their categories using a novel tree-structured layout representation. Groups naturally paired elements (image-caption, table-title, molecule-identifier).
  3. Semantic Contents Parsing: Routes each block to a specialized model: OCR for text, formula recognition for equations, table structure recognition, OCSR for chemical structures, reaction extraction, and chart parsing. Over ten sub-models operate in parallel.
  4. Semantic Contents Gathering: Filters non-essential elements, reconstructs reading order, merges cross-page and multi-column content, and reintegrates inline multimodal elements.
  5. Output Formatting and Semantic Chunking: Exports parsed documents in task-specific formats with proper chunking for RAG and other downstream tasks.

Group-Based Layout Detection

A key contribution is the group-based layout detection model (Uni-Parser-LD), which uses a hierarchical tree structure to represent page layouts. Elements are organized into a bottom layer (parent nodes like paragraphs, tables, images) and a top layer (child nodes like captions, footnotes, identifiers). This preserves semantic associations between paired elements, such as molecules and their identifiers.

The model is trained on 500k pages, including 220k human-annotated pages from scientific journals and patents across 85 languages. A modified DETR-based architecture was selected as the backbone after finding that RT-DETRv2, YOLOv12, and D-FINE exhibited training instability for this task.

Chemical Structure Recognition with MolParser 1.5

Uni-Parser integrates MolParser 1.5 for OCSR, an end-to-end model that directly generates molecular representations from images. The authors explicitly note that graph-based (atom-bond) methods were the first direction they explored but ultimately abandoned because of:

  • Strong reliance on rigid, hand-crafted rules that limit scalability
  • Substantially higher annotation costs (over 20x compared to end-to-end approaches)
  • Lower performance ceilings despite increasing training data

Molecule Localization

Uni-Parser-LD achieves strong molecule detection performance:

ModelmAP@50mAP@50-95
Uni-Parser-LD (Uni-Parser Bench)0.9940.968
MolDet-Doc-L0.9830.919
MolDet-General-L0.9740.815
Uni-Parser-LD (BioVista Bench)0.9810.844
MolDet-Doc-L0.9610.871
MolDet-General-L0.9450.815
BioMiner0.929-
MolMiner0.899-

OCSR Accuracy

MolParser 1.5 consistently outperforms prior methods across molecule types:

ModelFullChiralMarkushAll
MolParser 1.5 (Uni-Parser Bench)0.9790.8090.8050.886
MolParser 1.00.9530.6760.6640.800
MolScribe0.6170.2740.1680.417
MolParser 1.5 (BioVista Bench)0.7950.6040.7610.780
MolParser 1.00.6690.3520.7330.703
MolMiner0.7740.4970.1850.507
MolScribe0.7030.4810.1560.455
MolNexTR0.6950.4190.0450.401
DECIMER0.5450.3260.0000.298

Chiral molecule recognition remains a significant challenge and is identified as a key area for future work.

Document Parsing Benchmarks

On the Uni-Parser Benchmark (150 PDFs, 2,887 pages from patents and scientific articles), Uni-Parser (HQ mode) achieves an overall score of 89.74 (excluding molecules), outperforming both pipeline tools (MinerU, PP-StructureV3) and specialized VLMs (MinerU2-VLM, DeepSeek-OCR, PaddleOCR-VL). Competing systems score zero on molecule localization and OCSR because they lack molecular recognition capabilities.

On the general-document OmniDocBench-1.5, a variant (Uni-Parser-G) using a swapped layout module achieves 89.75 overall, competitive with top-performing specialized VLMs.

Comparison with OCSR-Enabled PDF Parsers

On a controlled test set of 141 simple molecules, Uni-Parser outperforms other PDF parsing systems with OCSR support:

MethodRecallOCSR SuccessOCSR AccId MatchTime
Uni-Parser100%100%96.5%100%1.8s
MathPix100%75.9%59.6%-66.1s
MinerU.Chem66.7%63.1%22.7%-~7 min

Reproducibility

ArtifactTypeLicenseNotes
HuggingFace ModelsModel/DatasetUnknownMolDet models and MolParser-7M dataset available
Project PageOtherUnknownProject website with documentation

The Uni-Parser system is deployed on a cluster of 240 NVIDIA L40 GPUs (48 GB each) with 22 CPU cores and 90 GB of host memory per GPU. The reference throughput benchmark (20 pages/second) uses 8 NVIDIA RTX 4090D GPUs. The HuggingFace organization hosts MolDet detection models and several datasets (MolParser-7M, RxnBench, OmniScience), but the full Uni-Parser system code and end-to-end inference pipeline do not appear to be publicly released. MolParser 1.5 model weights are not publicly available as of this writing.

Limitations and Future Directions

  • Chiral molecule recognition remains a challenge for end-to-end OCSR models
  • Chemical reaction understanding in real-world literature has substantial room for improvement
  • Layout models are primarily tailored to scientific and patent documents, with plans to expand to newspapers, slides, books, and financial statements
  • Chart parsing falls short of industrial-level requirements across the diversity of chart types in scientific literature