Paper Information
Citation: Leong, S. X., Pablo-García, S., Wong, B., & Aspuru-Guzik, A. (2025). MERMaid: Universal multimodal mining of chemical reactions from PDFs using vision-language models. ChemRxiv. https://doi.org/10.26434/chemrxiv-2025-8z6h2
Publication: ChemRxiv 2025 (Preprint)
Additional Resources:
What kind of paper is this?
This is primarily a Methodological paper ($\Psi_{\text{Method}}$) that introduces a novel pipeline (MERMaid) for extracting structured chemical data from unstructured PDF documents. It proposes a specific architecture combining fine-tuned vision models (VisualHeist) with vision-language models (DataRaider) and a retrieval-augmented generation system (KGWizard) to solve the problem of multimodal data ingestion.
Secondarily, it is a Resource paper ($\Psi_{\text{Resource}}$) as it releases the source code, prompts, and a new benchmark dataset (MERMaid-100) consisting of annotated reaction data across three chemical domains.
What is the motivation?
- Data Inaccessibility: A vast amount of chemical knowledge is locked in “print-optimized” PDF formats, specifically within graphical elements like figures, schemes, and tables, which resist standard text mining.
- Limitations of Prior Work: Existing tools (e.g., ChemDataExtractor, OpenChemIE) focus primarily on text, struggle with multimodal parsing, or lack the “contextual awareness” needed to interpret implicit information (e.g., “standard conditions” with modifications in optimization tables).
- Need for Structured Data: To enable self-driving laboratories and data-driven discovery, this unstructured literature must be converted into machine-actionable formats like knowledge graphs.
What is the novelty here?
- VisualHeist (Fine-tuned Segmentation): A custom fine-tuned model based on Microsoft’s Florence-2 that accurately segments figures, captions, and footnotes, even in messy supplementary materials.
- DataRaider (Context-Aware Extraction): A VLM-powered module (using GPT-4o) with a two-step prompt framework that performs “self-directed context completion.” It can infer missing reaction parameters from context and resolve footnote labels (e.g., linking “condition a” in a table to its footnote description).
- KGWizard (Schema-Adaptive Graph Construction): A text-to-graph engine that uses LLMs as higher-order functions to synthesize parsers dynamically. It employs Retrieval-Augmented Generation (RAG) to check for existing nodes during creation, implicitly resolving coreferences (e.g., unifying “MeCN” and “Acetonitrile”).
- Topic-Agnostic Design: Unlike rigid parsers, MERMaid is demonstrated to work across three distinct domains: organic electrosynthesis, photocatalysis, and organic synthesis.
What experiments were performed?
- Segmentation Benchmarking: The authors compared VisualHeist against OpenChemIE (LayoutParser) and PDFigCapX using a dataset of 121 PDFs from 5 publishers.
- End-to-End Extraction: Evaluated the full pipeline on MERMaid-100, a curated dataset of 100 articles across three domains (organic electrosynthesis, photocatalysis, organic synthesis).
- Validating extraction of specific parameters (e.g., catalysts, solvents, yields) using “hard-match” accuracy.
- Knowledge Graph Construction: Automatically generated knowledge graphs for the three domains and assessed the structural integrity and coreference resolution accuracy.
What outcomes/conclusions?
- Superior Segmentation: VisualHeist achieved >93% F1 score across all document types (including pre-2000 papers and supplementary materials), significantly outperforming baselines (OpenChemIE ~38% F1).
- High-Fidelity Extraction: DataRaider achieved >92% accuracy for VLM-based parameter extraction and near-unity accuracy for domain-specific parameters.
- Robust Graph Building: KGWizard achieved 96% accuracy in node creation and coreference resolution.
- Overall Performance: The pipeline demonstrated an 87% end-to-end overall accuracy.
- Availability: The authors provide a modular, extensible framework that can be adapted to other scientific domains.
Reproducibility Details
Data
- Training Data (VisualHeist):
- Dataset of 3,435 figures and 1,716 tables annotated from 3,518 PDF pages.
- Includes main text, supplementary materials, and unformatted archive papers.
- Evaluation Data (MERMaid-100):
- 100 PDF articles curated from three domains: organic electrosynthesis, photocatalysis, and organic synthesis.
- Includes 104 image-caption/table-heading pairs relevant to reaction optimization.
- Available for download at Zenodo (DOI: 10.5281/zenodo.14917752).
Algorithms
- Two-Step Prompt Framework (DataRaider):
- Step 1: Generic base prompt + domain keys to extract “reaction dictionaries” and “footnote dictionaries”. Uses “fill-in-the-blank” inference for missing details.
- Step 2: Safety check prompt where the VLM updates the reaction dictionary using the footnote dictionary to resolve entry-specific modifications.
- LLM-Synthesized Parsers (KGWizard):
- Uses LLM as a function $g_{A,B}: A \times B \rightarrow (X \rightarrow Y)$ to generate Python code (parsers) dynamically based on input schema instructions.
- RAG for Coreference:
- During graph construction, the system queries the existing database for matching values (e.g., “MeCN”) before creating new nodes to prevent duplication.
- Batching:
- Articles processed in dynamic batch sizes (starting at 1, increasing to 30) to balance speed and redundancy checks.
Models
- VisualHeist: Fine-tuned Florence-2-large (Microsoft vision foundation model).
- Hyperparameters: 12 epochs, learning rate $5 \times 10^{-6}$, batch size 4.
- DataRaider & KGWizard: GPT-4o (version
gpt-4o-2024-08-06). - RxnScribe: Used for Optical Chemical Structure Recognition (OCSR) to convert reactant/product images to SMILES.
Evaluation
- Metrics:
- Segmentation: Precision, Recall, F1, Accuracy.
- Caption Extraction: Jaccard similarity (threshold 0.70).
- Data Extraction: Hard-match accuracy (exact match of parameter to role, e.g., correct anode vs cathode).
- Baselines: OpenChemIE (LayoutParser + EasyOCR) and PDFigCapX.
Hardware
- Training (VisualHeist): 2x NVLINK Nvidia RTX A6000 GPUs (48GB VRAM) + Intel Xeon w7-2495X CPU.
- Inference Costs:
- DataRaider: ~$0.051 per image.
- KGWizard: ~$0.40 per JSON.
- Timing:
- VisualHeist inference: ~4.5 seconds/image.
- DataRaider inference: ~41.3 seconds/image.
- KGWizard processing: ~110.6 seconds/file.
Citation
@article{leong2025mermaid,
title={MERMaid: Universal multimodal mining of chemical reactions from PDFs using vision-language models},
author={Leong, Shi Xuan and Pablo-Garc{\'i}a, Sergio and Wong, Brandon and Aspuru-Guzik, Al{\'a}n},
journal={ChemRxiv},
year={2025},
doi={10.26434/chemrxiv-2025-8z6h2}
}