Paper Information

Citation: Shah, A. K., et al. (2025). Multimodal Search in Chemical Documents and Reactions. arXiv preprint arXiv:2502.16865. https://doi.org/10.48550/arXiv.2502.16865

Publication: SIGIR ‘25 (Demo Track), 2025

Additional Resources:

What kind of paper is this?

This is primarily a Method paper. It proposes a novel architectural pipeline for indexing and searching chemical literature by unifying text, molecular diagrams, and reaction records. A secondary component is Resource, as it presents a functional demonstration tool and curates a specific dataset for Suzuki coupling reactions.

What is the motivation?

Scientific literature tells the “full story” of a chemical reaction through a combination of text and diagrams. Text often contains details like yield and temperature, while diagrams illustrate the structural changes.

  • Problem: Existing tools (like SciFinder or Reaxys) do not explicitly link molecular figures to their textual descriptions, making it hard to retrieve a reaction diagram along with its textual context (conditions, yield).
  • Gap: Most systems retrieve documents or individual compounds, rather than specific passages containing the relevant reaction descriptions.
  • Need: Researchers need to efficiently retrieve synthesis protocols and reactions with their full context.

What is the novelty here?

The core novelty is the multimodal passage-level indexing and linking system.

  • Unified Indexing: It processes text and diagrams in parallel but links them into a single index. This allows searching via text, SMILES, or a combination.
  • Compound-Passage Linking: It introduces a specific logic to link molecular diagrams to text mentions using two strategies:
    1. Token-based: Matching text labels (e.g., “compound 5”) using Levenshtein distance.
    2. Fingerprint-based: Matching chemical structures (SMILES) using Tanimoto Similarity.
  • ReactionMiner Integration: It incorporates structured reaction records (reactants, products, catalysts, yields) extracted directly from the text.

What experiments were performed?

The authors evaluated the system using a specific case study and expert assessment.

  • Dataset: They indexed 7 research papers and 6 supplementary documents related to Suzuki coupling reactions.
  • Volume: The index contained 1,282 extracted passages (538 indexed), 383 unique SMILES, and 219 reactions.
  • Qualitative Evaluation: Expert chemists from the University of Illinois tested the system using real-world queries (e.g., searching for “Burke group” + a specific reaction SMARTS).

What were the outcomes and conclusions drawn?

  • Effective Linking: The system successfully linked molecular diagrams to text-based reaction details, allowing users to navigate from a molecule “card” to the exact passage in the PDF.
  • Context Retrieval: Chemists found the structured reaction output (yield, catalysts, etc.) useful as an extractive summary.
  • Serendipity: The system successfully retrieved relevant derivatives (e.g., benzothiophenylboronic acid) even when the user queried a related structure (dibenzothiophene).
  • Limitations:
    • Transparency: Users struggled to understand if results were ranked more by text or structure in multimodal queries.
    • Data Export: Chemists requested additional metadata like “equivalents” and “mol%” for lab notebooks.
    • Linking Failures: Some text mentions were not correctly linked to their diagrams.

Reproducibility Details

Data

  • Source: 7 research papers and 6 supplementary information documents on Suzuki coupling reactions provided by chemists at UIUC.
  • Preprocessing:
    • PDFs converted to images.
    • Text extracted via PyTesseract.
    • Passages segmented into reaction-related sentences using product-indicative keywords and topic modeling.

Algorithms

  • Diagram Extraction: YOLOv8 detects molecular regions in PDF pages.
  • Diagram Parsing: ChemScraper parses diagrams:
    • Born-digital PDFs: SymbolScraper extracts lines/polygons directly.
    • Raster images: Line Segment Detector (LSD) and watershed algorithms detect primitives.
  • Text Entity Extraction: ChemDataExtractor 2.0 identifies molecule names, which are converted to SMILES via OPSIN.
  • Linking Logic (Fusion):
    • Text Link: Normalized Levenshtein ratio between diagram labels (e.g., “5”) and text mentions (“compound 5”).
    • Structure Link: Tanimoto Similarity (Morgan fingerprints, 2048 bits) between diagram SMILES and text-derived SMILES.
    • Conflict Resolution: If both strategies match, the one with the higher score is chosen.

Models

  • Reaction Extraction: LLAMA-3.1-8b fine-tuned with LoRA is used to extract entities (reactants, products, catalysts) and conditions from text segments.
  • Diagram Parsing: A segmentation-aware multi-task neural network within ChemScraper is used for raster image parsing.

Evaluation

  • Search Engine: Built on PyTerrier.
  • Text Search: BM25 ranking.
  • Structure Search: RDKit for substructure matching and similarity search.
  • Multimodal Re-ranking:
    • Retrieve candidates via substructure search (SMILES) and text search (BM25).
    • Fusion step prioritizes passages with higher numbers of SMILES matches.

Citation

@misc{shahMultimodalSearchChemical2025,
  title = {Multimodal {{Search}} in {{Chemical Documents}} and {{Reactions}}},
  author = {Shah, Ayush Kumar and Dey, Abhisek and Luo, Leo and Amador, Bryan and Philippy, Patrick and Zhong, Ming and Ouyang, Siru and Friday, David Mark and Bianchi, David and Jackson, Nick and Zanibbi, Richard and Han, Jiawei},
  year = 2025,
  month = feb,
  number = {arXiv:2502.16865},
  eprint = {2502.16865},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2502.16865},
  archiveprefix = {arXiv}
}