Document Processing
Diagram showing page stream segmentation workflow: an input stream of pages is processed through binary classification of page pairs to predict document breaks, producing segmented output documents

LLMs for Page Stream Segmentation

Enhanced TabMe benchmark for page stream segmentation, creating TabMe++, showing fine-tuned decoder-based LLMs …

Document Processing
A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.

MolParser-7M & WildMol: Large-Scale OCSR Datasets

MolParser-7M is a 7.7M-pair dataset for molecule-to-text with realistic images paired with SMILES, InChI, and IUPAC.

Document Processing
Stream accuracy versus relative throughput for Mistral-7B and XGBoost models

LLMs for Insurance Document Automation

LLM applications for insurance document automation using parameter-efficient fine-tuning and analysis of calibration …