Abstract
Page Stream Segmentation (PSS), the task of correctly dividing a sequence of pages into distinct documents, is a critical first step in automated document processing pipelines. Research in this area has been held back by the lack of high-quality, public datasets.
In this work, we address this issue by enhancing an existing benchmark, TABME, with commercial-grade Optical Character Recognition (OCR) to create TABME++. This new version significantly reduces noise and improves text detection, highlighting the critical importance of OCR quality for document understanding tasks.
We then conduct the first evaluation of large, decoder-based language models (LLMs) on the PSS task. Our findings show that models like Mistral-7B, when fine-tuned using parameter-efficient methods, decisively outperform smaller encoder-based models and traditional baselines. For instance, our best model correctly segments 80% of document streams in the test set without any errors.
Key Contributions
- Enhanced Public Benchmark (TABME++): Re-processed the entire TABME dataset with commercial OCR, correcting significant text recognition errors and reducing blank pages by over 80% (from 2.27% to 0.38%)
- First Application of Large Decoder-Based LLMs to PSS: Systematically evaluated and fine-tuned billion-parameter, decoder-only LLMs for page stream segmentation
- State-of-the-Art Performance: Demonstrated that fine-tuned decoder models achieve superior results on TABME++, significantly outperforming previous encoder-based and multimodal approaches
- OCR Quality Analysis: Quantified the dramatic impact that high-quality OCR has on PSS model performance through comparative experiments
Technical Innovation
Our approach combines commercial-grade OCR processing with parameter-efficient fine-tuning of large language models. We enhanced the TABME dataset by replacing the original OCR output with commercial-quality text recognition, then applied LoRA (Low-Rank Adaptation) fine-tuning to adapt decoder-based LLMs for the document segmentation task.
Significance
This work provides the research community with improved evaluation tools for document processing tasks and demonstrates practical approaches for applying LLMs to real-world document automation challenges. The enhanced TABME++ dataset and our findings on OCR quality importance have immediate applications in commercial document processing pipelines.
Citation
@article{heidenreich2024large,
title={Large Language Models for Page Stream Segmentation},
year={2024}
}