In the world of automated document processing, Page Stream Segmentation (PSS)1 is the “hello world” problem that remains surprisingly stubborn.

The task is deceptively simple: given a stack of scanned pages (invoices, contracts, medical records), determine where one document ends and the next begins.

For decades, this problem was tackled with brittle rules and heuristics. Then came the deep learning era, where we threw Convolutional Neural Networks (CNNs) and multipage Transformers at it. Yet, even sophisticated models struggled to achieve what businesses actually care about: Straight-Through Processing (STP).

Why STP Matters: It’s not just businesses that care about STP. Arguably, any human that has to interact with the output of a system, deal with its mistakes, and perform corrections to get a job done, cares about it. If the system fails 90% of the time, it isn’t automating, it’s just creating more work at the expense of real people.

In this post, we’ll explore the three eras of PSS, why “99% accuracy” is a lie in this domain, and how context-driven sequence modeling is helping solve it.

The Hidden Complexity of PSS

Why is PSS hard? It comes down to ambiguity and asymmetry.

Robot looking confused at a messy stack of documents
Automation in PSS is rarely straightforward… Messy inputs and ambiguous boundaries often lead to confusion.

The “Document” Definition Problem

First, the concept of a “document” is not always discrete. It is context-dependent.

The “Word” Analogy: What is a word? A sequence of characters separated by spaces? Or a meaningful unit of language that can be a single character (e.g., “I”) or a compound (e.g., “New York”)? Is space a word, too? This ambiguity problem is not unique to documents and permeates all levels of language processing. We’d be naive to think PSS is an exception to this rule!

Consider an email with an attachment. Is the email body one document and the attachment another? Or is the whole packet one document? What about an invoice stapled to a check? A policy packet with multiple addendums?

“Solvability” implies a single ground truth, but in reality, PSS often requires aligning the model with specific, often subjective, business logic. A boundary to an underwriter might be a continuation to an archivist.

This subjectivity is a nightmare for rule-based systems. To solve it, we need models that don’t just match patterns, but can reason about context and semantics. This is precisely where the self-attention mechanisms of Transformers excel.

The Cost of Error

Second, the cost of failure is asymmetric.

If you are classifying an email as “Spam” or “Not Spam,” a single error affects one email. But PSS is a sequence problem. A single missed page break doesn’t just affect one page; it merges two distinct documents into one. This effectively “corrupts” two documents for the price of one error. Conversely, a false break splits a valid document in half.

Diagram comparing classification errors vs sequence segmentation errors
The blast radius of error: Unlike simple classification where one error affects one item, a single segmentation error corrupts the integrity of multiple documents.

Even more dismal, if our focus is truly STP, then the only acceptable outcome is perfect segmentation of an entire document stream. Sometimes faxes can be hundreds of pages long.

If we have a 99% page-level accuracy ($p=0.99$), the probability of correctly segmenting a 100-page stream ($N=100$) is only:

$$ P(\text{Success}) = p^N = 0.99^{100} \approx 0.37 $$

Chart showing exponential decay of straight-through processing probability as document length increases
Even high page-level accuracy (99%) results in low stream-level success rates for long documents due to the multiplicative nature of error probabilities.

In other words, even with “high” accuracy, the vast majority of document streams will require human intervention.

The “F1 Score” Trap

A major finding in our research (TabMe++) is that traditional metrics mask the operational reality. Although generalized text segmentation metrics like $P_k$ and WindowDiff exist, we found they don’t capture the document-centric nature of business workflows.

Instead, we evaluate at three levels:

  1. Page-Level: Did we correctly classify this single page transition?
  2. Document-Level: Did we correctly identify the entire document tuple $d_k = (p_i, \ldots, p_j)$?
  3. Stream-Level: Did we perfectly segment the entire stack of documents?

Our results showed that Page-Level F1 Score completely masks the downstream impact.

Consider a baseline XGBoost model we tested:

  • Page F1 Score: 0.83 (Sounds decent, right?)
  • STP: 0.07 (Abysmal)
  • MNDD: 10.85

That means 93% of document streams required human intervention. Even worse, the MNDD (Minimum Number of Drag-and-Drops)2 score tells us that for each stream, a human had to manually drag ~11 pages to fix the ordering.

This metric is crucial because it proxies the actual pain of the human in the loop. An error isn’t just a label flip; it’s a drag-and-drop operation.

Era 1: The Heuristic Era (2000s - 2015)

In the beginning, PSS was a game of if/else statements. Engineers hand-crafted heuristics tailored to specific document layouts, checking for signals like:

  • Does the page contain “Page 1 of X”?
  • Is there a “Total” line at the bottom?
  • Does the header text change drastically?

While effective for known templates, these systems were inherently brittle. They relied on rigid assumptions about the input structure. If a vendor changed their invoice layout or OCR quality dipped, the logic would fail. They worked perfectly for what they were designed for but had zero capability to generalize to the unknown. Unfortunately, the real world is a constant state of exception.

Era 2: The Encoder Era (2015 - 2023)

As deep learning matured, researchers moved from hard-coded rules to learned representations.

  • Visual Approaches: Using CNNs to look at the “shape” of a page as an image. First pages often look different from continuation pages (logos, big headers).
  • Word Vectors: Early NLP attempts used tools like doc2vec to represent page content, but these “averaged” the text, losing sequential meaning.
  • Multimodal Transformers: Eventually, models like LayoutLM and LEGAL-BERT tried to combine text and layout into a single understanding.

While these models were “smarter” than rules, they suffered from distinct limitations:

  1. Field Lag: Surprisingly, only a handful of studies applied Transformers to PSS before 2024. Most of the industry was still stuck on older CNN architectures.

  2. Context Windows: Encoder models like BERT are limited to 512 tokens. A dense legal contract page might have 1,000+ tokens. You had to chop the text, losing critical context.

  3. Modality Overload: Counterintuitively, our experiments showed that naively adding modalities (Text + Layout + Vision) often yielded diminishing returns. Models like LayoutLMv3 struggled to outperform simpler vision-only or text-only models on our benchmark.

    However, looking continuously at the data reveals an interesting nuance: Visual signals matter. In our tests, the vision-only model (DiT) actually outperformed the text-only model (RoBERTa). They often have better recall whereas text-only models can better optimize for precision. The failure of the multimodal models wasn’t because vision is useless; it was because aligning these modalities is incredibly hard. This insight led us to a key realization for Era 3: What if we could give the model visual information without the architectural headache of a vision encoder?

Era 3: The Decoder Era (2024 - Present)

The breakthrough came with applying Decoder-only Large Language Models (LLMs) like Mistral-7B and Phi-3 to the task.

Why do LLMs succeed where specialized encoders failed? Contextual Processing.

Determining if a page is a continuation often requires analyzing sequential dependencies.

  • Does the sentence cut off mid-thought?
  • Does the next page logically follow the argument of the previous one?
  • Is the “Policy Number” on Page 2 the same as Page 1?

LLMs are pre-trained on the internet; they model narrative flow and document structure effectively. By fine-tuning them on pairs of pages, we adapted these priors to recognize specific segmentation boundaries.

The Secret Sauce: 2D Projection & Data Quality

We didn’t just dump raw text into the model. We employed 2D Text Projection, a technique that serializes OCR output by mapping spatial coordinates to whitespace. This effectively “draws” the layout using text characters, allowing the LLM to “see” columns, headers, and form structures without needing a vision encoder. This is how we solved the “Modality Overload” problem: we translated the visual signal (layout) into the only language the LLM speaks (text).

To be clear, this is a lossy compression. We discard font sizes, bolding, colors, and line separators. It is merely a cheap, zeroth-order approximation of 2D layout using 1D text. Yet, as our results show, this approximation captures the semantic essence of the layout (e.g., “this text is in a header column”) sufficient for the model to reason about document boundaries.

However, this technique has a hard dependency: Data Quality. 2D projection is useless if your OCR gives you garbage coordinates. This is where our work on TabMe++ (discussed below) became critical. You can’t project a layout if the OCR misses the text or places it in the wrong spot.

# Original Raw Text (Loss of Layout)
INVOICE # 1024 DATE: 2024-02-14 TOTAL: $500.00

# 2D Projected Text (Layout Preserved)
                    INVOICE # 1024
                    DATE: 2024-02-14

TOTAL:                                      $500.00

Why Does This Work? Modern LLMs are trained on a mixture of web text, code, and even some structured data. They have learned to interpret whitespace and formatting cues as part of their understanding of language. By encoding layout information into the text itself, we leverage the LLM’s existing capabilities without needing to train a separate vision encoder.

We then wrapped this input in a structured prompt that explicitly framed the task for the LLM:

You are a skilled document reviewer. Given extracted text from pages of documents, your task is to determine if a page starts a new document or continues from the previous one.
...
Prior text:
###
{pg_prev}
###
Page text:
###
{pg}
###
Output your prediction as a JSON object...

The Results

We formulated the task as a binary classification problem on page pairs. We fed the model (Page N, Page N+1) and asked: “Does Page N+1 start a new document?”

Comparison on TabMe++ Benchmark:

Model TypeModel NamePage F1STP (Higher is better)MNDD (Lower is better)
BaselineXGBoost0.837.4%10.85
EncoderRoBERTa (Text)0.784.2%12.17
EncoderDiT (Vision)0.836.6%10.48
DecoderMistral-7B (Fine-Tuned)0.9980.0%0.81

The difference is stark. Moving from Encoders to Decoders increased the automation rate from ~7% to 80% and reduced the human effort (MNDD) by a factor of 10.

Sample efficiency plot showing rapid convergence in under 1000 updates
LLMs adapt fast. Our decoder models converged in fewer than 1,000 updates, suggesting strong priors for document structure.

Why Fine-Tuning Matters: The GPT-4o Comparison

You might look at the chart above and ask: “Is the model learning PSS, or does it just rely on pre-trained language statistics?”

To test this, we ran GPT-4o in a zero-shot setting on the same task. The result? An STP of roughly 9%.

Think about that. The most capable model in the world (at the time of publication), without specific instruction tuning, performed barely better than our XGBoost baseline. Yet, our much smaller 7B model achieved 80% STP after fewer than 1,000 updates.

This proves two things:

  1. Broad Pre-training isn’t enough. Modeling generic document distributions doesn’t equate to capturing specific business logic for segmentation.
  2. The Capabilities are Latent. The fact that the model converges so quickly implies it has the necessary statistical priors; it just needs a small “nudge” (fine-tuning) to align those priors with the specific task. We aren’t teaching it to “read”; we are adjusting the decision boundary between a generic “document” and a specific business record.

The Cost of Intelligence and the Value of Human Time

Critically, we must address the two elephants in the room: Inference Cost and Data Privacy.

It is true that running a 7B parameter LLM for every page pair is computationally more expensive than a lightweight XGBoost model. However, focusing solely on compute costs misses the operational and human reality of this work.

Economically, the “cheap” model is a mirage. When a low-accuracy model forces a human to reorganize 93% of document streams, the cost of rectification, specifically wasted salaries and slowed turnaround times, dwarfs the cost of GPU inference. But the financial argument is secondary to the human one.

Manually segmenting documents is, frankly, soul-sucking. It is tedious, repetitive drudgery that few people enjoy. We aren’t just talking about operational expense; we are talking about human burnout. A model that achieves 80% full automation (STP) doesn’t just save money, it liberates people from the mind-numbing task of sorting pages so they can focus on work that actually requires their creativity and empathy. We are trading cheap FLOPs for valuable human attention.

Furthermore, democratizing this capability has profound implications beyond the enterprise. If we can make high-quality segmentation usable on modest hardware (like a high-end laptop or a single commodity GPU), we open the door for archivists, librarians, digital humanists, and small cities or towns that have little to no resources for this kind of work. These are the custodians of our collective intelligence, often working with massive, unorganized scanned collections but lacking the budget for massive cloud clusters.

Our results showed that 7B parameter models (like Mistral) are sufficient to solve this task. This size is the sweet spot: powerful enough to reason, but small enough to run locally. This matters not just for data sovereignty (keeping medical records private), but for accessibility. It means a small historical society could potentially automate the organization of a century’s worth of digitized records without a massive grant for cloud compute.

That said, a 7B model might not be the lower bound. While it was the breakthrough size for our study, the recent explosion of capable 1B-3B models suggests we haven’t hit the efficiency floor yet. Combined with extreme quantization, modern small language models (SLMs) likely offer the “Goldilocks” zone: enough reasoning to maintain high STP, but fast enough to run continuously on modest hardware. We suspect the future of PSS lies in these highly optimized, smaller reasoning models that can run anywhere… from a bank’s secure server to a researcher’s laptop.

The Unsung Hero: Data Quality

Algorithms weren’t the only bottleneck. Data was, too. Most public datasets (like Tobacco800) were too small or unrealistic. The TABME dataset (precursor to our work) existed, but it relied on open-source Tesseract OCR, which missed vast amounts of text.

We released TabMe++, which re-processed the entire dataset with commercial-grade Microsoft OCR.

  • Blank Pages: Reduced from 2.27% $\rightarrow$ 0.38%.
  • Token Count: Increased from 719M $\rightarrow$ 9.5B.
Scanned document page showing a sales forecast with some noise
Original Page: A noisy scan that Tesseract struggles to read.

The difference in intelligibility is night and day. Consider the page above.

Tesseract (Original):

02Z10102

(Misses almost everything, including the title and real ID)

Microsoft OCR (TabMe++):

            SALES FORECAST
                            201017205

(Correctly captures the spatial layout, the title, and the ID)

Lesson: You can’t segment what you can’t read. High-quality OCR (or multimodal front-ends) is the foundation of high-quality downstream NLP.

The Next Frontier: Context and Instruction Following (2026+)

As we discussed earlier, the definition of a “document” is subjective. To one team, an email + attachment is a single record. To another, they are distinct entities. A rigid model that segments perfectly for Team A will fail miserably for Team B.

The failure of Zero-Shot GPT-4o (9% STP) proves that scale alone doesn’t solve this. The future of PSS isn’t just about bigger models; it’s about instruction tuning. We need models that can accept natural language rules alongside the document stream:

“Split all invoices, but keep attachments with their parent emails. If you see an ACORD form, group it with the subsequent policy document.”

This shift mirrors the broader evolution of LLMs. Just as we moved from “next token prediction” (raw completion) to “instruction following” (chat), PSS models must evolve from static classifiers to dynamic systems. A single model should be able to adapt to any business logic without retraining.

Furthermore, while our 2024 research favored unimodal text models with 2D projection, the multimodal landscape is shifting. With the rise of natively multimodal models (like Gemini, GPT-4o, and our own GutenOCR), we effectively get the “2D projection” for free. The kings of 2026 will be models that can seamlessly fuse this native visual understanding with deep semantic reasoning, all guided by user-defined constraints.

Conclusion

Page Stream Segmentation is a perfect case study in the evolution of AI. We moved from encoding rules (Heuristic Era) to encoding features (Encoder Era) to encoding understanding (Decoder Era).

For enterprise professionals, the takeaways are clearer and more critical than ever.

First, stop looking at element-wise F1 scores for sequence tasks. While element-wise metrics are useful for engineers debugging algorithms, they are misleading for decision-makers. Focus on the metrics that actually affect people and workflows, like Straight-Through Processing (STP) and Minimum Number of Drag-and-Drops (MNDD).

Second, if you want to solve PSS today, start with an inward-looking conversation about “for what.” Before picking a model, answer these questions:

  • Inputs: What assumptions are you making about your document stream?
  • Outcomes: What specific business outcomes are you hoping to see?
  • Context: What is the core motivation for this workflow?
  • Nuance: Are there informative scenarios (like the “email attachment” problem) that illustrate your specific needs?

Given these answers, many modern approaches can unlock PSS for you. Whether you need an on-premise solution for secure scenarios using lightweight open-weights models, or can leverage powerful AI-as-a-Service APIs, the technology is no longer the bottleneck; understanding your own requirements is.

For full technical details, experimental setups, and datasets, refer to our paper: Large Language Models for Page Stream Segmentation or view the preprint on arXiv:2408.11981. Much of the initial work was also documented in a precursor blog series at Roots Automation (Part 1 & Part 2).


  1. Historically, this task has gone by many names: document separation, document flow segmentation, document stream segmentation, document bundle separation, and page stream separation. We stick to Page Stream Segmentation (PSS) to emphasize the sequential nature of the problem. ↩︎

  2. We adopted the MNDD metric from Mungmeeprued et al. (2022), who introduced it alongside the original TABME dataset to better quantify the human effort required to correct segmentation errors. ↩︎