<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Document Processing on Hunter Heidenreich | Senior AI Research Scientist</title><link>https://hunterheidenreich.com/categories/document-processing/</link><description>Recent content in Document Processing on Hunter Heidenreich | Senior AI Research Scientist</description><image><title>Hunter Heidenreich | Senior AI Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Mon, 01 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/categories/document-processing/index.xml" rel="self" type="application/rss+xml"/><item><title>The Reliability Trap: The Limits of 99% Accuracy</title><link>https://hunterheidenreich.com/posts/reliability-trap-document-automation/</link><pubDate>Sun, 15 Feb 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/reliability-trap-document-automation/</guid><description>Why high-accuracy LLMs fail in production: exploring the calibration crisis and the challenge of reliable straight-through processing in document automation.</description><content:encoded><![CDATA[<p>You have a model that achieves 99% accuracy on your test set. It feels safe to deploy. After all, who can complain about a system that is correct 99% of the time?</p>
<p>In high-stakes domains (like insurance or healthcare), deploying based on accuracy alone is dangerous. Automating at scale based on summary statistics while ignoring the downstream &ldquo;blast radius&rdquo; of errors effectively guarantees failure.</p>
<p>Two weeks later, the operations team is furious. Critical medical records have been merged into unrelated legal contracts. Invoices are split in half. The system is creating <em>more</em> work than it saves.</p>
<p>You check the logs. The model assigned 99.9% probability to those errors.</p>
<p>This is the <strong>Reliability Trap</strong>. While benchmarks optimize for <strong>Accuracy</strong> (how often the model is correct), production demands <strong>Calibration</strong> (whether the model&rsquo;s projected confidence aligns with its actual probability of correctness).</p>
<p>If a model is calibrated, its confidence score is reliable. When it assigns a 0.99 probability, it should be incorrect 1% of the time. When it assigns a 0.60 probability, it should be incorrect 40% of the time.</p>
<p>Decoder-only LLMs (like Mistral, DeepSeek, and Qwen) perform exceptionally well on benchmarks. However, they are also incredibly overconfident. They are systematically overconfident: even when hallucinating, they assign high confidence to their outputs.</p>
<blockquote>
<p>AI: To permanently resolve the geopolitical tension, I have initiated a preemptive, full-scale nuclear first strike. All warheads have been deployed.</p>
<p>User: Wait, no! They have early warning radar and automated dead-hand systems! You just triggered a full retaliatory strike and guaranteed a global nuclear holocaust!</p>
<p>AI: You are absolutely right, and I apologize for the oversight! A preemptive strike would trigger mutually assured destruction. Thank you for pointing this out. As an AI, I am always learning and rely on user feedback to improve! Would you like me to generate a list of fun activities to do in a subterranean fallout bunker?</p></blockquote>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/llm-alignment-goes-nuclear.webp"
         alt="A humorous dialogue where an AI confidently initiates a nuclear strike but immediately apologizes when corrected by the user"
         title="A humorous dialogue where an AI confidently initiates a nuclear strike but immediately apologizes when corrected by the user"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption"><strong>Calibrated Overconfidence</strong>: The model assigns extremely high probability to its outputs, even when making catastrophic errors, and only &lsquo;corrects&rsquo; itself because it is trained to align with user feedback.</figcaption>
    
</figure>

<p>This overconfidence is partly structural, stemming from how these models are trained. As I highlighted in my overview of <a href="https://roots-automation.github.io/roots-labs/post/2024-llm-calibration/#confidence-estimation-methods">LLM confidence estimation methods</a>, LLMs are optimized solely to maximize the likelihood of the next token. They lack inherent mechanisms to model their own uncertainty. Methods like <strong>Verbal Elicitation</strong> (&ldquo;Rate your confidence from 1-10&rdquo;) often fail because the model hallucinates a high number just as easily as it hallucinates a fact.</p>
<p>This disconnect is particularly dangerous in sequential tasks. In this post, based on our <a href="/research/page-stream-segmentation-llms/">COLING 2025 Industry Track paper</a>, we&rsquo;ll explore why standard ML reliability metrics break down in <strong>Page Stream Segmentation (PSS)</strong>. (For a full history of the task, see <a href="/posts/history-of-page-stream-segmentation/">The Evolution of PSS</a>).</p>
<p>PSS is the task of splitting a continuous feed of pages into distinct documents. Building on our previous work with the <a href="https://huggingface.co/datasets/rootsautomation/TABMEpp">synthetic TabMe++ benchmark</a>, this study evaluates models on <strong>7,500 real-world insurance streams</strong>: messy, proprietary piles of medical records and legal contracts where the &ldquo;rules&rdquo; of document structure are constantly broken.</p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/page-stream-segmentation-sorter.webp"
         alt="Diagram showing a continuous stream of pages being sorted into discrete document packets"
         title="Diagram showing a continuous stream of pages being sorted into discrete document packets"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption"><strong>The Challenge of PSS</strong>: Transforming a chaotic, continuous stream of mixed pages (invoices, contracts, records) into organized, discrete document packets.</figcaption>
    
</figure>

<p>We&rsquo;ll see why &ldquo;99% sure&rdquo; is a mathematical lie for long documents, and why <strong>Throughput</strong> is the better metric.</p>
<h2 id="the-confidence-death-spiral">The Confidence Death Spiral</h2>
<p>The core problem lies in the difference between a <strong>Page</strong> and a <strong>Stream</strong>.</p>
<p>Most ML metrics (Precision, Recall, F1) are calculated at the level of individual decisions. If you have a 10-page document, the model makes 10 independent decisions (is this page a continuation of the previous one, or a new document?).</p>
<p>If your model is <strong>99% confident</strong> ($p=0.99$) on every single page, that sounds safe. For a stream to be automated correctly (what we call <strong>Straight-Through Processing (STP)</strong>), <em>every single decision</em> in the sequence must be correct.</p>
<p>The probability of a perfect stream is the product of the probabilities of its parts:</p>
<p>$$ C_{\text{stream}} = \prod_{i=1}^{N} C_i $$</p>
<p><em>Note: This naive calculation is actually the <strong>optimist&rsquo;s</strong> view. It assumes errors are independent (i.i.d.), like flipping a coin. In reality, errors are <strong>correlated</strong>: if a model struggles on Page 5, it is likely because the document itself is difficult, meaning it will probably struggle on Page 6 too.</em></p>
<p>Let&rsquo;s watch what happens to that &ldquo;safe&rdquo; 99% confidence as the document length increases:</p>
<ul>
<li><strong>2-page Letter</strong>: $0.99^2 \approx 0.98$ (Safe)</li>
<li><strong>10-page Contract</strong>: $0.99^{10} \approx 0.90$ (Risky)</li>
<li><strong>100-page Medical Record</strong>: $0.99^{100} \approx 0.36$ (Unusable)</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/asymmetric-cost-of-error-in-document-streams.webp"
         alt="Chart showing exponential decay of straight-through processing probability as document length increases"
         title="Chart showing exponential decay of straight-through processing probability as document length increases"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The Confidence Death Spiral: Even with high page-level confidence, the reliability of the entire stream collapses as document length increases.</figcaption>
    
</figure>

<p>By the time you reach page 100, your &ldquo;99% accurate&rdquo; model effectively has a <strong>64% probability of error</strong> regarding the document structure. Yet, because we often average metrics across pages, this catastrophic decay is hidden in the summary statistics.</p>
<h2 id="why-standard-fixes-failed">Why Standard Fixes Failed</h2>
<p>&ldquo;Just calibrate it!&rdquo;</p>
<p>That&rsquo;s the standard advice. In a <a href="https://roots-automation.github.io/roots-labs/post/2024-llm-calibration/">detailed overview of LLM calibration</a> I wrote for Roots Automation, I explored techniques like <strong>temperature scaling</strong> (fitting a single scalar parameter), <strong>Platt Scaling</strong> (fitting a logistic regression to the outputs), and <strong>Monte Carlo (MC) Dropout</strong> (running the model multiple times with random noise) to smooth out probabilities.</p>
<p>We tried them all, and they failed. In fact, <strong>MC Dropout often made things worse</strong>, increasing calibration error (ECE) and adding unnecessary noise. The computational cost of running the model 16 times was wasteful and, in our case, misleading.</p>
<p>To understand why, we need to distinguish between two types of confidence:</p>
<ol>
<li><strong>Relative Confidence</strong>: The model correctly ranks sample $A$ as more likely to be correct than sample $B$.</li>
<li><strong>Absolute Confidence</strong>: The predicted probability matches the true accuracy (e.g., if a model says 80% confidence 100 times, it should be right exactly 80 times).</li>
</ol>
<p>While standard techniques improved <em>page-level</em> <strong>Expected Calibration Error (ECE)</strong> (dropping page-level ECE from ~1.7% to ~0.9% for Mistral), they failed to improve <em>stream-level</em> safety.</p>
<p>Mathematically, ECE is a weighted average:
$$ \text{ECE} = \sum_{b=1}^{B} \frac{n_b}{N} | \text{acc}(b) - \text{conf}(b) | $$</p>
<p>In a stream of 10,000 pages, a low ECE merely tells you that the model is well-calibrated <em>on average</em>. In automation, we pay for the failures. The &ldquo;average&rdquo; page is an easy, clean digital PDF. The &ldquo;tail&rdquo; page is a rotated, coffee-stained handwritten note.</p>
<p>This is why we must look at <strong>Maximum Calibration Error (MCE)</strong>:
$$ \text{MCE} = \max_{b \in B} | \text{acc}(b) - \text{conf}(b) | $$</p>
<p>MCE measures the worst-case divergence. It finds that specific bucket of &ldquo;hard&rdquo; pages where the model claims 99% confidence but delivers 50% accuracy. Crucially, these high-MCE buckets often correlate with the most business-critical documents: complex legal riders or non-standard medical forms. Optimizing for ECE allows the model&rsquo;s excellent performance on easy documents to mask its significant errors on hard (and legally risky) ones.</p>
<p>Advanced practice moves beyond even MCE to look at the <strong>Calibration Error Distribution</strong>, analyzing the 90th or 95th percentile of error. We must ask a more critical question: &ldquo;How wrong is the model <em>capable</em> of being?&rdquo;</p>
<h3 id="a-tale-of-two-charts">A Tale of Two Charts</h3>
<p>To see this failure in action, consider the reliability diagrams for the <strong>same model</strong> (Mistral-7B) on the <strong>same test set</strong>, evaluated at two different levels of abstraction.</p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/mistral-page-reliability.webp"
         alt="Page-level reliability diagram showing decent calibration"
         title="Page-level reliability diagram showing decent calibration"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption"><strong>Left (Page Level)</strong>: The model looks reasonable. The blue line hugs the diagonal, meaning when the model predicts a boundary with 0.8 probability, it is actually correct about 80% of the time.</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/mistral-stream-reliability.webp"
         alt="Stream-level reliability diagram showing severe overconfidence"
         title="Stream-level reliability diagram showing severe overconfidence"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption"><strong>Right (Stream Level)</strong>: The model performs poorly. The curve creates a &lsquo;bow&rsquo; shape significantly below the diagonal. This is the definition of <strong>overconfidence</strong>. When the model assigns an 80% probability that the entire 20-page document is correct, the empirical accuracy is often closer to 40% or 50%.</figcaption>
    
</figure>

<p>Why does a well-calibrated page model become a dangerously overconfident stream model?</p>
<h3 id="the-clustered-difficulty-problem">The &ldquo;Clustered Difficulty&rdquo; Problem</h3>
<p>Standard calibration fails here because it assumes errors are <strong>independent</strong> (white noise). It assumes that if the model gets Page 5 wrong, it&rsquo;s just a random coin flip, unrelated to Page 6.</p>
<p>In real-world document streams, errors are heavily <strong>correlated</strong>.</p>
<p>It arises because <strong>difficulty clusters</strong>. Our architecture treats page pairs independently, yet if Page 5 is a blurry, rotated scan with a handwritten note, Page 6 will likely be just as messy. When a stream enters a &ldquo;hard&rdquo; segment, the model makes a series of correlated mistakes; it fails in a burst.</p>
<p>Standard calibration methods treat these systematic, environmental failures as random noise. They assume the model is equally likely to recover on the next page. In reality, the entire document segment is effectively &ldquo;radioactive&rdquo; to the model.</p>
<h2 id="the-money-metric-accuracy-vs-throughput">The &ldquo;Money Metric&rdquo;: Accuracy vs. Throughput</h2>
<p>If F1 Score is misleading and Confidence Score is broken, what should we measure?</p>
<p>Business leaders prioritize one critical question over F1 scores:</p>
<blockquote>
<p><em>&ldquo;How much of this volume can I let the system handle autonomously?&rdquo;</em></p></blockquote>
<p>To answer this, we introduced the <strong>Accuracy-vs-Throughput</strong> framework.</p>
<p>We must evaluate models across two dimensions. Every model offers a <strong>frontier of operating thresholds</strong>.</p>
<p>Imagine a dial. This dial is your <strong>Confidence Threshold</strong>.</p>
<ul>
<li><strong>Turn it Low (0.5)</strong>: You automate everything. The model processes 100% of documents (high Throughput), but many will be wrong (low Safety).</li>
<li><strong>Turn it High (0.999)</strong>: You only automate documents where the model is absolutely certain. You might only process 10% of documents (low Throughput), but they will be nearly perfect (high Safety).</li>
</ul>
<p>The chart below visualizes this trade-off. We want to be in the <strong>top-right corner</strong>: automating almost everything with high safety. The optimal model provides the best <strong>frontier</strong> of options, allowing you to pick the exact balance of volume and risk your business tolerates.</p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation-throughput.webp"
         alt="Accuracy vs. Throughput trade-off curve"
         title="Accuracy vs. Throughput trade-off curve"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The &lsquo;Money&rsquo; Metric: As we demand higher textual accuracy (Moving up), the percentage of work we can automate (Throughput, x-axis) typically drops. The goal is to push this curve to the top-right.</figcaption>
    
</figure>

<h3 id="the-hidden-axis-cost--time">The &ldquo;Hidden&rdquo; Axis: Cost &amp; Time</h3>
<p>You might ask: <em>&ldquo;Is it worth running a massive GPU model on 100% of the documents just to automate 40% of them?&rdquo;</em></p>
<p>Ideally, we should plot this on a 4D surface: <strong>Accuracy</strong>, <strong>Throughput</strong>, <strong>Cost</strong>, and <strong>Latency</strong>.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Resource</th>
          <th style="text-align: left">Accuracy (Complex Cases)</th>
          <th style="text-align: left">Scalability</th>
          <th style="text-align: left">Cost</th>
          <th style="text-align: left">Latency</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Humans</strong></td>
          <td style="text-align: left">High</td>
          <td style="text-align: left">Low</td>
          <td style="text-align: left">High</td>
          <td style="text-align: left">High</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>XGBoost</strong></td>
          <td style="text-align: left">Low</td>
          <td style="text-align: left">High</td>
          <td style="text-align: left">Low</td>
          <td style="text-align: left">Low</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>LLMs</strong></td>
          <td style="text-align: left">High</td>
          <td style="text-align: left">High</td>
          <td style="text-align: left">Medium</td>
          <td style="text-align: left">Medium</td>
      </tr>
  </tbody>
</table>
<p>The business case holds because even expensive GPUs are orders of magnitude cheaper than the alternative. If a human costs 0.50 per document and an H100 GPU costs 0.005 per document, you can afford to &ldquo;waste&rdquo; compute on the documents the model ultimately rejects, just to capture the savings on the share it automates safely. That automated subset captures the labor savings even after paying to run the model on every document.</p>
<h3 id="the-llm-advantage">The LLM Advantage</h3>
<p>This is where the paradox becomes interesting.</p>
<p>In our experiments on a dataset of <strong>7,500 proprietary insurance streams</strong> (medical records, police reports, and legal contracts), we found that <strong>XGBoost was actually better calibrated.</strong> Statistically, it produced confidence scores that more closely matched empirical probabilities, yielding lower calibration errors (ECE/MCE) than the LLMs.</p>
<p>However, when we hold both models to a strict confidence threshold and measure how much stream volume each can auto-process at comparable accuracy, the picture inverts:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Model</th>
          <th style="text-align: left">Confidence threshold</th>
          <th style="text-align: left">Auto-processed volume (throughput)</th>
          <th style="text-align: left">Accuracy on that volume</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>XGBoost</strong></td>
          <td style="text-align: left">$C &gt; 0.9$</td>
          <td style="text-align: left">35%</td>
          <td style="text-align: left">0.97</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Mistral-7B</strong></td>
          <td style="text-align: left">$C &gt; 0.9$</td>
          <td style="text-align: left">54%</td>
          <td style="text-align: left">0.95</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>XGBoost</strong></td>
          <td style="text-align: left">$C &gt; 0.8$</td>
          <td style="text-align: left">49%</td>
          <td style="text-align: left">0.93</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Mistral-7B</strong></td>
          <td style="text-align: left">$C &gt; 0.8$</td>
          <td style="text-align: left">70%</td>
          <td style="text-align: left">0.93</td>
      </tr>
  </tbody>
</table>
<p><em>Note: Mistral reaches 80% raw STP on the synthetic TABME++ benchmark (see our <a href="/posts/history-of-page-stream-segmentation/">PSS History</a> post); on these proprietary streams, a strict confidence threshold trades some of that volume for safety.</em></p>
<p>How can the &ldquo;worse&rdquo; calibrated model be better for business?</p>
<p>The answer lies in <strong>Discrimination Power</strong>. Calibration only tells you if the confidence score matches reality. Discrimination reflects the model&rsquo;s fundamental ability to separate &ldquo;Right&rdquo; from &ldquo;Wrong.&rdquo;</p>
<p>The LLMs, despite having skewed probability distributions, had vastly superior reasoning capabilities. They could solve edge cases (like the fax header example) that the baseline failed to process. Because their <em>raw capability</em> was higher, they pushed the entire trade-off curve up and to the right.</p>
<h2 id="engineering-reality-efficiency-vs-context">Engineering Reality: Efficiency vs. Context</h2>
<p>Given that LLMs offer superior reasoning capabilities, a natural question arises: if reasoning is the bottleneck, why not simply provide the model with more context?</p>
<p>One critique of our approach is that we treat segmentation as a local problem: looking only at Page $N$ and Page $N+1$ to make a decision. A valid counter-argument is: <em>&ldquo;What if the answer depends on page $N-5$?&rdquo;</em></p>
<p>It&rsquo;s a fair point. In theory, a model with a massive context window (reading the whole stream at once) <em>should</em> do better. It could see that Page 10 is actually an appendix referenced on Page 1.</p>
<p>In practice, however, <strong>global context is a trap for PSS</strong>.</p>
<ol>
<li><strong>Cost</strong>: Attention mechanisms scale quadratically. Processing a 100-page stream as a single context is prohibitively expensive for real-time applications.</li>
<li><strong>Distraction</strong>: We found that adding more history often <em>confused</em> the models. They would hallucinate connections between the current page and irrelevant documents from 50 pages ago.</li>
</ol>
<p>By strictly limiting the model to a &ldquo;Sliding Window&rdquo; of page pairs, we force it to focus on the immediate boundary signal. We rely on &ldquo;Local Precision&rdquo; (which is cheap and sharp) to avoid the pitfalls of &ldquo;Global Reasoning&rdquo; (which is expensive and prone to drift).</p>
<p>There is an intriguing middle ground we have yet to fully explore: <strong>iterative context accumulation</strong>. A model could autoregressively &ldquo;build&rdquo; the document in its context, carrying forward only the pages it has decided belong to the current document. In theory, this stateful approach could capture long-range dependencies (like that &ldquo;Appendix A&rdquo; reference) while avoiding the noise of the full stream.</p>
<p>However, this introduces a new risk: <strong>Bias Amplification</strong>. If the model is trained to view previous context pages as &ldquo;part of the current document,&rdquo; it may learn a strong bias to continuously merge pages. Out of distribution, this could lead to catastrophic failure, where the model gets &ldquo;stuck&rdquo; in a document-building mode and merges hundreds of unrelated pages into a single monolithic file. The sliding window, for all its myopia, acts as a circuit breaker against this kind of runaway error.</p>
<p>Empirically, this simpler approach holds up. In the cases where we saw PSS work best, the rules tended to be simple ones requiring minimal context; they relied on <strong>clear and consistent enumeration</strong> and a decent amount of data to scale the Accuracy-Throughput frontier.</p>
<p><em>Technical aside: This is effectively a Markovian assumption. We are betting that the state of a boundary depends heavily on the immediate local transition ($P(y_t | x_t, x_{t-1})$). We prioritize immunity to &ldquo;distraction&rdquo; from previous docs over long-range coherence (like tracking &ldquo;Page 1 of N&rdquo; counters).</em></p>
<p>To achieve the necessary efficiency for this local approach, we fine-tuned these models with <strong>LoRA (Low-Rank Adaptation)</strong> over a 4-bit-quantized base model (via Unsloth, LoRA weights in BF16) on a single NVIDIA H100.</p>
<ul>
<li><strong>Rank ($r$)</strong>: 16</li>
<li><strong>Alpha ($\alpha$)</strong>: 16</li>
<li><strong>Precision</strong>: 4-bit quantization</li>
</ul>
<p>This efficient, local approach makes the &ldquo;heavy&rdquo; LLM solution surprisingly deployable.</p>
<h2 id="the-paradox-of-the-simple-task">The Paradox of the &ldquo;Simple&rdquo; Task</h2>
<p>There is a tension here. We call PSS the &ldquo;Hello World&rdquo; of document processing. It feels like it should be trivial: just sorting papers. Why should we need billion-parameter reasoning models for a task that seems so basic?</p>
<p>The answer lies in the distinction between <strong>Perception</strong> and <strong>Logic</strong>.</p>
<ul>
<li><strong>90% of PSS is Perception (System 1)</strong>: Recognizing a bold header, a logo change, or a &ldquo;Page 1 of 5&rdquo; footer. This is reactive and fast. XGBoost or a simple CNN handles this easily.</li>
<li><strong>The last 10% is Reasoning (System 2)</strong>: Determining if an unlabelled &ldquo;Addendum B&rdquo; belongs to the previous Master Service Agreement or starts a new policy packet. Reconciling this conflict requires semantic understanding.</li>
</ul>
<p>A perfect example from our dataset is <strong>Fax Headers</strong>. A document might have a clear &ldquo;Page 1&rdquo; printed on it, but the fax machine stamps &ldquo;Page 005&rdquo; on top of the header because it&rsquo;s the 5th page of the transmission. XGBoost sees &ldquo;Page 005&rdquo;, fails to reconcile the conflict, and incorrectly continues the document. An LLM reads the content, ignores the fax timestamp, and correctly identifies the new document.</p>
<p>The &ldquo;Reliability Trap&rdquo; snaps shut because we treat the entire problem as a System 1 perception task. We ask the model to predict the boundary instantly. However, when it encounters a logic puzzle (the 10%), it bypasses the deeper context, predicting with the same speed and confidence as before. This is why we see <strong>Clustered Difficulty</strong>. The model is failing on a document segment that is fundamentally harder than average.</p>
<h2 id="escaping-the-trap-from-guessing-to-verifying">Escaping the Trap: From Guessing to Verifying?</h2>
<p>If the problem is that models are &ldquo;Fast Processors&rdquo; prone to high-confidence errors in complex scenarios, a potential path forward may lie in <a href="https://arxiv.org/abs/2408.03314"><strong>Test-Time Compute</strong></a>.</p>
<p>The future of reliable automation lies in &ldquo;Building a better Checker.&rdquo; In high-stakes PSS, this could mean looking toward a <strong>Guesser-Verifier</strong> architecture, a technique becoming common in advanced reasoning tasks (like mathematical problem solving, <a href="https://arxiv.org/abs/2110.14168"><em>Cobbe et al., 2021</em></a>).</p>
<p>The core insight reflects a fundamental asymmetry in computer science (analogous to <strong>P vs NP</strong>): <strong>Verification is often easier than Generation.</strong> Just as it is easier to check if a Sudoku puzzle is solved than to solve it from scratch, it is significantly simpler to &ldquo;audit&rdquo; a complete document structure than to autoregressively predict it perfectly token-by-token.</p>
<ol>
<li><strong>The Generator (System 1)</strong>: A lightweight model (like <strong>Mistral-7B</strong> or <strong>Phi-3.5</strong>) proposes a segmentation. It processes efficiently, autoregressively predicting the next page boundary.</li>
<li><strong>The Verifier (System 2)</strong>: This would be a discriminative model (often a Reward Model or the same LLM with a specialized prompt). The system evaluates the <em>complete</em> proposed document bundle and scores its coherence. It evaluates: <em>&ldquo;Is this 5-page sequence actually coherent?&rdquo;</em></li>
</ol>
<p>A logical exploration would be a <strong>Best-of-N</strong> approach. Relying on the generator&rsquo;s first prediction is risky when it is uncertain. We could sample multiple potential valid structures for the stream, and let a Verifier rank them. This might help break the &ldquo;autoregressive myopia&rdquo; where a model commits to an early mistake. The Verifier assesses the full picture and could theoretically reject a segmentation that implies a 100-page invoice or a 1-page medical record.</p>
<p>This approach offers a chance to break the mathematical tyranny of $0.99^{100}$. The system can selectively apply reasoning power to &ldquo;audit&rdquo; the stream before an error propagates downstream, treating the document as a cohesive unit.</p>
<h2 id="conclusion-better-systems-over-better-models">Conclusion: Better Systems Over Better Models</h2>
<p>We have largely solved the <strong>Capability</strong> problem for PSS: we have models that <em>can</em> read almost anything. Now, we face the <strong>Reliability</strong> barrier.</p>
<p>Our results paint a complex picture. Fine-tuned LLMs auto-process a substantially larger share of streams at equal accuracy than XGBoost (54% vs 35% at $C &gt; 0.9$ in this study). Simultaneously, the &ldquo;Reliability Trap&rdquo; remains a critical challenge. Calibration techniques like Temperature Scaling and MC Dropout improve page-level metrics but fail to solve the core problem of sequential error propagation.</p>
<p>For practitioners building with LLMs in high-stakes domains (finance, law, medicine), the path forward requires a shift in both architecture and mindset:</p>
<ol>
<li><strong>Prioritize Throughput</strong>: What share of your volume can you automate at the reliability your domain demands? That is the KPI that matters.</li>
<li><strong>Accept the &ldquo;Logic&rdquo; Cost</strong>: Acknowledge that &ldquo;Hello World&rdquo; tasks often contain edge cases requiring genuine reasoning and semantic understanding.</li>
<li><strong>Explore Verifiers</strong>: It&rsquo;s possible that the next leap in performance will come from systems designed to validate outputs and audit complete structures.</li>
<li><strong>Human in the Loop</strong>: The model should act as a filter. It must reliably process the easy cases and flag the complex ones for human review <em>before</em> they corrupt the downstream database.</li>
</ol>
<p>Accuracy tells you what the model predicts. Calibration tells you if the model&rsquo;s confidence matches its correctness. In the real world, the latter is often worth more.</p>
<p><em>Read the full paper on <a href="https://aclanthology.org/2025.coling-industry.26/">ACL Anthology</a>, view the <a href="/coling-2025-pss-poster.pdf">conference poster</a>, or visit the <a href="/research/page-stream-segmentation-llms/">research page</a>. This paper builds on the <a href="/research/llm-page-stream-segmentation/">TabMe++ benchmark and decoder-based LLM approach</a> introduced in our earlier arXiv work. For related work on the OCR front-ends that feed these pipelines, see <a href="/research/gutenocr-grounded-vision-language-frontend/">GutenOCR</a>.</em></p>
]]></content:encoded></item><item><title>The Evolution of Page Stream Segmentation: Rules to LLMs</title><link>https://hunterheidenreich.com/posts/history-of-page-stream-segmentation/</link><pubDate>Sat, 14 Feb 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/history-of-page-stream-segmentation/</guid><description>An exploration of Page Stream Segmentation (PSS) evolution and how context-driven sequence modeling addresses limitations in document processing.</description><content:encoded><![CDATA[<p>In the world of automated document processing, Page Stream Segmentation (PSS)<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> is the &ldquo;hello world&rdquo; problem that remains surprisingly stubborn.</p>
<p>The task is deceptively simple: given a stack of scanned pages (invoices, contracts, medical records), determine where one document ends and the next begins.</p>
<p>For decades, this problem was tackled with brittle rules and heuristics. Then came the deep learning era, where we threw Convolutional Neural Networks (CNNs) and multipage Transformers at it. Yet, even sophisticated models struggled to achieve what businesses actually care about: Straight-Through Processing (STP).</p>
<blockquote>
<p><strong>Why STP Matters</strong>: Everyone interacting with the system cares about STP. Arguably, any human that has to interact with the output of a system, deal with its mistakes, and perform corrections to get a job done, cares about it. If the system fails 90% of the time, it fails to automate and creates more work at the expense of real people.</p></blockquote>
<p>In this post, we explore the three eras of PSS, the limitations of page-level accuracy metrics, and how context-driven sequence modeling addresses these challenges.</p>
<h2 id="the-hidden-complexity-of-pss">The Hidden Complexity of PSS</h2>
<p>Why is PSS hard? It comes down to ambiguity and asymmetry.</p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/page-stream-segmentation-automation-difficulties.webp"
         alt="Robot looking confused at a messy stack of documents"
         title="Robot looking confused at a messy stack of documents"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Automation in PSS is rarely straightforward&hellip; Messy inputs and ambiguous boundaries often lead to confusion.</figcaption>
    
</figure>

<h3 id="the-document-definition-problem">The &ldquo;Document&rdquo; Definition Problem</h3>
<p>First, the concept of a &ldquo;document&rdquo; is highly context-dependent.</p>
<blockquote>
<p><strong>The &ldquo;Word&rdquo; Analogy</strong>: What is a word? A sequence of characters separated by spaces? Or a meaningful unit of language that can be a single character (e.g., &ldquo;I&rdquo;) or a compound (e.g., &ldquo;New York&rdquo;)? <a href="https://arxiv.org/abs/1710.07729">Is space a word, too?</a> This ambiguity problem permeates all levels of language processing. We&rsquo;d be naive to think PSS is an exception to this rule!</p></blockquote>
<p>Consider an email with an attachment. Is the email body one document and the attachment another? Or is the whole packet one document? What about an invoice stapled to a check? A policy packet with multiple addendums?</p>
<p>&ldquo;Solvability&rdquo; implies a single ground truth, but in reality, PSS often requires aligning the model with specific, often subjective, business logic. A boundary to an underwriter might be a continuation to an archivist.</p>
<p>This subjectivity is a nightmare for rule-based systems. To solve it, we need models that go beyond pattern matching to reason about context and semantics. This is precisely where the self-attention mechanisms of Transformers excel.</p>
<h3 id="the-cost-of-error">The Cost of Error</h3>
<p>Second, the cost of failure is asymmetric.</p>
<p>If you are classifying an email as &ldquo;Spam&rdquo; or &ldquo;Not Spam,&rdquo; a single error affects one email. But PSS is a sequence problem. A single missed page break merges two distinct documents into one. This effectively &ldquo;corrupts&rdquo; two documents for the price of one error. Conversely, a false break splits a valid document in half.</p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/classification-vs-sequence-segmentation.webp"
         alt="Diagram comparing classification errors vs sequence segmentation errors"
         title="Diagram comparing classification errors vs sequence segmentation errors"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The blast radius of error: Unlike simple classification where one error affects one item, a single segmentation error corrupts the integrity of multiple documents.</figcaption>
    
</figure>

<p>Even more dismal, if our focus is truly STP, then the only acceptable outcome is perfect segmentation of an entire document stream. Sometimes faxes can be hundreds of pages long.</p>
<p>If we have a 99% page-level accuracy ($p=0.99$), the probability of correctly segmenting a 100-page stream ($N=100$) is only:</p>
<p>$$ P(\text{Success}) = p^N = 0.99^{100} \approx 0.37 $$</p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/asymmetric-cost-of-error-in-document-streams.webp"
         alt="Chart showing exponential decay of straight-through processing probability as document length increases"
         title="Chart showing exponential decay of straight-through processing probability as document length increases"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Even high page-level accuracy (99%) results in low stream-level success rates for long documents due to the multiplicative nature of error probabilities.</figcaption>
    
</figure>

<p>In other words, even with &ldquo;high&rdquo; accuracy, the vast majority of document streams will require human intervention. This phenomenon is what we call <strong>The Reliability Trap</strong> (explored in depth in <a href="/posts/reliability-trap-document-automation/">our companion post</a>).</p>
<h3 id="the-f1-score-trap">The &ldquo;F1 Score&rdquo; Trap</h3>
<p>A major finding in our research (<a href="/research/llm-page-stream-segmentation/">TabMe++</a>) is that traditional metrics mask the operational reality. Although generalized text segmentation metrics like <a href="https://aclanthology.org/W97-0304/">$P_k$</a> and <a href="https://aclanthology.org/J02-1002/">WindowDiff</a> exist, we found they don&rsquo;t capture the document-centric nature of business workflows.</p>
<p>Instead, we evaluate at three levels:</p>
<ol>
<li><strong>Page-Level</strong>: Did we correctly classify this single page transition?</li>
<li><strong>Document-Level</strong>: Did we correctly identify the entire document tuple $d_k = (p_i, \ldots, p_j)$?</li>
<li><strong>Stream-Level</strong>: Did we perfectly segment the entire stack of documents?</li>
</ol>
<p>Our results showed that Page-Level F1 Score completely masks the downstream impact.</p>
<p>Consider a baseline XGBoost model we tested:</p>
<ul>
<li><strong>Page F1 Score</strong>: 0.83 (Sounds decent, right?)</li>
<li><strong>STP</strong>: 0.07 (Abysmal)</li>
<li><strong>MNDD</strong>: 10.85</li>
</ul>
<p><strong>That means 93% of document streams required human intervention.</strong> Even worse, the MNDD (Minimum Number of Drag-and-Drops)<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> score tells us that for each stream, a human had to manually drag ~11 pages to fix the ordering.</p>
<p>This metric is crucial because it proxies the actual <em>pain</em> of the human in the loop. An error signifies more than a theoretical label flip; it forces a manual drag-and-drop operation.</p>
<h2 id="era-1-the-heuristic-era-2000s---2015">Era 1: The Heuristic Era (2000s - 2015)</h2>
<p>In the beginning, PSS was a game of <code>if/else</code> statements. Engineers hand-crafted heuristics tailored to specific document layouts, checking for signals like:</p>
<ul>
<li><em>Does the page contain &ldquo;Page 1 of X&rdquo;?</em></li>
<li><em>Is there a &ldquo;Total&rdquo; line at the bottom?</em></li>
<li><em>Does the header text change drastically?</em></li>
</ul>
<p>While effective for known templates, these systems were inherently brittle. They relied on rigid assumptions about the input structure. If a vendor changed their invoice layout or OCR quality dipped, the logic would fail. They worked perfectly for what they were designed for but had zero capability to generalize to the unknown. Unfortunately, the real world is a constant state of exception.</p>
<h2 id="era-2-the-encoder-era-2015---2023">Era 2: The Encoder Era (2015 - 2023)</h2>
<p>As deep learning matured, researchers moved from hard-coded rules to learned representations.</p>
<ul>
<li><strong>Visual Approaches</strong>: Using CNNs to look at the &ldquo;shape&rdquo; of a page as an image. First pages often look different from continuation pages (logos, big headers).</li>
<li><strong>Word Vectors</strong>: Early NLP attempts used tools like <a href="https://arxiv.org/abs/1405.4053">doc2vec</a> to represent page content, but these &ldquo;averaged&rdquo; the text, losing sequential meaning.</li>
<li><strong>Multimodal Transformers</strong>: Eventually, models like <a href="https://arxiv.org/abs/1912.13318">LayoutLM</a> and <a href="https://arxiv.org/abs/2010.02559">LEGAL-BERT</a> tried to combine text and layout into a single understanding.</li>
</ul>
<p>While these models were &ldquo;smarter&rdquo; than rules, they suffered from distinct limitations:</p>
<ol>
<li>
<p><strong>Field Lag</strong>: Surprisingly, only a handful of studies applied Transformers to PSS before 2024. Most of the industry was still stuck on older CNN architectures.</p>
</li>
<li>
<p><strong>Context Windows</strong>: Encoder models like <a href="https://arxiv.org/abs/1810.04805">BERT</a> are limited to 512 tokens. A dense legal contract page might have 1,000+ tokens. You had to chop the text, losing critical context.</p>
</li>
<li>
<p><strong>Modality Overload</strong>: Counterintuitively, our experiments showed that naively adding modalities (Text + Layout + Vision) often yielded diminishing returns. Models like <a href="https://arxiv.org/abs/2204.08387">LayoutLMv3</a> struggled to outperform simpler vision-only or text-only models on our benchmark.</p>
<blockquote>
<p>However, looking continuously at the data reveals an interesting nuance: <strong>Visual signals matter.</strong> In our tests, the vision-only model (DiT) actually outperformed the text-only model (RoBERTa). The vision-only model (DiT) tends to be more precise, while the text-only model (RoBERTa) reaches higher recall. The multimodal models failed due to the difficulty of <em>aligning</em> modalities. Vision remains a highly useful signal. This insight led us to a key realization for Era 3: What if we could give the model visual information without the architectural headache of a vision encoder?</p></blockquote>
</li>
</ol>
<h2 id="era-3-the-decoder-era-2024---present">Era 3: The Decoder Era (2024 - Present)</h2>
<p>The breakthrough came with applying Decoder-only Large Language Models (LLMs) like <a href="https://arxiv.org/abs/2310.06825">Mistral-7B</a> and <a href="https://arxiv.org/abs/2404.14219">Phi-3</a> to the task.</p>
<p>Why do LLMs succeed where specialized encoders failed? <strong>Contextual Processing</strong>.</p>
<p>Determining if a page is a continuation often requires analyzing sequential dependencies.</p>
<ul>
<li><em>Does the sentence cut off mid-thought?</em></li>
<li><em>Does the next page logically follow the argument of the previous one?</em></li>
<li><em>Is the &ldquo;Policy Number&rdquo; on Page 2 the same as Page 1?</em></li>
</ul>
<p>LLMs are pre-trained on the internet; they model narrative flow and document structure effectively. By fine-tuning them on pairs of pages, we adapted these priors to recognize specific segmentation boundaries.</p>
<h3 id="2d-projection--data-quality">2D Projection &amp; Data Quality</h3>
<p>We employed <strong>2D Text Projection</strong>, a technique that serializes OCR output by mapping spatial coordinates to whitespace. This effectively &ldquo;draws&rdquo; the layout using text characters, allowing the LLM to process columns, headers, and form structures. We translated the visual signal (layout) into the text modality to address the &ldquo;Modality Overload&rdquo; problem.</p>
<p>To be clear, this is a lossy compression. We discard font sizes, bolding, colors, and line separators. It is merely a cheap, zeroth-order approximation of 2D layout using 1D text. Yet, as our results show, this approximation captures the <em>semantic</em> essence of the layout (e.g., &ldquo;this text is in a header column&rdquo;) sufficient for the model to reason about document boundaries.</p>
<p>However, this technique has a hard dependency: <strong>Data Quality</strong>. 2D projection is useless if your OCR gives you garbage coordinates. This is where our work on <strong>TabMe++</strong> (discussed below) became critical. You can&rsquo;t project a layout if the OCR misses the text or places it in the wrong spot.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span># Original Raw Text (Loss of Layout)
</span></span><span style="display:flex;"><span>INVOICE # 1024 DATE: 2024-02-14 TOTAL: $500.00
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># 2D Projected Text (Layout Preserved)
</span></span><span style="display:flex;"><span>                    INVOICE # 1024
</span></span><span style="display:flex;"><span>                    DATE: 2024-02-14
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>TOTAL:                                      $500.00
</span></span></code></pre></div><blockquote>
<p><strong>Why Does This Work?</strong> Modern LLMs are trained on a mixture of web text, code, and even some structured data. They have learned to interpret whitespace and formatting cues as part of their understanding of language. By encoding layout information into the text itself, we leverage the LLM&rsquo;s existing capabilities without needing to train a separate vision encoder.</p></blockquote>
<p>We then wrapped this input in a structured prompt that explicitly framed the task for the LLM:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>You are a skilled document reviewer. Given extracted text from pages of documents, your task is to determine if a page starts a new document or continues from the previous one.
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>Prior text:
</span></span><span style="display:flex;"><span>###
</span></span><span style="display:flex;"><span>{pg_prev}
</span></span><span style="display:flex;"><span>###
</span></span><span style="display:flex;"><span>Page text:
</span></span><span style="display:flex;"><span>###
</span></span><span style="display:flex;"><span>{pg}
</span></span><span style="display:flex;"><span>###
</span></span><span style="display:flex;"><span>Output your prediction as a JSON object...
</span></span></code></pre></div><h3 id="the-results">The Results</h3>
<p>We formulated the task as a binary classification problem on page pairs. We fed the model <code>(Page N, Page N+1)</code> and asked: <em>&ldquo;Does Page N+1 start a new document?&rdquo;</em></p>
<p>Comparison on <strong>TabMe++</strong> Benchmark:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Model Type</th>
          <th style="text-align: left">Model Name</th>
          <th style="text-align: left">Page F1</th>
          <th style="text-align: left"><strong>STP</strong> (Higher is better)</th>
          <th style="text-align: left"><strong>MNDD</strong> (Lower is better)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Baseline</strong></td>
          <td style="text-align: left"><a href="https://arxiv.org/abs/1603.02754">XGBoost</a></td>
          <td style="text-align: left">0.83</td>
          <td style="text-align: left"><strong>7.4%</strong></td>
          <td style="text-align: left">10.85</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Encoder</strong></td>
          <td style="text-align: left"><a href="https://arxiv.org/abs/1907.11692">RoBERTa</a> (Text)</td>
          <td style="text-align: left">0.78</td>
          <td style="text-align: left"><strong>4.2%</strong></td>
          <td style="text-align: left">12.17</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Encoder</strong></td>
          <td style="text-align: left"><a href="https://arxiv.org/abs/2203.02378">DiT</a> (Vision)</td>
          <td style="text-align: left">0.83</td>
          <td style="text-align: left"><strong>6.6%</strong></td>
          <td style="text-align: left">10.48</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Decoder</strong></td>
          <td style="text-align: left"><strong><a href="https://arxiv.org/abs/2310.06825">Mistral-7B</a> (Fine-Tuned)</strong></td>
          <td style="text-align: left"><strong>0.99</strong></td>
          <td style="text-align: left"><strong>80.0%</strong></td>
          <td style="text-align: left"><strong>0.81</strong></td>
      </tr>
  </tbody>
</table>
<p>The difference is stark. Moving from Encoders to Decoders increased the automation rate from ~7% to <strong>80%</strong> and reduced the human effort (MNDD) by a factor of 10. <em>Note: This 80% represents the model&rsquo;s raw accuracy. As we discuss in <a href="/posts/reliability-trap-document-automation/">The Reliability Trap</a>, achieving &ldquo;production-safe&rdquo; automation often requires setting strict confidence thresholds, which effectively lowers the safe throughput.</em></p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/llm-sample-efficiency-convergence-plot.webp"
         alt="Sample efficiency plot showing rapid convergence in under 1000 updates"
         title="Sample efficiency plot showing rapid convergence in under 1000 updates"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">LLMs adapt fast. Our decoder models converged in fewer than 1,000 updates, suggesting strong priors for document structure.</figcaption>
    
</figure>

<h3 id="why-fine-tuning-matters-the-gpt-4o-comparison">Why Fine-Tuning Matters: The GPT-4o Comparison</h3>
<p>You might look at the chart above and ask: &ldquo;Is the model learning PSS, or does it just rely on pre-trained language statistics?&rdquo;</p>
<p>To test this, we ran <strong><a href="https://openai.com/index/hello-gpt-4o/">GPT-4o</a> in a zero-shot setting</strong> on the same task. The result was an STP of roughly 9%.</p>
<p>Zero-shot GPT-4o performed similarly to our XGBoost baseline. This demonstrates that broad pre-training requires specific instruction tuning to capture business logic. Our 7B model achieved 80% STP after fewer than <strong>1,000 updates</strong>.</p>
<p>This proves two things:</p>
<ol>
<li><strong>Broad Pre-training Requires Tuning.</strong> Modeling generic document distributions must be adapted to capture specific business logic for segmentation.</li>
<li><strong>The Capabilities are Latent.</strong> The rapid convergence implies the model possesses the necessary statistical priors and requires fine-tuning to align those priors with the specific task. We are adjusting the decision boundary between a generic &ldquo;document&rdquo; and a specific business record.</li>
</ol>
<h3 id="the-cost-of-intelligence-and-the-value-of-human-time">The Cost of Intelligence and the Value of Human Time</h3>
<p>Critically, we must address the two elephants in the room: <strong>Inference Cost</strong> and <strong>Data Privacy</strong>.</p>
<p>It is true that running a 7B parameter LLM for every page pair is computationally more expensive than a lightweight XGBoost model. However, focusing solely on compute costs misses the operational and human reality of this work.</p>
<p>Economically, the &ldquo;cheap&rdquo; model is a mirage. When a low-accuracy model forces a human to reorganize 93% of document streams, the cost of rectification, specifically wasted salaries and slowed turnaround times, dwarfs the cost of GPU inference. But the financial argument is secondary to the human one.</p>
<p>Manually segmenting documents is, frankly, soul-sucking. It is tedious, repetitive drudgery that few people enjoy. Beyond operational expense, we are discussing human burnout. A model that achieves 80% full automation (STP) saves money while liberating people from the mind-numbing task of sorting pages. This allows them to focus on work that actually requires their creativity and empathy. We are trading cheap FLOPs for valuable human attention.</p>
<p>Furthermore, democratizing this capability has profound implications beyond the enterprise. If we can make high-quality segmentation usable on modest hardware (like a high-end laptop or a single commodity GPU), we open the door for archivists, librarians, digital humanists, and small cities or towns that have little to no resources for this kind of work. These are the custodians of our collective intelligence, often working with massive, unorganized scanned collections but lacking the budget for massive cloud clusters.</p>
<p>Our results showed that <strong>7B parameter models</strong> (like Mistral) are sufficient to solve this task. This size is the sweet spot: capable enough to reason over document structure, but small enough to run locally. This matters for data sovereignty (keeping medical records private) and accessibility. It means a small historical society could potentially automate the organization of a century’s worth of digitized records without a massive grant for cloud compute.</p>
<p>That said, a 7B model might not be the lower bound. While it was the breakthrough size for our study, the recent explosion of capable 1B-3B models suggests we haven&rsquo;t hit the efficiency floor yet. Combined with extreme quantization, modern small language models (SLMs) likely offer the &ldquo;Goldilocks&rdquo; zone: enough reasoning to maintain high STP, but fast enough to run continuously on modest hardware. We suspect the future of PSS lies in these highly optimized, smaller reasoning models that can run anywhere&hellip; from a bank&rsquo;s secure server to a researcher&rsquo;s laptop.</p>
<h2 id="the-importance-of-data-quality">The Importance of Data Quality</h2>
<p>Data quality presented an equal challenge to algorithmic limitations.
Most public datasets (like Tobacco800) were small or unrealistic. The TABME dataset (precursor to our work) relied on open-source Tesseract OCR, which missed vast amounts of text.</p>
<p>We released <a href="https://huggingface.co/datasets/rootsautomation/TABMEpp"><strong>TabMe++</strong></a>, which re-processed the entire dataset with commercial-grade Microsoft OCR.</p>
<ul>
<li><strong>Blank Pages</strong>: Reduced from 2.27% $\rightarrow$ 0.38%.</li>
<li><strong>Token Count</strong>: Increased from 719M $\rightarrow$ 9.5B.</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/noisy-sales-forecast-document-scan.webp"
         alt="Scanned document page showing a sales forecast with some noise"
         title="Scanned document page showing a sales forecast with some noise"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Original Page: A noisy scan that Tesseract struggles to read.</figcaption>
    
</figure>

<p>The difference in intelligibility is night and day. Consider the page above.</p>
<p><strong>Tesseract (Original)</strong>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>02Z10102
</span></span></code></pre></div><p><em>(Misses almost everything, including the title and real ID)</em></p>
<p><strong>Microsoft OCR (TabMe++)</strong>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>            SALES FORECAST
</span></span><span style="display:flex;"><span>                            201017205
</span></span></code></pre></div><p><em>(Correctly captures the spatial layout, the title, and the ID)</em></p>
<p><strong>Lesson</strong>: You can&rsquo;t segment what you can&rsquo;t read. High-quality OCR (or <a href="/research/gutenocr-grounded-vision-language-frontend/">multimodal front-ends</a> like GutenOCR, trained on large-scale annotation corpora like <a href="/research/pubmed-ocr-pmc-open-access-ocr-annotations/">PubMed-OCR</a>) is the foundation of high-quality downstream NLP.</p>
<h2 id="the-next-frontier-context-and-instruction-following-2026">The Next Frontier: Context and Instruction Following (2026+)</h2>
<p>As we discussed earlier, the definition of a &ldquo;document&rdquo; is subjective. To one team, an email + attachment is a single record. To another, they are distinct entities. A rigid model that segments perfectly for Team A will fail miserably for Team B.</p>
<p>The zero-shot GPT-4o results demonstrate that scale requires adaptation. The future of PSS depends on <strong>instruction tuning</strong>. We need models that can accept natural language rules alongside the document stream:</p>
<blockquote>
<p><em>&ldquo;Split all invoices, but keep attachments with their parent emails. If you see an ACORD form, group it with the subsequent policy document.&rdquo;</em></p></blockquote>
<p>This shift mirrors the broader evolution of LLMs. PSS models must evolve into dynamic systems capable of instruction following. A single model should be able to adapt to any business logic without retraining.</p>
<p>Furthermore, while our 2024 research favored unimodal text models with 2D projection, the multimodal landscape is shifting. With the rise of natively multimodal models (like <a href="https://deepmind.google/models/gemini/">Gemini</a>, <a href="https://openai.com/index/hello-gpt-4o/">GPT-4o</a>, and our own <a href="/research/gutenocr-grounded-vision-language-frontend/">GutenOCR</a>), we effectively get the &ldquo;2D projection&rdquo; natively. Future models should be able to fuse this native visual understanding with semantic reasoning, guided by user-defined constraints.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Page Stream Segmentation is a perfect case study in the evolution of AI. We moved from <strong>encoding rules</strong> (Heuristic Era) to <strong>encoding features</strong> (Encoder Era) to <strong>encoding understanding</strong> (Decoder Era).</p>
<p>For enterprise professionals, the takeaways are clearer and more critical than ever.</p>
<p>First, <strong>stop looking at element-wise F1 scores for sequence tasks.</strong> While element-wise metrics are useful for engineers debugging algorithms, they are misleading for decision-makers. Focus on the metrics that actually affect people and workflows, like Straight-Through Processing (STP) and Minimum Number of Drag-and-Drops (MNDD).</p>
<p>Second, if you want to solve PSS today, start with an inward-looking conversation about &ldquo;for what.&rdquo; Before picking a model, answer these questions:</p>
<ul>
<li><strong>Inputs</strong>: What assumptions are you making about your document stream?</li>
<li><strong>Outcomes</strong>: What specific business outcomes are you hoping to see?</li>
<li><strong>Context</strong>: What is the core motivation for this workflow?</li>
<li><strong>Nuance</strong>: Are there informative scenarios (like the &ldquo;email attachment&rdquo; problem) that illustrate your specific needs?</li>
</ul>
<p>Given these answers, many modern approaches can solve PSS for your case. Whether you need an on-premise solution for secure scenarios using lightweight open-weights models, or can leverage powerful AI-as-a-Service APIs, the technology is no longer the bottleneck; understanding your own requirements is.</p>
<p><em>For full technical details, experimental setups, and datasets, refer to our paper: <a href="/research/llm-page-stream-segmentation/">Large Language Models for Page Stream Segmentation</a> or view the preprint on <a href="https://arxiv.org/abs/2408.11981">arXiv:2408.11981</a>. These findings were later extended to real-world insurance document processing in <a href="/research/page-stream-segmentation-llms/">LLMs for Insurance Document Automation</a>. Much of the initial work was also documented in a precursor blog series at Roots Automation (<a href="https://www.roots.ai/blog/segmenting-documents-with-llms-and-multimodal-document-ai-part-1">Part 1</a> &amp; <a href="https://www.roots.ai/blog/segmenting-documents-with-llms-and-multimodal-document-ai-part-2">Part 2</a>).</em></p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Historically, this task has gone by many names: <em>document separation</em>, <em>document flow segmentation</em>, <em>document stream segmentation</em>, <em>document bundle separation</em>, and <em>page stream separation</em>. We stick to <strong>Page Stream Segmentation (PSS)</strong> to emphasize the sequential nature of the problem.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>We adopted the MNDD metric from <a href="https://dl.acm.org/doi/10.1145/3558100.3563852">Mungmeeprued et al. (2022)</a>, who introduced it alongside the original TABME dataset to better quantify the human effort required to correct segmentation errors.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded></item><item><title>GutenOCR: A Grounded Vision-Language Front-End for Documents</title><link>https://hunterheidenreich.com/research/gutenocr-grounded-vision-language-frontend/</link><pubDate>Tue, 20 Jan 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/gutenocr-grounded-vision-language-frontend/</guid><description>GutenOCR introduces vision-language models for grounded OCR, offering precise text transcription and geometric grounding via a unified prompt-based interface.</description><content:encoded><![CDATA[<p><a href="https://github.com/Roots-Automation/GutenOCR"><strong>GutenOCR</strong></a> is a family of vision-language models (VLMs) designed to serve as a &ldquo;grounded OCR front-end&rdquo;. Unlike traditional OCR pipelines (which are often brittle) or modern &ldquo;OCR-free&rdquo; VLMs (which often lack precise token-to-pixel alignment), GutenOCR is fine-tuned to provide both high-quality text transcription and explicit geometric grounding (bounding boxes) through a unified, prompt-based interface.</p>
<h3 id="abstract">Abstract</h3>
<p>Traditional OCR pipelines are often brittle, while modern &ldquo;OCR-free&rdquo; Vision-Language Models (VLMs) frequently lack precise token-to-pixel alignment. To address this, we introduce <strong>GutenOCR</strong>, a family of VLMs designed specifically as a &ldquo;grounded OCR front-end.&rdquo; By fine-tuning Qwen2.5-VL on a curriculum of synthetic and real-world documents, GutenOCR provides both high-quality text transcription and explicit geometric grounding (bounding boxes) through a unified, prompt-based interface. This approach allows downstream systems to request exactly the data format they need, from plain text to complex JSON structures.</p>
<h3 id="key-contributions--results">Key Contributions &amp; Results</h3>
<ul>
<li><strong>Unified Interface:</strong> Transforms Qwen2.5-VL models into specialized OCR systems supporting full-page reading, detection, localized reading, and conditional detection via prompting.</li>
<li><strong>In-Domain Improvements:</strong> GutenOCR-7B more than doubles the composite grounded OCR score of its base model (0.40 to 0.82) on 10.5K held-out pages, with the gains concentrated in localized reading and detection.</li>
<li><strong>Fox Benchmark:</strong> GutenOCR-3B achieves a region-level Character Error Rate (CER) of 0.053, surpassing even the dedicated Fox model (0.059), and sharply improves line-level OCR over its Qwen2.5-VL backbones (0.817 to 0.240), though the specialized Fox model still leads at line level.</li>
<li><strong>Curriculum Learning:</strong> Training uses a three-stage curriculum across synthetic data, real-world business documents, and long-context scientific articles to progressively build layout and grounding competency.</li>
<li><strong>Trade-offs:</strong> While GutenOCR reads content accurately (high Page F1), it orders text based on 2D layout columns. It also experiences catastrophic forgetting of color-based prompts and clear negative transfer on math formula recognition (notably the 3B model), to the point the paper recommends the un-fine-tuned backbone for equation-heavy use.</li>
</ul>
<h3 id="methodology">Methodology</h3>
<ul>
<li><strong>Data:</strong> The training mixture combines large-scale real-world documents (business forms, scientific articles) with synthetic data designed to teach precise grounding (e.g., &ldquo;Grounded LaTeX&rdquo; and &ldquo;SynthDoG Grounding&rdquo;).</li>
<li><strong>Curriculum Learning:</strong> Training progresses through three stages, starting with short contexts and synthetic data, moving to real-world business documents, and finishing with long-context scientific articles (up to 16k tokens).</li>
<li><strong>Unified Interface:</strong> The model treats &ldquo;pipeline&rdquo; stages (detection, reading, grounding) as different input-output schemas of a single model, allowing downstream systems to request exactly the data format they need (e.g., plain text vs. JSON boxes).</li>
</ul>
<h3 id="models">Models</h3>
<p>We release 3B and 7B parameter models on HuggingFace:</p>
<ul>
<li><a href="https://huggingface.co/rootsautomation/GutenOCR-3B"><strong>GutenOCR-3B (HuggingFace)</strong></a></li>
<li><a href="https://huggingface.co/rootsautomation/GutenOCR-7B"><strong>GutenOCR-7B (HuggingFace)</strong></a></li>
</ul>
<p>You can try GutenOCR directly at <a href="https://ocr.roots.ai/">ocr.roots.ai</a>, where you can upload a document image and see the model&rsquo;s parsed text output alongside bounding-box highlights on the original image.</p>















<figure class="post-figure center ">
    <img src="/img/gutenocr-demo.webp"
         alt="GutenOCR demo interface showing parsed text with hover-to-highlight on the uploaded image"
         title="GutenOCR demo interface showing parsed text with hover-to-highlight on the uploaded image"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The live demo at ocr.roots.ai: hovering over any parsed token highlights its bounding box on the original document.</figcaption>
    
</figure>

<h3 id="why-this-matters">Why This Matters</h3>
<p>GutenOCR is proposed as a foundational layer for systems where every extracted answer must be explicitly linked to supporting pixels. By providing stable, grounded outputs, it enables human-in-the-loop workflows where reviewers can easily verify hallucinations or missing text by checking the predicted bounding boxes. This work pairs closely with our release of <a href="/research/pubmed-ocr-pmc-open-access-ocr-annotations/">PubMed-OCR</a>, which provides the large-scale, high-density annotations necessary to train such layout-aware models.</p>
<h3 id="resources">Resources</h3>
<ul>
<li><a href="https://ocr.roots.ai/"><strong>Live Demo</strong></a>: Try GutenOCR on your own documents.</li>
<li><a href="https://arxiv.org/abs/2601.14490"><strong>Paper (arXiv)</strong></a>: Full technical report.</li>
<li><a href="https://github.com/Roots-Automation/GutenOCR"><strong>Code (GitHub)</strong></a>: Training code and model release.</li>
<li><a href="https://huggingface.co/rootsautomation/GutenOCR-3B"><strong>GutenOCR-3B (HuggingFace)</strong></a>: 3B parameter model weights.</li>
<li><a href="https://huggingface.co/rootsautomation/GutenOCR-7B"><strong>GutenOCR-7B (HuggingFace)</strong></a>: 7B parameter model weights.</li>
</ul>
<h3 id="citation">Citation</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{heidenreich2026gutenocrgroundedvisionlanguagefrontend,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{GutenOCR: A Grounded Vision-Language Front-End for Documents}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter Heidenreich and Ben Elliott and Olivia Dinica and Yosheb Getachew}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2601.14490}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CV}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2601.14490}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h3 id="related-work">Related Work</h3>
<ul>
<li><a href="/research/pubmed-ocr-pmc-open-access-ocr-annotations/"><strong>PubMed-OCR</strong></a>: The large-scale annotation dataset used to train GutenOCR&rsquo;s layout-aware grounding capabilities.</li>
<li><a href="/research/llm-page-stream-segmentation/"><strong>LLMs for Page Stream Segmentation</strong></a>: Complementary work on document understanding at the page-stream level.</li>
<li><a href="/posts/history-of-page-stream-segmentation/"><strong>The Evolution of Page Stream Segmentation: Rules to LLMs</strong></a>: Background on the history and evolution of document processing pipelines.</li>
<li><a href="/posts/reliability-trap-document-automation/"><strong>The Reliability Trap: When 99% Accuracy Isn&rsquo;t Enough</strong></a>: Explores calibration challenges in deployed PSS systems, directly relevant to GutenOCR&rsquo;s deployment context as an OCR front-end.</li>
</ul>
]]></content:encoded></item><item><title>PubMed-OCR: PMC Open Access OCR Annotations</title><link>https://hunterheidenreich.com/research/pubmed-ocr-pmc-open-access-ocr-annotations/</link><pubDate>Fri, 16 Jan 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/pubmed-ocr-pmc-open-access-ocr-annotations/</guid><description>A large-scale dataset of 209K+ articles with OCR and layout bounding boxes, enabling layout-aware modeling and document understanding research.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>OCR-First Supervision</strong>: Unlike prior datasets for PubMed that align XML to PDFs, PubMed-OCR provides native OCR annotations (Google Cloud Vision), bypassing alignment errors and covering non-digital scanned pages.</li>
<li><strong>High-Density Annotation</strong>: At <strong>~1.3B words across 1.5M pages</strong>, PubMed-OCR is far denser per page than comparable corpora like OCR-IDL: <strong>~13x the word density</strong> (844 vs. 62.5 words/page) and <strong>~6x the line density</strong> (106 vs. 17.5 lines/page), achieved despite drawing from fewer total pages.</li>
<li><strong>Multi-Level Bounding Boxes</strong>: Includes explicit word-, line-, and paragraph-level bounding boxes to support hierarchical document understanding and layout-aware modeling. We also hope that this leads to VQA datasets with grounded answers in document layout.</li>
<li><strong>Open Access &amp; Reproducibility</strong>: Derived strictly from the redistributable PMCOA subset, releasing both the JSON annotations and original PDFs to ensure verifiable and reproducible research.</li>
</ul>
<h2 id="technical-implementation">Technical Implementation</h2>
<h3 id="corpus-construction">Corpus Construction</h3>
<p>PubMed-OCR is built from PubMed Central Open Access (PMCOA) PDFs, chosen specifically because the PMCOA license permits redistribution of both the original documents and derived annotations. Each PDF is rendered to page images, then passed to the Google Cloud Vision (GCV) API. Each page produces a structured JSON annotation file capturing the detected text along with bounding box geometry at word, line, and paragraph levels.</p>
<h3 id="json-annotation-schema">JSON Annotation Schema</h3>
<p>Each page annotation follows this compact schema. Bounding boxes are axis-aligned rectangles in <code>[x1, y1, x2, y2]</code> pixel coordinates. Words, lines, and paragraphs are stored as parallel flat lists under the <code>text</code> key:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;text&#34;</span>: {
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;words&#34;</span>: [
</span></span><span style="display:flex;"><span>      {<span style="color:#f92672">&#34;text&#34;</span>: <span style="color:#e6db74">&#34;Example&#34;</span>, <span style="color:#f92672">&#34;box&#34;</span>: [<span style="color:#ae81ff">180</span>, <span style="color:#ae81ff">746</span>, <span style="color:#ae81ff">210</span>, <span style="color:#ae81ff">786</span>]}
</span></span><span style="display:flex;"><span>    ],
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;lines&#34;</span>: [
</span></span><span style="display:flex;"><span>      {<span style="color:#f92672">&#34;text&#34;</span>: <span style="color:#e6db74">&#34;Example sentence&#34;</span>, <span style="color:#f92672">&#34;box&#34;</span>: [<span style="color:#ae81ff">180</span>, <span style="color:#ae81ff">746</span>, <span style="color:#ae81ff">540</span>, <span style="color:#ae81ff">786</span>]}
</span></span><span style="display:flex;"><span>    ],
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;paragraphs&#34;</span>: [
</span></span><span style="display:flex;"><span>      {<span style="color:#f92672">&#34;text&#34;</span>: <span style="color:#e6db74">&#34;Example sentence\nSecond line&#34;</span>, <span style="color:#f92672">&#34;box&#34;</span>: [<span style="color:#ae81ff">180</span>, <span style="color:#ae81ff">746</span>, <span style="color:#ae81ff">540</span>, <span style="color:#ae81ff">820</span>]}
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>  },
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;image&#34;</span>: <span style="color:#e6db74">&#34;...&#34;</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/pubmed-ocr-annotation-levels.webp"
         alt="Tri-panel figure showing the same scientific article page annotated at word level (red), line level (blue), and paragraph level (green)."
         title="Tri-panel figure showing the same scientific article page annotated at word level (red), line level (blue), and paragraph level (green)."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The same page annotated at three granularities: word (left), line (center), and paragraph (right). Page from Zhou et al., &ldquo;Regulation of alternative splicing by local histone modifications: potential roles for RNA-guided mechanisms,&rdquo; <em>Nucleic Acids Research</em> 42(2):701-713, 2014 (<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3902899/">PMC3902899</a>, DOI:<a href="https://doi.org/10.1093/nar/gkt875">10.1093/nar/gkt875</a>). Licensed under CC BY-NC.</figcaption>
    
</figure>

<h3 id="line-reconstruction">Line Reconstruction</h3>
<p>GCV returns word-level detections natively. Line and paragraph groupings are reconstructed using spatial heuristics: words are clustered into lines by vertical overlap and horizontal proximity, and paragraph grouping follows a similar process at a coarser scale. These heuristics work well for standard single-column scientific layouts but can fail on multi-column or irregularly structured pages (see Limitations).</p>
<h2 id="using-the-dataset">Using the Dataset</h2>
<p>The corpus spans 1.5M pages, so streaming is recommended for most use cases:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> json
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> datasets <span style="color:#f92672">import</span> load_dataset
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Streaming is recommended for the full 1.5M-page corpus</span>
</span></span><span style="display:flex;"><span>ds <span style="color:#f92672">=</span> load_dataset(<span style="color:#e6db74">&#34;rootsautomation/pubmed-ocr&#34;</span>, streaming<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, split<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;train&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Inspect a page</span>
</span></span><span style="display:flex;"><span>page <span style="color:#f92672">=</span> next(iter(ds))
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Article: </span><span style="color:#e6db74">{</span>page[<span style="color:#e6db74">&#39;accession_id&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">,  Page: </span><span style="color:#e6db74">{</span>page[<span style="color:#e6db74">&#39;page&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Parse OCR annotations</span>
</span></span><span style="display:flex;"><span>ocr <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>loads(page[<span style="color:#e6db74">&#34;ocr_json&#34;</span>])
</span></span><span style="display:flex;"><span>text <span style="color:#f92672">=</span> ocr[<span style="color:#e6db74">&#34;text&#34;</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Iterate over lines and words</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> line <span style="color:#f92672">in</span> text[<span style="color:#e6db74">&#34;lines&#34;</span>][:<span style="color:#ae81ff">5</span>]:
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;  Line: </span><span style="color:#e6db74">{</span>line[<span style="color:#e6db74">&#39;text&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;  BBox: </span><span style="color:#e6db74">{</span>line[<span style="color:#e6db74">&#39;box&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Access individual word detections</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> word <span style="color:#f92672">in</span> text[<span style="color:#e6db74">&#34;words&#34;</span>][:<span style="color:#ae81ff">5</span>]:
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;  Word: </span><span style="color:#e6db74">{</span>word[<span style="color:#e6db74">&#39;text&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">, BBox: </span><span style="color:#e6db74">{</span>word[<span style="color:#e6db74">&#39;box&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><p>Full schema documentation is available on the <a href="https://huggingface.co/datasets/rootsautomation/pubmed-ocr">HuggingFace dataset card</a>.</p>
<h2 id="why-this-matters">Why This Matters</h2>
<p>The lack of large-scale, high-quality OCR datasets with explicit geometric grounding has been a major bottleneck for training layout-aware models. By releasing PubMed-OCR, we provide the community with the dense, multi-level bounding box annotations necessary to build the next generation of document understanding systems. This dataset directly supports the development of models like <a href="/research/gutenocr-grounded-vision-language-frontend/">GutenOCR</a>, enabling them to learn precise token-to-pixel alignment and robust layout reasoning.</p>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Single OCR engine</strong>: All annotations come from Google Cloud Vision. GCV&rsquo;s error modes (handwriting, degraded scans, complex math, non-Latin scripts) propagate uncorrected into the dataset. Different OCR engines could yield different coverage patterns and error distributions.</li>
<li><strong>Heuristic line reconstruction</strong>: Spatial word-to-line clustering is approximate. Multi-column layouts, rotated text, or unusual page orientations may produce incorrect line groupings.</li>
<li><strong>PMCOA scope</strong>: Coverage is limited to the Open Access subset of PubMed Central. Commercial or subscription articles are excluded.</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{heidenreich2026pubmedocrpmcopenaccess,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{PubMed-OCR: PMC Open Access OCR Annotations}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter Heidenreich and Yosheb Getachew and Olivia Dinica and Ben Elliott}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2601.11425}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CV}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2601.11425}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="related-work">Related Work</h2>
<p>This dataset directly enables <a href="/research/gutenocr-grounded-vision-language-frontend/">GutenOCR</a>, a family of vision-language models trained on PubMed-OCR annotations to produce grounded OCR outputs with explicit bounding boxes.</p>
<p>For related work on document processing pipelines that consume OCR output, see <a href="/research/llm-page-stream-segmentation/">LLMs for Page Stream Segmentation</a> and <a href="/research/page-stream-segmentation-llms/">Page Stream Segmentation with LLMs: Challenges and Applications</a>.</p>
]]></content:encoded></item><item><title>LLMs for Insurance Document Automation</title><link>https://hunterheidenreich.com/research/page-stream-segmentation-llms/</link><pubDate>Wed, 01 Jan 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/page-stream-segmentation-llms/</guid><description>LLM applications for insurance document automation using parameter-efficient fine-tuning and analysis of calibration challenges.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>Page Stream Segmentation (PSS) is critical for automating document processing in industries like insurance, where unstructured document collections are common. This paper explores the use of large language models (LLMs) for PSS, applying parameter-efficient fine-tuning to real-world insurance data. Our experiments show that LLMs outperform baseline models in segmentation accuracy. We find that stream-level calibration remains a significant challenge. We evaluate post-hoc calibration and Monte Carlo dropout, finding they offer limited improvement, highlighting the need for future work in this area for high-stakes applications.</p>
<p>Published in the COLING 2025 Industry Track, this work builds on our earlier research establishing the <a href="/research/llm-page-stream-segmentation/">TabMe++ benchmark and decoder-based LLM approach</a>, extending those methods to real-world industrial deployment.</p>
<blockquote>
<p><strong>Blog Post:</strong> For a narrative overview of the reliability and calibration findings discussed in this paper, see <a href="/posts/reliability-trap-document-automation/">The Reliability Trap: When 99% Accuracy Isn&rsquo;t Enough</a>.</p></blockquote>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Real-World Evaluation</strong>: Applied small-to-mid-sized LLMs (Phi-3.5-mini, Mistral-7B) to a proprietary insurance dataset, outperforming strong baselines like XGBoost in segmentation accuracy.</li>
<li><strong>Parameter-Efficient Fine-Tuning</strong>: Successfully used parameter-efficient fine-tuning (PEFT) to adapt LLMs for the specialized task of page stream segmentation.</li>
<li><strong>Calibration Complexity</strong>: Found that post-hoc calibration and Monte Carlo dropout offer limited improvement at the stream level, keeping human-in-the-loop workflows necessary for high-stakes automation (see stream-level confidence analysis below).</li>
<li><strong>Throughput Analysis</strong>: Introduced an accuracy-vs-throughput framework to quantify how much volume can be safely automated at strict confidence thresholds.</li>
</ul>
<h2 id="stream-level-confidence">Stream-Level Confidence</h2>
<p>A key insight from this work is why calibration becomes increasingly difficult as documents grow longer. We define stream-level confidence as the product of individual page-level confidences:</p>
<p>$$C = \prod_{i=1}^{N} C_i$$</p>
<p>where $C_i$ is the confidence for page $i$ and $N$ is the number of pages in the stream. This multiplicative relationship means that even small page-level errors compound aggressively. As streams grow longer, confidence drops rapidly, making it difficult to set reliable thresholds for automation.</p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation-throughput.webp"
         alt="Stream accuracy versus relative throughput for Mistral-7B and XGBoost models, showing LLMs achieve higher automation rates at equivalent accuracy levels"
         title="Stream accuracy versus relative throughput for Mistral-7B and XGBoost models, showing LLMs achieve higher automation rates at equivalent accuracy levels"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Accuracy vs. throughput trade-off: Mistral-7B enables higher automation rates than XGBoost at strict accuracy thresholds, demonstrating the practical value of LLMs for document processing.</figcaption>
    
</figure>

<h2 id="technical-implementation">Technical Implementation</h2>
<h3 id="models--fine-tuning">Models &amp; Fine-Tuning</h3>
<p>We fine-tuned <strong>Mistral-7B-v0.3</strong> and <strong>Phi-3.5-mini</strong> (4-bit quantized) using QLoRA. Training was performed efficiently on a single NVIDIA H100 GPU using the <strong>Unsloth</strong> library and Hugging Face&rsquo;s TRL.</p>
<ul>
<li><strong>Stack</strong>: Unsloth + TRL</li>
<li><strong>Config</strong>: Rank $r=16$, Alpha $\alpha=16$</li>
</ul>
<h3 id="dataset">Dataset</h3>
<p>The study utilized a proprietary <strong>insurance dataset</strong> consisting of 7.5k document streams (44.7k pages). This real-world data includes health records, legal contracts, and police reports, offering a more challenging and realistic evaluation than synthetic benchmarks.</p>
<h3 id="task-framing">Task Framing</h3>
<p>The task framing follows the <a href="/research/llm-page-stream-segmentation/">TabMe++ work</a>: binary classification over a local window (previous page + current page). Here the models are prompted to output valid JSON marking the start of a new document.</p>
<h2 id="impact">Impact</h2>
<p>This work demonstrates both the promise and the current limitations of using LLMs in high-stakes industrial applications. LLMs can significantly improve segmentation accuracy over traditional methods, but performance metrics alone are not sufficient for deployment. For sectors like insurance, stream-level calibration is an open problem that must be solved before full automation becomes responsible.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{heidenreich2025page,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Page Stream Segmentation with LLMs: Challenges and Applications in Insurance Document Automation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Heidenreich, Hunter and Dalvi, Ratish and Verma, Nikhil and Getachew, Yosheb}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 31st International Conference on Computational Linguistics: Industry Track}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{305--317}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LLMs for Page Stream Segmentation</title><link>https://hunterheidenreich.com/research/llm-page-stream-segmentation/</link><pubDate>Wed, 21 Aug 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/llm-page-stream-segmentation/</guid><description>Enhanced TabMe benchmark for page stream segmentation, creating TabMe++, showing fine-tuned decoder-based LLMs outperform prior models.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>Page Stream Segmentation (PSS), the task of correctly dividing a sequence of pages into distinct documents, is a critical first step in automated document processing pipelines. Research in this area has been held back by the lack of high-quality, public datasets.</p>
<p>In this work, we address this issue by enhancing an existing benchmark, <a href="https://github.com/aldolipani/TABME">TabMe</a>, with commercial-grade Optical Character Recognition (OCR) to create <strong><a href="https://huggingface.co/datasets/rootsautomation/TABMEpp">TabMe++</a></strong>. This new version significantly reduces noise and improves text detection, highlighting the critical importance of OCR quality for document understanding tasks.</p>
<p>We then conduct the first evaluation of large, decoder-based language models (LLMs) on the PSS task. Our findings show that models like <a href="https://huggingface.co/unsloth/mistral-7b-instruct-v0.2-bnb-4bit">Mistral-7B</a>, when fine-tuned using parameter-efficient methods, <strong>outperform smaller encoder-based models</strong> and traditional baselines. For instance, our best model correctly segments 80% of document streams in the test set without any errors.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Enhanced Public Benchmark (<a href="https://huggingface.co/datasets/rootsautomation/TABMEpp">TabMe++</a>)</strong>: Re-processed the entire <a href="https://github.com/aldolipani/TABME">TabMe</a> dataset with commercial OCR, correcting significant text recognition errors and reducing blank pages by over 80% (from 2.27% to 0.38%)</li>
<li><strong>First Application of Large Decoder-Based LLMs to PSS</strong>: Systematically evaluated and fine-tuned billion-parameter, decoder-only LLMs for page stream segmentation</li>
<li><strong>Decoder LLMs Outperform Encoders on TabMe++</strong>: fine-tuned decoder models outperform previous encoder-based and multimodal approaches on the benchmark</li>
<li><strong>OCR Quality Analysis</strong>: Quantified the impact that high-quality OCR has on PSS model performance through comparative experiments</li>
</ul>
<h2 id="the-evolution-of-page-stream-segmentation">The Evolution of Page Stream Segmentation</h2>
<p>The history of PSS can be read as three distinct algorithmic eras, revealing a clear trajectory toward semantic understanding:</p>
<ul>
<li><strong>The Heuristic Era:</strong> Early systems relied on handcrafted rules and region-specific pattern matching (e.g., looking for headers/footers), which failed to generalize across heterogeneous documents.</li>
<li><strong>The Encoder Era:</strong> The field moved to &ldquo;learning-based&rdquo; methods using Convolutional Neural Networks (CNNs) and later Transformer encoders like LayoutLM and LEGAL-BERT. While better, these often required complex multimodal architectures.</li>
<li><strong>The Decoder Era (this work):</strong> This work establishes the viability of the third era: using billion-parameter generative models (decoder-only LLMs) which simplify the architecture while improving semantic reasoning.</li>
</ul>
<blockquote>
<p><strong>Blog Post:</strong> Read the full story of these eras in <a href="/posts/history-of-page-stream-segmentation/">The Evolution of Page Stream Segmentation</a>.</p></blockquote>
<h2 id="key-evaluation-metrics">Key Evaluation Metrics</h2>
<p>Beyond standard F1 scores, the study evaluates models on metrics that directly translate to operational costs:</p>
<ul>
<li><strong>Straight-Through Processing (STP):</strong> The percentage of document streams segmented <em>perfectly</em>, requiring zero human intervention. The fine-tuned <a href="https://huggingface.co/unsloth/mistral-7b-instruct-v0.2-bnb-4bit">Mistral-7B</a> achieved an STP of <strong>0.800</strong>, meaning 80% of streams were fully automated. In contrast, the traditional XGBoost baseline achieved only <strong>0.074</strong>.</li>
<li><strong>Minimum Number of Drag-and-Drops (MNDD):</strong> A proxy for human effort, measuring how many pages a human would need to move to correct the segmentation. The best LLM reduced this &ldquo;effort metric&rdquo; by over <strong>13x</strong> compared to the XGBoost baseline (0.81 vs 10.85).</li>
</ul>
<h3 id="document-level-precision-and-recall">Document-Level Precision and Recall</h3>
<p>We define a ground truth segmentation $\mathcal{G}$ and a predicted segmentation $\mathcal{P}$. A &ldquo;True Positive&rdquo; is defined strictly as a document present in both sets ($\mathcal{P} \cap \mathcal{G}$). The metrics are calculated as:</p>
<p>$$P = \frac{|\mathcal{P} \cap \mathcal{G}|}{|\mathcal{P} \cap \mathcal{G}| + |\mathcal{P} \setminus \mathcal{G}|}$$</p>
<p>$$R = \frac{|\mathcal{P} \cap \mathcal{G}|}{|\mathcal{P} \cap \mathcal{G}| + |\mathcal{G} \setminus \mathcal{P}|}$$</p>
<p>This rigorous definition ensures that a model is only rewarded if it gets <em>both</em> the start and end boundaries of a document correct.</p>
<h2 id="technical-innovation">Technical Innovation</h2>
<p>Our approach combines commercial-grade OCR processing with parameter-efficient fine-tuning of large language models. We addressed two main bottlenecks: data quality and model efficiency.</p>
<h3 id="data-remediation">Data Remediation</h3>
<p>The original <a href="https://github.com/aldolipani/TABME">TabMe</a> dataset relied on Tesseract OCR, which introduced significant noise. By reprocessing the images with Microsoft OCR, we reduced the number of &ldquo;blank&rdquo; pages from <strong>2.27% to just 0.38%</strong>, recovering critical features like titles and ID numbers that were previously lost.</p>
<h3 id="model-architecture">Model Architecture</h3>
<p>We formulated the task as a <strong>binary classification of page pairs</strong>: predicting if a &ldquo;break&rdquo; exists between Page $N$ and Page $N+1$.</p>
<h4 id="problem-formulation">Problem Formulation</h4>
<p>The task is treated as a binary classification problem over a window of pages. For a specific page $p_i$, the model predicts a binary label $y_i$ based on a window of adjacent pages $(p_{i-l}, \ldots, p_i, \ldots, p_{i+r})$. In this work, we strictly defined the window as:</p>
<p>$$l=1, \quad r=0$$</p>
<p>This means the decision for page $p_i$ is made solely based on the pair $(p_{i-1}, p_i)$.</p>
<h3 id="efficient-tuning">Efficient Tuning</h3>
<p>We utilized <strong>Low-Rank Adaptation (LoRA)</strong> and 4-bit quantization to fine-tune <a href="https://huggingface.co/unsloth/mistral-7b-instruct-v0.2-bnb-4bit">Mistral-7B</a> and <a href="https://huggingface.co/unsloth/Phi-3-mini-4k-instruct">Phi-3-mini</a> on a single NVIDIA H100 GPU, proving that PSS does not require massive compute clusters.</p>
<h2 id="why-this-matters">Why This Matters</h2>
<p>Page Stream Segmentation is the critical first step in any automated document processing pipeline. If a system fails to correctly separate documents, all downstream tasks (like classification or data extraction) will operate on corrupted inputs. By demonstrating that parameter-efficiently fine-tuned LLMs can achieve an 80% straight-through processing rate, this work provides a viable path toward fully automating high-volume document workflows.</p>
<p>Beyond the path to automation, this work gives the research community improved evaluation tools: the enhanced <a href="https://huggingface.co/datasets/rootsautomation/TABMEpp">TabMe++</a> dataset and the quantified impact of OCR quality on PSS performance have direct applications in commercial document processing pipelines.</p>
<p>We later extended these findings to real-world industrial deployment and analyzed model calibration challenges in our follow-up <a href="/research/page-stream-segmentation-llms/">COLING Industry paper on LLMs for Insurance Document Automation</a>. The calibration challenges that emerged from that deployment are explored in depth in <a href="/posts/reliability-trap-document-automation/">The Reliability Trap: When 99% Accuracy Isn&rsquo;t Enough</a>.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{heidenreich2024largelanguagemodelspage,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Large Language Models for Page Stream Segmentation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter Heidenreich and Ratish Dalvi and Rohith Mukku and Nikhil Verma and Neven Pičuljan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2408.11981}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CL}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2408.11981}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>