GutenOCR: A Grounded Vision-Language Front-End for Documents

GutenOCR is a family of vision-language models (VLMs) designed to serve as a “grounded OCR front-end”. Unlike traditional OCR pipelines (which are brittle) or modern “OCR-free” VLMs (which often lack precise token-to-pixel alignment), GutenOCR is fine-tuned to provide both high-quality text transcription and explicit geometric grounding (bounding boxes) through a unified, prompt-based interface.

Abstract

Traditional OCR pipelines are often brittle, while modern “OCR-free” Vision-Language Models (VLMs) frequently lack precise token-to-pixel alignment. To address this, we introduce GutenOCR, a family of VLMs designed specifically as a “grounded OCR front-end.” By fine-tuning Qwen2.5-VL on a curriculum of synthetic and real-world documents, GutenOCR provides both high-quality text transcription and explicit geometric grounding (bounding boxes) through a unified, prompt-based interface. This approach allows downstream systems to request exactly the data format they need, from plain text to complex JSON structures.

Key Contributions & Results

Unified Interface: Transforms Qwen2.5-VL models into specialized OCR systems supporting full-page reading, detection, localized reading, and conditional detection via prompting.
In-Domain Improvements: GutenOCR-7B more than doubles the composite grounded OCR score of its base model (0.40 to 0.82) on 10.5K held-out pages, showing massive gains in localized reading and detection.
Fox Benchmark: Significantly outperforms baselines on region-level and line-level OCR, with GutenOCR-3B achieving a region-level Character Error Rate (CER) of 0.053, surpassing even the dedicated Fox model.
Curriculum Learning: Training progresses through three stages, starting with short contexts and synthetic data, moving to real-world business documents, and finishing with long-context scientific articles.
Trade-offs: While GutenOCR reads content accurately (high Page F1), it orders text based on 2D layout columns. It also experiences some catastrophic forgetting of color-based prompts and slight degradation in math formula recognition.

Methodology

Data: The training mixture combines large-scale real-world documents (business forms, scientific articles) with synthetic data designed to teach precise grounding (e.g., “Grounded LaTeX” and “SynthDoG Grounding”).
Curriculum Learning: Training progresses through three stages, starting with short contexts and synthetic data, moving to real-world business documents, and finishing with long-context scientific articles (up to 16k tokens).
Unified Interface: The model treats “pipeline” stages (detection, reading, grounding) as different input-output schemas of a single model, allowing downstream systems to request exactly the data format they need (e.g., plain text vs. JSON boxes).

Models

We release 3B and 7B parameter models on HuggingFace:

Why This Matters

GutenOCR is proposed as a foundational layer for systems where every extracted answer must be explicitly linked to supporting pixels. By providing stable, grounded outputs, it enables human-in-the-loop workflows where reviewers can easily verify hallucinations or missing text by checking the predicted bounding boxes. This work pairs closely with our release of PubMed-OCR, which provides the large-scale, high-density annotations necessary to train such layout-aware models.

Citation

@misc{heidenreich2026gutenocrgroundedvisionlanguagefrontend,
      title={GutenOCR: A Grounded Vision-Language Front-End for Documents},
      author={Hunter Heidenreich and Ben Elliott and Olivia Dinica and Yosheb Getachew},
      year={2026},
      eprint={2601.14490},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.14490},
}

For more on document processing, see LLMs for Page Stream Segmentation and The Evolution of Page Stream Segmentation: Rules to LLMs.

Abstract#

Key Contributions & Results#

Methodology#

Models#

Why This Matters#

Citation#

Related Work#