GutenOCR is a family of vision-language models (VLMs) designed to serve as a “grounded OCR front-end”. Unlike traditional OCR pipelines (which are brittle) or modern “OCR-free” VLMs (which often lack precise token-to-pixel alignment), GutenOCR is fine-tuned to provide both high-quality text transcription and explicit geometric grounding (bounding boxes) through a unified, prompt-based interface.

Core Capabilities

GutenOCR transforms the Qwen2.5-VL-3B and -7B models into specialized OCR systems without architectural changes. A single checkpoint supports four distinct task families via prompting:

  • Full-Page Reading: Returns transcripts in plain text, layout-sensitive “text2d,” or structured JSON with line/paragraph bounding boxes.
  • Full-Page Detection: Localizes text regions (lines, paragraphs, math) without transcription.
  • Localized Reading: Transcribes text only within a user-specified bounding box.
  • Conditional Detection: Locates all instances of a specific query string on the page.

Methodology

  • Data: The training mixture combines large-scale real-world documents (business forms, scientific articles) with synthetic data designed to teach precise grounding (e.g., “Grounded LaTeX” and “SynthDoG Grounding”).
  • Curriculum Learning: Training progresses through three stages, starting with short contexts and synthetic data, moving to real-world business documents, and finishing with long-context scientific articles (up to 16k tokens).
  • Unified Interface: The model treats “pipeline” stages (detection, reading, grounding) as different input-output schemas of a single model, allowing downstream systems to request exactly the data format they need (e.g., plain text vs. JSON boxes).

Key Results

The authors evaluate GutenOCR on held-out in-domain pages and external benchmarks (Fox, OmniDocBench v1.5):

  • In-Domain Improvements: GutenOCR-7B more than doubles the composite grounded OCR score of its base model (0.40 to 0.82) on 10.5K held-out pages, showing massive gains in localized reading and detection.
  • Fox Benchmark: GutenOCR significantly outperforms baselines on region-level and line-level OCR. For example, GutenOCR-3B achieves a region-level Character Error Rate (CER) of 0.053, surpassing even the dedicated Fox model.
  • Trade-offs:
    • “Good Reader, Bad Paginator”: While GutenOCR reads content accurately (high Page F1), it orders text based on 2D layout columns rather than the specific reading order of some benchmarks, resulting in higher page-level CER.
    • Catastrophic Forgetting: The model loses the ability to understand color-based prompts (e.g., “read the text in the red box”), a capability present in the base Qwen model.
    • Math Degradation: Formula recognition slightly degrades compared to the base model, likely due to a lack of math-heavy supervision in later training stages.

Models

We release 3B and 7B parameter models on HuggingFace:

Significance

GutenOCR is proposed as a foundational layer for systems where every extracted answer is explicitly linked to supporting pixels. By providing stable, grounded outputs, it enables human-in-the-loop workflows where reviewers can easily verify hallucinations or missing text by checking the predicted bounding boxes.

Citation

@misc{heidenreich2026gutenocrgroundedvisionlanguagefrontend,
      title={GutenOCR: A Grounded Vision-Language Front-End for Documents}, 
      author={Hunter Heidenreich and Ben Elliott and Olivia Dinica and Yosheb Getachew},
      year={2026},
      eprint={2601.14490},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.14490}, 
}