GutenOCR is a family of vision-language models (VLMs) designed to serve as a “grounded OCR front-end”. Unlike traditional OCR pipelines (which are brittle) or modern “OCR-free” VLMs (which often lack precise token-to-pixel alignment), GutenOCR is fine-tuned to provide both high-quality text transcription and explicit geometric grounding (bounding boxes) through a unified, prompt-based interface.
Core Capabilities
GutenOCR transforms the Qwen2.5-VL-3B and -7B models into specialized OCR systems without architectural changes. A single checkpoint supports four distinct task families via prompting:
- Full-Page Reading: Returns transcripts in plain text, layout-sensitive “text2d,” or structured JSON with line/paragraph bounding boxes.
- Full-Page Detection: Localizes text regions (lines, paragraphs, math) without transcription.
- Localized Reading: Transcribes text only within a user-specified bounding box.
- Conditional Detection: Locates all instances of a specific query string on the page.
Methodology
- Data: The training mixture combines large-scale real-world documents (business forms, scientific articles) with synthetic data designed to teach precise grounding (e.g., “Grounded LaTeX” and “SynthDoG Grounding”).
- Curriculum Learning: Training progresses through three stages, starting with short contexts and synthetic data, moving to real-world business documents, and finishing with long-context scientific articles (up to 16k tokens).
- Unified Interface: The model treats “pipeline” stages (detection, reading, grounding) as different input-output schemas of a single model, allowing downstream systems to request exactly the data format they need (e.g., plain text vs. JSON boxes).
Key Results
The authors evaluate GutenOCR on held-out in-domain pages and external benchmarks (Fox, OmniDocBench v1.5):
- In-Domain Improvements: GutenOCR-7B more than doubles the composite grounded OCR score of its base model (0.40 to 0.82) on 10.5K held-out pages, showing massive gains in localized reading and detection.
- Fox Benchmark: GutenOCR significantly outperforms baselines on region-level and line-level OCR. For example, GutenOCR-3B achieves a region-level Character Error Rate (CER) of 0.053, surpassing even the dedicated Fox model.
- Trade-offs:
- “Good Reader, Bad Paginator”: While GutenOCR reads content accurately (high Page F1), it orders text based on 2D layout columns rather than the specific reading order of some benchmarks, resulting in higher page-level CER.
- Catastrophic Forgetting: The model loses the ability to understand color-based prompts (e.g., “read the text in the red box”), a capability present in the base Qwen model.
- Math Degradation: Formula recognition slightly degrades compared to the base model, likely due to a lack of math-heavy supervision in later training stages.
Models
We release 3B and 7B parameter models on HuggingFace:
Significance
GutenOCR is proposed as a foundational layer for systems where every extracted answer is explicitly linked to supporting pixels. By providing stable, grounded outputs, it enables human-in-the-loop workflows where reviewers can easily verify hallucinations or missing text by checking the predicted bounding boxes.
Citation
@misc{heidenreich2026gutenocrgroundedvisionlanguagefrontend,
title={GutenOCR: A Grounded Vision-Language Front-End for Documents},
author={Hunter Heidenreich and Ben Elliott and Olivia Dinica and Yosheb Getachew},
year={2026},
eprint={2601.14490},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.14490},
}
