← 返回列表

arXiv 提交日期: 2026-01-20

📄 Abstract - GutenOCR: A Grounded Vision-Language Front-End for Documents

GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.

顶级标签: multi-modal natural language processing computer vision

GutenOCR：一种基于视觉语言模型的文档理解前端系统 / GutenOCR: A Grounded Vision-Language Front-End for Documents

1️⃣ 一句话总结

这篇论文提出了一个名为GutenOCR的视觉语言模型，它通过微调现有模型，能够统一地识别、定位和回答文档中的文字内容，在商业和科学文档的测试中性能大幅提升，但也揭示了在处理复杂布局时的一些权衡。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2601.14490

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要