LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

📄 Abstract - LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

We present \textbf{LightOnOCR-2-1B}, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9$\times$ smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and \textbf{LightOnOCR-bbox-bench} evaluation under their respective licenses.

LightOnOCR：一个10亿参数的端到端多语言视觉-语言模型，用于实现最先进的OCR / LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

1️⃣ 一句话总结

这篇论文提出了一个名为LightOnOCR-2-1B的轻量级模型，它能够直接将文档图片（如PDF）转换成干净、顺序自然的文本，无需复杂的传统OCR流程，并且在性能上超越了更大、更慢的现有最佳模型，同时还能预测文档中图片的位置。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要