菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-24
📄 Abstract - How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustness under controlled visual degradation remains insufficiently understood. This gap is critical for OCR reasoning, where visual corruption can induce OCR errors and structural distortions, thereby introducing uncertainty into the reasoning task. To systematically study this problem, we introduce OCR-Robust, a benchmark designed for evaluating OCR reasoning robustness under visual perturbations. It contains 812 samples across two complementary subsets: OCR1.0, covering documents, scene text, receipts, handwriting, and mathematical content, and OCR2.0, focusing on charts, geometry diagrams, and tables. To enable efficient yet informative evaluation, we conduct a pilot study over 18 candidate perturbations and select 5 representative types at 3 severity levels each based on their impact and cross-model discriminability. We evaluate robustness using clean accuracy, Relative Corruption Retention (RCR), Worst-Case Retention (WCR), and a composite Corruption Robustness Index (CRI), and benchmark 18 models spanning proprietary systems, open-source VLMs, and OCR+LLM pipelines. Our results show that higher clean accuracy does not necessarily imply stronger robustness, and that models can suffer pronounced degradation in the worst case on OCR tasks that are sensitive to structure, and charts and tables are substantially more fragile than document-like inputs under perturbation.

顶级标签: llm multi-modal model evaluation
详细标签: vision-language models ocr reasoning robustness benchmark visual perturbations corruption robustness 或 搜索:

OCR推理的鲁棒性有多强?——在视觉扰动下评估视觉语言模型的OCR推理鲁棒性 / How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations


1️⃣ 一句话总结

本文构建了一个名为OCR-Robust的基准测试集,通过引入多种视觉扰动来评估现有视觉语言模型在文本识别和推理任务上的鲁棒性,发现模型在表格和图表的处理上比普通文档更脆弱,且高准确率并不代表更强的抗干扰能力。

源自 arXiv: 2606.26041