OCR还是不用OCR?在MLLMs时代基于真实世界大规模数据集重新思考文档信息提取 / OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets
1️⃣ 一句话总结
这项研究发现,对于强大的多模态大语言模型来说,直接输入文档图像进行信息提取的效果已经可以媲美传统的OCR预处理后再分析的流程,这意味着未来处理文档时可能不再需要OCR步骤。
Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline--while simpler--can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction. To examine and explore failure modes, we propose an automated hierarchical error analysis framework that leverages large language models (LLMs) to diagnose error patterns systematically. Our findings suggest that OCR may not be necessary for powerful MLLMs, as image-only input can achieve comparable performance to OCR-enhanced approaches. Moreover, we demonstrate that carefully designed schema, exemplars, and instructions can further enhance MLLMs performance. We hope this work can offer practical guidance and valuable insight for advancing document information extraction.
OCR还是不用OCR?在MLLMs时代基于真实世界大规模数据集重新思考文档信息提取 / OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets
这项研究发现,对于强大的多模态大语言模型来说,直接输入文档图像进行信息提取的效果已经可以媲美传统的OCR预处理后再分析的流程,这意味着未来处理文档时可能不再需要OCR步骤。
源自 arXiv: 2603.02789