菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-23
📄 Abstract - Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval

With the rapid proliferation of multimodal information, Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition. Unlike traditional natural image retrieval, visual documents exhibit unique characteristics defined by dense textual content, intricate layouts, and fine-grained semantic dependencies. This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era. We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retrieval-Augmented Generation (RAG) and Agentic systems for complex document intelligence. Finally, we identify persistent challenges and outline promising future directions, aiming to provide a clear roadmap for future multimodal document intelligence.

顶级标签: multi-modal natural language processing llm
详细标签: visual document retrieval multimodal llm retrieval-augmented generation survey document intelligence 或 搜索:

解锁多模态文档智能:从当前成就到视觉文档检索的未来前沿 / Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval


1️⃣ 一句话总结

这篇论文首次全面综述了视觉文档检索领域,系统梳理了其方法演进、当前挑战,并展望了未来发展方向,为多模态文档智能研究提供了清晰的路线图。

源自 arXiv: 2602.19961