ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge

📄 Abstract - ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge

Existing multimodal retrieval benchmarks largely emphasize semantic matching on daily-life images and offer limited diagnostics of professional knowledge and complex reasoning. To address this gap, we introduce ARK, a benchmark designed to analyze multimodal retrieval from two complementary perspectives: (i) knowledge domains (five domains with 17 subtypes), which characterize the content and expertise retrieval relies on, and (ii) reasoning skills (six categories), which characterize the type of inference over multimodal evidence required to identify the correct candidate. Specifically, ARK evaluates retrieval with both unimodal and multimodal queries and candidates, covering 16 heterogeneous visual data types. To avoid shortcut matching during evaluation, most queries are paired with targeted hard negatives that require multi-step reasoning. We evaluate 23 representative text-based and multimodal retrievers on ARK and observe a pronounced gap between knowledge-intensive and reasoning-intensive retrieval, with fine-grained visual and spatial reasoning emerging as persistent bottlenecks. We further show that simple enhancements such as re-ranking and rewriting yield consistent improvements, but substantial headroom remains.

ARK：一个沿推理与知识双轴的多模态检索基准 / ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge

1️⃣ 一句话总结

这篇论文提出了一个名为ARK的新型多模态检索基准，它从知识领域和推理技能两个维度来评估模型，发现现有模型在处理需要专业知识和复杂推理的任务时存在明显不足，并指出精细视觉和空间推理是当前的主要瓶颈。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要