菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-24
📄 Abstract - What We are Missing in Multimodal LLM Evaluation?

Multimodal large language models (MLLMs) can process diverse inputs, e.g., text, images, audio, and video, and generate textual responses. While their capabilities have advanced rapidly, evaluation of such models has not kept pace. Most existing evaluation benchmarks are limited to isolated tasks and reveal little about whether a model integrates information across modalities. We examine current means for evaluating MLLMs and review the existing benchmark taxonomy to identify gaps, including temporal-spatial coherence, physical world understanding, multimodal consistency, and selective attention. Addressing these gaps is essential for measuring real progress in multimodal intelligence and exposing capability boundaries.

顶级标签: multi-modal llm evaluation
详细标签: benchmark multimodal consistency temporal-spatial coherence physical understanding selective attention 或 搜索:

多模态大语言模型评估中我们忽略了什么? / What We are Missing in Multimodal LLM Evaluation?


1️⃣ 一句话总结

这篇论文指出当前多模态大语言模型的评估方法滞后于模型能力的发展,现有基准测试大多局限于孤立任务,无法有效衡量模型跨模态信息整合的水平,并总结了四个关键缺失的评估维度:时空连贯性、物理世界理解、多模态一致性和选择性注意力。

源自 arXiv: 2606.26348