MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

📄 Abstract - MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing evaluation benchmarks remain limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world scenarios (e.g., sports analytics and autonomous driving). To address this significant gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs. Specifically, our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos from diverse domains, addressing both fundamental perception tasks and high-order reasoning tasks. These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics. Through extensive evaluation of state-of-the-art open-source and closed-source models, we reveal significant performance discrepancies and limitations in current MLLMs' ability to perform understanding across multiple videos. The benchmark will be made publicly available to foster future research.

📄 论文总结

MVU-Eval：面向多模态大语言模型的多视频理解评估 / MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

1️⃣ 一句话总结

这篇论文提出了首个多视频理解评估基准MVU-Eval，通过涵盖近5000个视频的1800多个问题，系统评估多模态大模型在跨视频感知与推理方面的能力，揭示了现有模型在处理多视频任务时的显著不足。

← 返回列表

菜单

🤖 AI 深度阅读

📄 论文总结

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

🤖 AI 深度阅读

📄 论文总结

1️⃣ 一句话总结

获取最新论文摘要