📄 论文总结
MVU-Eval:面向多模态大语言模型的多视频理解评估 / MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
1️⃣ 一句话总结
这篇论文提出了首个多视频理解评估基准MVU-Eval,通过涵盖近5000个视频的1800多个问题,系统评估多模态大模型在跨视频感知与推理方面的能力,揭示了现有模型在处理多视频任务时的显著不足。
The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing evaluation benchmarks remain limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world scenarios (e.g., sports analytics and autonomous driving). To address this significant gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs. Specifically, our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos from diverse domains, addressing both fundamental perception tasks and high-order reasoning tasks. These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics. Through extensive evaluation of state-of-the-art open-source and closed-source models, we reveal significant performance discrepancies and limitations in current MLLMs' ability to perform understanding across multiple videos. The benchmark will be made publicly available to foster future research.
MVU-Eval:面向多模态大语言模型的多视频理解评估 / MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
这篇论文提出了首个多视频理解评估基准MVU-Eval,通过涵盖近5000个视频的1800多个问题,系统评估多模态大模型在跨视频感知与推理方面的能力,揭示了现有模型在处理多视频任务时的显著不足。