Video-Oasis:重新思考视频理解的评估 / Video-Oasis: Rethinking Evaluation of Video Understanding
1️⃣ 一句话总结
这篇论文提出了一个名为Video-Oasis的诊断工具,通过系统分析发现现有视频理解评测基准存在严重缺陷——超过一半的测试样本无需观看视频就能答对,而顶尖模型在真正需要时空理解的样本上表现接近随机猜测,从而为未来构建更可靠的评测标准和模型设计提供了实用指导。
The inherent complexity of video understanding makes it difficult to attribute whether performance gains stem from visual perception, linguistic reasoning, or knowledge priors. While many benchmarks have emerged to assess high-level reasoning, the essential criteria that constitute video understanding remain largely overlooked. Instead of introducing yet another benchmark, we take a step back to re-examine the current landscape of video understanding. In this work, we provide Video-Oasis, a sustainable diagnostic suite designed to systematically evaluate existing evaluations and distill spatio-temporal challenges for video understanding. Our analysis reveals two critical findings: (1) 54% of existing benchmark samples are solvable without visual input or temporal context, and (2) on the remaining samples, state-of-the-art models exhibit performance barely exceeding random guessing. To bridge this gap, we investigate which algorithmic design choices contribute to robust video understanding, providing practical guidelines for future research. We hope our work serves as a standard guideline for benchmark construction and the rigorous evaluation of architecture development. Code is available at this https URL.
Video-Oasis:重新思考视频理解的评估 / Video-Oasis: Rethinking Evaluation of Video Understanding
这篇论文提出了一个名为Video-Oasis的诊断工具,通过系统分析发现现有视频理解评测基准存在严重缺陷——超过一半的测试样本无需观看视频就能答对,而顶尖模型在真正需要时空理解的样本上表现接近随机猜测,从而为未来构建更可靠的评测标准和模型设计提供了实用指导。
源自 arXiv: 2603.29616