菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-11
📄 Abstract - MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.

顶级标签: multi-modal benchmark model evaluation
详细标签: spatial intelligence video understanding multimodal llms evaluation benchmark geometric reasoning 或 搜索:

MMSI-Video-Bench:一个用于视频空间智能的整体性基准测试集 / MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence


1️⃣ 一句话总结

这篇论文提出了一个名为MMSI-Video-Bench的全面基准测试集,用于评估多模态大语言模型在理解视频中三维空间信息的能力,测试发现当前最先进的模型与人类水平相比仍有巨大差距。


源自 arXiv: 2512.10863