SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

📄 Abstract - SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.

📄 论文总结

SIMS-V：面向空间视频理解的模拟指令调优 / SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

1️⃣ 一句话总结

该论文提出了一种利用3D模拟器生成空间丰富视频数据的方法，仅需少量模拟示例就能有效训练视频语言模型，使其在现实世界空间推理任务中超越更大模型并媲美商业模型。

← 返回列表

菜单

🤖 AI 深度阅读

📄 论文总结

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

🤖 AI 深度阅读

📄 论文总结

1️⃣ 一句话总结

获取最新论文摘要