FoundationMotion:视频中空间运动的自动标注与推理 / FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos
1️⃣ 一句话总结
这篇论文提出了一个名为FoundationMotion的自动化数据构建流程,能够从视频中自动生成大规模、细粒度的运动数据集,并用这些数据训练模型,显著提升了AI对物体运动和空间关系的理解能力。
Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.
FoundationMotion:视频中空间运动的自动标注与推理 / FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos
这篇论文提出了一个名为FoundationMotion的自动化数据构建流程,能够从视频中自动生成大规模、细粒度的运动数据集,并用这些数据训练模型,显著提升了AI对物体运动和空间关系的理解能力。
源自 arXiv: 2512.10927