← 返回列表

🤖 系统

📄 Abstract - Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.

顶级标签: video generation computer vision multi-modal

Video4Spatial：通过上下文引导的视频生成迈向视觉空间智能 / Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

1️⃣ 一句话总结

这篇论文提出了一个名为Video4Spatial的框架，它证明仅通过视频数据训练的视频生成模型，就能像人一样理解复杂的空间关系，并成功完成场景导航和物体定位等需要空间推理的任务。

📄 打开原文 PDF

← 返回列表

菜单

🤖 AI 深度阅读

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

🤖 AI 深度阅读

1️⃣ 一句话总结

获取最新论文摘要