菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-05
📄 Abstract - Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.

顶级标签: agents multi-modal computer vision
详细标签: vision-language navigation video generation long-horizon planning sparse prediction zero-shot evaluation 或 搜索:

稀疏视频生成推动现实世界超视距视觉语言导航 / Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation


1️⃣ 一句话总结

这篇论文提出了一种名为SparseVideoNav的新方法,它首次将视频生成模型引入超视距导航任务,通过生成稀疏的未来视频帧来指导机器人快速规划长距离路径,从而在现实复杂场景(包括夜间)中,以远超现有技术的成功率实现了仅凭简单高层指令的自主导航。

源自 arXiv: 2602.05827