菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-23
📄 Abstract - NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We demonstrate the effectiveness of NovaPlan on three long-horizon tasks and the Functional Manipulation Benchmark (FMB). Our results show that NovaPlan can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training. Project page: this https URL

顶级标签: robotics multi-modal agents
详细标签: long-horizon manipulation vision-language models video planning zero-shot learning hierarchical planning 或 搜索:

NovaPlan:通过闭环视频语言规划实现零样本长程操作 / NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning


1️⃣ 一句话总结

这篇论文提出了一个名为NovaPlan的分层机器人框架,它结合了视觉语言模型和视频生成模型进行任务分解与规划,并利用从视频中提取的关键点信息来指导机器人动作,从而无需额外训练就能完成复杂的长程操作任务并自主纠正执行错误。

源自 arXiv: 2602.20119