Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

📄 Abstract - Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.

慢思考，快行动：一种用于泛化视觉语言导航的双系统基础模型 / Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

1️⃣ 一句话总结

这篇论文提出了一个名为DualVLN的双系统模型，它通过一个慢速思考的全局规划器来设定中期目标，再驱动一个快速行动的本地控制器来生成平滑轨迹，从而在复杂动态环境中实现了更鲁棒、更高效的视觉语言导航。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要