FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

📄 Abstract - FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi-CoT strategy. At inference, our model performs direct instruction-to-action mapping while still enjoying reasoning-aware representations. Extensive experiments on LH-VLN show that our approach achieves reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.

FantasyVLN：用于视觉语言导航的统一多模态思维链推理框架 / FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

1️⃣ 一句话总结

这篇论文提出了一个名为FantasyVLN的新方法，它通过将想象中的视觉信息压缩编码，让AI机器人在执行导航任务时既能像人一样进行多步骤推理，又能保持实时运行速度，解决了现有方法要么推理能力弱、要么速度太慢的问题。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要