双锚定框架:解决视觉语言导航中的状态漂移问题 / Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
1️⃣ 一句话总结
本文提出了一种双锚定框架,通过让智能体明确标记已完成的指令子任务和回忆途经的地标,有效解决了长距离视觉语言导航中因进度混淆和记忆衰减导致的迷失方向问题,将成功率提升了15.2%。
Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent's internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub-goals from remaining ones, and Memory Drift, where the agent's history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual-Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub-goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark-Centric World Model to retrospectively predict object-centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long-horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.
双锚定框架:解决视觉语言导航中的状态漂移问题 / Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
本文提出了一种双锚定框架,通过让智能体明确标记已完成的指令子任务和回忆途经的地标,有效解决了长距离视觉语言导航中因进度混淆和记忆衰减导致的迷失方向问题,将成功率提升了15.2%。
源自 arXiv: 2604.17473