菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-07-02
📄 Abstract - Path-level Hindsight Instructions for Semantic Exploration in Vision-Language Navigation

On-policy exploration is a crucial component for training robust Vision-Language Navigation agents, as it exposes the policy to a broader state distribution. However, such exploration inevitably leads to trajectories that deviate from expert demonstrations, resulting in a semantic mismatch between the executed visual stream and the original language instruction. In this work, we address this challenge by introducing Phi-Nav, a unified on-policy framework that leverages hindsight reasoning to align instructions with the agent's actual exploratory journey. Specifically, Phi-Nav operates through a three-stage dual-supervision cycle: 1) the agent performs oracle-guided on-policy exploration, sampling a trajectory while learning from expert action feedback, 2) a hindsight speaker synthesizes a path-level hindsight instruction grounded in the collected visual observations, and 3) the agent conducts a second imitation pass, treating the synthesized trajectory-instruction pair as an additional expert demonstration. Through this process, Phi-Nav bridges the critical semantic supervision gap inherent in on-policy methods, transforming semantically unlabeled movement into dense training signals. Evaluations on the R2R-CE and RxR-CE benchmarks show that Phi-Nav yields competitive performance while requiring only a fraction of the expert demonstrations used by current baselines. These results underscore the necessity of semantic exploration in VLN, positioning Phi-Nav as an effective solution for training embodied agents with limited data.

顶级标签: agents multi-modal
详细标签: vision-language navigation hindsight instruction on-policy exploration semantic exploration embodied agents 或 搜索:

路径级后见指令:用于视觉语言导航中的语义探索 / Path-level Hindsight Instructions for Semantic Exploration in Vision-Language Navigation


1️⃣ 一句话总结

本文提出Phi-Nav框架,通过在智能体探索轨迹后自动生成与路径匹配的语言指令,将无标签的探索数据转化为有效的训练样本,从而在减少专家演示数据需求的同时,显著提升了视觉语言导航任务的性能。

源自 arXiv: 2607.01754