📄
Abstract - WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models
Vision-language models (VLMs) and generative world models are opening new opportunities for embodied navigation. VLMs are increasingly used as direct planners or trajectory predictors, while world models support look-ahead reasoning by imagining future views. Yet predicting a reliable trajectory from a single egocentric observation remains challenging. Current VLMs often generate unstable trajectories, and world models, though able to synthesize plausible futures, do not directly provide the grounded signals needed for navigation learning. This raises a central question: how can generated futures be turned into supervision for grounded trajectory prediction? We present WorldMAP, a teacher--student framework that converts world-model-generated futures into persistent semantic-spatial structure and planning-derived supervision. Its world-model-driven teacher builds semantic-spatial memory from generated videos, grounds task-relevant targets and obstacles, and produces trajectory pseudo-labels through explicit planning. A lightweight student with a multi-hypothesis trajectory head is then trained to predict navigation trajectories directly from vision-language inputs. On Target-Bench, WorldMAP achieves the best ADE and FDE among compared methods, reducing ADE by 18.0% and FDE by 42.1% relative to the best competing baseline, while lifting a small open-source VLM to DTW performance competitive with proprietary models. More broadly, the results suggest that, in embodied navigation, the value of world models may lie less in supplying action-ready imagined evidence than in synthesizing structured supervision for navigation learning.
WorldMAP:利用生成式世界模型自举提升视觉语言导航轨迹预测 /
WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models
1️⃣ 一句话总结
这篇论文提出了一个名为WorldMAP的新方法,它巧妙地利用生成式世界模型来‘想象’未来的环境画面,并从中提取出结构化的导航指导信息,从而训练出一个更轻量、更准确的视觉语言模型,使其仅凭单次观察就能预测出稳定可靠的导航路线。