Structured Observation Language for Efficient and Generalizable Vision-Language Navigation

📄 Abstract - Structured Observation Language for Efficient and Generalizable Vision-Language Navigation

Vision-Language Navigation (VLN) requires an embodied agent to navigate complex environments by following natural language instructions, which typically demands tight fusion of visual and language modalities. Existing VLN methods often convert raw images into visual tokens or implicit features, requiring large-scale visual pre-training and suffering from poor generalization under environmental variations (e.g., lighting, texture). To address these issues, we propose SOL-Nav (Structured Observation Language for Navigation), a novel framework that translates egocentric visual observations into compact structured language descriptions for efficient and generalizable navigation. Specifically, we divide RGB-D images into a N*N grid, extract representative semantic, color, and depth information for each grid cell to form structured text, and concatenate this with the language instruction as pure language input to a pre-trained language model (PLM). Experimental results on standard VLN benchmarks (R2R, RxR) and real-world deployments demonstrate that SOL-Nav significantly reduces the model size and training data dependency, fully leverages the reasoning and representation capabilities of PLMs, and achieves strong generalization to unseen environments.

用于高效且可泛化的视觉语言导航的结构化观察语言 / Structured Observation Language for Efficient and Generalizable Vision-Language Navigation

1️⃣ 一句话总结

这篇论文提出了一种名为SOL-Nav的新方法，它将机器人看到的视觉图像转换成结构化的文字描述，然后与语言指令一起输入给预训练的语言模型来导航，这种方法不仅让模型更小、训练更简单，还能更好地适应没见过的环境。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要