基于视觉提示与多模态大语言模型的越野地图构建推理方法 / Visual Prompt Based Reasoning for Offroad Mapping using Multimodal LLMs
1️⃣ 一句话总结
这篇论文提出了一种创新的零样本方法,它利用一个视觉语言大模型,通过分析标注了数字标签的越野环境分割图像,直接推理出可通行区域,从而替代了传统需要多个专门模型协同工作的复杂方案,实现了更高效的越野自主导航。
Traditional approaches to off-road autonomy rely on separate models for terrain classification, height estimation, and quantifying slip or slope conditions. Utilizing several models requires training each component separately, having task specific datasets, and fine-tuning. In this work, we present a zero-shot approach leveraging SAM2 for environment segmentation and a vision-language model (VLM) to reason about drivable areas. Our approach involves passing to the VLM both the original image and the segmented image annotated with numeric labels for each mask. The VLM is then prompted to identify which regions, represented by these numeric labels, are drivable. Combined with planning and control modules, this unified framework eliminates the need for explicit terrain-specific models and relies instead on the inherent reasoning capabilities of the VLM. Our approach surpasses state-of-the-art trainable models on high resolution segmentation datasets and enables full stack navigation in our Isaac Sim offroad environment.
基于视觉提示与多模态大语言模型的越野地图构建推理方法 / Visual Prompt Based Reasoning for Offroad Mapping using Multimodal LLMs
这篇论文提出了一种创新的零样本方法,它利用一个视觉语言大模型,通过分析标注了数字标签的越野环境分割图像,直接推理出可通行区域,从而替代了传统需要多个专门模型协同工作的复杂方案,实现了更高效的越野自主导航。
源自 arXiv: 2604.04564