看见即到达:面向无人机的视场内的精准视觉语言导航 / See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View
1️⃣ 一句话总结
本文提出了一种针对无人机的视觉语言导航新任务和框架,专门解决无人机在目标进入视野后如何精准识别并飞抵目标的问题,通过结合动态3D方向线索和高分辨率图像,使导航成功率和精准度大幅提升。
UAV Vision-Language Navigation (UAV-VLN) is typically formulated as a holistic search-and-reach problem, where long-range target discovery and final target approach are optimized and evaluated jointly. This formulation makes it difficult to assess a critical capability of aerial embodied agents, namely whether a UAV can accurately ground a visible target and translate vision-language evidence into precise 3D motion once the target enters its field of view. To address this limitation, we introduce UAV-VLN-FOV, a target-visible navigation task that isolates the see-and-reach stage and enables a more diagnostic evaluation of terminal reaching ability. We further propose 3DG-VLN, a vision-language waypoint prediction framework guided by dynamic 3D direction cues to enhance fine-grained visual grounding and spatial direction alignment for precise target reaching. Specifically, 3DG-VLN adaptively processes high-resolution front-view and downward-view observations to preserve fine-grained visual and geometric details for target grounding. It also updates the target-relative direction online during closed-loop navigation, allowing the agent to maintain spatial alignment with the target and reduce accumulated direction drift. To support this task, we construct a dedicated high-resolution benchmark which contains 2,717 trajectories with target-oriented high-level instructions, high-resolution front-view and downward-view egocentric observations, and continuous 3D waypoint annotations. Experiments show that 3DG-VLN outperforms competitive UAV-VLN baselines, achieving a 13.82\% improvement in success rate. Real-world trials further demonstrate the potential of 3DG-VLN for practical see-and-reach navigation. The source code and benchmark are available at this https URL.
看见即到达:面向无人机的视场内的精准视觉语言导航 / See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View
本文提出了一种针对无人机的视觉语言导航新任务和框架,专门解决无人机在目标进入视野后如何精准识别并飞抵目标的问题,通过结合动态3D方向线索和高分辨率图像,使导航成功率和精准度大幅提升。
源自 arXiv: 2606.20045