菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-02
📄 Abstract - Unifying Language-Action Understanding and Generation for Autonomous Driving

Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method C2F that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.

顶级标签: multi-modal agents model training
详细标签: autonomous driving vision-language-action instruction alignment efficient generation cross-modal consistency 或 搜索:

面向自动驾驶的语言-动作理解与生成的统一模型 / Unifying Language-Action Understanding and Generation for Autonomous Driving


1️⃣ 一句话总结

这篇论文提出了一种名为LinkVLA的新架构,通过统一语言与动作的表示并引入双向训练目标,解决了自动驾驶中指令与动作不对齐以及动作生成效率低下的问题,从而显著提升了驾驶性能并大幅减少了推理时间。

源自 arXiv: 2603.01441