Unifying Language-Action Understanding and Generation for Autonomous Driving

📄 Abstract - Unifying Language-Action Understanding and Generation for Autonomous Driving

Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method C2F that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.

面向自动驾驶的语言-动作理解与生成的统一模型 / Unifying Language-Action Understanding and Generation for Autonomous Driving

1️⃣ 一句话总结

这篇论文提出了一种名为LinkVLA的新架构，通过统一语言与动作的表示并引入双向训练目标，解决了自动驾驶中指令与动作不对齐以及动作生成效率低下的问题，从而显著提升了驾驶性能并大幅减少了推理时间。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要