菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-11
📄 Abstract - From Representational Complementarity to Dual Systems: Synergizing VLM and Vision-Only Backbones for End-to-End Driving

Vision-Language-Action (VLA) driving augments end-to-end (E2E) planning with language-enabled backbones, yet it remains unclear what changes beyond the usual accuracy--cost trade-off. We revisit this question with 3--RQ analysis in RecogDrive by instantiating the system with a full VLM and vision-only backbones, all under an identical diffusion Transformer planner. RQ1: At the backbone level, the VLM can introduce additional subspaces upon the vision-only backbones. RQ2: This unique subspace leads to a different behavioral in some long-tail scenario: the VLM tends to be more aggressive whereas ViT is more conservative, and each decisively wins on about 2--3% of test scenarios; With an oracle that selects, per scenario, the better trajectory between the VLM and ViT branches, we obtain an upper bound of 93.58 PDMS. RQ3: To fully harness this observation, we propose HybridDriveVLA, which runs both ViT and VLM branches and selects between their endpoint trajectories using a learned scorer, improving PDMS to 92.10. Finally, DualDriveVLA implements a practical fast--slow policy: it runs ViT by default and invokes the VLM only when the scorer's confidence falls below a threshold; calling the VLM on 15% of scenarios achieves 91.00 PDMS while improving throughput by 3.2x. Code will be released.

顶级标签: computer vision agents systems
详细标签: autonomous driving vision-language models end-to-end planning multi-modal fusion behavior analysis 或 搜索:

从表征互补到双系统:协同视觉语言模型与纯视觉骨干网络实现端到端驾驶 / From Representational Complementarity to Dual Systems: Synergizing VLM and Vision-Only Backbones for End-to-End Driving


1️⃣ 一句话总结

这篇论文发现,在自动驾驶系统中,结合了语言理解的视觉模型和纯视觉模型在决策行为上存在互补性,并据此设计了一个高效的双系统框架,让系统能根据场景智能选择使用哪种模型,从而在保证性能的同时大幅提升运行效率。

源自 arXiv: 2602.10719