📄
Abstract - SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving
In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x.
SToRM:面向高效端到端自动驾驶的多模态大语言模型监督式令牌缩减框架 /
SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving
1️⃣ 一句话总结
这篇论文提出了一个名为SToRM的新方法,它能让自动驾驶系统中的多模态大语言模型在显著降低计算成本(最高达30倍)的同时,保持与使用全部视觉数据时同等的驾驶性能,从而解决了现有模型因计算资源需求过高而难以在车辆上高效部署的难题。