From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

📄 Abstract - From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs' ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA models demonstrated substandard accuracies, largely due to imperfect fine-grained motion understanding. To improve motion understanding and overall accuracy on TAD, two novel training-free solutions are proposed: Scene-CoT, that leverages Chain-of-Thought (CoT) and TCogMap, which incorporates an ego-centric temporal cognitive map. The proposed approaches are integrated with existing VLMs and improve average accuracy on TAD by up to 17.72%. By introducing TAD, benchmarking multiple SoTA models, and proposing effective enhancements, this work aims to catalyze future research on temporal understanding in AD. The benchmark and evaluation code are available at \href{this https URL}{Hugging Face} and \href{this https URL}{Github}, respectively.

从片段到场景：通过视觉语言模型实现自动驾驶中的时序理解 / From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

1️⃣ 一句话总结

这篇论文针对自动驾驶视频中时序理解这一难题，提出了一个专门的评测基准TAD，并设计了两种无需额外训练的方法来提升现有视觉语言模型对动态场景的理解能力，显著提高了模型在该基准上的表现。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要