Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

📄 Abstract - Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

A reliable driving assistant should provide consistent responses based on temporally grounded reasoning derived from observed information. In this work, we investigate whether Vision-Language Models (VLMs), when applied as driving assistants, can response consistantly and understand how present observations shape future outcomes, or whether their outputs merely reflect patterns memorized during training without temporally grounded reasoning. While recent efforts have integrated VLMs into autonomous driving, prior studies typically emphasize scene understanding and instruction generation, implicitly assuming that strong visual interpretation naturally enables consistant future reasoning and thus ensures reliable decision-making, a claim we critically examine. We focus on two major challenges limiting VLM reliability in this setting: response inconsistency, where minor input perturbations yield different answers or, in some cases, responses degenerate toward near-random guessing, and limited temporal reasoning, in which models fail to reason and align sequential events from current observations, often resulting in incorrect or even contradictory responses. Moreover, we find that models with strong visual understanding do not necessarily perform best on tasks requiring temporal reasoning, indicating a tendency to over-rely on pretrained patterns rather than modeling temporal dynamics. To address these issues, we adopt existing evaluation methods and introduce FutureVQA, a human-annotated benchmark dataset specifically designed to assess future scene reasoning. In addition, we propose a simple yet effective self-supervised tuning approach with chain-of-thought reasoning that improves both consistency and temporal reasoning without requiring temporal labels.

探究驾驶视觉语言模型的可靠性：从不一致响应到基于时间的推理 / Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

1️⃣ 一句话总结

这篇论文研究发现，当前用作驾驶助手的视觉语言模型存在回答不稳定和缺乏时间推理能力的问题，作者通过创建新数据集并提出一种自我监督的改进方法，旨在提升模型在驾驶场景中的可靠决策能力。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要