📄
Abstract - From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA
High benchmark accuracy does not guarantee genuine use of visual evidence. We study this problem in traffic accident Video Question Answering (VideoQA), where correct answers should depend on scene-specific visual evidence but may instead be inferred from textual shortcuts. Through an audit of four public benchmarks, we find that several recent open-weight Vision-Language Models (VLMs) perform competitively, and sometimes better, without video input. On the MM-AU benchmark, removing video consistently improves accuracy, and adding more frames further degrades performance. To quantify visual dependence, we introduce two dataset-level diagnostics: Blind Gap, measuring above-chance text-only performance, and Visual Gain, measuring the marginal benefit of adding video. We further propose an instance-level Shortcut Score that combines text-only confidence with visual necessity signals, enabling continuous, training-free filtering of shortcut-prone questions. The resulting subsets reduce shortcut bias and improve visual grounding. Our findings reveal large differences in grounding quality across benchmarks and show that visually grounded evaluation, not just high accuracy, is essential in safety-critical VideoQA.
从准确性到视觉依赖:审计与过滤交通视频问答中的模态崩溃 /
From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA
1️⃣ 一句话总结
该研究发现,在交通视频问答任务中,许多先进的视觉语言模型即使不看视频也能给出高分答案,说明它们依赖文本捷径而非真正理解画面;为此作者提出了盲区、视觉增益和快捷分数等指标,帮助筛选出真正需要视觉证据的问题,从而更准确地评估模型的视觉理解能力。