菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-09
📄 Abstract - CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at this https URL.

顶级标签: computer vision benchmark multi-modal
详细标签: traffic crash analysis vision-language models infrastructure perception temporal reasoning safety-critical evaluation 或 搜索:

CrashSight:一个面向交通碰撞场景理解与推理的、分阶段的、以基础设施为中心的视觉基准 / CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning


1️⃣ 一句话总结

这篇论文提出了一个名为CrashSight的大规模基准数据集,它利用真实世界的路边摄像头视频来评估视觉语言模型在理解和推理交通碰撞场景(包括原因、过程和结果)方面的能力,发现现有模型在关键安全场景的时序和因果推理上仍有不足。

源自 arXiv: 2604.08457