基于元数据感知的多提示推理实现零样本事故理解 / Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding
1️⃣ 一句话总结
本论文提出一种三阶段方法,将监控视频中的事故理解分解为“何时发生”(时间定位)、“什么类型”(语义分类)和“何处发生”(空间定位)三个子任务,通过结合视觉-语言模型和元数据驱动的多角度提示推理,在零样本条件下显著提升了事故检测的准确性和可靠性。
In this paper, we address the problem of zero-shot understanding of accidents from surveillance videos by identifying when an impact event occurs, what type of impact it is, and where in the frame it occurs using natural language. We propose a three-stage pipeline that decomposes the accident understanding into when, what, and where. The first stage extracts a short temporal window around the impact using vision-language similarity. In the second stage, we perform metadata-driven multi-prompt reasoning with five complementary views (baseline, motion, geometry, contrast, and tiebreaker) and resolve disagreement via an entropy-gated pairwise adjudicator. Finally, we localize the impact of an open-vocabulary detector queried on the predicted accident type and scene layout, and aggregate detections across keyframes using a score-weighted centroid. Our pipeline achieves a substantial improvement in the harmonic-mean score over a centre-of-frame baseline on the zero-shot ACCIDENT @ CVPR benchmark. We show that decomposing zero-shot video understanding into temporal localization, semantic classification, and spatial grounding enable more reliable reasoning with vision-language models than direct prompting alone.
基于元数据感知的多提示推理实现零样本事故理解 / Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding
本论文提出一种三阶段方法,将监控视频中的事故理解分解为“何时发生”(时间定位)、“什么类型”(语义分类)和“何处发生”(空间定位)三个子任务,通过结合视觉-语言模型和元数据驱动的多角度提示推理,在零样本条件下显著提升了事故检测的准确性和可靠性。
源自 arXiv: 2606.12047