📄
Abstract - Learning Where and When: Patch-Based Spatiotemporal Localization in Weakly Supervised Video Anomaly Detection
Weakly supervised video anomaly detection (WSVAD) has predominantly focused on temporal localization, identifying when anomalies occur while largely neglecting their spatial extent within frames. Yet, spatial localization is essential for interpretability and practical deployment in real-world settings. We introduce a patch-based spatiotemporal framework for weakly supervised anomaly localization that jointly models where and when anomalies occur. Our approach operates on grid-level patch features and learns region-level anomaly scores under a multiple instance learning paradigm. We further propose a Proximity-Aware Top-k spatiotemporal selection strategy that enables the model to generate fine-grained spatial anomaly maps without requiring bounding-box supervision during training. Our method surpasses existing state-of-the-art approaches across multiple benchmarks, yielding substantial gains in spatiotemporal localization accuracy. In addition, we release frame-level bounding-box annotations for the test sets of two widely used datasets, along with our code and pretrained models, providing new resources to facilitate future research in spatially grounded WSVAD.
学习何时与何地:弱监督视频异常检测中基于图块的时空定位方法 /
Learning Where and When: Patch-Based Spatiotemporal Localization in Weakly Supervised Video Anomaly Detection
1️⃣ 一句话总结
本文提出了一种新的弱监督视频异常检测方法,通过将视频帧分割成小块(patch)并利用“邻近感知”的智能选择策略,让模型在仅知道视频整体是否异常的训练条件下,也能同时准确判断异常发生的时间和具体空间位置,从而提升了异常检测的精细度和可解释性。