ForeSea:支持多模态查询的视频监控AI取证搜索系统 / ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance
1️⃣ 一句话总结
这篇论文提出了一个名为ForeSea的新系统和一个配套的基准数据集ForeSeaQA,用于解决在长时段多摄像头监控视频中,通过结合图像和文字进行复杂查询并精确定位事件时间的难题,显著提升了搜索的准确性和时间定位精度。
Despite decades of work, surveillance still struggles to find specific targets across long, multi-camera video. Prior methods -- tracking pipelines, CLIP based models, and VideoRAG -- require heavy manual filtering, capture only shallow attributes, and fail at temporal reasoning. Real-world searches are inherently multimodal (e.g., "When does this person join the fight?" with the person's image), yet this setting remains underexplored. Also, there are no proper benchmarks to evaluate those setting - asking video with multimodal queries. To address this gap, we introduce ForeSeaQA, a new benchmark specifically designed for video QA with image-and-text queries and timestamped annotations of key events. The dataset consists of long-horizon surveillance footage paired with diverse multimodal questions, enabling systematic evaluation of retrieval, temporal grounding, and multimodal reasoning in realistic forensic conditions. Not limited to this benchmark, we propose ForeSea, an AI forensic search system with a 3-stage, plug-and-play pipeline. (1) A tracking module filters irrelevant footage; (2) a multimodal embedding module indexes the remaining clips; and (3) during inference, the system retrieves top-K candidate clips for a Video Large Language Model (VideoLLM) to answer queries and localize events. On ForeSeaQA, ForeSea improves accuracy by 3.5% and temporal IoU by 11.0 over prior VideoRAG models. To our knowledge, ForeSeaQA is the first benchmark to support complex multimodal queries with precise temporal grounding, and ForeSea is the first VideoRAG system built to excel in this setting.
ForeSea:支持多模态查询的视频监控AI取证搜索系统 / ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance
这篇论文提出了一个名为ForeSea的新系统和一个配套的基准数据集ForeSeaQA,用于解决在长时段多摄像头监控视频中,通过结合图像和文字进行复杂查询并精确定位事件时间的难题,显著提升了搜索的准确性和时间定位精度。
源自 arXiv: 2603.22872