Video-BrowseComp:在开放网络上对智能体视频研究进行基准测试 / Video-BrowseComp: Benchmarking Agentic Video Research on Open Web
1️⃣ 一句话总结
这篇论文提出了首个名为Video-BrowseComp的基准测试,专门用于评估AI智能体在开放网络上主动搜索、观看并分析视频内容以回答复杂问题的能力,揭示了当前先进模型在此类需要视觉时序推理的任务上表现仍然很差。
The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, while textual and static multimodal agents have seen rapid progress, a significant modality gap remains in processing the web's most dynamic modality: video. Existing video benchmarks predominantly focus on passive perception, feeding curated clips to models without requiring external retrieval. They fail to evaluate agentic video research, which necessitates actively interrogating video timelines, cross-referencing dispersed evidence, and verifying claims against the open web. To bridge this gap, we present \textbf{Video-BrowseComp}, a challenging benchmark comprising 210 questions tailored for open-web agentic video reasoning. Unlike prior benchmarks, Video-BrowseComp enforces a mandatory dependency on temporal visual evidence, ensuring that answers cannot be derived solely through text search but require navigating video timelines to verify external claims. Our evaluation of state-of-the-art models reveals a critical bottleneck: even advanced search-augmented models like GPT-5.1 (w/ Search) achieve only 15.24\% accuracy. Our analysis reveals that these models largely rely on textual proxies, excelling in metadata-rich domains (e.g., TV shows with plot summaries) but collapsing in metadata-sparse, dynamic environments (e.g., sports, gameplay) where visual grounding is essential. As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.
Video-BrowseComp:在开放网络上对智能体视频研究进行基准测试 / Video-BrowseComp: Benchmarking Agentic Video Research on Open Web
这篇论文提出了首个名为Video-BrowseComp的基准测试,专门用于评估AI智能体在开放网络上主动搜索、观看并分析视频内容以回答复杂问题的能力,揭示了当前先进模型在此类需要视觉时序推理的任务上表现仍然很差。
源自 arXiv: 2512.23044