SVFSearch:面向游戏短视频帧搜索的多模态知识密集型基准 / SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
1️⃣ 一句话总结
该论文提出了首个针对游戏短视频领域、评估多模态AI模型在模糊视频帧上结合专业知识进行检索和推理能力的开放基准数据集,实验显示当前模型在知识获取和工具使用上仍有显著差距。
Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.
SVFSearch:面向游戏短视频帧搜索的多模态知识密集型基准 / SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
该论文提出了首个针对游戏短视频领域、评估多模态AI模型在模糊视频帧上结合专业知识进行检索和推理能力的开放基准数据集,实验显示当前模型在知识获取和工具使用上仍有显著差距。
源自 arXiv: 2605.17946