TimeLens:基于多模态大语言模型重新思考视频时间定位 / TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
1️⃣ 一句话总结
这篇论文通过构建高质量的数据集和探索有效的算法设计,系统性地提升了多模态大语言模型在视频时间定位任务上的能力,并取得了超越现有开源模型甚至部分闭源模型的性能。
This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
TimeLens:基于多模态大语言模型重新思考视频时间定位 / TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
这篇论文通过构建高质量的数据集和探索有效的算法设计,系统性地提升了多模态大语言模型在视频时间定位任务上的能力,并取得了超越现有开源模型甚至部分闭源模型的性能。
源自 arXiv: 2512.14698