KTV:用于高效免训练视频大语言模型的关键帧与关键令牌选择 / KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs
1️⃣ 一句话总结
这篇论文提出了一种名为KTV的两阶段方法,通过智能选择视频中的关键画面并进一步筛选画面中的关键视觉元素,在无需额外训练的情况下,大幅提升了现有图像理解模型处理长视频的效率和准确性。
Training-free video understanding leverages the strong image comprehension capabilities of pre-trained vision language models (VLMs) by treating a video as a sequence of static frames, thus obviating the need for costly video-specific training. However, this paradigm often suffers from severe visual redundancy and high computational overhead, especially when processing long videos. Crucially, existing keyframe selection strategies, especially those based on CLIP similarity, are prone to biases and may inadvertently overlook critical frames, resulting in suboptimal video comprehension. To address these significant challenges, we propose \textbf{KTV}, a novel two-stage framework for efficient and effective training-free video understanding. In the first stage, KTV performs question-agnostic keyframe selection by clustering frame-level visual features, yielding a compact, diverse, and representative subset of frames that mitigates temporal redundancy. In the second stage, KTV applies key visual token selection, pruning redundant or less informative tokens from each selected keyframe based on token importance and redundancy, which significantly reduces the number of tokens fed into the LLM. Extensive experiments on the Multiple-Choice VideoQA task demonstrate that KTV outperforms state-of-the-art training-free baselines while using significantly fewer visual tokens, \emph{e.g.}, only 504 visual tokens for a 60-min video with 10800 frames, achieving $44.8\%$ accuracy on the MLVU-Test benchmark. In particular, KTV also exceeds several training-based approaches on certain benchmarks.
KTV:用于高效免训练视频大语言模型的关键帧与关键令牌选择 / KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs
这篇论文提出了一种名为KTV的两阶段方法,通过智能选择视频中的关键画面并进一步筛选画面中的关键视觉元素,在无需额外训练的情况下,大幅提升了现有图像理解模型处理长视频的效率和准确性。
源自 arXiv: 2602.03615