菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-24
📄 Abstract - ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting

Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiency-oriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports frame-level referencing and mitigates positional ambiguity within a sparsely sampled sequence. Building on these insights, we introduce ViKey, a training-free framework that combines VP with a lightweight Keyword-Frame Mapping (KFM) module. KFM leverages frame indices as dictionary-like keys to link textual cues to the most relevant frames, providing explicit temporal anchors during inference. Despite its simplicity, our approach substantially improves temporal reasoning and, on some datasets, preserves dense-frame baseline performance with as few as 20% of frames.

顶级标签: multi-modal model evaluation video
详细标签: video large language models temporal reasoning visual prompting efficiency frame selection 或 搜索:

ViKey:通过视觉提示增强视频时序理解 / ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting


1️⃣ 一句话总结

这篇论文提出了一种名为ViKey的免训练框架,通过为视频帧添加序号等简单视觉提示,帮助视频大语言模型更好地理解事件的时间顺序和关联,从而在只使用少量视频帧的情况下,也能达到与处理全部密集帧相近的时序推理性能。

源自 arXiv: 2603.23186