菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-19
📄 Abstract - HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99\% and VLM processing time by up to 93\%, while improving answer quality on short-form benchmarks (+1.7\% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet's policy further transfers across VLM answerers without retraining, yielding an additional 8.5\% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emph{what} a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at this https URL.

顶级标签: multi-modal model evaluation machine learning
详细标签: video question answering frame selection vision-language models policy optimization efficiency 或 搜索:

HORNet:面向视觉语言模型视频问答的任务引导帧选择方法 / HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models


1️⃣ 一句话总结

这篇论文提出了一个名为HORNet的轻量级智能帧选择器,它能根据视频问答任务的需求,自动挑选出最关键的视频帧给视觉语言模型处理,从而在极大减少计算开销的同时,反而提升了回答的准确率。

源自 arXiv: 2603.18850