Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

📄 Abstract - Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: this https URL.

先注视再关注：通过自回归凝视实现高效且可扩展的视频理解 / Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

1️⃣ 一句话总结

这篇论文提出了一个名为AutoGaze的轻量级模块，它通过自回归学习的方式，智能地筛选出视频中最关键的画面片段，从而让大模型在处理长、高清视频时能大幅减少计算量、提升速度，同时保持甚至超越原有的理解能力。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要