HERMES:将KV缓存作为分层内存以实现高效的流式视频理解 / HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
1️⃣ 一句话总结
本文提出了一种名为HERMES的新方法,它巧妙地将模型处理视频时产生的中间数据(KV缓存)组织成分层记忆,从而在无需额外训练的情况下,实现了对连续视频流的实时、准确理解,同时大幅降低了计算和内存开销。
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
HERMES:将KV缓存作为分层内存以实现高效的流式视频理解 / HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
本文提出了一种名为HERMES的新方法,它巧妙地将模型处理视频时产生的中间数据(KV缓存)组织成分层记忆,从而在无需额外训练的情况下,实现了对连续视频流的实时、准确理解,同时大幅降低了计算和内存开销。
源自 arXiv: 2601.14724