Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

📄 Abstract - Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

Efficient key-value (KV) cache management is crucial for the practical deployment of large language models (LLMs), yet existing compression techniques often incur a trade-off between performance degradation and computational overhead. We propose a novel gating-based KV cache eviction method for frozen-weight LLMs that achieves high compression ratios with negligible computational cost. Our approach introduces lightweight sink-attention gating modules to identify and retain critical KV pairs, and integrates seamlessly into both the prefill and decoding stages. The proposed gate training algorithm relies on forward passes of an LLM, avoiding expensive backpropagation, while achieving strong task generalization through a task-agnostic reconstruction objective. Extensive experiments across the Qwen2.5-1M, Qwen3, and Gemma3 families show that our method maintains near-lossless performance while evicting up to 70% of the KV cache. The results are consistent across a wide range of tasks, including long-context understanding, code comprehension, and mathematical reasoning, demonstrating the generality of our approach.

快速KVzip：通过门控KV淘汰实现高效准确的大语言模型推理 / Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

1️⃣ 一句话总结

这篇论文提出了一种新的、基于门控机制的大语言模型推理加速方法，它能像智能管家一样自动识别并保留对话中最重要的信息，从而在几乎不影响模型回答质量的前提下，大幅减少计算负担，让大模型运行得更快、更省资源。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要