CompressKV:面向资源高效长上下文大模型推理的语义检索引导式KV缓存压缩 / CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference
1️⃣ 一句话总结
本文提出CompressKV方法,通过识别大模型中负责语义检索的注意力头,精准筛选并保留关键上下文信息,从而在仅需极小缓存空间的情况下,显著提升长文本推理的性能和资源效率。
Long-context large language model (LLM) inference is increasingly constrained by the memory footprint and decoding cost of key-value (KV) caches, limiting sustainable deployment on resource-constrained hardware. Existing KV cache eviction methods typically apply heuristic token scoring over all heads in GQA-based LLMs. These methods ignore the different functionalities of attention heads, leading to the eviction of critical tokens and thus degrading the performance of LLMs. To address this issue, we propose CompressKV, a resource-efficient KV-cache compression framework for GQA-based LLMs. Instead of aggregating attention scores from all heads, CompressKV identifies Semantic Retrieval Heads (SRHs) that capture both the initial and final tokens of a prompt and semantically important mid-context evidence, and uses them to select tokens whose KV pairs should be retained. Furthermore, CompressKV allocates cache budgets across layers according to offline estimates of layer-wise eviction error. Experiments on LongBench and Needle-in-a-Haystack show that CompressKV consistently outperforms existing KV-cache eviction methods across memory budgets. Notably, it preserves over 97\% of full-cache performance using only 3\% of the KV cache on LongBench question-answering tasks and achieves 90\% accuracy with just 0.7\% KV storage on Needle-in-a-Haystack. These results demonstrate an improved resource--performance trade-off for long-context LLM inference. Our code is publicly available at: this https URL
CompressKV:面向资源高效长上下文大模型推理的语义检索引导式KV缓存压缩 / CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference
本文提出CompressKV方法,通过识别大模型中负责语义检索的注意力头,精准筛选并保留关键上下文信息,从而在仅需极小缓存空间的情况下,显著提升长文本推理的性能和资源效率。
源自 arXiv: 2606.24467