DepthKV:面向长上下文大模型推理的层间敏感型KV缓存剪枝方法 / DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference
1️⃣ 一句话总结
本论文提出了一种名为DepthKV的KV缓存剪枝方法,它根据大模型各层对剪枝的敏感度差异来分配全局缓存预算,而不是对所有层使用相同的剪枝比例,从而在推理长文本时更高效地利用内存、提升模型性能。
Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint grows linearly with sequence length, leading to a major memory bottleneck. To mitigate this overhead, KV cache pruning methods discard cached tokens with low attention scores during inference. Most existing methods apply a uniform pruning ratio across layers, implicitly assuming that all layers contribute equally to overall model performance. We show that this assumption is suboptimal, as layers differ significantly in their sensitivity to pruning. We propose DepthKV, a layer-dependent pruning framework that allocates a fixed global KV budget across layers based on their sensitivity, rather than using a uniform allocation. Across multiple models and tasks, DepthKV consistently outperforms uniform pruning at the same global pruning ratio, demonstrating more effective utilization of the KV cache budget through layer-dependent allocation.
DepthKV:面向长上下文大模型推理的层间敏感型KV缓存剪枝方法 / DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference
本论文提出了一种名为DepthKV的KV缓存剪枝方法,它根据大模型各层对剪枝的敏感度差异来分配全局缓存预算,而不是对所有层使用相同的剪枝比例,从而在推理长文本时更高效地利用内存、提升模型性能。
源自 arXiv: 2604.24647