别那么“死板”!在Stiefel流形上学习KV缓存的低秩近似 / Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold
1️⃣ 一句话总结
这篇论文提出了一种名为StiefAttention的新方法,通过直接在正交投影空间中学习并最小化解码器输出误差,来更有效地压缩大语言模型推理时的KV缓存,从而在相同压缩率下显著提升模型性能。
Key--value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrixes to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns \emph{orthonormal} projection bases by directly minimizing \emph{decoder-layer output reconstruction error}. StiefAttention additionally precomputes, for each layer, an error-rank profile over candidate ranks, enabling flexible layer-wise rank allocation under a user-specified error budget. Noteworthy, on Llama3-8B under the same conditions, StiefAttention outperforms EigenAttention by $11.9$ points on C4 perplexity and $5.4\%$ on 0-shot MMLU accuracy at iso-compression, yielding lower relative error and higher cosine similarity with respect to the original decoder-layer outputs.
别那么“死板”!在Stiefel流形上学习KV缓存的低秩近似 / Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold
这篇论文提出了一种名为StiefAttention的新方法,通过直接在正交投影空间中学习并最小化解码器输出误差,来更有效地压缩大语言模型推理时的KV缓存,从而在相同压缩率下显著提升模型性能。
源自 arXiv: 2601.21686