SAW-INT4:面向实际大语言模型服务的系统感知4位KV缓存量化 / SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
1️⃣ 一句话总结
本文提出了一种系统感知的4位KV缓存量化方法,通过简单的按Token量化和分块对角哈达玛旋转,在不牺牲服务效率的前提下,几乎恢复了朴素INT4量化带来的精度损失,并证明了在实际部署中轻量级方法比复杂方法更有效。
KV-cache memory is a major bottleneck in real-world LLM serving, where systems must simultaneously support latency-sensitive small-batch requests and high-throughput concurrent workloads. Although many KV-cache compression methods improve offline accuracy or compression ratio, they often violate practical serving constraints such as paged memory layouts, regular memory access, and fused attention execution, limiting their effectiveness in deployment. In this work, we identify the minimal set of 4-bit KV-cache quantization methods that remain viable under these constraints. Our central finding is that a simple design--token-wise INT4 quantization with block-diagonal Hadamard rotation--consistently achieves the best accuracy-efficiency trade-off. Across multiple models and benchmarks, this approach recovers nearly all of the accuracy lost by naive INT4, while more complex methods such as vector quantization and Hessian-aware quantization provide only marginal additional gains once serving compatibility is taken into account. To make this practical, we implement a fused rotation-quantization kernel that integrates directly into paged KV-cache layouts and introduces zero measurable end-to-end overhead, matching plain INT4 throughput across concurrency levels. Our results show that effective KV-cache compression is fundamentally a systems co-design problem: under real serving constraints, lightweight block-diagonal Hadamard rotation is a viable method that delivers near-lossless accuracy without sacrificing serving efficiency.
SAW-INT4:面向实际大语言模型服务的系统感知4位KV缓存量化 / SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
本文提出了一种系统感知的4位KV缓存量化方法,通过简单的按Token量化和分块对角哈达玛旋转,在不牺牲服务效率的前提下,几乎恢复了朴素INT4量化带来的精度损失,并证明了在实际部署中轻量级方法比复杂方法更有效。
源自 arXiv: 2604.19157