菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-20
📄 Abstract - OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytically tractable marginal is a near-optimal recipe for KV compression. OCTOPUS advances this paradigm through joint quantization of rotated coordinate triplets. Each triplet's direction is mapped to a square via an octahedral parameterization, and the two resulting coordinates and the triplet norm are Lloyd-Max quantized against implementation-matched marginals. Optimizing the per-triplet squared error gives a strictly non-uniform bit allocation depending only on the total dimensionality of the keys. We find the finite-dimensional quality optimum with sweeps to be constant on every real decoder we test. The codec is data-oblivious, online, and deterministic given a seed. Across text, video, and audio, OCTOPUS matches or beats every prior rotation codec at every reported bit width and metric, with a lead that grows as bits drop for extreme compression. Furthermore, a fused Triton implementation reconstructs keys on the fly without materializing the uncompressed key, so the codec adds no decode-time bandwidth or latency over the existing dequantization. Project Page: this https URL

顶级标签: systems machine learning
详细标签: kv cache compression quantization transformer inference octahedral parameterization efficient decoding 或 搜索:

OCTOPUS:基于最优平方误差量化的八面体参数化变换器KV缓存优化方法 / OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization


1️⃣ 一句话总结

OCTOPUS提出了一种新的键值缓存压缩方法,通过将旋转后的坐标三元组进行八面体参数化并联合量化,在保持模型精度的同时大幅减少长文本推理中的内存占用和带宽需求,且不增加解码延迟。

源自 arXiv: 2605.21226