菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-04
📄 Abstract - Statistically-Lossless Quantization of Large Language Models

Model quantization has become essential for efficient large language model deployment, yet existing approaches involve clear trade-offs: methods such as GPTQ and AWQ achieve practical compression but are lossy, while lossless techniques preserve fidelity but typically do not accelerate inference. This paper explores the middle ground of statistically-lossless compression through three complementary notions of losslessness for quantized LLMs. First, task-lossless compression preserves zero-shot benchmark accuracy within natural sampling variance and remains achievable at aggressive bitwidths. Second, we formalize the stricter notion of distribution-lossless compression, requiring the quantized model's next-token distribution to be practically indistinguishable from the original, and propose the Expected Acceptance Rate (EAR), the maximum token-agreement probability under optimal coupling, as a directly interpretable fidelity metric (for example, EAR >= 0.99 indicates 99% agreement). Third, we prove a gamma-squared variance law showing that symmetric quantization inflates noise variance by gamma squared relative to asymmetric quantization, making asymmetry necessary for distribution-lossless fidelity but not for task-level preservation. Using SLQ, a layer-wise non-uniform method with asymmetric quantization and wide bitwidth search, we achieve task-lossless compression at well below 4 bits per parameter (as low as 3.3 bits depending on the model), distribution-lossless compression at 5 to 6 bits per parameter on average, and inference speedups of 1.7 to 3.6x relative to FP16 with optimized kernels. Source code is available at this https URL.

顶级标签: llm model compression efficiency
详细标签: quantization lossless compression distribution preservation inference speedup fidelity metric 或 搜索:

大型语言模型的统计无损量化 / Statistically-Lossless Quantization of Large Language Models


1️⃣ 一句话总结

本文提出了一种名为SLQ的量化方法,通过引入三种不同严格程度的“无损”定义(任务无损、分布无损和统计无损),并在非对称量化和宽位宽搜索下实现,既能将模型压缩到每个参数低至3.3比特,又能保持模型输出分布与原始模型几乎一致,同时带来1.7到3.6倍的推理加速。

源自 arXiv: 2605.02404