CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs

📄 Abstract - CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs

In this paper, we present CAT-Q, Cost-efficient and Accurate Ternary Quantization, for compressing and accelerating LLMs. Unlike existing state-of-the-art ternary quantization methods that rely on data-intensive and costly quantization-aware training to mitigate severe performance degradation, CAT-Q is a simple yet effective post-training quantization scheme that is readily applicable to LLMs with diverse architectures and model sizes. It has two key components, learnable modulation (LM) and softened ternarization (ST), which are coupled from an optimization perspective. LM leverages a composition of learnable factors to modulate the distribution of pre-trained high-precision weights and the ternary threshold, making them less sensitive to ternarization. ST further introduces a differentiable transition function to guide the ternarization process toward stable convergence. We show that, for pre-trained LLMs with 1.7B to 8B parameters, CAT-Q can efficiently quantize them into ternary models using only 512 calibration samples, while achieving superior performance than the seminal BitNet 1.58-bit v1 and v2 families (with 1.3B to 7B parameters) trained with 100B tokens, yielding about a 100,000X reduction in training tokens. Moreover, we show for the first time that CAT-Q can quantize much larger pre-trained LLMs having 14B to 235B parameters into leading ternary models within just 8 to 60 hours on 8 A100-80GB GPUs. Code is available at this https URL.

CAT-Q：面向大语言模型的经济高效且准确的三值量化方法 / CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs

1️⃣ 一句话总结

CAT-Q是一种针对大语言模型的轻量化后训练量化技术，仅需512个校准样本就能将模型压缩为三值版本，在保持性能的同时，训练成本相比同类方法降低约十万倍，并能高效处理高达2350亿参数的超大规模模型。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要