GAMMA:面向任意预算下的混合精度模型的全局位宽分配方法 / GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets
1️⃣ 一句话总结
本文提出一种名为GAMMA的框架,能在不重新训练大模型的情况下,自动为不同模块分配最合适的精度(位宽),从而在给定的内存预算下最大化模型性能,并且一次学习即可快速适配多种部署场景。
Mixed-precision quantization improves the budget--accuracy trade-off for large language models (LLMs) by allocating more bits to sensitive modules. However, automating this allocation at LLM scale faces a unique combination of constraints: learnable approaches require quantization-aware training, which is infeasible for billion-parameter models; training-free alternatives rely on static proxy metrics that miss cross-module interactions and must be recomputed per target budget; and search-based methods are expensive without guaranteeing exact budget compliance. We propose GAMMA, a quantizer-agnostic framework that learns module-wise precision preferences entirely within a post-training pipeline. GAMMA optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint, and projects the learned preferences into exact budget-feasible discrete assignments via integer programming. A key property is score reuse: because the learned preferences encode a stable sensitivity ranking rather than budget-specific weights, a single training run serves arbitrary deployment targets by re-solving only the integer program, reducing per-budget adaptation from hours to a few minutes. Across Llama and Qwen models (8B--32B), GAMMA outperforms both fixed-precision baselines (up to +12.99 Avg.) and search-based mixed-precision methods (up to +7.00 Avg.), and can match fixed 3-bit quality at 2.5-bit average precision, enabling deployment at substantially smaller memory footprints.
GAMMA:面向任意预算下的混合精度模型的全局位宽分配方法 / GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets
本文提出一种名为GAMMA的框架,能在不重新训练大模型的情况下,自动为不同模块分配最合适的精度(位宽),从而在给定的内存预算下最大化模型性能,并且一次学习即可快速适配多种部署场景。
源自 arXiv: 2605.18475