菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-06
📄 Abstract - MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition

Large language models (LLMs) have achieved outstanding performance across a wide range of natural language processing tasks, but their enormous parameter counts impose ubstantial memory and computational overheads. This challenge is particularly critical in NPU-based on-device environments, where FP16/FP32 computation is inefficient and integer (INT) quantization is therefore essential. However, existing methods, including ZeroQuant, LLM.int8(), and SmoothQuant, do not fully address input-activation outliers and the associated hardware inefficiencies. To overcome these limitations, we propose MUXQ (Mixed-to-Uniform Quantization). MUXQ detects outlier channels in input activations and introduces a small auxiliary matrix that redistributes outlier magnitudes across channels, thereby alleviating the outlier problem. This enables even activation outliers to be quantized at low-precision INT levels while preserving a hardware-friendly computation structure. Experiments on GPT-2 models at three scales (0.1B, 0.3B, and 0.7B parameters) using the WikiText-2 dataset show that MUXQ consistently achieves lower perplexity than naive quantization. In particular, under per-tensor quantization, MUXQ quantizes both activations and weights to INT8 while maintaining accuracy close to that of FP16. With only modest computational overhead, MUXQ enables stable low-precision inference and can be readily combined with other quantization techniques. These results suggest that MUXQ provides a promising direction for efficient and accurate LLM inference on edge devices.

顶级标签: llm model training systems
详细标签: quantization low-rank decomposition efficient inference edge devices outlier handling 或 搜索:

MUXQ:通过低秩异常值分解实现混合到均匀精度的矩阵量化 / MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition


1️⃣ 一句话总结

这篇论文提出了一种名为MUXQ的新方法,通过识别并重新分配神经网络激活值中的极端异常数据,成功地将大语言模型的权重和激活值都压缩到低精度整数格式,从而在保持高精度的同时,大幅降低了模型在手机等边缘设备上运行所需的内存和计算开销。

源自 arXiv: 2604.04701