菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-05
📄 Abstract - SQ-format: A Unified Sparse-Quantized Hardware-friendly Data Format for LLMs

Post-training quantization (PTQ) plays a crucial role in the democratization of large language models (LLMs). However, existing low-bit quantization and sparsification techniques are difficult to balance accuracy and efficiency due to the limited hardware support. For example, W4A8 can only achieve the same peak TOPS as W8A8 whereas the GPU-supported sparse data format (2:4 semi-structure sparse) is seldomly adopted due to the loss of accuracy. To bridge this gap, in this paper, we propose the Sparse-Quantized Format (SQ-format), which is a unified data format for quantization and sparsification potentially easily supported by new hardware and existing GPUs. SQ-format makes use of the fact that sparse matrix can be accelerated in high-precision, and low-precision matrix multiplication can also be accelerated accordingly. As such, SQ-format is proposed to achieve Pareto improvement between performance and throughput. This format is particularly suitable for activations with outlier inequality status and makes their static compression possible. We show the state-of-the-art PTQ performance with SQ-format, propose the hardware required to support it, and further offer the design exploration and insights for the next-generation AI accelerators.

顶级标签: llm model training systems
详细标签: quantization sparse computation hardware acceleration efficient inference post-training quantization 或 搜索:

SQ-format:一种用于大语言模型高效推理的稀疏量化数据格式 / SQ-format: A Unified Sparse-Quantized Hardware-friendly Data Format for LLMs


1️⃣ 一句话总结

本文提出了一种名为SQ-format的新型硬件友好数据格式,通过将激活值或权重量化为稀疏的高精度组件和密集的低精度组件,在保持模型精度的同时显著提升推理吞吐量,实现了精度与效率的帕累托改进。


2️⃣ 论文创新点

1. SQ-format:统一的稀疏量化数据格式

2. 基于SQ-format的后训练量化算法

3. 硬件-算法协同设计与可行性验证

4. 精度-吞吐量的帕累托改进


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2512.05409