菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-03
📄 Abstract - STaR-Quant: State-Time Consistent Post-Training Quantization for Diffusion Large Language Models

Diffusion large language models (DLLMs) have recently emerged as a promising alternative to autoregressive LLMs by generating text through iterative masked denoising with bidirectional context. However, their large model sizes and iterative denoising process introduce substantial memory and computational overhead, motivating post-training quantization for efficient deployment. In this paper, we identify two key challenges for low-bit DLLM quantization: state-dependent activation disparity and temporal error accumulation. Masked and unmasked tokens exhibit different activation distributions within each denoising step, while quantization errors can accumulate across steps during iterative decoding. To address these challenges, we propose STaR-Quant, a state-time consistent PTQ framework for DLLMs. STaR-Quant introduces State-Guided Activation Transformation (SGAT) to assign masked and unmasked tokens to different activation transformation spaces with a unified static weight-side transformation. It further introduces Temporal Attention Compensation (TAC) to correct the quantized attention representation via a lightweight block-diagonal affine mapping. Experiments on representative DLLMs demonstrate that STaR-Quant consistently improves low-bit weight-activation quantization over strong PTQ baselines, while delivering up to 1.69x speedup and 3.14x memory saving over FP16 deployment.

顶级标签: llm model training model evaluation
详细标签: diffusion model post-training quantization low-bit quantization attention compensation efficient deployment 或 搜索:

STaR-Quant:面向扩散大语言模型的状态-时间一致后训练量化方法 / STaR-Quant: State-Time Consistent Post-Training Quantization for Diffusion Large Language Models


1️⃣ 一句话总结

本文提出了一种名为STaR-Quant的高效量化方法,通过分别处理掩码与未掩码 token 的不同激活分布,并补偿每一步去噪过程中累积的量化误差,从而显著压缩扩散大语言模型的计算和存储开销,在不牺牲性能的前提下实现近1.7倍的加速和3.1倍以上的内存节省。

源自 arXiv: 2606.04945