Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery

📄 Abstract - Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery

This technical report presents quantization-aware distillation (QAD) and our best practices for recovering accuracy of NVFP4-quantized large language models (LLMs) and vision-language models (VLMs). QAD distills a full-precision teacher model into a quantized student model using a KL divergence loss. While applying distillation to quantized models is not a new idea, we observe key advantages of QAD for today's LLMs: 1. It shows remarkable effectiveness and stability for models trained through multi-stage post-training pipelines, including supervised fine-tuning (SFT), reinforcement learning (RL), and model merging, where traditional quantization-aware training (QAT) suffers from engineering complexity and training instability; 2. It is robust to data quality and coverage, enabling accuracy recovery without full training data. We evaluate QAD across multiple post-trained models including AceReason Nemotron, Nemotron 3 Nano, Nemotron Nano V2, Nemotron Nano V2 VL (VLM), and Llama Nemotron Super v1, showing consistent recovery to near-BF16 accuracy.

用于NVFP4推理精度恢复的量化感知蒸馏 / Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery

1️⃣ 一句话总结

这篇论文提出了一种名为量化感知蒸馏（QAD）的方法，它能有效且稳定地将高精度大模型的“知识”迁移到经过压缩的4位量化模型中，从而在几乎不损失精度的情况下，让模型在资源受限的设备上高效运行。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要