菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-05
📄 Abstract - Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression

Modern deployment often requires trading accuracy for efficiency under tight CPU and memory constraints, yet common compression proxies such as parameter count or FLOPs do not reliably predict wall-clock inference time. In particular, unstructured sparsity can reduce model storage while failing to accelerate (and sometimes slightly slowing down) standard CPU execution due to irregular memory access and sparse kernel overhead. Motivated by this gap between compression and acceleration, we study a practical, ordered pipeline that targets measured latency by combining three widely used techniques: unstructured pruning, INT8 quantization-aware training (QAT), and knowledge distillation (KD). Empirically, INT8 QAT provides the dominant runtime benefit, while pruning mainly acts as a capacity-reduction pre-conditioner that improves the robustness of subsequent low-precision optimization; KD, applied last, recovers accuracy within the already constrained sparse INT8 regime without changing the deployment form. We evaluate on CIFAR-10/100 using three backbones (ResNet-18, WRN-28-10, and VGG-16-BN). Across all settings, the ordered pipeline achieves a stronger accuracy-size-latency frontier than any single technique alone, reaching 0.99-1.42 ms CPU latency with competitive accuracy and compact checkpoints. Controlled ordering ablations with a fixed 20/40/40 epoch allocation further confirm that stage order is consequential, with the proposed ordering generally performing best among the tested permutations. Overall, our results provide a simple guideline for edge deployment: evaluate compression choices in the joint accuracy-size-latency space using measured runtime, rather than proxy metrics alone.

顶级标签: model training systems machine learning
详细标签: neural network compression pruning quantization knowledge distillation edge deployment 或 搜索:

剪枝-量化-蒸馏:一种面向高效神经网络压缩的有序流程 / Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression


1️⃣ 一句话总结

这篇论文提出了一种将剪枝、量化和知识蒸馏三种技术按特定顺序组合的流程,能有效压缩神经网络模型,在保证精度的同时显著降低模型大小和实际运行时间,为在手机等边缘设备上高效部署AI模型提供了实用指南。

源自 arXiv: 2604.04988