面向工业物联网端侧LLM推理的级联多粒度剪枝框架 / Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT
1️⃣ 一句话总结
本文提出了一种级联多粒度剪枝方法,通过从粗到细依次删除层、注意力头和前馈通道,并在各阶段之间用轻量级低秩恢复重新评估重要性,从而在工业物联网边缘设备上大幅压缩大语言模型,同时揭示出不同架构对剪枝策略的适应性差异。
Deploying large language models (LLMs) on Industrial Internet of Things (IIoT) edge devices demands extreme compression, yet existing structured pruning methods collapse at high compression ratios due to one-shot importance estimation, and their cross-architecture behavior remains unpredictable. This article presents a cascaded multi-granularity pruning framework that removes layers, attention heads, and feed-forward channels in coarse-to-fine order, with lightweight low-rank recovery between stages to re-estimate component importance. An information-theoretic analysis motivates this ordering, and the Structural Independence Assumption (SIA) is formalized as a checkable condition predicting whether per-component pruning criteria are reliable for a given architecture: Multi-Head Attention (MHA)+GELU designs satisfy the SIA, whereas Grouped Query Attention (GQA)+SwiGLU designs violate it. On bearing fault diagnosis spanning 88M to 6.25B-parameter models, the framework extends achievable compression to 13.8 times on MHA+GELU architectures with 83.82% accuracy (+3.70 percentage points (pp) over the strongest baseline), while exposing a ~74pp accuracy collapse on GQA+SwiGLU architectures that violate the SIA. Deployed on an industrial slewing bearing fault diagnosis platform with NVIDIA DGX Spark, compressed models reduce inference latency by up to 67.2% and peak memory by 62.5%, demonstrating viability for IIoT edge inference.
面向工业物联网端侧LLM推理的级联多粒度剪枝框架 / Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT
本文提出了一种级联多粒度剪枝方法,通过从粗到细依次删除层、注意力头和前馈通道,并在各阶段之间用轻量级低秩恢复重新评估重要性,从而在工业物联网边缘设备上大幅压缩大语言模型,同时揭示出不同架构对剪枝策略的适应性差异。
源自 arXiv: 2606.26861