Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT

📄 Abstract - Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT

Deploying large language models (LLMs) on Industrial Internet of Things (IIoT) edge devices demands extreme compression, yet existing structured pruning methods collapse at high compression ratios due to one-shot importance estimation, and their cross-architecture behavior remains unpredictable. This article presents a cascaded multi-granularity pruning framework that removes layers, attention heads, and feed-forward channels in coarse-to-fine order, with lightweight low-rank recovery between stages to re-estimate component importance. An information-theoretic analysis motivates this ordering, and the Structural Independence Assumption (SIA) is formalized as a checkable condition predicting whether per-component pruning criteria are reliable for a given architecture: Multi-Head Attention (MHA)+GELU designs satisfy the SIA, whereas Grouped Query Attention (GQA)+SwiGLU designs violate it. On bearing fault diagnosis spanning 88M to 6.25B-parameter models, the framework extends achievable compression to 13.8 times on MHA+GELU architectures with 83.82% accuracy (+3.70 percentage points (pp) over the strongest baseline), while exposing a ~74pp accuracy collapse on GQA+SwiGLU architectures that violate the SIA. Deployed on an industrial slewing bearing fault diagnosis platform with NVIDIA DGX Spark, compressed models reduce inference latency by up to 67.2% and peak memory by 62.5%, demonstrating viability for IIoT edge inference.

面向工业物联网端侧LLM推理的级联多粒度剪枝框架 / Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT

1️⃣ 一句话总结

本文提出了一种级联多粒度剪枝方法，通过从粗到细依次删除层、注意力头和前馈通道，并在各阶段之间用轻量级低秩恢复重新评估重要性，从而在工业物联网边缘设备上大幅压缩大语言模型，同时揭示出不同架构对剪枝策略的适应性差异。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要