少训练,快推理:通过结构化稀疏性实现高效模型微调与压缩 / Train Less, Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity
1️⃣ 一句话总结
这篇论文提出了一种通过结构化稀疏化来微调大语言模型的新方法,它无需大量调整权重,只需训练极少的参数就能让模型适应新任务,同时还能减少模型体积、加快推理速度,并且性能优于现有的主流微调技术。
Fully finetuning foundation language models (LMs) with billions of parameters is often impractical due to high computational costs, memory requirements, and the risk of overfitting. Although methods like low-rank adapters help address these challenges by adding small trainable modules to the frozen LM, they also increase memory usage and do not reduce inference latency. We uncover an intriguing phenomenon: sparsifying specific model rows and columns enables efficient task adaptation without requiring weight tuning. We propose a scheme for effective finetuning via sparsification using training stochastic gates, which requires minimal trainable parameters, reduces inference time, and removes 20--40\% of model parameters without significant accuracy loss. Empirical results show it outperforms recent finetuning baselines in efficiency and performance. Additionally, we provide theoretical guarantees for the convergence of this stochastic gating process, and show that our method admits a simpler and better-conditioned optimization landscape compared to LoRA. Our results highlight sparsity as a compelling mechanism for task-specific adaptation in LMs.
少训练,快推理:通过结构化稀疏性实现高效模型微调与压缩 / Train Less, Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity
这篇论文提出了一种通过结构化稀疏化来微调大语言模型的新方法,它无需大量调整权重,只需训练极少的参数就能让模型适应新任务,同时还能减少模型体积、加快推理速度,并且性能优于现有的主流微调技术。
源自 arXiv: 2602.09169