📄
Abstract - Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs
We study dense-to-sparse continual training as a way to construct channel-sparse large language models from dense checkpoints. Starting from a Qwen2.5-8B dense backbone, we continue training at 32K context and introduce a predictor-gated sparse SwiGLU FFN in the 32K stage. For each token and layer, we use a low-rank predictor to produce FFN-channel routing logits. We then apply a bank-wise top-k rule to retain 16 channels in every 64-channel bank, yielding 4x sparsity in the FFN intermediate activation. Unlike post-hoc sparse inference methods, the routing module is placed on the main language modeling path and optimized during continual training, enabling the dense model to be upcycled into a hardware-oriented sparse model. We report the architecture, training recipe, benchmark performance, and training lessons. We also identify a layer-local long-context failure mode on RULER-CWE and propose a single-layer repair algorithm that substantially improves the affected length range.
大型语言模型持续训练升级:一种基于预测器门控的按块稀疏训练方法,将稠密模型转化为稀疏模型 /
Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs
1️⃣ 一句话总结
本文提出了一种方法,通过持续训练将已有的稠密大语言模型(如Qwen2.5-8B)转化为一种计算高效的稀疏模型,其核心是在每个处理单元中只激活少量通道,从而大幅减少计算量,同时通过一个轻量级的预测模块动态决定哪些通道被激活,并且作者还发现并修复了模型在处理超长文本时出现的一种特定错误。