循环扩散语言模型 / Looped Diffusion Language Models
1️⃣ 一句话总结
本文提出了一种名为LoopMDM的方法,通过在掩码扩散语言模型中有选择地循环使用早期到中期的Transformer层,在不增加参数的情况下实现了深度缩放效果,显著提升了训练效率(最高节省3.3倍计算量)和推理性能(在GSM8K等推理基准上提升高达8.5分),并且通过自适应调整循环次数进一步优化了计算效率。
Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models for language modeling, yet the effective design of transformer architectures for MDMs remains underexplored. In this paper, we show that selectively looping the early-middle transformer layers significantly improves both training efficiency and model performance in MDMs. We call this approach LoopMDM(Looped Masked Diffusion Model), which brings two key benefits: looping layers at training-time yields a depth-scaling effect without adding parameters, while varying the number of loops at inference-time enables flexible compute scaling. Despite the simplicity, the results are striking: across multiple pre-training corpora, LoopMDM matches the performance of same-size MDMs with up to 3.3 fewer training FLOPs, while its final performance outperforms them on various reasoning benchmarks, including up to 8.5 points on GSM8K. It even surpasses deeper non-looped MDMs trained with comparable per-step compute, indicating that selective looping is more effective than naive depth scaling. Furthermore, LoopMDM can scale inference-time compute by increasing the number of loops. Adaptively adjusting the number of loops throughout the sampling process further yields additional gains in compute efficiency while maintaining performance. Lastly, with attention analysis, we provide evidence that looping is effective in MDMs by promoting interactions among masked positions. Our code and weights will be publicly released.
循环扩散语言模型 / Looped Diffusion Language Models
本文提出了一种名为LoopMDM的方法,通过在掩码扩散语言模型中有选择地循环使用早期到中期的Transformer层,在不增加参数的情况下实现了深度缩放效果,显著提升了训练效率(最高节省3.3倍计算量)和推理性能(在GSM8K等推理基准上提升高达8.5分),并且通过自适应调整循环次数进一步优化了计算效率。
源自 arXiv: 2605.26106