DiRL:一种用于扩散语言模型的高效后训练框架 / DiRL: An Efficient Post-Training Framework for Diffusion Language Models
1️⃣ 一句话总结
本文提出了一种名为DiRL的高效后训练框架,通过整合优化的训练与推理技术,显著提升了扩散语言模型在复杂数学推理任务上的性能,使其超越了同类模型。
Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models. While recent efforts have validated their pre-training potential and accelerated inference speeds, the post-training landscape for dLLMs remains underdeveloped. Existing methods suffer from computational inefficiency and objective mismatches between training and inference, severely limiting performance on complex reasoning tasks such as mathematics. To address this, we introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference. This architecture enables a streamlined online model update loop, facilitating efficient two-stage post-training (Supervised Fine-Tuning followed by Reinforcement Learning). Building on this framework, we propose DiPO, the first unbiased Group Relative Policy Optimization (GRPO) implementation tailored for dLLMs. We validate our approach by training DiRL-8B-Instruct on high-quality math data. Our model achieves state-of-the-art math performance among dLLMs and surpasses comparable models in the Qwen2.5 series on several benchmarks.
DiRL:一种用于扩散语言模型的高效后训练框架 / DiRL: An Efficient Post-Training Framework for Diffusion Language Models
本文提出了一种名为DiRL的高效后训练框架,通过整合优化的训练与推理技术,显著提升了扩散语言模型在复杂数学推理任务上的性能,使其超越了同类模型。
源自 arXiv: 2512.22234