LLaDA2.0: Scaling Up Diffusion Language Models to 100B

📄 Abstract - LLaDA2.0: Scaling Up Diffusion Language Models to 100B

This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.

LLaDA2.0：将扩散语言模型扩展至1000亿参数 / LLaDA2.0: Scaling Up Diffusion Language Models to 100B

1️⃣ 一句话总结

这篇论文提出了一种名为LLaDA2.0的新方法，它能够高效地将现有的大型自回归语言模型转换成参数规模高达1000亿的扩散模型，从而在保持高性能的同时，实现了并行解码和更快的推理速度，并开源了适用于实际部署的模型版本。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要