Scaling Behavior of Discrete Diffusion Language Models

📄 Abstract - Scaling Behavior of Discrete Diffusion Language Models

Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs. We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-bound settings. We scale our uniform diffusion model up to 10B parameters trained for $10^{22}$ FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.

离散扩散语言模型的缩放行为研究 / Scaling Behavior of Discrete Diffusion Language Models

1️⃣ 一句话总结

这篇论文研究发现，作为自回归模型替代方案的离散扩散语言模型，其性能随规模扩展的规律（缩放定律）与噪声类型密切相关，其中均匀扩散模型在数据有限时更具优势，并通过训练百亿参数模型验证了这一规律。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要