Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

📄 Abstract - Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

It is widely believed that stochastic gradient descent (SGD) performs significantly worse than adaptive optimizers such as Adam in pre-training Large Language Models (LLMs). Yet the underlying reason for this gap remains unclear. In this work, we attribute a large part of the discrepancy to SGD's inability to sustain learning rates comparable to Adam's much larger effective learning rates. Through empirical and theoretical analysis of LLM pre-training dynamics, we identify that training is characterized by small gradient norms and large weight-to-gradient ratios, an effect that becomes more pronounced with larger batch sizes typical in pre-training, necessitating such large effective learning rates. However, we find that output-layer gradient magnitudes become highly uneven across token classes, and that large gradient spikes frequently occur during training. Together, these effects severely restrict the admissible learning rate of SGD. Guided by this understanding, we show that simple clipping mechanisms that stabilize SGD at large learning rates enable it to recover most of Adam's performance. In our large-scale experiments, the validation loss gap between large-learning-rate SGD and Adam shrinks from more than 50% to only about 3.5% when pre-training a 1B-parameter LLaMA model with a 1M-token batch size.

重新审视大语言模型预训练中的Adam-SGD差距：大有效学习率的作用 / Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

1️⃣ 一句话总结

本文发现，在训练大语言模型时，SGD（随机梯度下降）效果远差于Adam的主要原因在于SGD无法像Adam那样使用大的有效学习率，而通过简单的梯度裁剪让SGD也能使用大学习率，就能大幅缩小两者之间的性能差距。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要