菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-18
📄 Abstract - Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

It is widely believed that stochastic gradient descent (SGD) performs significantly worse than adaptive optimizers such as Adam in pre-training Large Language Models (LLMs). Yet the underlying reason for this gap remains unclear. In this work, we attribute a large part of the discrepancy to SGD's inability to sustain learning rates comparable to Adam's much larger effective learning rates. Through empirical and theoretical analysis of LLM pre-training dynamics, we identify that training is characterized by small gradient norms and large weight-to-gradient ratios, an effect that becomes more pronounced with larger batch sizes typical in pre-training, necessitating such large effective learning rates. However, we find that output-layer gradient magnitudes become highly uneven across token classes, and that large gradient spikes frequently occur during training. Together, these effects severely restrict the admissible learning rate of SGD. Guided by this understanding, we show that simple clipping mechanisms that stabilize SGD at large learning rates enable it to recover most of Adam's performance. In our large-scale experiments, the validation loss gap between large-learning-rate SGD and Adam shrinks from more than 50% to only about 3.5% when pre-training a 1B-parameter LLaMA model with a 1M-token batch size.

顶级标签: llm model training
详细标签: sgd vs adam learning rate pre-training optimization dynamics gradient clipping 或 搜索:

重新审视大语言模型预训练中的Adam-SGD差距:大有效学习率的作用 / Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates


1️⃣ 一句话总结

本文发现,在训练大语言模型时,SGD(随机梯度下降)效果远差于Adam的主要原因在于SGD无法像Adam那样使用大的有效学习率,而通过简单的梯度裁剪让SGD也能使用大学习率,就能大幅缩小两者之间的性能差距。

源自 arXiv: 2605.17787