菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-02
📄 Abstract - The Effect of Mini-Batch Noise on the Implicit Bias of Adam

With limited high-quality data and growing compute, multi-epoch training is gaining back its importance across sub-areas of deep learning. Adam(W), versions of which are go-to optimizers for many tasks such as next token prediction, has two momentum hyperparameters $(\beta_1, \beta_2)$ controlling memory and one very important hyperparameter, batch size, controlling (in particular) the amount mini-batch noise. We introduce a theoretical framework to understand how mini-batch noise influences the implicit bias of memory in Adam (depending on $\beta_1$, $\beta_2$) towards sharper or flatter regions of the loss landscape, which is commonly observed to correlate with the generalization gap in multi-epoch training. We find that in the case of large batch sizes, higher $\beta_2$ increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes smaller, the dependence of (anti-)regulariation on $\beta_2$ is reversed. A similar monotonicity shift (in the opposite direction) happens in $\beta_1$. In particular, the commonly "default" pair $(\beta_1, \beta_2) = (0.9, 0.999)$ is a good choice if batches are small; for larger batches, in many settings moving $\beta_1$ closer to $\beta_2$ is much better in terms of validation accuracy in multi-epoch training. Moreover, our theoretical derivations connect the scale of the batch size at which the shift happens to the scale of the critical batch size. We illustrate this effect in experiments with small-scale data in the about-to-overfit regime.

顶级标签: model training theory machine learning
详细标签: optimization adam implicit bias generalization batch size 或 搜索:

小批量噪声对Adam优化器隐式偏差的影响 / The Effect of Mini-Batch Noise on the Implicit Bias of Adam


1️⃣ 一句话总结

这篇论文通过理论分析发现,Adam优化器的泛化能力受批次大小和动量参数共同影响,小批次时默认参数效果好,大批次时则需要调整动量参数以提升模型在多次训练中的验证精度。

源自 arXiv: 2602.01642