📄
Abstract - EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization
Lookahead-based acceleration methods, such as Nesterov's momentum, are widely used in optimization, but they often become unreliable in deep learning training mainly due to stochastic gradient noise and non-convex loss landscapes. In particular, standard lookahead relies on short-horizon update signals (e.g., differences between consecutive iterates), which are inherently noisy and can lead to unstable extrapolation directions. This work revisits Nesterov's acceleration from a trajectory perspective and argues that effective acceleration in deep learning should harness the low-frequency trends of optimization trajectories rather than extrapolating noisy one-step updates. Leveraging this insight, we propose EMA-Nesterov, a simple modification that replaces the standard Nesterov's lookahead direction with an exponential moving average (EMA) of parameter updates. This yields a stabilized lookahead direction that captures and harnesses the evolving trend of the training trajectory through a low-pass filter, while remaining adaptive to progressive changes via the geometric weighting structure of EMA. We show that EMA-Nesterov retains a theoretical accelerated convergence rate in convex problems that is analogous to Nesterov's accelerated gradient method. Furthermore, we provide empirical evidence on language model pre-training to verify that EMA-Nesterov is broadly applicable across a range of fine-tuned base optimizers, including Adam, SOAP, Muon, as well as complex optimizers that achieve state-of-the-art performance on optimization benchmarks (NanoGPT). Compared to prior lookahead methods, EMA-Nesterov achieves better performance by avoiding the instability of short-horizon lookahead and the non-adaptivity of long-horizon lookahead.
EMA-Nesterov:稳定Nesterov前瞻机制以加速深度学习优化 /
EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization
1️⃣ 一句话总结
该论文提出一种名为EMA-Nesterov的简单优化改进方法,通过用指数移动平均(EMA)替代传统Nesterov动量中的短视前瞻方向,有效过滤随机梯度噪声,捕捉训练轨迹的低频趋势,从而在深度学习(如语言模型预训练)中实现更稳定、更快的收敛,并兼容多种主流优化器。