菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-03
📄 Abstract - Adaptive Batch Sizes Using Non-Euclidean Gradient Noise Scales for Stochastic Sign and Spectral Descent

To maximize hardware utilization, modern machine learning systems typically employ large constant or manually tuned batch size schedules, relying on heuristics that are brittle and costly to tune. Existing adaptive strategies based on gradient noise scale (GNS) offer a principled alternative. However, their assumption of SGD's Euclidean geometry creates a fundamental mismatch with popular optimizers based on generalized norms, such as signSGD / Signum ($\ell_\infty$) and stochastic spectral descent (specSGD) / Muon ($\mathcal{S}_\infty$). In this work, we derive gradient noise scales for signSGD and specSGD that naturally emerge from the geometry of their respective dual norms. To practically estimate these non-Euclidean metrics, we propose an efficient variance estimation procedure that leverages the local mini-batch gradients on different ranks in distributed data-parallel systems. Our experiments demonstrate that adaptive batch size strategies using non-Euclidean GNS enable us to match the validation loss of constant-batch baselines while reducing training steps by up to 66% for Signum and Muon on a 160 million parameter Llama model.

顶级标签: model training machine learning systems
详细标签: adaptive batch size gradient noise scale non-euclidean geometry distributed optimization signsgd 或 搜索:

基于非欧几里得梯度噪声尺度自适应调整批处理大小,用于随机符号与谱下降法 / Adaptive Batch Sizes Using Non-Euclidean Gradient Noise Scales for Stochastic Sign and Spectral Descent


1️⃣ 一句话总结

这篇论文提出了一种新的自适应批处理大小调整方法,它专门为两种流行的非欧几里得优化器(Signum和Muon)设计了匹配其几何特性的梯度噪声尺度,从而在保证模型性能的同时,大幅减少了训练所需的迭代次数。

源自 arXiv: 2602.03001