量化超参数迁移与嵌入层学习率的重要性 / Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
1️⃣ 一句话总结
本文提出了一套量化超参数迁移质量的指标,并发现最大化嵌入层的学习率是μP参数化相比标准参数化在训练大语言模型时效果更好的主要原因,同时指出权重衰减能改善超参数缩放规律的拟合,但在固定每个参数对应的token数时会削弱外推的鲁棒性。
Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($\mu$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $\mu$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $\mu$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $\mu$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.
量化超参数迁移与嵌入层学习率的重要性 / Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
本文提出了一套量化超参数迁移质量的指标,并发现最大化嵌入层的学习率是μP参数化相比标准参数化在训练大语言模型时效果更好的主要原因,同时指出权重衰减能改善超参数缩放规律的拟合,但在固定每个参数对应的token数时会削弱外推的鲁棒性。
源自 arXiv: 2605.21486