Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

📄 Abstract - Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($\mu$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $\mu$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $\mu$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $\mu$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.

量化超参数迁移与嵌入层学习率的重要性 / Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

1️⃣ 一句话总结

本文提出了一套量化超参数迁移质量的指标，并发现最大化嵌入层的学习率是μP参数化相比标准参数化在训练大语言模型时效果更好的主要原因，同时指出权重衰减能改善超参数缩放规律的拟合，但在固定每个参数对应的token数时会削弱外推的鲁棒性。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要