Progressive Residual Warmup for Language Model Pretraining

📄 Abstract - Progressive Residual Warmup for Language Model Pretraining

Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an "early layer learns first" philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at this https URL.

用于语言模型预训练的渐进式残差预热方法 / Progressive Residual Warmup for Language Model Pretraining

1️⃣ 一句话总结

这篇论文提出了一种名为ProRes的新方法，通过让神经网络中较深的层等待较浅的层先稳定学习，从而让大型语言模型的预训练过程更稳定、收敛更快，并且最终效果更好。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要