后层归一化回归:稳定、高表达力与深度扩展 / Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
1️⃣ 一句话总结
这篇论文提出了一种名为Keel的新Transformer架构,它通过将传统的残差连接替换为高速公路式连接,解决了后层归一化在极深网络中训练不稳定的问题,从而能够稳定训练超过1000层的模型,为实现更深、表达能力更强的大语言模型提供了一种简单有效的方法。
Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training at extreme depths without requiring specialized initialization or complex optimization tricks. Keel trains robustly at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN. These findings indicate that Post-LN, when paired with a Highway-style connection, provides a simple and effective foundation for building deeply scalable LLMs, opening the possibility for future infinite-depth architectures.
后层归一化回归:稳定、高表达力与深度扩展 / Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
这篇论文提出了一种名为Keel的新Transformer架构,它通过将传统的残差连接替换为高速公路式连接,解决了后层归一化在极深网络中训练不稳定的问题,从而能够稳定训练超过1000层的模型,为实现更深、表达能力更强的大语言模型提供了一种简单有效的方法。
源自 arXiv: 2601.19895