Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

📄 Abstract - Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

We study signal propagation at initialization in transformers through the averaged partial Jacobian norm (APJN), a measure of gradient amplification across layers. We extend APJN analysis to transformers with bidirectional attention and permutation-symmetric input token configurations by deriving recurrence relations for activation statistics and APJNs across layers. Our theory predicts how attention modifies the asymptotic behavior of the APJN at large depth and matches APJNs measured in deep vision transformers. The criticality picture known from residual networks carries over to transformers: the pre-LayerNorm architecture exhibits power-law APJN growth, whereas transformers with LayerNorm replaced by elementwise $\tanh$-like nonlinearities have stretched-exponential APJN growth, indicating that the latter are subcritical. Applied to Dynamic Tanh (DyT) and Dynamic erf (Derf) transformers, the theory explains why these architectures can be more sensitive to initialization and optimization choices and require careful tuning for stable training.

无归一化Transformer初始化时的亚临界信号传播 / Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

1️⃣ 一句话总结

这篇论文通过分析梯度在Transformer各层间的放大效应，发现用类tanh的非线性函数替代层归一化会导致模型在初始化时信号传播能力变弱，从而解释了这类模型为何对初始化和优化参数更敏感、训练更不稳定的原因。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要