Softmax注意力头的专业化:来自高维单位置模型的洞见 / Specialization of softmax attention heads: insights from the high-dimensional single-location model
1️⃣ 一句话总结
这篇论文通过一个理论模型解释了Transformer中多头注意力机制的训练过程,揭示了注意力头会分阶段地专业化学习不同特征,并提出了改进的注意力函数来提升模型性能。
Multi-head attention enables transformer models to represent multiple attention patterns simultaneously. Empirically, head specialization emerges in distinct stages during training, while many heads remain redundant and learn similar representations. We propose a theoretical model capturing this phenomenon, based on the multi-index and single-location regression frameworks. In the first part, we analyze the training dynamics of multi-head softmax attention under SGD, revealing an initial unspecialized phase followed by a multi-stage specialization phase in which different heads sequentially align with latent signal directions. In the second part, we study the impact of attention activation functions on performance. We show that softmax-1 significantly reduces noise from irrelevant heads. Finally, we introduce the Bayes-softmax attention, which achieves optimal prediction performance in this setting.
Softmax注意力头的专业化:来自高维单位置模型的洞见 / Specialization of softmax attention heads: insights from the high-dimensional single-location model
这篇论文通过一个理论模型解释了Transformer中多头注意力机制的训练过程,揭示了注意力头会分阶段地专业化学习不同特征,并提出了改进的注意力函数来提升模型性能。
源自 arXiv: 2603.03993