菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-13
📄 Abstract - Demystifying the Slash Pattern in Attention: The Role of RoPE

Large Language Models (LLMs) often exhibit slash attention patterns, where attention scores concentrate along the $\Delta$-th sub-diagonal for some offset $\Delta$. These patterns play a key role in passing information across tokens. But why do they emerge? In this paper, we demystify the emergence of these Slash-Dominant Heads (SDHs) from both empirical and theoretical perspectives. First, by analyzing open-source LLMs, we find that SDHs are intrinsic to models and generalize to out-of-distribution prompts. To explain the intrinsic emergence, we analyze the queries, keys, and Rotary Position Embedding (RoPE), which jointly determine attention scores. Our empirical analysis reveals two characteristic conditions of SDHs: (1) Queries and keys are almost rank-one, and (2) RoPE is dominated by medium- and high-frequency components. Under these conditions, queries and keys are nearly identical across tokens, and interactions between medium- and high-frequency components of RoPE give rise to SDHs. Beyond empirical evidence, we theoretically show that these conditions are sufficient to ensure the emergence of SDHs by formalizing them as our modeling assumptions. Particularly, we analyze the training dynamics of a shallow Transformer equipped with RoPE under these conditions, and prove that models trained via gradient descent exhibit SDHs. The SDHs generalize to out-of-distribution prompts.

顶级标签: llm theory model evaluation
详细标签: attention patterns rotary position embedding training dynamics transformer analysis slash attention 或 搜索:

揭秘注意力机制中的斜线模式:RoPE的作用 / Demystifying the Slash Pattern in Attention: The Role of RoPE


1️⃣ 一句话总结

这篇论文通过理论和实验分析,解释了为什么大语言模型的注意力机制中会出现‘斜线主导头’模式,并揭示了旋转位置编码(RoPE)中的中高频成分是导致这一现象的关键原因。

源自 arXiv: 2601.08297