注意力模式为何存在:一种统一的时序视角分析 / Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis
1️⃣ 一句话总结
这篇论文提出了一个名为TAPPA的统一框架,从时序连续性的角度解释了大型语言模型中各种注意力模式的成因,并将其分为可预测和不可预测两类,这一理论不仅深化了对注意力机制的理解,还能有效指导模型推理加速和压缩任务。
Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce \textbf{Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations} from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self-similarity along the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to KV cache compression and LLM pruning tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at this https URL.
注意力模式为何存在:一种统一的时序视角分析 / Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis
这篇论文提出了一个名为TAPPA的统一框架,从时序连续性的角度解释了大型语言模型中各种注意力模式的成因,并将其分为可预测和不可预测两类,这一理论不仅深化了对注意力机制的理解,还能有效指导模型推理加速和压缩任务。
源自 arXiv: 2601.21709