记住遗忘:门控自适应位置编码 / Remember to Forget: Gated Adaptive Positional Encoding
1️⃣ 一句话总结
针对大语言模型中旋转位置编码(RoPE)在长序列下性能退化的问题,本文提出一种轻量级的门控自适应位置编码(GAPE),通过在注意力计算中引入内容感知的门控机制,让模型自动抑制不相关长距离信息、保留关键远距离信息,从而在不牺牲局部精度的情况下显著提升长文本处理的鲁棒性。
Rotary Positional Encoding (RoPE) is widely used in modern large language models. However, when sequences are extended beyond the range seen during training, rotary phases can enter out-of-distribution regimes, leading to spurious long-range alignments, diffuse attention, and degraded retrieval. Existing remedies only partially address these failures, as they often trade local positional resolution for long-context stability. We propose GAPE (Gated Adaptive Positional Encoding), a drop-in augmentation for positional encodings that introduces a content-aware bias directly into the attention logits while preserving the rotary geometry. GAPE decouples distance-based suppression from token importance through a query-dependent gate that contracts irrelevant context and a key-dependent gate that preserves salient distant tokens. We prove that protected tokens remain accessible, while the attention mass assigned to unprotected distant tokens decays as a function of the query gate. We further show that GAPE can be implemented within standard scaled dot-product attention. We validate these properties empirically, finding that GAPE consistently yields sharper attention and improved long-context robustness over rotary baselines across both synthetic retrieval and long-context benchmarks.
记住遗忘:门控自适应位置编码 / Remember to Forget: Gated Adaptive Positional Encoding
针对大语言模型中旋转位置编码(RoPE)在长序列下性能退化的问题,本文提出一种轻量级的门控自适应位置编码(GAPE),通过在注意力计算中引入内容感知的门控机制,让模型自动抑制不相关长距离信息、保留关键远距离信息,从而在不牺牲局部精度的情况下显著提升长文本处理的鲁棒性。
源自 arXiv: 2605.10414