📄
Abstract - Variational Linear Attention: Stable Associative Memory for Long-Context Transformers
Linear attention reduces the quadratic cost of softmax attention to $\mathcal{O}(T)$, but its memory state grows as $\mathcal{O}(T)$ in Frobenius norm, causing progressive interference between stored associations. We introduce \textbf{Variational Linear Attention} (VLA), which reframes the memory update as an online regularised least-squares problem with an adaptive penalty matrix maintained via the Sherman-Morrison rank-1 formula. We prove that normalising the write direction to unit length gives the recurrence Jacobian spectral norm exactly $1$ for all sequence lengths and head dimensions (Proposition 2), and that the state norm is self-limiting under bounded inputs (Proposition 1). Empirically, VLA reduces $\|S_t\|_F$ by $109\times$ relative to standard linear attention at $T{=}1{,}000$, achieves near-perfect exact-match accuracy on multi-query associative recall within the effective per-head memory regime ($n_\text{pairs} < d_h$), maintaining substantially higher retrieval performance than DeltaNet and standard linear attention under increasing memory load, and maintains 62\% accuracy at the per-head capacity boundary. A Triton-fused kernel achieves $14\times$ speedup over sequential Python and $\mathcal{O}(T)$ scaling, crossing below softmax attention latency at approximately 43\,000 tokens.
变分线性注意力:面向长上下文Transformer的稳定关联记忆 /
Variational Linear Attention: Stable Associative Memory for Long-Context Transformers
1️⃣ 一句话总结
本文提出了一种名为变分线性注意力(VLA)的新方法,通过将线性注意力中的记忆更新重新建模为带有自适应惩罚项的正则化最小二乘问题,并引入归一化写入方向,有效解决了传统线性注意力在处理长序列时记忆状态不断增长、干扰逐渐累积的核心缺陷,从而在长上下文中实现了稳定、高效的关联记忆检索。