📄
Abstract - A Hippocampus for Linear Attention: An Exact Memory for What the Recurrent State Forgets
Linear-attention and state-space language models compress the prefix into a fixed-size recurrent state, yielding O(1) memory at the cost of a lossy exact memory: when many key--value associations compete, earlier facts are overwritten and needle recall degrades. Inspired by Complementary Learning Systems, we give linear attention a hippocampal complement. HOLA (Hippocampal Linear Attention) keeps the usual delta-rule state as a compressive memory and adds a bounded exact KV cache, forming a semiparametric test-time memory: the state models linearly compressible structure, while the cache stores associations that should not be forced through that state. The cache writes without a learned eviction module, keeping tokens with large beta * ||e||, the prediction residual actually committed to the state; a decoupled RMSNorm-gamma cache read then turns these exact KV pairs into sharp retrieval rather than soft averaging. At 340M parameters trained on 15B SlimPajama tokens, HOLA lowers Wikitext perplexity from 27.32 to 22.92 (-16.1%), below a full-attention Transformer++ (26.88), and improves LAMBADA perplexity from 30.95 to 30.26. It also achieves the best linear in-context retrieval and remains much more robust than GDN or a matched HOLA+recency cache on RULER needle-in-a-haystack recall out to 32k tokens (16x its training length).
为线性注意力配置一个海马体:为循环状态遗忘的内容提供精确记忆 /
A Hippocampus for Linear Attention: An Exact Memory for What the Recurrent State Forgets
1️⃣ 一句话总结
为了解决线性注意力模型在压缩序列信息时容易遗忘早期关键事实的问题,本文受大脑互补学习系统启发,提出了一种名为HOLA的新方法,它在线性注意力的压缩记忆之外,额外增加一个较小的精确键值缓存区(类似海马体),专门存储那些容易被覆盖的重要信息,从而在不显著增加计算成本的情况下,大幅提升了模型对长距离信息的回忆能力,其表现甚至超越了传统的全注意力模型。