菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-14
📄 Abstract - AllMem: A Memory-centric Recipe for Efficient Long-context Modeling

Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks due to the computational complexity and memory overhead inherent in the self-attention mechanism. To address these challenges, we introduce \textsc{AllMem}, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks. \textsc{AllMem} enables models to effectively scale to ultra-long contexts while mitigating catastrophic forgetting. This approach not only overcomes the representation constraints typical of linear memory models but also significantly reduces the computational and memory footprint during long-sequence inference. Furthermore, we implement a Memory-Efficient Fine-Tuning strategy to replace standard attention layers in pre-trained models with memory-augmented sliding window layers. This framework facilitates the efficient transformation of any off-the-shelf pre-trained LLM into an \textsc{AllMem}-based architecture. Empirical evaluations confirm that our 4k window model achieves near-lossless performance on 37k LongBench with a marginal 0.83 drop compared to full attention. Furthermore, on InfiniteBench at a 128k context, our 8k window variant outperforms full attention, which validates the effectiveness of our parameterized memory in mitigating noise and maintaining robust long-range modeling without the prohibitive costs of global attention.

顶级标签: llm model training systems
详细标签: long-context modeling memory networks sliding window attention efficient inference test-time training 或 搜索:

AllMem:一种以内存为中心的、用于高效长上下文建模的解决方案 / AllMem: A Memory-centric Recipe for Efficient Long-context Modeling


1️⃣ 一句话总结

这篇论文提出了一种名为AllMem的新型混合架构,它通过结合滑动窗口注意力与非线性测试时训练记忆网络,让大语言模型能够高效处理超长文本,在保持高性能的同时大幅降低了计算和内存开销。

源自 arXiv: 2602.13680