菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-18
📄 Abstract - Context Memorization for Efficient Long Context Generation

Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix's influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our method improves accuracy over in-context learning at 1K-8K memory budgets while reducing attention latency by 1.36x at 8K, and surpasses full-attention RAG performance on NBA benchmark using only 20% of its memory footprint.

顶级标签: llm model training systems
详细标签: long context attention state memory prefix-augmented inference memory efficiency in-context learning 或 搜索:

上下文记忆化:实现高效的长文本生成 / Context Memorization for Efficient Long Context Generation


1️⃣ 一句话总结

本文提出一种无需额外训练的记忆化方法,通过预先计算并存储前缀与查询之间的注意力状态,用轻量级的查找表替代传统注意力计算,从而在长文本生成时既减少计算延迟,又避免了前缀信息随生成过程衰退的问题。

源自 arXiv: 2605.18226