Context Memorization for Efficient Long Context Generation

📄 Abstract - Context Memorization for Efficient Long Context Generation

Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix's influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our method improves accuracy over in-context learning at 1K-8K memory budgets while reducing attention latency by 1.36x at 8K, and surpasses full-attention RAG performance on NBA benchmark using only 20% of its memory footprint.

上下文记忆化：实现高效的长文本生成 / Context Memorization for Efficient Long Context Generation

1️⃣ 一句话总结

本文提出一种无需额外训练的记忆化方法，通过预先计算并存储前缀与查询之间的注意力状态，用轻量级的查找表替代传统注意力计算，从而在长文本生成时既减少计算延迟，又避免了前缀信息随生成过程衰退的问题。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要