菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-08
📄 Abstract - End-to-End Context Compression at Scale

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

顶级标签: llm systems model training
详细标签: kv cache compression encoder-decoder latent context efficiency long-context agents 或 搜索:

端到端的大规模上下文压缩 / End-to-End Context Compression at Scale


1️⃣ 一句话总结

本论文提出了一种名为LCLM的新型编码器-解码器模型,能在不显著降低质量的前提下,将超长文本压缩为更短的潜在表示,从而大幅减少大语言模型推理时的内存占用,并实现了压缩速度、准确率和内存效率的最佳平衡。

源自 arXiv: 2606.09659