菜单

🤖 系统
📄 Abstract - CompLLM: Compression for Long Context Q&A

Large Language Models (LLMs) face significant computational challenges when processing long contexts due to the quadratic complexity of self-attention. While soft context compression methods, which map input text to smaller latent representations, have shown promise, their real-world adoption is limited. Existing techniques typically compress the context as a single unit, which leads to quadratic compression complexity and an inability to reuse computations across queries with overlapping contexts. In this work, we introduce CompLLM, a soft compression technique designed for practical deployment. Instead of processing the context holistically, CompLLM divides it into segments and compresses each one independently. This simple design choice yields three critical properties: efficiency, as the compression step scales linearly with the context length; scalability, enabling models trained on short sequences (e.g., 1k tokens) to generalize to contexts of 100k tokens; and reusability, allowing compressed segments to be cached and reused across different queries. Our experiments show that with a 2x compression rate, at high context lengths CompLLM speeds up Time To First Token (TTFT) by up to 4x and reduces the KV cache size by 50%. Furthermore, CompLLM achieves performance comparable to that obtained with the uncompressed context, and even surpasses it on very long sequences, demonstrating its effectiveness and practical utility.

顶级标签: llm model training systems
详细标签: context compression long context efficiency kv cache attention complexity 或 搜索:

📄 论文总结

CompLLM:面向长上下文问答的压缩方法 / CompLLM: Compression for Long Context Q&A


1️⃣ 一句话总结

这篇论文提出了一种名为CompLLM的智能压缩技术,通过将长文本分段独立压缩,显著提升了大语言模型处理长文本时的速度和效率,同时保持甚至在某些情况下超越了原始模型的性能表现。


📄 打开原文 PDF