菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-24
📄 Abstract - Clustering-driven Memory Compression for On-device Large Language Models

Large language models (LLMs) often rely on user-specific memories distilled from past interactions to enable personalized generation. A common practice is to concatenate these memories with the input prompt, but this approach quickly exhausts the limited context available in on-device LLMs. Compressing memories by averaging can mitigate context growth, yet it frequently harms performance due to semantic conflicts across heterogeneous memories. In this work, we introduce a clustering-based memory compression strategy that balances context efficiency and personalization quality. Our method groups memories by similarity and merges them within clusters prior to concatenation, thereby preserving coherence while reducing redundancy. Experiments demonstrate that our approach substantially lowers the number of memory tokens while outperforming baseline strategies such as naive averaging or direct concatenation. Furthermore, for a fixed context budget, clustering-driven merging yields more compact memory representations and consistently enhances generation quality.

顶级标签: llm systems model training
详细标签: memory compression on-device inference clustering personalization context window 或 搜索:

面向设备端大型语言模型的聚类驱动内存压缩 / Clustering-driven Memory Compression for On-device Large Language Models


1️⃣ 一句话总结

这篇论文提出了一种基于聚类的内存压缩方法,通过将相似的用户记忆分组合并,在减少设备端大型语言模型所需内存空间的同时,有效保持了生成内容的个性化质量。

源自 arXiv: 2601.17443