Clustering-driven Memory Compression for On-device Large Language Models

📄 Abstract - Clustering-driven Memory Compression for On-device Large Language Models

Large language models (LLMs) often rely on user-specific memories distilled from past interactions to enable personalized generation. A common practice is to concatenate these memories with the input prompt, but this approach quickly exhausts the limited context available in on-device LLMs. Compressing memories by averaging can mitigate context growth, yet it frequently harms performance due to semantic conflicts across heterogeneous memories. In this work, we introduce a clustering-based memory compression strategy that balances context efficiency and personalization quality. Our method groups memories by similarity and merges them within clusters prior to concatenation, thereby preserving coherence while reducing redundancy. Experiments demonstrate that our approach substantially lowers the number of memory tokens while outperforming baseline strategies such as naive averaging or direct concatenation. Furthermore, for a fixed context budget, clustering-driven merging yields more compact memory representations and consistently enhances generation quality.

面向设备端大型语言模型的聚类驱动内存压缩 / Clustering-driven Memory Compression for On-device Large Language Models

1️⃣ 一句话总结

这篇论文提出了一种基于聚类的内存压缩方法，通过将相似的用户记忆分组合并，在减少设备端大型语言模型所需内存空间的同时，有效保持了生成内容的个性化质量。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要