RealMem:在真实世界记忆驱动交互中评估大语言模型 / RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
1️⃣ 一句话总结
这篇论文提出了首个基于真实项目场景的基准测试RealMem,用于评估大语言模型在长期、目标动态演变的项目式交互中的记忆能力,发现现有模型在管理长期项目状态和动态上下文依赖方面面临巨大挑战。
As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or task-oriented dialogue, failing to capture **"long-term project-oriented"** interactions where agents must track evolving goals. To bridge this gap, we introduce **RealMem**, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation. We propose a synthesis pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long-term project states and dynamic context dependencies inherent in real-world projects. Our code and datasets are available at [this https URL](this https URL).
RealMem:在真实世界记忆驱动交互中评估大语言模型 / RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
这篇论文提出了首个基于真实项目场景的基准测试RealMem,用于评估大语言模型在长期、目标动态演变的项目式交互中的记忆能力,发现现有模型在管理长期项目状态和动态上下文依赖方面面临巨大挑战。
源自 arXiv: 2601.06966