菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-17
📄 Abstract - $\texttt{MemoryRewardBench}$: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

Existing works increasingly adopt memory-centric mechanisms to process long contexts in a segment manner, and effective memory management is one of the key capabilities that enables large language models to effectively propagate information across the entire sequence. Therefore, leveraging reward models (RMs) to automatically and reliably evaluate memory quality is critical. In this work, we introduce $\texttt{MemoryRewardBench}$, the first benchmark to systematically study the ability of RMs to evaluate long-term memory management processes. $\texttt{MemoryRewardBench}$ covers both long-context comprehension and long-form generation tasks, featuring 10 distinct settings with different memory management patterns, with context length ranging from 8K to 128K tokens. Evaluations on 13 cutting-edge RMs indicate a diminishing performance gap between open-source and proprietary models, with newer-generation models consistently outperforming their predecessors regardless of parameter count. We further expose the capabilities and fundamental limitations of current RMs in evaluating LLM memory management across diverse settings.

顶级标签: llm model evaluation benchmark
详细标签: reward models long-term memory long-context evaluation benchmark memory management 或 搜索:

MemoryRewardBench:用于评估大语言模型长期记忆管理能力的奖励模型基准 / $\texttt{MemoryRewardBench}$: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models


1️⃣ 一句话总结

这篇论文提出了首个专门用于评估奖励模型对大语言模型长期记忆管理能力进行自动评分的基准测试,发现开源模型与闭源模型的性能差距正在缩小,并揭示了当前奖励模型在此任务上的能力与局限。

源自 arXiv: 2601.11969