KnowMe-Bench:面向终身数字伴侣的人物理解基准测试 / KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions
1️⃣ 一句话总结
这篇论文提出了一个名为KnowMe-Bench的新基准测试,它使用真实的长篇自传体叙事来评估AI模型对人的深层次理解能力,发现当前基于检索的系统主要提升了事实记忆,但在解释时间关联和进行高级推理方面仍有不足,揭示了未来数字伴侣需要更先进的记忆机制。
Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present \BenchName, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. \BenchName~reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning. Across diverse narrative sources, retrieval-augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher-level inferences, highlighting the need for memory mechanisms beyond retrieval. Our data is in \href{KnowMeBench}{this https URL}.
KnowMe-Bench:面向终身数字伴侣的人物理解基准测试 / KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions
这篇论文提出了一个名为KnowMe-Bench的新基准测试,它使用真实的长篇自传体叙事来评估AI模型对人的深层次理解能力,发现当前基于检索的系统主要提升了事实记忆,但在解释时间关联和进行高级推理方面仍有不足,揭示了未来数字伴侣需要更先进的记忆机制。
源自 arXiv: 2601.04745