菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-15
📄 Abstract - LikeBench: Evaluating Subjective Likability in LLMs for Personalization

A personalized LLM should remember user facts, apply them correctly, and adapt over time to provide responses that the user prefers. Existing LLM personalization benchmarks are largely centered on two axes: accurately recalling user information and accurately applying remembered information in downstream tasks. We argue that a third axis, likability, is both subjective and central to user experience, yet under-measured by current benchmarks. To measure likability holistically, we introduce LikeBench, a multi-session, dynamic evaluation framework that measures likability across multiple dimensions by how much an LLM can adapt over time to a user's preferences to provide more likable responses. In LikeBench, the LLMs engage in conversation with a simulated user and learn preferences only from the ongoing dialogue. As the interaction unfolds, models try to adapt to responses, and after each turn, they are evaluated for likability across seven dimensions by the same simulated user. To the best of our knowledge, we are the first to decompose likability into multiple diagnostic metrics: emotional adaptation, formality matching, knowledge adaptation, reference understanding, conversation length fit, humor fit, and callback, which makes it easier to pinpoint where a model falls short. To make the simulated user more realistic and discriminative, LikeBench uses fine-grained, psychologically grounded descriptive personas rather than the coarse high/low trait rating based personas used in prior work. Our benchmark shows that strong memory performance does not guarantee high likability: DeepSeek R1, with lower memory accuracy (86%, 17 facts/profile), outperformed Qwen3 by 28% on likability score despite Qwen3's higher memory accuracy (93%, 43 facts/profile). Even SOTA models like GPT-5 adapt well in short exchanges but show only limited robustness in longer, noisier interactions.

顶级标签: llm benchmark model evaluation
详细标签: personalization subjective evaluation likability multi-session dialogue simulated user 或 搜索:

LikeBench:评估大语言模型主观喜好度以实现个性化 / LikeBench: Evaluating Subjective Likability in LLMs for Personalization


1️⃣ 一句话总结

这篇论文提出了一个名为LikeBench的新评估框架,它首次将大语言模型个性化能力的核心——‘用户喜好度’分解为七个可测量的维度,并发现模型记忆事实的准确性与生成讨喜回复的能力并不直接相关。


源自 arXiv: 2512.13077