Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations

📄 Abstract - Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations

Recent research has shown that large language models (LLM) favor own outputs when acting as judges, undermining the integrity of automated post-training and evaluation workflows. However, it is difficult to disentangle which evaluation biases are explained by narcissism versus general experimental confounds, distorting measurements of self-preference bias. We discover a core methodological confound which could reduce measurement error by 89.6%. Specifically, LLM evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves; this would be true regardless of whether one of their responses is their own. To decouple self-preference signals from noisy outputs on hard problems, we introduce an Evaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model. Evaluating this simple baseline on 37,448 queries, only 51% of initial findings retain statistical significance. Finally, we turn towards characterizing the entropy of "easy" versus "hard" evaluation votes from LLM judges. Our corrective baseline enables future research on self-preference by eliminating noisy data from potential solutions. More widely, this work contributes to the growing body of work on cataloging and isolating judge-bias effects.

LLM评估者真的是自恋者吗？对自我偏好评估的合理性检验 / Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations

1️⃣ 一句话总结

这篇论文发现，大语言模型在作为评估者时表现出的‘自恋’倾向（即偏爱自己的输出），很大程度上是由于评估任务本身难度造成的混淆，而非真正的自我偏好，并提出了一个纠正性基线方法，显著降低了测量误差。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要