菜单

🤖 系统
📄 Abstract - When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

LLM-judged benchmarks are increasingly used to evaluate complex model behaviors, yet their design introduces failure modes absent in conventional ground-truth based benchmarks. We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise. We introduce two mechanisms to diagnose these issues. Schematic adherence quantifies how much of a judge's overall verdict is explained by the explicit evaluation schema, revealing unexplained variance when judges deviate from their own rubric. Psychometric validity aggregates internal consistency and discriminant validity signals to quantify irreducible uncertainty in any benchmarking run. Applying these tools to Arena-Hard Auto, we find severe schema incoherence and factor collapse across popular judges: for example, unexplained variance exceeding 90 percent for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We also show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty. Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware LLM-judged benchmarks. We released our code and dataset at this https URL

顶级标签: llm benchmark model evaluation
详细标签: llm evaluation benchmark validity psychometric analysis judge reliability ranking uncertainty 或 搜索:

📄 论文总结

当评判沦为噪音:LLM评判基准中的设计失败如何悄然破坏有效性 / When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity


1️⃣ 一句话总结

这篇论文指出,当前使用大型语言模型作为评判者的基准测试存在严重设计缺陷,导致评分结果大部分是随机噪音而非有效评估,并提出了两种诊断工具来量化这些问题,呼吁构建更可靠、范围明确的基准测试体系。


📄 打开原文 PDF