STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

📄 Abstract - STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.

STABLEVAL：一种考虑分歧且稳定的AI系统评估方法 / STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

1️⃣ 一句话总结

本文提出了一种名为STABLEVAL的新评估框架，它通过建模标注者之间的分歧和混淆模式，而不是简单地取多数票，从而在评估AI系统时获得比传统方法更稳定、更可靠的排名结果。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要