基于项目反应理论校正AI评估中的人类评分者效应 / Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
1️⃣ 一句话总结
这篇论文提出使用心理测量学中的项目反应理论(特别是多面Rasch模型)来分析和校正AI评估中人类评分者的系统性偏差(如评分严格度或趋中性),从而获得更可靠、更真实的AI模型性能估计,并以摘要任务为例展示了该方法的应用。
Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments. The paper reviews common rater effects, severity and centrality, that distort observed ratings, and demonstrates how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior. Using the OpenAI summarization dataset as an empirical example, we show how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance. Incorporating psychometric modeling into human-in-the-loop evaluation offers more principled and transparent use of human data, enabling developers to make decisions based on adjusted scores rather than raw, error-prone ratings. This perspective highlights a path toward more robust, interpretable, and construct-aligned practices for AI development and evaluation.
基于项目反应理论校正AI评估中的人类评分者效应 / Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
这篇论文提出使用心理测量学中的项目反应理论(特别是多面Rasch模型)来分析和校正AI评估中人类评分者的系统性偏差(如评分严格度或趋中性),从而获得更可靠、更真实的AI模型性能估计,并以摘要任务为例展示了该方法的应用。
源自 arXiv: 2602.22585