DualEval: Joint Model-Item Calibration for Unified LLM Evaluation

📄 Abstract - DualEval: Joint Model-Item Calibration for Unified LLM Evaluation

Current LLM evaluation relies on two complementary but often disconnected signals: static benchmarks with objective correctness labels and arena-style preference data that better reflect open-ended user interactions. We introduce DualEval, a latent model-item calibration framework that represents models and evaluation items in a shared space, jointly estimating model ability together with item difficulty and sharpness. We apply DualEval across four domains: coding, math, miscellaneous domain-knowledge tasks, and generic everyday user queries. Our evaluation uses 18 frontier LLMs, static benchmark labels, and reward-model scores validated against held-out human preferences for open-ended model responses. Empirically, our framework produces reliable and balanced model rankings, and its learned item-level profiles support downstream applications such as benchmark compression for sample-efficient evaluation and anomaly detection for contamination or outlier analysis. Overall, DualEval unifies static and arena-style evaluation through joint model-item calibration, producing model rankings and item-level diagnostics that support more sample-efficient, interpretable, and auditable evaluation pipelines.

DualEval：通过模型与题目的联合校准实现统一的LLM评估 / DualEval: Joint Model-Item Calibration for Unified LLM Evaluation

1️⃣ 一句话总结

本文提出了DualEval框架，它通过将大语言模型和评估题目映射到同一个潜在空间，并联合分析模型能力、题目难度和区分度，从而将传统的静态基准测试与偏好评测两种评估方式统一起来，最终生成更可靠、更高效的模型排名和题目诊断信息。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要