Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

📄 Abstract - Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). On the full benchmark, most items showed no reliable change (79% and 72%). However, over half the items were floor/ceiling. Among analysable items, change was bidirectional with large effect sizes: 34% improved and 28% deteriorated for Llama; 47% improved and 39% deteriorated for Qwen (median |delta p| = 0.50 and 0.90). Churn was asymmetric by difficulty: low-accuracy items improved, high-accuracy items deteriorated. Domain-level decomposition revealed family-specific reversals: Llama lost physics while Qwen lost law. Greedy single-shot evaluation missed 42% of reliably changed items and falsely flagged 25% of unchanged items. The aggregate accuracy gain is the net residual of opposing item-level movements. We recommend reporting churn rate alongside aggregate accuracy.

超越均值：面向大模型评估的模型内可靠变化检测 / Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

1️⃣ 一句话总结

本文引入临床心理学中的可靠变化指数（RCI），对LLM版本升级（如Llama 3→3.1和Qwen 2.5→3）进行逐题比较，发现平均准确率的小幅提升掩盖了大规模的双向性能波动（有的题目大幅进步，有的严重退步），且多数题目变化无实质意义，因此建议在汇报平均分数时必须同时报告“变化率”指标。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要