菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-30
📄 Abstract - Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). On the full benchmark, most items showed no reliable change (79% and 72%). However, over half the items were floor/ceiling. Among analysable items, change was bidirectional with large effect sizes: 34% improved and 28% deteriorated for Llama; 47% improved and 39% deteriorated for Qwen (median |delta p| = 0.50 and 0.90). Churn was asymmetric by difficulty: low-accuracy items improved, high-accuracy items deteriorated. Domain-level decomposition revealed family-specific reversals: Llama lost physics while Qwen lost law. Greedy single-shot evaluation missed 42% of reliably changed items and falsely flagged 25% of unchanged items. The aggregate accuracy gain is the net residual of opposing item-level movements. We recommend reporting churn rate alongside aggregate accuracy.

顶级标签: llm model evaluation
详细标签: reliable change index llm evaluation item-level analysis churn rate mmlu-pro 或 搜索:

超越均值:面向大模型评估的模型内可靠变化检测 / Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation


1️⃣ 一句话总结

本文引入临床心理学中的可靠变化指数(RCI),对LLM版本升级(如Llama 3→3.1和Qwen 2.5→3)进行逐题比较,发现平均准确率的小幅提升掩盖了大规模的双向性能波动(有的题目大幅进步,有的严重退步),且多数题目变化无实质意义,因此建议在汇报平均分数时必须同时报告“变化率”指标。

源自 arXiv: 2604.27405