菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-07
📄 Abstract - Benchmark^2: Systematic Evaluation of LLM Benchmarks

The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three complementary metrics: (1) Cross-Benchmark Ranking Consistency, measuring whether a benchmark produces model rankings aligned with peer benchmarks; (2) Discriminability Score, quantifying a benchmark's ability to differentiate between models; and (3) Capability Alignment Deviation, identifying problematic instances where stronger models fail but weaker models succeed within the same model family. We conduct extensive experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains, evaluating 11 LLMs across four model families. Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction based on our metrics can achieve comparable evaluation performance with substantially reduced test sets.

顶级标签: llm benchmark model evaluation
详细标签: benchmark evaluation ranking consistency discriminability capability alignment test set reduction 或 搜索:

基准的基准:对大语言模型评测基准的系统性评估 / Benchmark^2: Systematic Evaluation of LLM Benchmarks


1️⃣ 一句话总结

这篇论文提出了一个名为Benchmark^2的框架,用于评估现有大语言模型评测基准本身的质量好坏,发现不同基准质量差异很大,并证明用他们的方法筛选题目能大幅减少测试题量而不影响评估效果。

源自 arXiv: 2601.03986