菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-12
📄 Abstract - Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs

Large Language Models (LLMs) are advancing rapidly, yet the benchmarks used to measure this progress are becoming increasingly unreliable. Score inflation and selective reporting have eroded the authority of standard benchmarks, leaving the community uncertain about which evaluation results remain trustworthy. We introduce the Benchmark Health Index (BHI), a pure data-driven framework for auditing evaluation sets along three orthogonal and complementary axes: (1) Capability Discrimination, measuring how sharply a benchmark separates model performance beyond noise; (2) Anti-Saturation, estimating remaining headroom before ceiling effects erode resolution and thus the benchmark's expected longevity; and (3) Impact, quantifying influence across academic and industrial ecosystems via adoption breadth and practice-shaping power. By distilling 106 validated benchmarks from the technical reports of 91 representative models in 2025, we systematically characterize the evaluation landscape. BHI is the first framework to quantify benchmark health at a macro level, providing a principled basis for benchmark selection and enabling dynamic lifecycle management for next-generation evaluation protocols.

顶级标签: llm benchmark model evaluation
详细标签: benchmark health evaluation framework score inflation capability discrimination benchmark lifecycle 或 搜索:

基准健康指数:一个用于系统性评估大语言模型基准测试的框架 / Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs


1️⃣ 一句话总结

这篇论文提出了一个名为‘基准健康指数’的数据驱动框架,通过评估基准测试的区分度、可持续性和影响力三个维度,来解决当前大语言模型评测中因分数膨胀和选择性报告导致的可靠性下降问题,为科学选择和管理评测基准提供了量化依据。

源自 arXiv: 2602.11674