菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-24
📄 Abstract - ReLE: A Scalable System and Structured Benchmark for Diagnosing Capability Anisotropy in Chinese LLMs

Large Language Models (LLMs) have achieved rapid progress in Chinese language understanding, yet accurately evaluating their capabilities remains challenged by benchmark saturation and prohibitive computational costs. While static leaderboards provide snapshot rankings, they often mask the structural trade-offs between capabilities. In this work, we present ReLE (Robust Efficient Live Evaluation), a scalable system designed to diagnose Capability Anisotropy, the non-uniformity of model performance across domains. Using ReLE, we evaluate 304 models (189 commercial, 115 open-source) across a Domain $\times$ Capability orthogonal matrix comprising 207,843 samples. We introduce two methodological contributions to address current evaluation pitfalls: (1) A Symbolic-Grounded Hybrid Scoring Mechanism that eliminates embedding-based false positives in reasoning tasks; (2) A Dynamic Variance-Aware Scheduler based on Neyman allocation with noise correction, which reduces compute costs by 70\% compared to full-pass evaluations while maintaining a ranking correlation of $\rho=0.96$. Our analysis reveals that aggregate rankings are highly sensitive to weighting schemes: models exhibit a Rank Stability Amplitude (RSA) of 11.4 in ReLE versus $\sim$5.0 in traditional benchmarks, confirming that modern models are highly specialized rather than generally superior. We position ReLE not as a replacement for comprehensive static benchmarks, but as a high-frequency diagnostic monitor for the evolving model landscape.

顶级标签: llm model evaluation benchmark
详细标签: capability anisotropy live evaluation scalable system ranking stability chinese language understanding 或 搜索:

ReLE:一个用于诊断中文大语言模型能力各向异性的可扩展系统与结构化基准 / ReLE: A Scalable System and Structured Benchmark for Diagnosing Capability Anisotropy in Chinese LLMs


1️⃣ 一句话总结

这篇论文提出了一个名为ReLE的可扩展评估系统,它通过创新的评分和调度方法,高效诊断了数百个中文大语言模型在不同领域和任务上表现不均的‘能力各向异性’问题,揭示了当前模型更偏向专业化而非全面领先。

源自 arXiv: 2601.17399