菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-24
📄 Abstract - LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics

The rapid proliferation of Large Language Models (LLMs) and diverse specialized benchmarks necessitates a shift from fragmented, task-specific metrics to a holistic, competitive ranking system that effectively aggregates performance across multiple ability dimensions. Primarily using static scoring, current evaluation methods are fundamentally limited. They struggle to determine the proper mix ratio across diverse benchmarks, and critically, they fail to capture a model's dynamic competitive fitness or its vulnerability when confronted with sequential, high-stakes tasks. To address this, we introduce the novel Competitive Swiss-System Dynamics (CSD) framework. CSD simulates a multi-round, sequential contest where models are dynamically paired across a curated sequence of benchmarks based on their accumulated win-loss record. And Monte Carlo Simulation ($N=100,000$ iterations) is used to approximate the statistically robust Expected Win Score ($E[S_m]$), which eliminates the noise of random pairing and early-round luck. Furthermore, we implement a Failure Sensitivity Analysis by parameterizing the per-round elimination quantity ($T_k$), which allows us to profile models based on their risk appetite--distinguishing between robust generalists and aggressive specialists. We demonstrate that CSD provides a more nuanced and context-aware ranking than traditional aggregate scoring and static pairwise models, representing a vital step towards risk-informed, next-generation LLM evaluation.

顶级标签: llm benchmark model evaluation
详细标签: competitive ranking dynamic evaluation monte carlo simulation risk analysis multi-benchmark aggregation 或 搜索:

LLM瑞士轮:通过竞争性瑞士制动态聚合多基准测试性能 / LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics


1️⃣ 一句话总结

这篇论文提出了一种名为‘竞争性瑞士制动态’的新评估框架,通过模拟多轮竞赛来动态评估大语言模型的综合能力与风险偏好,相比传统静态评分方法,它能提供更细致、更贴近实际竞争环境的模型排名。

源自 arXiv: 2512.21010