📄 论文总结
CodeClash:面向目标的软件工程基准测试 / CodeClash: Benchmarking Goal-Oriented Software Engineering
1️⃣ 一句话总结
这篇论文提出了一个名为CodeClash的基准测试平台,通过多轮竞赛评估语言模型在开放目标下自主优化代码的能力,发现当前模型在战略规划和长期代码维护方面存在明显不足,难以与人类程序员匹敌。
Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge. To address this, we introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource acquisition, or survival. Whether it's writing notes, scrutinizing documentation, analyzing competition logs, or creating test suites, models must decide for themselves how to improve their codebases both absolutely and against their opponents. We run 1680 tournaments (25,200 rounds total) to evaluate 8 LMs across 6 arenas. Our results reveal that while models exhibit diverse development styles, they share fundamental limitations in strategic reasoning. Models also struggle with long-term codebase maintenance, as repositories become progressively messy and redundant. These limitations are stark: top models lose every round against expert human programmers. We open-source CodeClash to advance the study of autonomous, goal-oriented code development.
CodeClash:面向目标的软件工程基准测试 / CodeClash: Benchmarking Goal-Oriented Software Engineering
这篇论文提出了一个名为CodeClash的基准测试平台,通过多轮竞赛评估语言模型在开放目标下自主优化代码的能力,发现当前模型在战略规划和长期代码维护方面存在明显不足,难以与人类程序员匹敌。