SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

📄 Abstract - SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, Tool comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on Tool, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.

SWE-EVO：在长周期软件演化场景中评估代码智能体 / SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

1️⃣ 一句话总结

这篇论文提出了一个名为SWE-EVO的新基准测试，它模拟了需要跨多个文件进行多步骤修改的真实软件长期演化任务，并发现当前最先进的AI编程模型在此类复杂任务上的表现远低于处理单一、孤立问题的能力。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要