NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

📄 Abstract - NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.

NL2Repo-Bench：面向编码智能体长周期仓库生成能力的评估基准 / NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

1️⃣ 一句话总结

这篇论文提出了一个名为NL2Repo-Bench的新基准测试，专门用于评估编码智能体根据单一自然语言需求文档、从头开始构建完整可安装Python软件库的长期综合能力，实验发现当前最先进的模型在此任务上表现不佳，揭示了长期规划与跨文件协调是自主编程面临的核心挑战。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要