📄
Abstract - Needle in the Repo: A Benchmark for Maintainability in AI-Generated Repository Edits
AI coding agents can now complete complex programming tasks, but existing evaluations largely emphasize behavioral correctness and often overlook maintainability risks such as weak modularity or testability. We present Needle in the Repo (NITR), a diagnostic probe-and-oracle framework for evaluating whether behaviorally correct repository edits preserve maintainable structure. NITR distills recurring software engineering wisdom into controlled probes embedded in small, realistic multi-file codebases, each designed so that success depends primarily on one targeted maintainability dimension. Each probe is paired with a hidden evaluation harness that combines functional tests for required behavior with structural oracles that encode the targeted maintainability constraint and return interpretable diagnoses. Using NITR, we evaluate 23 coding configurations across GPT, Claude, Gemini, and Qwen families in both direct-inference and agent-based settings. Current AI coding systems remain far from robust: on average, configurations solve only 36.2% of cases, the best reaches 57.1%, and performance drops from 53.5% on micro cases to 20.6% on multi-step cases. The hardest pressures are architectural rather than local edits, especially dependency control (4.3%) and responsibility decomposition (15.2%). Moreover, 64/483 outcomes (13.3%) pass all functional tests yet fail the structural oracle. Under our harness, agent-mode configurations improve average performance from 28.2% to 45.0%, but do not eliminate these architectural failures. These results show that progress in code generation is not yet progress in maintainable code evolution, and that NITR exposes a critical failure surface missed by conventional evaluation.
代码仓库中的“绣花针”:一个评估AI生成代码仓库编辑可维护性的基准 /
Needle in the Repo: A Benchmark for Maintainability in AI-Generated Repository Edits
1️⃣ 一句话总结
这篇论文提出了一个名为NITR的新基准测试框架,专门用于评估AI编程助手在完成代码修改任务时,是否能在保证功能正确的同时,维持代码的模块化、可测试性等长期可维护性,结果发现当前主流AI系统在这方面的表现还很薄弱,尤其是在处理复杂的架构性修改时。