菜单

🤖 系统
📄 Abstract - SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather than how to fix code. We introduce SWE-fficiency, a benchmark for evaluating repository-level performance optimization on real workloads. Our suite contains 498 tasks across nine widely used data-science, machine-learning, and HPC repositories (e.g., numpy, pandas, scipy): given a complete codebase and a slow workload, an agent must investigate code semantics, localize bottlenecks and relevant tests, and produce a patch that matches or exceeds expert speedup while passing the same unit tests. To enable this how-to-fix evaluation, our automated pipeline scrapes GitHub pull requests for performance-improving edits, combining keyword filtering, static analysis, coverage tooling, and execution validation to both confirm expert speedup baselines and identify relevant repository unit tests. Empirical evaluation of state-of-the-art agents reveals significant underperformance. On average, agents achieve less than 0.15x the expert speedup: agents struggle in localizing optimization opportunities, reasoning about execution across functions, and maintaining correctness in proposed edits. We release the benchmark and accompanying data pipeline to facilitate research on automated performance engineering and long-horizon software reasoning.

顶级标签: llm agents benchmark
详细标签: software optimization code reasoning performance engineering repository-level evaluation automated patching 或 搜索:

📄 论文总结

SWE-效率:语言模型能否在真实工作负载下优化现实世界代码库? / SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?


1️⃣ 一句话总结

这篇论文提出了一个名为SWE-fficiency的基准测试,用于评估AI模型在真实代码库中优化运行速度的能力,发现当前先进模型的表现远低于人类专家,主要困难在于定位性能瓶颈和保持代码正确性。


📄 打开原文 PDF