菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-19
📄 Abstract - SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on Python-based bug fixes. We introduce SWE-Bench++, an automated framework that generates repository-level coding tasks from open-source GitHub projects. Unlike synthetic approaches, our pipeline harvests live pull requests to cover both bug fixes and feature requests across 11 languages. SWE-Bench++ turns GitHub pull requests (PRs) into reproducible, execution-based tasks via four stages: programmatic sourcing, environment synthesis, test oracle extraction, and quality assurance. A final hint-guided trajectory synthesis step converts instances that strong models fail on into training trajectories. Our initial benchmark consists of 11,133 instances from 3,971 repositories across 11 languages. On a subset of 1,782 instances of this benchmark, today's strongest models perform as follows: claude-sonnet-4.5 achieves 36.20% pass@10, gpt-5-2025-08-07 34.57%, gemini/gemini-2.5-pro 24.92%, and gpt-4o 16.89%. We further demonstrate the utility of our dataset by showing that fine-tuning on SWE-Bench++ instances yields measurable improvements on the SWE-bench Multilingual benchmark. SWE-Bench++ provides a scalable, multilingual benchmark for evaluating and improving repository-level code generation.

顶级标签: llm systems benchmark
详细标签: code generation software engineering automated evaluation multi-language test oracle 或 搜索:

SWE-Bench++:一个用于自动化生成多语言软件工程基准测试的框架 / SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories


1️⃣ 一句话总结

SWE-Bench++是一个自动化框架,能够从真实的GitHub拉取请求中大规模、多语言地生成可执行的软件工程基准测试任务,并通过创新的状态差分测试预言机和提示引导的轨迹合成等方法,显著提升了基准测试的规模、多样性、可靠性和对模型改进的实用性。


2️⃣ 论文创新点

1. 自动化多语言基准生成框架

2. 状态差分测试预言机与任务分类

3. 混合架构环境合成与自适应日志解析

4. 自动化质量保证(AutoQA)四层管道

5. 提示引导的轨迹合成


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2512.17419