ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

📄 Abstract - ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the dynamic, full-process requirements of real-world engineering, particularly in backend development which demands rigorous environment configuration and service deployment. To address this gap, we introduce ABC-Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. Using a scalable automated pipeline, we curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories. Distinct from previous evaluations, ABC-Bench require the agents to manage the entire development lifecycle from repository exploration to instantiating containerized services and pass the external end-to-end API tests. Our extensive evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks, highlighting a substantial disparity between current model capabilities and the demands of practical backend engineering. Our code is available at this https URL.

ABC-Bench：现实世界开发中智能体后端编码的基准测试 / ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

1️⃣ 一句话总结

这篇论文提出了一个名为ABC-Bench的新基准测试，专门用于评估AI智能体在真实后端开发全流程（从代码探索到服务部署）中的综合编码能力，发现当前最先进的模型在这类实际工程任务上仍表现不佳。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要