菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-07
📄 Abstract - Agentic Rubrics as Contextual Verifiers for SWE Agents

Verification is critical for improving agents: it provides the reward signal for Reinforcement Learning and enables inference-time gains through Test-Time Scaling (TTS). Despite its importance, verification in software engineering (SWE) agent settings often relies on code execution, which can be difficult to scale due to environment setup overhead. Scalable alternatives such as patch classifiers and heuristic methods exist, but they are less grounded in codebase context and harder to interpret. To this end, we explore Agentic Rubrics: an expert agent interacts with the repository to create a context-grounded rubric checklist, and candidate patches are then scored against it without requiring test execution. On SWE-Bench Verified under parallel TTS evaluation, Agentic Rubrics achieve a score of 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B, with at least a +3.5 percentage-point gain over the strongest baseline in our comparison set. We further analyze rubric behavior, showing that rubric scores are consistent with ground-truth tests while also flagging issues that tests do not capture. Our ablations show that agentic context gathering is essential for producing codebase-specific, unambiguous criteria. Together, these results suggest that Agentic Rubrics provide an efficient, scalable, and granular verification signal for SWE agents.

顶级标签: agents systems model evaluation
详细标签: software engineering agents verification test-time scaling rubric evaluation codebase context 或 搜索:

作为软件工程智能体上下文验证器的“能动性评估准则” / Agentic Rubrics as Contextual Verifiers for SWE Agents


1️⃣ 一句话总结

这篇论文提出了一种名为‘能动性评估准则’的新方法,它让一个专家智能体通过分析代码库来生成一份具体的检查清单,然后无需运行测试就能直接评估代码补丁的质量,从而为软件工程智能体提供了一种更高效、可扩展且易于理解的验证信号。

源自 arXiv: 2601.04171