A Rubric-Supervised Critic from Sparse Real-World Outcomes

📄 Abstract - A Rubric-Supervised Critic from Sparse Real-World Outcomes

Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a "critic" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling. Specifically, we introduce Critic Rubrics, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone. Using a semi-supervised objective, we can then jointly predict these rubrics and sparse human feedback (when present). In experiments, we demonstrate that, despite being trained primarily from trace-observable rubrics and sparse real-world outcome proxies, these critics improve best-of-N reranking on SWE-bench (Best@8 +15.9 over Random@8 over the rerankable subset of trajectories), enable early stopping (+17.7 with 83% fewer attempts), and support training-time data curation via critic-selected trajectories.

基于评分标准监督的稀疏现实世界结果评论模型 / A Rubric-Supervised Critic from Sparse Real-World Outcomes

1️⃣ 一句话总结

这篇论文提出了一种新方法，通过分析人机交互过程中的行为特征来训练一个‘评论模型’，从而帮助AI编程助手在现实世界稀疏、延迟的反馈中更好地学习和决策，提升其实际应用效果。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要