ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

📄 Abstract - ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

Reward-guided search methods have demonstrated strong potential in enhancing tool-using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models (PRMs) to provide step-level rewards, enabling more fine-grained monitoring. However, there is a lack of systematic and reliable evaluation benchmarks for PRMs in tool-using settings. In this paper, we introduce ToolPRMBench, a large-scale benchmark specifically designed to evaluate PRMs for tool-using agents. ToolPRMBench is built on top of several representative tool-using benchmarks and converts agent trajectories into step-level test cases. Each case contains the interaction history, a correct action, a plausible but incorrect alternative, and relevant tool metadata. We respectively utilize offline sampling to isolate local single-step errors and online sampling to capture realistic multi-step failures from full agent rollouts. A multi-LLM verification pipeline is proposed to reduce label noise and ensure data quality. We conduct extensive experiments across large language models, general PRMs, and tool-specialized PRMs on ToolPRMBench. The results reveal clear differences in PRM effectiveness and highlight the potential of specialized PRMs for tool-using. Code and data will be released at this https URL.

ToolPRMBench：评估并推进工具使用智能体的过程奖励模型 / ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

1️⃣ 一句话总结

这篇论文提出了一个名为ToolPRMBench的大规模基准测试，专门用来评估和比较那些指导AI智能体分步骤使用工具的过程奖励模型，发现专门为工具使用设计的模型表现更好。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要