奖励科学过程:面向智能体数据分析的过程级奖励建模 / Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
1️⃣ 一句话总结
本文提出了一种名为DataPRM的智能奖励模型,它能像经验丰富的导师一样,在数据分析任务中逐步识别AI助手的潜在错误(比如逻辑漏洞而非语法错误),同时学会区分“合理的探索尝试”和“真正的失误”,从而显著提升AI在复杂科学数据任务中的表现。
Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at this https URL.
奖励科学过程:面向智能体数据分析的过程级奖励建模 / Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
本文提出了一种名为DataPRM的智能奖励模型,它能像经验丰富的导师一样,在数据分析任务中逐步识别AI助手的潜在错误(比如逻辑漏洞而非语法错误),同时学会区分“合理的探索尝试”和“真正的失误”,从而显著提升AI在复杂科学数据任务中的表现。
源自 arXiv: 2604.24198