菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-28
📄 Abstract - DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction. While RL mitigates this, conventional approaches for Tool-Integrated Reasoning are hindered by sparse outcome-based rewards, failing to supervise intermediate reasoning steps and tool invocations. To address this, we propose DeepTool, a novel framework that scales deliberate thinking within the interleaved process of thinking, action, and observation at each turn. In DeepTool, we first introduce a synthesis pipeline that evolves extended thinking into interleaved trajectories, integrating adversarial perturbations to ensure robustness and self-correction. Secondly, we devise Process-Supervised Reinforcement Learning based on GRPO, which utilizes an Action-Centric Process Reward to reinforce intermediate interleaved thinking and enforce precise tool invocation at every turn. Extensive experiments demonstrate that DeepTool achieves superior performance, boosting Qwen2.5-7B significantly across six benchmarks (e.g., AIME24: 3.2% -> 40.4% and HMMT25: 0.0% -> 28.6%). Furthermore, the token cost-effectiveness analysis confirms the utility of interleaved thinking, demonstrating DeepTool's optimal balance between performance and token efficiency.

顶级标签: llm reinforcement learning agents
详细标签: tool-integrated reasoning process-supervised rl interleaved deliberation benchmark 或 搜索:

DeepTool:通过过程监督强化学习实现工具集成推理中的交错式思考扩展 / DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning


1️⃣ 一句话总结

该论文提出了DeepTool框架,通过让大模型在每一步使用工具时都进行“思考-行动-观察”的交错式深度推理,并引入过程监督强化学习来引导中间步骤的自我纠错,从而显著提升了复杂数学推理任务的准确率和稳健性。

源自 arXiv: 2605.29568