基于分歧点偏好学习的多轮工具调用智能体自我进化 / Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning
1️⃣ 一句话总结
本文提出一种名为ToolGraph的新方法,通过构建工具调用关系的拓扑图和基于成功轨迹的权重估计,并结合分歧点偏好的强化学习,使多轮对话中的工具调用智能体能自我改进,在测试中将平均奖励从0.304提升至0.355,性能相对提升16.8%。
Multi-turn tool-using agents must coordinate long-horizon tool sequences while tracking dialogue state and policy constraints. Existing approaches often separate inference-time orchestration from parameter-level learning, leaving tool selection weakly structured and preference updates vulnerable to train--deployment prompt mismatch. For within-benchmark self-improvement, ToolGraph combines schema-derived topology, transition weights estimated from successful rollouts, and history-aware controls for write prerequisites and repeated-search loops. We then construct 161 preference pairs by locating divergence points via state-based matching and prefix-based alignment, filtered through action-correctness annotations, and train DPO under the same ToolGraph context used at inference. Across 375 tau2-bench tasks, ToolGraph raises the weighted average reward from 0.304 to 0.338 (+11.2% relative), while ToolGraph+DPO reaches 0.355 (+16.8% over the baseline), with the DPO gain concentrated in airline and retail. Fine-grained diagnostics further show that roughly half of telecom trajectories exhaust the step budget before action execution and that chosen reward positivity is the most useful checkpoint signal across our 16 evaluated DPO configurations.
基于分歧点偏好学习的多轮工具调用智能体自我进化 / Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning
本文提出一种名为ToolGraph的新方法,通过构建工具调用关系的拓扑图和基于成功轨迹的权重估计,并结合分歧点偏好的强化学习,使多轮对话中的工具调用智能体能自我改进,在测试中将平均奖励从0.304提升至0.355,性能相对提升16.8%。
源自 arXiv: 2606.23112