菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-22
📄 Abstract - Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning

Multi-turn tool-using agents must coordinate long-horizon tool sequences while tracking dialogue state and policy constraints. Existing approaches often separate inference-time orchestration from parameter-level learning, leaving tool selection weakly structured and preference updates vulnerable to train--deployment prompt mismatch. For within-benchmark self-improvement, ToolGraph combines schema-derived topology, transition weights estimated from successful rollouts, and history-aware controls for write prerequisites and repeated-search loops. We then construct 161 preference pairs by locating divergence points via state-based matching and prefix-based alignment, filtered through action-correctness annotations, and train DPO under the same ToolGraph context used at inference. Across 375 tau2-bench tasks, ToolGraph raises the weighted average reward from 0.304 to 0.338 (+11.2% relative), while ToolGraph+DPO reaches 0.355 (+16.8% over the baseline), with the DPO gain concentrated in airline and retail. Fine-grained diagnostics further show that roughly half of telecom trajectories exhaust the step budget before action execution and that chosen reward positivity is the most useful checkpoint signal across our 16 evaluated DPO configurations.

顶级标签: agents model training natural language processing
详细标签: tool use preference learning multi-turn self-improvement dpo 或 搜索:

基于分歧点偏好学习的多轮工具调用智能体自我进化 / Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning


1️⃣ 一句话总结

本文提出一种名为ToolGraph的新方法,通过构建工具调用关系的拓扑图和基于成功轨迹的权重估计,并结合分歧点偏好的强化学习,使多轮对话中的工具调用智能体能自我改进,在测试中将平均奖励从0.304提升至0.355,性能相对提升16.8%。

源自 arXiv: 2606.23112