菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-30
📄 Abstract - From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions. Post-training such agents is challenging because synthesis for high-quality multi-turn tool-use data is difficult to scale, and reinforcement learning (RL) could face noisy signals caused by user simulation, leading to degraded training efficiency. We propose a unified framework that combines a self-evolving data agent with verifier-based RL. Our system, EigenData, is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers, and improves generation reliability via closed-loop self-evolving process that updates prompts and workflow. Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training with trajectory-level group-relative advantages and dynamic filtering, yielding consistent improvements beyond SFT. Evaluated on tau^2-bench, our best model reaches 73.0% pass^1 on Airline and 98.3% pass^1 on Telecom, matching or exceeding frontier models. Overall, our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.

顶级标签: agents llm model training
详细标签: tool-using agents reinforcement learning synthetic data generation post-training multi-turn interaction 或 搜索:

从自演进的合成数据到可验证奖励的强化学习:训练后多轮交互式工具使用智能体 / From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents


1️⃣ 一句话总结

这篇论文提出了一个名为EigenData的统一框架,它通过一个能自我演进、自动生成高质量多轮对话数据的系统,结合一种基于验证器的强化学习方法,来高效训练能够使用工具完成复杂任务的AI助手,而无需依赖昂贵的人工标注数据。

源自 arXiv: 2601.22607