菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-08
📄 Abstract - AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search

LLM agents have emerged as powerful systems for tackling multi-turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post-training paradigm to further refine these capabilities. In this paper, we present AT$^2$PO (Agentic Turn-based Policy Optimization via Tree Search), a unified framework for multi-turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT$^2$PO introduces a turn-level tree structure that jointly enables Entropy-Guided Tree Expansion for strategic exploration and Turn-wise Credit Assignment for fine-grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn-based Policy Optimization, a turn-level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi-turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state-of-the-art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component. Our code is available at this https URL.

顶级标签: llm agents reinforcement learning
详细标签: policy optimization tree search multi-turn agents credit assignment exploration 或 搜索:

AT$^2$PO:基于树搜索的智能体回合制策略优化 / AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search


1️⃣ 一句话总结

这篇论文提出了一个名为AT$^2$PO的新框架,它通过结合树搜索和回合制学习,有效解决了多轮任务中智能体探索不足、奖励分配困难等关键问题,从而显著提升了智能体在复杂任务中的表现。

源自 arXiv: 2601.04767