菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-30
📄 Abstract - TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals and increasing reward variance within groups without requiring external reward models or any annotations. Extensive experiments demonstrate that TSPO significantly outperforms state-of-the-art baselines, achieving average performance gains of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively.

顶级标签: llm agents reinforcement learning
详细标签: multi-turn reasoning policy optimization reward shaping tool integration search policy 或 搜索:

轮次阶段感知策略优化:解决多轮工具集成推理中的双重同质化困境 / TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization


1️⃣ 一句话总结

本文提出了一种名为TSPO(轮次阶段感知策略优化)的新型强化学习框架,通过其核心机制——首次出现潜在奖励(FOLR),有效解决了多轮工具集成推理中存在的‘过程级奖励同质化’和‘组内奖励同质化’双重困境,无需外部奖励模型或额外标注,即可显著提升模型在多轮推理任务中的性能。


2️⃣ 论文创新点

1. TSPO框架

2. 首次出现潜在奖励(FOLR)机制

3. 双重同质化困境的识别与形式化

4. 针对全错误组的轮次级奖励分配


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2601.22776