Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

📄 Abstract - Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at this https URL.

长期平衡：信息瓶颈驱动的树形策略优化 / Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

1️⃣ 一句话总结

本文针对大语言模型在线强化学习中探索与利用不平衡的问题，提出了一种基于信息瓶颈理论的新指标IB-Score来量化平衡程度，并设计了一种树形采样策略，在相同令牌预算下获得更多训练轨迹，从而显著提升模型在复杂推理任务上的性能表现。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要