菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-05
📄 Abstract - BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.

顶级标签: llm reinforcement learning model training
详细标签: policy optimization trust regions proximal policy optimization entropy collapse exploration bottleneck 或 搜索:

BandPO:通过概率感知边界连接信任区域与比率裁剪,用于大语言模型强化学习 / BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning


1️⃣ 一句话总结

这篇论文提出了一种名为BandPO的新方法,通过引入一个能根据动作概率动态调整更新范围的‘Band’操作符,解决了现有强化学习算法中固定更新上限会抑制低概率但高价值策略探索的问题,从而在提升模型性能的同时有效防止了策略多样性的过早丧失。

源自 arXiv: 2603.04918