分布导向策略优化:用于细粒度信用分配 / DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
1️⃣ 一句话总结
本文提出了一种名为分布导向策略优化的强化学习新框架,通过将模型输出分布的变化作为灵活引导信号,代替传统算法中死板的惩罚,从而在长链条推理任务中精准识别关键步骤,并鼓励模型探索更多样化的解题路径。
Reinforcement learning is crucial for aligning large language models to perform complex reasoning tasks. However, current algorithms such as Group Relative Policy Optimization suffer from coarse grained, sequence level credit assignment, which severely struggles to isolate pivotal reasoning steps within long Chain of Thought generations. Furthermore, the standard unbounded Kullback Leibler divergence penalty induces severe gradient instability and mode seeking conservatism, ultimately stifling the discovery of novel reasoning trajectories. To overcome these limitations, we introduce Distribution Guided Policy Optimization, a novel critic free reinforcement learning framework that reinterprets distribution deviation as a guiding signal rather than a rigid penalty.
分布导向策略优化:用于细粒度信用分配 / DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
本文提出了一种名为分布导向策略优化的强化学习新框架,通过将模型输出分布的变化作为灵活引导信号,代替传统算法中死板的惩罚,从而在长链条推理任务中精准识别关键步骤,并鼓励模型探索更多样化的解题路径。
源自 arXiv: 2605.03327