← 返回列表

arXiv 提交日期: 2026-05-05

📄 Abstract - DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

Reinforcement learning is crucial for aligning large language models to perform complex reasoning tasks. However, current algorithms such as Group Relative Policy Optimization suffer from coarse grained, sequence level credit assignment, which severely struggles to isolate pivotal reasoning steps within long Chain of Thought generations. Furthermore, the standard unbounded Kullback Leibler divergence penalty induces severe gradient instability and mode seeking conservatism, ultimately stifling the discovery of novel reasoning trajectories. To overcome these limitations, we introduce Distribution Guided Policy Optimization, a novel critic free reinforcement learning framework that reinterprets distribution deviation as a guiding signal rather than a rigid penalty.

顶级标签: llm reinforcement learning model training

分布导向策略优化：用于细粒度信用分配 / DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

1️⃣ 一句话总结

本文提出了一种名为分布导向策略优化的强化学习新框架，通过将模型输出分布的变化作为灵活引导信号，代替传统算法中死板的惩罚，从而在长链条推理任务中精准识别关键步骤，并鼓励模型探索更多样化的解题路径。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2605.03327

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要