令人惊喜的策略梯度 / Delightful Policy Gradient
1️⃣ 一句话总结
这篇论文提出了一种新的策略梯度方法,通过引入一个结合了优势值和动作意外程度的“惊喜度”因子,有效解决了传统方法中罕见负优势动作过度影响更新方向以及资源分配不均的问题,从而在多个任务上取得了更好的性能。
Standard policy gradients weight each sampled action by advantage alone, regardless of how likely that action was under the current policy. This creates two pathologies: within a single decision context (e.g. one image or prompt), a rare negative-advantage action can disproportionately distort the update direction; across many such contexts in a batch, the expected gradient over-allocates budget to contexts the policy already handles well. We introduce the \textit{Delightful Policy Gradient} (DG), which gates each term with a sigmoid of \emph{delight}, the product of advantage and action surprisal (negative log-probability). For $K$-armed bandits, DG provably improves directional accuracy in a single context and, across multiple contexts, shifts the expected gradient strictly closer to the supervised cross-entropy oracle. This second effect is not variance reduction: it persists even with infinite samples. Empirically, DG outperforms REINFORCE, PPO, and advantage-weighted baselines across MNIST, transformer sequence modeling, and continuous control, with larger gains on harder tasks.
令人惊喜的策略梯度 / Delightful Policy Gradient
这篇论文提出了一种新的策略梯度方法,通过引入一个结合了优势值和动作意外程度的“惊喜度”因子,有效解决了传统方法中罕见负优势动作过度影响更新方向以及资源分配不均的问题,从而在多个任务上取得了更好的性能。
源自 arXiv: 2603.14608