菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-08
📄 Abstract - GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

顶级标签: reinforcement learning llm model training
详细标签: multi-reward rl policy optimization reward normalization alignment training stability 或 搜索:

GDPO:面向多奖励强化学习的组奖励解耦归一化策略优化 / GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization


1️⃣ 一句话总结

本文提出了一种名为GDPO的新策略优化方法,通过解耦多个奖励的归一化过程,有效解决了现有方法在多奖励强化学习中因信号模糊导致的训练不稳定和性能不佳的问题,并在工具调用、数学推理和代码推理等任务上取得了更好的效果。

源自 arXiv: 2601.05242