你的组相对优势是有偏的 / Your Group-Relative Advantage Is Biased
1️⃣ 一句话总结
这篇论文发现,在基于验证器奖励的强化学习训练大语言模型时,广泛使用的组相对优势估计方法存在系统性偏差,导致模型对不同难度问题的探索和利用失衡,并提出了一种自适应的权重调整方案来纠正这一偏差,从而提升模型在数学推理等任务上的表现。
Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid learned critics, yet its theoretical properties remain poorly understood. In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage. We provide the first theoretical analysis showing that it systematically underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation. To address this issue, we propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics. Both theoretical analysis and experiments on five mathematical reasoning benchmarks demonstrate that HA-DW consistently improves performance when integrated into GRPO and its variants. Our results suggest that correcting biased advantage estimation is critical for robust and efficient RLVR training.
你的组相对优势是有偏的 / Your Group-Relative Advantage Is Biased
这篇论文发现,在基于验证器奖励的强化学习训练大语言模型时,广泛使用的组相对优势估计方法存在系统性偏差,导致模型对不同难度问题的探索和利用失衡,并提出了一种自适应的权重调整方案来纠正这一偏差,从而提升模型在数学推理等任务上的表现。
源自 arXiv: 2601.08521