菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-25
📄 Abstract - Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards

Large reasoning models (LRMs) are typically trained using reinforcement learning with verifiable reward (RLVR) to enhance their reasoning abilities. In this paradigm, policies are updated using both positive and negative self-generated rollouts, which correspond to distinct sample polarities. In this paper, we provide a systematic investigation into how these sample polarities affect RLVR training dynamics and behaviors. We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. We further explore how adjusting the advantage values of positive and negative samples at both the sample level and the token level affects RLVR training. Based on these insights, we propose an Adaptive and Asymmetric token-level Advantage shaping method for Policy Optimization, namely A3PO, that more precisely allocates advantage signals to key tokens across different polarities. Experiments across five reasoning benchmarks demonstrate the effectiveness of our approach.

顶级标签: llm reinforcement learning model training
详细标签: reasoning models policy optimization advantage shaping rlvr sample polarity 或 搜索:

重新思考可验证奖励强化学习中的样本极性 / Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards


1️⃣ 一句话总结

这篇论文通过研究发现,在训练大型推理模型时,使用正确(正极性)和错误(负极性)的推理路径分别能强化已有模式和探索新路径,并据此提出了一种名为A3PO的新方法,能更智能地分配奖励信号,从而在多个推理任务上取得了更好的效果。

源自 arXiv: 2512.21625