探索与利用:通过裁剪、熵和虚假奖励重新思考可验证奖励强化学习 / Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
1️⃣ 一句话总结
这篇论文通过分析虚假奖励和熵最小化这两种看似矛盾的方法,揭示了它们如何协同作用,在可验证奖励强化学习中提升大语言模型的推理能力,并解释了其背后的机制。
This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
探索与利用:通过裁剪、熵和虚假奖励重新思考可验证奖励强化学习 / Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
这篇论文通过分析虚假奖励和熵最小化这两种看似矛盾的方法,揭示了它们如何协同作用,在可验证奖励强化学习中提升大语言模型的推理能力,并解释了其背后的机制。
源自 arXiv: 2512.16912