📄
Abstract - Binary Rewards and Reinforcement Learning: Fundamental Challenges
Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improves, multi-sample coverage degrades, sometimes falling below the base model. We provide a structural account of this phenomenon grounded in the properties of binary rewards. Binary rewards create a fundamental degeneracy for policy gradient methods: the set of distributions maximizing expected reward is infinite, with no distinguished element. KL-control resolves this degeneracy by selecting, in the limit $\beta\to 0$, the filtered model $p_*:=a(\cdot\mid\mathcal{Y}_1)$ -- the base model conditioned on validity -- which is the unique fully valid distribution closest to the base model in KL divergence. This selection operates through a nontrivial asymmetry: the tilted distribution $p_{[\beta]}\propto a(y)\,e^{v(y)/\beta}$ converges to $p_*$ in forward KL as $\beta\to 0$, yet $p_*$ cannot serve as a direct optimization target because $\mathrm{KL}(q\,\|\,p_*)$ is infinite for any full-support policy $q$. We develop explicit formulas relating the hyperparameter $\beta$ to the more interpretable target validity rate $\mu$. Under model misspecification -- the typical practical regime -- the pressure to decrease $\beta$ drives the optimizer toward highly concentrated distributions over a small number of valid outputs, collapsing toward ever fewer as $\beta$ decreases, rather than toward the filtered model. We illustrate this mechanism on a toy autoregressive experiment and discuss how alternative divergences that target $p_*$ directly -- as pursued empirically by \citet{kruszewski_whatever_2026} -- avoid this failure mode by rewarding coverage of $p_*$'s support rather than concentration on high-validity outputs.
二元奖励与强化学习:根本性挑战 /
Binary Rewards and Reinforcement Learning: Fundamental Challenges
1️⃣ 一句话总结
这篇论文揭示了在语言模型中使用二元奖励进行强化学习时,模型虽然能提高单次回答的正确率,但会导致多样性和覆盖率下降,并从理论层面解释了这一现象的根源:二元奖励使优化目标存在内在缺陷,而常用的KL正则化方法虽然能在理想情况下选出接近基模型的有效答案分布,但在实际模型不匹配时,反而会促使模型只生成少量重复的正确回答,从而失去多样性。