JURY-RL:投票提出答案,证明决定奖励——无标签的强化学习验证框架 / JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR
1️⃣ 一句话总结
本文提出JURY-RL方法,在不需要人工标注答案的情况下,通过让模型自己投票选出候选答案,再使用形式化验证器(如Lean)判断该答案是否正确,仅在验证成功时才给予奖励,从而稳定地提升了大型语言模型在数学推理等任务上的推理能力,效果接近使用标准答案进行训练的方法。
Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus. Across three backbone models trained on mathematical data, JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks and transfers competitively to code generation and general benchmarks. It attains pass@1 performance comparable to supervised ground-truth training, with superior generalization demonstrated by higher pass@k and response diversity.
JURY-RL:投票提出答案,证明决定奖励——无标签的强化学习验证框架 / JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR
本文提出JURY-RL方法,在不需要人工标注答案的情况下,通过让模型自己投票选出候选答案,再使用形式化验证器(如Lean)判断该答案是否正确,仅在验证成功时才给予奖励,从而稳定地提升了大型语言模型在数学推理等任务上的推理能力,效果接近使用标准答案进行训练的方法。
源自 arXiv: 2604.25419