JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

📄 Abstract - JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus. Across three backbone models trained on mathematical data, JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks and transfers competitively to code generation and general benchmarks. It attains pass@1 performance comparable to supervised ground-truth training, with superior generalization demonstrated by higher pass@k and response diversity.

JURY-RL：投票提出答案，证明决定奖励——无标签的强化学习验证框架 / JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

1️⃣ 一句话总结

本文提出JURY-RL方法，在不需要人工标注答案的情况下，通过让模型自己投票选出候选答案，再使用形式化验证器（如Lean）判断该答案是否正确，仅在验证成功时才给予奖励，从而稳定地提升了大型语言模型在数学推理等任务上的推理能力，效果接近使用标准答案进行训练的方法。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要