不完美的验证器已足够:在带噪声的奖励中学习 / An Imperfect Verifier is Good Enough: Learning with Noisy Rewards
1️⃣ 一句话总结
这项研究发现,在训练大语言模型时,即使用于评估模型输出的验证器存在高达15%的错误率,其训练效果与使用完美验证器相比也几乎没有差别,因此实际应用中应优先选择高精确度的验证器,而不必追求完美无误。
Reinforcement Learning with Verifiable Rewards (RLVR) has become a prominent method for post-training Large Language Models (LLMs). However, verifiers are rarely error-free; even deterministic checks can be inaccurate, and the growing dependence on model-based judges exacerbates the issue. The extent to which RLVR is robust to such noise and the verifier accuracy required for effective training remain unresolved questions. We investigate these questions in the domains of code generation and scientific reasoning by introducing noise into RL training. Noise rates up to 15% yield peak validation accuracy within 2 percentage points of the clean baseline. These findings are consistent across controlled and model-based noise types, three model families (Qwen3, GLM4, Llama 3.1), and model sizes from 4B to 9B. Overall, the results indicate that imperfect verification does not constitute a fundamental barrier to RLVR. Furthermore, our findings suggest that practitioners should prioritize moderate accuracy with high precision over perfect verification.
不完美的验证器已足够:在带噪声的奖励中学习 / An Imperfect Verifier is Good Enough: Learning with Noisy Rewards
这项研究发现,在训练大语言模型时,即使用于评估模型输出的验证器存在高达15%的错误率,其训练效果与使用完美验证器相比也几乎没有差别,因此实际应用中应优先选择高精确度的验证器,而不必追求完美无误。
源自 arXiv: 2604.07666