菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-09
📄 Abstract - Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use. This leads to noisy supervision, poor sample efficiency, and suboptimal policy updates. We address these issues by formulating rollout scheduling in RLVR as a contextual bandit problem and proposing a unified neural scheduling framework that adaptively selects high-value rollouts throughout training. Each rollout is treated as an arm whose reward is defined by the induced performance gain between consecutive optimization steps. The resulting scheduler supports both noise-aware intra-group selection and adaptive global reuse of historical rollouts within a single principled framework. We provide theoretical justification by deriving sublinear regret bounds and showing that enlarging the rollout buffer improves the achievable performance upper bound. Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.

顶级标签: reinforcement learning theory model training
详细标签: contextual bandits rollout scheduling sample efficiency regret analysis reasoning benchmarks 或 搜索:

基于情境化滚动选择的强化学习与可验证奖励 / Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards


1️⃣ 一句话总结

这篇论文提出了一种智能调度方法,通过将强化学习训练过程中的历史反馈数据视为可选择的“选项”,并动态挑选出最有价值的反馈来优化模型,从而显著提升了大型语言模型在数学推理等任务上的训练效率和最终性能。

源自 arXiv: 2602.08499