Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

📄 Abstract - Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.

通过随机选取的少量示例引导提升基于可验证奖励的强化学习 / Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

1️⃣ 一句话总结

本文提出一种名为FEST的算法，仅需随机选取128个示例（远少于传统监督微调所需的数据量），就能显著提升强化学习在数学和编程等复杂任务中的样本效率，避免模型过拟合，并在多个基准测试中达到甚至超越使用完整数据集的效果。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要