菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-26
📄 Abstract - Towards Better RL Training Data Utilization via Second-Order Rollout

Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding second-order rollout and critique training, such as the importance of label balance in critique training and the noise problem of outcome-based rewards, which can be mitigated through sampling techniques. Our work offers a preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for the further advancement of RL training

顶级标签: reinforcement learning llm model training
详细标签: rlhf critique training data utilization second-order rollout generation-critique joint training 或 搜索:

通过二阶展开实现更好的强化学习训练数据利用 / Towards Better RL Training Data Utilization via Second-Order Rollout


1️⃣ 一句话总结

这篇论文提出了一种名为“二阶展开”的新方法,通过让大语言模型在训练时不仅生成答案,还生成对答案的多个评价,来联合训练其生成和批判能力,从而更充分地利用训练数据,在相同数据量下比传统强化学习获得更好的性能。

源自 arXiv: 2602.22765