Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

📄 Abstract - Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.

利用大语言模型稳定强化学习：公式化与实践 / Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

1️⃣ 一句话总结

这篇论文通过理论分析和大量实验，解释了如何通过减少训练与推理的差异以及策略过时问题，来稳定大语言模型的强化学习训练，并提出了结合重要性采样、梯度裁剪和路由重放等技术的实用方案。

← 返回列表

菜单

🤖 AI 深度阅读

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

🤖 AI 深度阅读

1️⃣ 一句话总结

获取最新论文摘要