RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

📄 Abstract - RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model's current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.

RePO：通过重述策略优化桥接在线策略学习与离线策略知识 / RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

1️⃣ 一句话总结

这篇论文提出了一种名为RePO的新方法，通过让大语言模型先理解外部的高质量知识，再将其重述成符合自身风格的数据来训练，从而既稳定又高效地提升了模型在特定领域任务上的表现。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要