菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-02
📄 Abstract - Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.

顶级标签: reinforcement learning llm model training
详细标签: policy optimization reinforcement learning from human feedback sample routing self-distillation credit assignment 或 搜索:

通过样本路由统一组相对与自蒸馏策略优化 / Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing


1️⃣ 一句话总结

这篇论文提出了一种名为SRPO的新方法,它巧妙地结合了两种现有强化学习技术的优点,通过智能地将不同质量的训练样本分配给不同的优化策略,从而在训练大语言模型时实现了既快速提升效果又保持长期稳定的目标,最终在多个测试中超越了现有最佳方法。

源自 arXiv: 2604.02288