菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-10
📄 Abstract - RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the learning signal drawn from this distributional gap concentrates on style tokens rather than task-bearing ones, as the hinted model tends to produce more direct, shorter outputs. We term this pathology \emph{privilege-induced style drift}, which destabilizes training or causes response length to shrink. To address this, we propose \textbf{RLCSD} (Reinforcement Learning with Contrastive on-policy Self-Distillation), which mitigates this drift by contrasting the teacher-student gap under a correct hint against that under a wrong hint, suppressing the style shift that conditioning on a hint tends to induce regardless of correctness, and yielding a signal that is more concentrated on task-bearing tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across mathematical and logical reasoning show that RLCSD consistently outperforms GRPO and prior OPSD methods. We further show that the contrastive principle is general: it plugs into existing OPSD methods to improve them, and its underlying insight extends to the broader cross-model on-policy distillation setting.

顶级标签: reinforcement learning llm
详细标签: on-policy self-distillation contrastive learning reasoning style drift 或 搜索:

基于对比策略的强化学习自蒸馏方法 / RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation


1️⃣ 一句话总结

该论文提出RLCSD方法,通过对比正确与错误提示下的师生分布差异,解决了在线自蒸馏中模型偏好模仿风格而非推理内容的问题,从而在数学和逻辑推理任务上稳定提升模型性能。

源自 arXiv: 2606.11709