菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-17
📄 Abstract - Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

顶级标签: llm model training
详细标签: reasoning self-distillation rubric supervised learning reward 或 搜索:

重新思考奖励监督:基于评分准则的自蒸馏方法 / Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation


1️⃣ 一句话总结

本文提出一种新的训练推理模型的方法,通过引入详细的评分准则(rubrics)作为结构化反馈,让学生模型从自身的推理过程中学习,避免了传统蒸馏依赖昂贵且可能有错的标准答案,以及强化学习仅用单一分数指导的不足,从而在科学推理任务上取得了比现有方法更好的效果。

源自 arXiv: 2606.19327