通过变分策略蒸馏从语言反馈中学习 / Learning from Language Feedback via Variational Policy Distillation
1️⃣ 一句话总结
本文提出一种名为变分策略蒸馏(VPD)的新框架,通过让教师模型在学生策略改进过程中动态调整、不断从文本反馈中提取更有效的指导信号,解决了以往方法中教师能力停滞、学生无法继续进步的难题,在科学推理和代码生成等复杂任务上显著优于现有方法。
Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.
通过变分策略蒸馏从语言反馈中学习 / Learning from Language Feedback via Variational Policy Distillation
本文提出一种名为变分策略蒸馏(VPD)的新框架,通过让教师模型在学生策略改进过程中动态调整、不断从文本反馈中提取更有效的指导信号,解决了以往方法中教师能力停滞、学生无法继续进步的难题,在科学推理和代码生成等复杂任务上显著优于现有方法。
源自 arXiv: 2605.15113