少即是多:用于同策略知识蒸馏的早期停止生成方法 / Less is More: Early Stopping Rollout for On-Policy Distillation
1️⃣ 一句话总结
本文发现同策略知识蒸馏中,当学生模型生成长文本后段时,教师模型的评分能力会因上下文偏离其训练分布而退化,因此提出一种简单的“早期停止生成”策略,只让学生生成前几个词条,便能在各种模型和任务上超越传统长文本蒸馏方法,同时显著提升训练效率和稳定性。
On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with student's earlier trajectory as context that is off-policy to the teacher, the teacher's ability to produce a corrective score would decay, and may fall back to token-completion behavior learned in the pre-training stage. We empirically verify this problem, and we propose Early Stopping Rollout (ESR) to fix it: a simple yet effective distillation strategy that simply restricts the rollout generation to the first response tokens. We show that ESR both surpasses the full rollout OPD performance across model size, family, tasks and training regime, and exhibit much higher GPU efficiency and training stability, especially under cross model family scenarios. We further investigate the mechanism behind this surprising performance and discovered "Cascading Alignment" and "Sub-mode Commitment" effect of ESR that may explain why it works effectively and even sometimes exceeding the teacher model performance. Besides, we show that this position-based token selection strategy cannot be fully explainable by KL divergence and entropy signals.
少即是多:用于同策略知识蒸馏的早期停止生成方法 / Less is More: Early Stopping Rollout for On-Policy Distillation
本文发现同策略知识蒸馏中,当学生模型生成长文本后段时,教师模型的评分能力会因上下文偏离其训练分布而退化,因此提出一种简单的“早期停止生成”策略,只让学生生成前几个词条,便能在各种模型和任务上超越传统长文本蒸馏方法,同时显著提升训练效率和稳定性。
源自 arXiv: 2605.27028