菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-11
📄 Abstract - PACED: Distillation at the Frontier of Student Competence

Standard LLM distillation wastes compute on two fronts: problems the student has already mastered (near-zero gradients) and problems far beyond its reach (incoherent gradients that erode existing capabilities). We show that this waste is not merely intuitive but structurally inevitable: the gradient signal-to-noise ratio in distillation provably vanishes at both pass-rate extremes. This theoretical observation leads to Paced, a framework that concentrates distillation on the zone of proximal development -- the frontier of a student model's competence -- via a principled pass-rate weight $w(p) = p^\alpha(1 - p)^\beta$ derived from the boundary-vanishing structure of distillation gradients. Key results: (1) Theory: We prove that the Beta kernel $w(p) = p^\alpha(1-p)^\beta$ is a leading-order weight family arising from the SNR structure of distillation, and that it is minimax-robust -- under bounded multiplicative misspecification, worst-case efficiency loss is only $O(\delta^2)$. (2)Distillation: On distillation from a larger teacher to a smaller student model with forward KL, Paced achieves significant gain over the base model, while keeping benchmark forgetting at a low level. (3)Self-distillation: On instruction-tuned models with reverse KL, gains are exceeding baselines as well. (4)Two-stage synergy: A forward-KL-then-reverse-KL schedule yields the strongest results in our setting, reaching substantial improvements on standard reasoning benchmarks -- supporting a mode-coverage-then-consolidation interpretation of the distillation process. All configurations require only student rollouts to estimate pass rates, need no architectural changes, and are compatible with any KL direction.

PACED:聚焦学生能力前沿的蒸馏方法 / PACED: Distillation at the Frontier of Student Competence


1️⃣ 一句话总结

这篇论文提出了一种名为PACED的智能蒸馏框架,它通过将训练重点集中在学生模型‘会一点但又不全会’的能力边界上,有效避免了传统方法在已掌握或完全不会的任务上浪费算力的问题,从而显著提升了模型蒸馏的效率和效果。

源自 arXiv: 2603.11178