Understanding Knowledge Distillation in Post-Training: When It Helps and When It Fails

📄 Abstract - Understanding Knowledge Distillation in Post-Training: When It Helps and When It Fails

Large language models (LLMs) achieve strong performance across many tasks, but their high computational cost limits deployment in resource-constrained environments. Knowledge Distillation (KD) offers a practical solution by transferring knowledge from a teacher model of a larger size to a smaller student model. While prior work has mainly examined task-specific or small-scale settings, the post-training stage for building general instruction-following models has received limited attention. In this paper, we conduct a systematic study of KD in post-training using the large-scale Tulu 3 dataset. We find that KD outperforms supervised fine-tuning (SFT) in low-data regimes, but its advantage diminishes as more training data is added. Distilling from a stronger instruction-tuned teacher restores substantial gains even with abundant data, indicating that KD remains effective when the teacher provides knowledge that the student cannot easily acquire from the training data alone. We further study domain-specific, low-resource scenarios and propose a two-stage KD strategy that leverages synthetic teacher-labeled data followed by refinement on human annotations. This method consistently improves student performance, providing practical guidance for building compact models in data-scarce environments.

理解后训练阶段的知识蒸馏：何时有效与何时失效 / Understanding Knowledge Distillation in Post-Training: When It Helps and When It Fails

1️⃣ 一句话总结

本研究系统分析了在大型语言模型的后训练阶段中，知识蒸馏技术如何帮助小型学生模型提升性能，发现当训练数据较少时蒸馏效果显著优于传统微调，但在数据充足时优势减弱；不过，若使用更强的指令微调教师模型，即使在数据丰富的情况下也能带来明显提升，并针对数据稀缺场景提出了一个两阶段蒸馏策略来进一步优化模型性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要