菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-20
📄 Abstract - Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that closely align with the model's current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically combine low absolute probability with relatively high-ranked tokens under the student model, balancing learning signal strength and behavioral alignment. Concretely, RSR is defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training performance (average Spearman 0.86), outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.

顶级标签: llm model training model evaluation
详细标签: reasoning distillation chain-of-thought data selection teacher-student alignment metric 或 搜索:

哪些推理轨迹能让学生模型更好地学习推理?一个衡量信息对齐的简单指标 / Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment


1️⃣ 一句话总结

这篇论文提出了一个名为‘排序-惊奇度比率’的简单新指标,它能有效评估用于训练学生大语言模型的推理轨迹的质量,帮助挑选出既贴合学生当前水平又富含新信息的最佳教学材料,从而显著提升模型在复杂推理任务上的表现。

源自 arXiv: 2601.14249