重新思考专家轨迹在大语言模型后训练中的利用 / Rethinking Expert Trajectory Utilization in LLM Post-training
1️⃣ 一句话总结
这篇论文提出了一个理论框架,发现先进行监督微调再进行强化学习的顺序训练法效果最好,并给出了如何选择最佳切换时机和训练数据以最大化模型性能的具体指导原则。
While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More'' in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.
重新思考专家轨迹在大语言模型后训练中的利用 / Rethinking Expert Trajectory Utilization in LLM Post-training
这篇论文提出了一个理论框架,发现先进行监督微调再进行强化学习的顺序训练法效果最好,并给出了如何选择最佳切换时机和训练数据以最大化模型性能的具体指导原则。
源自 arXiv: 2512.11470