菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-12
📄 Abstract - On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under SFT optimality and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training

顶级标签: llm model training theory
详细标签: post-training supervised fine-tuning reinforcement learning decoupling model optimization 或 搜索:

论大语言模型后训练中监督微调与强化学习的不可分离性 / On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training


1️⃣ 一句话总结

这篇论文通过理论和实验证明,在大语言模型的后训练阶段,监督微调和强化学习这两种方法是紧密耦合、不可分离的,强行拆开会损害模型性能。

源自 arXiv: 2601.07389