← 返回列表

arXiv 提交日期: 2026-01-12

📄 Abstract - On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under SFT optimality and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training

顶级标签: llm model training theory

论大语言模型后训练中监督微调与强化学习的不可分离性 / On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

1️⃣ 一句话总结

这篇论文通过理论和实验证明，在大语言模型的后训练阶段，监督微调和强化学习这两种方法是紧密耦合、不可分离的，强行拆开会损害模型性能。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2601.07389

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要