📄
Abstract - Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning
Post-training Small Language Models (SLMs) for reasoning typically follows an SFT-then-RL pipeline, yet existing work rarely considers what data should be learned at each stage. We argue that data strategy should be aligned with the distinct roles of SFT and RL: SFT is better suited for acquiring not-yet-mastered reasoning skills, while RL is better suited for consolidating skills that the model can already partially access. Based on this principle, we propose a difficulty-aware SFT-then-RL framework that organizes training data into stage-specific sets. For hard samples in the SFT stage, we introduce a Bridge mechanism that transforms raw teacher-generated reasoning traces into more learnable supervision for SLMs. For hard samples that remain unsolved during RL, we apply Critique Fine-Tuning by converting all-zero-reward failures into diagnostic, repair, and new reasoning trace supervision for the next SFT stage. Experiments on two SLMs across five reasoning benchmarks show that our method consistently improves over representative SFT, distillation, and RL baselines. Our results highlight the importance of coordinating data difficulty across SFT and RL for effective SLM reasoning post-training.
学习该学什么:面向小模型推理中指令微调后强化学习的分阶段数据集设计 /
Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning
1️⃣ 一句话总结
针对小语言模型推理训练中的两阶段流程(先指令微调再强化学习),本文提出根据每个阶段的学习目标来分级组织数据:指令微调阶段重点提供模型尚未掌握的困难样本,并设计“桥梁机制”使其更容易理解;强化学习阶段则专注巩固模型已能部分解决的样本,并利用失败案例进行“批判性微调”来补充训练,从而显著提升模型在多个推理任务上的表现。