📄
Abstract - Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently
Recent advances in large language models (LLMs) have demonstrated that reinforcement fine-tuning of pretrained base models can lead to significant gains in reasoning performance at inference time. In this work, we theoretically analyze why reinforcement fine-tuning induces better reasoning ability than purely supervised fine-tuning (SFT) methods. We model chain-of-thought (CoT) reasoning as a pathfinding problem on graphs and compare the popular method of reinforcement learning with verifiable rewards (RLVR) against traditional SFT. We prove that SFT, when trained on golden shortest paths without negative examples, fails to learn how to efficiently backtrack. In contrast, an RLVR-trained model can learn how to efficiently backtrack from dead ends using only outcome reward. This leads to an exponential separation in inference-time compute between the two methods, and demonstrates that RLVR leads the model to learn the location of difficult decisions in a reasoning chain, ultimately allowing for better allocation of inference-time compute. Finally, we show that the reasoning traces of an RLVR model can be distilled to train a base model to backtrack efficiently as well.
RLVR相对于SFT在推理模型中的可证明优势:学习高效回溯 /
Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently
1️⃣ 一句话总结
这项研究通过将链式推理建模为图上的路径搜索问题,从理论上证明:相比传统的监督微调,使用可验证奖励的强化学习能够教会大语言模型在推理过程中高效地从死胡同回溯,从而在推理时计算效率上带来指数级的提升,并且这种回溯能力还可以通过蒸馏传递给其他模型。