菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-09
📄 Abstract - New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR

Whether Reinforcement Learning with Verifiable Rewards (RLVR) endows Large Language Models (LLMs) with new capabilities or merely elicits latent traces remains a central debate. In this work, we align with the former view, proposing a probabilistic framework where capability is defined by instance-level solvability. We hypothesize that the emergence of complex reasoning can be driven by sharpening atomic step probabilities, which enables models to overcome the exponential decay of success rates inherent in multi-step reasoning chains. Utilizing the Algebrarium framework, we train models exclusively on single-step operations and evaluate their performance on unseen multi-step tasks. Our empirical results confirm that: (1) RLVR incentivizes the exploration of previously inaccessible solution paths by amplifying the model's existing skills; (2) composite performance is strictly governed by the joint probability of atomic steps, evidenced by high Pearson correlation coefficients ($\rho \in [0.69, 0.96]$); and (3) RLVR, acting as a global optimizer, can cause specific skills to be sacrificed to maximize aggregate reward. Our work offers a novel explanation for emergent abilities in RLVR, suggesting that the iterative optimization of solvable problems enables models to develop the capabilities to tackle previously unsolvable scenarios.

顶级标签: llm reinforcement learning theory
详细标签: verifiable rewards emergent reasoning probabilistic framework multi-step reasoning capability emergence 或 搜索:

新技能还是更锐利的基础能力?从概率视角看RLVR中推理能力的涌现 / New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR


1️⃣ 一句话总结

这篇论文通过一个概率框架证明,在强化学习结合可验证奖励的训练中,模型之所以能学会复杂的多步推理,并非获得了全新的能力,而是通过大幅提升其已有基础步骤的准确率,从而克服了多步任务中成功率指数级下降的难题。

源自 arXiv: 2602.08281