菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-15
📄 Abstract - From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.

顶级标签: llm model training reinforcement learning
详细标签: pre-training reasoning policy optimization distribution shift marginal distribution 或 搜索:

从条件概率到边缘概率:探索预训练空间中的强化学习 / From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space


1️⃣ 一句话总结

这篇论文提出了一种名为PreRL的新方法,通过直接在预训练空间中优化模型生成答案的整体概率分布(而非依赖于特定问题的条件概率),并结合一种名为“负样本强化”的机制来大幅修剪错误的推理路径、激发模型的反思能力,最终形成一种两阶段训练策略,显著提升了大型语言模型的推理性能。

源自 arXiv: 2604.14142