菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-28
📄 Abstract - Diversity or Precision? A Deep Dive into Next Token Prediction

Recent advancements have shown that reinforcement learning (RL) can substantially improve the reasoning abilities of large language models (LLMs). The effectiveness of such RL training, however, depends critically on the exploration space defined by the pre-trained model's token-output distribution. In this paper, we revisit the standard cross-entropy loss, interpreting it as a specific instance of policy gradient optimization applied within a single-step episode. To systematically study how the pre-trained distribution shapes the exploration potential for subsequent RL, we propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. By framing next-token prediction as a stochastic decision process, we introduce a reward-shaping strategy that explicitly balances diversity and precision. Our method employs a positive reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically. This allows us to reshape the pre-trained token-output distribution and investigate how to provide a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. Contrary to the intuition that higher distribution entropy facilitates effective exploration, we find that imposing a precision-oriented prior yields a superior exploration space for RL.

顶级标签: llm model training theory
详细标签: next token prediction reinforcement learning exploration space reward shaping distribution entropy 或 搜索:

多样性还是精确性?深入探讨下一个词预测 / Diversity or Precision? A Deep Dive into Next Token Prediction


1️⃣ 一句话总结

这篇论文研究发现,在训练大语言模型时,与其追求预测的多样性,不如在预训练阶段就塑造一个更偏向精确性的词分布,这样能为后续的强化学习提供一个更好的探索起点,从而最终提升模型的推理能力。

源自 arXiv: 2512.22955