STAPO:通过抑制罕见伪标记来稳定大语言模型的强化学习训练 / STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
1️⃣ 一句话总结
这篇论文发现大语言模型强化学习训练不稳定的根源是极少数‘伪标记’,并提出了STAPO方法,通过选择性屏蔽这些标记的梯度更新,有效提升了训练稳定性和模型在数学推理任务上的表现。
Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, leading to degraded reasoning quality and unstable training. We derive that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. Building on this result, we prove that training instability is driven by a tiny fraction of tokens, approximately 0.01\%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. Motivated by this observation, we propose Spurious-Token-Aware Policy Optimization (STAPO) for large-scale model refining, which selectively masks such updates and renormalizes the loss over valid tokens. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13\% over GRPO, 20-Entropy and JustRL.
STAPO:通过抑制罕见伪标记来稳定大语言模型的强化学习训练 / STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
这篇论文发现大语言模型强化学习训练不稳定的根源是极少数‘伪标记’,并提出了STAPO方法,通过选择性屏蔽这些标记的梯度更新,有效提升了训练稳定性和模型在数学推理任务上的表现。
源自 arXiv: 2602.15620