虚假奖励悖论:从机制上理解RLVR如何激活大语言模型中的记忆捷径 / Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs
1️⃣ 一句话总结
这篇论文发现,即使使用错误的奖励信号进行强化学习训练,大语言模型也能取得性能提升,其机制是模型在中间层形成了一个‘锚点-适配器’神经回路,绕过了复杂的推理过程,直接通过记忆来生成答案。
Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a "Perplexity Paradox": spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, we uncover a hidden Anchor-Adapter circuit that facilitates this shortcut. We localize a Functional Anchor in the middle layers (L18-20) that triggers the retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate the shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering-artificially amplifying or suppressing contamination-driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR-tuned models. Code is available at this https URL.
虚假奖励悖论:从机制上理解RLVR如何激活大语言模型中的记忆捷径 / Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs
这篇论文发现,即使使用错误的奖励信号进行强化学习训练,大语言模型也能取得性能提升,其机制是模型在中间层形成了一个‘锚点-适配器’神经回路,绕过了复杂的推理过程,直接通过记忆来生成答案。
源自 arXiv: 2601.11061