GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

📄 Abstract - GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for step-level supervision. Uniform advantage distribution assumes that all tokens contribute equally to the final reward. This dilutes the gradient signal, since flawed reasoning steps and filler words are updated as strongly as valid logical inferences. To address this, we introduce Gradient-Reweighted Advantage (GRAIL), an intrinsic token-wise advantage reweighting method. GRAIL uses gradient-activation saliency to place more weight on tokens that are more locally sensitive to the final answer. Evaluations across five models from the Qwen3, R1-distilled and OctoThinker families show that GRAIL consistently outperforms GRPO. GRAIL achieved an average improvement of 3.60% in accuracy and 3.05% in Pass@3, demonstrating that fine-grained reasoning alignment can be achieved without process-level supervision.

GRAIL：基于梯度重加权优势的强化学习在可验证奖励中的应用 / GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

1️⃣ 一句话总结

本文提出了一种名为GRAIL的新方法，通过利用每个token对最终答案的敏感程度来重新分配奖励信号，从而克服了传统强化学习方法中错误推理步骤与有效步骤被同等更新的问题，在不依赖昂贵过程奖励模型的情况下，显著提升了大型语言模型在数学推理任务上的准确率。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要