菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-26
📄 Abstract - Focal Reward: Balanced Reinforcement Learning under Rubric-Based Rewards

The open-ended generation in LLMs usually requires multi-dimensional rubrics to adequately assess quality and guide the improvement of reinforcement learning. However, a critical dilemma inherent in this training paradigm is the imbalanced reward polarization along different rubric dimensions. Under this bottleneck, even if LLMs achieve relatively high rewards after training, they may still exhibit severe deficiencies in certain dimensions, leading to a direct deterioration in user experience. To address this problem, we propose Focal Reward, a novel objective to automatically balance the training of reinforcement learning under rubric-based rewards. Specifically, we first leverage an inverse reward projection mechanism to estimate the saturation degree of each criterion in the rubric, which forms the basis to calibrate the reward direction. Then, the final objective is designed with an automatically reweighting coefficient for each criterion to achieve the fine-grained balancing. Extensive experiments across three model scales and six benchmarks demonstrate that our Focal Reward method outperforms the strongest static aggregation baseline in all 18 model-benchmark comparisons. Rollout, mechanism, and ablation analyses further show that these gains arise from online, saturation-aware reallocation toward rubrics that still have room for improvement.

顶级标签: llm reinforcement learning model training
详细标签: reward design multi-dimensional rubrics saturation estimation balanced training text generation 或 搜索:

焦点奖励:基于评分标准的奖励下强化学习的平衡训练 / Focal Reward: Balanced Reinforcement Learning under Rubric-Based Rewards


1️⃣ 一句话总结

本文提出了一种名为“焦点奖励”的新方法,用于解决大语言模型在多维度评分标准强化训练中出现的奖励不平衡问题,通过自动感知各维度训练饱和程度并动态调整优化权重,使模型在各个评价维度上都能均衡提升,实验证明该方法在18项对比中均优于传统方案。

源自 arXiv: 2605.26579