📄 论文总结
通过动态奖励权重学习优化多目标对齐 / Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting
1️⃣ 一句话总结
这篇论文提出了一种动态调整奖励权重的新方法,解决了传统固定权重在多目标强化学习中无法有效探索最优解的问题,显著提升了大型语言模型在多任务对齐训练中的效率和效果。
Prior works in multi-objective reinforcement learning typically use linear reward scalarization with fixed weights, which provably fail to capture non-convex Pareto fronts and thus yield suboptimal results. This limitation becomes especially critical in online preference alignment for large language models. Here, stochastic trajectories generated by parameterized policies create highly non-linear and non-convex mappings from parameters to objectives that no single static weighting scheme can find optimal trade-offs. We address this limitation by introducing dynamic reward weighting, which adaptively adjusts reward weights during the online reinforcement learning process. Unlike existing approaches that rely on fixed-weight interpolation, our dynamic weighting continuously balances and prioritizes objectives in training, facilitating effective exploration of Pareto fronts in objective space. We introduce two approaches of increasing sophistication and generalizability: (1) hypervolume-guided weight adaptation and (2) gradient-based weight optimization, offering a versatile toolkit for online multi-objective alignment. Our extensive experiments demonstrate their compatibility with commonly used online reinforcement learning algorithms (including GRPO, REINFORCE, and RLOO), effectiveness across multiple mathematical reasoning datasets, and applicability to different model families, consistently achieving Pareto dominant solutions with fewer training steps than fixed-weight linear scalarization baselines.
通过动态奖励权重学习优化多目标对齐 / Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting
这篇论文提出了一种动态调整奖励权重的新方法,解决了传统固定权重在多目标强化学习中无法有效探索最优解的问题,显著提升了大型语言模型在多任务对齐训练中的效率和效果。