用于测试时可引导奖励模型的贝叶斯偏好学习 / Bayesian Preference Learning for Test-Time Steerable Reward Models
1️⃣ 一句话总结
这篇论文提出了一种名为ICRM的新方法,它能让AI在训练后根据用户给出的新偏好示例动态调整其奖励判断,从而更灵活地适应多样化的任务需求,比如同时兼顾安全性和有用性。
Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapt to unseen preference distributions at test time for both single and multi-objective settings. With more in-context demonstrations, ICRM gains 34% accuracy on SafeRLHF and 9% accuracy on RM-Bench in the single-objective setting, while widening the Pareto frontier with a 4% gain in hypervolume on helpfulness and refusal benchmarks. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.
用于测试时可引导奖励模型的贝叶斯偏好学习 / Bayesian Preference Learning for Test-Time Steerable Reward Models
这篇论文提出了一种名为ICRM的新方法,它能让AI在训练后根据用户给出的新偏好示例动态调整其奖励判断,从而更灵活地适应多样化的任务需求,比如同时兼顾安全性和有用性。
源自 arXiv: 2602.08819