菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-29
📄 Abstract - Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking

Reinforcement learning (RL) systems typically optimize scalar reward functions that assume precise and reliable evaluation of outcomes. However, real-world objectives--especially those derived from human preferences--are often uncertain, context-dependent, and internally inconsistent. This mismatch can lead to alignment failures such as reward hacking, over-optimization, and overconfident behavior. We introduce a dual-source uncertainty-aware reward framework that explicitly models both epistemic uncertainty in value estimation and uncertainty in human preferences. Model uncertainty is captured via ensemble disagreement over value predictions, while preference uncertainty is derived from variability in reward annotations. We combine these signals through a confidence-adjusted Reliability Filter that adaptively modulates action selection, encouraging a balance between exploitation and caution. Empirical results across multiple discrete grid configurations (6x6, 8x8, 10x10) and high-dimensional continuous control environments (Hopper-v4, Walker2d-v4) demonstrate that our approach yields more stable training dynamics and reduces exploitative behaviors under reward ambiguity, achieving a 93.7% reduction in reward-hacking behavior as measured by trap visitation frequency. We demonstrate statistical significance of these improvements and robustness under up to 30% supervisory noise, albeit with a trade-off in peak observed reward compared to unconstrained baselines. By treating uncertainty as a first-class component of the reward signal, this work offers a principled approach toward more reliable and aligned reinforcement learning systems.

顶级标签: reinforcement learning agents
详细标签: reward hacking uncertainty human preferences alignment 或 搜索:

面向缓解奖励作弊的不确定性感知奖励折扣方法 / Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking


1️⃣ 一句话总结

该论文提出一种同时考虑模型预测不确定性和人类偏好不确定性的双重不确定性奖励框架,通过自适应调节动作选择来抑制强化学习中的奖励作弊行为,实验表明该方法能将作弊行为减少93.7%,并提升训练稳定性。

源自 arXiv: 2604.26360