菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-25
📄 Abstract - Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

Alignment and adaptation in large language models heavily rely on reinforcement learning from human feedback (RLHF); yet, theoretical understanding of its generalisability remains premature, especially when the learned reward could shift, and the KL control is estimated and clipped. To address this issue, we develop generalisation theory for RLHF that explicitly accounts for (1) \emph{reward shift}: reward models are trained on preference data from earlier or mixed behaviour policies while RLHF optimises the current policy on its own rollouts; and (2) \emph{clipped KL regularisation}: the KL regulariser is estimated from sampled log-probability ratios and then clipped for stabilisation, resulting in an error to RLHF. We present generalisation bounds for RLHF, suggesting that the generalisation error stems from a sampling error from prompts and rollouts, a reward shift error, and a KL clipping error. We also discuss special cases of (1) initialising RLHF parameters with a uniform prior over a finite space, and (2) training RLHF by stochastic gradient descent, as an Ornstein-Uhlenbeck process. The theory yields practical implications in (1) optimal KL clipping threshold, and (2) budget allocation in prompts, rollouts, and preference data.

顶级标签: theory llm model training
详细标签: reinforcement learning from human feedback generalization theory reward shift kl regularization theoretical analysis 或 搜索:

奖励偏移与截断KL正则化下RLHF的泛化理论 / Generalisation of RLHF under Reward Shift and Clipped KL Regularisation


1️⃣ 一句话总结

这篇论文为基于人类反馈的强化学习(RLHF)建立了一套泛化理论,首次系统分析了因奖励模型训练数据与当前策略不匹配导致的‘奖励偏移’问题,以及因技术实现而引入的‘KL正则项截断误差’,并据此为实际训练中的参数设置和数据分配提供了理论指导。

源自 arXiv: 2602.21765