菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-23
📄 Abstract - Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling

Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward model. Theoretically, we study the suboptimality gap and show that privacy contributes an additional additive term beyond the usual non-private statistical error. We also establish a minimax lower bound and show that the dominant term changes with sample size and privacy level, which in turn characterizes regimes in which the upper bound is rate-optimal up to logarithmic factors. Empirically, synthetic experiments confirm the scaling predicted by the theory, and experiments on the Anthropic HH-RLHF dataset using the Gemma-2B-IT model show stronger private alignment performance than existing differentially private baseline methods across privacy budgets.

顶级标签: reinforcement learning model training privacy
详细标签: differential privacy human feedback reward modeling privacy-preserving alignment 或 搜索:

基于解耦奖励建模的隐私保护人类反馈强化学习 / Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling


1️⃣ 一句话总结

这篇论文提出了一种新的隐私保护方法,通过只对学习用户偏好的‘奖励模型’部分进行隐私处理,来训练大型语言模型,从而在保护用户敏感数据的同时,有效提升了模型与人类价值观对齐的性能。

源自 arXiv: 2603.22563