📄
Abstract - Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: this https URL.
DPO与RLHF的条件等价性:隐含假设、失败模式与可证明的对齐 /
Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
1️⃣ 一句话总结
本文发现,直接偏好优化(DPO)与基于人类反馈的强化学习(RLHF)并非总是等价的,当RLHF最优策略不偏好人类更喜欢的回答时,DPO会优化错误目标、导致模型选错答案,因此作者提出了带约束的偏好优化方法(CPO),在保持简单的同时实现了可靠的对齐效果。