菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-25
📄 Abstract - Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

Reinforcement Learning from Human Feedback (RLHF) plays a significant role in aligning Large Language Models (LLMs) with human preferences. While RLHF with expected reward constraints can be formulated as a primal-dual optimization problem, standard primal-dual methods only guarantee convergence with a distributional policy where the saddle-point problem is in convex-concave form. Moreover, standard primal-dual methods may exhibit instability or divergence in the last iterate under policy parameterization in practical applications. In this work, we propose a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods. Building on this framework, we introduce an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to stabilize saddle-point dynamics. We establish last-iterate convergence guarantees for the proposed method, covering both exact policy optimization in the distributional space and convergence to a neighborhood of the optimal solution whose gap is related to approximation error and bias under parameterized policies. Our analysis reveals that optimism plays a crucial role in mitigating oscillations inherent to constrained alignment objectives, thereby closing a key theoretical gap between constrained RL and practical RLHF.

顶级标签: llm reinforcement learning theory
详细标签: safe alignment primal-dual optimization last-iterate convergence constrained rl human feedback 或 搜索:

通过乐观原始对偶方法实现多目标安全大语言模型对齐的可证明末次迭代收敛 / Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual


1️⃣ 一句话总结

这篇论文提出了一种名为“乐观原始对偶”的新算法,它能够稳定地训练大语言模型,使其在遵循人类偏好的同时满足安全约束,并首次从理论上证明了该方法的最终训练结果是可靠收敛的。

源自 arXiv: 2602.22146