菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-02
📄 Abstract - Reward-free Alignment for Conflicting Objectives

Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.

顶级标签: llm model training theory
详细标签: multi-objective alignment preference learning gradient conflict pareto optimization reward-free learning 或 搜索:

面向冲突目标的无奖励对齐方法 / Reward-free Alignment for Conflicting Objectives


1️⃣ 一句话总结

这篇论文提出了一种名为RACO的新方法,它无需依赖复杂的奖励模型,就能直接利用成对偏好数据来训练大语言模型,有效解决多个相互冲突的目标(如摘要质量和安全性)之间的权衡问题,并在多种主流模型上取得了更好的综合平衡效果。

源自 arXiv: 2602.02495