菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-30
📄 Abstract - Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning

Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model's inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D$^2$-Align), a novel framework that mitigates PMC by directionally correcting the reward signal. Specifically, our method first learns a directional correction within the reward model's embedding space while keeping the model frozen. This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D$^2$-Align achieves superior alignment with human preference.

顶级标签: model training reinforcement learning aigc
详细标签: diffusion models human preference alignment reward hacking mode collapse diversity preservation 或 搜索:

通过方向性解耦对齐驯服扩散强化学习中的偏好模式坍缩 / Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning


1️⃣ 一句话总结

这篇论文针对基于人类反馈的强化学习在优化文本到图像扩散模型时,容易导致模型生成风格单一、多样性丧失的‘偏好模式坍缩’问题,提出了一个量化该现象的新基准,并设计了一种通过方向性修正奖励信号来维持图像多样性的新方法,从而在保证图像质量的同时更好地与人类偏好对齐。

源自 arXiv: 2512.24146