漂移偏好优化:面向一步生成式模型的对齐方法 / Drifting Preference Optimization for One-Step Generative Models
1️⃣ 一句话总结
本文提出了一种名为DrPO的新方法,通过仅利用奖励模型的排序结果(而非梯度计算)来微调一步式图像生成模型,使其生成结果更符合人类偏好,同时大幅降低了训练计算成本,例如在HPSv3基准上训练速度提升了3.51倍。
One-step text-to-image generators are attractive for deployment because they generate an image with a single forward pass, but preference finetuning them remains difficult: standard alignment methods often rely on policy likelihoods, denoising trajectories, differentiable reward gradients, or test-time optimization. We propose Drifting Preference Optimization (DrPO), an online preference-finetuning method for deterministic one-step generators. For each prompt, DrPO samples candidates from the current generator, ranks them with a target reward, and uses high- and low-scoring samples to synthesize a feature-space update direction. The update is a non-parametric dipole preference field plus a reference drift estimated from the frozen base generator, and is optimized through a detached feature-space regression target. The target reward is used only for ranking, so DrPO can train with large, black-box, or non-differentiable rewards while inference remains a single generator call. We evaluate DrPO on SD-Turbo and SDXL-Turbo with multiple target rewards and benchmarks, including HPSv3 and GenEval. DrPO improves alignment over reward-gradient-free one-step preference baselines and reduces HPSv3 training computation by $3.51\times$ under the matched effective-batch setting by removing reward-model backpropagation. Initial offline experiments suggest that sample-based gradient synthesis can also be used beyond online reward ranking.
漂移偏好优化:面向一步生成式模型的对齐方法 / Drifting Preference Optimization for One-Step Generative Models
本文提出了一种名为DrPO的新方法,通过仅利用奖励模型的排序结果(而非梯度计算)来微调一步式图像生成模型,使其生成结果更符合人类偏好,同时大幅降低了训练计算成本,例如在HPSv3基准上训练速度提升了3.51倍。
源自 arXiv: 2606.02521