正确实现VAR强化学习:解决视觉自回归生成中的异步策略冲突 / VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
1️⃣ 一句话总结
这篇论文针对视觉自回归模型在强化学习训练中因生成步骤间输入结构不同而产生的策略冲突问题,提出了一种改进的优化框架,通过引入稳定奖励、动态权重分配和掩码传播算法,显著提升了模型生成图像的质量和与训练目标的对齐度。
Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.
正确实现VAR强化学习:解决视觉自回归生成中的异步策略冲突 / VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
这篇论文针对视觉自回归模型在强化学习训练中因生成步骤间输入结构不同而产生的策略冲突问题,提出了一种改进的优化框架,通过引入稳定奖励、动态权重分配和掩码传播算法,显著提升了模型生成图像的质量和与训练目标的对齐度。
源自 arXiv: 2601.02256