UniGRPO:面向推理驱动视觉生成的统一策略优化 / UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation
1️⃣ 一句话总结
这篇论文提出了一个名为UniGRPO的统一强化学习框架,通过联合优化文本推理和图像生成策略,让AI模型在生成图片前先进行逻辑推理,从而显著提升了图像生成的质量,为未来开发能交替生成文本和图像的复杂模型打下了坚实基础。
Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.
UniGRPO:面向推理驱动视觉生成的统一策略优化 / UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation
这篇论文提出了一个名为UniGRPO的统一强化学习框架,通过联合优化文本推理和图像生成策略,让AI模型在生成图片前先进行逻辑推理,从而显著提升了图像生成的质量,为未来开发能交替生成文本和图像的复杂模型打下了坚实基础。
源自 arXiv: 2603.23500