Stepwise Credit Assignment for GRPO on Flow-Matching Models

📄 Abstract - Stepwise Credit Assignment for GRPO on Flow-Matching Models

Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.

流匹配模型上GRPO的逐步信用分配 / Stepwise Credit Assignment for GRPO on Flow-Matching Models

1️⃣ 一句话总结

这篇论文提出了一种名为Stepwise-Flow-GRPO的新方法，它通过分析图像生成过程中不同步骤（如早期构图和后期细节处理）对最终结果的不同贡献，为每一步分配合适的“功劳”，从而解决了原有方法对所有步骤一视同仁导致的效率低下问题，使得AI模型能更快、更高效地学习如何生成高质量的图像。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要