菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-05
📄 Abstract - OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

Post training via GRPO has demonstrated remarkable effectiveness in improving the generation quality of flow-matching models. However, GRPO suffers from inherently low sample efficiency due to its on-policy training paradigm. To address this limitation, we present OP-GRPO, the first Off-Policy GRPO framework tailored for flow-matching models. First, we actively select high-quality trajectories and adaptively incorporate them into a replay buffer for reuse in subsequent training iterations. Second, to mitigate the distribution shift introduced by off-policy samples, we propose a sequence-level importance sampling correction that preserves the integrity of GRPO's clipping mechanism while ensuring stable policy updates. Third, we theoretically and empirically show that late denoising steps yield ill-conditioned off-policy ratios, and mitigate this by truncating trajectories at late steps. Across image and video generation benchmarks, OP-GRPO achieves comparable or superior performance to Flow-GRPO with only 34.2% of the training steps on average, yielding substantial gains in training efficiency while maintaining generation quality.

顶级标签: model training aigc multi-modal
详细标签: off-policy learning flow matching generative models sample efficiency importance sampling 或 搜索:

OP-GRPO:面向流匹配模型的高效离策略GRPO方法 / OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models


1️⃣ 一句话总结

这篇论文提出了一种名为OP-GRPO的新方法,它通过引入离策略训练、高质量样本重用和分布偏移校正技术,大幅提升了流匹配模型(用于图像和视频生成)的训练效率,在保持生成质量的同时,平均只需原来约三分之一的训练步骤即可达到同等或更好的效果。

源自 arXiv: 2604.04142