菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-01
📄 Abstract - E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models

Recent reinforcement learning has enhanced the flow matching models on human preference alignment. While stochastic sampling enables the exploration of denoising directions, existing methods which optimize over multiple denoising steps suffer from sparse and ambiguous reward signals. We observe that the high entropy steps enable more efficient and effective exploration while the low entropy steps result in undistinguished roll-outs. To this end, we propose E-GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps. Since the integration of stochastic differential equations suffer from ambiguous reward signals due to stochasticity from multiple steps, we specifically merge consecutive low entropy steps to formulate one high entropy step for SDE sampling, while applying ODE sampling on other steps. Building upon this, we introduce multi-step group normalized advantage, which computes group-relative advantages within samples sharing the same consolidated SDE denoising step. Experimental results on different reward settings have demonstrated the effectiveness of our methods.

顶级标签: model training reinforcement learning machine learning
详细标签: flow matching preference alignment policy optimization stochastic differential equations entropy sampling 或 搜索:

E-GRPO:高熵步骤驱动流模型的有效强化学习 / E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models


1️⃣ 一句话总结

这篇论文提出了一种名为E-GRPO的新强化学习方法,它通过智能地合并低熵步骤来创造高熵采样步骤,从而解决了现有方法在训练流模型时因奖励信号稀疏模糊而导致的探索效率低下问题,有效提升了模型与人类偏好对齐的性能。

源自 arXiv: 2601.00423