TreeGRPO:用于扩散模型在线强化学习后训练的树形优势GRPO / TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models
1️⃣ 一句话总结
这篇论文提出了一种名为TreeGRPO的新型强化学习方法,它通过将扩散模型的去噪过程构建成一棵搜索树,从而大幅提高了模型根据人类偏好进行训练的效率,实现了更快的训练速度和更好的性能。
Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \textbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) \emph{High sample efficiency}, achieving better performance under same training samples (2) \emph{Fine-grained credit assignment} via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) \emph{Amortized computation} where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves \textbf{2.4$\times$ faster training} while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at this http URL.
TreeGRPO:用于扩散模型在线强化学习后训练的树形优势GRPO / TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models
这篇论文提出了一种名为TreeGRPO的新型强化学习方法,它通过将扩散模型的去噪过程构建成一棵搜索树,从而大幅提高了模型根据人类偏好进行训练的效率,实现了更快的训练速度和更好的性能。
源自 arXiv: 2512.08153