用于离线多智能体强化学习的价值引导均值流方法 / Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning
1️⃣ 一句话总结
这篇论文提出了一种名为VGM²P的新方法,它通过结合全局价值引导和高效的均值流生成模型,让多个AI智能体能够直接从离线数据中快速学习协作策略,同时避免了传统方法对参数敏感和计算效率低的问题。
Offline multi-agent reinforcement learning (MARL) aims to learn the optimal joint policy from pre-collected datasets, requiring a trade-off between maximizing global returns and mitigating distribution shift from offline data. Recent studies use diffusion or flow generative models to capture complex joint policy behaviors among agents; however, they typically rely on multi-step iterative sampling, thereby reducing training and inference efficiency. Although further research improves sampling efficiency through methods like distillation, it remains sensitive to the behavior regularization coefficient. To address the above-mentioned issues, we propose Value Guidance Multi-agent MeanFlow Policy (VGM$^2$P), a simple yet effective flow-based policy learning framework that enables efficient action generation with coefficient-insensitive conditional behavior cloning. Specifically, VGM$^2$P uses global advantage values to guide agent collaboration, treating optimal policy learning as conditional behavior cloning. Additionally, to improve policy expressiveness and inference efficiency in multi-agent scenarios, it leverages classifier-free guidance MeanFlow for both policy training and execution. Experiments on tasks with both discrete and continuous action spaces demonstrate that, even when trained solely via conditional behavior cloning, VGM$^2$P efficiently achieves performance comparable to state-of-the-art methods.
用于离线多智能体强化学习的价值引导均值流方法 / Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning
这篇论文提出了一种名为VGM²P的新方法,它通过结合全局价值引导和高效的均值流生成模型,让多个AI智能体能够直接从离线数据中快速学习协作策略,同时避免了传统方法对参数敏感和计算效率低的问题。
源自 arXiv: 2604.08174