在多智能体强化学习中保留次优行动以追踪动态最优解 / Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning
1️⃣ 一句话总结
本文提出了一种名为S2Q的新方法,通过让智能体在学习时记住多个有价值的备选行动,有效解决了传统多智能体协作算法因环境变化而陷入次优策略的问题,从而提升了系统的适应性和整体表现。
Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value Q-learning (S2Q), which learns multiple sub-value functions to retain alternative high-value actions. Incorporating these sub-value functions into a Softmax-based behavior policy, S2Q encourages persistent exploration and enables $Q^{\text{tot}}$ to adjust quickly to the changing optima. Experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance. Our code is available at this https URL.
在多智能体强化学习中保留次优行动以追踪动态最优解 / Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning
本文提出了一种名为S2Q的新方法,通过让智能体在学习时记住多个有价值的备选行动,有效解决了传统多智能体协作算法因环境变化而陷入次优策略的问题,从而提升了系统的适应性和整体表现。
源自 arXiv: 2602.17062