SCoUT:多智能体强化学习中基于效用引导时序分组的可扩展通信 / SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning
1️⃣ 一句话总结
这篇论文提出了一个名为SCoUT的新方法,通过动态、软性地将智能体分组并利用反事实推理来精准分配通信功劳,从而让多智能体系统在学习何时、与谁通信时更高效、更可扩展,同时保持去中心化执行的优点。
Communication can improve coordination in partially observed multi-agent reinforcement learning (MARL), but learning \emph{when} and \emph{who} to communicate with requires choosing among many possible sender-recipient pairs, and the effect of any single message on future reward is hard to isolate. We introduce \textbf{SCoUT} (\textbf{S}calable \textbf{Co}mmunication via \textbf{U}tility-guided \textbf{T}emporal grouping), which addresses both these challenges via temporal and agent abstraction within traditional MARL. During training, SCoUT resamples \textit{soft} agent groups every \(K\) environment steps (macro-steps) via Gumbel-Softmax; these groups are latent clusters that induce an affinity used as a differentiable prior over recipients. Using the same assignments, a group-aware critic predicts values for each agent group and maps them to per-agent baselines through the same soft assignments, reducing critic complexity and variance. Each agent is trained with a three-headed policy: environment action, send decision, and recipient selection. To obtain precise communication learning signals, we derive counterfactual communication advantages by analytically removing each sender's contribution from the recipient's aggregated messages. This counterfactual computation enables precise credit assignment for both send and recipient-selection decisions. At execution time, all centralized training components are discarded and only the per-agent policy is run, preserving decentralized execution. Project website, videos and code: \hyperlink{this https URL}{this https URL}
SCoUT:多智能体强化学习中基于效用引导时序分组的可扩展通信 / SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning
这篇论文提出了一个名为SCoUT的新方法,通过动态、软性地将智能体分组并利用反事实推理来精准分配通信功劳,从而让多智能体系统在学习何时、与谁通信时更高效、更可扩展,同时保持去中心化执行的优点。
源自 arXiv: 2603.04833