多智能体强化学习系统中的可解释故障分析 / Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems
1️⃣ 一句话总结
这篇论文提出了一种基于梯度的两阶段框架,用于在多智能体强化学习系统中可解释地检测、定位并追踪故障的初始源头及其在智能体间的传播路径,以提高关键安全应用中的系统诊断能力。
Multi-Agent Reinforcement Learning (MARL) is increasingly deployed in safety-critical domains, yet methods for interpretable failure detection and attribution remain underdeveloped. We introduce a two-stage gradient-based framework that provides interpretable diagnostics for three critical failure analysis tasks: (1) detecting the true initial failure source (Patient-0); (2) validating why non-attacked agents may be flagged first due to domino effects; and (3) tracing how failures propagate through learned coordination pathways. Stage 1 performs interpretable per-agent failure detection via Taylor-remainder analysis of policy-gradient costs, declaring an initial Patient-0 candidate at the first threshold crossing. Stage 2 provides validation through geometric analysis of critic derivatives-first-order sensitivity and directional second-order curvature aggregated over causal windows to construct interpretable contagion graphs. This approach explains "downstream-first" detection anomalies by revealing pathways that amplify upstream deviations. Evaluated across 500 episodes in Simple Spread (3 and 5 agents) and 100 episodes in StarCraft II using MADDPG and HATRPO, our method achieves 88.2-99.4% Patient-0 detection accuracy while providing interpretable geometric evidence for detection decisions. By moving beyond black-box detection to interpretable gradient-level forensics, this framework offers practical tools for diagnosing cascading failures in safety-critical MARL systems.
多智能体强化学习系统中的可解释故障分析 / Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems
这篇论文提出了一种基于梯度的两阶段框架,用于在多智能体强化学习系统中可解释地检测、定位并追踪故障的初始源头及其在智能体间的传播路径,以提高关键安全应用中的系统诊断能力。
源自 arXiv: 2602.08104