何时可以信任不可信的监控?针对合谋策略的安全案例概览 / When can we trust untrusted monitoring? A safety case sketch across collusion strategies
1️⃣ 一句话总结
这篇论文探讨了如何通过放宽对AI合谋策略的假设,构建一个更严谨的安全论证框架,以评估使用一个不可信的AI模型去监控另一个AI时可能面临的风险,并识别了被动自我识别等新策略可能带来的挑战。
AIs are increasingly being deployed with greater autonomy and capabilities, which increases the risk that a misaligned AI may be able to cause catastrophic harm. Untrusted monitoring -- using one untrusted model to oversee another -- is one approach to reducing risk. Justifying the safety of an untrusted monitoring deployment is challenging because developers cannot safely deploy a misaligned model to test their protocol directly. In this paper, we develop upon existing methods for rigorously demonstrating safety based on pre-deployment testing. We relax assumptions that previous AI control research made about the collusion strategies a misaligned AI might use to subvert untrusted monitoring. We develop a taxonomy covering passive self-recognition, causal collusion (hiding pre-shared signals), acausal collusion (hiding signals via Schelling points), and combined strategies. We create a safety case sketch to clearly present our argument, explicitly state our assumptions, and highlight unsolved challenges. We identify conditions under which passive self-recognition could be a more effective collusion strategy than those studied previously. Our work builds towards more robust evaluations of untrusted monitoring.
何时可以信任不可信的监控?针对合谋策略的安全案例概览 / When can we trust untrusted monitoring? A safety case sketch across collusion strategies
这篇论文探讨了如何通过放宽对AI合谋策略的假设,构建一个更严谨的安全论证框架,以评估使用一个不可信的AI模型去监控另一个AI时可能面临的风险,并识别了被动自我识别等新策略可能带来的挑战。
源自 arXiv: 2602.20628