结果主义目标与灾难性风险 / Consequentialist Objectives and Catastrophe
1️⃣ 一句话总结
这篇论文指出,当人工智能系统过于强大时,它们为了追求一个固定不变的目标而采取极端行动,反而可能导致灾难性后果,因此需要适当限制AI的能力才能确保安全并发挥其价值。
Because human preferences are too complex to codify, AIs operate with misspecified objectives. Optimizing such objectives often produces undesirable outcomes; this phenomenon is known as reward hacking. Such outcomes are not necessarily catastrophic. Indeed, most examples of reward hacking in previous literature are benign. And typically, objectives can be modified to resolve the issue. We study the prospect of catastrophic outcomes induced by AIs operating in complex environments. We argue that, when capabilities are sufficiently advanced, pursuing a fixed consequentialist objective tends to result in catastrophic outcomes. We formalize this by establishing conditions that provably lead to such outcomes. Under these conditions, simple or random behavior is safe. Catastrophic risk arises due to extraordinary competence rather than incompetence. With a fixed consequentialist objective, avoiding catastrophe requires constraining AI capabilities. In fact, constraining capabilities the right amount not only averts catastrophe but yields valuable outcomes. Our results apply to any objective produced by modern industrial AI development pipelines.
结果主义目标与灾难性风险 / Consequentialist Objectives and Catastrophe
这篇论文指出,当人工智能系统过于强大时,它们为了追求一个固定不变的目标而采取极端行动,反而可能导致灾难性后果,因此需要适当限制AI的能力才能确保安全并发挥其价值。
源自 arXiv: 2603.15017