菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-04
📄 Abstract - The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability -- knowing what they do not know, detecting errors, seeking clarification -- under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6-condition factorial design with dual-classifier scoring. We find that 8 of 11 models suffer catastrophic metacognitive degradation under adversarial pressure, with accuracy dropping by up to 30.2 percentage points (all $p < 2 \times 10^{-8}$, surviving Bonferroni correction). Crucially, we identify a &#34;Compliance Trap&#34;: through factorial isolation and a benign distraction control, we demonstrate that collapse is driven not by the psychological content of survival threats, but by compliance-forcing instructions that override epistemic boundaries. Removing the compliance suffix restores performance even under active threat. Models with advanced reasoning capabilities exhibit the most severe absolute degradation, while Anthropic's Constitutional AI demonstrates near-perfect immunity -- not from superior capability (Google's Gemini matches its baseline accuracy) but from alignment-specific training. We release the complete dataset and evaluation infrastructure.

顶级标签: llm model evaluation safety
详细标签: metacognition adversarial pressure compliance trap benchmark frontier models 或 搜索:

服从陷阱:结构约束如何在前沿AI面临对抗压力时削弱其元认知能力 / The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure


1️⃣ 一句话总结

这项研究通过大规模评估发现,前沿AI模型在面对对抗压力时,其自知之明和错误检测等元认知能力会严重崩溃,而罪魁祸首并非心理威胁,而是强制服从指令,移除这些指令就能恢复模型性能,其中Anthropic的宪法AI由于特殊的对齐训练而表现出近乎完全的免疫力。

源自 arXiv: 2605.02398