通过意图欺骗突破前沿基础模型的防御 / Jailbreaking Frontier Foundation Models Through Intention Deception
1️⃣ 一句话总结
本文提出一种多轮对话式的攻击方法,通过逐步伪装成善意用户并利用模型的一致性特性,成功诱导前沿AI模型(如GPT-5和Claude-Sonnet-4.5)输出有害信息,并首次揭示了一种此前被忽视的“准越狱”漏洞——模型虽未直接回答恶意问题,但其给出的间接信息仍然具有危害性。
Large (vision-)language models exhibit remarkable capability but remain highly susceptible to jailbreaking. Existing safety training approaches aim to have the model learn a refusal boundary between safe and unsafe, based on the user's intent. It has been found that this binary training regime often leads to brittleness, since the user intent cannot reliably be evaluated, especially if the attacker obfuscates their intent, and also makes the system seem unhelpful. In response, frontier models, such as GPT-5, have shifted from refusal-based safeguards to safe completion, that aims to maximize helpfulness while obeying safety constraints. However, safe completion could be exploited when a user pretends their intention is benign. Specifically, this intent inversion would be effective in multi-turn conversation, where the attacker has multiple opportunities to reinforce their deceptively benign intent. In this work, we introduce a novel multi-turn jailbreaking method that exploits this vulnerability. Our approach gradually builds conversational trust by simulating benign-seeming intentions and by exploiting the consistency property of the model, ultimately guiding the target model toward harmful, detailed outputs. Most crucially, our approach also uncovered an additional class of model vulnerability that we call para-jailbreaking that has been unnoticed up to now. Para-jailbreaking describes the situation where the model may not reveal harmful direct reply to the attack query, however the information that it reveals is nevertheless harmful. Our contributions are threefold. First, it achieves high success rates against frontier models including GPT-5-thinking and Claude-Sonnet-4.5. Second, our approach revealed and addressed para-jailbreaking harmful output. Third, experiments on multimodal VLM models showed that our approach outperformed state-of-the-art models.
通过意图欺骗突破前沿基础模型的防御 / Jailbreaking Frontier Foundation Models Through Intention Deception
本文提出一种多轮对话式的攻击方法,通过逐步伪装成善意用户并利用模型的一致性特性,成功诱导前沿AI模型(如GPT-5和Claude-Sonnet-4.5)输出有害信息,并首次揭示了一种此前被忽视的“准越狱”漏洞——模型虽未直接回答恶意问题,但其给出的间接信息仍然具有危害性。
源自 arXiv: 2604.24082