菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-18
📄 Abstract - Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

Despite rigorous safety alignment, Large Language Models (LLMs) remain vulnerable to jailbreak attacks. Existing black-box methods often rely on heuristic templates or exhaustive trials, lacking mechanistic interpretability and query efficiency. In this study, we investigate an intrinsic vulnerability in the safety mechanisms of LLMs, where safety alignment relies on a small set of sparsely distributed attention heads, leaving much of the representational space weakly monitored. We formalize this phenomenon with a mathematical jailbreaking model that characterizes the delicate boundary of effective text obfuscation and analytically explains observed jailbreak behaviors. Guided by this model, we propose Babel, an efficient black-box attack framework that exploits the identified safety gap through systematic obfuscation sampling with iterative, feedback-driven distribution refinement, enabling reliable and high-success jailbreak attacks without access to model internals. Comprehensive evaluations on frontier commercial models demonstrate that Babel achieves state-of-the-art attack success rates and superior query efficiency. Specifically, compared to state-of-the-art methods, Babel increases the attack success rate on GPT-4o from 41.33% to 82.67% and on Claude-3-5-haiku from 38.33% to 78.33% within an average of 40 queries, providing a robust red-teaming methodology for LLMs safety research.

顶级标签: llm security
详细标签: jailbreak attack attention mechanism safety alignment obfuscation sampling black-box attack 或 搜索:

巴别塔:通过混淆分布优化采样突破安全注意力机制 / Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling


1️⃣ 一句话总结

本文发现大型语言模型的安全机制仅依赖少量稀疏分布的注意力头,存在监控盲区,据此提出一种名为Babel的黑盒攻击方法,通过迭代优化文本混淆分布,能够在仅需约40次查询的情况下,将GPT-4o和Claude-3-5-haiku等前沿模型的攻击成功率提升至80%左右,显著优于现有方法。

源自 arXiv: 2605.17971