菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-11
📄 Abstract - Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.

顶级标签: machine learning multi-modal llm
详细标签: jailbreak attack entropy maximization transferability refusal behavior safety 或 搜索:

打破刹车,而非车轮:通过熵最大化的非定向越狱攻击 / Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization


1️⃣ 一句话总结

本文提出一种轻量级的非定向越狱方法UJEM-KL,通过最大化模型拒绝回答时刻的高熵标记(相当于“刹车”)来绕过安全限制,同时保持其他部分输出质量,从而在多个视觉语言模型上显著提升跨模型攻击的迁移性。

源自 arXiv: 2605.10764