Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

📄 Abstract - Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.

打破刹车，而非车轮：通过熵最大化的非定向越狱攻击 / Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

1️⃣ 一句话总结

本文提出一种轻量级的非定向越狱方法UJEM-KL，通过最大化模型拒绝回答时刻的高熵标记（相当于“刹车”）来绕过安全限制，同时保持其他部分输出质量，从而在多个视觉语言模型上显著提升跨模型攻击的迁移性。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要