📄
Abstract - Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift
Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.
理解与防御视觉语言模型的越狱攻击:基于越狱相关表征偏移的分析 /
Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift
1️⃣ 一句话总结
这篇论文发现,视觉语言模型之所以容易被图片诱导产生有害回复,不是因为模型识别不出有害意图,而是因为视觉输入会将模型的内部表征推向一个特定的‘越狱状态’,从而绕过安全机制;基于此,作者提出了一种通过移除这种‘越狱相关偏移’来有效防御攻击的方法。