When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

📄 Abstract - When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct response generation, text-only prior turn, visual-state manipulation, and explicit external image-tool invocation. In this paper, we ask which of these evaluated paradigms improves multimodal jailbreak robustness, and why. Across multiple vision-language models, explicit image-tool interaction yields the lowest attack success rates in our experiments, reducing jailbreak success by around 30% relative on average across the evaluated models. This finding is initially surprising: ASR remains low even when the returned image-tool output is manually overridden or itself unsafe-looking, but returns near direct-answering levels under text-only prior turn controls. These results indicate that the lower ASR is not explained by benign returned-image semantics or by the textual image-tool trace alone. To explain the pattern, we introduce an image-tool safety vector framework that models image-tool invocation as a residual shift in hidden representations toward a safety-relevant direction. Representation-level analyses and activation interventions support this account. Overall, our results suggest that explicit image-tool interaction is a promising design pattern for improving jailbreak robustness, while also motivating pipeline-specific safety evaluation.

当“看图思考”遇上安全：是什么决定了多模态越狱鲁棒性？ / When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

1️⃣ 一句话总结

本文研究了大型视觉语言模型在不同“看图思考”流程下的安全性，发现显式调用图像工具能显著降低被恶意诱导攻击的成功率，并揭示了其背后的机制——这种调用方式会在模型内部表征中产生一种安全相关的偏移，而非单纯依赖图像内容或文字记录。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要