菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-06
📄 Abstract - OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Recent advances in multi-modal large language models (MLLMs) have enabled unified perception-reasoning capabilities, yet these systems remain highly vulnerable to jailbreak attacks that bypass safety alignment and induce harmful behaviors. Existing benchmarks such as JailBreakV-28K, MM-SafetyBench, and HADES provide valuable insights into multi-modal vulnerabilities, but they typically focus on limited attack scenarios, lack standardized defense evaluation, and offer no unified, reproducible toolbox. To address these gaps, we introduce OmniSafeBench-MM, which is a comprehensive toolbox for multi-modal jailbreak attack-defense evaluation. OmniSafeBench-MM integrates 13 representative attack methods, 15 defense strategies, and a diverse dataset spanning 9 major risk domains and 50 fine-grained categories, structured across consultative, imperative, and declarative inquiry types to reflect realistic user intentions. Beyond data coverage, it establishes a three-dimensional evaluation protocol measuring (1) harmfulness, distinguished by a granular, multi-level scale ranging from low-impact individual harm to catastrophic societal threats, (2) intent alignment between responses and queries, and (3) response detail level, enabling nuanced safety-utility analysis. We conduct extensive experiments on 10 open-source and 8 closed-source MLLMs to reveal their vulnerability to multi-modal jailbreak. By unifying data, methodology, and evaluation into an open-source, reproducible platform, OmniSafeBench-MM provides a standardized foundation for future research. The code is released at this https URL.

顶级标签: llm multi-modal benchmark
详细标签: jailbreak attacks safety evaluation multimodal llms harmfulness assessment toolbox 或 搜索:

OmniSafeBench-MM:一个用于多模态越狱攻击-防御评估的统一基准与工具箱 / OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation


1️⃣ 一句话总结

这篇论文提出了一个名为OmniSafeBench-MM的综合性工具箱与基准,用于系统评估多模态大语言模型在抵御诱导其产生有害内容的‘越狱攻击’方面的安全性,并集成了多种攻击与防御方法、覆盖广泛风险领域的数据集以及一个多维度的评估体系。


源自 arXiv: 2512.06589