菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-27
📄 Abstract - KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations: 1) English-centric dataset construction, and 2) a focus on generic risks that are not tied to local cultural contexts. This paper introduces KSAFE-MM, a benchmark for Korean multimodal safety evaluation that covers both general safety risks and culture-specific vulnerabilities. KSAFE-MM consists of two parts, KSAFE-MM-G and KSAFE-MM-C. KSAFE-MM-G evaluates globally shared risks in Korean contexts through linguistic contextualization, which transforms generic safety queries into contextually grounded multimodal samples. KSAFE-MM-C targets culture-dependent MLLM safety vulnerabilities using localized visual queries derived from real-world contexts. It pairs these visual queries with jailbreak-style textual queries to cover multimodal safety risks involving cultural visual cues and malicious textual intent. Together, these components provide a general-to-local construction pipeline for evaluating both globally shared safety risks and culture-specific vulnerabilities. We evaluate 12 state-of-the-art MLLMs on KSAFE-MM and reveal that models exhibit greater vulnerability to culturally grounded attacks than to generic ones. Notably, jailbreaking strategies substantially amplify attack success rates, with ProgramExecution yielding up to 74.2% ASR compared to 13.4% for standard queries. Furthermore, we identify a systematic trade-off between safety and over-refusal, where models achieving low ASR tend to exhibit excessive refusal behavior on benign queries. These findings highlight the urgent need for culturally grounded safety evaluation beyond English-centric benchmarks.

顶级标签: multi-modal benchmark model evaluation
详细标签: multimodal safety cultural risks korean jailbreak attacks refusal behavior 或 搜索:

KSAFE-MM:一种通过本地化情境构建的韩国文化风险多模态安全基准 / KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks


1️⃣ 一句话总结

该论文提出了一个名为KSAFE-MM的多模态安全评估基准,专门针对韩国文化背景,通过将通用的安全测试问题“本地化”为包含韩国语言、视觉和文化元素的多模态样本,揭示了当前主流多模态大模型在面对文化特定攻击时比面对通用攻击更脆弱,且存在安全性与过度拒绝之间的权衡问题。

源自 arXiv: 2605.28013