菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-08
📄 Abstract - Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis

Multilingual safety evaluation of large language models (LLMs) has predominantly relied on direct translation (DT) of English benchmarks into target languages - an approach that converts surface-level linguistic form while failing to reflect the cultural context embedded in threat scenarios, social norms, and legal frameworks. We construct paired DT and culturally-adapted (CA) datasets via 1:1 seed matching for four languages - Korean (KO), Japanese (JA), Thai (TH), and Khmer (KM) - and compare Attack Success Rate (ASR) and Cultural Realism scores across four open-source LLM. CA prompts yield Delta-ASR > 0 across all 16 language x model combinations (mean +9.3 pp), and DT-based evaluation underestimates risk in 44 of 48 category x language combinations. Language-level analysis reveals that the distribution of threat forms is heterogeneous across languages. Cultural Realism analysis further shows that DT Cultural Depth (C3) scores remain consistently below 1.0 out of 3.0 across all four languages (mean 0.17), whereas CA scores reach up to 2.51, indicating that direct translation produces inputs systematically divergent from those encountered in real-world multicultural settings. These findings demonstrate that adapting benchmarks to language-specific cultural contexts - rather than relying on linguistic translation alone - is necessary for valid multilingual LLM safety evaluation.

顶级标签: llm evaluation multi-modal
详细标签: red-teaming multilingual safety cultural adaptation benchmark east asian languages 或 搜索:

面向东亚及东南亚语境的文化自适应红队测试:一种方法论与比较分析 / Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis


1️⃣ 一句话总结

该研究发现,当前大语言模型的多语言安全评估主要依赖对英语测试集的直接翻译,而这种做法忽视了本地文化背景,导致严重低估模型在真实场景中的安全风险;通过为韩语、日语、泰语和高棉语分别构建文化自适应的攻击测试集,攻击成功率平均提升9.3个百分点,表明只有针对具体语言文化调整测试内容,才能有效评估模型在多语言环境中的安全性。

源自 arXiv: 2606.09178