RedBench:一个用于大型语言模型全面红队测试的通用数据集 / RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models
1️⃣ 一句话总结
这篇论文提出了一个名为RedBench的通用数据集,它整合了多个现有基准,通过标准化的风险分类和领域覆盖,来系统性地评估和比较大型语言模型在面对恶意或对抗性提示时的安全漏洞,以促进更安全可靠的模型开发。
As large language models (LLMs) become integral to safety-critical applications, ensuring their robustness against adversarial prompts is paramount. However, existing red teaming datasets suffer from inconsistent risk categorizations, limited domain coverage, and outdated evaluations, hindering systematic vulnerability assessments. To address these challenges, we introduce RedBench, a universal dataset aggregating 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts. RedBench employs a standardized taxonomy with 22 risk categories and 19 domains, enabling consistent and comprehensive evaluations of LLM vulnerabilities. We provide a detailed analysis of existing datasets, establish baselines for modern LLMs, and open-source the dataset and evaluation code. Our contributions facilitate robust comparisons, foster future research, and promote the development of secure and reliable LLMs for real-world deployment. Code: this https URL
RedBench:一个用于大型语言模型全面红队测试的通用数据集 / RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models
这篇论文提出了一个名为RedBench的通用数据集,它整合了多个现有基准,通过标准化的风险分类和领域覆盖,来系统性地评估和比较大型语言模型在面对恶意或对抗性提示时的安全漏洞,以促进更安全可靠的模型开发。
源自 arXiv: 2601.03699