菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-03
📄 Abstract - ExpGuard: LLM Content Moderation in Specialized Domains

With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.

顶级标签: llm systems model evaluation
详细标签: content moderation safety guardrails domain-specific robustness adversarial attacks specialized datasets 或 搜索:

ExpGuard:专业领域的大型语言模型内容审核 / ExpGuard: LLM Content Moderation in Specialized Domains


1️⃣ 一句话总结

这篇论文提出了一个名为ExpGuard的专业领域内容审核模型,它通过一个精心构建的数据集,专门保护金融、医疗和法律等专业领域的AI对话免受有害内容攻击,并在对抗性测试中显著优于现有通用审核模型。

源自 arXiv: 2603.02588