COMPASS:一个评估大语言模型组织特定政策对齐性的框架 / COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs
1️⃣ 一句话总结
本文提出了首个名为COMPASS的系统性评估框架,用于检验大语言模型是否遵守企业内部的允许与禁止政策清单,研究发现现有模型在处理合规请求时表现良好,但在阻止违反禁令的对抗性请求时存在严重缺陷,揭示了它们在关键政策部署场景中缺乏必要的鲁棒性。
As large language models are deployed in high-stakes enterprise applications, from healthcare to finance, ensuring adherence to organization-specific policies has become essential. Yet existing safety evaluations focus exclusively on universal harms. We present COMPASS (Company/Organization Policy Alignment Assessment), the first systematic framework for evaluating whether LLMs comply with organizational allowlist and denylist policies. We apply COMPASS to eight diverse industry scenarios, generating and validating 5,920 queries that test both routine compliance and adversarial robustness through strategically designed edge cases. Evaluating seven state-of-the-art models, we uncover a fundamental asymmetry: models reliably handle legitimate requests (>95% accuracy) but catastrophically fail at enforcing prohibitions, refusing only 13-40% of adversarial denylist violations. These results demonstrate that current LLMs lack the robustness required for policy-critical deployments, establishing COMPASS as an essential evaluation framework for organizational AI safety.
COMPASS:一个评估大语言模型组织特定政策对齐性的框架 / COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs
本文提出了首个名为COMPASS的系统性评估框架,用于检验大语言模型是否遵守企业内部的允许与禁止政策清单,研究发现现有模型在处理合规请求时表现良好,但在阻止违反禁令的对抗性请求时存在严重缺陷,揭示了它们在关键政策部署场景中缺乏必要的鲁棒性。
源自 arXiv: 2601.01836