Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

📄 Abstract - Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or response patterns that are unforeseen by model developers. We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD). It is difficult to find failures that are truly OOD for off-the-shelf models trained on vast safety datasets. We sidestep this by including a restricted training set in MOOD that we use to train our own monitors, as well as seven test sets with diverse alignment failures that are outside the training distribution. Using MOOD, we find that guard models (safety classifiers) often fail to generalize OOD. To fix this, we propose combining guard models with OOD detectors. We test four types of OOD detectors and find that a combination of a guard model with Mahalanobis distance and perplexity-based OOD detectors can improve recall from 39% to 45%. We also establish positive scaling trends across model scales for monitors that combine a guard model and OOD detector; we find that incorporating OOD detection into monitoring achieves a higher recall gain than using a guard model with 20 times more parameters. Our work suggests that OOD detection should be a crucial component of LLM monitoring and provides a foundation for further work on this important problem.

大语言模型分布外对齐失败的监测基准与改进方法 / Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

1️⃣ 一句话总结

本文提出一个名为MOOD的基准测试，系统评估大语言模型在遇到非常规输入（分布外情况）时的安全监测能力，并证明将基础的安全分类器与两种分布外检测工具（马氏距离和困惑度检测器）结合使用，可以更有效地识别模型的安全漏洞，且效果优于单纯扩大模型规模。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要