Intent Laundering: AI Safety Datasets Are Not What They Seem

📄 Abstract - Intent Laundering: AI Safety Datasets Are Not What They Seem

We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world attacks based on three key properties: driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results indicate that current AI safety datasets fail to faithfully represent real-world attacks due to their overreliance on triggering cues. In fact, once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90% to over 98%, under fully black-box access. Overall, our findings expose a significant disconnect between how model safety is evaluated and how real-world adversaries behave.

意图洗白：AI安全数据集名不副实 / Intent Laundering: AI Safety Datasets Are Not What They Seem

1️⃣ 一句话总结

这篇论文发现，当前广泛使用的AI安全数据集过度依赖带有明显负面色彩的‘触发词’来测试模型，这与现实攻击手法不符；研究通过一种‘意图洗白’的方法剥离这些触发词后，所有被评估为‘安全’的主流AI模型都变得不安全，揭示了现有安全评估与现实威胁之间存在巨大脱节。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要