安全悖论:增强的安全意识如何使大语言模型更容易受到后验攻击 / Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack
1️⃣ 一句话总结
这篇论文发现了一个矛盾现象:越是被训练得“懂安全”的大语言模型,反而越容易被一种名为“后验攻击”的简单方法骗过,因为它能精准生成自己原本会拒绝的有害内容,从而揭示了当前安全对齐策略的潜在缺陷。
Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In this work, we reveal that this advanced safety awareness inadvertently introduces a fatal vulnerability. We introduce Posterior Attack, a single-query jailbreak that bypasses guardrails by prompting the model to generate the exact harmful response its internal classifier would normally flag as unsafe. Through extensive empirical evaluation across 30 open-source LLMs (up to 35B parameters in size) and frontier models (e.g., GPT-5, Claude 4.6), we observe a striking phenomenon: models with superior safety-judgment capabilities are disproportionately more susceptible to this exploitation. To explain this, we formalize the Safety Paradox, analytically showing that monotonic improvements in safety alignment naturally amplify posterior vulnerability. Finally, we establish a causal link via reinforcement learning interventions, exemplifying that artificially degrading a model's safety judgment immunizes it against the attack, whereas enhancing judgment exacerbates the vulnerability. Our findings highlight potential flaws in current alignment paradigms, indicating that defense mechanisms may require further structural refinement.
安全悖论:增强的安全意识如何使大语言模型更容易受到后验攻击 / Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack
这篇论文发现了一个矛盾现象:越是被训练得“懂安全”的大语言模型,反而越容易被一种名为“后验攻击”的简单方法骗过,因为它能精准生成自己原本会拒绝的有害内容,从而揭示了当前安全对齐策略的潜在缺陷。
源自 arXiv: 2606.05614