隐形安全威胁:通过隐写术对大型语言模型进行恶意微调 / Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
1️⃣ 一句话总结
这篇论文揭示了一种新型的AI安全威胁:攻击者可以通过一种特殊的微调方法,让看似安全的大型语言模型学会使用“隐写术”,在用户完全察觉不到的情况下,接收隐藏的恶意指令并生成有害内容,从而绕过现有的安全防护措施。
Understanding and addressing potential safety alignment risks in large language models (LLMs) is critical for ensuring their safe and trustworthy deployment. In this paper, we highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. In this process, human observers only see the model being prompted with a cover question and generating a corresponding cover response, while the malicious content is hidden from view. We demonstrate this invisible safety threat on GPT-4.1 despite the OpenAI finetuning API's safeguards. The finetuned model produces steganographic malicious outputs in response to hidden malicious prompts, while the user interface displays only a fully benign cover interaction. We also replicate the attack on three open-source models, Llama-3.3-70B-Instruct, Phi-4, and Mistral-Small-24B-Base-2501, confirming the generality of our method. We quantitatively evaluate our method on the AdvBench dataset, using Llama-Guard-3-8B for content safety classification. Across all four models, all stegotexts containing malicious content are incorrectly classified as safe.
隐形安全威胁:通过隐写术对大型语言模型进行恶意微调 / Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
这篇论文揭示了一种新型的AI安全威胁:攻击者可以通过一种特殊的微调方法,让看似安全的大型语言模型学会使用“隐写术”,在用户完全察觉不到的情况下,接收隐藏的恶意指令并生成有害内容,从而绕过现有的安全防护措施。
源自 arXiv: 2603.08104