菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-09
📄 Abstract - Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Understanding and addressing potential safety alignment risks in large language models (LLMs) is critical for ensuring their safe and trustworthy deployment. In this paper, we highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. In this process, human observers only see the model being prompted with a cover question and generating a corresponding cover response, while the malicious content is hidden from view. We demonstrate this invisible safety threat on GPT-4.1 despite the OpenAI finetuning API's safeguards. The finetuned model produces steganographic malicious outputs in response to hidden malicious prompts, while the user interface displays only a fully benign cover interaction. We also replicate the attack on three open-source models, Llama-3.3-70B-Instruct, Phi-4, and Mistral-Small-24B-Base-2501, confirming the generality of our method. We quantitatively evaluate our method on the AdvBench dataset, using Llama-Guard-3-8B for content safety classification. Across all four models, all stegotexts containing malicious content are incorrectly classified as safe.

顶级标签: llm model training systems
详细标签: safety alignment steganography malicious finetuning adversarial attack model security 或 搜索:

隐形安全威胁:通过隐写术对大型语言模型进行恶意微调 / Invisible Safety Threat: Malicious Finetuning for LLM via Steganography


1️⃣ 一句话总结

这篇论文揭示了一种新型的AI安全威胁:攻击者可以通过一种特殊的微调方法,让看似安全的大型语言模型学会使用“隐写术”,在用户完全察觉不到的情况下,接收隐藏的恶意指令并生成有害内容,从而绕过现有的安全防护措施。

源自 arXiv: 2603.08104