菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-27
📄 Abstract - Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

Large Language Models (LLMs) are increasingly vulnerable to adversarial prompts that exploit semantic ambiguities to bypass safety mechanisms, resulting in harmful or inappropriate outputs. Such attacks, including jailbreaking and prompt injection, pose significant risks to the integrity and availability of LLMs in security-critical applications. This paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense mechanism that proactively identifies and neutralizes malicious components in input prompts before they are processed by the LLM. The APD framework integrates three key innovations: (1) a mutual information-based semantic decomposition method to isolate adversarial and benign prompt components, ensuring statistical independence; (2) a graph-based intent classification approach that leverages spectral analysis to detect malicious patterns in prompt semantics; and (3) a lightweight transformer-based classifier trained on real-world datasets of toxic and jailbreaking prompts, enabling efficient and accurate adversarial intent detection. Evaluated on diverse datasets containing adversarial prompts, APD demonstrates superior robustness, reducing harmful output generation by over 85\% while maintaining negligible impact on model performance. The framework's computational efficiency supports real-time deployment, making it a practical solution for securing LLMs. Our work addresses critical challenges in machine learning security on novel attacks and integrity methods for ML systems, and offers a scalable, ethically grounded defense against prompt-based adversarial threats.

顶级标签: llm machine learning
详细标签: adversarial prompts jailbreaking prompt injection semantic decomposition defense mechanism 或 搜索:

解构对抗性提示:一种基于语义图的鲁棒大语言模型安全防御方法 / Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security


1️⃣ 一句话总结

本文提出了一种名为APD的防御框架,通过将用户输入中的恶意部分与正常部分分离开来,并利用语义图技术识别攻击模式,能在不降低大语言模型性能的前提下,将有害输出减少85%以上,从而有效抵御越狱攻击和提示注入等安全威胁。

源自 arXiv: 2605.27823