友善重写:通过改写实现良性投影以防御大语言模型数据投毒攻击 / Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
1️⃣ 一句话总结
本文提出一种基于大语言模型“开放式良性改写”(OBBR)的防御方法,通过将训练数据重写为良性内容,有效消除后门攻击和恶意样本,无需牺牲模型性能,并且比现有防御方法平均提升51%的安全效果。
Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.
友善重写:通过改写实现良性投影以防御大语言模型数据投毒攻击 / Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
本文提出一种基于大语言模型“开放式良性改写”(OBBR)的防御方法,通过将训练数据重写为良性内容,有效消除后门攻击和恶意样本,无需牺牲模型性能,并且比现有防御方法平均提升51%的安全效果。
源自 arXiv: 2605.19147