Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

📄 Abstract - Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.

友善重写：通过改写实现良性投影以防御大语言模型数据投毒攻击 / Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

1️⃣ 一句话总结

本文提出一种基于大语言模型“开放式良性改写”（OBBR）的防御方法，通过将训练数据重写为良性内容，有效消除后门攻击和恶意样本，无需牺牲模型性能，并且比现有防御方法平均提升51%的安全效果。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要