菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-02
📄 Abstract - Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage when unknown backdoors may exist in a model. We show that backdoor neutralization through unlearning generalizes across backdoors: training a model to ignore a single trigger can also suppress other backdoors that were never explicitly targeted. We study this phenomenon across three model families, whose backdoors were injected via pretraining or continual pretraining, by analyzing the models obtained after removing one backdoor at a time. To understand why unlearning certain backdoors induces the suppression of others, we introduce the Cross Activation Shift Distance, to quantify the distance between model changes induced by different trainings. Our results open a new direction for LLM safety as defenders could deliberately inject controlled backdoors and then remove them, leveraging cross-backdoor transfer to also suppress unknown backdoors that an attacker may have previously introduced in the model.

顶级标签: llm security
详细标签: backdoor attack unlearning cross-backdoor transfer defense activation shift 或 搜索:

后门遗忘的泛化:通往消除大语言模型中未知触发器的路径 / Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs


1️⃣ 一句话总结

本文发现,在大型语言模型中,通过训练模型遗忘某一个已知的后门触发器,可以连带抑制其他从未被明确处理过的未知后门,从而为防御者提供了一种利用可控后门来批量清除潜在攻击后门的新思路。

源自 arXiv: 2606.03785