移除触发器而非后门:替代触发器与潜在后门 / Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors
1️⃣ 一句话总结
这篇论文挑战了传统观点,指出仅移除已知的后门触发器无法真正消除AI模型中的后门,因为存在多种感知上不同的替代触发器也能激活同一个后门,因此防御措施应针对特征空间中的后门方向,而非仅仅处理输入层面的触发器。
Current backdoor defenses assume that neutralizing a known trigger removes the backdoor. We show this trigger-centric view is incomplete: \emph{alternative triggers}, patterns perceptually distinct from training triggers, reliably activate the same backdoor. We estimate the alternative trigger backdoor direction in feature space by contrasting clean and triggered representations, and then develop a feature-guided attack that jointly optimizes target prediction and directional alignment. First, we theoretically prove that alternative triggers exist and are an inevitable consequence of backdoor training. Then, we verify this empirically. Additionally, defenses that remove training triggers often leave backdoors intact, and alternative triggers can exploit the latent backdoor feature-space. Our findings motivate defenses targeting backdoor directions in representation space rather than input-space triggers.
移除触发器而非后门:替代触发器与潜在后门 / Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors
这篇论文挑战了传统观点,指出仅移除已知的后门触发器无法真正消除AI模型中的后门,因为存在多种感知上不同的替代触发器也能激活同一个后门,因此防御措施应针对特征空间中的后门方向,而非仅仅处理输入层面的触发器。
源自 arXiv: 2603.09772