菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-24
📄 Abstract - Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models

Multimodal Diffusion Language Models (MDLMs) have recently emerged as a competitive alternative to their autoregressive counterparts. Yet their vulnerability to backdoor attacks remains largely unexplored. In this work, we show that well-established data-poisoning pipelines can successfully implant backdoors into MDLMs, enabling attackers to manipulate model behavior via specific triggers while maintaining normal performance on clean inputs. However, defense strategies effective to these models are yet to emerge. To bridge this gap, we introduce a backdoor defense framework for MDLMs named DiSP (Diffusion Self-Purification). DiSP is driven by a key observation: selectively masking certain vision tokens at inference time can neutralize a backdoored model's trigger-induced behaviors and restore normal functionality. Building on this, we purify the poisoned dataset using the compromised model itself, then fine-tune the model on the purified data to recover it to a clean one. Given such a specific design, DiSP can remove backdoors without requiring any auxiliary models or clean reference data. Extensive experiments demonstrate that our approach effectively mitigates backdoor effects, reducing the attack success rate (ASR) from over 90% to typically under 5%, while maintaining model performance on benign tasks.

顶级标签: multi-modal model training model evaluation
详细标签: backdoor defense diffusion models multimodal language models security self-purification 或 搜索:

自净化缓解多模态扩散语言模型中的后门攻击 / Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models


1️⃣ 一句话总结

本文提出了一种名为DiSP的自净化防御框架,它能在不依赖额外模型或干净数据的情况下,通过选择性屏蔽视觉标记和微调,有效消除多模态扩散语言模型中的后门攻击,将攻击成功率从90%以上降至5%以下。

源自 arXiv: 2602.22246