菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-08
📄 Abstract - Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at this https URL.

顶级标签: multi-modal model training model evaluation
详细标签: vision-language models adversarial robustness cross-modal attention function words adversarial attacks 或 搜索:

为视觉语言模型的免费鲁棒性而少关注功能词 / Pay Less Attention to Function Words for Free Robustness of Vision-Language Models


1️⃣ 一句话总结

这篇论文发现视觉语言模型容易受到跨模态对抗攻击的弱点与模型过度关注文本中的功能词(如“的”、“在”)有关,并提出了一种名为“功能词去注意”的新方法,通过从注意力中减去功能词的影响,显著提升了模型的抗攻击能力,同时几乎不影响其正常任务性能。


源自 arXiv: 2512.07222