菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-30
📄 Abstract - One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

The hubness problem, in which hub embeddings are close to many unrelated examples, occurs often in high-dimensional embedding spaces and may pose a practical threat for purposes such as information retrieval and automatic evaluation metrics. In particular, since cross-modal similarity between text and images cannot be calculated by direct comparisons, such as string matching, cross-modal encoders that project different modalities into a shared space are helpful for various cross-modal applications, and thus, the existence of hubs may pose practical threats. To reveal the vulnerabilities of cross-modal encoders, we propose a method for identifying the hub embedding and its corresponding hub text. Experiments on image captioning evaluation in MSCOCO and nocaps along with image-to-text retrieval tasks in MSCOCO and Flickr30k showed that our method can identify a single hub text that unreasonably achieves comparable or higher similarity scores than human-written reference captions in many images, thereby revealing the vulnerabilities in cross-modal encoders.

顶级标签: machine learning multi-modal
详细标签: hubness problem cross-modal encoders adversarial text embedding vulnerability image-text retrieval 或 搜索:

单个文本枢纽点就能攻破CLIP:通过枢纽性识别跨模态编码器的脆弱性 / One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness


1️⃣ 一句话总结

这篇论文发现,在文本和图像的跨模态编码器中存在一种“枢纽”问题——某个文本可能意外地与大量不相关的图像高度相似,并据此提出了一种方法,能在常用模型(如CLIP)中仅用一个特定文本就能让它在图像描述评估和检索任务上表现异常,从而揭示了这类模型的潜在安全漏洞。

源自 arXiv: 2604.27674