📄
Abstract - Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning
Multimodal contrastive learning is increasingly enriched by going beyond image-text pairs. Among recent contrastive methods, Symile is a strong approach for this challenge because its multiplicative interaction objective captures higher-order cross-modal dependence. Yet, we find that Symile treats all modalities symmetrically and does not explicitly model reliability differences, a limitation that becomes especially present in trimodal multiplicative interactions. In practice, modalities beyond image-text pairs can be misaligned, weakly informative, or missing, and treating them uniformly can silently degrade performance. This fragility can be hidden in the multiplicative interaction: Symile may outperform pairwise CLIP even if a single unreliable modality silently corrupts the product terms. We propose Gated Symile, a contrastive gating mechanism that adapts modality contributions on an attention-based, per-candidate basis. The gate suppresses unreliable inputs by interpolating embeddings toward learnable neutral directions and incorporating an explicit NULL option when reliable cross-modal alignment is unlikely. Across a controlled synthetic benchmark that uncovers this fragility and three real-world trimodal datasets for which such failures could be masked by averages, Gated Symile achieves higher top-1 retrieval accuracy than well-tuned Symile and CLIP models. More broadly, our results highlight gating as a step toward robust multimodal contrastive learning under imperfect and more than two modalities.
隐藏于乘法交互之中:揭示多模态对比学习的脆弱性 /
Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning
1️⃣ 一句话总结
这篇论文发现,当前先进的多模态对比学习方法Symile在处理超过两种模态(如图像、文本、音频)时,由于对所有模态一视同仁而存在隐藏的脆弱性,当某些模态信息不可靠时会暗中损害模型性能;为此,作者提出了一种带门控机制的改进方法Gated Symile,它能动态评估并调整每个模态的贡献度,从而在多种真实数据集上实现了更鲁棒和准确的检索性能。