📄
Abstract - Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift
Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose MG-MTTA, which keeps the backbone frozen and updates only a lightweight gate or adapter. The objective combines fused-posterior entropy minimization with a reliability-aware gate prior built from anchor-based modality consistency and cross-modal conflict. Our analysis gives conditions under which entropy reduction preserves the correct ranking and a threshold that characterizes modality-dominance failure. On the ImageNet-based benchmark, MG-MTTA improves top-1 accuracy from 57.97 to 66.51 under semantics-preserving textual shift and from 21.68 to 26.27 under joint visual-textual shift, while remaining competitive in the visual-only benchmark. These results show that multimodal test-time adaptation should control modality reliability, not just prediction entropy.
主导化引导的视觉-语言模型测试时自适应方法:应对模态特异性偏移 /
Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift
1️⃣ 一句话总结
本文提出了一种针对视觉-语言模型在部署时出现视觉与文本分支不对称偏移问题的测试时自适应方法,通过引入模态可靠性约束来避免传统熵最小化方法因不可靠模态主导而导致错误增加,在保持模型骨干不变的情况下仅更新轻量级门控或适配器,显著提升了多模态偏移场景下的分类准确率。