📄
Abstract - Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning
Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.
不止于所见:无需微调,让CLIP理解带否定的视觉描述 /
Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning
1️⃣ 一句话总结
这篇论文提出了一个名为CLIPGlasses的即插即用框架,它通过一个解耦否定语义的‘镜片’模块和一个预测排斥强度的‘镜框’模块,巧妙地提升了CLIP模型对图像中‘没有什么’(如‘没有狗’)这类否定描述的理解能力,无需重新训练模型就能在跨领域任务中表现更优、更稳健。