Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

📄 Abstract - Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.

不止于所见：无需微调，让CLIP理解带否定的视觉描述 / Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

1️⃣ 一句话总结

这篇论文提出了一个名为CLIPGlasses的即插即用框架，它通过一个解耦否定语义的‘镜片’模块和一个预测排斥强度的‘镜框’模块，巧妙地提升了CLIP模型对图像中‘没有什么’（如‘没有狗’）这类否定描述的理解能力，无需重新训练模型就能在跨领域任务中表现更优、更稳健。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要