菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-13
📄 Abstract - CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

Human perception of visual similarity is inherently adaptive and subjective, depending on the users' interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves high retrieval accuracy and notable computational efficiency compared to previous works.

顶级标签: multi-modal computer vision model evaluation
详细标签: vision-language models conditional similarity image retrieval embedding space evaluation dataset 或 搜索:

CLAY:视觉-语言嵌入空间中的条件化视觉相似度调制 / CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space


1️⃣ 一句话总结

这篇论文提出了一个名为CLAY的新方法,它无需额外训练,就能利用预训练的视觉-语言模型,让图像检索系统根据用户用文字描述的不同兴趣点(例如“颜色”或“形状”)来灵活、高效地判断图片间的相似度。

源自 arXiv: 2604.11539