关系视觉相似性 / Relational Visual Similarity
1️⃣ 一句话总结
这篇论文提出了一种新的衡量图像相似性的方法,它关注的是图像内部元素之间的抽象关系(比如地球和桃子的分层结构相似),而不是表面的视觉特征,并通过微调一个视觉-语言模型首次实现了对这种关系相似性的度量。
Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.
关系视觉相似性 / Relational Visual Similarity
这篇论文提出了一种新的衡量图像相似性的方法,它关注的是图像内部元素之间的抽象关系(比如地球和桃子的分层结构相似),而不是表面的视觉特征,并通过微调一个视觉-语言模型首次实现了对这种关系相似性的度量。
源自 arXiv: 2512.07833