菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-20
📄 Abstract - STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval

Training-free zero-shot composed image retrieval models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval. Recent LLM-based advances focus on generating the expected target caption by exploring the compositional ability behind the LLMs. Although efficient, we find that 1) the generated captions tend to introduce unexpected features from the reference image due to the semantic gap between the input image and text modification, where the image contains much more details than the text; 2) the point-to-point alignment during the retrieval stage fails to capture diverse compositions. To address these challenges, we introduce a novel Semantic Transition and Transportation in collaboration framework for training-free zero-shot CIR tasks. Specifically, given the composed caption inferred by an LLM, we aim to refine it through a transition vector in the embedding space and make it closer to the target image. Combining LLMs with user instruction, the refined caption concentrates more on the core modification intent and thus filters out unnecessary noise. Moreover, to explore diverse alignment during the retrieval stage, we model the caption and image as discrete distributions and reformulate the retrieval task as a set-to-set alignment task. Finally, a bidirectional transportation distance is developed to consider fine-grained alignments across modalities and calculate the retrieval score. Extensive experiments demonstrate that our method can be general, effective, and beneficial for many CIR tasks.

顶级标签: multi-modal retrieval
详细标签: zero-shot composed image retrieval semantic transition llm collaboration set-to-set alignment 或 搜索:

STiTch:面向零样本组合图像检索的无训练协作语义过渡与传输框架 / STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval


1️⃣ 一句话总结

本文提出了一种无需训练的零样本组合图像检索方法,通过语义过渡向量精炼文本描述,并利用双向传输距离实现图文之间的细粒度对齐,有效解决了现有方法中文本描述不准确和检索匹配粗糙的问题。

源自 arXiv: 2605.21261