菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-07-02
📄 Abstract - FlowCIR: Semantic Transport via Flow Matching for Zero-Shot Composed Image Retrieval

Zero-shot composed image retrieval (ZS-CIR) aims to retrieve a target image by editing a reference image with a natural-language instruction, without relying on domain-specific annotated triplets. Most existing ZS-CIR methods rely on textual inversion to translate the reference image into pseudo-text tokens and then compose them with the instruction via simple concatenation in the text space, which can be lossy and brittle for fine-grained semantics. In this work, we propose a new paradigm, namely FlowCIR, that casts ZS-CIR as conditional semantic transport between reference and target embeddings. Leveraging \emph{conditional flow matching}, our model learns a lightweight transport field that maps the instruction representation toward a target-aligned query embedding conditioned on the reference image. Since FlowCIR operates on pre-extracted VLM embeddings and trains only a small transport module without updating the image or text encoder, it offers a computationally efficient training protocol compared with prior textual-inversion-based approaches. The resulting framework is training-efficient, requiring roughly $10\times$ fewer training resources than prior textual-inversion-based approaches. We further identify negation and removal as a major failure mode of VLM-based composition. To address this, we propose an inference-only Multi-Negative Steering strategy that steers a negation-containing relative instruction away from its negated semantics, mitigating the limited negation handling of VLMs and improving robustness on negation-heavy queries. Extensive experiments on standard CIR benchmarks demonstrate that FlowCIR achieves strong and competitive performance compared with recent ZS-CIR methods.

顶级标签: computer vision machine learning multi-modal
详细标签: image retrieval zero-shot learning flow matching compositional reasoning text-based editing 或 搜索:

FlowCIR:基于流匹配的语义传输用于零样本组合图像检索 / FlowCIR: Semantic Transport via Flow Matching for Zero-Shot Composed Image Retrieval


1️⃣ 一句话总结

本文提出一种新方法FlowCIR,通过条件流匹配将参考图像和指令转化为目标嵌入,实现零样本组合图像检索,不仅训练效率比传统方法高十倍,还引入多负向引导策略解决视觉语言模型处理否定指令的弱点。

源自 arXiv: 2607.02284