菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-20
📄 Abstract - XR: Cross-Modal Agents for Composed Image Retrieval

Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning. To address these limitations, we introduce XR, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for fine filtering. Through progressive multi-agent coordination, XR iteratively refines retrieval to meet both semantic and visual query constraints, achieving up to a 38% gain over strong training-free and training-based baselines on FashionIQ, CIRR, and CIRCO, while ablations show each agent is essential. Code is available: this https URL.

顶级标签: multi-modal agents model evaluation
详细标签: composed image retrieval cross-modal reasoning multi-agent framework training-free benchmark 或 搜索:

XR:用于组合图像检索的跨模态智能体框架 / XR: Cross-Modal Agents for Composed Image Retrieval


1️⃣ 一句话总结

这篇论文提出了一个名为XR的无需训练的多智能体框架,通过让不同类型的智能体协同工作,分别负责想象目标图像、进行初步匹配和事实核查,从而更准确地根据一张参考图片和一段修改文字来找到目标图片,大幅提升了组合图像检索任务的性能。

源自 arXiv: 2601.14245