菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-08
📄 Abstract - Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

Omni-modal retrieval promises a single embedding space for text, image, video, document, and audio inputs, but building such a unified retriever is difficult since these modalities differ in data distribution, architecture, and optimization dynamics. In this work, we present Conan-embedding-v3, a decouple--fuse--recover framework for omni-modal retrieval. Conan-embedding-v3 first trains modality specialists independently and fuses their task vectors into a single dense backbone, a strategy we call Decoupled Specialist Fusion. We show that this fusion composes visual, video, and document retrieval capabilities, but also exposes a failure mode for projector-based modalities: when audio is attached through an external encoder and projector, fusing the backbone leaves the projector calibrated to the audio-specialist backbone, causing a large audio retrieval regression despite copying all audio-specific modules unchanged. We call this failure Projector Drift. To repair it, Conan-embedding-v3 applies Projector Recovery (i.e., full-parameter fine-tuning of the projector while keeping the backbone frozen) followed by balanced multi-modal rehearsal. The resulting model supports these retrieval pathways in one backbone, achieving 74.9 scores on MMEB while obtaining 55.61 on the 30-task MAEB audio suite.

顶级标签: multi-modal model training retrieval
详细标签: omni-modal retrieval embedding fusion projector drift decoupled training audio retrieval 或 搜索:

Conan-embedding-v3:融合模态专用模型实现全模态嵌入 / Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding


1️⃣ 一句话总结

本文提出一种名为Conan-embedding-v3的新框架,通过先独立训练处理不同数据类型(如文本、图像、视频、音频)的专用模型,再将它们的能力融合到一个统一模型中,并专门解决融合时音频模块性能下降的“投影漂移”问题,最终实现了支持文本、图像、视频、文档和音频等多种数据类型的统一检索系统。

源自 arXiv: 2606.09331