E5-Omni:面向全模态嵌入的显式跨模态对齐方法 / e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings
1️⃣ 一句话总结
这篇论文提出了一种名为e5-omni的轻量级方法,通过校准相似度尺度、优化训练样本难度和统一嵌入空间统计特性,有效解决了现有全模态嵌入模型中跨模态比较不准确、训练效率低的问题,显著提升了文本、图像、音频、视频等多种不同类型数据在同一个空间中进行匹配的鲁棒性和效果。
Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at this https URL.
E5-Omni:面向全模态嵌入的显式跨模态对齐方法 / e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings
这篇论文提出了一种名为e5-omni的轻量级方法,通过校准相似度尺度、优化训练样本难度和统一嵌入空间统计特性,有效解决了现有全模态嵌入模型中跨模态比较不准确、训练效率低的问题,显著提升了文本、图像、音频、视频等多种不同类型数据在同一个空间中进行匹配的鲁棒性和效果。
源自 arXiv: 2601.03666