菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-17
📄 Abstract - SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval

With the rapid expansion of massive multilingual corpora, Multilingual Information Retrieval (MLIR) has emerged as a critical technology for global information access. MLIR enables users to retrieve semantically relevant documents from multilingual text collections using a single-language query. However, recent multilingual dense retrieval models often exhibit a strong preference for documents in the same language as the query. This leads to severe language bias, where top-ranked results are dominated by documents of specific languages, even when documents in other languages contain more semantically relevant information. To address this issue, we propose SHIFT, a training-free method applicable in the indexing stage. Specifically, SHIFT utilizes parallel translation pairs to estimate a relative language vector for each target language with respect to a source language. Subsequently, SHIFT corrects the language-specific offset by subtracting this relative language vector from document embeddings during indexing. Our comprehensive evaluation across four MLIR benchmarks and diverse dense retrieval models confirms that SHIFT can effectively mitigate language bias and enhance MLIR performance.

顶级标签: natural language processing machine learning
详细标签: multilingual information retrieval language bias dense retrieval feature transformation indexing 或 搜索:

SHIFT:基于索引侧特征变换的多语言信息检索语义对齐方法 / SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval


1️⃣ 一句话总结

本文提出了一种无需额外训练的索引阶段方法SHIFT,通过估算并减去各语言与源语言之间的偏移向量,有效纠正多语言信息检索中模型偏好同语言文档的语言偏差问题,从而提升跨语言检索的语义相关性。

源自 arXiv: 2606.18801