菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-22
📄 Abstract - All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG

Multilingual Retrieval-Augmented Generation (mRAG) leverages cross-lingual evidence to ground Large Language Models (LLMs) in global knowledge. However, we show that current mRAG systems suffer from a language bias during reranking, systematically favoring English and the query's native language. By introducing an estimated oracle evidence analysis, we quantify a substantial performance gap between existing rerankers and the achievable upper bound. Further analysis reveals a critical distributional mismatch: while optimal predictions require evidence scattered across multiple languages, current systems systematically suppress such ``answer-critical'' documents, thereby limiting downstream generation performance. To bridge this gap, we propose \textit{\textbf{L}anguage-\textbf{A}gnostic \textbf{U}tility-driven \textbf{R}eranker \textbf{A}lignment (LAURA)}, which aligns multilingual evidence ranking with downstream generative utility. Experiments across diverse languages and generation models show that LAURA effectively mitigates language bias and consistently improves mRAG performance.

顶级标签: llm natural language processing multi-modal
详细标签: multilingual rag language bias reranking bias mitigation cross-lingual retrieval 或 搜索:

所有语言都重要:理解并缓解多语言RAG中的语言偏见 / All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG


1️⃣ 一句话总结

本文揭示了多语言检索增强生成(mRAG)系统中,重排序阶段存在偏向英语和查询语言的系统性偏见,导致跨语言的有用证据被压制,并提出了一种名为LAURA的新方法,通过让重排序器直接对齐下游生成效果,有效消除了这种语言偏见,显著提升了多语言问答的准确性。

源自 arXiv: 2604.20199