菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-16
📄 Abstract - Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA

Large Language Models (LLMs) are becoming a common way for humans to seek knowledge, yet their coverage and reliability vary widely. Especially for local language varieties, there are large asymmetries, e.g., information in local Wikipedia that is absent from the standard variant. However, little is known about how well LLMs perform under such information asymmetry, especially on closely related languages. We manually construct a novel challenge question-answering (QA) dataset that captures knowledge conveyed on a local Wikipedia page, which is absent from their higher-resource counterparts-covering Mandarin Chinese vs. Cantonese and German vs. Bavarian. Our experiments show that LLMs fail to answer questions about information only in local editions of Wikipedia. Providing context from lead sections substantially improves performance, with further gains possible via translation. Our topical, geographic annotations, and stratified evaluations reveal the usefulness of local Wikipedia editions as sources of both regional and global information. These findings raise critical questions about inclusivity and cultural coverage of LLMs.

顶级标签: llm natural language processing data
详细标签: information asymmetry low-resource languages question answering wikipedia cultural coverage 或 搜索:

语言变体间的信息不对称:一项关于粤语-普通话与巴伐利亚语-德语的问答案例研究 / Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA


1️⃣ 一句话总结

这项研究发现,当知识仅存在于地方语言(如粤语或巴伐利亚语)的维基百科版本中时,大语言模型往往无法回答相关问题,这揭示了当前AI模型在文化包容性和信息覆盖面上存在显著缺陷。

源自 arXiv: 2603.14782