西班牙语中的数字语言偏见:来自大语言模型词汇变异的证据 / Digital Linguistic Bias in Spanish: Evidence from Lexical Variation in LLMs
1️⃣ 一句话总结
这项研究发现,大语言模型在识别西班牙语不同地区的词汇差异时存在系统性偏见,例如对西班牙、墨西哥等地区的词汇识别更准,而对智利等地区的识别较差,且这种偏见并非单纯由网络数据量决定。
This study examines the extent to which Large Language Models (LLMs) capture geographic lexical variation in Spanish, a language that exhibits substantial regional variation. Treating LLMs as virtual informants, we probe their dialectal knowledge using two survey-style question formats: Yes-No questions and multiple-choice questions. To this end, we exploited a large-scale, expert-curated database of Spanish lexical variation. Our evaluation covers more than 900 lexical items across 21 Spanish-speaking countries and is conducted at both the country and dialectal area levels. Across both evaluation formats, the results reveal systematic differences in how LLMs represent Spanish language varieties. Lexical variation associated with Spain, Equatorial Guinea, Mexico & Central America, and the La Plata River is recognized more accurately by the models, while the Chilean variety proves particularly difficult for the models to distinguish. Importantly, differences in the volume of country-level digital resources do not account for these performance patterns, suggesting that factors beyond data quantity shape dialectal representation in LLMs. By providing a fine-grained, large-scale evaluation of geographic lexical variation, this work advances empirical understanding of dialectal knowledge in LLMs and contributes new evidence to discussions of Digital Linguistic Bias in Spanish.
西班牙语中的数字语言偏见:来自大语言模型词汇变异的证据 / Digital Linguistic Bias in Spanish: Evidence from Lexical Variation in LLMs
这项研究发现,大语言模型在识别西班牙语不同地区的词汇差异时存在系统性偏见,例如对西班牙、墨西哥等地区的词汇识别更准,而对智利等地区的识别较差,且这种偏见并非单纯由网络数据量决定。
源自 arXiv: 2602.09346