菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-25
📄 Abstract - CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.

顶级标签: natural language processing data benchmark
详细标签: language identification multilingual corpora web data evaluation open dataset 或 搜索:

CommonLID:重新评估网络数据上最先进语言识别模型的性能 / CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data


1️⃣ 一句话总结

这篇论文提出了一个名为CommonLID的社区共建、人工标注的基准测试集,涵盖109种语言,用于评估网络数据上的语言识别模型,并发现现有评估方法普遍高估了模型在真实网络环境下的准确率。

源自 arXiv: 2601.18026