IndicParam:评估大语言模型在低资源印度语言上的基准 / IndicParam: Benchmark to evaluate LLMs on low-resource Indic Languages
1️⃣ 一句话总结
该论文提出了一个名为IndicParam的人工标注基准,包含超过1.3万道选择题,用于系统评估大语言模型在11种低资源印度语言上的表现,结果显示即使是顶尖模型在这些语言上的平均准确率也不足50%,揭示了跨语言迁移的局限性。
While large language models excel on high-resource multilingual tasks, low- and extremely low-resource Indic languages remain severely under-evaluated. We present IndicParam, a human-curated benchmark of over 13,000 multiple-choice questions covering 11 such languages (Nepali, Gujarati, Marathi, Odia as low-resource; Dogri, Maithili, Rajasthani, Sanskrit, Bodo, Santali, Konkani as extremely low-resource) plus Sanskrit-English code-mixed set. We evaluated 19 LLMs, both proprietary and open-weights, which reveals that even the top-performing GPT-5 reaches only 45.0% average accuracy, followed by DeepSeek-3.2 (43.1) and Claude-4.5 (42.7). We additionally label each question as knowledge-oriented or purely linguistic to discriminate factual recall from grammatical proficiency. Further, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. IndicParam provides insights into limitations of cross-lingual transfer and establishes a challenging benchmark for Indic languages. The dataset is available at this https URL. Scripts to run benchmark are present at this https URL.
IndicParam:评估大语言模型在低资源印度语言上的基准 / IndicParam: Benchmark to evaluate LLMs on low-resource Indic Languages
该论文提出了一个名为IndicParam的人工标注基准,包含超过1.3万道选择题,用于系统评估大语言模型在11种低资源印度语言上的表现,结果显示即使是顶尖模型在这些语言上的平均准确率也不足50%,揭示了跨语言迁移的局限性。