IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

📄 Abstract - IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

Large language models are increasingly consulted for Islamic knowledge, yet no comprehensive benchmark evaluates their performance across core Islamic disciplines. We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions). Each track is formed of multiple types of questions to examine LLMs capabilities handling different aspects of Islamic knowledge. The benchmark is used to create the IslamicMMLU public leaderboard for evaluating LLMs, and we initially evaluate 26 LLMs, where their averaged accuracy across the three tracks varied between 39.8\% to 93.8\% (by Gemini 3 Flash). The Quran track shows the widest span (99.3\% to 32.4\%), while the Fiqh track includes a novel madhab (Islamic school of jurisprudence) bias detection task revealing variable school-of-thought preferences across models. Arabic-specific models show mixed results, but they all underperform compared to frontier models. The evaluation code and leaderboard are made publicly available.

IslamicMMLU：评估大语言模型伊斯兰知识能力的基准 / IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

1️⃣ 一句话总结

这篇论文提出了一个名为IslamicMMLU的综合性基准测试，包含超过一万道选择题，用于评估大语言模型在《古兰经》、圣训和伊斯兰法学等核心领域的知识水平，并发现不同模型的表现差异巨大，其中法学部分还能检测出模型对特定学派的偏好。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要