MedAraBench:大规模阿拉伯语医学问答数据集与基准 / MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark
1️⃣ 一句话总结
这篇论文创建了一个名为MedAraBench的大规模、高质量的阿拉伯语医学选择题数据集,并以此评估了多个先进大语言模型,旨在推动阿拉伯语医疗AI研究和提升模型的多语言临床能力。
Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region. We then conducted extensive preprocessing and split the dataset into training and test sets to support future research efforts in the area. To assess the quality of the data, we adopted two frameworks, namely expert human evaluation and LLM-as-a-judge. Our dataset is diverse and of high quality, spanning 19 specialties and five difficulty levels. For benchmarking purposes, we assessed the performance of eight state-of-the-art open-source and proprietary models, such as GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet. Our findings highlight the need for further domain-specific enhancements. We release the dataset and evaluation scripts to broaden the diversity of medical data benchmarks, expand the scope of evaluation suites for LLMs, and enhance the multilingual capabilities of models for deployment in clinical settings.
MedAraBench:大规模阿拉伯语医学问答数据集与基准 / MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark
这篇论文创建了一个名为MedAraBench的大规模、高质量的阿拉伯语医学选择题数据集,并以此评估了多个先进大语言模型,旨在推动阿拉伯语医疗AI研究和提升模型的多语言临床能力。
源自 arXiv: 2602.01714