大型语言模型在阿拉伯语医疗任务中的跨语言实证评估 / Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks
1️⃣ 一句话总结
这篇论文通过对比实验发现,大型语言模型在处理阿拉伯语医疗问答任务时,其性能显著低于英语,且任务越复杂差距越大,这主要是由于模型对阿拉伯语文本的切分处理不当以及模型自身给出的置信度与答案正确性关联不大所导致的。
In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic and English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis suggests that model-reported confidence and explanations exhibit limited correlation with correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.
大型语言模型在阿拉伯语医疗任务中的跨语言实证评估 / Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks
这篇论文通过对比实验发现,大型语言模型在处理阿拉伯语医疗问答任务时,其性能显著低于英语,且任务越复杂差距越大,这主要是由于模型对阿拉伯语文本的切分处理不当以及模型自身给出的置信度与答案正确性关联不大所导致的。
源自 arXiv: 2602.05374