菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-04
📄 Abstract - Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA

Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.

顶级标签: llm medical model evaluation
详细标签: evaluation medical qa llm-as-a-judge semantic equivalence low-resource adaptation 或 搜索:

谁来评判裁判?评估大语言模型作为法语医学开放式问答的评判者 / Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA


1️⃣ 一句话总结

这项研究评估了用大语言模型自动评判法语医学开放式问答答案的可行性,发现评判结果受生成答案的模型影响很大,但通过针对性的轻量级训练,即使是小模型也能在资源有限的医学领域实现高效、可靠的自动评估。

源自 arXiv: 2603.04033