菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-20
📄 Abstract - MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline

Recent advances in deep research systems enable large language models to retrieve, synthesize, and reason over large-scale external knowledge. In medicine, developing clinical guidelines critically depends on such deep evidence integration. However, existing benchmarks fail to evaluate this capability in realistic workflows requiring multi-step evidence integration and expert-level judgment. To address this gap, we introduce MedProbeBench, the first benchmark leveraging high-quality clinical guidelines as expert-level references. Medical guidelines, with their rigorous standards in neutrality and verifiability, represent the pinnacle of medical expertise and pose substantial challenges for deep research agents. For evaluation, we propose MedProbe-Eval, a comprehensive evaluation framework featuring: (1) Holistic Rubrics with 1,200+ task-adaptive rubric criteria for comprehensive quality assessment, and (2) Fine-grained Evidence Verification for rigorous validation of evidence precision, grounded in 5,130+ atomic claims. Evaluation of 17 LLMs and deep research agents reveals critical gaps in evidence integration and guideline generation, underscoring the substantial distance between current capabilities and expert-level clinical guideline development. Project: this https URL

顶级标签: llm medical benchmark
详细标签: deep research evidence integration clinical guidelines evaluation framework domain expert 或 搜索:

MedProbeBench:面向专家级医学指南的深度证据整合系统性基准测试 / MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline


1️⃣ 一句话总结

本文提出了首个专门评估大语言模型在医学领域进行多步骤证据整合并生成专家级临床指南能力的基准测试平台MedProbeBench,通过1200多项评分标准和5100多个细粒度事实核查点,系统揭示了当前顶尖AI模型与真实专家水平之间的显著差距。

源自 arXiv: 2604.18418