菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-17
📄 Abstract - Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination

Background: Large language models have demonstrated strong performance on general medical examinations, but subspecialty clinical reasoning remains challenging due to rapidly evolving guidelines and nuanced evidence hierarchies. Methods: We evaluated January Mirror, an evidence-grounded clinical reasoning system, against frontier LLMs (GPT-5, GPT-5.2, Gemini-3-Pro) on a 120-question endocrinology board-style examination. Mirror integrates a curated endocrinology and cardiometabolic evidence corpus with a structured reasoning architecture to generate evidence-linked outputs. Mirror operated under a closed-evidence constraint without external retrieval. Comparator LLMs had real-time web access to guidelines and primary literature. Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%). On the 30 most difficult questions (human accuracy less than 50%), Mirror achieved 76.7% accuracy. Top-2 accuracy was 92.5% for Mirror versus 85.25% for GPT-5.2. Conclusions: Mirror provided evidence traceability: 74.2% of outputs cited at least one guideline-tier source, with 100% citation accuracy on manual verification. Curated evidence with explicit provenance can outperform unconstrained web retrieval for subspecialty clinical reasoning and supports auditability for clinical deployment.

顶级标签: medical llm model evaluation
详细标签: clinical reasoning evidence grounding medical examination specialized knowledge benchmark evaluation 或 搜索:

基于证据的专科推理:评估一个经过整理的临床智能系统在2025年内分泌专科考试上的表现 / Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination


1️⃣ 一句话总结

这篇论文介绍了一个名为‘Mirror’的临床智能系统,它通过整合经过严格筛选的内分泌专科医学证据库,在不联网检索的情况下,在模拟专科考试中显著超越了前沿通用大语言模型和人类专家的准确率,并实现了可追溯的答案引用,为临床决策支持提供了更可靠、可审计的解决方案。

源自 arXiv: 2602.16050