When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

📄 Abstract - When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

Across medical specialties, clinical practice is anchored in evidence-based guidelines that codify best studied diagnostic and treatment pathways. These pathways routinely fall short for the long tail of real-world care not covered by guidelines. Most medical large language models (LLMs), however, are trained to encode common, guideline-focused medical knowledge in their parameters. Current evaluations test models primarily on recalling and reasoning with this memorized content, often in multiple-choice settings. Given the fundamental importance of evidence-based reasoning in medicine, it is neither feasible nor reliable to depend on memorization in practice. To address this gap, we introduce OGCaReBench, a free-form retrieval-focused benchmark aimed at evaluating LLMs at answering clinical questions that require going beyond typical guidelines. Extracted from published medical case reports and validated by medical experts, OGCaReBench contains long-form clinical questions requiring free-text answers, providing a systematic framework for assessing open-ended medical reasoning in rare, case-based scenarios. Our experiments reveal that even the best-performing baseline (GPT-5.2) correctly answers only 56% of our benchmark with specialized models only reaching 42%. Augmenting models with retrieved medical articles improves this performance to up to 82% (using GPT-5.2) highlighting the importance of evidence-grounding for real-world medical reasoning tasks. This work thus establishes a foundation for benchmarking and advancing both general-purpose and medical LLMs to produce reliable answers in challenging clinical contexts.

当病例罕见时：面向脱离指南的临床问答的检索基准 / When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

1️⃣ 一句话总结

该论文提出了一个名为OGCaReBench的新型基准测试，专门评估大语言模型在面对罕见、不遵循常规指南的临床问题时，通过检索真实医学文献来提供准确答案的能力，实验表明即使最强模型直接回答也仅有56%正确率，而结合文档检索后准确率可提升至82%。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要