菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-15
📄 Abstract - Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention-aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings. Experiments are conducted with six diverse language models, including encoder-decoder, open-weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence-based abstention substantially reduces risk at moderate coverage levels, even when absolute accuracy improvements are limited. Our results suggest that in scientific reasoning tasks, the primary challenge is not selecting a single best model, but rather determining when available evidence is sufficient to justify an answer. This work highlights abstention-aware evaluation as a practical and model-agnostic lens for assessing scientific reliability, and provides a unified experimental basis for future work on selective reasoning in scientific domains. Code is available at this https URL .

顶级标签: llm natural language processing model evaluation
详细标签: scientific reasoning abstention verification natural language inference reliability 或 搜索:

知道何时不回答:具备弃答意识的科学推理 / Knowing When Not to Answer: Abstention-Aware Scientific Reasoning


1️⃣ 一句话总结

这篇论文提出了一种让AI模型在科学推理任务中学会‘弃答’的框架,通过将科学论断分解并对照证据进行审核,模型可以选择支持、反驳或放弃回答,从而在证据不足时主动避免错误,有效控制风险,提升了科学验证的可靠性。

源自 arXiv: 2602.14189