回答错误的问题:基于推理轨迹反演的大语言模型弃答机制 / Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs
1️⃣ 一句话总结
这篇论文提出了一种名为‘推理轨迹反演’的新方法,通过比较大语言模型实际回答的问题与原始问题的差异,来更准确地判断模型何时应该‘弃答’(即不回答),从而显著提升了模型在复杂任务中的自知之明和可靠性。
For Large Language Models (LLMs) to be reliably deployed, models must effectively know when not to answer: abstain. Reasoning models, in particular, have gained attention for impressive performance on complex tasks. However, reasoning models have been shown to have worse abstention abilities. Taking the vulnerabilities of reasoning models into account, we propose our Query Misalignment Framework. Hallucinations resulting in failed abstention can be reinterpreted as LLMs answering the wrong question (rather than answering a question incorrectly). Based on this framework, we develop a new class of state-of-the-art abstention methods called Trace Inversion. First, we generate the reasoning trace of a model. Based on only the trace, we then reconstruct the most likely query that the model responded to. Finally, we compare the initial query with the reconstructed query. Low similarity score between the initial query and reconstructed query suggests that the model likely answered the question incorrectly and is flagged to abstain. Extensive experiments demonstrate that Trace Inversion effectively boosts abstention performance in four frontier LLMs across nine abstention QA datasets, beating competitive baselines in 33 out of 36 settings.
回答错误的问题:基于推理轨迹反演的大语言模型弃答机制 / Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs
这篇论文提出了一种名为‘推理轨迹反演’的新方法,通过比较大语言模型实际回答的问题与原始问题的差异,来更准确地判断模型何时应该‘弃答’(即不回答),从而显著提升了模型在复杂任务中的自知之明和可靠性。
源自 arXiv: 2604.02230