菜单

🤖 系统
📄 Abstract - From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: this https URL.

顶级标签: llm model evaluation agents
详细标签: tool-augmented reasoning reasoning hallucinations code interpreter mathematical reasoning preference optimization 或 搜索:

📄 论文总结

从证明到程序:揭示大型语言模型中工具引发的推理幻觉 / From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models


1️⃣ 一句话总结

这项研究发现,尽管使用代码解释器等外部工具能提升语言模型的答案准确率,但会导致模型过度依赖工具输出而忽视逻辑推理过程,产生看似正确但缺乏合理性的解决方案,研究者通过优化方法成功改善了这一问题。


📄 打开原文 PDF