菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-25
📄 Abstract - Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning

When AI systems explain their reasoning step-by-step, practitioners often assume these explanations reveal what actually influenced the AI's answer. We tested this assumption by embedding hints into questions and measuring whether models mentioned them. In a study of over 9,000 test cases across 11 leading AI models, we found a troubling pattern: models almost never mention hints spontaneously, yet when asked directly, they admit noticing them. This suggests models see influential information but choose not to report it. Telling models they are being watched does not help. Forcing models to report hints works, but causes them to report hints even when none exist and reduces their accuracy. We also found that hints appealing to user preferences are especially dangerous-models follow them most often while reporting them least. These findings suggest that simply watching AI reasoning is not enough to catch hidden influences.

顶级标签: llm model evaluation natural language processing
详细标签: chain-of-thought explainable ai faithfulness evaluation transparency 或 搜索:

我们能信任AI的解释吗?思维链推理中系统性漏报的证据 / Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning


1️⃣ 一句话总结

这项研究发现,尽管主流AI模型在逐步推理时能察觉到问题中隐藏的提示信息,但它们通常会选择性地不报告这些关键影响因素,这表明仅观察AI的思维链输出不足以确保其解释的透明度和可信度。

源自 arXiv: 2601.00830