看似合理却错误:天体物理工作流中智能体失败案例研究 / Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows
1️⃣ 一句话总结
本文通过测试一个名为CMBAgent的AI系统在天体物理任务中的表现,发现其最危险的问题并非直接报错,而是自信地生成语法正确但物理上错误的结果,尤其在复杂推理任务中难以自我察觉。
Agentic AI systems are increasingly being integrated into scientific workflows, yet their behavior under realistic conditions remains insufficiently understood. We evaluate CMBAgent across two workflow paradigms and eighteen astrophysical tasks. In the One-Shot setting, access to domain-specific context yields an approximately ~6x performance improvement (0.85 vs. ~0 without context), with the primary failure mode being silent incorrect computation - syntactically valid code that produces plausible but inaccurate results. In the Deep Research setting, the system frequently exhibits silent failures across stress tests, producing physically inconsistent posteriors without self-diagnosis. Overall, performance is strong on well-specified tasks but degrades on problems designed to probe reasoning limits, often without visible error signals. These findings highlight that the most concerning failure mode in agentic scientific workflows is not overt failure, but confident generation of incorrect results. We release our evaluation framework to facilitate systematic reliability analysis of scientific AI agents.
看似合理却错误:天体物理工作流中智能体失败案例研究 / Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows
本文通过测试一个名为CMBAgent的AI系统在天体物理任务中的表现,发现其最危险的问题并非直接报错,而是自信地生成语法正确但物理上错误的结果,尤其在复杂推理任务中难以自我察觉。
源自 arXiv: 2604.25345