AI智能体的全面评估与故障诊断 / Holistic Evaluation and Failure Diagnosis of AI Agents
1️⃣ 一句话总结
本文提出了一个两层结构的评估框架,能对AI智能体执行复杂任务时的每个步骤分别进行独立诊断和定位错误,从而大幅提升错误分类和定位的准确性,实验表明评估方法本身比模型能力更关键。
AI agents execute complex multi-step processes, but current evaluation falls short: outcome metrics report success or failure without explaining why, and process-level approaches struggle to connect failure types to their precise locations within long, structured traces. We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments. This decomposition scales to traces of arbitrary length and produces span-level rationales for each verdict. On the TRAIL benchmark, our framework achieves state-of-the-art results across all metrics on both GAIA and SWE-Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization-categorization accuracy. Per-category analysis shows our framework leading in more error categories than any other evaluator. Notably, the same frontier model achieves several times higher localization accuracy when used inside our framework than as a monolithic judge over the full trace, showing that evaluation methodology, not model capability, is the bottleneck.
AI智能体的全面评估与故障诊断 / Holistic Evaluation and Failure Diagnosis of AI Agents
本文提出了一个两层结构的评估框架,能对AI智能体执行复杂任务时的每个步骤分别进行独立诊断和定位错误,从而大幅提升错误分类和定位的准确性,实验表明评估方法本身比模型能力更关键。
源自 arXiv: 2605.14865