长视野任务幻象?诊断智能体系统在何处及为何失效 / The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break
1️⃣ 一句话总结
这篇论文通过创建一个名为HORIZON的跨领域诊断基准,系统性地揭示了大型语言模型智能体在执行需要多步复杂操作的长视野任务时容易失败的原因,并提出了一个可扩展的自动化评估方法来分析这些失败模式,为构建更可靠的智能体提供了指导。
Large language model (LLM) agents perform strongly on short- and mid-horizon tasks, but often break down on long-horizon tasks that require extended, interdependent action sequences. Despite rapid progress in agentic systems, these long-horizon failures remain poorly characterized, hindering principled diagnosis and comparison across domains. To address this gap, we introduce HORIZON, an initial cross-domain diagnostic benchmark for systematically constructing tasks and analyzing long-horizon failure behaviors in LLM-based agents. Using HORIZON, we evaluate state-of-the-art (SOTA) agents from multiple model families (GPT-5 variants and Claude models), collecting 3100+ trajectories across four representative agentic domains to study horizon-dependent degradation patterns. We further propose a trajectory-grounded LLM-as-a-Judge pipeline for scalable and reproducible failure attribution, and validate it with human annotation on trajectories, achieving strong agreement (inter-annotator \kappa=0.61; human-judge \kappa=0.84). Our findings offer an initial methodological step toward systematic, cross-domain analysis of long-horizon agent failures and offer practical guidance for building more reliable long-horizon agents. We release our project website at \href{this https URL}{HORIZON Leaderboard} and welcome contributions from the community.
长视野任务幻象?诊断智能体系统在何处及为何失效 / The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break
这篇论文通过创建一个名为HORIZON的跨领域诊断基准,系统性地揭示了大型语言模型智能体在执行需要多步复杂操作的长视野任务时容易失败的原因,并提出了一个可扩展的自动化评估方法来分析这些失败模式,为构建更可靠的智能体提供了指导。
源自 arXiv: 2604.11978