菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-11
📄 Abstract - ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $\textbf{ComplexMCP}$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $\textbf{tool retrieval saturation}$ as action spaces scale; (2) $\textbf{over-confidence}$, where agents skip essential environment verifications; and (3) $\textbf{strategic defeatism}$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $\textbf{ComplexMCP}$ as a critical testbed for the next generation of resilient autonomous systems.

顶级标签: llm agents benchmark
详细标签: tool use evaluation interdependent tools dynamic environment failure analysis 或 搜索:

复杂MCP:在动态、相互依赖的大规模工具沙箱中评估LLM智能体 / ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox


1️⃣ 一句话总结

该论文提出了一个名为ComplexMCP的基准测试,通过模拟真实商业软件中工具相互依赖、环境动态变化且可能出错的复杂场景,发现当前最先进的AI智能体成功率不足60%,远低于人类的90%,并揭示了工具检索、过度自信和策略性放弃三大瓶颈。

源自 arXiv: 2605.10787