菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-02
📄 Abstract - ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems

Tool using agents often fail for operational reasons even when language understanding is strong. Common causes include invalid arguments, interface drift, weak recovery, and inefficient retry behavior. We introduce ToolMisuseBench, an offline deterministic benchmark for evaluating tool misuse and recovery under explicit step, call, and retry budgets. The benchmark covers CRUD, retrieval, file, and scheduling environments with replayable fault injection. It reports success, invalid call behavior, policy violations, recovery quality, and budgeted efficiency. We release a public dataset with 6800 tasks and a reproducible evaluation pipeline. Baseline results show fault specific recovery gains for schema aware methods, while overall success remains limited under the released authorization and hard failure settings.

顶级标签: agents benchmark model evaluation
详细标签: tool misuse agent evaluation fault injection recovery offline benchmark 或 搜索:

ToolMisuseBench:一个用于评估智能体系统工具误用与恢复能力的离线确定性基准 / ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems


1️⃣ 一句话总结

这篇论文提出了一个名为ToolMisuseBench的标准化测试平台,专门用来评估和提升AI智能体在调用工具时犯错(如参数错误、接口不匹配)后的自我修复能力,并提供了一个包含6800个任务的数据集和评估流程。

源自 arXiv: 2604.01508