AgentSelect:面向叙事性查询的智能体推荐基准 / AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation
1️⃣ 一句话总结
这篇论文提出了一个名为AgentSelect的基准测试,旨在解决如何根据用户的具体任务描述(查询)来推荐最合适的AI智能体配置这一核心问题,它整合了海量异构数据并揭示了传统推荐方法在长尾场景下的不足,为智能体生态系统的研究和应用提供了首个统一的数据与评估基础。
LLM agents are rapidly becoming the practical interface for task automation, yet the ecosystem lacks a principled way to choose among an exploding space of deployable configurations. Existing LLM leaderboards and tool/agent benchmarks evaluate components in isolation and remain fragmented across tasks, metrics, and candidate pools, leaving a critical research gap: there is little query-conditioned supervision for learning to recommend end-to-end agent configurations that couple a backbone model with a toolkit. We address this gap with AgentSelect, a benchmark that reframes agent selection as narrative query-to-agent recommendation over capability profiles and systematically converts heterogeneous evaluation artifacts into unified, positive-only interaction data. AgentSelectcomprises 111,179 queries, 107,721 deployable agents, and 251,103 interaction records aggregated from 40+ sources, spanning LLM-only, toolkit-only, and compositional agents. Our analyses reveal a regime shift from dense head reuse to long-tail, near one-off supervision, where popularity-based CF/GNN methods become fragile and content-aware capability matching is essential. We further show that Part~III synthesized compositional interactions are learnable, induce capability-sensitive behavior under controlled counterfactual edits, and improve coverage over realistic compositions; models trained on AgentSelect also transfer to a public agent marketplace (MuleRun), yielding consistent gains on an unseen catalog. Overall, AgentSelect provides the first unified data and evaluation infrastructure for agent recommendation, which establishes a reproducible foundation to study and accelerate the emerging agent ecosystem.
AgentSelect:面向叙事性查询的智能体推荐基准 / AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation
这篇论文提出了一个名为AgentSelect的基准测试,旨在解决如何根据用户的具体任务描述(查询)来推荐最合适的AI智能体配置这一核心问题,它整合了海量异构数据并揭示了传统推荐方法在长尾场景下的不足,为智能体生态系统的研究和应用提供了首个统一的数据与评估基础。
源自 arXiv: 2603.03761