菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-26
📄 Abstract - MarketBench: Evaluating AI Agents as Market Participants

Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. These LLMs are miscalibrated on both success probability and token usage, and auctions built from these self-reports diverge from a full-information allocation. A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration, but only modestly narrows the gap to a full-information benchmark. We also document the performance of a market-based scaffolding with these LLMs. Our results point to self-assessment as a key bottleneck for market-style coordination of AI agents.

顶级标签: agents llm evaluation
详细标签: benchmark calibration self-assessment market coordination swe-bench 或 搜索:

市场基准:评估AI代理作为市场参与者的能力 / MarketBench: Evaluating AI Agents as Market Participants


1️⃣ 一句话总结

该论文提出了一个名为MarketBench的基准测试,用于评估AI代理(如大型语言模型)在市场中自我评估能力(即预测自己能否完成任务及所需成本)的准确性,并通过实验发现当前AI代理的自我评估存在严重偏差,导致市场资源分配效率低下,而加入历史经验信息只能轻微改善这一瓶颈。

源自 arXiv: 2604.23897