市场基准:评估大语言模型在经济与贸易竞争中的表现 / Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition
1️⃣ 一句话总结
这篇论文提出了一个名为Market-Bench的评估框架,通过模拟多智能体供应链中的采购与零售竞争,来测试大语言模型在经济资源管理和贸易决策中的实际能力,发现只有少数模型能持续盈利,多数模型表现平平。
The ability of large language models (LLMs) to manage and acquire economic resources remains unclear. In this paper, we introduce \textbf{Market-Bench}, a comprehensive benchmark that evaluates the capabilities of LLMs in economically-relevant tasks through economic and trade competition. Specifically, we construct a configurable multi-agent supply chain economic model where LLMs act as retailer agents responsible for procuring and retailing merchandise. In the \textbf{procurement} stage, LLMs bid for limited inventory in budget-constrained auctions. In the \textbf{retail} stage, LLMs set retail prices, generate marketing slogans, and provide them to buyers through a role-based attention mechanism for purchase. Market-Bench logs complete trajectories of bids, prices, slogans, sales, and balance-sheet states, enabling automatic evaluation with economic, operational, and semantic metrics. Benchmarking on 20 open- and closed-source LLM agents reveals significant performance disparities and winner-take-most phenomenon, \textit{i.e.}, only a small subset of LLM retailers can consistently achieve capital appreciation, while many hover around the break-even point despite similar semantic matching scores. Market-Bench provides a reproducible testbed for studying how LLMs interact in competitive markets.
市场基准:评估大语言模型在经济与贸易竞争中的表现 / Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition
这篇论文提出了一个名为Market-Bench的评估框架,通过模拟多智能体供应链中的采购与零售竞争,来测试大语言模型在经济资源管理和贸易决策中的实际能力,发现只有少数模型能持续盈利,多数模型表现平平。
源自 arXiv: 2604.05523