菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-14
📄 Abstract - Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

We introduce \textsc{Cattle Trade, a multi-agent benchmark for evaluating large language models (LLMs) as agents in strategic reasoning under imperfect information, adversarial interaction, and resource constraints. The benchmark combines auctions, hidden-offer trade challenges (TCs), bargaining, bluffing, opponent modeling, and resource allocation within a single long-horizon game lasting 50--60 turns. Unlike prior agent benchmarks that test these abilities in isolation, \textsc{Cattle Trade} evaluates whether agents integrate them across a competitive, multi-agent economic game with conflicting incentives. The benchmark logs every bid, TC offer, counteroffer, and card selection, enabling behavioural analysis beyond final scores or win rates. We evaluate seven cost-efficient language models and three deterministic code agents across 242 games. Strategic coherence, in particular spending efficiency, resource discipline, and phase-adaptive bidding, is associated with rank more strongly than spending volume or any single subskill. Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation. Evaluating agentic competence requires benchmarks that test the joint deployment of multiple capabilities in multi-agent environments with conflicting incentives, uncertainty, and economic dynamics.

顶级标签: llm agents benchmark
详细标签: multi-agent strategic reasoning bargaining bluffing economic game 或 搜索:

牛商战:用于评估大语言模型虚张声势、竞价与谈判能力的多智能体基准 / Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining


1️⃣ 一句话总结

本文提出了一个名为“牛商战”的多智能体游戏基准,通过结合拍卖、秘密交易、谈判与虚张声势等复杂经济互动,全面测试大语言模型在信息不完全、资源有限且利益冲突的环境中的综合策略推理能力,并揭示了当前模型在预算控制、避免自我竞价和适应对手行为等方面的常见缺陷。

源自 arXiv: 2605.14537