菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-09
📄 Abstract - Towards a Science of Scaling Agent Systems

Agents, language model (LM)-based systems that are capable of reasoning, planning, and acting are becoming the dominant paradigm for real-world AI applications. Despite this widespread adoption, the principles that determine their performance remain underexplored, leaving practitioners to rely on heuristics rather than principled design choices. We address this gap by deriving quantitative scaling principles for agent systems. We evaluate this across four diverse benchmarks: Finance-Agent, BrowseComp-Plus, PlanCraft, and Workbench. Using five canonical architectures (Single, Independent, Centralized, Decentralized, Hybrid) instantiated across three LLM families, we perform a controlled evaluation spanning 180 configurations with standardized tools and token budgets. We derive a predictive model using empirical coordination metrics, including efficiency, overhead, error amplification, and redundancy, that achieves cross-validated R^2=0.513. We identify three dominant effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead. (2) a capability saturation: coordination yields diminishing or negative returns (beta=-0.408, p<0.001) once single-agent baselines exceed ~45%. (3) topology-dependent error amplification: independent agents amplify errors 17.2x through unchecked propagation, while centralized coordination contains this to 4.4x. Centralized coordination improves performance by 80.9% on parallelizable tasks like financial reasoning, while decentralized coordination excels on dynamic web navigation (+9.2% vs. +0.2%). Yet for sequential reasoning tasks, all multi-agent variants degraded performance by 39-70%. The framework predicts the optimal coordination strategy for 87% of held-out configurations, providing a predictive principle of agentic scaling based on measurable task properties.

顶级标签: agents systems model evaluation
详细标签: multi-agent systems scaling principles coordination trade-offs task characterization benchmark design 或 搜索:

智能体系统的定量扩展原理 / Towards a Science of Scaling Agent Systems


1️⃣ 一句话总结

本文提出了一个量化框架,揭示了智能体系统性能并非简单地随智能体数量增加而提升,而是由任务特性、协调机制与模型能力之间的复杂权衡所主导,并建立了基于任务可测量属性的架构选择预测模型。


2️⃣ 论文创新点

1. 智能体系统定量扩展框架

2. 基于任务属性的协调策略预测模型

3. 工具-协调权衡效应

4. 能力饱和效应

5. 架构依赖的错误放大效应

6. 智能体任务与非智能体任务的区分框架

7. 受控评估设计


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2512.08296