菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-06
📄 Abstract - RoboPhD: Evolving Diverse Complex Agents Under Tight Evaluation Budgets

2026 has brought an explosion of interest in LLM-guided evolution of agentic artifacts, with systems like GEPA and Autoresearch demonstrating that LLMs can iteratively improve prompts, code, and agent architectures across diverse domains. As adoption accelerates, a central question emerges: given the same information, the same seed agent, and the same objective, which optimization algorithm yields the best results under the same evaluation budget? This question becomes critical when evaluations are expensive, such as when they require human judgment or multiple LLM calls. We present the first systematic comparison of three optimization paradigms -- Elo tournament selection (RoboPhD), Pareto-based selection (GEPA), and greedy hill-climbing (Autoresearch) -- across four benchmarks spanning abstract reasoning, cloud scheduling, SQL generation, and financial QA, all under a fixed budget of 1,500 evaluations. RoboPhD introduces validation-free evolution: instead of splitting the budget between training and validation, it uses Elo competition on training data to simultaneously evaluate agents and drive evolution. All three systems receive seed agents with diagnostic print() statements that evolution can grow, enabling self-instrumenting agents that develop increasingly informative diagnostics for the benefit of their evolutionary successors. Using a single default configuration, RoboPhD outperforms both GEPA and Autoresearch on three of four benchmarks, losing only on the simplest task, where the winning solution (from our Autoresearch adaptation) required under 90 lines of code. On ARC-AGI, RoboPhD evolves a 22-line seed agent into a 1,013-line multi-strategy system, improving accuracy from 27.8% to 65.8% using Gemini 3.1 Flash Lite as the solver. We release RoboPhD as a versatile toolkit under the MIT license with a simple optimize_anything() API for evolving diverse complex agents.

顶级标签: agents model evaluation llm
详细标签: agent evolution optimization comparison evaluation budget elo tournament benchmarking 或 搜索:

RoboPhD:在有限评估预算下演化多样复杂智能体 / RoboPhD: Evolving Diverse Complex Agents Under Tight Evaluation Budgets


1️⃣ 一句话总结

这篇论文提出了一种名为RoboPhD的新方法,它能在有限的评估次数内,通过一种类似国际象棋排名的竞争机制来高效地自动改进和优化AI智能体,使其在多项复杂任务上的表现优于其他主流优化算法。

源自 arXiv: 2604.04347