基于专家智能体的自动研究:开发有效且非平凡的训练方案 / Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes
1️⃣ 一句话总结
本文提出一种由外部评估驱动的自动研究循环系统,通过多个专家智能体分工协作,在无人干预的情况下自主生成、测试并改进训练方案,成功在多个任务上取得显著性能提升,并展示了智能体能够从失败反馈中学习并执行程序级修改的能力。
We study auto research as a closed empirical loop driven by external measurement. Each submitted trial carries a hypothesis, an executable code edit, an evaluator-owned outcome, and feedback that shapes the next proposal. The output is not a generated paper or a single model checkpoint, but an auditable trajectory of proposals, code diffs, experiments, scores, and failure labels. We instantiate this loop with specialist agents that partition recipe surfaces and share measured lineage across trials. The central empirical finding is that lineage feedback lets agents turn evaluator outcomes, including crashes, budget overruns, size failures, and accuracy-gate misses, into later program-level recipe edits rather than one-shot suggestions. Across 1,197 headline-run trials plus 600 Parameter Golf control trials after one-time setup and launch, humans did not choose proposals, edit recipes, override scores, or repair failed trials during the search. In the three headline runs, the same submitted-trial loop reduces Parameter Golf validation bpb by $0.81\%$, raises NanoChat-D12 CORE by $38.7\%$, and reduces CIFAR-10 Airbench96 wallclock by $4.59\%$, with each task measured by its own external evaluator and legality checks. The trace includes a strict architecture-domain audit of 157 headline-run submissions and program rewrites such as a NanoChat attention-kernel path change. Within this scope the loop autonomously writes code, submits experiments, absorbs feedback, applies and combines known techniques inside each environment, and improves public starting recipes.
基于专家智能体的自动研究:开发有效且非平凡的训练方案 / Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes
本文提出一种由外部评估驱动的自动研究循环系统,通过多个专家智能体分工协作,在无人干预的情况下自主生成、测试并改进训练方案,成功在多个任务上取得显著性能提升,并展示了智能体能够从失败反馈中学习并执行程序级修改的能力。
源自 arXiv: 2605.05724