菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-03
📄 Abstract - AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.

顶级标签: agents benchmark systems
详细标签: long-horizon closed-loop optimization persistence model evaluation autonomous agents 或 搜索:

AutoLab:前沿模型能否解决长周期自动化研究与工程任务? / AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?


1️⃣ 一句话总结

本文提出了AutoLab,一个包含36个真实任务(如系统优化、模型开发等)的基准测试,通过让AI模型在有限时间内反复迭代改进已有方案,发现决定模型成败的关键不是首次尝试的好坏,而是持续测试、修改和吸收反馈的毅力,以此揭示了当前多数前沿模型缺乏长期规划和持久迭代能力的问题。

源自 arXiv: 2606.05080