AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

📄 Abstract - AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.

AutoLab：前沿模型能否解决长周期自动化研究与工程任务？ / AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

1️⃣ 一句话总结

本文提出了AutoLab，一个包含36个真实任务（如系统优化、模型开发等）的基准测试，通过让AI模型在有限时间内反复迭代改进已有方案，发现决定模型成败的关键不是首次尝试的好坏，而是持续测试、修改和吸收反馈的毅力，以此揭示了当前多数前沿模型缺乏长期规划和持久迭代能力的问题。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要