菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-10
📄 Abstract - Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

顶级标签: llm agents systems
详细标签: autonomous research hypothesis tree benchmark long-horizon planning iterative experimentation 或 搜索:

通过假设树优化实现通用自主研究 / Toward Generalist Autonomous Research via Hypothesis-Tree Refinement


1️⃣ 一句话总结

这篇论文提出了一个名为Arbor的自主研究框架,它通过一个长期存在的协调器、短期执行器和一棵持续更新的假设树,让AI能够像人类科学家一样长期、自主地进行实验探索、积累经验和优化策略,在多项真实任务中,其性能远超现有工具。

源自 arXiv: 2606.11926