菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-01
📄 Abstract - ATLAS: Agentic Test-time Learning-to-Allocate Scaling

Test-time scaling has become a major way to improve large language model reasoning, but its orchestration has remained designer-engineered: a fixed sample budget, a fixed refinement loop, a fixed scoring rule, or a fixed search policy decides how compute is spent, leaving the model in charge of solving but not of orchestration. We introduce ATLAS, an agentic test-time scaling framework in which an LLM orchestrator owns the control loop end-to-end. Through a single action, explore, which dispatches a fresh independent solver on the original problem, the orchestrator decides whether to gather more evidence, when to stop, and how to synthesize the final answer; the action space is extensible, with each explore call optionally specifying solver, reasoning effort, or prompting strategy. We evaluate ATLAS on four benchmarks covering scientific question answering, code generation, and multimodal reasoning under a Claude Sonnet 4.6 backbone, where it reaches 56.00% on HLE-Verified, 82.29% on LiveCodeBench, 85.75% on GPQA-Diamond, and 23.71% on BabyVision while using far fewer API calls than fixed-workflow baselines. A multi-model extension, ATLAS-MM, that exposes solver choice as an additional action dimension further improves HLE-Verified to 60.00% and LiveCodeBench to 85.63%, with consistent gains on GPQA-Diamond and BabyVision. Ablations replacing the orchestrator's direct synthesis with a separate integrator degrade or fail to improve accuracy on three of four benchmarks, consistent with the role of stateful evidence management in producing the gains.

顶级标签: llm agents model training
详细标签: test-time scaling orchestration reasoning multi-model action space 或 搜索:

ATLAS:智能体的测试时扩展分配方法 / ATLAS: Agentic Test-time Learning-to-Allocate Scaling


1️⃣ 一句话总结

本文提出ATLAS框架,让大语言模型自己当“调度员”,通过智能决策(如是否继续探索、选择哪种解法)来动态分配测试时的计算资源,从而在减少API调用次数的同时,在科学问答、代码生成和多模态推理等任务上取得更优表现。

源自 arXiv: 2606.01667