菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-12
📄 Abstract - Agentic Test-Time Scaling for WebAgents

Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent's own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.

顶级标签: agents llm model evaluation
详细标签: test-time scaling web agents compute allocation multi-step tasks uncertainty estimation 或 搜索:

面向网络智能体的自主测试时扩展方法 / Agentic Test-Time Scaling for WebAgents


1️⃣ 一句话总结

这项研究提出了一种名为CATTS的智能方法,它能让网络任务智能体在执行多步骤任务时,只在决策不确定的环节增加计算资源,从而在显著提升任务成功率的同时,大幅减少不必要的计算消耗。

源自 arXiv: 2602.12276