菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-18
📄 Abstract - DocOS: Towards Proactive Document-Guided Actions in GUI Agents

While Graphical User Interface (GUI) agents have shown promising performance in automated device interaction, they primarily depend on static parametric knowledge from pre-training or instruction tuning. This reliance fundamentally limits their ability to handle long-tailed tasks that require explicit procedural knowledge absent from model parameters, often forcing agents to resort to inefficient and brittle trial-and-error exploration. To mitigate this limitation, we introduce \textbf{Proactive Document-Guided Action} for GUI agents in dynamic, open-web environments, a novel paradigm that mirrors human problem-solving by enabling agents to autonomously search for relevant documentation to resolve long-tailed tasks. To evaluate agents' capability in this paradigm, we propose \textbf{DocOS}, a benchmark designed to assess document-guided problem solving in fully interactive environments. DocOS requires agents to autonomously navigate a web browser, locate relevant online documentation, comprehend procedural instructions, and faithfully ground them into executable GUI actions. Extensive experiments reveal that progress is strictly constrained by dual bottlenecks: agents struggle to reliably locate relevant information during proactive search and frequently fail to faithfully ground retrieved instructions into precise actions, pointing toward document-guided interaction as a crucial pathway for enabling self-evolving GUI agents in dynamic environments.

顶级标签: agents natural language processing benchmark
详细标签: gui agents proactive search document-guided action grounding long-tailed tasks 或 搜索:

DocOS:面向GUI代理的主动文档引导操作 / DocOS: Towards Proactive Document-Guided Actions in GUI Agents


1️⃣ 一句话总结

这篇论文提出了一种新方法,让图形界面(GUI)助手像人一样主动搜索在线文档来解决复杂任务,并设计了DocOS测试平台,发现当前代理在查找文档和执行指导两方面都存在瓶颈,为开发能自我进化的智能助手指明了方向。

源自 arXiv: 2605.18048