菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-12
📄 Abstract - ShowUI-Aloha: Human-Taught GUI Agent

Graphical User Interfaces (GUIs) are central to human-computer interaction, yet automating complex GUI tasks remains a major challenge for autonomous agents, largely due to a lack of scalable, high-quality training data. While recordings of human demonstrations offer a rich data source, they are typically long, unstructured, and lack annotations, making them difficult for agents to learn this http URL address this, we introduce ShowUI-Aloha, a comprehensive pipeline that transforms unstructured, in-the-wild human screen recordings from desktop environments into structured, actionable tasks. Our framework includes four key components: A recorder that captures screen video along with precise user interactions like mouse clicks, keystrokes, and scrolls. A learner that semantically interprets these raw interactions and the surrounding visual context, translating them into descriptive natural language captions. A planner that reads the parsed demonstrations, maintains task states, and dynamically formulates the next high-level action plan based on contextual reasoning. An executor that faithfully carries out these action plans at the OS level, performing precise clicks, drags, text inputs, and window operations with safety checks and real-time feedback. Together, these components provide a scalable solution for collecting and parsing real-world human data, demonstrating a viable path toward building general-purpose GUI agents that can learn effectively from simply observing humans.

顶级标签: agents systems model training
详细标签: gui automation human demonstrations pipeline action planning computer control 或 搜索:

ShowUI-Aloha:人类教导的图形界面智能体 / ShowUI-Aloha: Human-Taught GUI Agent


1️⃣ 一句话总结

这篇论文提出了一个名为ShowUI-Aloha的系统,它能将人类在电脑上的操作录屏自动转换成结构化的任务指令,从而让AI助手能通过观察人类使用软件来学习如何完成复杂的图形界面操作。

源自 arXiv: 2601.07181