菜单

🤖 系统
📄 Abstract - GUI-360$^\circ$: A Comprehensive Dataset and Benchmark for Computer-Using Agents

We introduce GUI-360$^\circ$, a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs). CUAs present unique challenges and is constrained by three persistent gaps: a scarcity of real-world CUA tasks, the lack of automated collection-and-annotation pipelines for multi-modal trajectories, and the absence of a unified benchmark that jointly evaluates GUI grounding, screen parsing, and action prediction. GUI-360$^\circ$ addresses these gaps with an LLM-augmented, largely automated pipeline for query sourcing, environment-template construction, task instantiation, batched execution, and LLM-driven quality filtering. The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications, and includes full-resolution screenshots, accessibility metadata when available, instantiated goals, intermediate reasoning traces, and both successful and failed action trajectories. The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space that reflects modern agent designs. Benchmarking state-of-the-art vision--language models on GUI-360$^\circ$ reveals substantial out-of-the-box shortcomings in grounding and action prediction; supervised fine-tuning and reinforcement learning yield significant gains but do not close the gap to human-level reliability. We release GUI-360$^\circ$ and accompanying code to facilitate reproducible research and accelerate progress on robust desktop CUAs. The full dataset has been made public on this https URL.

顶级标签: agents benchmark computer vision
详细标签: computer-using agents gui interaction multi-modal trajectories action prediction screen parsing 或 搜索:

📄 论文总结

GUI-360°:用于计算机使用代理的全面数据集与基准测试 / GUI-360$^\circ$: A Comprehensive Dataset and Benchmark for Computer-Using Agents


1️⃣ 一句话总结

这篇论文提出了一个大规模数据集GUI-360°,通过自动化流程收集了超过120万次Windows办公软件操作记录,旨在解决计算机代理在图形界面理解、屏幕解析和行动预测方面的关键挑战,并为相关研究提供了统一的评估基准。


📄 打开原文 PDF