See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch

📄 Abstract - See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch

Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs through runtime tests within the browser environment. Extensive experiments across state-of-the-art multimodal language models and GUI agents reveal a substantial reasoning--acting gap, highlighting persistent challenges in fine-grained GUI manipulation despite strong planning capabilities.

看、规划、点击：在Scratch中评估多模态图形界面智能体 / See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch

1️⃣ 一句话总结

这篇论文提出了一个名为ScratchWorld的新评估基准，用于全面测试AI智能体在Scratch图形化编程环境中通过操作界面来构建、调试和扩展程序的能力，并发现当前智能体在高级规划与精细界面操作之间存在明显差距。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要