菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-22
📄 Abstract - MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments

Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. To bridge this gap, we introduce MobileWorld, a substantially more challenging benchmark designed to better reflect real-world mobile usage, comprising 201 tasks across 20 applications, while maintaining the same level of reproducible evaluation as AndroidWorld. The difficulty of MobileWorld is twofold. First, it emphasizes long-horizon tasks with cross-application interactions: MobileWorld requires nearly twice as many task-completion steps on average (27.8 vs. 14.3) and includes far more multi-application tasks (62.2% vs. 9.5%) compared to AndroidWorld. Second, MobileWorld extends beyond standard GUI manipulation by introducing novel task categories, including agent-user interaction and MCP-augmented tasks. To ensure robust evaluation, we provide snapshot-based container environment and precise functional verifications, including backend database inspection and task callback APIs. We further develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively. Our analysis shows that current models struggle significantly with user interaction and MCP calls, offering a strategic roadmap toward more robust, next-generation mobile intelligence.

顶级标签: agents benchmark systems
详细标签: gui agents mobile interaction agent-user interaction tool-augmented agents deterministic evaluation 或 搜索:

MobileWorld:一个更具挑战性的移动GUI智能体基准测试 / MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments


1️⃣ 一句话总结

本文提出了MobileWorld,一个比现有基准(如AndroidWorld)更具挑战性的移动GUI智能体基准测试,它通过引入更高复杂度的任务、智能体-用户交互任务和MCP增强任务,更真实地反映了现实世界的移动使用场景,并揭示了当前最佳模型在复杂交互和外部工具调用方面的显著不足。


2️⃣ 论文创新点

1. 高复杂度与真实性的基准测试设计

2. 引入智能体-用户交互任务

3. 引入MCP增强任务

4. 确定性评估基础设施

5. 规划器-执行器智能体框架


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2512.19432