MolmoWeb:面向开放网络的开放视觉网络智能体与开放数据集 / MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
1️⃣ 一句话总结
这篇论文提出了一个完全开源的视觉网络智能体MolmoWeb及其配套的多样化训练数据集MolmoWebMix,旨在通过开放模型、数据和代码,推动网络智能体研究的透明化与社区协作,并在多项网页任务基准测试中取得了领先的性能。
Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.
MolmoWeb:面向开放网络的开放视觉网络智能体与开放数据集 / MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
这篇论文提出了一个完全开源的视觉网络智能体MolmoWeb及其配套的多样化训练数据集MolmoWebMix,旨在通过开放模型、数据和代码,推动网络智能体研究的透明化与社区协作,并在多项网页任务基准测试中取得了领先的性能。
源自 arXiv: 2604.08516