菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-03
📄 Abstract - See and Remember: A Multimodal Agent for Web Traversal

Autonomous web navigation requires agents to perceive complex visual environments and maintain long-term context, yet current Large Language Model (LLM) based agents often struggle with spatial disorientation and navigation loops. In this paper, we propose generally applicable V-GEMS(Visual Grounding and Explicit Memory System), a robust multimodal agent architecture designed for precise and resilient web traversal. Our agent integrates visual grounding to resolve ambiguous interactive elements and introduces an explicit memory stack with state tracking. This dual mechanism allows the agent to maintain a structured map of its traversal path, enabling valid backtracking and preventing cyclical failures in deep navigation tasks. We also introduce an updatable dynamic benchmark to rigorously evaluate adaptability. Experiments show V-GEMS significantly dominates the WebWalker baseline, achieving a substantial 28.7% performance gain. Code is available at this https URL.

顶级标签: agents multi-modal model evaluation
详细标签: web navigation visual grounding explicit memory benchmark autonomous agents 或 搜索:

看见与记忆:一种用于网页遍历的多模态智能体 / See and Remember: A Multimodal Agent for Web Traversal


1️⃣ 一句话总结

这篇论文提出了一种名为V-GEMS的新型多模态智能体,它通过结合视觉定位和显式记忆系统,让AI在浏览网页时能更好地理解界面元素并记住走过的路径,从而有效避免迷路和重复打转,显著提升了网页导航的准确性和效率。

源自 arXiv: 2603.02626