See and Remember: A Multimodal Agent for Web Traversal

📄 Abstract - See and Remember: A Multimodal Agent for Web Traversal

Autonomous web navigation requires agents to perceive complex visual environments and maintain long-term context, yet current Large Language Model (LLM) based agents often struggle with spatial disorientation and navigation loops. In this paper, we propose generally applicable V-GEMS(Visual Grounding and Explicit Memory System), a robust multimodal agent architecture designed for precise and resilient web traversal. Our agent integrates visual grounding to resolve ambiguous interactive elements and introduces an explicit memory stack with state tracking. This dual mechanism allows the agent to maintain a structured map of its traversal path, enabling valid backtracking and preventing cyclical failures in deep navigation tasks. We also introduce an updatable dynamic benchmark to rigorously evaluate adaptability. Experiments show V-GEMS significantly dominates the WebWalker baseline, achieving a substantial 28.7% performance gain. Code is available at this https URL.

看见与记忆：一种用于网页遍历的多模态智能体 / See and Remember: A Multimodal Agent for Web Traversal

1️⃣ 一句话总结

这篇论文提出了一种名为V-GEMS的新型多模态智能体，它通过结合视觉定位和显式记忆系统，让AI在浏览网页时能更好地理解界面元素并记住走过的路径，从而有效避免迷路和重复打转，显著提升了网页导航的准确性和效率。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要