📄
Abstract - WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction
Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.
WorldMemArena:通过动作-世界交互评估多模态智能体记忆 /
WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction
1️⃣ 一句话总结
本文提出了一种名为WorldMemArena的评估框架,通过设计400个多会话、多模态的交互任务,系统性地测试和比较了不同多模态大模型智能体在记忆写入、维护、检索和使用四个阶段的表现,发现即使记忆存储做得很好,也不一定能提升最终任务性能,且当前系统在利用视觉证据和跨领域稳定性方面仍有明显不足。