VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

📄 Abstract - VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: this https URL.

VisGym：用于多模态智能体的多样化、可定制、可扩展的环境套件 / VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

1️⃣ 一句话总结

这篇论文提出了一个名为VisGym的多样化测试平台，用于评估和训练视觉语言模型在复杂交互任务中的表现，结果发现当前顶尖模型在需要多步骤视觉决策的任务上表现不佳，并指出了其具体缺陷和改进方向。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要