菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-23
📄 Abstract - VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: this https URL.

顶级标签: multi-modal agents model evaluation
详细标签: vision-language models interactive environments benchmark visual decision-making supervised finetuning 或 搜索:

VisGym:用于多模态智能体的多样化、可定制、可扩展的环境套件 / VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents


1️⃣ 一句话总结

这篇论文提出了一个名为VisGym的多样化测试平台,用于评估和训练视觉语言模型在复杂交互任务中的表现,结果发现当前顶尖模型在需要多步骤视觉决策的任务上表现不佳,并指出了其具体缺陷和改进方向。

源自 arXiv: 2601.16973