InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

📄 Abstract - InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at this https URL .

InSight-o3：通过广义视觉搜索增强多模态基础模型 / InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

1️⃣ 一句话总结

这篇论文提出了一个名为InSight-o3的多智能体框架，通过一个专门训练、能理解复杂语言指令进行‘广义视觉搜索’的智能体，来帮助现有的顶级多模态模型更准确地分析和推理图像中的细节信息，从而显著提升了它们在多个复杂视觉推理任务上的表现。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要