📄
Abstract - Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models
Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiveness for complex reasoning tasks. Inspired by pigeons' building and exploiting cognitive maps for navigation, we propose a novel agentic pipeline for spatial reasoning. First, we introduce a new \emph{dynamic cognitive map} parameterizing scene layout as object positions and orientations, serving as persistent memory for new observations. Second, we propose a novel \emph{Spatial Assertion Codes (SAC)}, Python expressions programmatically describing spatial relationships. By collaborating with the dynamic cognitive map, SAC enables verification of intermediate reasoning steps, providing dense reward signals. We optimize the model via supervised and reinforcement finetuning. Experiments on the MindCube benchmark demonstrate state-of-the-art performance with \emph{80.5\%} overall accuracy, outperforming the best current method by \emph{29.5} accuracy points (a relative improvement of \emph{53.2\%}) on the challenging \textsc{Rotation} subset. Our code and data are open-sourced at this https URL.
像鸽子一样主动探索:通过智能视觉语言模型强化空间推理 /
Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models
1️⃣ 一句话总结
本文借鉴鸽子构建并利用认知地图导航的机制,提出一种让视觉语言模型(VLM)主动探索环境的智能框架,通过动态认知地图记录场景布局,并结合空间断言代码(SAC)作为稠密奖励信号来训练模型,从而显著提升其在空间推理任务上的表现,在MindCube基准上达到80.5%的准确率,尤其将最具挑战的旋转子集准确率相对提升了53.2%。