菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-04
📄 Abstract - COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average \textbf{6.91\%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a \textbf{7.92\%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.

顶级标签: multi-modal natural language processing model training
详细标签: spatial reasoning multimodal llm depth estimation segmentation visual question answering 或 搜索:

COOPER:空间智能中协同感知与推理的统一模型 / COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence


1️⃣ 一句话总结

这篇论文提出了一个名为COOPER的统一多模态大语言模型,它通过整合深度和分割信息来增强空间感知能力,并采用自适应交替推理策略,从而显著提升了模型对三维空间关系的理解和推理性能。


源自 arXiv: 2512.04563