大语言多模态模型与以物体为中心的视觉:理解、分割、编辑与生成 / LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
1️⃣ 一句话总结
这篇论文综述了如何将大语言多模态模型与以物体为中心的视觉技术相结合,以解决现有模型在精确物体定位、细粒度空间推理和可控视觉操作方面的不足,从而推动更精准、可靠的多模态系统发展。
Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into four major themes: object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation. We further summarize the key modeling paradigms, learning strategies, and evaluation protocols that support these capabilities. Finally, we discuss open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift. We hope this paper provides a structured perspective on the development of scalable, precise, and trustworthy object-centric multimodal systems.
大语言多模态模型与以物体为中心的视觉:理解、分割、编辑与生成 / LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
这篇论文综述了如何将大语言多模态模型与以物体为中心的视觉技术相结合,以解决现有模型在精确物体定位、细粒度空间推理和可控视觉操作方面的不足,从而推动更精准、可靠的多模态系统发展。
源自 arXiv: 2604.11789