SpatialFusion:赋予统一图像生成模型内在的3D几何感知能力 / SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
1️⃣ 一句话总结
本文提出SpatialFusion框架,通过在统一图像生成模型中引入并行空间变换器来学习深度信息,并将这些几何约束注入扩散模型,使得生成的图像在空间一致性上显著超越GPT-4o等现有模型,同时不增加额外推理开销。
Recent unified image generation models have achieved remarkable success by employing MLLMs for semantic understanding and diffusion backbones for image generation. However, these models remain fundamentally limited in spatially-aware tasks due to a lack of intrinsic spatial understanding and the absence of explicit geometric guidance during generation. In this paper, we propose SpatialFusion, a novel framework that internalizes 3D geometric awareness into unified image generation models. Specifically, we first employ a Mixture-of-Transformers (MoT) architecture to augment the MLLM with a parallel spatial transformer to enhance 3D geometric modeling capability. By sharing self-attention with the MLLM, the spatial transformer learns to derive metric-depth maps of target images from rich semantic contexts. These explicit geometric scaffolds are then injected into the diffusion backbone through a specialized depth adapter, providing precise spatial constraints for spatially-coherent image generation. Through a progressive two-stage training strategy, SpatialFusion significantly enhances performance on spatially-aware benchmarks, notably outperforming leading models such as GPT-4o. Additionally, it achieves generalized performance gains across both text-to-image generation and image editing scenarios, all while maintaining negligible inference overhead.
SpatialFusion:赋予统一图像生成模型内在的3D几何感知能力 / SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
本文提出SpatialFusion框架,通过在统一图像生成模型中引入并行空间变换器来学习深度信息,并将这些几何约束注入扩散模型,使得生成的图像在空间一致性上显著超越GPT-4o等现有模型,同时不增加额外推理开销。
源自 arXiv: 2604.26341