菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-29
📄 Abstract - SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

Recent unified image generation models have achieved remarkable success by employing MLLMs for semantic understanding and diffusion backbones for image generation. However, these models remain fundamentally limited in spatially-aware tasks due to a lack of intrinsic spatial understanding and the absence of explicit geometric guidance during generation. In this paper, we propose SpatialFusion, a novel framework that internalizes 3D geometric awareness into unified image generation models. Specifically, we first employ a Mixture-of-Transformers (MoT) architecture to augment the MLLM with a parallel spatial transformer to enhance 3D geometric modeling capability. By sharing self-attention with the MLLM, the spatial transformer learns to derive metric-depth maps of target images from rich semantic contexts. These explicit geometric scaffolds are then injected into the diffusion backbone through a specialized depth adapter, providing precise spatial constraints for spatially-coherent image generation. Through a progressive two-stage training strategy, SpatialFusion significantly enhances performance on spatially-aware benchmarks, notably outperforming leading models such as GPT-4o. Additionally, it achieves generalized performance gains across both text-to-image generation and image editing scenarios, all while maintaining negligible inference overhead.

顶级标签: computer vision multi-modal image generation
详细标签: 3d geometric awareness spatial understanding depth estimation mixture-of-transformers unified generation 或 搜索:

SpatialFusion:赋予统一图像生成模型内在的3D几何感知能力 / SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness


1️⃣ 一句话总结

本文提出SpatialFusion框架,通过在统一图像生成模型中引入并行空间变换器来学习深度信息,并将这些几何约束注入扩散模型,使得生成的图像在空间一致性上显著超越GPT-4o等现有模型,同时不增加额外推理开销。

源自 arXiv: 2604.26341