菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-11
📄 Abstract - Modality Forcing for Scalable Spatial Generation

Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (370M to 3.3B parameters), we find that larger models trained on more image data produce more accurate depth. Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. These results provide strong evidence that image generation is a scalable pre-training objective for spatial perception. this https URL

顶级标签: computer vision multi-modal model training
详细标签: text-to-image depth prediction modality forcing spatial generation diffusion transformer 或 搜索:

模态强制:可扩展的空间生成方法 / Modality Forcing for Scalable Spatial Generation


1️⃣ 一句话总结

本文提出了一种名为“模态强制”的简单后训练方法,通过为图像和深度数据分配不同的噪声级别,让预训练的文生图模型能够在不依赖密集深度数据的情况下,同时或分别生成图像与深度图,从而在保持模型可扩展性的同时,大幅提升深度预测的准确性。

源自 arXiv: 2606.13676