菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-22
📄 Abstract - Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.

顶级标签: model training multi-modal computer vision
详细标签: text-to-image diffusion models representation autoencoders scaling laws latent space 或 搜索:

利用表征自动编码器扩展文本到图像的扩散变换器 / Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders


1️⃣ 一句话总结

这项研究发现,在大型文本生成图像任务中,一种名为“表征自动编码器”的模型比当前主流技术更简单、更强大,它训练更快、生成质量更高,并且能有效防止过拟合。

源自 arXiv: 2601.16208