PixelDiT: Pixel Diffusion Transformers for Image Generation

📄 Abstract - PixelDiT: Pixel Diffusion Transformers for Image Generation

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.

PixelDiT：用于图像生成的像素扩散变换器 / PixelDiT: Pixel Diffusion Transformers for Image Generation

1️⃣ 一句话总结

这篇论文提出了一种名为PixelDiT的新型图像生成模型，它摒弃了传统两阶段流程中依赖的压缩编码器，直接在原始像素空间进行端到端训练，通过结合全局语义和局部细节的双层变换器设计，在保持图像精细纹理的同时，取得了比以往像素级生成模型更好的效果。

← 返回列表

菜单

🤖 AI 深度阅读

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

🤖 AI 深度阅读

1️⃣ 一句话总结

获取最新论文摘要