菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-25
📄 Abstract - VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training

Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes \textbf{\namex}, a lightweight intrinsic guidance framework for efficient diffusion training. \name leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, \name aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that \name improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4\% extra GFLOPs with zero additional cost for external guidance models.

顶级标签: model training computer vision aigc
详细标签: diffusion models training acceleration variational autoencoder feature alignment efficient training 或 搜索:

VAE-REPA:基于变分自编码器表征对齐的高效扩散模型训练方法 / VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training


1️⃣ 一句话总结

这篇论文提出了一种名为VAE-REPA的轻量级方法,通过将扩散模型训练过程中的中间特征与预训练变分自编码器的特征进行对齐,来显著提升模型的训练效率和生成质量,且无需依赖额外的外部模型或复杂的双模型架构。

源自 arXiv: 2601.17830