菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-09
📄 Abstract - Boosting Latent Diffusion Models via Disentangled Representation Alignment

Latent Diffusion Models (LDMs) generate high-quality images by operating in a compressed latent space, typically obtained through image tokenizers such as Variational Autoencoders (VAEs). In pursuit of a generation-friendly VAE, recent studies have explored leveraging Vision Foundation Models (VFMs) as representation alignment targets for VAEs, mirroring the approach commonly adopted for LDMs. Although this yields certain performance gains, using the same alignment target for both VAEs and LDMs overlooks their fundamentally different representational requirements. We advocate that while LDMs benefit from latents retaining high-level semantic concepts, VAEs should excel in semantic disentanglement, enabling encoding of attribute-level information in a structured way. To address this, we propose the Semantic disentangled VAE (Send-VAE), explicitly optimized for disentangled representation learning through aligning its latent space with the semantic hierarchy of pre-trained VFMs. Our approach employs a non-linear mapper network to transform VAE latents, aligning them with VFMs to bridge the gap between attribute-level disentanglement and high-level semantics, facilitating effective guidance for VAE learning. We evaluate semantic disentanglement via linear probing on attribute prediction tasks, showing strong correlation with improved generation performance. Finally, using Send-VAE, we train flow-based transformers SiTs; experiments show Send-VAE significantly speeds up training and achieves a state-of-the-art FID of 1.21 and 1.75 with and without classifier-free guidance on ImageNet 256x256.

顶级标签: model training computer vision aigc
详细标签: latent diffusion models variational autoencoders semantic disentanglement representation alignment image generation 或 搜索:

通过解耦表征对齐提升潜在扩散模型 / Boosting Latent Diffusion Models via Disentangled Representation Alignment


1️⃣ 一句话总结

这篇论文提出了一种名为Send-VAE的新型图像编码器,它通过将编码器的潜在空间与视觉基础模型的语义层次对齐,实现了对图像属性的解耦表征,从而显著提升了潜在扩散模型的图像生成质量和训练效率。

源自 arXiv: 2601.05823