菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-19
📄 Abstract - Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.

顶级标签: computer vision model training multi-modal
详细标签: latent diffusion models representation learning text-to-image generation image editing semantic reconstruction 或 搜索:

语义与重建并重:让表征编码器为文本到图像生成与编辑做好准备 / Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing


1️⃣ 一句话总结

这篇论文提出了一种新方法,通过引入语义-像素联合重建目标,将主要用于图像识别的编码器特征改造成既紧凑又富含细节的表示,从而成功用于高质量的文本生成图像和图像编辑任务,并取得了优异的性能。

源自 arXiv: 2512.17909