菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-12
📄 Abstract - SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.

顶级标签: computer vision model training multi-modal
详细标签: text-to-image latent diffusion visual foundation models representation learning generative ai 或 搜索:

SVG-T2I:无需变分自编码器即可扩展文本到图像的潜在扩散模型 / SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder


1️⃣ 一句话总结

这篇论文提出了一个名为SVG-T2I的新方法,它绕过了传统变分自编码器,直接在视觉基础模型的表示空间里训练大型文本生成图像模型,并取得了与现有方法相当的高质量生成效果。


源自 arXiv: 2512.11749