菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-18
📄 Abstract - REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and (c) global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at this https URL .

顶级标签: model training computer vision multi-modal
详细标签: latent diffusion semantic compression representation learning vision foundation models image generation 或 搜索:

利用全局与局部语义进行潜在纠缠扩散的REGLUE方法 / REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion


1️⃣ 一句话总结

这篇论文提出了一种名为REGLUE的新方法,通过将图像潜在特征与视觉基础模型提取的全局和局部语义信息在同一个扩散模型框架中进行联合建模,显著提升了图像生成的质量和训练效率。


源自 arXiv: 2512.16636