菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-07
📄 Abstract - Controllable Image Generation with Composed Parallel Token Prediction

Conditional discrete generative models struggle to faithfully compose multiple input conditions. To address this, we derive a theoretically-grounded formulation for composing discrete probabilistic generative processes, with masked generation (absorbing diffusion) as a special case. Our formulation enables precise specification of novel combinations and numbers of input conditions that lie outside the training data, with concept weighting enabling emphasis or negation of individual conditions. In synergy with the richly compositional learned vocabulary of VQ-VAE and VQ-GAN, our method attains a $63.4\%$ relative reduction in error rate compared to the previous state-of-the-art, averaged across 3 datasets (positional CLEVR, relational CLEVR and FFHQ), simultaneously obtaining an average absolute FID improvement of $-9.58$. Meanwhile, our method offers a $2.3\times$ to $12\times$ real-time speed-up over comparable methods, and is readily applied to an open pre-trained discrete text-to-image model for fine-grained control of text-to-image generation.

顶级标签: computer vision model training aigc
详细标签: controllable image generation masked generation diffusion models vq-vae compositional generation 或 搜索:

基于组合并行令牌预测的可控图像生成 / Controllable Image Generation with Composed Parallel Token Prediction


1️⃣ 一句话总结

这篇论文提出了一种新的可控图像生成方法,能够更精确地组合多个输入条件(如物体位置、关系或文本描述)来生成图像,不仅在效果上显著优于现有技术,而且生成速度更快,还能直接应用于已有的文本生成图像模型进行精细控制。

源自 arXiv: 2604.05730