先思考再生成:基于大语言模型编码器的推理感知文本到图像扩散模型 / Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders
1️⃣ 一句话总结
这篇论文提出了一种名为‘先思考再生成’的新方法,通过让大语言模型先对用户文本指令进行推理和改写,再指导图像生成,从而显著提升了生成图像在事实一致性、语义对齐和视觉真实性方面的表现。
Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers -- they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.
先思考再生成:基于大语言模型编码器的推理感知文本到图像扩散模型 / Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders
这篇论文提出了一种名为‘先思考再生成’的新方法,通过让大语言模型先对用户文本指令进行推理和改写,再指导图像生成,从而显著提升了生成图像在事实一致性、语义对齐和视觉真实性方面的表现。
源自 arXiv: 2601.10332