菜单

🤖 系统
📄 Abstract - DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.

顶级标签: multi-modal model training model evaluation
详细标签: text-to-image generation chain-of-thought visual reasoning rare concept generation classifier-free guidance 或 搜索:

DraCo:将草稿作为思维链用于文本到图像预览与稀有概念生成 / DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation


1️⃣ 一句话总结

这篇论文提出了一种名为DraCo的新方法,它通过先生成低分辨率草稿图像进行预览和视觉规划,再利用模型自身能力进行语义验证和选择性修正,从而显著提升了多模态大模型在文本生成图像任务中的规划准确性和生成稀有概念组合的能力。


📄 打开原文 PDF