Qwen-Image-2.0 技术报告 / Qwen-Image-2.0 Technical Report
1️⃣ 一句话总结
本文介绍了一个名为Qwen-Image-2.0的统一图像生成与编辑基础模型,它通过将语言理解模型与扩散模型结合,在长文本渲染、多语言排版、高分辨率逼真画质和复杂指令遵循等关键任务上,显著超越了前代模型。
We present Qwen-Image-2.0, an omni-capable image generation foundation model that unifies high-fidelity generation and precise image editing within a single framework. Despite recent progress, existing models still struggle with ultra-long text rendering, multilingual typography, high-resolution photorealism, robust instruction following, and efficient deployment, especially in text-rich and compositionally complex scenarios. Qwen-Image-2.0 addresses these challenges by coupling Qwen3-VL as the condition encoder with a Multimodal Diffusion Transformer for joint condition-target modeling, supported by large-scale data curation and a customized multi-stage training pipeline. This enables strong multimodal understanding while preserving flexible generation and editing capabilities. The model supports instructions of up to 1K tokens for generating text-rich content such as slides, posters, infographics, and comics, while significantly improving multilingual text fidelity and typography. It also enhances photorealistic generation with richer details, more realistic textures, and coherent lighting, and follows complex prompts more reliably across diverse styles. Extensive human evaluations show that Qwen-Image-2.0 substantially outperforms previous Qwen-Image models in both generation and editing, marking a step toward more general, reliable, and practical image generation foundation models.
Qwen-Image-2.0 技术报告 / Qwen-Image-2.0 Technical Report
本文介绍了一个名为Qwen-Image-2.0的统一图像生成与编辑基础模型,它通过将语言理解模型与扩散模型结合,在长文本渲染、多语言排版、高分辨率逼真画质和复杂指令遵循等关键任务上,显著超越了前代模型。
源自 arXiv: 2605.10730