利用MetaCanvas探索多模态大语言模型与扩散模型的信息传递 / Exploring MLLM-Diffusion Information Transfer with MetaCanvas
1️⃣ 一句话总结
这篇论文提出了一个名为MetaCanvas的轻量级框架,它能让强大的多模态大语言模型直接在图像和视频的潜在空间中进行推理与规划,从而更精确地控制扩散模型生成内容,有效缩小了多模态理解与生成能力之间的差距。
Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video generation, image/video editing, and in-context video generation, each requiring precise layouts, robust attribute binding, and reasoning-intensive control. MetaCanvas consistently outperforms global-conditioning baselines, suggesting that treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation.
利用MetaCanvas探索多模态大语言模型与扩散模型的信息传递 / Exploring MLLM-Diffusion Information Transfer with MetaCanvas
这篇论文提出了一个名为MetaCanvas的轻量级框架,它能让强大的多模态大语言模型直接在图像和视频的潜在空间中进行推理与规划,从而更精确地控制扩散模型生成内容,有效缩小了多模态理解与生成能力之间的差距。
源自 arXiv: 2512.11464