菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-25
📄 Abstract - Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but largely overlook identity preservation. To address these limitations, we condition diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, and augment it with VAE-based identity conditioning. A novel Dual Layer Aggregation (DLA) module is designed to aggregate multi-level MLLM features for optimal conditioning, and a multi-stage denoising strategy is applied to progressively balance the semantic information from MLLM and fine-detail identity from VAE during inference. Extensive experiments demonstrate that our approach harmonizes multimodal understanding with identity preservation, mitigates copy-paste issues, and achieves superior performance regarding human preference on subject-driven image generation. Our project website is available at this https URL.

顶级标签: machine learning multi-modal image generation
详细标签: subject-driven generation diffusion model multimodal llm identity preservation vae conditioning 或 搜索:

从多模态大语言模型中榨取能力用于主题驱动图像生成 / Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation


1️⃣ 一句话总结

这篇论文提出了一种新方法,通过将多模态大语言模型与扩散模型结合,并设计双层特征聚合模块和多阶段去噪策略,在主题驱动的图像生成中同时提升了指令遵循能力和主体身份保留效果,有效避免了常见的图像复制粘贴痕迹问题。

源自 arXiv: 2605.26111