菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-26
📄 Abstract - Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model's capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.

顶级标签: multi-modal model training aigc
详细标签: interleaved generation text-to-image planning visual consistency benchmark 或 搜索:

万维编织者:通过解耦训练实现交错式多模态生成 / Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training


1️⃣ 一句话总结

这篇论文提出了一个名为Wan-Weaver的模型,它通过将复杂的图文交错生成任务分解为文本规划和视觉一致性建模两个独立步骤来训练,从而在没有真实交错数据的情况下,也能生成内容连贯、视觉一致的多模态混合内容。

源自 arXiv: 2603.25706