基于历史条件化多模态大语言模型的非马尔可夫多轮对话式图像生成 / Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLMs
1️⃣ 一句话总结
这篇论文提出了一种新的多轮对话图像生成方法,通过构建非马尔可夫交互数据、采用历史条件化训练框架,有效解决了用户在多轮对话中回指、撤销或跨轮引用实体时模型容易遗忘历史信息的问题,显著提升了生成图像在多轮对话中的一致性和指令遵循能力。
Conversational image generation requires a model to follow user instructions across multiple rounds of interaction, grounded in interleaved text and images that accumulate as chat history. While recent multimodal large language models (MLLMs) can generate and edit images, most existing multi-turn benchmarks and training recipes are effectively Markov: the next output depends primarily on the most recent image, enabling shortcut solutions that ignore long-range history. In this work we formalize and target the more challenging non-Markov setting, where a user may refer back to earlier states, undo changes, or reference entities introduced several rounds ago. We present (i) non-Markov multi-round data construction strategies, including rollback-style editing that forces retrieval of earlier visual states and name-based multi-round personalization that binds names to appearances across rounds; (ii) a history-conditioned training and inference framework with token-level caching to prevent multi-round identity drift; and (iii) enabling improvements for high-fidelity image reconstruction and editable personalization, including a reconstruction-based DiT detokenizer and a multi-stage fine-tuning curriculum. We demonstrate that explicitly training for non-Markov interactions yields substantial improvements in multi-round consistency and instruction compliance, while maintaining strong single-round editing and personalization.
基于历史条件化多模态大语言模型的非马尔可夫多轮对话式图像生成 / Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLMs
这篇论文提出了一种新的多轮对话图像生成方法,通过构建非马尔可夫交互数据、采用历史条件化训练框架,有效解决了用户在多轮对话中回指、撤销或跨轮引用实体时模型容易遗忘历史信息的问题,显著提升了生成图像在多轮对话中的一致性和指令遵循能力。
源自 arXiv: 2601.20911