基于统一多模态大语言模型的定制化视觉故事生成 / Customized Visual Storytelling with Unified Multimodal LLMs
1️⃣ 一句话总结
这篇论文提出了一个名为VstoryGen的多模态框架,它能够根据文字描述、角色形象和背景参考图,并利用镜头类型控制,来生成连贯且符合电影语法的定制化视觉故事,在角色场景一致性、图文对齐和镜头多样性上优于现有方法。
Multimodal story customization aims to generate coherent story flows conditioned on textual descriptions, reference identity images, and shot types. While recent progress in story generation has shown promising results, most approaches rely on text-only inputs. A few studies incorporate character identity cues (e.g., facial ID), but lack broader multimodal conditioning. In this work, we introduce VstoryGen, a multimodal framework that integrates descriptions with character and background references to enable customizable story generation. To enhance cinematic diversity, we introduce shot-type control via parameter-efficient prompt tuning on movie data, enabling the model to generate sequences that more faithfully reflect cinematic grammar. To evaluate our framework, we establish two new benchmarks that assess multimodal story customization from the perspectives of character and scene consistency, text-visual alignment, and shot-type control. Experiments demonstrate that VstoryGen achieves improved consistency and cinematic diversity compared to existing methods.
基于统一多模态大语言模型的定制化视觉故事生成 / Customized Visual Storytelling with Unified Multimodal LLMs
这篇论文提出了一个名为VstoryGen的多模态框架,它能够根据文字描述、角色形象和背景参考图,并利用镜头类型控制,来生成连贯且符合电影语法的定制化视觉故事,在角色场景一致性、图文对齐和镜头多样性上优于现有方法。
源自 arXiv: 2603.27690