菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-22
📄 Abstract - StoryMem: Multi-shot Long Video Storytelling with Memory

Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.

顶级标签: video generation multi-modal model training
详细标签: long video generation visual storytelling memory bank video diffusion benchmark 或 搜索:

StoryMem:基于记忆的多镜头长视频故事讲述 / StoryMem: Multi-shot Long Video Storytelling with Memory


1️⃣ 一句话总结

这篇论文提出了一种名为StoryMem的新方法,它通过模仿人类记忆机制,利用一个动态更新的关键帧记忆库来指导视频生成,从而让现有的单镜头视频AI模型能够创作出情节连贯、画面精美且长达数分钟的多镜头故事视频。

源自 arXiv: 2512.19539