菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-22
📄 Abstract - Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory

Recent foundational video-to-video diffusion models have achieved impressive results in editing user provided videos by modifying appearance, motion, or camera movement. However, real-world video editing is often an iterative process, where users refine results across multiple rounds of interaction. In this multi-turn setting, current video editors struggle to maintain cross-consistency across sequential edits. In this work, we tackle, for the first time, the problem of cross-consistency in multi-turn video editing and introduce Memory-V2V, a simple, yet effective framework that augments existing video-to-video models with explicit memory. Given an external cache of previously edited videos, Memory-V2V employs accurate retrieval and dynamic tokenization strategies to condition the current editing step on prior results. To further mitigate redundancy and computational overhead, we propose a learnable token compressor within the DiT backbone that compresses redundant conditioning tokens while preserving essential visual cues, achieving an overall speedup of 30%. We validate Memory-V2V on challenging tasks including video novel view synthesis and text-conditioned long video editing. Extensive experiments show that Memory-V2V produces videos that are significantly more cross-consistent with minimal computational overhead, while maintaining or even improving task-specific performance over state-of-the-art baselines. Project page: this https URL

顶级标签: video generation model training multi-modal
详细标签: video-to-video diffusion iterative editing cross-consistency memory-augmented generation token compression 或 搜索:

Memory-V2V:为视频到视频扩散模型增加记忆模块 / Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory


1️⃣ 一句话总结

这篇论文提出了一个名为Memory-V2V的新框架,它通过给现有的视频编辑AI模型增加一个‘记忆库’,让用户在多次、反复编辑同一个视频时,能自动参考之前的编辑结果,从而保持视频整体风格和内容的一致性,同时还提高了处理速度。

源自 arXiv: 2601.16296