EasyV2V:一个基于指令的高质量视频编辑框架 / EasyV2V: A High-quality Instruction-based Video Editing Framework
1️⃣ 一句话总结
这篇论文提出了一个名为EasyV2V的简单高效框架,通过创新的数据构建、简化的模型设计和统一的控制机制,成功解决了视频编辑在一致性、控制和泛化方面的难题,实现了基于自然语言指令的高质量视频编辑。
While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: this https URL
EasyV2V:一个基于指令的高质量视频编辑框架 / EasyV2V: A High-quality Instruction-based Video Editing Framework
这篇论文提出了一个名为EasyV2V的简单高效框架,通过创新的数据构建、简化的模型设计和统一的控制机制,成功解决了视频编辑在一致性、控制和泛化方面的难题,实现了基于自然语言指令的高质量视频编辑。
源自 arXiv: 2512.16920