Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

📄 Abstract - Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at this https URL.

Kiwi-Edit：通过指令和参考引导实现多功能视频编辑 / Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

1️⃣ 一句话总结

这篇论文提出了一个名为Kiwi-Edit的视频编辑新方法，它通过结合文字指令和参考图像来更精准地控制编辑效果，并创造了一个大规模训练数据集来提升模型性能，在可控视频编辑任务上达到了当前最佳水平。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要