Scaling Zero-Shot Reference-to-Video Generation

📄 Abstract - Scaling Zero-Shot Reference-to-Video Generation

Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.

扩展零样本参考图像到视频生成 / Scaling Zero-Shot Reference-to-Video Generation

1️⃣ 一句话总结

这篇论文提出了一种名为Saber的零样本框架，它无需依赖昂贵且难以获取的参考图像-视频-文本配对数据，仅使用视频-文本对进行训练，就能生成与文本描述一致且保持参考图像主体身份的高质量视频，并在性能上超越了需要专门数据训练的方法。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要