菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-09
📄 Abstract - ALIVE: Animate Your World with Lifelike Audio-Video Generation

Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page: this https URL.

顶级标签: video generation aigc multi-modal
详细标签: audio-video generation text-to-video&audio animation mmdit architecture benchmark 或 搜索:

ALIVE:用逼真的音视频生成技术为你的世界注入活力 / ALIVE: Animate Your World with Lifelike Audio-Video Generation


1️⃣ 一句话总结

这篇论文提出了一个名为ALIVE的生成模型,它通过改进现有视频生成模型,使其能根据文本或参考视频同步生成高质量且音画同步的视频和音频,性能媲美顶尖商业方案。

源自 arXiv: 2602.08682