实时数字人:支持无限时长流式生成、由实时音频驱动的数字人生成 / Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
1️⃣ 一句话总结
这篇论文提出了一个名为Live Avatar的软硬件协同设计框架,它通过创新的并行计算和缓存机制,首次实现了使用超大规模扩散模型进行高保真、低延迟、无限时长的实时数字人视频流式生成。
Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.
实时数字人:支持无限时长流式生成、由实时音频驱动的数字人生成 / Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
这篇论文提出了一个名为Live Avatar的软硬件协同设计框架,它通过创新的并行计算和缓存机制,首次实现了使用超大规模扩散模型进行高保真、低延迟、无限时长的实时数字人视频流式生成。