LiveTalk:通过改进的策略内蒸馏实现实时多模态交互式视频扩散 / LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
1️⃣ 一句话总结
这篇论文提出了一种名为LiveTalk的实时多模态交互式虚拟人视频生成系统,它通过改进的模型蒸馏技术,在保证视频质量的同时,将生成延迟从数分钟大幅降低到实时水平,从而实现了流畅的人机多模态对话互动。
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.
LiveTalk:通过改进的策略内蒸馏实现实时多模态交互式视频扩散 / LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
这篇论文提出了一种名为LiveTalk的实时多模态交互式虚拟人视频生成系统,它通过改进的模型蒸馏技术,在保证视频质量的同时,将生成延迟从数分钟大幅降低到实时水平,从而实现了流畅的人机多模态对话互动。
源自 arXiv: 2512.23576