AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

📄 Abstract - AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.

AnyTalker：通过交互性优化实现可扩展的多人物对话视频生成 / AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

1️⃣ 一句话总结

这篇论文提出了一个名为AnyTalker的新框架，它能够利用低成本、易获取的单人视频数据，高效生成多个不同人物同步说话、互动自然的对话视频，解决了以往方法在数据收集和多人互动协调上的难题。

← 返回列表

菜单

🤖 AI 深度阅读

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

🤖 AI 深度阅读

1️⃣ 一句话总结

获取最新论文摘要