Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

📄 Abstract - Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.

📄 论文总结

和谐：通过跨任务协同实现音视频生成的协调统一 / Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

1️⃣ 一句话总结

这篇论文提出了一个名为Harmony的新框架，通过跨任务协同训练、高效的全局-局部解耦交互模块以及同步增强的引导技术，解决了音视频生成中难以保持精确同步的核心难题，显著提升了生成内容的真实感和同步质量。

← 返回列表

菜单

🤖 AI 深度阅读

📄 论文总结

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

🤖 AI 深度阅读

📄 论文总结

1️⃣ 一句话总结

获取最新论文摘要