菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-22
📄 Abstract - JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

AIGC has rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task that produces synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced commercial models such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. To bridge the gap, this paper presents JavisDiT++, a concise yet powerful framework for unified modeling and optimization of JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables cross-modal interaction efficacy while enhancing single-modal generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit, frame-level synchronization between audio and video tokens. Besides, we develop an audio-video direct preference optimization (AV-DPO) method to align model outputs with human preference across quality, consistency, and synchrony dimensions. Built upon Wan2.1-1.3B-T2V, our model achieves state-of-the-art performance merely with around 1M public training entries, significantly outperforming prior approaches in both qualitative and quantitative evaluations. Comprehensive ablation studies have been conducted to validate the effectiveness of our proposed modules. All the code, model, and dataset are released at this https URL.

顶级标签: aigc multi-modal model training
详细标签: audio-video generation diffusion transformer preference optimization temporal alignment mixture-of-experts 或 搜索:

JavisDiT++:面向联合音视频生成的统一建模与优化 / JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation


1️⃣ 一句话总结

这篇论文提出了一个名为JavisDiT++的新框架,它通过创新的专家混合模块、时序对齐技术和人类偏好优化方法,显著提升了根据文字描述同时生成高质量、音画同步视频的能力。

源自 arXiv: 2602.19163