JavisGPT:一个用于音视频理解与生成的统一多模态大语言模型 / JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation
1️⃣ 一句话总结
这篇论文提出了首个能同时理解和生成音视频内容的统一多模态大模型JavisGPT,它通过创新的融合模块和分阶段训练方法,在复杂的音视频同步任务上表现出色。
This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder-LLM-decoder architecture, featuring a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. To support this, we further construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios. Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.
JavisGPT:一个用于音视频理解与生成的统一多模态大语言模型 / JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation
这篇论文提出了首个能同时理解和生成音视频内容的统一多模态大模型JavisGPT,它通过创新的融合模块和分阶段训练方法,在复杂的音视频同步任务上表现出色。
源自 arXiv: 2512.22905