菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-06
📄 Abstract - LTX-2: Efficient Joint Audio-Visual Foundation Model

Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.

顶级标签: multi-modal aigc model training
详细标签: audio-visual generation diffusion models foundation model text-to-video cross-modality attention 或 搜索:

LTX-2:高效的联合视听基础模型 / LTX-2: Efficient Joint Audio-Visual Foundation Model


1️⃣ 一句话总结

这篇论文提出了一个名为LTX-2的开源基础模型,它能够高效地生成高质量且音画同步的视频内容,通过创新的双流架构和训练机制,在保证性能的同时大幅降低了计算成本。

源自 arXiv: 2601.03233