菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-08
📄 Abstract - UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: this https URL

顶级标签: video generation multi-modal model training
详细标签: world-aware generation cross-modal learning unified framework dynamic noising modality switcher 或 搜索:

UnityVideo:用于增强世界感知视频生成的统一多模态多任务学习框架 / UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation


1️⃣ 一句话总结

这篇论文提出了一个名为UnityVideo的统一框架,通过联合学习多种视觉模态(如分割掩码、人体骨架等)和训练范式,有效提升了视频生成模型对物理世界的感知能力、生成质量以及泛化性能。


源自 arXiv: 2512.07831