LongVie 2: Multimodal Controllable Ultra-Long Video World Model

📄 Abstract - LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.

LongVie 2：多模态可控的超长视频世界模型 / LongVie 2: Multimodal Controllable Ultra-Long Video World Model

1️⃣ 一句话总结

这篇论文提出了一个名为LongVie 2的三阶段训练框架，通过融合多种控制信号、优化长时生成质量以及确保时间连贯性，能够生成高质量、可控且连贯的极长视频（最长可达5分钟），是构建视频世界模型的重要进展。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要