基于预测性潜在变量的视频生成 / Video Generation with Predictive Latents
1️⃣ 一句话总结
本文提出了一种名为预测性视频变分自编码器(PV-VAE)的方法,通过让模型在编码部分过去帧的同时预测未来帧,使得潜在空间能够更好地捕捉视频的动态规律,从而显著提升视频生成的质量和训练效率。
Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to reconstruct the observed frames and predict future ones simultaneously. This design encourages the latent space to encode temporally predictive structures and build a more coherent understanding of video dynamics, thereby improving generation quality. Our model, termed Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52% faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore, comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors.
基于预测性潜在变量的视频生成 / Video Generation with Predictive Latents
本文提出了一种名为预测性视频变分自编码器(PV-VAE)的方法,通过让模型在编码部分过去帧的同时预测未来帧,使得潜在空间能够更好地捕捉视频的动态规律,从而显著提升视频生成的质量和训练效率。
源自 arXiv: 2605.02134