利用潜在世界模型在推理时对齐视频生成模型的物理规律 / Inference-time Physics Alignment of Video Generative Models with Latent World Models
1️⃣ 一句话总结
这篇论文提出了一种新方法,通过利用一个能理解物理规律的‘潜在世界模型’作为指导,在视频生成过程中实时调整和优化多个候选生成路径,从而显著提升了生成视频的物理合理性,并在相关竞赛中取得了第一名。
State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.
利用潜在世界模型在推理时对齐视频生成模型的物理规律 / Inference-time Physics Alignment of Video Generative Models with Latent World Models
这篇论文提出了一种新方法,通过利用一个能理解物理规律的‘潜在世界模型’作为指导,在视频生成过程中实时调整和优化多个候选生成路径,从而显著提升了生成视频的物理合理性,并在相关竞赛中取得了第一名。
源自 arXiv: 2601.10553