VIGOR:面向视频几何的时间生成对齐奖励 / VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment
1️⃣ 一句话总结
这篇论文提出了一种基于几何的奖励模型,利用预训练的几何基础模型来评估生成视频的多视角一致性,并通过两种互补的路径来对齐视频扩散模型,从而有效减少了视频生成中的物体变形、空间漂移等不一致性伪影,且无需大量计算资源进行重新训练。
Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model to align video diffusion models through two complementary pathways: post-training of a bidirectional model via SFT or Reinforcement Learning and inference-time optimization of a Causal Video Model (e.g., Streaming video generator) via test-time scaling with our reward as a path verifier. Experimental results validate the effectiveness of our design, demonstrating that our geometry-based reward provides superior robustness compared to other variants. By enabling efficient inference-time scaling, our method offers a practical solution for enhancing open-source video models without requiring extensive computational resources for retraining.
VIGOR:面向视频几何的时间生成对齐奖励 / VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment
这篇论文提出了一种基于几何的奖励模型,利用预训练的几何基础模型来评估生成视频的多视角一致性,并通过两种互补的路径来对齐视频扩散模型,从而有效减少了视频生成中的物体变形、空间漂移等不一致性伪影,且无需大量计算资源进行重新训练。
源自 arXiv: 2603.16271