菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-24
📄 Abstract - MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation

Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representations are limited by the accuracy of off-the-shelf reconstruction modules, which often produce inaccurate geometry for dynamic objects in monocular videos. In contrast, camera-conditioning-only methods can achieve high visual quality but often struggle to preserve geometric and motion consistency. In this work, we introduce MVTrack4Gen (Multi-View point Tracking for Novel-View Generation), a motion-aware training framework that leverages multi-view point tracking as an additional geometric and motion supervision signal for camera-conditioning-only novel-view video diffusion models. Our key finding is that specific attention layers encode strong correspondence cues, where query features attend to key features at geometrically corresponding locations across views and over time, and the misalignment of these correspondences causes motion inconsistency. Based on this observation, we route these features into an auxiliary multi-view tracking head and jointly train the diffusion model with a point-tracking objective. By explicitly strengthening these motion-aware correspondences, MVTrack4Gen improves existing models to better follow the motion in the reference view and maintain cross-view geometric consistency. Across diverse benchmarks, our method achieves state-of-the-art geometric consistency and competitive camera accuracy.

顶级标签: computer vision video generation multi-modal
详细标签: 4d video generation novel-view synthesis point tracking geometric consistency diffusion models 或 搜索:

MVTrack4Gen:多视角点追踪作为4D视频生成的几何监督 / MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation


1️⃣ 一句话总结

本文提出了一种名为MVTrack4Gen的新方法,通过在多视角视频生成中引入注意力层中的点追踪机制,让AI在生成新视角视频时既能保持物体运动连贯性,又能确保不同视角间的几何一致性,从而显著提升生成视频的真实感和稳定性。

源自 arXiv: 2606.26087