从追踪中学习结构:为视频生成提炼结构保持的运动 / Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation
1️⃣ 一句话总结
这篇论文提出了一种新方法,通过从一个能追踪物体运动的模型中学习其保持物体结构不变的运动规律,并将其知识传授给一个视频生成模型,从而显著提升了生成视频中物体(如人或动物)运动的真实性和结构合理性。
Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60\% on VBench, 21-22\% lower FVD, and 71.4\% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51\%, surpassing REPA (92.91\%) by 2.60\%, and reduce FVD to 360.57, a 21.20\% and 22.46\% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at this https URL .
从追踪中学习结构:为视频生成提炼结构保持的运动 / Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation
这篇论文提出了一种新方法,通过从一个能追踪物体运动的模型中学习其保持物体结构不变的运动规律,并将其知识传授给一个视频生成模型,从而显著提升了生成视频中物体(如人或动物)运动的真实性和结构合理性。
源自 arXiv: 2512.11792