菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-03
📄 Abstract - 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision (i.e., single-view, multi-view, and moving-camera videos), forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.

顶级标签: computer vision video generation multi-modal
详细标签: human video generation 3d-aware motion control view-agnostic representation implicit motion encoding novel-view synthesis 或 搜索:

面向视图自适应人体视频生成的3D感知隐式运动控制 / 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation


1️⃣ 一句话总结

这篇论文提出了一种名为3DiMo的新方法,它通过隐式的、与视角无关的运动表示来控制视频生成中的人体动作,使得生成的视频既能忠实复现驱动动作,又能灵活地根据文本指令自由调整拍摄视角,效果优于现有技术。

源自 arXiv: 2602.03796