基于单张图像、三维姿态与视角控制的人类视频生成 / Human Video Generation from a Single Image with 3D Pose and View Control
1️⃣ 一句话总结
这篇论文提出了一种名为HVG的新方法,它能够仅凭一张人物照片,通过控制三维姿态和观看角度,自动生成高质量、多视角、动作连贯流畅的人物视频。
Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii) Progressive Spatio-Temporal Sampling with temporal alignment to maintain smooth transitions in long multi-view animations. Extensive experiments on image-to-video tasks demonstrate that HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.
基于单张图像、三维姿态与视角控制的人类视频生成 / Human Video Generation from a Single Image with 3D Pose and View Control
这篇论文提出了一种名为HVG的新方法,它能够仅凭一张人物照片,通过控制三维姿态和观看角度,自动生成高质量、多视角、动作连贯流畅的人物视频。
源自 arXiv: 2602.21188