LooseControlVideo:使用空间阻挡实现导演级视频控制 / LooseControlVideo: Directorial Video Control using Spatial Blocking
1️⃣ 一句话总结
这篇论文提出了一种新方法,让用户只需通过简单拖拽几个3D方框(就像摆放舞台道具),就能轻松控制AI视频生成中多个物体的位置、轨迹和互动,大大简化了复杂场景的制作过程,并显著提升了生成视频的准确性。
Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a "blocking" proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.
LooseControlVideo:使用空间阻挡实现导演级视频控制 / LooseControlVideo: Directorial Video Control using Spatial Blocking
这篇论文提出了一种新方法,让用户只需通过简单拖拽几个3D方框(就像摆放舞台道具),就能轻松控制AI视频生成中多个物体的位置、轨迹和互动,大大简化了复杂场景的制作过程,并显著提升了生成视频的准确性。
源自 arXiv: 2606.19495