学习长期运动嵌入以实现高效运动学生成 / Learning Long-term Motion Embeddings for Efficient Kinematics Generation
1️⃣ 一句话总结
这篇论文提出了一种高效生成未来运动的方法,它先学习一个高度压缩的‘运动密码’,然后在这个压缩空间里根据文字或位置指令快速生成多种逼真的长期运动轨迹,比直接生成整个视频快得多。
Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.
学习长期运动嵌入以实现高效运动学生成 / Learning Long-term Motion Embeddings for Efficient Kinematics Generation
这篇论文提出了一种高效生成未来运动的方法,它先学习一个高度压缩的‘运动密码’,然后在这个压缩空间里根据文字或位置指令快速生成多种逼真的长期运动轨迹,比直接生成整个视频快得多。
源自 arXiv: 2604.11737