菜单

🤖 系统
📄 Abstract - Captain Safari: A World Engine

World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers. To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories, constructed through a multi-stage geometric and kinematic validation pipeline. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled generators. It reduces MEt3R from 0.3703 to 0.3690, improves AUC@30 from 0.181 to 0.200, and yields substantially lower FVD than all camera-controlled baselines. More importantly, in a 50-participant, 5-way human study where annotators select the best result among five anonymized models, 67.6% of preferences favor our method across all axes. Our results demonstrate that pose-conditioned world memory is a powerful mechanism for long-horizon, controllable video generation and provide OpenSafari as a challenging new benchmark for future world-engine research.

顶级标签: computer vision video generation aigc
详细标签: world engine 3d consistent video camera trajectory pose-conditioned generation video synthesis 或 搜索:

Captain Safari:一个世界引擎 / Captain Safari: A World Engine


1️⃣ 一句话总结

这篇论文提出了一个名为Captain Safari的新系统,它通过一个独特的‘世界记忆’机制,能够根据用户指定的复杂相机运动路径,稳定地生成长时间、三维结构一致的探索性视频,并在新建立的真实世界无人机视频数据集上验证了其优越性能。


📄 打开原文 PDF