CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

📄 Abstract - CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

CityRAG：通过空间锚定的视频生成步入城市 / CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

1️⃣ 一句话总结

CityRAG提出了一种新的视频生成模型，能够利用真实地理数据生成与物理世界一致、可自由导航的长视频，并支持任意天气和动态物体变化，从而为自动驾驶和机器人仿真提供高保真的虚拟城市环境。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要