菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-26
📄 Abstract - UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.

顶级标签: computer vision video generation model training
详细标签: world models camera control long-term memory video generation diffusion transformer 或 搜索:

UCM:通过时间感知位置编码扭曲统一相机控制与记忆的世界模型 / UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models


1️⃣ 一句话总结

这篇论文提出了一个名为UCM的新框架,它通过一种创新的时间感知位置编码扭曲技术,巧妙地解决了视频生成世界模型中长期内容不一致和相机控制不精准的两大难题,从而能生成既连贯又可控的高质量模拟环境视频。

源自 arXiv: 2602.22960