菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-18
📄 Abstract - The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: this https URL.

顶级标签: video generation multi-modal aigc
详细标签: promptable events multimodal generation trajectory control world models video synthesis 或 搜索:

世界是你的画布:用参考图像、轨迹和文本描绘可提示的事件 / The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text


1️⃣ 一句话总结

这篇论文提出了一个名为WorldCanvas的多模态框架,它允许用户通过结合文本、运动轨迹和参考图像来生成可控、连贯且包含复杂交互的模拟视频,从而将世界模型从被动预测工具转变为用户可交互塑造的模拟器。


源自 arXiv: 2512.16924