📄
Abstract - MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts
Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language-driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent} that supports five object-level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset} containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our dataset and code will be released at this https URL.
MajutsuCity: 基于自然语言驱动的审美自适应3D城市生成框架 /
MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts
1️⃣ 一句话总结
MajutsuCity是一个通过自然语言指令驱动、支持审美自适应控制和对象级交互编辑的3D城市场景生成系统,通过四阶段流程实现了结构一致且风格多样的城市生成。
2️⃣ 论文创新点
1. 语言驱动的审美自适应框架
- 创新点:通过自然语言指令统一控制城市场景生成和交互编辑,将文本解析为几何和美学规范
- 区别/改进:解决了现有方法在文本生成创造性与显式结构可编辑性之间的平衡问题
- 意义:实现了大规模、风格多样城市场景的语言驱动创建和持续修改
2. 两阶段解耦生成架构
- 创新点:将城市生成分为布局生成和高度图生成两个阶段,第一阶段处理语义布局,第二阶段处理空间一致性
- 区别/改进:使用LongCLIP处理长文本描述,ControlNet确保空间一致性
- 意义:实现了从文本空间引导到布局和高度图合成的连贯生成
3. MajutsuAgent交互编辑代理
- 创新点:集成语言基础的编辑代理,支持对象级添加、删除、编辑、移动和替换五种标准化操作
- 区别/改进:将可控性扩展到初始生成之外,支持迭代细化
- 意义:增强了用户对生成场景的精细控制能力,提供从生成到编辑的完整工作流程
4. MajutsuDataset多模态数据集
- 创新点:包含2D语义布局与建筑高度、多样风格3D建筑资产、PBR材质和天空盒的高质量数据集
- 区别/改进:为逼真和可定制场景合成提供全面数据支持
- 意义:解决了现有方法在数据覆盖和风格多样性方面的限制
5. VLM-based评估框架
- 创新点:提供绝对评分(AQS)和相对维度排序(RDR),覆盖结构一致性、场景丰富度、材质保真度和照明氛围
- 区别/改进:解决了3D城市场景缺乏专用指标的问题
- 意义:为全面评估生成质量的关键维度提供了可靠方法
3️⃣ 主要结果与价值
结果亮点
- 在布局FID指标上显著优于CityDreamer和CityCraft,分别提升83.7%和20.1%
- 在几何保真度、多视图一致性和风格多样性方面均表现更优
- AQS和RDR定量结果表明该方法在所有基线方法中表现最佳
实际价值
- 为虚拟现实、游戏开发和数字孪生提供了技术基础
- 实现了直观的用户控制和美学自适应能力
- 提供了从生成到编辑的完整工作流程,增强了系统的实用性和可操作性
4️⃣ 术语表
- MajutsuCity:基于自然语言驱动和审美自适应的可控3D城市场景生成框架
- MajutsuAgent:自然语言驱动的城市场景编辑系统,支持添加、删除、编辑、移动和替换操作
- MajutsuDataset:用于文本引导3D城市场景生成的多模态数据集,包含布局/高程图、3D建筑模型和材质资产
- LongCLIP:用于处理长文本描述的文本编码器,替换标准CLIP以生成信息丰富的语义特征
- ControlNet:基于控制网络的架构,通过零卷积层注入像素级控制信号,确保空间一致性
- FID:Frèchet Inception Distance,用于评估生成图像视觉保真度和多样性的指标,值越低表示生成质量越好
- PBR:Physically-Based Rendering,基于物理的渲染,包含完整的纹理贴图集
- AQS:绝对定量评分,用于评估生成质量的绝对指标
- RDR:相对维度排序,用于评估生成质量的相对指标
- WorldGen:Meta提出的从文本提示直接生成完全交互式3D世界的系统,采用整体规划到局部生成的范式