菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-07
📄 Abstract - SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, harming the geometric coherence and restricting the rendering to training views, or are limited to small-scale 3D scene or object-centric generation. In this work, we propose a 3D generative framework based on $\Sigma$-Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation, we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally, we render the generated $\Sigma$-Voxfield grid with a deferred rendering module to obtain photorealistic images, enabling large-scale multiview-consistent 3D scene generation without per-scene optimization. Extensive experiments show that our approach can generate diverse large-scale urban outdoor scenes, renderable into photorealistic images with various sensor configurations and camera trajectories while maintaining moderate computation cost compared to existing approaches.

顶级标签: computer vision model training multi-modal
详细标签: 3d scene generation diffusion models voxel representation driving scenes semantic conditioning 或 搜索:

SEM-ROVER:用于大规模驾驶场景生成的语义体素引导扩散模型 / SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation


1️⃣ 一句话总结

这篇论文提出了一种新的三维生成框架,通过一个基于语义条件扩散模型的方法,能够高效地生成大规模、多视角一致且可渲染成逼真图像的户外驾驶场景,而无需对每个场景进行单独优化。

源自 arXiv: 2604.06113