菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-03
📄 Abstract - VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction

This paper introduces VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving. Current voxel-based occupancy models often struggle with semantic ambiguity in sparse geometric grids and performance degradation under adverse weather conditions. To address these challenges, we leverage the rich linguistic priors of Vision-Language Models (VLMs) to anchor ambiguous voxel features to stable semantic concepts. Our framework initiates with a dual-branch feature extraction pipeline that projects multi-view images and LiDAR point clouds into a unified voxel space. We propose Instance-driven VLM Attention (InstVLM), which utilizes gated cross-attention and LoRA-adapted CLIP embeddings to inject high-level semantic and geographic priors directly into the 3D voxels. Furthermore, we introduce Weather-Aware Adaptive Fusion (WeathFusion), a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions based on real-time environmental reliability. To ensure structural consistency, a Depth-Aware Geometric Alignment (DAGA) loss is employed to align dense camera-derived geometry with sparse, spatially accurate LiDAR returns. Extensive experiments on the nuScenes and SemanticKITTI datasets demonstrate that our plug-and-play modules consistently enhance the performance of state-of-the-art voxel-based baselines. Notably, our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.

顶级标签: computer vision multi-modal autonomous driving
详细标签: 3d semantic occupancy vision-language models sensor fusion adverse weather voxel-based prediction 或 搜索:

VLMFusionOcc3D:视觉语言模型辅助的多模态3D语义占据预测 / VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction


1️⃣ 一句话总结

这篇论文提出了一种名为VLMFusionOcc3D的新方法,它巧妙地将视觉语言模型的语义理解能力与激光雷达、摄像头的数据融合起来,让自动驾驶汽车在各种天气条件下都能更准确、更可靠地识别和理解周围环境的3D结构和物体类别。

源自 arXiv: 2603.02609