菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-15
📄 Abstract - Urban Socio-Semantic Segmentation with Vision-Language Reasoning

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio-semantic segmentation by vision-language model reasoning. To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach's gains over state-of-the-art models and strong zero-shot generalization. Our dataset and code are available in this https URL.

顶级标签: computer vision multi-modal model training
详细标签: semantic segmentation vision-language reasoning satellite imagery reinforcement learning urban analysis 或 搜索:

基于视觉-语言推理的城市社会语义分割 / Urban Socio-Semantic Segmentation with Vision-Language Reasoning


1️⃣ 一句话总结

这篇论文提出了一种结合视觉与语言模型进行推理的新方法,成功解决了卫星图像中难以识别社会功能区域(如学校、公园)的难题,并发布了首个相关数据集和一套有效的学习框架。

源自 arXiv: 2601.10477