TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

📄 Abstract - TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

TerraScope：面向地球观测的像素级视觉推理 / TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

1️⃣ 一句话总结

这篇论文提出了一个名为TerraScope的新型视觉语言模型，它能够结合卫星图像中的具体像素位置进行地理空间推理，并创建了首个评估此类像素级推理能力的基准测试，显著提升了地球观测任务的分析精度和可解释性。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要