像苏格拉底一样提问:苏格拉底助手帮助视觉语言模型理解遥感图像 / Asking like Socrates: Socrates helps VLMs understand remote sensing images
1️⃣ 一句话总结
这篇论文针对视觉语言模型在分析遥感图像时存在的‘伪推理’问题,提出了一种名为RS-EoT的新方法,它通过模拟苏格拉底式的多轮问答和自我检查,引导模型逐步寻找视觉证据,从而实现了更准确、基于真实图像内容的推理。
Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at this https URL
像苏格拉底一样提问:苏格拉底助手帮助视觉语言模型理解遥感图像 / Asking like Socrates: Socrates helps VLMs understand remote sensing images
这篇论文针对视觉语言模型在分析遥感图像时存在的‘伪推理’问题,提出了一种名为RS-EoT的新方法,它通过模拟苏格拉底式的多轮问答和自我检查,引导模型逐步寻找视觉证据,从而实现了更准确、基于真实图像内容的推理。