视觉接地思维链:通过基于证据的多步推理实现可信的视觉推理 / VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought
1️⃣ 一句话总结
本文提出了一种名为VG-CoT的全自动方法,能够为每张图片生成一个多步推理链,其中每个推理步骤都精确对应到图片中的具体区域或文字,从而帮助AI模型做出更可信、有据可查的视觉判断,并大幅降低了人工标注成本。
The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model's logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning-Answer Alignment. Experiments with representative LVLMs, including LLaVA-1.5 and Qwen2-VL, demonstrate consistent improvements on most evaluation metrics, confirming that VG-CoT effectively enhances trustworthy, evidence-based reasoning while maintaining scalable and cost-efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.
视觉接地思维链:通过基于证据的多步推理实现可信的视觉推理 / VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought
本文提出了一种名为VG-CoT的全自动方法,能够为每张图片生成一个多步推理链,其中每个推理步骤都精确对应到图片中的具体区域或文字,从而帮助AI模型做出更可信、有据可查的视觉判断,并大幅降低了人工标注成本。
源自 arXiv: 2604.21396