Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

📄 Abstract - Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

Vision-Language Models (VLMs) are frequently undermined by object hallucination--generating content that contradicts visual reality--due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object's features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model's outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state-of-the-art performance with up to 6.5% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail--all without requiring any model retraining. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen-VL.

全局上下文还是局部细节？面向幻觉缓解的自适应视觉定位 / Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

1️⃣ 一句话总结

本文提出了一种无需训练的推理框架PND，通过对比增强视觉证据和抑制语言先验的两个解码路径，有效纠正了视觉语言模型因过度依赖语言习惯而产生的物体幻觉，显著提升了模型输出的视觉准确性。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要