菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-01
📄 Abstract - First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models

Recent Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks that require understanding both visual and linguistic inputs. However, object hallucination -- the generation of nonexistent objects in answers -- remains a persistent challenge. Although several approaches such as retraining and external grounding methods have been proposed to mitigate this issue, they still suffer from high data costs or structural complexity. Training-free methods such as Contrastive Decoding (CD) are more cost-effective, avoiding additional training or external models, but still suffer from long-term decay, where visual grounding weakens and language priors dominate as the generation progresses. In this paper, we propose First Logit Boosting (FLB), a simple yet effective training-free technique designed to alleviate long-term decay in LVLMs. FLB stores the logit of the first generated token and adds it to subsequent token predictions, effectively mitigating long-term decay of visual information. We observe that FLB (1) sustains the visual information embedded in the first token throughout generation, and (2) suppresses hallucinated words through the stabilizing effect of the ``The'' token. Experimental results show that FLB significantly reduces object hallucination across various tasks, benchmarks, and backbone models. Notably, it causes negligible inference overhead, making it highly applicable to real-time multimodal systems. Code is available at this https URL

顶级标签: natural language processing computer vision multi-modal
详细标签: object hallucination visual grounding training-free method large vision-language models inference optimization 或 搜索:

首词对数增强:缓解大型视觉语言模型中物体幻觉的视觉接地方法 / First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models


1️⃣ 一句话总结

这篇论文提出了一种无需额外训练的简单方法,通过增强生成过程中首个词的重要性来持续利用视觉信息,从而有效减少AI模型在描述图片时凭空捏造物体的错误。

源自 arXiv: 2604.00455