菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-14
📄 Abstract - Mitigating Visual Hallucinations in Multimodal Systems through Retrieval-Augmented Reliability-Aware Inference

Multimodal large language models (MLLMs) have demonstrated strong capabilities in vision-language understanding and natural-language response generation. However, these systems can still produce overconfident predictions and hallucination-like outputs, particularly when the visual evidence is weak, ambiguous, or semantically inconsistent. Most existing approaches focus on improving multimodal representation alignment or retrieval-augmented generation, while providing limited mechanisms to quantify instance-level prediction reliability or identify incorrect visual outputs. This work proposes a retrieval-augmented reliability-aware inference framework for trustworthy multimodal visual understanding. The proposed framework constructs an external visual evidence database using pretrained visual embeddings and nearest-neighbor retrieval over normalized feature representations. Retrieved evidence is used to estimate prediction trustworthiness through multiple reliability indicators, including similarity strength, class-support agreement, evidence margin, entropy-based uncertainty, and an aggregate reliability score. Based on these signals, a decision gate determines whether the system should accept the prediction, answer with caution, or abstain/fallback when evidence is insufficient. A multimodal response-generation layer then produces a final user-facing response conditioned on the reliability decision. Experiments on ImageNet-100 demonstrate that the proposed reliability-aware framework improves accepted prediction accuracy from 85.84\% to 88.88\% at 89.04\% coverage. The hallucination-like accepted wrong-answer rate is reduced from 14.16\% to 11.12\%. These results show that integrating retrieval evidence, reliability estimation, and selective decision gating can improve calibration and reduce overconfident visual errors without retraining large multimodal models.

顶级标签: multi-modal llm
详细标签: visual hallucination reliability estimation retrieval-augmented generation uncertainty quantification decision gating 或 搜索:

通过检索增强的可靠性感知推理缓解多模态系统中的视觉幻觉 / Mitigating Visual Hallucinations in Multimodal Systems through Retrieval-Augmented Reliability-Aware Inference


1️⃣ 一句话总结

本文提出了一种新的框架,通过在外部图像数据库中检索相似视觉证据,并结合多个可靠性指标(如相似度、类别一致性、不确定性)来评估预测的可信度,从而在视觉信息模糊或矛盾时,让多模态AI系统选择“谨慎回答”或“拒绝回答”,而不是盲目给出错误答案;实验表明,该方法在不重新训练模型的情况下,将错误答案率从14.16%降至11.12%,提升了系统的可信度。

源自 arXiv: 2606.15782