菜单

🤖 系统
📄 Abstract - Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.

顶级标签: multi-modal model training model evaluation
详细标签: hallucination mitigation contrastive learning video captioning multimodal llms faithful generation 或 搜索:

通过自增强对比对齐缓解多模态大语言模型中的物体与动作幻觉 / Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment


1️⃣ 一句话总结

这篇论文提出了一个名为SANTA的自增强对比对齐框架,通过识别并利用模型自身可能产生的错误描述来构建对比样本,从而有效减少多模态大模型在视频描述任务中凭空捏造物体和动作的幻觉问题。


📄 打开原文 PDF