菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-14
📄 Abstract - Distorted or Fabricated? A Survey on Hallucination in Video LLMs

Despite significant progress in video-language modeling, hallucinations remain a persistent challenge in Video Large Language Models (Vid-LLMs), referring to outputs that appear plausible yet contradict the content of the input video. This survey presents a comprehensive analysis of hallucinations in Vid-LLMs and introduces a systematic taxonomy that categorizes them into two core types: dynamic distortion and content fabrication, each comprising two subtypes with representative cases. Building on this taxonomy, we review recent advances in the evaluation and mitigation of hallucinations, covering key benchmarks, metrics, and intervention strategies. We further analyze the root causes of dynamic distortion and content fabrication, which often result from limited capacity for temporal representation and insufficient visual grounding. These insights inform several promising directions for future work, including the development of motion-aware visual encoders and the integration of counterfactual learning techniques. This survey consolidates scattered progress to foster a systematic understanding of hallucinations in Vid-LLMs, laying the groundwork for building robust and reliable video-language systems. An up-to-date curated list of related works is maintained at this https URL .

顶级标签: video llm model evaluation
详细标签: hallucination video-language models benchmark temporal representation visual grounding 或 搜索:

失真还是捏造?视频大语言模型中的幻觉问题综述 / Distorted or Fabricated? A Survey on Hallucination in Video LLMs


1️⃣ 一句话总结

这篇论文系统梳理了视频大语言模型在理解视频内容时,容易产生看似合理但实际与视频不符的‘幻觉’问题,分析了其类型、成因、评估方法和解决思路,为构建更可靠的视频理解系统提供了路线图。

源自 arXiv: 2604.12944