菜单

🤖 系统
📄 Abstract - Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

Recent advancements in large video models (LVMs) have significantly enhance video understanding. However, these models continue to suffer from hallucinations, producing content that conflicts with input videos. To address this issue, we propose Dr.V, a hierarchical framework covering perceptive, temporal, and cognitive levels to diagnose video hallucination by fine-grained spatial-temporal grounding. Dr.V comprises of two key components: a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent. Dr.V-Bench includes 10k instances drawn from 4,974 videos spanning diverse tasks, each enriched with detailed spatial-temporal annotation. Dr.V-Agent detects hallucinations in LVMs by systematically applying fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning. This step-by-step pipeline mirrors human-like video comprehension and effectively identifies hallucinations. Extensive experiments demonstrate that Dr.V-Agent is effective in diagnosing hallucination while enhancing interpretability and reliability, offering a practical blueprint for robust video understanding in real-world scenarios. All our data and code are available at this https URL.

顶级标签: video model evaluation benchmark
详细标签: video hallucination spatial-temporal grounding video understanding evaluation framework multi-modal evaluation 或 搜索:

📄 论文总结

Dr.V:一种通过细粒度时空定位诊断视频幻觉的分层感知-时序-认知框架 / Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding


1️⃣ 一句话总结

本文提出了一个名为Dr.V的分层框架,通过结合细粒度的时空定位和认知推理,有效检测和诊断大型视频模型在理解视频时产生的幻觉问题,并提供了包含丰富标注的基准数据集和诊断工具,以提高模型的可靠性和可解释性。


📄 打开原文 PDF