菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-30
📄 Abstract - Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.

顶级标签: multi-modal model training video
详细标签: video understanding hallucination reduction counterfactual data generation diffusion models contrastive training 或 搜索:

驯服幻觉:通过反事实视频生成提升多模态大语言模型的视频理解能力 / Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation


1️⃣ 一句话总结

这篇论文提出了一种名为DualityForge的新方法,通过自动生成违背常识的反事实视频及其问答对来训练多模态大模型,有效减少了模型在视频理解中因过度依赖文本先验而产生的‘幻觉’错误,并在多个测试中显著提升了性能。

源自 arXiv: 2512.24271