菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-15
📄 Abstract - Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: \textbf{(i) Immersive ASMR video-audio sources.} Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. \textbf{(ii) Peer-Review evaluation.} An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56\% accuracy (random 50\%), far below that of human experts (81.25\%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at this https URL.

顶级标签: video generation multi-modal model evaluation
详细标签: asmr aigc detection audio-visual consistency perceptual realism vlm evaluation 或 搜索:

视频真实性测试:AI生成的ASMR视频能欺骗视觉语言模型和人类吗? / Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?


1️⃣ 一句话总结

这篇论文通过构建一个结合音频的ASMR视频测试集发现,当前最先进的AI视频生成模型(如Veo3.1)已能制作出高度逼真、足以欺骗大多数视觉语言模型的视频,但人类专家仍能更准确地识别真伪,揭示了AI在感知真实性和视听一致性方面的局限性。


源自 arXiv: 2512.13281