菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-25
📄 Abstract - SVBench: Evaluation of Video Generation Models on Social Reasoning

Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans, who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues, current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.

顶级标签: video generation benchmark model evaluation
详细标签: social reasoning text-to-video evaluation framework multi-agent video generation benchmark 或 搜索:

SVBench:视频生成模型在社会推理能力上的评估 / SVBench: Evaluation of Video Generation Models on Social Reasoning


1️⃣ 一句话总结

这篇论文提出了首个用于评估视频生成模型社会推理能力的基准SVBench,发现当前先进模型虽然在画面真实性和动作流畅度上表现优秀,但在理解人物意图、信念、共同关注等深层社会逻辑方面存在系统性不足。

源自 arXiv: 2512.21507