DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

📄 Abstract - DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

Long-form video generation is rapidly moving from short, single-scene synthesis toward minute-long, multi-shot creation with narrative structure, cinematic control, audio, and cross-modal synchronization. However, evaluating such videos remains challenging, since existing benchmarks largely focus on local visual quality, short-horizon temporal consistency, or generic prompt alignment, and provide limited diagnosis of workflow failures and user-dependent preferences. We introduce DirectorBench, a personalized multi-agent diagnostic benchmark for long-form video generation. DirectorBench evaluates generated videos with respect to 80 structured metadata entries, 7 user profiles, and 40 checkpoint criteria across 5 dimensions: script, visual, audio, cross-modal, and stability. Instead of reducing quality to a single aggregate score, DirectorBench localizes checkpoint-level bottlenecks and supports profile-aware evaluation. We evaluate 4 long-form video generation workflows, 6 base LLMs, and 7 user profiles. Across workflows, DirectorBench reveals a between-unit bottleneck: transition quality averages only 0.256 and reaches 0.356 for the best workflow, while prompt-level user demand fulfillment averages 0.71. We further conduct human evaluation with 14 annotators to validate the alignment between DirectorBench and human judgment. The results show that DirectorBench captures human-perceptible quality differences and reveals workflow- and profile-dependent failure modes that are hidden by aggregate scoring. These findings highlight the importance of diagnostic and profile-aware benchmarking for long-form video generation.

DirectorBench：借助个性化多智能体评估诊断长视频生成 / DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

1️⃣ 一句话总结

本文提出了一个名为DirectorBench的全新评估系统，它像一位懂行的导演，通过80个结构化指标、7种不同观众喜好和40个关键检查点，从剧本、画面、音频、跨模态和稳定性五个维度，精准诊断长视频生成中的具体问题（比如镜头切换生硬），而不是只给一个笼统的分数。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要