剧本即一切:一种用于长时域对话到电影视频生成的智能体框架 / The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation
1️⃣ 一句话总结
这篇论文提出了一个由智能体驱动的端到端框架,能够将粗略的对话自动转化为精细的电影剧本,并以此指导视频生成模型,从而生成情节连贯的长篇叙事视频,解决了现有模型难以从高级概念(如对话)生成连贯长视频的难题。
Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.
剧本即一切:一种用于长时域对话到电影视频生成的智能体框架 / The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation
这篇论文提出了一个由智能体驱动的端到端框架,能够将粗略的对话自动转化为精细的电影剧本,并以此指导视频生成模型,从而生成情节连贯的长篇叙事视频,解决了现有模型难以从高级概念(如对话)生成连贯长视频的难题。
源自 arXiv: 2601.17737