菜单

🤖 系统
📄 Abstract - BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: this https URL. Inferix (Code): this https URL.

顶级标签: video generation model training benchmark
详细标签: block diffusion long-video generation kv cache temporal consistency coherence metrics 或 搜索:

BlockVid:用于高质量、一致性分钟级视频生成的块扩散模型 / BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation


1️⃣ 一句话总结

这篇论文提出了一个名为BlockVid的新方法,通过改进块扩散技术、引入语义感知缓存和新的训练策略,有效解决了生成长视频时常见的错误累积和连贯性问题,并在新建立的评测基准上显著超越了现有方法,能够生成更高质量、更连贯的分钟级长视频。


📄 打开原文 PDF