菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-15
📄 Abstract - Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation

Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the spectral bias of 3D positional embeddings and the lack of dynamic priors in noise sampling. To address these issues, we propose FLEX (Frequency-aware Length EXtension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at 6x extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at 12x scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at this https URL.

顶级标签: video generation model training model evaluation
详细标签: autoregressive diffusion inference-time optimization positional embeddings temporal consistency long video synthesis 或 搜索:

短训练,长推理:用于自回归视频生成的无训练时域扩展方法 / Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation


1️⃣ 一句话总结

这篇论文提出了一个名为FLEX的无训练推理框架,通过自适应调整位置编码和优化噪声采样,让原本只能生成短视频的模型无需重新训练,就能直接生成长达数分钟且质量稳定的长视频。

源自 arXiv: 2602.14027