FlowLong:基于流形约束Tweedie匹配的推理时长视频生成 / FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching
1️⃣ 一句话总结
本文提出一种无需额外训练的长视频生成方法,通过滑动窗口和Tweedie匹配技术融合相邻片段,同时利用随机早期采样保持画面一致性,能生成比原始窗口长数倍的高质量视频,并适用于音频-视频联合生成等任务。
Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via \emph{Tweedie matching} to enforce both \textbf{manifold constraint and temporal consistency} across overlap regions. \emph{Stochastic early-phase sampling} then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.
FlowLong:基于流形约束Tweedie匹配的推理时长视频生成 / FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching
本文提出一种无需额外训练的长视频生成方法,通过滑动窗口和Tweedie匹配技术融合相邻片段,同时利用随机早期采样保持画面一致性,能生成比原始窗口长数倍的高质量视频,并适用于音频-视频联合生成等任务。
源自 arXiv: 2605.20910