End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

📄 Abstract - End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.

通过自重采样实现自回归视频扩散模型的端到端训练 / End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

1️⃣ 一句话总结

这篇论文提出了一种名为‘重采样强制’的全新端到端训练框架，它通过让模型在训练时主动模拟并纠正自己未来可能犯的预测错误，成功解决了自回归视频生成中常见的‘曝光偏差’问题，从而能够直接训练出能生成长时间、高一致性视频的模型，无需依赖复杂的教师模型或额外训练步骤。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要