Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

📄 Abstract - Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.

奖励强制：通过奖励分布匹配蒸馏实现高效流式视频生成 / Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

1️⃣ 一句话总结

这篇论文提出了一种名为‘奖励强制’的新方法，通过引入能融合长期上下文与近期动态的EMA-Sink令牌，以及利用视觉语言模型奖励来优先学习动态内容的分布匹配蒸馏技术，有效解决了现有流式视频生成方法中初始帧被过度复制、运动动态不足的问题，从而在保持长时一致性的同时，显著提升了生成视频的运动质量和生成效率。

← 返回列表

菜单

🤖 AI 深度阅读

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

🤖 AI 深度阅读

1️⃣ 一句话总结

获取最新论文摘要