菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-17
📄 Abstract - Demystifing Video Reasoning

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

顶级标签: video generation model training theory
详细标签: diffusion models reasoning mechanisms chain-of-steps emergent behavior video understanding 或 搜索:

揭秘视频推理:探索扩散模型中的推理机制 / Demystifing Video Reasoning


1️⃣ 一句话总结

这篇论文通过研究发现,视频生成模型的核心推理能力并非如先前认为的那样在视频帧之间顺序展开,而是在扩散去噪的步骤中逐步形成,并揭示了模型在这一过程中展现出的多种智能行为,为利用视频模型进行更复杂的推理任务提供了新思路。

源自 arXiv: 2603.16870