菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-24
📄 Abstract - Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.

顶级标签: video generation model training multi-modal
详细标签: autoregressive modeling next-frame prediction representation learning conditional flow matching video understanding 或 搜索:

NExT-Vid:一种用于联合建模图像和视频的掩码下一帧自回归视觉生成预训练框架 / Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations


1️⃣ 一句话总结

本文提出了一种名为NExT-Vid的新型视觉生成预训练框架,它通过上下文隔离的自回归预测器和条件流匹配解码器,将语义表征与目标解码解耦,有效解决了现有自回归预训练方法中语义定位不准、生成质量差的问题,并在多个视频理解基准上取得了领先的性能。


2️⃣ 论文创新点

1. 掩码下一帧生成预训练范式

2. 上下文隔离的自回归预测器

3. 条件流匹配解码器

4. 表示对齐正则化与帧隔离注意力


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2512.21004