菜单

🤖 系统
📄 Abstract - InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning. To address these, we disentangle the traditional encoder-decoder design into an Encoder-Predictor-Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model. First, conventional linear decoder in pixel MVM enforces the predictor output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction. Our Stage 1 proposes a conditional diffusion decoder and injects reliable image-level semantic priors to enhance semantics and convergence, thus bridging pixel-level fidelity with high-level semantic abstraction. Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning. Trained on public, unlabeled videos, InternVideo-Next achieves state-of-the-art results across benchmarks and provides a scalable path toward general video representation learning.

顶级标签: video model training multi-modal
详细标签: video foundation model masked video modeling self-supervised learning diffusion decoder world model 或 搜索:

InternVideo-Next:迈向无需视频-文本监督的通用视频基础模型 / InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision


1️⃣ 一句话总结

这篇论文提出了一种名为InternVideo-Next的新方法,它通过创新的两阶段训练框架,无需依赖大规模视频-文本配对数据,就能构建出能同时理解视频细节和高级语义的通用视频模型,并在多个基准测试中取得了领先性能。


📄 打开原文 PDF