菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-24
📄 Abstract - Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining

Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal correlation. Additionally, we introduce a video-text relevance reconstruction objective that aligns high-level multimodal semantics beyond conventional visual reconstruction. Extensive experiments across multiple benchmarks demonstrate that ClusterSTM achieves superior performance on video-text retrieval, video question answering, and video captioning tasks, establishing a new state-of-the-art among efficient video-language models.

顶级标签: multi-modal model training video
详细标签: video-language pretraining masked visual modeling spatio-temporal masking multimodal alignment efficient training 或 搜索:

面向高效视频-语言预训练的聚类式时空掩码策略 / Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining


1️⃣ 一句话总结

这篇论文提出了一种名为ClusterSTM的智能视频掩码方法,它通过聚类和保留关键帧来高效学习视频与文字的关系,在降低计算成本的同时,显著提升了视频理解、检索和问答等任务的效果。

源自 arXiv: 2603.22953