MJEPA:一种简单且可扩展的音频-视觉联合嵌入预测架构 / MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning
1️⃣ 一句话总结
本文提出了一种名为MJEPA的音频-视觉自监督学习方法,它使用统一的编码器和单一的预测目标来同时学习声音和图像特征,通过跨模态预测显著提升了模型性能,在多个基准测试上超越了此前的方法,尤其适用于数据量有限的情况。
Self-supervised learning from large-scale video data has emerged as a dominant paradigm for visual representation learning. Since audio and visual streams naturally co-occur in video data, extending this success to jointly learn from both modalities is a natural next step, yet it remains challenging. Existing audio-visual self-supervised methods rely on modality-specific encoders and complex combinations of contrastive or reconstruction objectives, limiting cross-modal synergy and scalability. Joint Embedding Predictive Architectures (JEPAs) offer a simple, modality-agnostic alternative, but have to date been applied primarily to individual modalities. We introduce MJEPA, a joint-embedding predictive architecture for audio-visual learning that uses a single, unified encoder for both modalities. Our approach uses only a single predictive objective, applied both within and across modalities. We show that cross-modal prediction is critical: without it, a shared encoder degrades below unimodal baselines; with it, each modality's representation benefits from the other. Our frozen ViT-g model outperforms the best prior frozen baseline by over 6.8 mAP on AudioSet-20K, surpasses fully finetuned models on ESC-50 and FSD50K, and is competitive on video benchmarks despite using 10x less video data.
MJEPA:一种简单且可扩展的音频-视觉联合嵌入预测架构 / MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning
本文提出了一种名为MJEPA的音频-视觉自监督学习方法,它使用统一的编码器和单一的预测目标来同时学习声音和图像特征,通过跨模态预测显著提升了模型性能,在多个基准测试上超越了此前的方法,尤其适用于数据量有限的情况。
源自 arXiv: 2606.25225