MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

📄 Abstract - MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

Self-supervised learning from large-scale video data has emerged as a dominant paradigm for visual representation learning. Since audio and visual streams naturally co-occur in video data, extending this success to jointly learn from both modalities is a natural next step, yet it remains challenging. Existing audio-visual self-supervised methods rely on modality-specific encoders and complex combinations of contrastive or reconstruction objectives, limiting cross-modal synergy and scalability. Joint Embedding Predictive Architectures (JEPAs) offer a simple, modality-agnostic alternative, but have to date been applied primarily to individual modalities. We introduce MJEPA, a joint-embedding predictive architecture for audio-visual learning that uses a single, unified encoder for both modalities. Our approach uses only a single predictive objective, applied both within and across modalities. We show that cross-modal prediction is critical: without it, a shared encoder degrades below unimodal baselines; with it, each modality's representation benefits from the other. Our frozen ViT-g model outperforms the best prior frozen baseline by over 6.8 mAP on AudioSet-20K, surpasses fully finetuned models on ESC-50 and FSD50K, and is competitive on video benchmarks despite using 10x less video data.

MJEPA：一种简单且可扩展的音频-视觉联合嵌入预测架构 / MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

1️⃣ 一句话总结

本文提出了一种名为MJEPA的音频-视觉自监督学习方法，它使用统一的编码器和单一的预测目标来同时学习声音和图像特征，通过跨模态预测显著提升了模型性能，在多个基准测试上超越了此前的方法，尤其适用于数据量有限的情况。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要