菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-08
📄 Abstract - JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention

We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.

顶级标签: audio model training natural language processing
详细标签: self-supervised learning speech representation neural tokenization audio compression joint-embedding predictive architecture 或 搜索:

JEPA作为神经分词器:利用密度自适应注意力学习鲁棒的语音表征 / JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention


1️⃣ 一句话总结

这篇论文提出了一种两阶段自监督学习框架,它结合了联合嵌入预测架构和一种密度自适应注意力机制,能够从语音中高效提取出高度压缩、易于语言模型处理且能高质量还原成声音的语义特征单元。


源自 arXiv: 2512.07168