Representation-Regularized Convolutional Audio Transformer for Audio Understanding

📄 Abstract - Representation-Regularized Convolutional Audio Transformer for Audio Understanding

Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and spectral structures inherent in complex audio signals. Furthermore, bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge. In this work, we propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges. First, to capture hierarchical audio features, CAT incorporates a Multi-resolution Block that aggregates information across varying granularities. Second, to enhance training efficiency, we introduce a Representation Regularization objective. Drawing inspiration from generative modeling, this auxiliary task guides the student model by aligning its predictions with high-quality semantic representations from frozen, pre-trained external encoders. Experimental results demonstrate that CAT significantly outperforms baselines on audio understanding benchmarks. Notably, it achieves competitive performance on the AudioSet 20k dataset with 5 times faster convergence than existing methods. Codes and checkpoints will be released soon at this https URL.

用于音频理解的表示正则化卷积音频变换器 / Representation-Regularized Convolutional Audio Transformer for Audio Understanding

1️⃣ 一句话总结

这篇论文提出了一种名为CAT的新模型，它通过整合多分辨率信息来捕捉音频的层次特征，并利用一个创新的表示正则化目标来借用高质量的外部知识，从而在显著提升音频理解性能的同时，将训练收敛速度加快了五倍。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要