A Human-Inspired Decoupled Architecture for Efficient Audio Representation Learning

📄 Abstract - A Human-Inspired Decoupled Architecture for Efficient Audio Representation Learning

While self-supervised learning (SSL) has revolutionized audio representation, the excessive parameterization and quadratic computational cost of standard Transformers limit their deployment on resource-constrained devices. To address this bottleneck, we propose HEAR (Human-inspired Efficient Audio Representation), a novel decoupled architecture. Inspired by the human cognitive ability to isolate local acoustic features from global context, HEAR splits the processing pipeline into two dedicated modules: an Acoustic Model for local feature extraction and a Task Model for global semantic integration. Coupled with an Acoustic Tokenizer trained via knowledge distillation, our approach enables robust Masked Audio Modeling (MAM). Extensive experiments demonstrate that HEAR requires only 15M parameters and 9.47 GFLOPs for inference, operating at a fraction of the computational cost of conventional foundation models (which typically require 85M-94M parameters). Despite this high efficiency, HEAR achieves highly competitive performance across diverse audio classification benchmarks. The code and pre-trained models are available at this https URL

一种受人类启发的解耦架构，用于高效的音频表示学习 / A Human-Inspired Decoupled Architecture for Efficient Audio Representation Learning

1️⃣ 一句话总结

这篇论文提出了一种名为HEAR的新型高效音频学习架构，它模仿人类听觉系统将声音处理分为提取局部特征和整合全局语义两个独立模块，从而在参数和计算量大幅减少的情况下，仍能在多种音频任务上取得与大型模型相媲美的性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要